webratz.de

Random notes about random cloud and photo stuff

SREcon17 EMEA Dulin recap

2017-09-04

What is SRE?

Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems. A more detailed explanation can be found at Googles SRE page

What is SREcon about?

SREcon is a gathering of engineers who care deeply about site reliability, systems engineering, and working with complex distributed systems at scale. Our purpose is to be inclusive as we bring together ideas representative of our diverse community, whether its members are focusing on a global scale, launching new products and ideas for a small business, or pivoting their approach to unite software and systems engineering. SREcon challenges both those new to the profession as well as those who have been involved in it for decades. The conference has a culture of critical thought, deep technical insights, continuous improvement, and innovation.

The conference

While we already have many things in place at glomex that let us work “The SRE way” there is always place for improvement. The SREcon Europes programme, and participants were a good mix: Many huge companys with many SRE teams like Google, but also way smaller companies like Intercom or Wayfair introduced their learnings around the SRE topic.

The conference in Dublin was a great experience, altough there were so many talks (mostly 5 tracks at a time) it was impossible to attend all of them. Following will be a very brief summary of the most interesting talks I attended and some takeaways.

Managing SSH Access without Managing SSH Keys

It is possible to use certificates together with ssh keys. They have a lot of advantages

  • restrictable attributes
  • short lifetime
  • easy rollback
  • no keys need to be pre baked into images. only the pubkey of the CA

All of this improves the security of SSH based authentication a lot.

Cashier is the tooling that Intercom built for managing their SSH CA. Lot of infos about this are also in the man page of ssh.

OK Log: Distributed and Coordination Free Logging

A really great talk, that started by explaining the reasoning on why one would want to build a new and simple log system.

There is no need to repeating all this, as there is a great explanation in the GitHub repo

OK Log on GitHub

Deploying Changes to Production in the Age of the Microservice

Penalty Buckets

There are different types of how you could treat / value of you outages

  • immediate lost revenue
  • user trust
  • contract violations
  • increased infrastructure costs
  • engineering time

the optimal rollout is

  • staged
  • progressive
  • revertable
  • transparent
  • automatic

detect badness as early as possible

  • have high quality (unit, integration) test coverage
  • build your targets as often as possible
  • A / B expirement everything
  • automate to limit toil and risk of human error
  • deploy often, limit change surface
  • make it easy to rollback
  • encourage backwards compatibility in code

consistent names

it does not matter how you name something, but it has to be consistent

  • server name schemas
  • binary release stages
  • machines
  • load-balancing domains
  • user populations

From Firefighting to Proactive Work

The speaker indentified three main work modes / health stages:

  • Firefighting
  • Preventive
  • Proactive

The important part: When asessing in which mode you are currently, it is essential to have a look, both at the service and its metrics AND the team. While the service part can change pretty fast, the team part is way more difficult and requires a mid- and longterm approach.

Building a Culture of Reliability

Distributed Engineering requires a Distributed on call. The teams that build systems should also be on call for those. If people are can fix things on their own, they will do. If the teams are split this is impossible and hard to break out.

Failures happen, all the time!

This is a fact, but we can learn to get along with them and probably never repeat them. Some ideas on how one could do this:

  • Chaos Cat (similiar to Chaos Monkey) Here a service randomly introduces errors to the system and people have to resolve the issue
  • Failure Friday Similiar to the above, but here every friday a small team introduces issues to a randomly choosen service. The rest of the company is trying to figre out what is wrong anf fix it.
  • Magazine Style postmortem, including interviews with people that are sent out as newsletter to the whole company

Most important is: do postportems and actually handle all of your action items to become better.

Improving reliabilty means constantly failing, constantly recovering, and constantly learning

PagerDuty Incident response docs

Source

When Trouble Comes to Town

Step 0: Don’t panic

  • You are not alone
  • You won’t get fired

Step 1: Assess the impact

  • Whats broken for users
  • How broken is it?
  • Have dashboards for that

Step 2: communicate

  • maybe: file an incident announcement
  • form investigation groups
  • advertise realtime discussion
  • contact other members, expers and incident manager

Step 3: What changed

  • identify exact start time
  • get some error details
  • check for changes (code push, deployment, config change)
  • check change database

Step 4: mitigate

  • can we turn something off
  • can we drain traffic
  • runtime controls (eg feature disable)

Step 5: dive in

  • investigate root cause
  • but don’t waste too much time
  • parallelise efforts

step 6: resolution

  • confirm that user impact is gone
  • communicate internally / externally
  • can it easily retrigger?

step 7: cleanup

incident report

Record all details

  • timeline
  • root cause
  • followups

Try to improve:

  • detection
  • escalation
  • recovery
  • prevention

do a review

Linux System Metrics

This was a very interesting workshop on the little nifty details of all the metrics the linux kernel has. Lot of the thing were not new to me, but still there is always something to learn.

The good thing is: If you are interested in the topic, but had no chance to attend the session: all the material is on GitHub.

There are even more places to find interesting infos about this topic:

Statistics for engineers

A very interesting workshop about statistics, but not in the highly mathematical way: Lot of examples on which statistics makes sense for the issues we face as SREs.

Also here: there are lot of jupyter notebooks in the GitHub repo that explain all the things from the workshop.

Further reading to mentioned things:

Monitoring Design Principles

SLO: based on percentiles

SLOs based on percentiles are often cutting off interesting data.

percentiles are complicated:

  • often not described how they are calculated (period of time, number of samples)
  • no description of why 90,95,99,99.9 is choosen
  • hard to aggregate

For SLO: If you aggegrate over longer time periods (buckets/bins) than your SLO allows, you are screwed. Example: If you have a 99,9% availlability you only can be 4m 23s below your SLO per month. If you aggregate your values in 5 minute buckets, and had a massive issue for only one minute which would be within what is allowed, you still have to account for 5 minutes of downtime.

principles

  • Do not measure rates. The rate of change over time can be derived at query time
  • monitor outside of your tech stack
  • monitor what is important to the health of your org, not necessarily tech values
  • do not silo data correlation is key, everybody should have all data available
  • Value observation of real work over the measurement of synthesized work eg don’t do a db test query, but check results of actual queries
  • synthesize work to ensure function but only for business-critical low volume events
  • percentiles are not histograms
    for SLO management you need to store histograms for post-processing
  • history is critical
    keep your data for a long time (eg for capacity planning, reanalisys)
  • stay outside of the blast radius
    your monitoring should still work if the rest doesn’t
  • something is better than nothing
    Don’t let perfect be the enemy of good. You have to start somewhere

alerts require documentation

No ruleset should trigger an alert without:

  • human readable explanation
  • business impact description
  • remediation procedure
  • escalation documentation

Have you tried turning it off and on again

What would happen if all of your systems reboot at once? was the overall question of this talk.

Here are the tips shared:

  • small fallback plans: small, simple, solid
  • test everything
  • manage dependencies
  • architect in layers

Manage dependencies:

  • between services, not libaries
  • more services = more dependencies
  • your availlability can’t be better than the worst dependency you have
  • check for cyclic dependencies, graphviz may help

summary

If there is one key takeaway from the conference it is:

Most of the technical issues we have in our field a pretty well understood and most of them solved. The stuff most companies are struggling with is everything people and organization related.

Further reading

Keeping track of everything that happend during a conference is complicated and there is more interesting stuff to see. Here are some reference of things that I noted down to have a look later on.

Tools

Linux Performance

Git Stacktrace

Dynamic documentation

Snap: telemetry framework

ShipIt

Deployment coordination

Udppinger

Toxiproxy

Notes from other people

Also other people published their notes on the conference, here are some:

Tanya Reilly @whereistanya

Kurt Andersen‏ @drkurta:

Nikolay Sturm @nistude