What is SRE?
Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems. A more detailed explanation can be found at Googles SRE page
What is SREcon about?
SREcon is a gathering of engineers who care deeply about site reliability, systems engineering, and working with complex distributed systems at scale. Our purpose is to be inclusive as we bring together ideas representative of our diverse community, whether its members are focusing on a global scale, launching new products and ideas for a small business, or pivoting their approach to unite software and systems engineering. SREcon challenges both those new to the profession as well as those who have been involved in it for decades. The conference has a culture of critical thought, deep technical insights, continuous improvement, and innovation.
While we already have many things in place at glomex that let us work “The SRE way” there is always place for improvement. The SREcon Europes programme, and participants were a good mix: Many huge companys with many SRE teams like Google, but also way smaller companies like Intercom or Wayfair introduced their learnings around the SRE topic.
The conference in Dublin was a great experience, altough there were so many talks (mostly 5 tracks at a time) it was impossible to attend all of them. Following will be a very brief summary of the most interesting talks I attended and some takeaways.
Managing SSH Access without Managing SSH Keys
It is possible to use certificates together with ssh keys. They have a lot of advantages
- restrictable attributes
- short lifetime
- easy rollback
- no keys need to be pre baked into images. only the pubkey of the CA
All of this improves the security of SSH based authentication a lot.
OK Log: Distributed and Coordination Free Logging
A really great talk, that started by explaining the reasoning on why one would want to build a new and simple log system.
There is no need to repeating all this, as there is a great explanation in the GitHub repo
Deploying Changes to Production in the Age of the Microservice
There are different types of how you could treat / value of you outages
- immediate lost revenue
- user trust
- contract violations
- increased infrastructure costs
- engineering time
the optimal rollout is
detect badness as early as possible
- have high quality (unit, integration) test coverage
- build your targets as often as possible
- A / B expirement everything
- automate to limit toil and risk of human error
- deploy often, limit change surface
- make it easy to rollback
- encourage backwards compatibility in code
it does not matter how you name something, but it has to be consistent
- server name schemas
- binary release stages
- load-balancing domains
- user populations
From Firefighting to Proactive Work
The speaker indentified three main work modes / health stages:
The important part: When asessing in which mode you are currently, it is essential to have a look, both at the service and its metrics AND the team. While the service part can change pretty fast, the team part is way more difficult and requires a mid- and longterm approach.
Building a Culture of Reliability
Distributed Engineering requires a Distributed on call. The teams that build systems should also be on call for those. If people are can fix things on their own, they will do. If the teams are split this is impossible and hard to break out.
Failures happen, all the time!
This is a fact, but we can learn to get along with them and probably never repeat them. Some ideas on how one could do this:
- Chaos Cat (similiar to Chaos Monkey) Here a service randomly introduces errors to the system and people have to resolve the issue
- Failure Friday Similiar to the above, but here every friday a small team introduces issues to a randomly choosen service. The rest of the company is trying to figre out what is wrong anf fix it.
- Magazine Style postmortem, including interviews with people that are sent out as newsletter to the whole company
Most important is: do postportems and actually handle all of your action items to become better.
Improving reliabilty means constantly failing, constantly recovering, and constantly learning
When Trouble Comes to Town
Step 0: Don’t panic
- You are not alone
- You won’t get fired
Step 1: Assess the impact
- Whats broken for users
- How broken is it?
- Have dashboards for that
Step 2: communicate
- maybe: file an incident announcement
- form investigation groups
- advertise realtime discussion
- contact other members, expers and incident manager
Step 3: What changed
- identify exact start time
- get some error details
- check for changes (code push, deployment, config change)
- check change database
Step 4: mitigate
- can we turn something off
- can we drain traffic
- runtime controls (eg feature disable)
Step 5: dive in
- investigate root cause
- but don’t waste too much time
- parallelise efforts
step 6: resolution
- confirm that user impact is gone
- communicate internally / externally
- can it easily retrigger?
step 7: cleanup
Record all details
- root cause
Try to improve:
do a review
Linux System Metrics
This was a very interesting workshop on the little nifty details of all the metrics the linux kernel has. Lot of the thing were not new to me, but still there is always something to learn.
The good thing is: If you are interested in the topic, but had no chance to attend the session: all the material is on GitHub.
There are even more places to find interesting infos about this topic:
Statistics for engineers
A very interesting workshop about statistics, but not in the highly mathematical way: Lot of examples on which statistics makes sense for the issues we face as SREs.
Also here: there are lot of jupyter notebooks in the GitHub repo that explain all the things from the workshop.
Further reading to mentioned things:
Monitoring Design Principles
SLO: based on percentiles
SLOs based on percentiles are often cutting off interesting data.
percentiles are complicated:
- often not described how they are calculated (period of time, number of samples)
- no description of why 90,95,99,99.9 is choosen
- hard to aggregate
For SLO: If you aggegrate over longer time periods (buckets/bins) than your SLO allows, you are screwed. Example: If you have a 99,9% availlability you only can be 4m 23s below your SLO per month. If you aggregate your values in 5 minute buckets, and had a massive issue for only one minute which would be within what is allowed, you still have to account for 5 minutes of downtime.
- Do not measure rates. The rate of change over time can be derived at query time
- monitor outside of your tech stack
- monitor what is important to the health of your org, not necessarily tech values
- do not silo data correlation is key, everybody should have all data available
- Value observation of real work over the measurement of synthesized work eg don’t do a db test query, but check results of actual queries
- synthesize work to ensure function but only for business-critical low volume events
- percentiles are not histograms
for SLO management you need to store histograms for post-processing
- history is critical
keep your data for a long time (eg for capacity planning, reanalisys)
- stay outside of the blast radius
your monitoring should still work if the rest doesn’t
- something is better than nothing
Don’t let perfect be the enemy of good. You have to start somewhere
alerts require documentation
No ruleset should trigger an alert without:
- human readable explanation
- business impact description
- remediation procedure
- escalation documentation
Have you tried turning it off and on again
What would happen if all of your systems reboot at once? was the overall question of this talk.
Here are the tips shared:
- small fallback plans: small, simple, solid
- test everything
- manage dependencies
- architect in layers
- between services, not libaries
- more services = more dependencies
- your availlability can’t be better than the worst dependency you have
- check for cyclic dependencies, graphviz may help
If there is one key takeaway from the conference it is:
Most of the technical issues we have in our field a pretty well understood and most of them solved. The stuff most companies are struggling with is everything people and organization related.
Keeping track of everything that happend during a conference is complicated and there is more interesting stuff to see. Here are some reference of things that I noted down to have a look later on.
Snap: telemetry framework
Notes from other people
Also other people published their notes on the conference, here are some:
Tanya Reilly @whereistanya
Kurt Andersen @drkurta:
Nikolay Sturm @nistude