SREcon17 EMEA Dulin recap

2017-09-04

What is SRE?

Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems. A more detailed explanation can be found at Googles SRE page

What is SREcon about?

SREcon is a gathering of engineers who care deeply about site reliability, systems engineering, and working with complex distributed systems at scale. Our purpose is to be inclusive as we bring together ideas representative of our diverse community, whether its members are focusing on a global scale, launching new products and ideas for a small business, or pivoting their approach to unite software and systems engineering. SREcon challenges both those new to the profession as well as those who have been involved in it for decades. The conference has a culture of critical thought, deep technical insights, continuous improvement, and innovation.

The conference

While we already have many things in place at glomex that let us work “The SRE way” there is always place for improvement. The SREcon Europes programme, and participants were a good mix: Many huge companys with many SRE teams like Google, but also way smaller companies like Intercom or Wayfair introduced their learnings around the SRE topic.

The conference in Dublin was a great experience, altough there were so many talks (mostly 5 tracks at a time) it was impossible to attend all of them. Following will be a very brief summary of the most interesting talks I attended and some takeaways.

Managing SSH Access without Managing SSH Keys

It is possible to use certificates together with ssh keys. They have a lot of advantages

restrictable attributes
short lifetime
easy rollback
no keys need to be pre baked into images. only the pubkey of the CA

All of this improves the security of SSH based authentication a lot.

Cashier is the tooling that Intercom built for managing their SSH CA. Lot of infos about this are also in the man page of ssh.

OK Log: Distributed and Coordination Free Logging

A really great talk, that started by explaining the reasoning on why one would want to build a new and simple log system.

There is no need to repeating all this, as there is a great explanation in the GitHub repo

OK Log on GitHub

Deploying Changes to Production in the Age of the Microservice

Penalty Buckets

There are different types of how you could treat / value of you outages

immediate lost revenue
user trust
contract violations
increased infrastructure costs
engineering time

the optimal rollout is

staged
progressive
revertable
transparent
automatic

detect badness as early as possible

have high quality (unit, integration) test coverage
build your targets as often as possible
A / B expirement everything
automate to limit toil and risk of human error
deploy often, limit change surface
make it easy to rollback
encourage backwards compatibility in code

consistent names

it does not matter how you name something, but it has to be consistent

server name schemas
binary release stages
machines
load-balancing domains
user populations

From Firefighting to Proactive Work

The speaker indentified three main work modes / health stages:

Firefighting
Preventive
Proactive

The important part: When asessing in which mode you are currently, it is essential to have a look, both at the service and its metrics AND the team. While the service part can change pretty fast, the team part is way more difficult and requires a mid- and longterm approach.

Building a Culture of Reliability

Distributed Engineering requires a Distributed on call. The teams that build systems should also be on call for those. If people are can fix things on their own, they will do. If the teams are split this is impossible and hard to break out.

Failures happen, all the time!

This is a fact, but we can learn to get along with them and probably never repeat them. Some ideas on how one could do this:

Chaos Cat (similiar to Chaos Monkey) Here a service randomly introduces errors to the system and people have to resolve the issue
Failure Friday Similiar to the above, but here every friday a small team introduces issues to a randomly choosen service. The rest of the company is trying to figre out what is wrong anf fix it.
Magazine Style postmortem, including interviews with people that are sent out as newsletter to the whole company

Most important is: do postportems and actually handle all of your action items to become better.

Improving reliabilty means constantly failing, constantly recovering, and constantly learning

PagerDuty Incident response docs

Source

When Trouble Comes to Town

Step 0: Don’t panic

You are not alone
You won’t get fired

Step 1: Assess the impact

Whats broken for users
How broken is it?
Have dashboards for that

Step 2: communicate

maybe: file an incident announcement
form investigation groups
advertise realtime discussion
contact other members, expers and incident manager

Step 3: What changed

identify exact start time
get some error details
check for changes (code push, deployment, config change)
check change database

Step 4: mitigate

can we turn something off
can we drain traffic
runtime controls (eg feature disable)

Step 5: dive in

investigate root cause
but don’t waste too much time
parallelise efforts

step 6: resolution

confirm that user impact is gone
communicate internally / externally
can it easily retrigger?

step 7: cleanup

incident report

Record all details

timeline
root cause
followups

Try to improve:

detection
escalation
recovery
prevention

do a review

Linux System Metrics

This was a very interesting workshop on the little nifty details of all the metrics the linux kernel has. Lot of the thing were not new to me, but still there is always something to learn.

The good thing is: If you are interested in the topic, but had no chance to attend the session: all the material is on GitHub.

There are even more places to find interesting infos about this topic:

Statistics for engineers

A very interesting workshop about statistics, but not in the highly mathematical way: Lot of examples on which statistics makes sense for the issues we face as SREs.

Also here: there are lot of jupyter notebooks in the GitHub repo that explain all the things from the workshop.

Monitoring Design Principles

SLO: based on percentiles

SLOs based on percentiles are often cutting off interesting data.

percentiles are complicated:

often not described how they are calculated (period of time, number of samples)
no description of why 90,95,99,99.9 is choosen
hard to aggregate

For SLO: If you aggegrate over longer time periods (buckets/bins) than your SLO allows, you are screwed. Example: If you have a 99,9% availlability you only can be 4m 23s below your SLO per month. If you aggregate your values in 5 minute buckets, and had a massive issue for only one minute which would be within what is allowed, you still have to account for 5 minutes of downtime.

principles

Do not measure rates. The rate of change over time can be derived at query time
monitor outside of your tech stack
monitor what is important to the health of your org, not necessarily tech values
do not silo data correlation is key, everybody should have all data available
Value observation of real work over the measurement of synthesized work eg don’t do a db test query, but check results of actual queries
synthesize work to ensure function but only for business-critical low volume events
percentiles are not histograms
for SLO management you need to store histograms for post-processing
history is critical
keep your data for a long time (eg for capacity planning, reanalisys)
stay outside of the blast radius
your monitoring should still work if the rest doesn’t
something is better than nothing
Don’t let perfect be the enemy of good. You have to start somewhere

alerts require documentation

No ruleset should trigger an alert without:

human readable explanation
business impact description
remediation procedure
escalation documentation

Have you tried turning it off and on again

What would happen if all of your systems reboot at once? was the overall question of this talk.

Here are the tips shared:

small fallback plans: small, simple, solid
test everything
manage dependencies
architect in layers

Manage dependencies:

between services, not libaries
more services = more dependencies
your availlability can’t be better than the worst dependency you have
check for cyclic dependencies, graphviz may help

summary

If there is one key takeaway from the conference it is:

Most of the technical issues we have in our field a pretty well understood and most of them solved. The stuff most companies are struggling with is everything people and organization related.