Random notes about random cloud and photo stuff

AWS multi account infrastructure

Back in October 2016 I held a talk in the Munich AWS User Group about managing AWS Multi account infrastructure at glomex. This post is summarizing the talk and appeared initially on the glomex techblog.

You can find the slides at speakderdeck

Why we implemented an AWS multi account strategy

As for the motivation to go ahead and deal with the complexity of such a setup - there are a lot of good reasons:

  • It’s recommended by AWS for security and billing concerns.
  • Concerning security, you can control access and security on a much more granular level. So for example you don’t have to deal with designing IAM rules so that one team can’t access another team’s resources – say EC2 resources - because you “physically” separate those resources from each other. Billing-wise, you instantly get more insights on how much a team is spending or how much a certain product costs without having the need to build reports based on billing tags. You can still make use of those within the accounts to further drill down on how much impact a certain aspect of your infrastructure / product has on your AWS spendings.
  • Mimic your organization’s hierarchy. Sometimes, different departments have different guidelines or budgets so they want to be “in control” and not share accounts in order to avoid interference.
  • Separation of concerns: Separate your staging environments from each other and, most importantly, from your production environment. This makes it much harder to interconnect services from different stages which results in better isolated systems and possibly more consistency. This also protects your production environment from hitting either AWS account limits or API rate limits. Nothing more embarrassing than an AutoScalingGroup that can’t scale up because a load test done by the QA team eats up all of your EC2 instance allowance, right? Or you can’t deploy because a rogue developer script triggers rate limiting for the AWS API in that account.

This separation also helps with decommissioning of services. Sure, you should have all your infrastructure as code but in some cases, it still makes sense to shut down a complete account including deleting all name spaces and start fresh. In the end it all comes down to treat AWS accounts as just another volatile/non-static resource that can be “instantiated” as you wish.

In a nutshell, all of those measures also help to minimize the blast radius of things gone wrong. Account limits, API limits, rogue scripts, security incidents – all of those can be limited to a smaller section of your whole cloud infrastructure by having a multiple account strategy.

The image above shows the rough glomex setup. Each of our teams gets a set of up to four AWS accounts which they use for their environments. The “Team Ops” Accounts have some special IAM roles set up for accessing the other accounts, but is otherwise no different to the development team accounts. Furthermore, we have one master billing account which also serves as the account to keep IAM users. This means that none of our other accounts have any users configured (except for some special 3rd party tools which can’t deal with STS). We enforce two factor authentication and users must then switch to a set of certain preconfigured roles to be able to access any of the other accounts. This way we can easily manage who gets access to which account. The users are kept directly within AWS with no active connection to our user management backend. Instead we sync users from our LDAP to IAM whenever we make changes to a user. So in case our LDAP fails, we can still access AWS. As a last thing, we have another special account which contains all of our CloudTrail logs. This account has very restricted access as it contains sensitive data. For the management of our accounts, we have created an internal tool handling the creation of VPCs, IAM federation, deployment of same basic resources. We are looking very much forward to the GA of AWS Organizations since there is still a lot of manual stuff to do (initial account registration on the AWS website) that we haven’t automated yet.

Pitfalls we discovered

Along the way we found some pitfalls that sometimes were surprising, but eventually we found ways to work around all of them:

  • Tool support for cross-account (STS) is not as good as expected. Since the various AWS SDKs usually handle authentication and authorization we were sometimes surprised that some tools explicitly wouldn’t understand STS credentials:
    • S3cmd
    • Serverless framework (at least in the beginning of 2016, we since have abandoned using it)
    • Kinesis agent (has been fixed)
  • AWS support for cross-account resource access is underwhelming:
    • API gateway is always public
    • VPC security groups cannot be used across accounts
    • S3 bucket permissions between more than 2 accounts are a mess
    • Complex trust relationships between accounts needed
  • Complex networking setup: We have decided against peering for all accounts but still do peer to some accounts
  • Hard to get a good overview over all accounts with AWS tools:
    • Billing works fairly well with Cost Explorer
    • Metrics are best used with a 3rd party tool, we use DataDog
    • Config Rules are too expensive
  • Costs do multiply
    • Config Rules
    • Support costs
    • VPN connections
  • User support and education is more demanding
    • User federation / STS hard to grasp for sporadic users

Tools we used

As there is surprisingly little tooling out there to support such setups we have created some custom tools which we are in the progress of being made open source:

  • LDAP – IAM sync
  • LDAP SSH key user management for EC2 instances
  • Account / environment detection for services as a safety net
  • Base setup tool
  • Custom deployment tools (CloudFormation, Lambda, CodeDeploy, API Gateway)
  • Account creation automation

To sum it all up we can say that from our experience the invest in this kind of account structure has already paid off. We had numerous occasions were somebody would have killed a production setup if it had been in the same account (deleting the wrong resources, eating up various AWS limits, etc). With AWS organizations, it should also become much easier to get rid of the cumbersome part of initial account creation – for us, that would mean that we will have even more accounts in the future.