At work, our AWS image-building process exists as a set of scripts that are run manually, in-order, if everything looks okay to the human at the terminal.
The problem is, the image auto-installs security updates once it launches; if enough accumulate, instances launched from the image start failing to become ready for traffic before the timeout is up. The system is guaranteed to decay if I don’t periodically build a fresh image, manually.
One reason it isn’t automated (say, rebuilding itself weekly using AWS infrastructure) is that the process takes more permissions than normal business operations. It must be able to run and terminate instances, tag things, and reconfigure what image is launched by Auto Scaling. Those would be scary permissions to give to any instances, which generally only have write access to specific data storage locations.
However, if the various steps (build, test, update configuration) each existed as separate objects visible to AWS, we could give each component unique permissions. The configuration-update step would be the only one to have access to the image ID in SSM Parameter Store. Likewise, that step would be forbidden from creating or terminating EC2 instances.
Due to implementation details, for us and our system, we could split it into the following components:
- Configuration and test script deployment (run by developers off-AWS; stores to S3)
- Ubuntu base image lookup (read-only)
- Image build sequence (create/wait/destroy EC2 instances, tag instance and image, read configuration scripts from S3, invoke the AWS-maintained SSM Automation to create the image from the instance)
- Image test sequence (create/destroy EC2 instances, read the test script from S3)
- Configuration update (describe the image, update a specific SSM Parameter Store item)
- Garbage collection (read running configuration to determine “unused” status, deregister images, delete their snapshots)
We only need to update the configuration scripts if we make changes to them; otherwise, doing a rebuild simply pulls in the available security updates. Nothing inside AWS gets special permission to write these files.
After that, we can run step 2 to find an appropriate base AMI on our cron host. That requires no write access, so all the cron host really needs is permission to start and monitor the latter tasks. Tasks 3-5, in particular, are simple enough (once inputs are determined in step 2) to be run via AWS SSM Automation. I imagine there will be a “coordinating” automation that runs those three tasks in order, and the granular tasks exist mainly for ease of debugging each one. Finally, garbage collection is somewhat compute-heavy but requires no waiting, so AWS Lambda might be the best option for it.
The end result will be a major improvement: each step will actually have minimal privileges, and the union of all privileges is much less than the admin privileges that I am technically granting to the process. It “works” in the sense that we trust my entire laptop, but what if we didn’t need to?
No comments:
Post a Comment