Decoded Node: Some Solutions to a Problem

We have an EC2 instance that has a quickly-produced shell script that runs on boot. It sets a DNS name to the instance’s public IPv4 address. Since time was of the essence, it hard-codes everything about this, particularly, the DNS name to use.

This means, if we want to bring up a copy of this instance, based on a snapshot of its root volume, the copied instance will overwrite the DNS record for the production service. We need to stop this.

As a side project, it would be nice to remove the hard-coding of the DNS name. It would be trivial to “stop DNS name conflicts” if we did not have a DNS name stored on the instance’s disk image to begin with.

What are the options?

Tag the Instance

My first thought would be to tag the instance. If production instance has role=production, then the boot script can choose to update DNS only if it sees that tag. We are unlikely to set a tag like that on a copied instance, and if we did see tags, we would know production is definitively incorrect.

This would also give us the ability to set dnsName=foo.example.org through the tags, removing the hard-coding of the name.

The downside with this option is that it requires a “Tags in Metadata” setting, default false, to be activated. If we forget this setting, the script can’t work. It’s a feature for a copied instance, but a bug for production instances. Also, if AWS feels that tags are sensitive enough to hide by default, then maybe we shouldn’t expose them after all.

User Data

To avoid “Tags in Metadata,” the next option is to make use of user-data when launching instances. The question then gets hairy: how do we connect the things together?

If we know the path of the DNS update script in our AMI, instead of running it as a service, we can put a script that is basically exec /path/to/script … into the user-data.

On the other hand, if we aren’t running it as a service, we don’t need to have the script on-instance in advance. We can have a slightly longer user-data shell script, which uses aws s3 cp to fetch the payload, executing it via /bin/bash /tmp/script.

But what if we want to leave the option open to use other user-data formats than “shell scripting”?

Tag the Instance, Redux

We can have a Lambda function read the configuration of the calling instance (to get its tags and public IP), then perform the DNS update on the instance’s behalf. If we fail to tag the copied instance, no DNS update happens.

The straightforward path is to have the on-instance script use the AWS CLI to invoke that Lambda function.

Alternatively, a Lambda function can be configured to allow access via URL, and cloud-init happens to have a Phone Home module to invoke some URL. Together, it’s likely we could run the function declaratively, with no shell scripting at all. It also means the “update DNS” operation doesn’t need the instance metadata service, either explicitly or for the IAM Role credentials, to perform the work. It is the Lambda function which calls EC2 DescribeInstances to read the tags, followed by performing the DNS update.

In some cases, that might make it feasible to deactivate metadata access from the instance entirely.

I haven’t fully proved this path out (by implementing a URL-activated Lambda function), but it remains theoretically viable.

The Event Bus

But, wait. We can get EC2 instance-state notifications. We could respond to those via Lambda. Once the instance transitions into the running state, we fetch its IP and write it into DNS.

This is more in the realm of ‘spooky action at a distance’ and more tightly integrated with the vendor, but it is a fairly self-contained system. Nobody has to remember to put things together just right when launching the instance, and the design doesn’t require a Lambda URL that is open to (hypothetically) the entire world.

Abandon Hope and Hard Code More

We could also hard code the production instance’s EC2 instance ID into the script. If the instance ID that booted is not the known ID, then we can print a helpful debug message before exiting.

This meets the basic “don’t wreck DNS” goal, but doesn’t move forward on anything else. What it lacks in finesse, it makes up for in speed.

Rationale behind all this

I found out that the awscli package on Debian 11 transitively depends on X11, while it doesn’t in Debian 12. I hunger to get rid of megabytes of code that does nothing but provide potential attack surface.

“Getting rid of X11” resulted in a set of options where the minimal amount of work appears to be, “upgrade to Debian 12.” That’s complex enough that I want to test the procedure on a copy of the instance first. If that goes well, it can be replicated in production.

Other choices included porting the service to Python or Go to permit removal of awscli and X11, but then we are maintaining code in non-primary languages. I definitely don’t want to be the only person who could update this code.

Finally, there’s a sociopolitical issue: because the LTS security updates are handled by the LTS team rather than the security team, it’s desirable to shift off of LTS and up to stable again. There’s no practical problem with the LTS. I just don’t want to have to explain how that works to corporate executives, when I could use a Debian version that has support directly from the Debian security team.

Monday, September 9, 2024

Some Solutions to a Problem