Sunday, August 18, 2024

The Missing Call

I decided to combine (and minify) some CSS files for our backend administration site, so I wrote the code to load, minify, and output the final stylesheet.  I was very careful to write to a temporary file, check even the fclose() return code, rename it into place, and so on.  I even renamed the original to a backup so that I could attempt to rename it back if the first rename succeeded, but the second failed.

For style points, I updated it to set the last-modified time of the generated file to the timestamp of the latest input, so that If-Modified-Since headers will work correctly.

I tested it, multiple times, with various states of having the main and backup filenames. It looked great.  I pushed it out to production… and that wasn’t so great.

We just had no styles at all. Yikes!  I had some logic in there for “if production and minified CSS exists, use it; else, fall back to the source stylesheets.”  I hastily changed that to if (false) and pushed another deployment, so I could figure out what happened.

It didn’t take long.  The web server log helpfully noted that the site.min.css file wasn’t accessible to the OS user.

I had used tempnam(), which created an output file with mode 600, rw- --- ---.  Per long-standing philosophy, the deployment runs as a separate user from the web server, so a file that’s only readable to the deployer can’t be served by the web server.  Oops.

I had considered the direct failure modes of all of the filesystem calls I was making, but I hadn’t considered the indirect consequences of the actions being performed.  I added a chmod(0o644) call and its error check, and deployed again.  After that, the site worked.

Sunday, August 11, 2024

Our Long-Term AWS CloudSearch Experience

AWS has announced the deprecation of CloudSearch, among other services, just as I wanted to share why we chose it, and how it worked out.

Competitors

The field we considered when choosing CloudSearch included Sphinx, ElasticSearch (the real one and AWS’ ripoff), MySQL FULLTEXT indexes, rolling our own in-SQL trigram search, and of course, CloudSearch.

We had operational experience with Sphinx. It performed well enough, but it is oriented toward documents, not the smaller constellation of attributes I was interested in here.  It took quite a chunk of memory to index our tickets (description/comments), required a pet machine, and didn’t vibe correctly with the team.  I didn’t want to commit to putting 100 times more entries in it, then defending it politically for all eternity.

ElasticSearch appeared to be hyper-focused on log searching specifically, more like what we’re already doing with Papertrail.  It was not clear that it could be used for other purposes, let alone how to go about such things.

We actually had an in-SQL trigram search already, but only for customer names.  I built it because MySQL’s full-text index features were not in great health at the time. (I thought full-text indexes were deprecated ever since, but in checking today, this appears not to be the case.  Even the MySQL 9.0 docs don’t mention it.) I started populating an additional trigram index for all the data I was interested in searching, and it blew up our storage size so fast I had to stop it and definitely find something else. That’s also how I found out that RDS can’t reclaim storage; once it expands, it has expanded for good.

The problem with using MySQL’s full-text indexing was the related integer fields that needed to be indexed.  We wanted to have a general search field, where the user could put in “Sunesh” or “240031” and get the related customer or transaction number, without a complex multi-part form.  Doing that with nothing but MySQL features seemed difficult and/or slow.

“Do nothing” wasn’t really an alternative, either; to search all the relevant fields, MySQL wanted to do two full table scans.  Searches would be running against the largest tables in the database, which makes even a single full scan prohibitively expensive.

CloudSearch

CloudSearch got a great review in my collection of blurbs about AWS services, but further experience has been somewhat less rosy.

For background, CloudSearch is arranged into one-dimensional domains, with a limited second dimension in the form of array attributes.  To contain costs, I chose to index our customers, attaching their VINs as array attributes, rather than have separate customer and vehicle domains or redundantly index the customer attributes on every vehicle.  This results in a domain with 2.5M records.  (Doing some serious guesswork, that means around 12M contracts in total.  Give or take a couple million.)

Things worked fine with a ‘small’ search instance for a while, but it didn’t handle bursty traffic.  Last month, I resized the instance to ‘medium’, and rebuilt the index… which took an unknown number of hours between 2 and 18, inclusive.

Why don’t I know exactly how long it took?  Well, that’s the next problem: metrics. CloudSearch only keeps metrics for three hours, and doesn’t have an event log.  (They appear to go into CloudWatch, but with a custom 3-hour expiration time.) When did the rebuild finish?  Dunno!  Did the system get overwhelmed overnight?  Too bad; that’s gone! With the basic metrics being so anemic, there’s definitely nothing as useful as RDS’ Performance Insights, which is what I would really want here.

Our instance has managed to survive adequately at medium for a while, but I don’t know when I’ll have to scale it up as we roll out this search to more parts of the system.  We just don’t have the metrics here to plan capacity.

Considering that, and the deprecation of it by AWS, I would love to have an alternative… except it would just be CloudSearch, improved.

Wednesday, August 7, 2024

AWS CodeDeploy’s Blue/Green Deployment Gotcha

Once, well after I no longer remembered how the whole thing was bootstrapped, I accidentally deleted the target group associated with a CodeDeploy application that was configured for blue/green deployment.  That’s how I found out (rediscovered?) that CodeDeploy doesn’t create a target group for blue/green deployments, it copies an existing one.  Since I had just deleted that existing one, I couldn’t do a (re)deployment and bring the system back online!

(Also, it cemented my opinion that prompts should be like, “Type ‘delete foo-production-dAdw4a1Ta’ to delete the target group” rather than “Type ‘delete’ to delete.” Guess which way the AWS Console is set up.)

I started up an instance to add to a new target group, and it promptly fell over.  The AMI had health monitoring baked in, and one of the health checks was “CodeDeploy has installed the application on the instance.”  Since it was not CodeDeploy starting the instance for the purpose of installing the application, the health check failed, and Auto Scaling dutifully pulled it down to replace it.

Meanwhile, the lack of healthy instances was helpfully sending alerts and bringing my boss’ attention to the problem.

[Now I wonder if it could have worked to issue a redploy at this point.  The group was there to copy, even if the instances weren’t functional.  I guess we’ll never know; I’m not about to delete the target group again, just to find out!]

I ended up flipping the configuration to using EC2 health checks instead of HTTP, and then everything was stable enough to issue a proper redeployment through CodeDeploy.  With service restored, I finally put the health checks back to HTTP.

And then, with production in service again, I finally got to work on moving staging from in-place to blue/green.  Ironically, I would have learned the lesson either way; but by breaking production, it really stuck with me.

Sunday, August 4, 2024

qbXML is Rest – Distilled

The design of Quickbooks XML is fundamentally REST.  Allow me to rephrase an old post with way too many words about this.

The Quickbooks Web Connector (QBWC) would run on the client with the Quickbooks GUI, and periodically make calls out to a SOAP server to set up a session, get “qbXML” documents, and execute them.

Each of those documents contained a series of elements that essentially mapped to commands within Quickbooks.  To make an edit request, one included the type of object being edited, its ID, its sequence number (for conflict detection), and the desired changes.  Crucially, everything Quickbooks needed to carry out that request was embedded within the XML.  The XML could only reference objects that existed inside of Quickbooks.  There was no concept of “session data,” “temporary IDs,” locks, or anything, and no way to create nor access them.

If memory serves, one could “name” objects being created, then reference them later by that name within the same qbXML document.  Thus, “create a new thing and update something else to reference it” was expressible.

In other words, qbXML transferred a complete representation of the necessary state to complete the request: therefore, by my understanding, it is REST.

The overall system wasn’t pure REST.  Everything happened within the context of “a session” which had “a specific company file” open in the GUI.  Outside of that, the fact that SOAP/WSDL (normally an full-blown RPC mechanism) was the transport was practically irrelevant.

I’m also aware there is no HTTP, thus no HTTP integration, no URLs, and no HATEOAS.  However, I don’t think these things are required to call something REST; those are simply things that REST was co-popularized with.