Sunday, August 11, 2024

Our Long-Term AWS CloudSearch Experience

AWS has announced the deprecation of CloudSearch, among other services, just as I wanted to share why we chose it, and how it worked out.

Competitors

The field we considered when choosing CloudSearch included Sphinx, ElasticSearch (the real one and AWS’ ripoff), MySQL FULLTEXT indexes, rolling our own in-SQL trigram search, and of course, CloudSearch.

We had operational experience with Sphinx. It performed well enough, but it is oriented toward documents, not the smaller constellation of attributes I was interested in here.  It took quite a chunk of memory to index our tickets (description/comments), required a pet machine, and didn’t vibe correctly with the team.  I didn’t want to commit to putting 100 times more entries in it, then defending it politically for all eternity.

ElasticSearch appeared to be hyper-focused on log searching specifically, more like what we’re already doing with Papertrail.  It was not clear that it could be used for other purposes, let alone how to go about such things.

We actually had an in-SQL trigram search already, but only for customer names.  I built it because MySQL’s full-text index features were not in great health at the time. (I thought full-text indexes were deprecated ever since, but in checking today, this appears not to be the case.  Even the MySQL 9.0 docs don’t mention it.) I started populating an additional trigram index for all the data I was interested in searching, and it blew up our storage size so fast I had to stop it and definitely find something else. That’s also how I found out that RDS can’t reclaim storage; once it expands, it has expanded for good.

The problem with using MySQL’s full-text indexing was the related integer fields that needed to be indexed.  We wanted to have a general search field, where the user could put in “Sunesh” or “240031” and get the related customer or transaction number, without a complex multi-part form.  Doing that with nothing but MySQL features seemed difficult and/or slow.

“Do nothing” wasn’t really an alternative, either; to search all the relevant fields, MySQL wanted to do two full table scans.  Searches would be running against the largest tables in the database, which makes even a single full scan prohibitively expensive.

CloudSearch

CloudSearch got a great review in my collection of blurbs about AWS services, but further experience has been somewhat less rosy.

For background, CloudSearch is arranged into one-dimensional domains, with a limited second dimension in the form of array attributes.  To contain costs, I chose to index our customers, attaching their VINs as array attributes, rather than have separate customer and vehicle domains or redundantly index the customer attributes on every vehicle.  This results in a domain with 2.5M records.  (Doing some serious guesswork, that means around 12M contracts in total.  Give or take a couple million.)

Things worked fine with a ‘small’ search instance for a while, but it didn’t handle bursty traffic.  Last month, I resized the instance to ‘medium’, and rebuilt the index… which took an unknown number of hours between 2 and 18, inclusive.

Why don’t I know exactly how long it took?  Well, that’s the next problem: metrics. CloudSearch only keeps metrics for three hours, and doesn’t have an event log.  (They appear to go into CloudWatch, but with a custom 3-hour expiration time.) When did the rebuild finish?  Dunno!  Did the system get overwhelmed overnight?  Too bad; that’s gone! With the basic metrics being so anemic, there’s definitely nothing as useful as RDS’ Performance Insights, which is what I would really want here.

Our instance has managed to survive adequately at medium for a while, but I don’t know when I’ll have to scale it up as we roll out this search to more parts of the system.  We just don’t have the metrics here to plan capacity.

Considering that, and the deprecation of it by AWS, I would love to have an alternative… except it would just be CloudSearch, improved.

No comments: