Sunday, September 22, 2024

CloudSearch's Tricky prefix Operator

We ran into an interesting problem with CloudSearch.  Maybe I did it wrong, but I stored customer names in CloudSearch as “text” type with “English” analysis.  We do generic-search-bar scans with prefix searches, like (or (prefix field=name 'moon') (prefix field=address 'moon')).

Then, a developer found that a search term of “john” would find customers with a name of “johns”, but a search for “johns” would not!  The root cause turned out to be that the English analyzer stems everything that is a plausible plural, storing “Johns” as “john”.

Normally, this isn’t a problem.  When—and only when—using a prefix search, stemming is not applied to the terms for those matches.  Thus, doing a prefix search of “johns” will match “johnson” but not “johns”.  Doing a regular search through the CloudSearch Console will turn up the expected customers, and so might checking the database directly, adding to the confusion.  It even works as expected with most names, because “Karl” or “Packard” don’t look like plurals.

We added a custom analyzer with no stemming, set our text fields to use it, and reindexed.

No comments: