Sunday, December 29, 2024

Scattered Notes on Dovecot’s userdb, passdb, and passwd-file

Dovecot can authenticate users using a passwd-like file.  This happens in two phases.  First, users are looked up in the passdb.  If the user is found and authenticated, then the user is looked up again in the userdb to get things like their UID/GID and home directory.

Now, this doesn’t allow for aliasing users in Dovecot.  If the login is user@example.com, then the defaults will lead to trying to find “user@example.com” in the passdb, then the userdb.  Failure to have these configured correctly can result in different errors:

  1. User not found in the passdb: authentication fails.  (Beware of fail2ban here.)
  2. User not found in the userdb: user can authenticate, but appears to have no mail!

For my own system, the virtual address needs to be resolved to a particular system user (aka Unix account.)  I also want to share the password files with Postfix for outbound email authentication.  This made Dovecot complicated: I want to log in as user@domain, then have that processed as user for both lookups in a file that is specific to the domain. I put the shortened user in the passwd-file, and now I have to configure passdb carefully:

# /etc/dovecot/local.conf snippet
passdb {
    args username_format=%n /local/auth/%d/passwd
    override_fields user=%n
    driver = passwd-file
}
userdb {
    args /local/auth/%d/passwd
    driver = passwd-file
}

This makes passdb do the first lookup using the short username, %n, with the args setting.  Then, that short username is returned by override_fields for use in later lookups.  After that, userdb can continue with no special settings; it will use the overridden user to look up the short name, and nothing special needs to happen.

I believe that the passwd-file can’t return a different username, because there’s only one username field (the first field), and it is also the lookup key.  This is what requires us to use override_fields for this scenario.

Sunday, December 22, 2024

Don’t Let HTTP/2 Nest

For some time, I had problems accessing a dev server with HTTP/2.  Asking cURL to use HTTP/1.1 worked fine, so that’s what I did for a long time.

Today, I found the root cause.  I had nginx set up as reverse-proxy/TLS termination (to emulate ALB), proxying requests to apache2Both of them had HTTP/2 enabled!  I needed to deactivate support in Apache, and since the system is Debian/Ubuntu based, that meant:

sudo a2dismod http2
sudo systemctl reload apache2

After that, everything worked.

The problem was that the client would connect to nginx with HTTP/2, and then the request would be sent to Apache. Apache's HTTP/2 module would include an Upgrade: h2, h2c header in the response.  Then nginx would dutifully copy this back to the client.  When cURL or PHP streams received this header, they would detect it as invalid: we can’t upgrade to HTTP/2 from inside HTTP/2.

That error-handling resulted in discarding the response body… but not the HTTP 200 status code, which was extremely puzzling.  How could this successful request have failed?  It failed during header processing, after processing the status and before accepting the body.  (I think browsers must ignore it?  Or maybe they don’t use HTTP/2 through a proxy, even with CONNECT requests?  I would have had to figure out the problem much sooner, if they had seen this Upgrade header and treated it as an error.)

The other weird thing about this is that Apache doesn't have TLS configured, but it still provided h2 as an option in its Upgrade header.  I don’t think that’s a reasonable configuration.  It’s especially not a reasonable default, but I’m not sure whether that’s Apache’s problem, Debian’s, or Ubuntu’s.

Tuesday, December 17, 2024

What I Learned Trying to Install Kubuntu (alongside Pop!_OS)

First and foremost, once again, this is clearly not a supported configuration that I tried to make.  I'm sure that if I wiped the drive and started afresh, things would have gone much better.  I just… wanted to push the envelope a bit.

Pop!_OS installs (with encryption) with the physical partition as a LUKS container, holding an LVM volume group, and the root filesystem is on a logical volume within.  The plan was hatched:

  • Create a logical volume for /home and move those files over to it
  • Create a logical volume for Kubuntu’s root filesystem
  • Install Kubuntu into the new volume, and share /home for easy switching (either direction)

Things immediately got weird.  The Kubuntu installer (calamares) knows how to install into a logical volume, but it doesn’t know how to open the LUKS container.  I quit the installer, unlocked the thing, and restarted the installer.  This let the installation proceed, up to the point where it failed to install grub.

Although that problem can be fixed, the whole installation ended up being irretrievably broken, all because booting Linux is clearly not important enough to get standardized. Oh well!

Sunday, December 8, 2024

Side Note: Firefox’s Primary Password is Local

When signing into Firefox Sync to set up a new computer, the primary password is not applied.  I usually forget this, and it takes a couple of runs for me to remember to set it up.

That’s not enough for a post, so here are some additional things about it:

The primary password protects all passwords, but not other data.  If someone can access Firefox data, bookmarks and history are effectively stored in the clear.

The primary password is intended to prevent reading credentials… and the Sync password is one of those credentials.  That’s why a profile with both Sync and a primary password wants that password as soon as Firefox starts; it wants to check for new data.

The same limitation of protections applies to Thunderbird.  If someone has access to the profile, they can read all historic/cached email, but they will not be able to connect and download newly received email without the primary password.

The Primary Password never times out.  As such, it creates a “before/after first unlock” distinction.  After first unlock, the password is in RAM somewhere, and the Passwords UI asking for it again is merely re-authentication.  Firefox obviously has the password saved already, because it can fill form data.

Some time ago, the hash that turns the primary password into an actual encryption key has been strengthened somewhat.  I believe it is now a 10,000-iteration thing, and not just one SHA-1 invocation.  The problem with upgrading it further is that the crypto is always applied; ”no password” is effectively a blank password, and the encryption key still needs to be derived from it to access the storage.  Mozilla understandably doesn’t want to introduce a noticeable startup delay for people who did not set a password.


Very recently (2024-10-17), the separate Firefox Sync authentication was upgraded.  Users need to log into Firefox Sync with their password again in order to take advantage of the change.

Sunday, December 1, 2024

Unplugging the Network

I ended up finding a use case for removing the network from something.  It goes like this:

I have a virtual machine (guest) set up with nodejs and npm installed, along with @redocly/cli for generating some documentation from an OpenAPI specification.  This machine has two NICs, one in the default NAT configuration, and one attached to a host-only network with a static IP.  The files I want to build are shared via NFS on the host-only network, and I connect over the host-only network to issue the build command.

Meaning, there is no loss of functionality to remove the default NIC (the one configured for NAT), but it does cut npm off from the internet.  That’s an immediate UX improvement: npm can no longer complain that it is out of date! Furthermore, if the software I installed happened to be compromised and running a Bitcoin miner, it has been cut off from its c2 server, and can’t make anyone money.

An interesting side benefit is that it also cuts off everyone’s telemetry, impassively.

I can’t update the OS packages, but I’m not sure that is an actual problem.  If the code installed doesn’t have an exploit payload already, there’s no way to get one later.  The vulnerability remains, but nothing is there to go after it.

Level Up

(Updated 2024-12-19: this section was a P.S. hypothetical on the original post. Later sections are added.)

It is actually possible to deactivate both NICs.  The network was used for only two things: logging in to run commands, and to (re)use the NFS share to get the files.

Getting the files is easy: they can be shared using the hypervisor’s shared-folders system.  Logging in to run commands can be done on the hypervisor’s graphical console.  As a bonus, if the machine has a snapshot when starting, it can be shut down by closing the hypervisor’s window and reverting to snapshot.

Now, we really have a network-less (and stateless) appliance.

Reconfigure

Before I made that first snapshot, I configured the console to boot with the Dvorak layout, because the default of Qwerty is pretty much why I use SSH when available for virtual machines.  But then, after a while, I got tired of being told that the list of packages was more than a week old, so I set out to de-configure some other things.

I cleared out things that would just waste energy on a system that would revert to snapshot: services like rsyslog, cron, and logrotate.  Then I trawled through systemctl list-units --all and cleared a number of timers, such as ones associated with “ua”, apt, dpkg, man-db, and update-notifier.  Any work these tasks do will simply be thrown away every time.

I took the pam_motd modules out of /etc/pam.d/login, too.  If Canonical doesn't want me to clear out the dynamic motd entirely, the next best thing is to completely ignore it.

After a reboot, I went through systemd-analyze critical-chain and its friend, systemd-analyze blame, and turned off more things, like ufw and apport.

With all that out of the way, I rebooted and checked how much memory my actual task consumed; it was apparently a hundred megabytes, so I pared the machine’s memory allocation down from 2,048 MiB to 512 MiB.  The guest runs with neither swap nor earlyoom, so I didn’t want to push it much farther, but 384 MiB is theoretically possible.

NFS

A small, tiny note: besides cutting off the Internet as a whole, sharing files from the hypervisor instead of NFS adds another small bit of security.  The NFS export is a few directories up, and the host has no_subtree_check to improve performance on the other guest that the mount is actually meant for.

Super theoretically, if the guest turned evil, it could possibly look around the entire host filesystem, or at least the entire export.  When using the hypervisor’s file sharing, only the intended directory is accessible to the guest kernel.

Sunday, November 24, 2024

Mac Mini (M4/2024) First Impressions

I bought an M4 Mac Mini (2024) to replace my Ivy Bridge (2012) PC.

It was difficult to choose a configuration, because of the need to see a decade into the future, and the cost of upgrades.  It is hard to believe that an additional half terabyte (internal) would cost more than a whole terabyte external drive (with board, USB electronics/port, case, cable, and retail box.)

It feels pretty fast.  Apps open unexpectedly quickly.  Which is to say, on par with native apps on my 12C/16T Alder Lake work laptop.  Apparently, my expectations have been lowered by heavy use of Flatpaks.

It is quiet.  When I ejected the old USB drive I was using for file transfer, it spun down, and that was the noise I had been hearing all along.  The Mac itself is generally too quiet to hear.

It is efficient.  I have a power strip that detects when the main device is on, and powers an extra set of outlets for other devices.  Even with the strip moved from “PC” to “Netbook,” the Mini does not normally draw enough power to keep the other outlets on.  (I put the power strip on the desk and plugged it into the desk power, then turned off the Mac’s wake-on-sleep feature.  Now I can unplug the whole strip when not in use.)

It has been weird getting used to the Mac keyboard shortcuts again.  For two years, I haven’t needed to think about which computer I’m in front of; Windows and Linux share the basic structure for app shortcuts and cursor movement.  I don’t know how many times I have pressed Ctrl+T in Firefox on the Mac and waited a beat for the tab to open, before pressing Cmd+T instead.

It is extremely weird to me that the PC Home/End keys do nothing by default on the Mac.  It’s not like they do something better, or even different; they just don’t do anything. Why?

I also had to search the web to find out why an NTFS external drive couldn’t put things in the trash after I had copied them onto the Mac.  It seems the whole volume is read-only; macOS doesn’t have built-in support for writing to NTFS.  Meanwhile, I didn’t notice anything in the UI to suggest that the volume is read-only; some operations just don’t work (quietly, in the case of keyboard shortcuts.)

There was one time where I tried to wake the Mac up, and it didn't want to talk to the keyboard. I plugged and unplugged the USB (both the keyboard from the C-to-A adapter, and the adapter from the Mac) and tried it with a different keyboard, but to no avail.  I couldn’t find any way to open an on-screen keyboard with the trackpad alone.  I had to hard power off, but it has been fine ever since.

I guess that’s about it!  It doesn’t feel like “coming home” or anything, it just feels like a new computer to be set up.

Sunday, November 17, 2024

Fixing a Random ALB Alarm Failure

tl;dr: if an Auto Scaling Group’s capacity is updated on a schedule, the max instance lifetime is an exact number of days, and instances take a while to reach healthy state after launching… Auto Scaling can terminate running-but-healthy instances before new instances are ready to replace them.

I pushed our max instance lifetime 2 hours further out, so that the max-lifetime terminations happen well after scheduled launches.

Sunday, November 10, 2024

Ubuntu 24.10 First Impressions

I hit the button to upgrade Ubuntu Studio 24.04 to 24.10.  First impressions:

  1. The upgrade process was seriously broken.  Not sure if my fault.
  2. Sticky Keys is still not correct on Wayland.
  3. Orchis has major problems on Ubuntu Studio.

Sunday, October 6, 2024

Pulling at threads: File Capabilities

For unimportant reasons, on my Ubuntu 24.04 installation, I went looking for things that set file capabilities in /usr/bin and /usr/sbin.  There were three:

  • ping: cap_net_raw=ep
  • mtr-packet: cap_net_raw=ep
  • kwin_wayland: cap_sys_resource=ep

The =ep notation means that only the listed capabilities are set to “effective” and “permitted”, but not “inheritable.”  Processes can and do receive the capability, but cannot pass it to child processes.

ping and mtr-packet are “as expected.”  They want to send unusual network packets, so they need that right.  (This is the sort of thing I would also expect to see on nmap, if it were installed.)

kwin_wayland was a bit more surprising to see.  Why does it want that?  Through reading capabilities(7) and running strings /usr/bin/kwin_xwayland, my best guess is that kwin needs to raise its RLIMIT_NOFILE (max number of open files.)

There’s a separate kwin_wayland_wrapper file.  A quick check showed that it was not a shell script (a common form of wrapper), but an actual executable.  Could it have had the capability, set the limits, and launched the main process?  For that matter, could this whole startup sequence have been structured through systemd, so that none of kwin’s components needed elevated capabilities?

The latter question is easily answered: no.  This clearly isn’t a system service, and if it were run from the user instance, that never had any elevated privileges.  (The goal, as I understand it, is that a systemd user-session bug doesn’t permit privilege escalation, and “not having those privileges” is the surest way to avoid malicious use of them.)

If kwin actually adjusts the limit dynamically, in response to the actual number of clients seen, then the former answer would also be “no.”  To exercise the capability at any time, kwin itself must retain it.

I haven’t read the code to confirm any of this.  Really, it seems like this situation is exactly what capabilities are for; to allow limited actions like raising resource limits, without giving away broad access to the entire system.  Even if I were to engineer a less-privileged alternative, it doesn’t seem like it will measurably improve the “security,” especially not for cap_sys_resource.  It was just a fun little thought experiment.

Sunday, September 29, 2024

Scattered Thoughts on Distrobox

Distrobox’s aim is to integrate a container with the host OS’ data, to the extent possible.  By default, everything (process, network, filesystem) are shared, and in particular, the home directory is mounted into the container as well.  It is not even trying to be “a sandbox,” even when using the --unshare-all option.

I also found out the hard way that Distrobox integrates devices explicitly. If some USB devices are unplugged, the container will not start.  This happened because I pulled my laptop away from its normal dock area (with USB hub, keyboard, and fancy mouse) and tried to use a distrobox.  Thankfully, I wasn’t fully offline, so I was able to rebuild the container.  [Updated 2024-11-28: This danger is persistent.  Creating a container without the USB devices connected, then running it later with the devices, will make it fail to start if the devices are unplugged.  This ended up being impossible to live with.]

Before it stopped performing properly in Ubuntu Studio 23.10, I used distrobox to build ares and install it into my home directory.  This process yielded a ~/.local/share/applications/ares.desktop file, which my host desktop picked up, but which would not actually work from the host.  I always needed to be careful to click the “ares (on ares)“ in the menu after exporting, to start it on the ares container.

I have observed that distroboxes must be stopped to be deleted, but then distrobox will want to start them to un-export the apps.  Very recent distrobox versions ask whether to start the container and do the un-exporting, but there’s still a base assumption that you wanted the distrobox specifically to export GUI apps from it.  It clearly doesn’t track whether any apps are exported, because it always asks, even if none are.

Sunday, September 22, 2024

CloudSearch's Tricky prefix Operator

We ran into an interesting problem with CloudSearch.  Maybe I did it wrong, but I stored customer names in CloudSearch as “text” type with “English” analysis.  We do generic-search-bar scans with prefix searches, like (or (prefix field=name 'moon') (prefix field=address 'moon')).

Then, a developer found that a search term of “john” would find customers with a name of “johns”, but a search for “johns” would not!  The root cause turned out to be that the English analyzer stems everything that is a plausible plural, storing “Johns” as “john”.

Normally, this isn’t a problem.  When—and only when—using a prefix search, stemming is not applied to the terms for those matches.  Thus, doing a prefix search of “johns” will match “johnson” but not “johns”.  Doing a regular search through the CloudSearch Console will turn up the expected customers, and so might checking the database directly, adding to the confusion.  It even works as expected with most names, because “Karl” or “Packard” don’t look like plurals.

We added a custom analyzer with no stemming, set our text fields to use it, and reindexed.

Sunday, September 15, 2024

The Wrong Terminal

Somewhere in my Pop!_OS 22.04 settings, I set Tilix as the preferred terminal emulator. When I use the Super+T* keyboard shortcut, I get a Tilix window.  However, when I use a launcher that Distrobox has created for a container from the Super+A (for Applications) UI, the command-line doesn’t come up in Tilix… it comes up in gnome-terminal instead. Why is that, and can I fix it?

AIUI, all the Application Launcher UI does is ask the system to open the .desktop file that Distrobox added.  That file has the “run in terminal” option, but lacks the ability to request some specific terminal. That gets chosen, eventually, by the GIO library.

GIO isn’t desktop-specific, so it doesn’t read the desktop settings.  It actually builds in a hard-coded list of terminals that it knows how to work with, like gnome-terminal, konsole, and eventually (I am assuming) ye olde xterm.  It walks down the list and runs the first one that exists on the system, which happens to be gnome-terminal.  AFAIK, there is no configuration for this, at any level.

It is also possible that one of the distributions in the chain (Debian, Ubuntu, or Pop!_OS) patched GIO to try x-terminal-emulator first.  If so, it would go through the alternatives system, which would send it directly to gnome-terminal, since between that and Tilix, gnome-terminal has priority.  We are deep into speculative territory, but if all of that were the case, I could “make it work” by making a system-level change resulting in all users now preferring Tilix… but only for cases where x-terminal-emulator is involved, specifically.

I want the admin user to have all the defaults, like gnome-terminal, because the less deviation made in that account, the less likely I am to configure a weird problem for myself.** (Especially for Gnome, which has a configuration system, but they don’t want anyone to use it.  For simplicity.) Changing the alternatives globally is in direct contradiction to that goal.

It seems that the “simplest” solution is to change the .desktop file to avoid launching in a terminal, and then update the command to include the desired terminal.  It would work in the short term, but fall “out of sync” if I ever changed away from Tilix as default in the desktop settings, or uninstalled Tilix.  It’s not robust.

It seems like there’s some sort of desktop-environment standard missing here.  If we don’t want to invoke threads or communication inside GIO, then there would need to be a way for the Gnome libraries to pass an “XDG Configuration” or something, to allow settings like “current terminal app” to be passed in.

If we relax the constraints, then a D-Bus call would be reasonable… but in that case, maybe GIO could be removed from the sequence entirely.  The Applications UI would make the D-Bus call to “launch the thing,” and a desktop-specific implementation would pick it up and do it properly.

It seems like there should be solutions, but from searching the web, it looks like the UX has “just done that” for years.

* Super is the |□| key, because it's a System76 laptop, a Darter Pro 8 in particular.

** As a side effect, this makes the admin account feel like “someone else’s computer,” which makes me take more care with it.  I may not want to break my own things, exactly, but I would feel even worse about breaking other people’s stuff.

Monday, September 9, 2024

Some Solutions to a Problem

We have an EC2 instance that has a quickly-produced shell script that runs on boot.  It sets a DNS name to the instance’s public IPv4 address.  Since time was of the essence, it hard-codes everything about this, particularly, the DNS name to use.

This means, if we want to bring up a copy of this instance, based on a snapshot of its root volume, the copied instance will overwrite the DNS record for the production service. We need to stop this.

As a side project, it would be nice to remove the hard-coding of the DNS name.  It would be trivial to “stop DNS name conflicts” if we did not have a DNS name stored on the instance’s disk image to begin with.

What are the options?

Sunday, September 1, 2024

A Problem of Semantic Versioning

For a while, we’ve been unable to upgrade to PHPUnit 11 due to a conflict in transitive dependencies.  The crux of the problem is:

  1. Psalm (5.25.0) directly requires nikic/php-parser: ^4.16, prohibiting 5.x.
  2. PHPUnit (11.3.1) transitively requires nikic/php-parser: ^5.1, prohibiting 4.x.

It is possible in the short term to retain PHPUnit 10.x, but it brings to light a certain limitation of Semantic Versioning: it tells you how to create version numbers for your own code base, but it does not carry information about the dependencies of that code.

When the required PHP runtime version goes up, what kind of change is that?  SemVer prescribes incrementing the major number for “incompatible API changes,” or the patch for “backward compatible bug fixes.”

So, is it a bug fix?  Is it incompatible? Or is the question ill-formed?

It feels wrong to increment the patch version with such a change.  Such a release states, “We are now preventing the installation on {whatever Enterprise Linux versions} and below, and in exchange, you get absolutely nothing. There are no new features.  Or perhaps we fixed bugs, but now you can’t access those fixes.”  That sounds… rude.

Meanwhile, it seems gratuitous to bump the major version on a strict time schedule, merely because an old PHP version is no longer supported upstream every year.  It appears to cause a lot of churn in the API, simply because making a major version change is an opportunity to “break” that API.  PHPUnit is particularly annoying about this, constantly moving the deck chairs around.

In between is the feature release.  I have the same misgivings as with the patch version, although weaker.  Hypothetically, a project could release X.3.0 while continuing to maintain X.2.Y, but I’m not sure how many of them do.  When people have a new shiny thing to chase, they don’t enjoy spending any time on the old, tarnishing one.

What if we take the path of never upgrading the minimum versions of our dependencies?  I have also seen a project try this.  They were starving themselves of contributors, because few volunteers want to make their patch work on PHP 5.2–8.1.  (At the PHP 8.1 release in 2021, PHP 5.2 had reached its “end of life” about 11 years prior, four years after its own release in 2006.) Aside from that issue, they were also either unable to pick up new features in other packages they may use, or they were forever adding run-time feature detection.

As in most things engineering, it comes down to trade-offs… but versions end up being a social question, and projects do not determine their answers in isolation.  The ecosystem as a whole has to work together.  When they don’t, users have to deal with the results, like the nikic/php-parser situation.  And maybe, that means users will migrate away from Psalm, if it’s not moving fast enough to permit use with other popular packages.

Sunday, August 18, 2024

The Missing Call

I decided to combine (and minify) some CSS files for our backend administration site, so I wrote the code to load, minify, and output the final stylesheet.  I was very careful to write to a temporary file, check even the fclose() return code, rename it into place, and so on.  I even renamed the original to a backup so that I could attempt to rename it back if the first rename succeeded, but the second failed.

For style points, I updated it to set the last-modified time of the generated file to the timestamp of the latest input, so that If-Modified-Since headers will work correctly.

I tested it, multiple times, with various states of having the main and backup filenames. It looked great.  I pushed it out to production… and that wasn’t so great.

We just had no styles at all. Yikes!  I had some logic in there for “if production and minified CSS exists, use it; else, fall back to the source stylesheets.”  I hastily changed that to if (false) and pushed another deployment, so I could figure out what happened.

It didn’t take long.  The web server log helpfully noted that the site.min.css file wasn’t accessible to the OS user.

I had used tempnam(), which created an output file with mode 600, rw- --- ---.  Per long-standing philosophy, the deployment runs as a separate user from the web server, so a file that’s only readable to the deployer can’t be served by the web server.  Oops.

I had considered the direct failure modes of all of the filesystem calls I was making, but I hadn’t considered the indirect consequences of the actions being performed.  I added a chmod(0o644) call and its error check, and deployed again.  After that, the site worked.

Sunday, August 11, 2024

Our Long-Term AWS CloudSearch Experience

AWS has announced the deprecation of CloudSearch, among other services, just as I wanted to share why we chose it, and how it worked out.

Competitors

The field we considered when choosing CloudSearch included Sphinx, ElasticSearch (the real one and AWS’ ripoff), MySQL FULLTEXT indexes, rolling our own in-SQL trigram search, and of course, CloudSearch.

We had operational experience with Sphinx. It performed well enough, but it is oriented toward documents, not the smaller constellation of attributes I was interested in here.  It took quite a chunk of memory to index our tickets (description/comments), required a pet machine, and didn’t vibe correctly with the team.  I didn’t want to commit to putting 100 times more entries in it, then defending it politically for all eternity.

ElasticSearch appeared to be hyper-focused on log searching specifically, more like what we’re already doing with Papertrail.  It was not clear that it could be used for other purposes, let alone how to go about such things.

We actually had an in-SQL trigram search already, but only for customer names.  I built it because MySQL’s full-text index features were not in great health at the time. (I thought full-text indexes were deprecated ever since, but in checking today, this appears not to be the case.  Even the MySQL 9.0 docs don’t mention it.) I started populating an additional trigram index for all the data I was interested in searching, and it blew up our storage size so fast I had to stop it and definitely find something else. That’s also how I found out that RDS can’t reclaim storage; once it expands, it has expanded for good.

The problem with using MySQL’s full-text indexing was the related integer fields that needed to be indexed.  We wanted to have a general search field, where the user could put in “Sunesh” or “240031” and get the related customer or transaction number, without a complex multi-part form.  Doing that with nothing but MySQL features seemed difficult and/or slow.

“Do nothing” wasn’t really an alternative, either; to search all the relevant fields, MySQL wanted to do two full table scans.  Searches would be running against the largest tables in the database, which makes even a single full scan prohibitively expensive.

CloudSearch

CloudSearch got a great review in my collection of blurbs about AWS services, but further experience has been somewhat less rosy.

For background, CloudSearch is arranged into one-dimensional domains, with a limited second dimension in the form of array attributes.  To contain costs, I chose to index our customers, attaching their VINs as array attributes, rather than have separate customer and vehicle domains or redundantly index the customer attributes on every vehicle.  This results in a domain with 2.5M records.  (Doing some serious guesswork, that means around 12M contracts in total.  Give or take a couple million.)

Things worked fine with a ‘small’ search instance for a while, but it didn’t handle bursty traffic.  Last month, I resized the instance to ‘medium’, and rebuilt the index… which took an unknown number of hours between 2 and 18, inclusive.

Why don’t I know exactly how long it took?  Well, that’s the next problem: metrics. CloudSearch only keeps metrics for three hours, and doesn’t have an event log.  (They appear to go into CloudWatch, but with a custom 3-hour expiration time.) When did the rebuild finish?  Dunno!  Did the system get overwhelmed overnight?  Too bad; that’s gone! With the basic metrics being so anemic, there’s definitely nothing as useful as RDS’ Performance Insights, which is what I would really want here.

Our instance has managed to survive adequately at medium for a while, but I don’t know when I’ll have to scale it up as we roll out this search to more parts of the system.  We just don’t have the metrics here to plan capacity.

Considering that, and the deprecation of it by AWS, I would love to have an alternative… except it would just be CloudSearch, improved.

Wednesday, August 7, 2024

AWS CodeDeploy’s Blue/Green Deployment Gotcha

Once, well after I no longer remembered how the whole thing was bootstrapped, I accidentally deleted the target group associated with a CodeDeploy application that was configured for blue/green deployment.  That’s how I found out (rediscovered?) that CodeDeploy doesn’t create a target group for blue/green deployments, it copies an existing one.  Since I had just deleted that existing one, I couldn’t do a (re)deployment and bring the system back online!

(Also, it cemented my opinion that prompts should be like, “Type ‘delete foo-production-dAdw4a1Ta’ to delete the target group” rather than “Type ‘delete’ to delete.” Guess which way the AWS Console is set up.)

I started up an instance to add to a new target group, and it promptly fell over.  The AMI had health monitoring baked in, and one of the health checks was “CodeDeploy has installed the application on the instance.”  Since it was not CodeDeploy starting the instance for the purpose of installing the application, the health check failed, and Auto Scaling dutifully pulled it down to replace it.

Meanwhile, the lack of healthy instances was helpfully sending alerts and bringing my boss’ attention to the problem.

[Now I wonder if it could have worked to issue a redploy at this point.  The group was there to copy, even if the instances weren’t functional.  I guess we’ll never know; I’m not about to delete the target group again, just to find out!]

I ended up flipping the configuration to using EC2 health checks instead of HTTP, and then everything was stable enough to issue a proper redeployment through CodeDeploy.  With service restored, I finally put the health checks back to HTTP.

And then, with production in service again, I finally got to work on moving staging from in-place to blue/green.  Ironically, I would have learned the lesson either way; but by breaking production, it really stuck with me.

Sunday, August 4, 2024

qbXML is Rest – Distilled

The design of Quickbooks XML is fundamentally REST.  Allow me to rephrase an old post with way too many words about this.

The Quickbooks Web Connector (QBWC) would run on the client with the Quickbooks GUI, and periodically make calls out to a SOAP server to set up a session, get “qbXML” documents, and execute them.

Each of those documents contained a series of elements that essentially mapped to commands within Quickbooks.  To make an edit request, one included the type of object being edited, its ID, its sequence number (for conflict detection), and the desired changes.  Crucially, everything Quickbooks needed to carry out that request was embedded within the XML.  The XML could only reference objects that existed inside of Quickbooks.  There was no concept of “session data,” “temporary IDs,” locks, or anything, and no way to create nor access them.

If memory serves, one could “name” objects being created, then reference them later by that name within the same qbXML document.  Thus, “create a new thing and update something else to reference it” was expressible.

In other words, qbXML transferred a complete representation of the necessary state to complete the request: therefore, by my understanding, it is REST.

The overall system wasn’t pure REST.  Everything happened within the context of “a session” which had “a specific company file” open in the GUI.  Outside of that, the fact that SOAP/WSDL (normally an full-blown RPC mechanism) was the transport was practically irrelevant.

I’m also aware there is no HTTP, thus no HTTP integration, no URLs, and no HATEOAS.  However, I don’t think these things are required to call something REST; those are simply things that REST was co-popularized with.

Sunday, July 28, 2024

How I Use Firefox

A long, long time ago, I firewalled my online and real-world identities.  I have separate email accounts for them.  Those email accounts live in separate Firefox profiles. To throw more chaff into the system, the profiles have different adblockers (commonly uBlock Origin; first runner-up is AdBlocker Ultimate) and may or may not include Privacy Possum.

Within the profile, I’ve separated things further into containers, full name Multi-Account Containers.  Google gets its own container, so that YouTube can’t follow me everywhere online.

For things where I suspect all the defenses are a problem, I also have a profile that runs in “always private browsing” mode, but is otherwise fairly open.  I very rarely need it.  I’d rather bounce out of a site that has too many annoyances, and which hasn’t sold me on its usefulness.  (Will I sign up for your newsletter?  Will I create an account to read this article?  No.)

(There’s also Pale Moon and the Windows XP VM with Firefox 52 ESR on it, for checking compatibility with Quilt Draw when I am working on that.  However, those are well outside of everyday usage.)

At work, I don’t use separate profiles; I ended up with entirely separate browsers instead.  My day-to-day work happens in Firefox, with containers for the AWS console, each of our own sites I am responsible for, and “other things requiring login.”  The theory is that a site outside the container(s) that tries to attack one inside will fail, because the login isn’t valid from outside the container.  Meanwhile, the multiple containers separate our sites and our general service providers, and the cloud against everyone.

Because the Google Panopticon Browser is becoming the new IE6, there’s a copy of Edge to make the corporate site(s) “fully supported.”  Let Microsoft spy upon themselves, and only themselves.  All so-called “AI” “features” are turned off, where possible.

Finally, for browsing work-adjacent things that aren’t actually work, like LWN and various blogs, I have Waterfox.  It doesn’t have containers or profiles, but it does have uMatrix (better security by running less remote code), along with LeechBlock so that I don’t waste the whole day in there.

uMatrix is a great defense system; in fact, too great. I wouldn’t recommend it for most people.  However, it suits my goals for that particular browser.

Sunday, July 21, 2024

The Case of the Unknown Errors

For a number of reports, we did the lazy thing: we print errors on stderr in the job, and let cron email them to us.

Unfortunately, email is unreliable, and transient.  If a remote system accepts the message from us, then drops it for anti-spam reasons, we don’t have a log of that, nor a copy to resend.  We noticed these problems with report data, and now all output files are archived to S3.  However, the cron emails are the only source of truth for errors, so if we don’t get them, they’re lost forever!

I think the solution will be changing the error_log() calls to syslog().  That will create an on-host record, then forward it to the central server for searching and archiving.  We can even still get cron emails (normally) if we include the flag to print the messages to stderr.

I’m just kind of surprised that I have left a “can’t get email errors about email errors” loop in production for over a decade.

Wednesday, July 17, 2024

Reverting Flatpaks

Today, I learned a lot about Flatpak, motivated by Thunderbird giving me an error at startup.  The error message said that I had used the profile with “a newer version of Thunderbird,” and now it was not compatible.  It gave me the choice of creating a new profile, or quitting.  This was incredibly confusing, since I simply ran flatpak update as usual this morning, and took whatever was there.

That turned out to have been 115.13.0.  By the end of the day, it was “resolved” in the sense that a new version was published on Flathub (128.0esr) and that version was capable of opening my profile.

In the meantime, I learned more commands:

flatpak history

flatpak history produces a list of install/update activity, but it does not have version information at all. We can at least get the commit by asking for it specifically:

flatpak history --columns time,change,app,commit

Or with a sufficiently large terminal window (1920+ pixels wide), try using --columns all.

flatpak remote-info --log

It turns out that the remote can—and Flathub in particular does—keep older versions around for “a while.”  We can get these older versions, with their full commit IDs, by using flatpak remote-info with the repository and the package name.

flatpak remote-info --log flathub \
    org.mozilla.Thunderbird | less

(Line wrapped for readability on mobile; remove the backslash and put it all on one line.)

This prints out some nice header information, then the commit, followed by History.  For Thunderbird in particular, as I write this, the top 3 are:

  • Commit: 2131b9d618… on 2024-07-17 16:26:34 +0000
  • Commit: c2e09fc595… on 2024-07-16 18:58:52 +0000 (this is the one I installed this morning, about 2024-07-17 13:00:00 +0000.)
  • Commit: 2151b1e101… on 2024-07-11 18:18:41 +0000 (which I installed 2024-07-12)

I opened a second terminal window, so I could copy and paste the full commit IDs between them, while experimenting.

sudo flatpak update --commit

Now that we have our old version, how do we install it?  Let’s assume the most-recent version wasn’t published yet, and I just wanted to roll back to my version from 2024-07-12.  We’ll pass its hash to the update command, and run it with root privileges:

sudo flatpak update --commit=2151b1e101… \
    org.mozilla.Thunderbird

Of course I did it without sudo at first, but after confirming, it failed, stating I needed root privileges.  I guess it makes sense (they don’t want someone who doesn’t know my password to downgrade to a known-vulnerable app and then exploit it) but I’m also miffed that it couldn’t tell me this before confirmation.

Anyway, one quick test later, I had my email again.

Versions in Flatpak

After the rollback, I checked the Thunderbird version through Help → About: it was “128.0esr, Mozilla Flatpak version 1.0”.

I followed up with a plain flatpak update org.mozilla.Thunderbird to get the latest 128.0esr build, and verified that was able to access my email as well.  I checked the version again in Help → About: it was “128.0esr, Mozilla Flatpak version 1.0”.

That’s why flatpak update and flatpak history don’t have version numbers at all.  They don’t have any guarantees of clarity or accuracy.

What I didn’t learn

I might have been able to give the short commit ID (from flatpak history) directly to flatpak update without going through flatpak remote-info --log in between.  I didn’t actually try it.

I kept trying to find information about branches, to see if there was a Thunderbird beta branch I could try since stable was broken, but I never did find any information about that.  There’s some build-related documentation about how to specify the branch during build, but absolutely nothing about listing available branches.

I also didn’t find anything about this situation in web searches.  How did version 115.x get pushed after 128.x?  Why did it take 21 hours to get it fixed?  Where would I find out whether Mozilla even knew about the problem?  I discovered it around 15:00, and couldn’t tell if anyone else was having the issue!

There’s a “Subject” in the flatpak remote-info --log data for each commit, but it is invariably “Export org.mozilla.Thunderbird”, so that didn’t add any signal, either.

Sunday, July 14, 2024

fopen() modes vs. Unix modes

PHP has a function for creating temporary files, tempnam. One limitation is that it only takes a filename prefix, and most often, I want to have a file “type” as the suffix, like “report-20240701-RANDOM.csv”.

new SplTempFileObject() creates a file in memory, which isn’t usable for anything where an actual “file on disk” is required.  The related tmpfile() function does not give access to the file name itself.

Meanwhile, fopen() and new SplFileObject() don’t offer control of the Unix permissions.  We can create files in exclusive-write mode by setting the mode to argument to “x”, but we can’t pass 0o600 (rw- --- ---) at that stage.  We have to create the file, and if it works, call chmod() separately.

fopen() and anything modeled on it offer a context parameter, but there are no context options for the file: scheme, only for other stream wrappers.

Underneath fopen()—at least on Linux—is the open syscall.  That call accepts a mode_t mode argument, to indicate what Unix permissions to use when creating a file, which is exactly what we are after.  But thanks to history and standards, we can’t access that directly from PHP now.

P.S.: there’s actually another possibility: we can rename() the file from tempnam() to add a suffix in the same directory.  If an attacker can observe our original file and create the target file with something unexpected, then the rename() will fail.  If tempnam() didn’t give us the permissions we wanted, though, we’d be out of luck with that, and it’s still a two-step process.

Sunday, July 7, 2024

Revisiting Backups

Since comparing DejaDup and Pika Backup for work, I’ve also used KDE backup/kup at home, and gotten a little more experience with both systems.  How are things going?

Pika Backup

Pika has a firm internal idea of the schedule.  At the end of a vacation, where the normal weekly backup was skipped, I manually asked it to run the backup “6 days late / 1 day early”, hoping I could skip it the next day.  No such luck: it promptly asked for the drive the next morning.  No big deal; I just didn’t know the details of the system.

One small file recovery was quick and effective through the GUI.  It looked like a special folder in the normal file explorer, which allowed for drag-and-drop copying from the backup to the desired location in another file window.  As usual when coding, I had done something, changed my mind, deleted it, and then changed my mind back later.  And probably changed a couple more times.

KDE Backup

KDE Backup had been set up as a redundant system behind my handcrafted (smaller, faster, non-versioned) script, the latter of which was—er, intended—to store only my critical data.

A few months after a trial-by-fire backup test, I discovered that ~/.gnupg was not in my manual backup.  This could have been a critical fault… but KDE Backup had the data.  Clicking the “Restore” button opened a “File Digger” window, and from there it was just like using Pika Backup.

I got my key back!

I added the missing ~/.gnupg directory to manual backup, then reconfigured KDE Backup to back up into the existing repository.  That went smoothly, too.  It noticed that the directory I gave it already had data, so it verified the integrity, and then backed up into it.

Format-wise, KDE uses kup to do the backup, which is a front-end to bup, storing the data as a bare git repository.  I didn’t need a password to get data out of it, which makes sense, because I didn’t need one to set up the backup.  That’s also great news for actually recovering my data, because I have no idea what I would have chosen for a password when I started KDE Backup in the first place.

Conclusion

Both systems are working great, i.e., better than my manual one.

Sunday, June 23, 2024

Sorted by What?

Shortened for illustrative purposes, I came across some ancient code of the form:

SELECT DATE_FORMAT(o.created_at, '%c/%e/%Y') AS dt,
    DATE_FORMAT(c.updated_at, '%l:%i %p') as tm,
    …
FROM orders o JOIN customers c … WHERE …
ORDER BY c.last_name, dt DESC, tm;

The ORDER BY dt caught my eye because it’s not an actual column in any table. Its value turns out to be the American-style “6/23/2024” format, which is reasonable to display, but completely wrong to sort on.  Doing that puts October prior to February, as “10” begins with “1”, which is less than “2”.

I cannot guess why it pulls the time from an unrelated column as tiebreaker, sorting it the other direction.  The rest of the issues are likely for the same reasons as the date.

I assume the chaotic arrangement of orders within a customer was never raised as a concern only because duplicating orders would be rare enough—ideally, never happening—that it didn’t matter.

Nonetheless, I queued a change to sort on the full date+time held in created_at, so that records will be fully chronological in the future.

Sunday, June 16, 2024

Availability and Automatic Responses

At work, we have built up quite a bit of custom monitoring, for example. It’s all driven by things failing in new and exciting ways.

Before that particular day, there were other crash-loop events with that code, which mainly manifested as “the site is down” or “being weird,” and which showed up in the metrics we were collecting as high load average (loadavg.) Generally, we’d get onto the machine—slowly, because it was busy looping—see the flood in the logs, and then ask Auto Scaling to terminate-and-replace the instance.  It was the fastest way out of the mess.

This eventually led to the suggestion that the instance should monitor its own loadavg, and terminate itself (letting Auto Scaling replace it) if it got too high.

We didn’t end up doing that, though.  What if we had legitimate high CPU usage?  We’d stop the instance right in the middle of doing useful work.

Instead, during that iteration, we built the exit_manager() function that would bring down the service from the inside (for systemd to replace) if that particular cause happened again.

Some other time, I accidentally poisoned php-fpm.  The site appeared to run fine with the existing pages.  However, requests involving the newly updated extension would somehow both generate a segfault, and tie up the request worker forevermore.  FPM responded by starting up more workers, until it hit the limit, and then the entire site was abruptly wedged.

It became a whole Thing because the extension was responsible for EOM reporting that big brass was trying to run… after I left that evening.  The brass tried to message the normally-responsible admin directly.  It would have worked, but the admin was strictly unavailable that night, and they didn’t reach out to anyone else.  I wouldn’t find out about the chaos in my wake until reading the company chat the next morning.

Consequently, we also have a daemon watching php-fpm for segfaults, so it can run systemctl restart from the outside if too many crashes are happening.  It actually does have the capability to terminate the instance, if enough restarts fail to alleviate the problem.

I’m not actually certain if that daemon is useful or unnecessary, because we changed our update process.  We now deploy new extension binaries by replacing the whole instance.

Having a daemon which can terminate the instance opens a new failure mode for PHP: if the new instance is also broken, we might end up rapidly cycling whole instances, rather than processes on a single instance.

Rapidly cycling through main production instances will be noticed and alerted on within 24 hours.  It has been a long-standing goal of mine to alert on any scaling group’s instances within 15 minutes.

On the other hand, we haven’t had rapidly-cycling instances in a long time, and the cause was almost always crash-looping on startup due to loading unintended code, so expanding and improving the system isn’t much of a business priority.

It doesn’t have to be well-built; it just has to be well-built enough that it never, ever stops the flow of dollars.  Apparently.

Sunday, June 9, 2024

My firewall, as of 2024

On my old Ubuntu installation, I had set up firewall rules to keep me focused on things (and to keep software in line, like blocking plain DNS to require DoT to CloudFlare.)

Before doing a fresh installation, I saved copies of /etc/gufw and /etc/ufw, but they didn’t turn out to be terribly useful.  I don’t know what happened, but some of the rules lost address information.  The ruleset ended up allowing printing to the whole internet, for instance.

I didn’t have a need for profiles (I don’t take my desktop to other networks), so I ended up reconstructing it all as a script that uses ufw, and removing gufw from the system entirely (take that!)

That script looks in part like this:

#!/bin/sh
set -eufC
# -- out --
ufw default reject outgoing
ufw allow out 443/udp comment 'HTTP 3'
ufw allow out 80,443/tcp comment 'Old HTTP'
ufw allow out proto udp \
	to 224.0.0.251 port 5353 \
    comment 'mDNS to LAN for printing'
ufw allow out proto tcp \
    to 192.168.0.251 port 631,9100 \
    comment 'CUPS to Megabrick'
ufw allow out on virbr0 proto tcp \
    to any port 22 comment 'VM SSH'
# -- in --
ufw default deny incoming
ufw allow in 9000:9010/tcp \
    comment 'XDebug listener'

This subset captures all of the syntax I’m using: basic and advanced forms, and all of the shapes of multi-port rules.  One must use the ‘advanced’ form to specify address or interface restrictions.  However, ufw is extremely unhelpful about error messages, usually only giving out “wrong number of arguments.”  The typical recourse is either to look harder at the man page syntax, or to try to roll back conditions until it gets accepted.

For deleting those test rules, the best way is ufw status numbered followed by ufw delete N where N is the desired rule number.  (You can also do ufw reset and start over.)

Note that the ufw port range syntax is “low:high” with a colon, like iptables. For example, 9000:9010 is a range of 11 ports; 9000,9010 is a list of only those two ports.

(I gave the printer a static IP because Windows; thus, the printer’s static IP appears in the ruleset.)

This script, then, only has to be run once per fresh install; after that, ufw will remember these rules and apply them at boot.

Sunday, June 2, 2024

Stateful Deployment was Orthogonal

I used to talk about “stateful, binary” deployment, thinking that both things would happen together:

  1. We would deploy from a built tarball, without any git pull or composer install steps
  2. We would record the actual version (or whole tarball path) that was deployed

This year, we finally accumulated enough failures caused by auto-deploy picking up pushed code that wasn’t ready that we decided we had to solve that issue. It turned out to be unimportant that we weren’t deploying from tarballs.

We introduced a new flag for “auto mode” for the instance-launch scripts to use. Without the flag, deployment happens in manual mode: it performs the requested operation (almost) as it always has, then writes the resulting branch, commit, and (if applicable) tarball overlay as the deployed state.

In contrast, auto mode simply reads the deployed state, and applies that exact branch, commit, and overlay as requested.

I say “simply,” but watch out for what happens to a repository which doesn’t have any state stored.  This isn’t a one-time thing: when adding new repositories later, their first deployment won’t have state yet, either.  This can disrupt both auto and manual deployments.

Sunday, May 26, 2024

My ssh/sshd Configurations

Let’s look at my SSH configurations!

File Layout

Starting with Ubuntu 22.04 LTS and Debian 12, the OpenSSH version in the distribution is new enough that the Include directive is supported, and works properly with Match blocks in included files.  Therefore, most of the global stuff ends up in /etc/ssh/sshd_config.d/01-security.conf and further modifications are made at higher numbers.

Core Security

To minimize surface area, I turn off features I don’t use, if possible:

GSSAPIAuthentication no
HostbasedAuthentication no
PasswordAuthentication no
PermitEmptyPasswords no

AllowTcpForwarding no
X11Forwarding no
Compression no

PermitUserRC no
# Debian and derivatives
DebianBanner no

Some of these are defaults, unless the distribution changes them, which means “explicit is better than implicit” is strongly advised.

Next, I use a group to permit access, allowing me to explicitly add the members to the group without needing to edit the ssh config when things change.  Don’t forget to groupadd ssh-users (once) and gpasswd -a USER ssh-users (for each user.) Then, permit only that group:

AllowGroups ssh-users
# extra paranoia
PermitRootLogin no

Note that all of the above may be overridden in Match blocks, where required. TCP forwarding may also be more finely controlled through PermitListen and PermitOpen directives.

Note also that my systems are essentially single-user.  The group doesn't permit any sharing (and doesn't participate in quotas or anything) that would otherwise be forbidden.

Performance

Machines I use for ssh and sshd are all amd64, so for personal usage, I bump the AES algorithms to the front of the list:

Ciphers ^aes256-gcm@openssh.com,aes256-ctr

SFTP

The biggest trouble is the SFTP subsystem.  I comment that out in the main config, then set it in my own:

# /etc/ssh/sshd_config:
#Subsystem sftp ...

# /etc/ssh/sshd_config/02-sftp.conf:
Subsystem sftp internal-sftp
Match group sftp-only
    # ForceCommand, ChrootDirectory, etc.

I forget the details of what goes in that Match block.  It’s work stuff, set up a while ago now.

[Updated 2024-08-15: It seems that Subsystem is permitted inside a Match block as of OpenSSH 9.5, which is included in Ubuntu 24.04 LTS.  My statements above apply to 22.04, which uses version 8.9; or to Debian 12, with 9.2.]

Ongoing Hardening

I occasionally run ssh-audit and check out what else shows up.  Note that you may need to run it with the --skip-rate-test option these days, particularly if you have set up fail2ban (guess how I know.)

There are also other hardening guides on the internet; I have definitely updated my moduli to only include 3072-bit and up options.  Incidentally, if you wonder how that works:

awk '$5 >= 3071' ...

The default action for awk is print, so that command prints lines that fulfill the condition.  The fifth field is the length of the modulus, so that’s what we compare to.  The actual bit count is 3071 instead of 3072, because the first digit must be 1 to make a 3072-bit number, so there are only 3071 bits that aren’t predetermined.

Client Config Sample

Host site-admin
    # [HostName, Port, User undisclosed]
    IdentityFile ~/.ssh/id_admin
    IdentitiesOnly yes

Host 192.168.*
    # Allow talking to Dropbear 2022.83+ on this subnet
    KexAlgorithms +curve25519-sha256,curve25519-sha256@libssh.org
    MACs +hmac-sha2-256

Host *
    Ciphers aes256-gcm@openssh.com,aes256-ctr
    KexAlgorithms sntrup761x25519-sha512@openssh.com
    MACs hmac-sha2-256-etm@openssh.com,umac-128-etm@openssh.com
    GSSAPIAuthentication no

It’s mostly post-quantum, or assigning a very specific private key to the administrative user on my Web server.

Sunday, May 19, 2024

Everything Fails, FCGI::ProcManager::Dynamic Edition

I have been reading a lot of rachelbythebay, which has led me to thinking about the reliability of my own company’s architecture.

It’s top of my mind right now, because an inscrutable race condition caused a half-hour outage of our primary site.  It was a new, never-before-seen condition that slipped right past all our existing defenses.  Using Site as a placeholder for our local namespace, it looked like this:

use Try::Tiny qw(try catch);
try {
  require Site::Response;
  require Site::Utils;
  ...;
} catch {
  # EX_PRELOAD => exit status 1
  exit_manager(EX_PRELOAD, $_);
};
...;
$res = Site::Response->new();

Well, this time—this one time—on both running web servers… it started throwing an error that Method "new" wasn't found in package Site::Response (did you forget to load Site::Response?).  Huh?  Of course I loaded it; I would’ve exited if that had failed.

In response, I added a lot more try/catch, exit_manager() has been improved, and there is a separate site-monitoring service that will issue systemctl restart on the site, if it starts rapidly cycling through workers.

Sunday, May 12, 2024

Using tarlz with GNU tar

I have an old trick that looks something like:

$ ssh HOST tar cf - DIR | lzip -9c >dir.tar.lz

The goal here is to pull a tar from the server, compressing it locally, to trade bandwidth and client CPU for reduced server CPU usage.  I keep this handy for when I don’t want to disturb a small AWS instance too much.

Since then, I learned about tarlz, which can compress an existing tar archive with lzip.  That seemed like what I wanted, but naïve usage would result in errors:

$ ssh HOST tar cf - DIR | tarlz -z -o dir.tar.lz
tarlz: (stdin): Corrupt or invalid tar header.

It turned out that tarlz only works on archives in POSIX format, and (modern?) GNU tar produces them in GNU format by default.  Pass it the --posix option to make it all work together:

$ ssh HOST tar cf - --posix DIR | \
    tarlz -z -o dir.tar.lz

(Line broken on my blog for readability.)

Bonus tip: it turns out that GNU tar will auto-detect the compression format on read operations these days.  Running tar xf foo.tar.lz will transparently decompress the archive with lzip.

Tuesday, April 30, 2024

Things I learned Reinstalling My Ubuntu

I did not want to wait for Ubuntu Studio 24.04 to be offered as an update to 23.10, so I got the installer and tried it.  Also, I thought I would try repartitioning the disk as UEFI.

Brief notes:

  • I did not feel in control of manual partitioning
  • I found out one of my USB sticks is bad, thanks to F3…
  • …and no thanks to the Startup Disk Creator!
  • If the X11 window manager crashes/doesn’t start, goofy things happen
  • Wayland+KWin still don’t support sticky keys, smh
  • snap remove pops up the audio device overlay… sometimes repeatedly
  • I depend on a surprising amount of configuration actually

Tuesday, April 23, 2024

Getting fail2ban Working [with my weird choices] on Ubuntu 22.04 (jammy)

To put the tl;dr up front:

  1. The systemd service name may not be correct
  2. The service needs to be logging enough information for fail2ban to process
  3. Unrelatedly, Apple Mail on iPhone is really bad at logging into Dovecot
  4. Extended Research

[2024-04-26: Putting the backend in the DEFAULT section may not actually work on all distributions.  One may need to copy it into each individual jail (sshd, postfix, etc.) for it to take effect.]

A minimalist /etc/fail2ban/jail.local for a few services, based on mine:

[DEFAULT]
backend = systemd
[sshd]
enabled = true
journalmatch = _SYSTEMD_UNIT=ssh.service + _COMM=sshd
[postfix]
enabled = true
journalmatch = _SYSTEMD_UNIT=postfix@-.service
[pure-ftpd]
enabled = true
journalmatch = _SYSTEMD_UNIT=pure-ftpd.service

(The journalmatch for pure-ftpd removes the command/_COMM field entirely.)

Sunday, March 3, 2024

vimrc tips

On Debian-family systems, vim.tiny may be providing the vim command, through the alternatives system. If I bring in my dotfiles and haven’t installed a full vim package yet, such as vim-gtk3, then dozens of errors might show up.  vim.tiny really does not support many features.

Other times, I run gvim -ZR for quickly checking some code, to get read-only restricted mode.  In that case, anything that wants to run a shell command will fail.  Restricted mode is also a signal that I don’t trust the files I’m viewing, so I don’t want to process their modelines at all.

To deal with these scenarios, my vimrc is shaped like this (line count heavily reduced for illustration):

set nocompatible ruler laststatus=2 nomodeline modelines=2
if has('eval')
    call plug#begin('~/.vim/plugged')
    try
        call system('true')
        Plug 'dense-analysis/ale'
        Plug 'mhinz/vim-signify' | set updatetime=150
        Plug 'pskpatil/vim-securemodelines'
    catch /E145/
    endtry
    Plug 'editorconfig/editorconfig-vim'
    Plug 'luochen1990/rainbow'
    Plug 'tpope/vim-sensible'
    Plug 'sapphirecat/garden-vim'
    Plug 'ekalinin/Dockerfile.vim', { 'for': 'Dockerfile' }
    Plug 'rhysd/vim-gfm-syntax', { 'for': 'md' }
    Plug 'wgwoods/vim-systemd-syntax', { 'for': 'service' }
    call plug#end()
    if !has('gui_running') && exists('&termguicolors')
        set termguicolors
    endif
    let g:rainbow_active=1
    colorscheme garden
endif

We start off with the universally-supported settings.  Although I use the abbreviated forms in the editor, my vimrc has the full spelling, for self-documentation.

Next is the feature detection of if has('eval') … endif.  This ensures that vim.tiny doesn’t process the block.  Sadly, inverting the test and using the finish command inside didn’t work.

If we have full vim, we start loading plugins, with a try-catch for restricted mode.  If we can’t run the true shell command, due to E145, we cancel the error and proceed without that subset of non-restricted plugins.  Otherwise, ALE and signify would load in restricted mode, but throw errors as soon as we opened files.

After that, it’s pretty straightforward; we’re running in a full vim, loading things that can run in restricted mode.  When the plugins are over, we finish by configuring and activating the ones that need it.

Friday, February 2, 2024

My Issues with Libvirt / Why I Kept VirtualBox

At work, we use VirtualBox to distribute and run development machines.  The primary reasons for this are:

  1. It is free (gratis), at least the portions we require
  2. It has import/export

However, it isn’t developed in the open, and it has a worrying tendency to print sanitizer warnings on the console when I shut down my laptop.

Can I replace it with kvm/libvirt/virt-manager?  Let’s try!