Sunday, July 21, 2024

The Case of the Unknown Errors

For a number of reports, we did the lazy thing: we print errors on stderr in the job, and let cron email them to us.

Unfortunately, email is unreliable, and transient.  If a remote system accepts the message from us, then drops it for anti-spam reasons, we don’t have a log of that, nor a copy to resend.  We noticed these problems with report data, and now all output files are archived to S3.  However, the cron emails are the only source of truth for errors, so if we don’t get them, they’re lost forever!

I think the solution will be changing the error_log() calls to syslog().  That will create an on-host record, then forward it to the central server for searching and archiving.  We can even still get cron emails (normally) if we include the flag to print the messages to stderr.

I’m just kind of surprised that I have left a “can’t get email errors about email errors” loop in production for over a decade.

No comments: