Queue Failure

If queue processing starts to fail, you’ll see events starting to pile up in RabbitMQ as Unacked.

image-20240402-153743.png

Mitigation

Local processing

When this happens, the first thing you’ll want to try is connect to the RMQ cluster directly with a local API running in production events mode.

Depending on the circumstances, you’ll either be able to get the queues back down to 0 or at the very least, you’ll keep queue processing going for new events, which will stem the problem from getting worse.

Restart

If the queue doesn’t start to go down, you may need to restart the events pods in k8s as a first attempt. It’s not clear why, but it seems like the API can get stuck on processing the queue, possibly due to a bad event or maybe locking?

Purge

If restarting doesn’t work, you’ll need to purge the queues for scoresheets, stats and transactions.

You’ll need to quickly scale down the event pods to 0, purge the queues, then scale them back up right away. Purging is not effective unless there are no consumers on the queue. Make sure you scale back up right away after purging so that you don’t create another backlog.

Restore

Once purged, you haven’t lost anything, the events are still in Postgres. You need to run the recalculateScoresheets bin script for the last n hours that the issue was going on for and recalculate scoresheets, transactions and stats (in that order, stats will take a while and are less important). You may need to comment out withLock depending on how many events you’re dealing with.

Resolution

When the queues are back down to 0 consistently and you’ve caught up the events with the bin scripts, you’re in the clear. The production event pods should be able to continue processing as normal.

There’s likely an event or two somewhere that caused the issue, you should try to find it based on the timestamp when the queue started to fail to avoid a future issue.