While working on the DNS issue, I noticed that from time-to-time the database process chews up loads of memory, sits there dead for a while before the OOM killer kicks in, and that also blocks the app server as well because it's trying to read from it. I *think* that's the cause of the 500 errors.
So I've given both the app and database servers a bit more memory (4x more in the database server case), tuned the caches to suit and I've got logging warning me when we get close to the OOM states for both database and app server, just in case.
We migrate to new forum software in a month or so, so hopefully I can fix these memory related things before that.