Server stability issues due to Sympa
Hi Erik and Javier,
I applied a patch on the ohwr server last friday and have been monitoring the server since. It seems that this has been enough to make the server stable again.
As I suspected, the issue was on the mailing list server, sympa. Starting last tuesday, some event made it "go rogue" at certain times at night (server time). I've still not identified the concrete "trigger", but it should be a new "use case" that we have not seen before: a big file uploaded to the mailing list, intensive RSS usage, that kind of thing.
When sympa encountered this a bug on its web server started spawning lots of unuseful but cpu-intensive processes. Since the ohwr server is big, it took a while, but eventually the server run out of resources and died.
The patch I've applied detects and kills the "rogue processes" before they do anything dangerous. The server is stable again. I'm attaching one of our control graphs, showing the past issues and current situation.
Next week we'll start upgrading sympa to a more recent, and hopefully more stable version.
Best regards,
Enrique García Cota
Splendeo Innovación