Dedicated Server Stops Logging and Takes up 99% of CPU

So we have the weirdest behavior on our Linux Dedicated Server on Gamelift Spot instances.

Symptoms:

  1. CPU usage goes to 99%-100%
  2. Logs stop coming in. (we have a healthcheck that runs every minute)
  3. Gamelift healthcheck continues to pass and players are still shown as active after disconnecting
  4. Server process continues to run for days if we allow it
  5. Other server processes run on the same ec2 and are fine.

Has anyone seen anything like this? Any ideas on where to even start troubleshooting? We’ve been working at different shots in the dark for weeks with no luck. I’d say repro rate is like 1/100 instances.

So in case someone else find this. You can find what’s causing the CPU to spike with perf tools in linux.

Here are some helpful links:

Our servers still freeze, but since they aren’t eating up 99% of the CPU and freezing the process we can effectively notice and kill.