Magento incident diagnosis

We’ve got a fairly structured regiment that we would follow,

  1. Ask the customer if they are running a newsletter/promotion
  2. Ask the customer if they have changed any store configuration settings
  3. Ask the customer if they has installed any new modules

If all the above are “No”. Then we look at the machine itself.

  1. Find how to replicate the issue, then ensure it is as the customer describes to rule out last-mile-connectivity at fault
  2. Review all server graphs, usually starting with CPU/RAM/HDD and look for unusual peaks (followed by application level graphs, PHP/Nginx etc.).
  3. Look at system logs for corresponding alerts/changes around the time the issues started (we log page load times in Nginx, so it will clearly show issues)
  4. Check outbound TCP requests (in case of 3rd party modules calling home)

Beyond this, it depends on what the previous steps have highlighted. The graphs are usually incredibly useful in terms of diagnosis and 9/10 will identify the trigger.