PackNet Software Platform (Prod 2) status page
Incidents

Production Outage - RabbitMQ

CRITICALaffectedMachine Production (Prod 2)lasted2h 59min 37s

All sites have returned to normal production after restarting the problematic RabbitMQ nodes. We're continuing to monitor.

resolved
Mar 28 08:53:37 (UTC)

A fix has been implemented and we are monitoring the results. Job production is restoring.

monitoring
Mar 28 08:27:17 (UTC)

CloudAMQP restarting queues one at a time.

identified
Mar 28 08:22:36 (UTC)

CloudAMQP identified there was a partial netsplit network failure which caused node queues to get in a bad state. We are continuing investigations.

identified
Mar 28 07:47:26 (UTC)

We are now attempting a full reboot of the problematic node instance to conduct a full power cycle.

identified
Mar 28 07:41:50 (UTC)

We've been able to determine that at least 1 node in our Rabbit cluster is having issues and have restarted the service on that node.

identified
Mar 28 07:19:20 (UTC)

It appears most messages for work being produced are not being processed by the Production service correctly. This is resulting in some carton kick outs at Walmart sites. We've reached out to CloudAMQP.

investigating
Mar 28 06:18:00 (UTC)

We are investigating high exception count of RabbitMQ messages in Prod2.

investigating
Mar 28 05:56:00 (UTC)

A transient network blip in Rabbit's platform managed by CloudAMQP occurred. This was later determined to be the root cause of this issue that interrupted production.

investigating
Mar 28 05:54:00 (UTC)
Powered by Checkly