Incident | PackNet Software Platform (Prod 2)

PackNet Software Platform (Prod 2)

Production Outage - RabbitMQ

CRITICALaffectedMachine Production (Prod 2)lasted2h 59min 37s

All sites have returned to normal production after restarting the problematic RabbitMQ nodes. We're continuing to monitor.

resolved

Mar 28 08:53:37 (UTC)

resolved

A fix has been implemented and we are monitoring the results. Job production is restoring.

monitoring

Mar 28 08:27:17 (UTC)

monitoring

CloudAMQP restarting queues one at a time.

identified

Mar 28 08:22:36 (UTC)

identified

CloudAMQP identified there was a partial netsplit network failure which caused node queues to get in a bad state. We are continuing investigations.

identified

Mar 28 07:47:26 (UTC)

identified

We are now attempting a full reboot of the problematic node instance to conduct a full power cycle.

identified

Mar 28 07:41:50 (UTC)

identified

We've been able to determine that at least 1 node in our Rabbit cluster is having issues and have restarted the service on that node.

identified

Mar 28 07:19:20 (UTC)

identified

It appears most messages for work being produced are not being processed by the Production service correctly. This is resulting in some carton kick outs at Walmart sites. We've reached out to CloudAMQP.

investigating

Mar 28 06:18:00 (UTC)

investigating

We are investigating high exception count of RabbitMQ messages in Prod2.

investigating

Mar 28 05:56:00 (UTC)

investigating

A transient network blip in Rabbit's platform managed by CloudAMQP occurred. This was later determined to be the root cause of this issue that interrupted production.

investigating

Mar 28 05:54:00 (UTC)

investigating