All sites have returned to normal production after restarting the problematic RabbitMQ nodes. We're continuing to monitor.
A fix has been implemented and we are monitoring the results. Job production is restoring.
CloudAMQP restarting queues one at a time.
CloudAMQP identified there was a partial netsplit network failure which caused node queues to get in a bad state. We are continuing investigations.
We are now attempting a full reboot of the problematic node instance to conduct a full power cycle.
We've been able to determine that at least 1 node in our Rabbit cluster is having issues and have restarted the service on that node.
It appears most messages for work being produced are not being processed by the Production service correctly. This is resulting in some carton kick outs at Walmart sites. We've reached out to CloudAMQP.
We are investigating high exception count of RabbitMQ messages in Prod2.
A transient network blip in Rabbit's platform managed by CloudAMQP occurred. This was later determined to be the root cause of this issue that interrupted production.