The Sui network outage was quickly resolved after the verifier's cooperation - Latest View

The Sui network outage was quickly resolved after the verifier's cooperation



Peter Chang
November 21, 2024 at 22:28

Sui's network experienced a brief outage due to a congestion control error. Rapid action by engineers and auditors resulted in operations being restored within minutes, highlighting effective incident response.



The Sui network outage was quickly resolved after the verifier's cooperation

The Sui Mainnet recently experienced a major outage, halting all network operations for a few hours due to a technical glitch. The incident, which occurred on November 21, 2024, between 1:15 and 3:45 a.m. PT, involved a failure loop that affected all validators, preventing any transaction processing, according to the Sui Foundation.

Understanding the incident

The problem arose from a bug in the congestion control code, specifically a bug assert! statement, which caused a crash when the estimated execution cost was zero. This problem was related TotalGasBudgetWithCap It was briefly enabled in version 63 of the protocol and was reintroduced in version 68. The error appeared when the network received a transaction containing a mutable shared object entry and zero MoveCall commands, causing all validators to crash.

The role of congestion control

Congestion control in the Sui network is crucial to managing transaction rates for shared objects, ensuring that the network is not overloaded. This system was recently upgraded to improve the use of shared objects by accurately estimating transaction complexity. However, the upgrade inadvertently introduced a bug that caused an outage.

Decision and response

Upon identifying a problem, Sui engineers immediately devised a solution. The corrective code is detailed in PR #20365is deployed on both Mainnet and Testnet in versions v1.37.4 and v1.38.1, respectively. The rapid deployment was facilitated by the outstanding response from the validator community, enabling the network to resume operations within 15 minutes of the fix being released.

Lessons and future improvements

This incident confirmed the effectiveness of Sui's incident detection and response mechanisms. Automated alerts immediately notify engineers, who collaborate with the auditor community to quickly address the issue. Going forward, Sui plans to enhance its testing systems to prevent similar errors and streamline construction workflows to reduce incident response times.

For more detailed information, please visit Sui Foundation.

Image source: Shutterstock


Leave a Comment