Preliminary Post Incident Review (PIR) - Azure Portal - Errors accessing the Azure portal You can rate this PIR and provide any feedback using our quick 3-question survey: How can we make our incident communications more useful? Ĝapacity augmentation procedures/safeguards. Areas of focus for us in the investigation of this incident include but are not limited to the following: This specific capacity augment project is halted and will only be resumed after a thorough review and update of the current process.Ī full investigation remains underway, and a full post incident report (PIR) will be published within the next 14 days. The network began recovering as the links were restored, and other services dependent on the network recovered shortly thereafter. Engineers placed the healthy links back into service and ensured that the network automation would not remove them again. Engineers scanned the network to identify links that were incorrectly removed from service and distinguish them from links that were correctly removed from service for being unhealthy. Packet loss detection alerts fired at 03:03 UTC and notified on-call engineers that their help was needed to recover the network. This safety mechanism prevented the impact from worsening. As soon as the network automation systems detected the increase in congestion and packet loss, they determined it was unsafe to continue taking links out of service and stopped their activities, as designed. Then, as designed, network automation began issuing commands consistent with standard remediations to turn off the LAGs to prevent impact due to potential link imbalance.Īlthough regions are built with extensive redundancy, enough links were taken out of service to cause congestion and impact to customer traffic. The network automation systems detected LAGs that contained working physical links as well as the non-working “In Production” links. This network topology description incorrectly listed these links as being “In Production”. This new topology included links that were about to be added to the network to increase the capacity in West Europe, but the physical links had not yet been connected or turned on. Shortly before the start time of this incident, a new network topology description was added into the network automation systems that manage the West Europe region. In order to keep traffic balanced across the physical links, if one or more links in a LAG fail, it is desirable to turn off the entire LAG. When our automation systems detect a link is unhealthy, the system confirms it is safe to remove the link from service issues commands to the routers to shut down the link and then issues a request to datacentre staff to repair or replace the faulty link.Ī network architecture detail relevant to this incident is that the routers that carry traffic between Availability Zones and the Wide Area Network in West Europe are connected by groups of multiple physical links bundled together into what are called Link Aggregation Groups (LAGs, or port-channels). As a faulty link can cause packet loss or corruption, Azure has network automation systems that continually monitor these links for health. The Azure network is comprised of millions of links between routers, with many redundant links to cope with failures. Resources & services in West Europe, recovered quickly as network infrastructure health recovered. From 05:45 UTC, the network recovered progressively throughout the remaining incident period - it was substantially recovered by 06:12 UTC and loss rates returned to normal by 07:25 UTC. Although mitigation efforts were in progress by 04:45 UTC, the network loss worsened from 02:34 to 05:45 as traffic in the region increased with the workdays starting in the EMEA region. The traffic loss rate peaked at 10% for short periods of time during the incident. Resources hosted in this region may have experienced availability failures, low throughput, or increased latencies. Between 02:34 UTC and 07:25 UTC on 16 June 2023, a network issue caused excessive packet loss that affected traffic entering or leaving the West Europe region and, to a lesser extent, traffic between Availability Zones inside the West Europe region.
0 Comments
Leave a Reply. |