2024-09-18 - LAS Outage (Resolved)
Table of Contents
Event Description: The LAS core server that provides voice services and device registration unexpectedly crashed causing temporary disruption to voice services on the LAS server. During this time calls and device registrations failed over to alternate servers successfully. Shortly after the successful failover to alternate servers, some phones were intermittently going in and out of registration.
Event Start Time: 2024-09-18 14:55 EST
Event End Time: 2024-10-01 20:00 EST
RFO Issue Date: 2024-10-04
Affected Services:
- Inbound and outbound calling on all servers
- Phone and device registration on all servers
Event Summary
The LAS core server that provides voice services and device registration unexpectedly crashed causing temporary disruption to voice services on the LAS server. During this time calls and device registrations failed over to alternate servers successfully. Shortly after the successful failover to alternate servers, some phones were intermittently going in and out of registration. After working with NetSapiens engineering we were able to identify the root causes and implement immediate fixes to prevent future occurrences.
Event Timeline
September 18, 2024
- 14:55 EST The CSE team received alerts from our monitoring systems for health check failures for SBC-LAS
- 15:06 EST We validated the alerts and confirmed that the crashed service had not yet auto recovered and engaged NetSapiens support.
- 15:19 EST We started receiving alerts for phones on the GRR and ATL servers were intermittently showing no service for brief moments.
- 15:23 EST We reviewed registration statistics for all servers and showed devices had failed over successfully, but also confirmed the reports of device's intermittently losing registration.
- 15:26 EST Our ticket with NetSapiens was escalated to their T3 engineers. During this time, we confirmed that that primary service for voice and device registrations had still not self-recovered. We proceeded to force kill the service and restart it manually.
- 15:40 EST The voice service on the LAS server had restarted but did not observe the phones returning to the server. At this time, we escalated the ticket further to the NetSapiens engineering team.
- 15:32 EST We restarted the voice service a second time. At this time, we began seeing calls and device registrations returning to the LAS server. Subsequent test calls and device registrations confirmed the service was able to receive requests. While this was happening, we were still receiving reports of devices failing to maintain registration on the servers.
- 16:30 EST After working with NetSapiens engineering we confirmed that the voice and device registration services were up and running but were not processing data correctly. After performing a packet capture and analyzing our monitoring systems, we identified a few key areas that were struggling to process data in a timely fashion.
- 16:37 EST We confirmed the data collected was accurate.
- 17:29 EST We completed identifying the higher-than-expected memory pool utilization which was narrowed down to duplicated subscribe packets to multiple servers from the failed over devices.
- 18:16 EST After analyzing additional details we implemented server level blocking of the subscribe traffic to allow the memory pools time to finishing processing the data. We also enabled a system parameter to perform additional checks to drop duplicate subscribe packets instead of attempting to process them.
- 19:20 EST We increased the available worker threads for subscribe packet processing.
September 19, 2024
- 00:30 EST After monitoring we found that the subscribe packet pools were not fully clearing. We disabled the system parameter for checking duplicate subscribe traffic and immediately saw all memory pools return to pre-event levels. We observed phone registrations and subscriptions had returned to normal and confirmed via test calls.
- 14:30 EST We continued to observe system stability and worked with NetSapiens engineering team to further identify the root cause of the increased subscribe traffic.
September 20-22, 2024
- We continued to observe system stability on all core servers.
September 23, 2024
- 15:47 EST The CSE team received alerts from our monitoring systems for health check failures for SBC-LAS. Initial review of data showed a similar increase in subscribe traffic to the GRR and ATL servers at the same time.
- 14:08 EST The LAS voice service successfully self-recovered and we observed device registrations and calls successfully returned to the LAS server. During this time, we received reports of devices registered to the GRR and ATL servers were again intermittently going in and out of registration. Looking at the stats we did not see the same peaked traffic as the previous day and all observed statistics were within their limits.
- 14:38 EST We confirmed the LAS server had fully recovered and reported instances of devices registration failures were resolved as well.
- 17:02 EST At this time, we identified that the LAS crash on Sept 18 was caused by a different root source.
- 17:45 EST We decided to perform a controlled 503 status on the LAS server to gather additional details from all 3 servers.
- 20:00 EST The CSE team performed the controlled 503 of the LAS server and worked with NetSapiens engineering to gather synchronized packet captures on the LAS, GRR and ATL servers.
- 20:30 EST The controlled 503 testing was completed.
September 24, 2024
- 13:00 EST After careful analysis and comparison of the collected packet captures, we identified an area of improvement and applied an update to an existing system parameter.
- 15:00 EST We continued to observe system stability on all core servers and continued working with NetSapiens engineering to identify areas of improvement for failover stability and root cause of the Sept 23 LAS crash event.
- 18:46 EST We identified what we believed to be the root cause of the Sept 23 crash and began taking steps to rectify.
September 25, 2024
- 08:00 EST We continued to observe system stability on all core servers.
- 15:45 EST We announced we would be performing a controlled 503 status on the LAS servers to validate the system changes made on Sept 24 and confirm system stability in a failover event.
- 20:00 EST The CSE team performed the controlled 503 on the LAS server, at this time we expanded the maintenance window and scope to include performing a controlled 503 status on the GRR and ATL servers.
- 23:45 EST The controlled 503 testing was completed. In our testing we found improvements to registration stability in a failover event but continued to observe instability with subscribe statuses.
September 26, 2024
- 08:50 EST We continued to work with NetSapiens engineering team to identify the root cause of the unusual subscribe traffic.
- 17:45 EST We identified an additional root cause for the Sept 23 LAS system crash, which was related to higher-than-normal IOWait. This was narrowed down to be caused by the disks on the server. At this time, we started the process of replacing the drives with new ones.
September 27 - 28, 2024
- We continued to observe system stability on all core servers.
September 29, 2024
- 19:20 EST We received confirmation that the drive replacement on the LAS server was successful, and the RAID was successfully rebuilt. At this time, we also increased the voice service watchdog timeout to allow for the system to self-correct without forcing a restart of the service.
September 30, 2024
- 8:30 EST We observed a 10x decrease in IOwait on the LAS server and continued to monitor through peak times during the day to confirm.
- 15:30 EST All systems remained stable and all stats through peak times were well below system thresholds.
October 1, 2024
- 20:00 EST CSE team declared the event resolved.
Root Cause
The September 18th crash was a rare indexing error on Music-On-Hold lookup causing a thread deadlock that could not be self-recovered. This is ultimately fixed in a future update. It was investigated to backport into the current version but ultimately this was not done due to the high potential of causing more frequent crash events due to the nature of the code changes required.
The September 23rd crash was related to the voice service watchdog restarting the service due to extended period of time of what it determined to be no activity, where the watchdog assumed that the thread had deadlocked. This was ultimately caused by higher-than-expected IOwait caused by the installed drives. This was resolved by replacing the system disks with new ones and increasing the watchdog timer.
Future Preventative Action
- We increased the watchdog timeout to allow for more time to self-recover and prevent false restarts. Replaced all drives in the RAID setup on the LAS server to resolve the higher IOwait times.
- We increased the number of subscribe thread workers to allow for more system resources to process subscribe requests (BLF status updates) more efficiently.
- We will be adding the following to our monitoring software:
- Updating current IOwait alarms to allow for quicker identification.
- Adding additional alarms based on identified baselines and thresholds for memory pool utilization for quicker identification of potential failure.
- Additional items identified to facilitate a more seamless failover include.
- Continue to investigate the failover stability when it comes to device registrations and subscriptions, with controlled 503'ing of servers over the coming weeks.
- Device default updates to allow for more seamless failover.
- Improved ability to self-move physical devices to alternate cores.