2025/01/29 - Inbound and Outbound calls failing on the LAS and GRR servers (Resolved)
Table of Contents
Affected Services
- Inbound calling
- Outbound calling
- File replication between the ATL server and the LAS and GRR servers.
Event Summary
A significant service disruption occurred due to file synchronization delays, database connectivity interruptions, and call processing failures following an upgrade to software version V44.2. These issues resulted in prolonged downtime, impacting both internal operations and customer services.
Event Timeline
Jan 28, 2025
09:07 ET - We became aware of inbound call failures to DIDs configured to use the Time of Day call routing functionality on the GRR server only.
10:14 ET - After reviewing submitted reports, we confirmed degraded functionality, and declared a major incident. An announcement was sent out to all partners and clients.
11:07 ET - Our team continued investigating and identified a workaround for those affected.
11:15 ET - A potential source of the inbound call failures was identified, and NetSapiens engineering support was engaged for further analysis.
12:17 ET - A mass update to all phone numbers using Time of Day routing was initiated, allowing inbound calls to reach their respective locations. However, auto attendants still failed to successfully route calls.
12:30 ET - NetSapiens engineering began reviewing the provided information.
13:30 ET - The source of the auto attendant routing failures was identified, and work began on a resolution.
14:10 ET - Due to the complexity of resolving the auto attendant routing failures, it was determined that the best course of action was to failover calls and device registrations away from the GRR server to provide immediate relief while continuing to resolve the root cause.
16:00 ET - After working with NetSapiens support, it was determined that a database sync was required for a full resolution, scheduled for 20:00 ET.
20:00 ET - The database sync was initiated.
Jan 29, 2025
04:00 ET - The database sync was successfully completed, and the GRR server manual failover was removed. Testing confirmed that calls and registrations were reaching the server correctly.
11:00 ET - We received reports of calls to Auto Attendants failing again as well as outbound calls failing for users with the US and Canada Dial Permission. We also engaged NetSapiens support for assistance in resolving the issue.
12:43 ET - A manual failover of the LAS and GRR servers was performed, forcing all calls and device registrations to the ATL server.
12:51 ET - While the manual failover of the LAS and GRR servers was successful, inbound and outbound calls continued to fail.
13:20 ET - We identified the source of the inbound and outbound call failures as a rogue dial permission.
13:26 ET - After removing the dial permission, calls were once again processing successfully.
15:05 ET - At this point, inbound and outbound calling was confirmed to be resolved. Monitoring continued to ensure ongoing stability.
15:26 ET - After sustained stability of inbound and outbound calling on the GRR server, the manual failover was removed.
15:44 ET - After continued stability of inbound and outbound calling on the LAS server, the manual failover was removed.
Jan 30, 2025
10:44 ET - Reports and observations indicated audio file sync delays and failures on the LAS and GRR servers.
11:44 ET - After confirming the reports provided, we engaged NetSapiens support on a conference call.
11:54 ET - After confirming the issue was related to the previous day's incident, an announcement was sent to all clients and partners.
12:16 ET - With assistance from NetSapiens support, we identified additional symptoms related to the LAS and GRR database connections.
13:21 ET - After reviewing data, services related to file and database replication were restarted.
13:26 ET - Service restarts were successful, but file replication was still impacted.
13:40 ET - An announcement was sent out with a possible workaround to assign new auto attendant greetings and menu prompts directly to the LAS and GRR servers while investigations continued.
14:27 ET - Additional data was gathered and reviewed.
16:30 ET - After reviewing the collected data, two root causes were identified, and investigation into a resolution path began.
17:30 ET - A resolution path was identified, and steps for implementation were scheduled.
18:30 ET - While preparing to resolve file and database syncing issues, another bug was identified, causing forwarded calls to external numbers to fail. A fix was implemented.
23:30 ET - Initial steps toward full resolution began, with an attempt to offload all recording services to an alternate server.
Jan 31, 2025
02:00 ET - The offloading of recording services was stopped due to an unforeseen issue that prevented linking recordings back to the database correctly. The approach was pivoted to offloading the VoIPmonitor services instead.
04:00 ET - The offloading of the VoIPmonitor services was successfully completed, and system stability was observed on the LAS and GRR servers.
11:00 ET - While we had observed full operation of all services since the completion of maintenance, we began seeing evidence of audio file sync delays again.
11:11 ET - At this time, we determined that the files were syncing but with extreme delays.
11:23 ET - We continued to see database connection warnings and moved the API services to prioritize the ATL server.
11:41 ET - To attempt to alleviate IO contention on the LAS server, the VoIPmonitor services were stopped.
12:21 ET - NetSapiens support identified an error in the backend logs related to another service that assisted with file replication.
15:00 ET - We continued observing delays in audio file updates, causing call failures for devices on GRR and LAS. At this time, the OIT team started moving all DIDs to point to the ATL server.
16:33 ET - A fix for file syncing delays was identified, and preparations for implementation began.
23:38 ET - Updates to the backend code for file syncing were completed to address the delays and failures, along with updates to the LAS and GRR RAID controller settings to address disk IO.
23:55 ET - We observed improved reliability in file syncing, including Auto Attendant greetings and menu prompts. However, we continued to observe higher-than-usual IO usage on the LAS and GRR SSDs during off-peak times.
Feb 1, 2025
14:12 ET - We continued to observe immediate file syncing between the servers. While increased IO persisted during off-peak times, no negative effects on the servers were observed.
Feb 2, 2025
12:45 ET - We continued to observe immediate file syncing between the servers. While increased IO persisted during off-peak times, no negative effects on the servers were observed.
Feb 3, 2025
09:11 ET - While observing system stats over the weekend, which remained stable, we began to see an increase in disk IO along with delays in file syncing between the servers.
11:28 ET - We continued seeing increased IO on the LAS servers, causing file sync delays and database connection delays. File syncing on the GRR server continued to show no delays.
13:13 ET - While working with NetSapiens engineering, file sync batching was identified as contributing to additional syncing delays. This was being further investigated.
13:27 ET - We became aware of calls dropping when transferred or placed on call park and began investigating.
14:00 ET - We confirmed the dropped calls and continued investigating a resolution. An announcement was sent out to all clients and partners regarding the issue.
14:23 ET - The increase in disk IO was identified as being caused by underperforming disks, and the sourcing of replacement disks for the LAS and GRR servers began.
14:25 ET - We identified a fix for the file sync failures and initiated a plan to implement a code update, with a plan to manually sync all files later in the evening.
15:40 ET - We identified that the cause of the dropped calls was due to cross-core calling and began moving affected DIDs back to their home servers (LAS or GRR).
16:55 ET - The file sync code change was successfully implemented and monitoring continued.
17:19 ET - We continued to observe improved file syncing times and continued monitoring.
23:00 ET - Additional RAID controller setting changes were applied to help alleviate the increased drive IO.
Feb 4, 2025
09:15 EDT - We continued to observe system stability with file syncing and database connectivity and maintained monitoring.
10:51 EDT - We began observing increased IO on the LAS and GRR servers, resulting in intermittent database connectivity errors. File syncing on the GRR server remained immediate, while the LAS server experienced delayed sync times during peak hours.
12:30 EDT - We continued moving client and partner DIDs back to their home servers (LAS or GRR) upon request, as reports of calls dropping on transfer or call park were resolved with this action.
12:40 EDT - We continued observing increased IO on the LAS server, resulting in intermittent file syncing delays and database connectivity failures.
14:12 EDT - New SSD drives were ordered with overnight shipping to the LAS and GRR servers.
16:00 EDT - We continued observing increased IO on the LAS server, while the GRR server showed improvements with instant file syncing.
16:20 EDT - We identified additional causes of file syncing delays related to how the system tracks changes and began preparing a resolution.
19:20 EDT - Additional code updates were applied to the services that control file syncing on the LAS server, and monitoring continued.
Feb 5, 2025
09:30 EDT - We continued observing improved system stability after the changes made the previous night and maintained monitoring.
11:11 EDT - We continued to observe instant file syncing on both LAS servers but continued to see increased IO-related symptoms on the LAS server.
11:19 EDT - After observing continued stability with the GRR server, all DIDs that had been moved away from the GRR server were returned en masse.
19:39 EDT - We continued observing instant file syncing across all servers.
Feb 6, 2025
11:30 EDT - We received confirmation that the new drives had arrived at the GRR datacenter, and the staff began opening the packages to verify the correct parts.
15:36 EDT - Parts for the GRR server were confirmed, and scheduling began for datacenter support to install them. The LAS parts had also been received, verified to be in good condition, and confirmed to be correct.
17:44 EDT - A time was confirmed with LAS datacenter support, and an urgent maintenance notice was sent out advising that disk upgrades would be performed at 00:00 EDT on Feb 7, 2025.
Feb 7, 2025
00:00 EDT - Drive upgrades on the LAS server began.
00:32 EDT - Drive upgrades and filesystem installation were successfully completed. The transfer of database files to the new drives began.
06:20 EDT - Data transfer was successful, and we began confirming data integrity.
06:30 EDT - Data integrity checks passed, and the LAS server was returned to service.
13:04 EDT - We continued observing increased performance and system stability on the LAS server after the drive upgrades, as well as instant file syncing on the LAS server.
16:17 EDT - An announcement was sent out indicating urgent maintenance on the GRR server to perform the same disk upgrades as those completed on the LAS server, scheduled for 21:00 EDT on Feb 7, 2025.
21:00 EDT - Disk drive upgrade maintenance began.
22:00 EDT - The disk upgrade on the GRR server was successful, and data transfer to the new drives began.
Feb 8, 2025
00:00 EDT - Data transfer was successfully completed, and data integrity checks began.
00:12 EDT - Data integrity checks were successfully completed, and the server was returned to production.
01:00 EDT - All post-upgrade tests were completed, showing significant improvements in system stability. Monitoring continued.
Feb 10, 2025
11:30 EDT - We continued observing system stability on the LAS server and maintained monitoring.
Feb 11, 2025
11:30 EDT - We continued observing system stability on the LAS server and maintained monitoring.
Feb 12, 2025
11:30 EDT - After 48 hours of sustained server stability on the LAS server, we mass-moved all remaining LAS DIDs back to the LAS server and marked this issue as fully resolved.
Root Cause
File Synchronization Delays and Database Connectivity Issues
The primary cause of the file synchronization delays was faulty code introduced in the V44 upgrade. Specifically, this update contained defective regular expression (regex) matching logic and batch file transfer handling, which necessitated reverting settings to their pre-V44 state to restore functionality. These issues were not detected during the initial testing phase by NetSapiens, as the Ubuntu 18 operating system was not explicitly included in the testing scope.
As a result, excessive write cycles on disk drives were observed, leading to significant I/O wait times. The increased disk load further contributed to performance degradation, ultimately causing database connectivity interruptions and compounding the impact of the incident.
Failover Events and Call Processing Failures
During failover events on the LAS and GRR servers, a previously unidentified software bug was discovered. This issue caused call drops in specific scenarios where calls were parked or transferred across multiple servers.
Additionally, another defect was identified in dial translation logic. This bug led to incorrect call routing decisions, where inbound and outbound calls were erroneously matched to a translation rule configured to deny calls, resulting in unintended call failures.
Impact Summary
- File synchronization failures leading to increased disk utilization and database connectivity issues.
- Call drops during failover scenarios when calls traversed multiple servers.
- Unintended call blocking due to incorrect dial translation logic.
Future Preventative Action
Immediate preventative actions taken:
- Code updates after identifying faulty regex matching, which caused file syncing failures on all three servers.
- Disk drive upgrades on the LAS and GRR servers to resolve the bottleneck of disk writes.
Long-term actions:
- Expanding the snaphosted servers into the V2 infrastructure, which allows for greater performance. Each primary service (core processing, device provisioning, call recording, and VoIP monitor) will be on its own dedicated server, preventing one service from affecting another.
- The Portal and API services will also be moved to their own dedicated servers, allowing for greater flexibility and reduced downtime in case a server requires maintenance.