2025-04-23 Web Socket connections for GRR and LAS servers failing to connect
Table of Contents
Affected Services
- SNAPmobile Web
Event Timeline
April 23, 2025
- 7:42 AM ET – OIT's on-call team was alerted to failed web socket connections for the GRR and LAS core servers.
- 7:50 AM ET – OIT support verified the connection failure and began investigating.
- 7:59 AM ET – NetSapiens support was engaged after identifying an SSL failure on the web socket service.
- 8:24 AM ET – OIT support contacted NetSapiens support by phone to escalate the ticket.
- 9:00 AM ET – NetSapiens support restarted the services related to the web socket connections in an attempt to clear the error.
- 9:06 AM ET – After the service restarts, web socket connections were retested but failed to connect with the same original failure.
- 9:39 AM ET – NetSapiens attempted to restart the Apache web server on the LAS and GRR servers.
- 9:42 AM ET – The Apache web server service was restarted, but the connection failures remained. At this time, the OIT support team was informed that the connection failure was not specific to our servers.
- 9:47 AM ET – The OIT support team began preparing for mass updates to all web phone connections to point to the ATL server, in case NetSapiens support was not able to resolve the root cause promptly.
- 9:59 AM ET – NetSapiens support identified the root cause and began testing a resolution.
- 10:19 AM ET – NetSapiens support applied a fix to the GRR and LAS servers, which resulted in the web socket connections successfully connecting.
Root Cause
The wildcard root SSL certifycate had recently been renewed and was successfully loaded for traffic over HTTPS and TLS but had not been fully inserted into memory for the module that controls web socket connections. While users were able to launch SNAPmobile Web successfully, they encountered a "connecting" message.
This caused a partial failure for SNAPmobile Web instances, preventing them from connecting to the service that translates encrypted web socket traffic into normal SIP traffic, where call routing is handled.
Impact Summary
- SNAPmobile Web phones were unable to fully connect to the GRR and LAS servers due to a renewed SSL certificate not being fully loaded into the server.
- Automatic failover to alternate servers did not activate because SNAPmobile Web phones were partially registered, requiring manual intervention to move devices upon request.
Future Preventative Action
Long-term action
With the migration to the new IAD and PHX servers and the decommissioning of the GRR and LAS servers, SSL certificates will be handled directly by OIT, removing this potential cause of failure in the future. OIT also submitted an improvement request to NetSapiens to account for partial registrations and to include these scenarios in the automatic failover mechanism.