• Home
  • Announcements
  • Platform Events

2019-11-22 - ACS001 Core Failure (Resolved)

This article shares insights on an ACS001 Core Failure episode, detailing how the issue was resolved in 2019-11-22.

Written by Marissa Orsini

Updated at April 22nd, 2023

Contact Us

  • The Essentials
    FAQs Forms
  • Announcements
    Carrier Events mFax Events Platform Events Release Notes
  • Billing Administration
    Datagate OneBill
  • Faxing
    mFax - Analog mFax - Digital Native Fax
  • Hardware & Software
    Manual Configuration Provisioning NDP Axis Cisco Fanvil Grandstream Polycom Snom Yealink Mobile Applications Desktop Applications Mobile-X SNAPbuilder TeamMate Connector UC Integrator
  • Hosted Voice
    Auto Attendants Branding Call Queues Call Routing CDRs Conferencing E-911 Features Fraud Integrations Inventory / Phone Numbers Local & Toll Free Porting Onboarding Recommendations SNAP.HD SIP Trunking SMS / MMS Users Voicemail Caller ID
  • Troubleshooting
    VoIPmonitor Firewalls PBX
  • Ray's Stuff
+ More

Table of Contents

Affected Services Event Summary Event Timeline (All times 24-hour format, EST) November 22nd, 2019 November 23rd, 2019 Root Cause Analysis Future Preventative Action Update 11/26/19

Event Description: ACS001 Crash

Event Start Time: 2019-11-22 11:42 PM EST

Event End Time: 2019-11-23 01:29 AM EST

RFO Issue Date: 2019-11-25

 

 

Affected Services

Phones registered to Atlanta lost registration. Devices configured with SRV or UDP failed over. Devices configured as TCP or manually registered to core1-atl did not regain registration.


Event Summary

On November 22nd, 2019, at 23:42 EDT, the ACS clusterbegan crashing repeatedly. Several phones lost registration and the ability to make or receive calls.


Event Timeline (All times 24-hour format, EST)

November 22nd, 2019

  • 23:42 First crash reporting by monitoring systems. Phones lost registration. Notice placed in partner server
  • 23:43 Failover verified to SJE and NYJ clusters

November 23rd, 2019

  • 00:00 Issue verified isolated to Atlanta servers
  • 00:25 Rolled back 40.2 updates in case that might have been the cause of the issue
  • 00:57 Services to Atlanta resumed and functional. Endpoints configured for UDP or SRV registered back to the Atlanta cluster. Cause yet undetermined.
  • 01:29 Atlanta cluster remained online. All UDP and SRV phones remained registered. Some reports of call history not functioning. Would continue to investigate in the morning.
  • 17:12 Call history page restored. All services functional

Root Cause Analysis

In troubleshooting with NetSapiens and our own senior engineers, we determined that malformed TCP packets were causing a crash in the TCP stack. We believe the packets were isolated to a single device but more testing is needed. Normally this would not affect services. However, the repeated crashes and subsequent core dumps quickly filled the server's storage. The eventual full storage prevented normal functions from processing.


Future Preventative Action

While not a permanent fix, the decision was made to block TCP device registration. This affected less than 1% of our total registered devices. Devices that were configured for TCP must be reconfigured to use UDP. We are continuing to work with NetSapiens senior engineer staff to determine how a single errant device could crash the stack. Finally, we have instituted additional safeguards that will immediately move core dump files off the server immediately to prevent full storage again. 

Update 11/26/19

Worked with NS engineering to isolate issues to TCP connections in SIP trunks only. SIP trunk TCP functionality was left disabled. TCP functionality for endpoints was restored and devices re-registered successfully. Systems have been stable since. NS will continue to follow up regarding SIP trunk TCP.

 

resolved issue failure core

Was this article helpful?

Yes
No
Give feedback about this article

Related Articles

  • 2019-05-06 - Atlanta Network Failure (Resolved)
  • 2019-06-06 - NYJ004 Network Failure (Resolved)

Knowledge Base Software powered by Helpjuice

Expand