• Home
  • Announcements
  • Platform Events

2021-11-29 System Restart SJE and ATL (Resolved)

Written by Marissa Orsini

Updated at April 22nd, 2023

Contact Us

  • The Essentials
    FAQs Forms
  • Announcements
    Carrier Events mFax Events Platform Events Release Notes
  • Billing Administration
    Datagate OneBill
  • Faxing
    mFax - Analog mFax - Digital Native Fax
  • Hardware & Software
    Manual Configuration Provisioning NDP Axis Cisco Fanvil Grandstream Polycom Snom Yealink Mobile Applications Desktop Applications Mobile-X SNAPbuilder TeamMate Connector UC Integrator
  • Hosted Voice
    Auto Attendants Branding Call Queues Call Routing CDRs Conferencing E-911 Features Fraud Integrations Inventory / Phone Numbers Local & Toll Free Porting Onboarding Recommendations SNAP.HD SIP Trunking SMS / MMS Users Voicemail Caller ID
  • Troubleshooting
    VoIPmonitor Firewalls PBX
  • Ray's Stuff
+ More

Table of Contents

Affected Services Event Timeline (All times 24-hour format, EST) November 29th, 2021 Root Cause Analysis Future Preventative Action

Event Description: Core services restarted in SJE and ATL nodes

Event Start Time: 2021-11-29 12:12 PM EST

Event End Time: 2021-11-29 13:48 PM EST

RFO Issue Date: 2021-11-30

 

 

Affected Services

Media services for greetings, ringing, voicemail recordings, Auto Attendants, and any other recorded message. Calls were unable to complete during the restart of services. Loss of portal access during reconvergence.


Event Timeline (All times 24-hour format, EST)

November 29th, 2021

  • 12:12 Core1-SJE: Our systems alerted to a restart of the NmsMedRecMgr service due to a crash in the same service. Call stats were reviewed and test calls were conducted. There did not appear to be any loss of service, nor did we receive any reports from users. The decision was made to investigate and monitor.
  • 12:31 Could not find documentation on NmsMedRecMgr service. T4 reached out to Netsapiens engineering for clarification.
  • 13:45 Core1-ATL: Our systems alerted to a crash of the same NmsMedRecMgr service. The service immediately restarted as designed. This time we received reports from clients of 486 busy, unable to complete internal and external calls, and other loss of voice service. The restart took less than 60 seconds. Most devices were able to resume calling immediately after. Some devices required rebooting to re-establish registration. While data synced between the cores portals were briefly unavailable.
  • 13:51 Netsapiens engineering identified the issue and began working on a patch.
  • 17:58 Netsapiens engineering provided a software patch that required service restarts. Maintenance was scheduled for the following morning.
  • 11/30/21 05:30 Patches were applied successfully to all cores. Services show nominal.  We will continue to monitor for 24 hours before marking the incident as resolved.

Root Cause Analysis

Part of the architectural changes to v42 included several performance enhancements. Among these was the move to make the service responsible for playback of media into a multi-threaded service. It was also moved to a sub-service of NMS which is core to calling, registration, and other major functions.  

An issue was identified where a high volume of calls needing this media could crash the service. Because the service was a sub-thread of NMS, the crash would also bring down core calling features. A software patch was provided that lowered the threading count. Testing by Netsapiens engineering showed the patch to be successful in preventing further crashes. We applied those same patches to all cores on 11/30/2021.


Future Preventative Action

We will continue to monitor for 24 hours to ensure the efficacy of the patch. If all remains nominal we will mark the incident as resolved.

Update 11/30/21: No further issues were experienced. Marking issue as resolved.

 

resolved system restart

Was this article helpful?

Yes
No
Give feedback about this article

Related Articles

  • 2019-06-06 - NYJ004 Network Failure (Resolved)
  • 2020-02-15 - Outbound Call Failure (Resolved)
  • 2021-03-16 - Call History Display Failure (Resolved)
  • 2021-11-11 Outbound Call Failures from Atlanta (Resolved)

Knowledge Base Software powered by Helpjuice

Expand