Process Crash Reasons Reporting

Design

The process crash reasons reporting feature provides the following information items:

  • Name of the process that crashed

  • Date (timestamp) of the crash

  • Crash reason (fatal signal)

  • Summary of stack trace

The functionality is implemented using the osa library. When a process crashes, a fatal signal and a best-effort stack trace is logged to the system logger. Additionally, a log traceback is logged as well.

The feature reports a short summary of the data about the OpenSync process crashes that are logged to the system log and sent via MQTT.

Implementation

Backtrace generation: Acquiring a stack trace when the OpenSync process crash reasons reporting is implemented in a standard way:

  1. Installing a signal handler for fatal signals (such as SIGSEGV, SIGBUS, SIGILL, etc.).

  2. Unwinding a backtrace at that point of execution when the signal is raised (at the point of crash).

  3. Reporting the crash info (i.e., backtrace, fatal signal) to the system logs and possibly elsewhere, and then re-raising the same signal to assure the program will actually crash (since the crash is inevitable and ignoring the signal would lead to undefined states).

Backtrace summary example:

/usr/osync/lib/libopensync.so(bcmwl_chanspec_get+0x5c) [0xb69a20e0] /usr/osync/lib/libopensync.so(wl80211_survey_results_get+0x65) [0xb69acd9c] /usr/osync/lib/libopensync.so(target_stats_survey_get+0xdc) [0xb69f610c]

API: To opt-in for this functionality, everything an OpenSync manager code has to do is to call the backtrace_init() function at program startup. Everything else is done automatically.

Modes of operation: Current implementation offers two modes of OpenSync process crashes reporting:

  • BTRACE_LOG_ONLY (report to system logs only)

  • BTRACE_FILE_LOG (report to system logs and a file). This mode additionally logs the backtrace dump in a separate file under /var/log/. The intention here is that these files are harvested by the logpull functionality.

Note that in addition to the backtrace dumped to syslog, the current implementation writes LOG traceback to the system log as well. It should be noted that the log traceback is often much more descriptive than the backtrace itself. However, LOG traceback reporting is out of scope of this feature. It could be collected by a logpull anyway.

Implementation of backtrace unwinding: OpenSync currently has a custom backtrace() implementation, but should be (and it is according to testing) identical in behaviour to that of uClibc.

MQTT Report Contents

  • An MQTT topic will be generated by the controller and configured through AWLAN_Node table in the same way as other MQTT topics. MQTT topic will be configured in AWLAN_Node::mqtt_topics map, key Crash.Reports.

  • The implementation of sending the crash reports uses DM. DM posts directly to the MQTT topic configured (via qm_lib/QM, but with the direct method, not the aggregated one).

  • The format of OpenSync crash reports is JSON.

The following data is reported:

  • Timestamp of the crash (integer, unix timestamp)

  • Name of the process that crashed (string: max 32 chars)

  • Reason for crash: string or enum for signals: FATAL_SIGNAL_SIGSEGV, FATAL_SIGNAL_SIGILL, FATAL_SIGNAL_SIGFPE, FATAL_SIGNAL_SIGBUS, FATAL_SIGNAL_SIGABRT)

  • Backtrace summary (string: max 512 chars)

Northbound API

The following JSON format is used:

Field name

Data type

Description

Field name

Data type

Description

locationId

string

Location ID

nodeId

string

Node ID

model

string

Node model string

firmwareVersion

string

Firmware version string

pid

string

Process ID of the crashed process

name

string

Name of the crashed process

reason

string

Crash reason description

timestamp

number (long)

Timestamp of the crash – milliseconds since epoch

Note: This is not timestamp of MQTT crash report. This is timestamp of the moment of the crash

backtrace

string 

Backtrace string

Example of a JSON MQTT crash report:

{ "nodeId": "4C70XXXXXX", "locationId": "5ff4c749d2a88f37ccXXXXXX", "model": "DAKOTA", "firmwareVersion": "2.4.0-0-ga0b87f-development", "reason": "SIG 11 (Segmentation fault)", "timestamp": 1613075969937, "pid": "1557", "name": "sm", "backtrace": " 0 > 0xb6d818cc: backtrace /usr/opensync/lib/libopensync.so\\n 1 > 0xb6d81968: (null) /usr/opensync/lib/libopensync.so\\n 2 > 0xb6d81b80: sig_crash_report /usr/opensync/lib/libopensync.so\\n 3 > 0xb6d81c0c: (null) /usr/opensync/lib/libopensync.so\\n 4 > 0xb68dd4dc: __default_sa_restorer /lib/libc.so.1 \\n" }

Requirements

At the time of crash, if the target uses BTRACE_FILE_LOG (report to file, logs and controller) option and CONFIG_DM_OSYNC_CRASH_REPORTS is enabled in Kconfig, the flow is:

  1. In the signal handler, write a crash report to a temporary file (or a set of files) under a dedicated directory under /tmp/, for instance /tmp/osync_crash_reports/.

  2. DM monitors the contents of that temporary directory.

  3. When DM detects a new crash report, it sends the report (via MQTT) to the controller and deletes the entry for that crash report in the temporary directory. 

Stripping backtraces: The most important function is usually the function immediately following the __default_sa_restorer, and the following 1 to 3 functions. Backtraces in this short reports should be stripped from both directions (omit lines up to __default_sa_restorer, and include only the first few (the suggestion is 3, but up to 5) lines that carry most of the information.