5 How to Interpret the Erlang Crash Dumps

This section describes the erl_crash.dump file generated upon abnormal exit of the Erlang runtime system.

Note

The Erlang crash dump had a major facelift in Erlang/OTP R9C. The information in this section is therefore not directly applicable for older dumps. However, if you use crashdump_viewer(3) on older dumps, the crash dumps are translated into a format similar to this.

The system writes the crash dump in the current directory of the emulator or in the file pointed out by the environment variable (whatever that means on the current operating system) ERL_CRASH_DUMP. For a crash dump to be written, a writable file system must be mounted.

Crash dumps are written mainly for one of two reasons: either the built-in function erlang:halt/1 is called explicitly with a string argument from running Erlang code, or the runtime system has detected an error that cannot be handled. The most usual reason that the system cannot handle the error is that the cause is external limitations, such as running out of memory. A crash dump caused by an internal error can be caused by the system reaching limits in the emulator itself (like the number of atoms in the system, or too many simultaneous ETS tables). Usually the emulator or the operating system can be reconfigured to avoid the crash, which is why interpreting the crash dump correctly is important.

On systems that support OS signals, it is also possible to stop the runtime system and generate a crash dump by sending the SIGUSR1 signal.

The Erlang crash dump is a readable text file, but it can be difficult to read. Using the Crashdump Viewer tool in the Observer application simplifies the task. This is a wx-widget-based tool for browsing Erlang crash dumps.

5.1 General Information

The first part of the crash dump shows the following:

The creation time for the dump
A slogan indicating the reason for the dump
The system version of the node from which the dump originates
The compile time of the emulator running the originating node
The number of atoms in the atom table
The runtime system thread that caused the crash dump

Reasons for Crash Dumps (Slogan)

The reason for the dump is shown in the beginning of the file as:

Slogan: <reason>

If the system is halted by the BIF erlang:halt/1, the slogan is the string parameter passed to the BIF, otherwise it is a description generated by the emulator or the (Erlang) kernel. Normally the message is enough to understand the problem, but some messages are described here. Notice that the suggested reasons for the crash are only suggestions. The exact reasons for the errors can vary depending on the local applications and the underlying operating system.

<A>: Cannot allocate <N> bytes of memory (of type "<T>"): The system has run out of memory. <A> is the allocator that failed to allocate memory, <N> is the number of bytes that <A> tried to allocate, and <T> is the memory block type that the memory was needed for. The most common case is that a process stores huge amounts of data. In this case <T> is most often heap, old_heap, heap_frag, or binary. For more information on allocators, see erts_alloc(3).
<A>: Cannot reallocate <N> bytes of memory (of type "<T>"): Same as above except that memory was reallocated instead of allocated when the system ran out of memory.
Unexpected op code <N>: Error in compiled code, beam file damaged, or error in the compiler.
Module <Name> undefined | Function <Name> undefined | No function <Name>:<Name>/1 | No function <Name>:start/2: The Kernel/STDLIB applications are damaged or the start script is damaged.
Driver_select called with too large file descriptor N: The number of file descriptors for sockets exceeds 1024 (Unix only). The limit on file descriptors in some Unix flavors can be set to over 1024, but only 1024 sockets/pipes can be used simultaneously by Erlang (because of limitations in the Unix select call). The number of open regular files is not affected by this.
Received SIGUSR1: Sending the SIGUSR1 signal to an Erlang machine (Unix only) forces a crash dump. This slogan reflects that the Erlang machine crash-dumped because of receiving that signal.
Kernel pid terminated (<Who>) (<Exit reason>): The kernel supervisor has detected a failure, usually that the application_controller has shut down (Who = application_controller, Why = shutdown). The application controller can have shut down for many reasons, the most usual is that the node name of the distributed Erlang node is already in use. A complete supervisor tree "crash" (that is, the top supervisors have exited) gives about the same result. This message comes from the Erlang code and not from the virtual machine itself. It is always because of some failure in an application, either within OTP or a "user-written" one. Looking at the error log for your application is probably the first step to take.
Init terminating in do_boot (): The primitive Erlang boot sequence was terminated, most probably because the boot script has errors or cannot be read. This is usually a configuration error; the system can have been started with a faulty -boot parameter or with a boot script from the wrong OTP version.
Could not start kernel pid (<Who>) (): One of the kernel processes could not start. This is probably because of faulty arguments (like errors in a -config argument) or faulty configuration files. Check that all files are in their correct location and that the configuration files (if any) are not damaged. Usually messages are also written to the controlling terminal and/or the error log explaining what is wrong.

Other errors than these can occur, as the erlang:halt/1 BIF can generate any message. If the message is not generated by the BIF and does not occur in the list above, it can be because of an error in the emulator. There can however be unusual messages, not mentioned here, which are still connected to an application failure. There is much more information available, so a thorough reading of the crash dump can reveal the crash reason. The size of processes, the number of ETS tables, and the Erlang data on each process stack can be useful to find the problem.

Number of Atoms

The number of atoms in the system at the time of the crash is shown as Atoms: <number>. Some ten thousands atoms is perfectly normal, but more can indicate that the BIF erlang:list_to_atom/1 is used to generate many different atoms dynamically, which is never a good idea.

5.2 Scheduler Information

Under the tag =scheduler is shown information about the current state and statistics of the schedulers in the runtime system. On operating systems that allow suspension of other threads, the data within this section reflects what the runtime system looks like when a crash occurs.

The following fields can exist for a process:

=scheduler:id

Heading. States the scheduler identifier.

Scheduler Sleep Info Flags

If empty, the scheduler was doing some work. If not empty, the scheduler is either in some state of sleep, or suspended. This entry is only present in an SMP-enabled emulator.

Scheduler Sleep Info Aux Work

If not empty, a scheduler internal auxiliary work is scheduled to be done.

Current Port

The port identifier of the port that is currently executed by the scheduler.

Current Process

The process identifier of the process that is currently executed by the scheduler. If there is such a process, this entry is followed by the State, Internal State, Program Counter, and CP of that same process. The entries are described in section Process Information.

Notice that this is a snapshot of what the entries are exactly when the crash dump is starting to be generated. Therefore they are most likely different (and more telling) than the entries for the same processes found in the =proc section. If there is no currently running process, only the Current Process entry is shown.

Current Process Limited Stack Trace

This entry is shown only if there is a current process. It is similar to =proc_stack, except that only the function frames are shown (that is, the stack variables are omitted). Also, only the top and bottom part of the stack are shown. If the stack is small (< 512 slots), the entire stack is shown. Otherwise the entry skipping ## slots is shown, where ## is replaced by the number of slots that has been skipped.

Run Queue

Shows statistics about how many processes and ports of different priorities are scheduled on this scheduler.

** crashed **

This entry is normally not shown. It signifies that getting the rest of the information about this scheduler failed for some reason.

5.3 Memory Information

Under the tag =memory is shown information similar to what can be obtained on a living node with erlang:memory().

5.4 Internal Table Information

Under the tags =hash_table:<table_name> and =index_table:<table_name> is shown internal tables. These are mostly of interest for runtime system developers.

5.5 Allocated Areas

Under the tag =allocated_areas is shown information similar to what can be obtained on a living node with erlang:system_info(allocated_areas).

5.6 Allocator

Under the tag =allocator:<A> is shown various information about allocator <A>. The information is similar to what can be obtained on a living node with erlang:system_info({allocator, <A>}). For more information, see also erts_alloc(3).

5.7 Process Information

The Erlang crashdump contains a listing of each living Erlang process in the system. The following fields can exist for a process:

=proc:<pid>

Heading. States the process identifier.

State

The state of the process. This can be one of the following:

Scheduled: The process was scheduled to run but is currently not running ("in the run queue").
Waiting: The process was waiting for something (in receive).
Running: The process was currently running. If the BIF erlang:halt/1 was called, this was the process calling it.
Exiting: The process was on its way to exit.
Garbing: This is bad luck, the process was garbage collecting when the crash dump was written. The rest of the information for this process is limited.
Suspended: The process is suspended, either by the BIF erlang:suspend_process/1 or because it tries to write to a busy port.

Registered name

The registered name of the process, if any.

Spawned as

The entry point of the process, that is, what function was referenced in the spawn or spawn_link call that started the process.

Last scheduled in for | Current call

The current function of the process. These fields do not always exist.

Spawned by

The parent of the process, that is, the process that executed spawn or spawn_link.

Started

The date and time when the process was started.

Message queue length

The number of messages in the process' message queue.

Number of heap fragments

The number of allocated heap fragments.

Heap fragment data

Size of fragmented heap data, in words. This is data either created by messages sent to the process or by the Erlang BIFs. This amount depends on so many things that this field is usually uninteresting.

Link list

Process IDs of processes linked to this one. Can also contain ports. If process monitoring is used, this field also tells in which direction the monitoring is in effect. That is, a link "to" a process tells you that the "current" process was monitoring the other, and a link "from" a process tells you that the other process was monitoring the current one.

Reductions

The number of reductions consumed by the process.

Stack+heap

The size of the stack and heap, in words (they share memory segment).

OldHeap

The size of the "old heap", in words. The Erlang virtual machine uses generational garbage collection with two generations. There is one heap for new data items and one for the data that has survived two garbage collections. The assumption (which is almost always correct) is that data surviving two garbage collections can be "tenured" to a heap more seldom garbage collected, as they will live for a long period. This is a usual technique in virtual machines. The sum of the heaps and stack together constitute most of the allocated memory of the process.

Heap unused, OldHeap unused

The amount of unused memory on each heap, in words. This information is usually useless.

Memory

The total memory used by this process, in bytes. This includes call stack, heap, and internal structures. Same as erlang:process_info(Pid,memory).

Program counter

The current instruction pointer. This is only of interest for runtime system developers. The function into which the program counter points is the current function of the process.

CP

The continuation pointer, that is, the return address for the current call. Usually useless for other than runtime system developers. This can be followed by the function into which the CP points, which is the function calling the current function.

Arity

The number of live argument registers. The argument registers if any are live will follow. These can contain the arguments of the function if they are not yet moved to the stack.

Internal State

A more detailed internal representation of the state of this process.

5.8 Port Information

This section lists the open ports, their owners, any linked processes, and the name of their driver or external process.

5.9 ETS Tables

This section contains information about all the ETS tables in the system. The following fields are of interest for each table:

=ets:<owner>: Heading. States the table owner (a process identifier).
Table: The identifier for the table. If the table is a named_table, this is the name.
Name: The table name, regardless of if it is a named_table or not.
Hash table, Buckets: If the table is a hash table, that is, if it is not an ordered_set.
Hash table, Chain Length: If the table is a hash table. Contains statistics about the table, such as the maximum, minimum, and average chain length. Having a maximum much larger than the average, and a standard deviation much larger than the expected standard deviation is a sign that the hashing of the terms behaves badly for some reason.
Ordered set (AVL tree), Elements: If the table is an ordered_set. (The number of elements is the same as the number of objects in the table.)
Fixed: If the table is fixed using ets:safe_fixtable/2 or some internal mechanism.
Objects: The number of objects in the table.
Words: The number of words allocated to data in the table.
Type: The table type, that is, set, bag, dublicate_bag, or ordered_set.
Compressed: If the table was compressed.
Protection: The protection of the table.
Write Concurrency: If write_concurrency was enabled for the table.
Read Concurrency: If read_concurrency was enabled for the table.

5.10 Timers

This section contains information about all the timers started with the BIFs erlang:start_timer/3 and erlang:send_after/3. The following fields exist for each timer:

=timer:<owner>: Heading. States the timer owner (a process identifier), that is, the process to receive the message when the timer expires.
Message: The message to be sent.
Time left: Number of milliseconds left until the message would have been sent.

5.11 Distribution Information

If the Erlang node was alive, that is, set up for communicating with other nodes, this section lists the connections that were active. The following fields can exist:

=node:<node_name>: The node name.
no_distribution: If the node was not distributed.
=visible_node:<channel>: Heading for a visible node, that is, an alive node with a connection to the node that crashed. States the channel number for the node.
=hidden_node:<channel>: Heading for a hidden node. A hidden node is the same as a visible node, except that it is started with the "-hidden" flag. States the channel number for the node.
=not_connected:<channel>: Heading for a node that was connected to the crashed node earlier. References (that is, process or port identifiers) to the not connected node existed at the time of the crash. States the channel number for the node.
Name: The name of the remote node.
Controller: The port controlling communication with the remote node.
Creation: An integer (1-3) that together with the node name identifies a specific instance of the node.
Remote monitoring: <local_proc> <remote_proc>: The local process was monitoring the remote process at the time of the crash.
Remotely monitored by: <local_proc> <remote_proc>: The remote process was monitoring the local process at the time of the crash.
Remote link: <local_proc> <remote_proc>: A link existed between the local process and the remote process at the time of the crash.

5.12 Loaded Module Information

This section contains information about all loaded modules.

First, the memory use by the loaded code is summarized:

Current code: Code that is the current latest version of the modules.
Old code: Code where there exists a newer version in the system, but the old version is not yet purged.

Then, all loaded modules are listed. The following fields exist:

=mod:<module_name>: Heading. States the module name.
Current size: Memory use for the loaded code, in bytes.
Old size: Memory use for the old code, in bytes.
Current attributes: Module attributes for the current code. This field is decoded when looked at by the Crashdump Viewer tool.
Old attributes: Module attributes for the old code, if any. This field is decoded when looked at by the Crashdump Viewer tool.
Current compilation info: Compilation information (options) for the current code. This field is decoded when looked at by the Crashdump Viewer tool.
Old compilation info: Compilation information (options) for the old code, if any. This field is decoded when looked at by the Crashdump Viewer tool.

5.13 Fun Information

This section lists all funs. The following fields exist for each fun:

=fun: Heading.
Module: The name of the module where the fun was defined.
Uniq, Index: Identifiers.
Address: The address of the fun's code.
Native_address: The address of the fun's code when HiPE is enabled.
Refc: The number of references to the fun.

5.14 Process Data

For each process there is at least one =proc_stack and one =proc_heap tag, followed by the raw memory information for the stack and heap of the process.

For each process there is also a =proc_messages tag if the process message queue is non-empty, and a =proc_dictionary tag if the process dictionary (the put/2 and get/1 thing) is non-empty.

The raw memory information can be decoded by the Crashdump Viewer tool. You can then see the stack dump, the message queue (if any), and the dictionary (if any).

The stack dump is a dump of the Erlang process stack. Most of the live data (that is, variables currently in use) are placed on the stack; thus this can be interesting. One has to "guess" what is what, but as the information is symbolic, thorough reading of this information can be useful. As an example, we can find the state variable of the Erlang primitive loader online (5) and (6) in the following example:

(1)  3cac44   Return addr 0x13BF58 (<terminate process normally>)
(2)  y(0)     ["/view/siri_r10_dev/clearcase/otp/erts/lib/kernel/ebin",
(3)            "/view/siri_r10_dev/clearcase/otp/erts/lib/stdlib/ebin"]
(4)  y(1)     <0.1.0>
(5)  y(2)     {state,[],none,#Fun<erl_prim_loader.6.7085890>,undefined,#Fun<erl_prim_loader.7.9000327>,
(6)            #Fun<erl_prim_loader.8.116480692>,#Port<0.2>,infinity,#Fun<erl_prim_loader.9.10708760>}
(7)  y(3)     infinity

When interpreting the data for a process, it is helpful to know that anonymous function objects (funs) are given the following:

A name constructed from the name of the function in which they are created
A number (starting with 0) indicating the number of that fun within that function

5.15 Atoms

This section presents all the atoms in the system. This is only of interest if one suspects that dynamic generation of atoms can be a problem, otherwise this section can be ignored.

Notice that the last created atom is shown first.

5.16 Disclaimer

The format of the crash dump evolves between OTP releases. Some information described here may not apply to your version. A description like this will never be complete; it is meant as an explanation of the crash dump in general and as a help when trying to find application errors, not as a complete specification.

Chapters

5 How to Interpret the Erlang Crash Dumps