Determining Root Cause of Segmentation Faults SIGSEGV or SIGBUS errors

ID 标签 689759
已更新 12/26/2018
版本 Latest
公共

author-image

作者

Problem : When I run my code compiled with the Intel® Fortran Compiler I get SIGSEGV on Linux* (or SIGBUS on MacOS*).  This code has run fine for years on <insert your previous compiler/platform>.  Is this a bug with the Intel Compiler?

Environment : Linux or MacOS

Root Cause : There are many possible causes.  A segmention fault (bus error MacOS) is a general fault that can have multiple causes.  We outline these potential causes below and give suggestions for avoiding the segmentation fault

Possible Cause #1 Fortran Specific Stackspace Exhaustion. Solution: -heap-arrays compiler option.
The Intel Fortran Compiler uses stack space to allocate a number of temporary or intermediate copies of array data. 

NON-OpenMP and NON-Auto-parallelized Applications: If your program is not using OpenMP* or auto-parallelization (-parallel compiler option), try the -heap-arrays compiler option.  OpenMP or auto-parallelization users and users with Linux compilers, please read ahead to Possible Cause #2 for tips on unlimiting the stack size.

-heap-arrays

If this removes the SIGSEGV or SIGBUS error, you may STOP at this point. 

You may wish to read this PDF presentation to learn about when and where array temporaries are created.  With a few code changes you may be able to avoid some array temporaries, and, hence, reduce your application's need for temporary copies and improve performance! Also, the -heap-arrays compiler option can take an optional argument [size] to specify the threshold size in Kbytes at which arrays larger than [size] are allocated on heap, all others on stack.  For example:

-heap-arrays 10

puts all automatic and temporary arrays larger than 10K bytes on heap

Possible Cause #2 Stackspace Exhaustion.  Solution: Unlimit Stacksize for OpenMP Applications or any Application:
The first step is to try to increase your shell stack limit on Linux and MacOS.  However, this option can have unwanted effects on data sharing with OpenMP or auto-parallelized code.  Because of this, OpenMP and auto-parallelization users are advised to not use -heap-arrays and instead try to unlimit their shell stack size limit.

Linux, bash:       ulimit -s unlimited
Linux, csh/tcsh:   unlimit stacksize

You can check your stack size limit with:

bash:  ulimit -a
csh:   limit

and look for 'stack size' limit for your shell environment

Notes:  If you run your program under the control of a batch subsystem you may need to add the command above to your user startup files ( ~/.bashrc  ~/.profile  or ~/.cshrc )

For MacOS, there is a hard upper limit on the shell stacksize.  For most systems, this is:

bash:  ulimit -s 65532

which sets the limit to 64 Mbytes.

An alternative is to use a linker option to increase the executable's default shell stacksize.

Re-run your application, if this fixes the issue you may stop.  If your application still generates SIGSEGV or SIGBUS error, continue reading.

Possible Cause #2-prime: Stack Exhaustion Due to Heap or General Memory Exhaustion
In the process memory map, heap and stack grow towards each other.  If they collide, this, too, can cause a segmentation fault on either the heap allocation or the next stack allocation. 

It is also possible to exhaust all of physical memory + swap space with an application.  Remember, with a 64-bit application, your VIRTUAL memory is practically unlimited.  However, the realistic amount of memory that can be consumed has a ceiling at PHYSICAL ram + swap space (typically, 2x the physical memory size).  You can get this information with the 'free' command.  Physical memory is also shown by 'cat /proc/meminfo' with fields 'MemTotal' and 'SwapTotal'.  The system typically needs some space, so a rule of thumb is to keep memory footprint of your application to around 80% of MemTotal, if possible, and never exceed MemTotal + SwapTotal.

Compile and link with -g -traceback to locate where the code is aborting.

Possible Cause #3, Stack Corruption Due to User Coding Error.  
There are a number of user coding errors that can cause stack corruption and lead to a SIGSEGV or SIGBUS error at run time.  These errors are particularly hard to find since the segmentation fault may occur later in the program in a section unrelated to where the stack was initially corrupted.

The first step is to try to isolate where in the code the fault occurs.  This is done by generating an execution 'traceback'.  Compile and link using the ifort driver and these options:

-g -traceback

When the code faults, you will likely get a report showing the call stack when the fault occurs.  If you do not get a stack traceback, ensure that you have used -g for both compilation and link and make sure that -traceback was used on the compilation.  There are cases where the seg fault occurs while the program is in kernel space and thus no user stack is available for traceback.  

This trace back report is read from the bottom of the list upwards.  Find the uppermost subroutine or function from your code along with its line number to isolate which instruction caused the fault.  Check for user coding errors at this statement.  If no obvious user error, continue below

Possible Cause #4, exceeding Array Bound.  Solution: Try -check bounds
The -check-bounds compiler option provides a run-time check of array accesses and character string expressions to insure that the indices are within the boundaries of the array.  This checking is useful to find cases where the indices go outside of the declared size of the array.  This option has a big impact on performance, the magnitude of which depends on how many array accesses are performed in the application.  Also, with -check bounds, array bounds checking is not performed for arrays that are dummy arguments in which the last dimension bound is specified as * or when both upper and lower dimensions are 1.   To enable bounds checking, compile with:

-check bounds -g

and run your program.  The checking is performed at run time and not at compile time.  If this finds your error STOP.  ELSE keep reading.

A new compiler option with the Intel Fortran 19.0 compiler is -check shape. This is very handy for debugging codes with dynamic arrays.  This article describes the use of -check shape compiler option.

Possible Cause #5, Calling a Function as a Subroutine, or invoking a subroutine as if it were a function. 

These are user coding errors where a user does something similar to this:

!-- main program ---
...
call ThisIsIllegal( some_arguments )
...
!-- end main program ---

!-- ThisIsIllegal ---
integer function ThisIsIllegal( some_arguments )
...
!-- end ThisIsIllegal ---

In the example above, the main program calls ThisIsIllegal as if it were a subroutine. However, ThisIsIllegal is declared as a function.  This can cause stack corruption.  To test for these conditions, try using compiler options

-fp-stack-check -g -traceback

compile with these options and run.   If the stack is corrupted by something similar to the above, your code will exit and give a stack trace.

You can check the interfaces of your procedures with a compile time check:

-gen-interfaces -warn interfaces

This compile time check will generate INTERFACE blocks for your procedures.  The -warn interfaces will then use these compiler-generated interfaces and check the calls to your procedures to make sure arguments and interfaces match between caller and callee.   Note that this check occurs only for Fortran source files.  This will not check interfaces in mixed language program.

Possible Cause #6, Large Array Temporaries Caused by Passing Noncontiguous Array Sections. Solution: detect with -check arg_temp_created and fix with coding change to include explicit interface and assumed shaped arrays.

Consider this 'before' example:

--- main program ---
real(8) :: f(1800,3600,1)
external sub
...
call sub( f(1:900,:,:) )
...
!-- end main program ---

and the "sub" subroutine is in a separately compiled source file:
!-- external subroutine "sub" ---
subroutine sub( f )
real(8) :: f(900,3600,1)
...
!-- end subroutine "sub" ---

In this case, "sub" is expecting a contiguous array of size 900x3600x1.  However, the call is passing an array that is not contiguous in memory.  In situations such as this, the compiler will make an array temporary at the call to copy the elements of the array "f" from non-contiguous chucks into a contiguous array such as what "sub" is expecting.  This temporary is allocated on stack unless -heap-arrays is specified. 

To check if this is occurring in your code, compile with

-check arg_temp_created

and run the program.  Messages will be written when argument temporaries are created.  To work around the issue, creating a explicit interface and using an assumed shaped array in "sub" will remove the need for an array temporary:

!-- main ---
real(8) :: f(1800,3600,1)
interface
subroutine sub(f)
real(8) :: f(:,:,:)
end subroutine sub
end interface
...
call sub( f(1:900,:,:) )
...
!-- end main program ---

!-- "sub" ---
subroutine sub( f )
real(8) :: f(:,:,:)
...
end subroutine sub

Keep in mind, that although this avoids the array temporary, within "sub" the compiler is now aware that the array "f" may be non-contiguous.  Thus, some optimizations on statements using "f" may be disabled and thus affect performance.

Case NONE OF THE ABOVE:  Solution: More in-depth analysis is needed

99% of the SIGSEGV or SIGBUS error cases fall into the categories above.  However, there are other cases where segmentation faults can occur. 

If your application is linking in external libraries, make sure that the library is compatible with your compiler.  Was the external library compiled with the Intel Compiler?  If so, were the major versions the same - that is, was the library compiled with Intel Fortran 19.0 but your application built with Intel Fortran v17.x?  Intel only guarantees compatibility within major versions (19, 18 and 17 are examples of major versions). 

If the external library is from a software vendor or tool:  does this vendor explicitly name the Intel Compiler as compatible, and if so, with which version(s) have they verified their library?  You should only use the version(s) of the Intel Compiler certified by your vendor.  If you need an older version of the Intel Compiler, please see How do I get an older version of an Intel® Software Development Product.

When all else fails ....
Post a note to the User Forum.  Please include the name of your application if it is a commonly available code, post a stack trace (if you can get one), compiler options used, and ideally a tarball of the entire application, input files and instructions on how to run the program.

If you have support for your product, you can open an issue.

For more background information, try the excellent Dr. Fortran Article "Don't Blow Your Stack!"

And read this PDF presentation.

Have questions? 

See our Get Help page for your support options.

"