TECH : Supporting A Clang Compiler Optimizer Crash

Dan Parry
ARTICLE BY:
POSTED:

TAGS: Clang, Developer Services, Toolchain

Split between Bristol (UK) and Campbell (USA), the SN Systems Developer Service Team support PlayStation® game developers globally. Due to the variety of tools SN Systems develop and support, the work of a Developer Support Engineer varies day to day. The work ranges from high level investigations into simple cosmetic UI issues, to low level investigations of the instructions generated by the CPU Toolchain and checking it for errors or performance issues.

It requires a range of skills to handle PlayStation® support effectively, including the ability to think critically and conduct investigations in a logical and methodical manner. The following case study is designed to provide an insight into a typical day’s work in the Developer Services team.
During the development of their game, a game developer may encounter a crash when the compiler is compiling their code, or a runtime issue that only manifests when optimizations are enabled.
In this case study, we will look at the steps that a Developer Support Engineer might perform to investigate a crash caused by the compiler’s optimizer.

Clang is the C++ compiler used in the PlayStation®4 CPU Toolchain and will be the subject of this case study. The Clang project provides a C/C++ language frontend for the LLVM project.

The Investigation 

Reproducing the crash 

When triaging Clang issues, the most important first step is to reproduce the issue, allowing the issue to be investigated further and a workaround or fix discovered.

The game developer who encountered the Clang compile time crash reported the following error:

Wrote crash dump file "C:\Users\<user>\AppData\Local\Temp\clang.exe-efbebb.dmp"

0x00007FF6965680C6 (0x0000020A469F7DE8 0x0000020A469F7DE8 0x0000000000000000 0x000000313DF8D1D0)

0x00007FFAA27DCB31 (0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000), RtlUserThreadStart() + 0x21 bytes(s)

clang.exe : error : clang frontend command failed due to signal (use -v to see invocation)

clang version 5.0.1

clang.exe: note: diagnostic msg: PLEASE submit a bug report and include the crash backtrace, preprocessed source, and associated run script.

clang.exe: note: diagnostic msg:

********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:

Preprocessed source(s) and associated run script(s) are located at:

clang.exe: note: diagnostic msg: C:\Users\<user>\AppData\Local\Temp\Crash _Example-f4571c.cpp

clang.exe: note: diagnostic msg: C:\Users\<user>\AppData\Local\Temp\Crash_Example-f4571c.sh

The developer very helpfully included the Crash_Example-f4571c.cpp and Crash_Example-f4571c.sh files in the bug report as requested by the crash information. These files will be renamed to Crash_Example.cpp and Crash_Example.sh respectively for simplicity. The developer also noticed that this crash only happens when optimization (using the -O3 compiler option) is enabled.

The Crash_Example.cpp file is a kind of partially pre-processed source file where all of the source files included (using the #include directive) have been processed into one source file. To successfully compile this source file and reproduce the crash, a compiler command line must be created that matches the original command line that the developer used to trigger the compiler crash. The original command line is contained in the Crash_Example.sh file and is easily extracted:

# Crash reproducer for clang version 5.0.1

# Driver args: "-o" "Crash_Example.obj" "-c" "-MD" "-MV" "-fdiagnostics-format=msvc" <etc…>

# Original command:  "clang.exe" "-cc1" "-triple" "-emit-obj" "-disable-free" "-disable-llvm-verifier" "-discard-value-names" "-x" "c++" “‑O3” "Crash_Example.cpp" <etc…>

This command line can now be added to a response file (called Response_File.txt) that can be used to compile the partially pre-processed Crash_Example.cpp file and reproduce the crash, like so:

Contents of Response_File.txt file:

"-cc1" "-triple" "-emit-obj" "-disable-free" "-disable-llvm-verifier" "-discard-value-names" "-x" "c++" “‑O3” "Crash_Example.cpp" <etc…>

The Clang crash is now reproducible by running the compiler with the response file with the following command line:

>clang.exe @Response_File.txt

Wrote crash dump file "C:\Users\<user>\AppData\Local\Temp\clang.exe-597708.dmp"

0x00007FF6CC1180C6 (0x00000132A405B738 0x00000132A405B738 0x0000000000000000 0x000000054318D030)

0x00007FFD92720C31 (0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000), RtlUserThreadStart() + 0x21 bytes(s)

 

Finding a workaround

Now that the crash has been reproduced, the next step in triaging the problem is to determine the exact optimization that causes the crash and see if it reveals any clues about a possible workaround.

This can be done by using Clang’s -opt-bisect-limit option. Each individual optimization performed by the compiler has a number associated with it, and this number is incremented with each additional optimization performed. The -opt-bisect-limit option allows the Developer Support Engineer to limit the number of optimizations performed by the compiler, and thus find the optimization that triggers the compiler crash. To do this, the Developer Support Engineer must verify if the crash does not occur when no optimizations are performed. This can be achieved with this command line:

>clang.exe @Response_File.txt -mllvm -opt-bisect-limit=0

BISECT: NOT running pass (1) Simplify the CFG on function (__cxx_global_var_init)

BISECT: NOT running pass (3215) X86 LEA Fixup on function (_GLOBAL__sub_I_Crash_Example.cpp)

 

The next step is to try the maximum number of optimizations obtained from the previous run of the compiler (3215) and verify if the crash still occurs:

>clang.exe @Response_File.txt -mllvm -opt-bisect-limit=3215

BISECT: running pass (1) Simplify the CFG on function (__cxx_global_var_init)

BISECT: running pass (1366) Loop Vectorization on function (<function name>)

Wrote crash dump file "C:\Users\<user>\AppData\Local\Temp\clang.exe-7c5dc6.dmp"

0x00007FF6CC1180C6 (0x000002A3D00BD848 0x000002A3D00BD848 0x0000000000000000 0x000000BD5578D320)

0x00007FFD92720C31 (0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000), RtlUserThreadStart() + 0x21 bytes(s)

 

The compiler is now crashing and the last optimization to be performed before the crash is:

BISECT: running pass (1366) Loop Vectorization on function (<function name>)

 

The Developer Support Engineer should now try disabling the loop vectorizer by removing the -vectorize-loops compiler option from the Response_File.txt file, and then run the compiler without any bisection applied:

>clang.exe @Response_File.txt

 

The compiler no longer crashes, and with a successful test, the Developer Support Engineer can recommend that the developer disables loop vectorization as a workaround until the Clang development team fix the issue and release a hot fix. 

What if this was a runtime issue? 

Clang’s -opt-bisect-limit option can also be used to find the exact optimization that manifests as a runtime glitch, crash, or performance issue in a game. However, the investigation is more laborious, and requires a runnable example or the source/data files for the whole game.

The Developer Support Engineer must first narrow down the investigation to the source file(s) that caused the issue by disabling optimizations on groups of source files until the game runs fine. If disabling optimization on a group of files fixes the problem, then the group must be sub divided into smaller groups and the process repeated until a minimal set of one or more problematic source files is identified.

The Developer Support Engineer can then use the -opt-bisect-limit option on a problematic source file to narrow down the problem to a specific optimization number by a process of elimination. The game must be run each time a new value for the -opt-bisect-limit option is tried when the problematic source file is compiled. Eventually, it will be determined that a specific optimization leads to the runtime issue, and this gives a clue about a potential workaround.

 

Back to top