TECH : Measuring the User Debug Experience

Greg Bedwell

TAGS: Compiler, Debugger, LLVM, Toolchain

For a game to hit 30 frames-per-second, all of the frame compute has to fit into ~33.3 ms, at 60 frames-per-second it has to fit into ~16.7 ms.  Games are reliant on compiler optimizations to get near to these target frame times, without these optimizations the game may be unplayable. Unfortunately, compiler optimizations tend to interfere with the "debuggability" of the code.  This can be for legitimate reasons, i.e. some optimizations fundamentally changing the structure of the code and it may no longer resemble the original source code. In other cases, it can be down to the compiler not preserving debug information that could have been preserved. Game developers often need to play the fully optimized game in order to trigger the required condition to reproduce a reported bug. Making it easier to debug fully optimized code is a request that we hear regularly from game developers.  

At Sony Interactive Entertainment (SIE) we've introduced new features to Clang such as "__attribute__((optnone))" and "#pragma clang optimize", which allow developers to selectively disable compiler optimization on portions of their code. When the developer knows in advance the portions of their game they need to debug, they can disable optimizations for those portions without affecting the level of compiler optimization applied to the rest of the game codebase. This should provide a frame rate comparable to a fully optimized game, but with a debugging experience comparable to an unoptimized build. We've also researched new features such as variable lifetime extension. These features were highly requested and are now widely used. However, we've always known that this is more of a workaround than a real fix for the core issues. 

I'm a firm believer in the school of thought that to improve something, you first need to measure it. This leads us to the question of how to measure the user debugging experience. One simple answer might be to survey our users, and then again after we've made some changes. It's not a precise measure though, and the feedback loop is far too long for our requirements.  

Let’s start with the question, "what is the user debugging experience?". The user interacts with a debugger to step through the source code and inspect variables.  The compiler is responsible for generating the debug information to describe the program structure.  If that is incorrect, the debugger will be unable to give the information the user wants, resulting in a bad user experience. We can test and measure the individual aspects. For example, it stands to reason that if the debug information produced by the compiler contains more complete information about the source code, then the user will have a better debugging experience.  Indeed, as part of our efforts in this area we've contributed our tool "DIVA" (Debug Information Visual Analyzer) to allow easy visualization of DWARF debug information.  There's also really exciting new work, contributed by other members of the open-source LLVM community, for Testing Debug Info Preservation in Optimizations that we've been using to help find plenty of cases of preventable debug information loss during compiler optimization.

The hypothesis is that by improving the individual components, we can improve the overall experience. In order to prove this, we need to observe the system as a whole.  Testing the interactions between the compiler and debugger is not a new idea by any means.  The LLVM project has the "debuginfo-tests" suite for example.  This suite is tremendously useful as an integration test, but it doesn't provide the measurement aspect beyond pass/fail. 

This is where DExTer (Debugging Experience Tester) comes in.  DExTer is a new tool we've developed that aims to provide dynamic analysis and visualization of the user debugging experience, as well as the ability to assign a heuristic-derived score for its overall quality. DExTer defines this experience as a function of stepping order and variable visibility.  A DExTer score is in the range of 0 to 1 where 1.000 is the best possible score and 0.000 is the worst.  

We annotate short runnable tests with our expectations (Figure 1).  DExTer then drives the chosen debugger via the debugger's automation API, stepping through the code at source level repeatedly until the debugging session ends. If it steps to a line containing a DexWatch annotation, then it will ask the debugger to evaluate the expressions in the DexWatch parameter list. The information about each step is stored in a JSON-serializable format, DextIR, (Figure 2) for analysis and visualization. 

Image showing annotations for test 

The JSON-serializable format

A score is derived from the information in the DextIR, starting from 1.000 with penalties applied every time something unexpected is observed. The size of the penalty is dependent on the nature of the unexpected difference. For example, a less severe penalty is applied when the debugger reports that a variable has been optimized away compared to when the debugger is reporting that the expression cannot be evaluated at all. A more severe penalty is applied when the debugger successfully evaluates an expression but is reporting an unexpected value. The penalty multipliers for each condition can be tweaked via the DExTer command line. 

Running the current set of DExTer tests across a range of optimization levels with Clang and LLDB (both at trunk revision 340405) shows us how the user debugging experience is affected with different amounts of optimization.  For "-O0 -g", all but one test scores the maximum of 1.000.  As the optimization level increases, more and more tests start to get lower scores.  It should be explicitly noted at this point that a penalty being incurred is not an indication of a bug, or a failure to preserve debug information, but purely an indicator that the user debugging experience has changed somehow for the worse.  It's unrealistic to expect the same debug experience in fully optimized code compared to unoptimized code.  As previously noted, some optimizations necessarily trade debuggability for performance, for example, by scheduling instructions or by modifying the control flow so that the compiler generated code no longer resembles the original source code.  

DExTer scores

We can get a higher degree of granularity in our results by making use of LLVM's "-opt-bisect-limit" option.  DExTer has built-in support for this via its 'clang-opt-bisect' tool.  This tool repeatedly runs Clang, increasing the value specified for the "-mllvm -opt-bisect-limit" option and measuring the score at each optimization pass. Running this tool on the Fibonacci test shows a relatively small number of optimization passes having an effect on the score (Figure 4).  Using this information in conjunction with other techniques such as the previously mentioned "Testing Debug Info Preservation in Optimizations" technique it should be relatively easy to identify optimization passes where debug information is being lost unnecessarily. 

image showing effect of optimization passes on the score

This leaves the question of how to deal with optimization passes that harm the user debugging experience, but only as a result of a totally legitimate transformation.  The most likely possibility is to make use of Clang's
"-Og" optimization level.  The Clang documentation describes this optimization level as: 

    "-Og Like -O1. In future versions, this option might disable different optimizations in order to improve debuggability." 

By utilizing the information from DExTer on the effect of each individual optimization pass in conjunction with runtime performance benchmark data, it may be possible to tune the -Og optimization level to provide a good user debugging experience, while at the same time providing the runtime performance that our users require to hit their frame times for development builds. 

DExTer is implemented in Python (2.7 and 3.5+) and is available under the MIT license here


Back to top