The Inner Product, December 2002
Jonathan Blow (jon@number-none.com)
Last updated 17 January 2004
Interactive Profiling, Part 1
Related articles:
Interactive Profiling, Part 2
(January 2003)
Interactive Profiling, Part 3 (February 2003)
Working with different development houses, I've seen that it's often extremely
difficult to get anything done. Most developers are working at a low level of
efficiency. Our programming languages, compilers, debuggers, source control
systems, and profiling tools are all, in one way or another, bad. This costs us
huge amounts of time and money. Today I'd like to attack one corner of this
problem space, the inadequacy of our profiling tools.
For the practicing game developer, there are a handful of commercial profiling
products to choose from; two representative examples are Intel's VTune and the
CodeWarrior Hierarchical Profiler (HiProf). Despite being profilers, these
products are inappropriate in several ways; using them in a production
environment tends to be painful. Usually a profiler will slow down the game
tremendously, preventing us from reproducing the situations we want to profile.
VTune is not slow if you do not collect any information other than the program
counter at each sample; but that information is of very limited use.
Because these products don't satisfy our needs, just about every game gets
rigged with a homemade profiler. We manually mark sections of the code with
telltale names like "render_world" or "simulate_physics"; a heads-up display
shows, in real time, how much CPU is being spent in each area.
Interactive Profiling as a Different Thing than Batch
Profiling
There are some ways in which even the simplest in-game profiler is more useful
than a tool like VTune or HiProf. With a realtime heads-up display, you can see
correlations between the frame rate and the activity of each game subsystem, and
use human pattern-matching and intelligence to figure out what's happening.
Suppose, in a map with a lot of objects and busy activity, your game's frame
rate stutters annoyingly; say, rougly one frame out of every 30 takes 3 times as
long as it should. If you try to diagnose the problem with VTune or HiProf,
you'll meet with some difficulty. Since the long frames are intermixed with
normal frames, the code causing the problem is only taking 1 2/3 as much CPU as
it should, when averaged over time; this probably won't be sufficient for that
code to stand out as the culprit in the profiling report.
You might be able to track down the problem in VTune by enabling the "collect
chronology data" option. You'll suffer through annoying slowdowns, and if the
problem is timestep-dependent, it may not happen during profiling. To robustly
solve this problem, you can build your game engine to support full journaling
and playback; that's useful in its own right, but it's also a big engineering
task that we'd rather not require just to obtain a profile. After all this, you
may discover that the problem is caused by your AI pathfinding code. You may
even know that a specific loop is eating up a lot of time. But the profiler
can't tell you why.
On the other hand, with an in-engine interactive profiler you see the problem
instantly, because you're watching a live report that changes from frame to
frame. Every time the frame stutter happens, you'll see the program section "AI
Pathfind" jump to the top of the chart. And because you're running around inside
the game world, you may be able to easily correlate the spikes with specific
problems -- for example, now that you know the problem is pathfinding, you can
see that it happens whenever a monster walks near a certain kind of lamp. So you
go investigate the geometry of the lamp to get to the heart of the matter.
Because you can turn on the interactive profiler at any time, you don't have to
strain to reproduce a test case; when it happens, bam, you hit the function key
that turns on the profiler and shows you what's happening.
Batch profilers try to help you visualize data, with bar charts and pie charts
and hierarchical diagrams. But your game renderer is already a rich
visualization of your game's variables, a far more complex and meaningful one
than any batch profiler will give you. Augmenting the world rendering with some
supplementary data is usually the best visualization approach.
Basically, there are 2 things batch profilers are good for: (1) Giving you a
broad idea of your program's behavior averaged over time, and (2) giving you
precise instruction-level performance data for small sections of code. (1) can
be obtained just as well through a simple in-engine profiler, and (2) represents
a small fraction of the profiling use cases of a modern game.
As an industry we frequently write our own in-engine profilers, but it's seldom
articulated that interactive profiling is a different paradigm than batch
profiling. The batch profilers available to us now are somewhat useful, but most
of the time we really want a good interactive profiler. If the tool vendors
understood this, they could do some neat stuff.
Because we as game developers are not in the business of selling tools, the
profilers we write are usually minimal. I think it would be quite worthwhile to
spend some up-front effort to build a high-quality interactive profiler that
people can just plug into their games. So, in this article and the next, I will
take steps toward that.
Goals of the Interactive Profiler
I want this profiler to be effortless to use. It should be so fast that you can
always leave it enabled in every build of your game, only removing it when the
time comes to ship the final product. Disabling profilers with #ifdefs in
various builds is bad; suppose you see a problem and want to check it out, but
whoops, your current build doesn't have profiling turned on, so you have to
stop, change the #define, do a full rebuild which may take a while, then run the
game and try to reproduce the case (which can be pretty hard sometimes). That's
the kind of process impediment that stifles game development. By itself it
doesn't seem like much, but we build up these situations where many impediments
affect our daily progress, and it becomes a chore to do anything.
Using the profiler, I'd like it to be easy to detect the kinds of performance
problems we care about. I want the profiler's behavior to be reliable; for
example, it should not change in character based on the frame rate. And I think
it would be really neat if it provided some tools that help us jump to more of a
meta-level and see the different kinds of behaviors our games exhibit, and how
common each behavior is within the overall gamut.
Profiler Overview
I'll create a small header file that should be included in all code that wants
to be profiled. The header contains macros for declaring different zones of the
program (rendering, input, physics, etc) and for stating that code execution is
entering or leaving a given zone.
When we enter a zone, we'll ask a high-resolution timer what time it is, and
record that. Upon leaving the zone, we subtract that entry time from the current
time, which tells us how long we were inside. We add this to a total-time
accumulator for that zone. Once per frame, we draw the accumulated results and
reset them.
With a little bit of extra accounting, we can separately track what HiProf
refers to as the "hierarchical time", which is the time spent inside a zone and
all zones that it calls; and "self time", which only counts time within a
specific zone body and stops when a new zone is entered.
Zone Declaration
I chose to declare zones by instantiating structs in the global namespace. Each
struct contains the variables used to store the raw timing data for that zone.
For convenience I created a macro called Define_Zone that declares the
appropriate instance (see Listing 1). When entering or leaving a zone, we refer
to the name of that global zone variable; all references to zones are resolved
at link time, so no CPU is spent at runtime matching zone IDs. Also, you get a
compile-time error if you try to use a zone that isn't declared, which is good
for catching spelling mistakes.
There are two alternatives for declaring zones that people often use, but they
both have big disadvantages. One method is to create a global header file with
enumerated constants for each zone; these constants index into an array. With
this approach, adding a new zone requires a full rebuild of the project, and
full rebuilds are horrible for many code bases. Also, when tracking down a
performance problem, we often want to incrementally create temporary zones,
which is painful with this scheme. The struct-based scheme presented here is
totally confined to the files being worked on, requiring minimal recompilation.
The other alternative is to use strings to designate zone names, but this incurs
a lot of runtime overhead, and spelling errors can go unnoticed for a long time.
Listing 1: struct Program_Zone { Program_Zone(char *name); char *name; int index; Profiling_Int64 total_self_ticks; Profiling_Int64 t_self_start; .... // Other members deleted for brevity };
|
Entering and Leaving Zones
It is easy to define macros to indicate that you are entering or leaving a given
zone. When entering, the zone is pushed onto a stack; when leaving, it's popped.
However, I don't like using these macros directly; it is too easy to forget an
exit declaration in functions that have a lot of exit points (Listing 2). I
prefer to make a C++ dummy class that contains no local data, but that calls
Enter_Zone on construction, and Leave_Zone on destruction. Then I can
instantiate that class whenever I want to enter a zone, and the leaving will
occur automatically.
Listing 2: bool foo(char *s) { Enter_Zone(foo); if (s == NULL) { Exit_Zone(foo); return false; } char *t; // Do something complicated with s, // putting the result into t... t = result; if (!well_formed(t)) { Exit_Zone(foo); return false; } do_something_complicated_with(t); Exit_Zone(foo); return true; } |
Getting the Time
We want to acquire an accurate timestamp as quickly as possible. We could use
the Windows API function QueryPerformanceCounter ("QPC") for this, but QPC is
very slow. In Windows 2000/XP it's a wrapper function in a DLL, which calls a
different wrapper function in a different DLL, which triggers a kernel trap to a
routine that reads the PCI bus clock. If you want to be saved from the agony of
including windows.h in every file of your game, you'll probably wrap QPC yet
again. Add all that up, and it's just not lightweight.
Instead I wrote some inline assembly language that uses the rdtsc instruction,
which has been available on all new x86 chips for years now. This gives me
integer timestamps in units of some arbitrary chip frequency, but I want time
measured in seconds. For maximum speed, I perform all the timestamp accounting
directly on these integer timestamps, and only convert to seconds once per
frame, when it's time to process and display the profiling data. To perform the
conversion, I use QPC at the beginning and end of the frame to get a frame
duration in seconds; by measuring how many rdtsc ticks have passed during the
same interval, I can convert from rdtsc ticks to seconds.
Unfortunately, newer mobile processors have this feature called SpeedStep,
wherein they dynamically adjust their clock speed based on demand. Even though
rdtsc's name ("Read Time Stamp Counter") indicates that it was intended as a
timer, someone at Intel decided that it's okay for rdtsc to dumbly report clock
cycles without compensating for SpeedStep. As a result, if you're running on a
laptop and your app has bottlenecks that involve blocking on i/o or accidentally
sleeping, rdtsc is unreliable as a measurement of time. Fortunately, the
profiler can detect this situation and report that SpeedStep is causing a
problem (it just looks for fluctuation in the number of rdtsc ticks per QPC
second). However, there is no way to fix the profiling data itself. If this
happens to you, I recommend using a desktop machine for profiling.
Displaying the Data
We draw the profiling data with one zone per line, sorting by CPU usage so that
the most expensive zones are at the top, as shown in Figure 1. But since timings
in a game fluctuate a lot, if we do this naively, lines of output will move
around haphazardly and the display will be hard to read.
This is a good application for the frame-rate-independent IIR filters I
introduced two months ago (See "Toward Better Scripting, Part I"). We apply
these once per frame to the timing values; the reuslting numbers are more stable
and the display is easier to read. By varying the filter coefficients, we can
decide whether we want to see long-term averages or very recent values.
But we don't just want to know about frame-time hogs. Consistency of frame rate
is very important, so we should also seek out routines that are inconsistent in
the amount of time they take. In the scripting article, I showed how to compute
the variance of a filtered variable. We square-root this to get the standard
deviation (stdev), which is measured in seconds. Now the profiler can display
and sort by self-stdev and hierarchical-stdev. In addition, we color-code each
line of output depending on how unpredictable the zone's timing is; steady zones
are drawn in a cool color, and vibrating zones are drawn in a hot color.
Sample Code, and Next Time
This month's sample code is a toy game system with the profiler built in.
There's a simple world full of crates. Besides rendering, the game has dummy
functions for AI, audio, etc. To ensure that the different systems run slowly
enough for the profile to be interesting across a variety of machines, they
artificially slow themselves down. Some functions, like the AI code, fluctuate
their CPU usage so that the profiler can detect that variance.
The profiler itself is very modular; you should be able to use it in your own
game with minimal effort. We've only just gotten started exploring what we can
do with an interactive profiler; next month we'll look at some data processing
and visualization enhancements.