The Inner Product, February 2003
Jonathan Blow (jon@number-none.com)
Last updated 17 January 2004
Interactive
Profiling, Part 3:
Behavior Clustering
Related articles:
Interactive Profiling, Part 1
(December 2002)
Interactive Profiling, Part 2 (January 2003)
Note: This article is currently missing its figures.
We will attempt to remedy this situation in the future.
For the past two months I've been talking about profiling. I started out by
decrying the lack of good profiling tools for games. I talked about the
difference between a "batch profiler", like VTune or the CodeWarrior profiler,
and an "interactive profiler", like the ones we usually build into our games. I
then proceeded to make a big wish list of features that a good interactive
profiler might provide.
Chief among those features was the ability to abstract the program's timing
measurements into higher-level "behaviors", and to do some analysis regarding
the frequency and consistency of these behaviors. The point is that modern games
don't exhibit one consistent pattern of resource usage. At one time they might
be fill-rate limited by the graphics card; at another they'll be slowed down by
an abundance of pathfinding and physics. A tool that makes these patterns clear
would be valuable.
Finding a Clustering Method
Let's look at Table 1, which we first saw in last month's column. Here we see
three behaviors: behavior A is rendering-heavy, B is physics-heavy, and C is
AI-heavy. Imagine these numbers came from a profiling run of a game in progress;
as each frame is processed, we get a new set of timing numbers. Now suppose that
the next frame is measured, and we get the following numbers: "rendering 72%,
physics_sim 17%, ai_pathfind 11%". Those numbers are very much like behavior A.
Even though the numbers do not exactly match, we would expect the profiling tool
to classify this frame as having behavior A.
Table 1:
Code Module | Behavior A | Behavior B | Behavior C |
rendering | 75% | 25% | 25% |
physics_sim | 15% | 65% | 15% |
ai_pathfind | 10% | 10% | 60% |
That's not hard to implement when you have a predefined list of behaviors like
Table 1; it's harder when you start from scratch with no preconceived notions of
program behavior. In general we are faced with a clustering problem: given a
bunch of vectors containing timing data, we want to arrange the vectors into
groups, where the centroid of each group can be used as a summary.
Clustering is a common task in computer science and engineering applications, so
many algorithms have been developed, for example "k-means clustering". Here's
how k-means works: first, randomly partition your input into a set of initial
clusters. Find the centroid of each cluster. These clusters obviously won't be
perfect yet. To fix that, you iterate over each input vector, and re-assign it
to the cluster whose centroid is nearest. Now recompute the centroids. Repeat
this reassignment process until the results converge, and you perform an entire
pass without changing anything.
The name "k-means" indicates that there are some number of clusters (k of them),
and that we are representing each cluster by its center (by the mean of the
input values).
Unfortunately, k-means and most other clustering methods are batch algorithms.
They require you to have all the vectors available at once, and to access those
vectors randomly. Since I want an interactive profiler, which can classify
behaviors "on the fly" during a single program run, these algorithms are
unsuitable.
Non-Batch Clustering
There's a special term for clustering algorithms that can handle streaming
input: they are called "online clustering algorithms". For example, there's an
"online k-means".
Online k-means works like this: start with k cluster centers arbitrary
distributed through space. For each input vector you want to classify, find the
closest cluster point, classify the vector into that cluster, and move the
cluster point closer to that vector. (See "Radial Basis Function Learning" in
For More Information).
Online k-means really is an incremental version of batch k-means; but making the
algorithm incremental is a big change, so the explanations of the two versions
sound fairly different.
How do you decide how many clusters are necessary to accurately represent your
data stream? There are some heuristics that involve removing cluster points when
you have too many (some of them are lying idle, never attracting input) and
adding cluster points when you have too few (when the radius of space attributed
to a particular cluster point is too large). Unfortunately, these heuristics all
require some kind of a priori knowledge of the data: you need to know how many
samples is the minimum for considering a point to represent a valid cluster, or
how spatially large a good cluster is likely to be. In other words, you need a
good guess about the duration and scale of your data.
But we don't want to have to guess at that stuff in order to get a good profile.
Really we want the computer to figure out appropriate values from context. And
those values may need to change. If we are running our game for 5 minutes and
see some frames that spend 70% rendering, 75% rendering, or 80% rendering, we
probably want to draw distinctions between these and call them three different
behaviors. But then if we walk into another room in the game, and we start
seeing a lot of behaviors like B and C from Table 1, then perhaps all of the
previous frames should be categorized as A.
Clearly, some kind of hierarchical clustering is necessary. Probably it's best
to let the user decide at which detail level to view the abstractions; we should
maintain a tree of behaviors from which the user selects a particular view.
Unfortunately, the typical methods, like "Hierarchical Agglomerative Clustering"
tend to be data-intensive batch processes.
Visualization
There's one other problem. In a full-sized game, there may be a few hundred
different sub-areas of the program for which we get profiling data. In
clustering terms, we're clustering data points that have a few hundred
dimensions each. Suppose we manage to do this, and we get a result with 15
different behavior clusters. Those clusters are still points in a
high-dimensional space. We can list them, like Table 1, but we'd like better
methods of visualization. Maybe we want to somehow project these clusters onto a
2D chart, and have the clusters that are most alike placed near each other on
that chart. Maybe the two dimensions of the chart can even mean something,
related to the two biggest ways in which the behaviors tend to vary.
There are some well-known methods for projecting multidimensional data onto a
2-plane. These methods tend to fall into two categories: (1) ad-hoc, producing
poor results, and (2) complicated, slow, and scary.
There's another alternative, though: a hybrid kind of algorithm that clusters
and projects simultaneously, and does a good job of both.
The Self-Organizing Map
In this month's sample code, to perform the classification of CPU behavior, I
used this hybrid algorithm, the Kohonen Self-Organizing Map (abbreviated SOM).
The SOM is an array of points in the input space; all the points are initialized
to random values. For each incoming frame of profiling data, we find the closest
point in the SOM. We then move that closest-match point toward the position
represented by the incoming frame, just like online k-means. But the SOM does
something extra: in addition to giving the clusters positions in the input
space, it treats them as having positions in another space also. The clusters
have fixed positions along a regular grid, in what I will call "neighbor space".
Whenever you move a cluster center toward an input in input space, you also look
up its neighbors in neighbor space, and move those neighbors through input space
in the same direction, thought not by as much.
This neighbor space is usually implemented just by storing the cluster data in a
2D array, and by iterating across nearby cells in that array. You can think of
this as applying a "move filter" to the array, where the filter kernel is high
in the center but tapers off with increasing radius (see Figure 1).
After many iterations of this, the clusters in input space will converge, just
like k-means. The size of the move filter shrinks over time, as the clusters
become settled, until it becomes a point. The "Self-Organizing" part of SOM
represents the idea that, because we applied that move filter in neighborhood
space, we were constantly influencing neighboring array cells to take on similar
values. So after we are done processing our inputs, we not only get cluster
centers, but the clusters are arranged in the 2D grid according to similarity.
So we can generate a 2D display where similar clusters are drawn near each
other, which can help us understand the data.
This is a quick-and-dirty description of the SOM, with some simplifications. For
reasons of isotropy, like we saw when looking at networking ("Networking, Part
2: Transmitting Vectors", July 2002, Game Developer Magazine), it is usually
better to use a hexagonal tiling than a rectangular grid. Also, the SOM grid can
consist of any number of dimensions, but 2 dimensions is the most common choice,
since a map with more dimensions is harder to visualize. Finally, initialization
of the SOM is best done using some attributes of the training data, since a
random spread in a high-dimensional space is likely to produce points that are
not near anything in your data set; in this case, the only way array elements
become used is when they get dragged into use by a neighbor, so it takes a while
before the SOM is working at reasonable effectiveness.
For some references that explain self-organizing maps in greater detail, see For
More Information.
Adapting the SOM to realtime tasks
The SOM is a batch algorithm, for two reasons. To produce good results, the
algorithm wants a randomly-ordered sampling of the data. Otherwise, you can
imagine first presenting all your samples of behavior A; then you present an
equal number of samples of behavior B which happens to land in a cluster
neighboring A, and the movement filter for B wipes out most of A. The second
reason is that the SOM algorithm wants to have a global notion of time; early in
training, the movement filter is very large, and over time it shrinks, until
toward the end of training it is nearly a point. But with live input data, we
don't know how long we will be running, and we may wish to run indefinitely.
So I made some modifications to the SOM algorithm to adapt it for realtime data.
These modifications are sort of questionable, since I have not analyzed them
deeply, and I can think of some ways they may be problematic. But to a first
approximation they work okay.
Shrinking The Filter
Rather than using a global time, I introduced the idea of "local
convergence", which is a value between 0 and 1. The filter size centered around
any given array cell depends on that cell's local convergence: a convergence of
0 means a big filter, 1 means a small filter.
Whenever a point in the SOM is generally staying in the same place, its
convergence value increases; if it starts getting yanked someplace else, its
convergence value decreases. I use the IIR filters described in "Toward Better
Scripting, Part I" (October 2002 issue of Game Developer) to track the trend of
a point's motion. So if many training steps confirm the positions of some region
of the map, that region will no longer feel the need to upset its neighbors. But
if an area of the map keeps being upset, its "local time" will go backward and
it will jostle its neighbors in a bid to create more space for the competing
values.
This "hardening of the map" does to some extent alleviate the "B wipes out A"
problem, but does not solve it. To do so, I would implement a hierarchical
series of maps, with maps that change quickly feeding data into maps that change
more slowly. Unfortunately I haven't had time to try this (see the note at the
end of the article) but plan to soon.
One further problem is that there's no way for the SOM to grow. To achieve
reasonable results, you need to construct it with enough initial nodes. But I
believe hierarchical maps would solve this also.
Initialization
I wanted a good way to initialize the SOM. As I mentioned earlier, random
initialization is not so good. The way it's usually done for batch data is to
perform principal component analysis on all the input, find the two
non-axis-aligned dimensions along which it has the greatest spread, and
initialize the SOM array entries to evenly cover that spread. Another way of
saying this is, you find the covariance of the data with respect to the mean,
then pick the two longest axes of the multidimensional ellipsoid; this is just
an n-dimensional version of what we did in "My Friend, The Covariance Body"
(September 2002 issue of Game Developer).
Since I don't want to wait for the whole batch of input data, I just collect 10
frames' worth of "warmup points" before the SOM kicks into training mode. Then,
because I didn't want to write n-dimensional matrix decomposition code, I
approximate the spread as follows: First I pick a random warmup point, then I
find the warmup point that is furthest from it. I find the vector between them,
and that is the longest axis of the spread. Then I remove that dimension from
the warmup data by subtracting each point's projection along that axis. Then I
choose another random point, find the warmup point furthest from that one, and
that is the second longest axis of the spread. I then initialize the SOM entries
to points evenly spaced across the plane defined by these two axes, centered on
the mean. If the axes are degenerate within some numerical threshold, then I
just seed the SOM with random offsets from the mean.
Sample Code
This month's sample code is an updated version of last month's toy game world.
In addition to AI-heavy and entity-render-heavy behavior modes, there's now a
sun in the sky; when you look toward the sun, ridiculous amounts of lens flare
are drawn, bogging the system down that way. As you move around the game world,
you'll cause various behaviors that are mixtures of AI, entity, and lens flare.
Whenever mouselook is enabled so that you can move around, the game system takes
each frame's profiling data and classifies it using an SOM.
When mouselook is disabled and the SOM is paused, you can move the mouse pointer
over the SOM's display and look at the behaviors it has detected. There are a
number of visualizations enabled, which are explained in the README. These
visualizations aren't as intuitive as I'd like, but after a bit of exposure you
get used to them and can infer interesting relationships about the clusters by
looking at colored diagrams (Figure 2).
Future Work
So far this work has been pretty experimental, but I think it shows some good
promise. I am going to try some hierarchical versions of the SOM and online
k-means, and will report on their effectiveness in a future column. In the
meantime, next month we'll go back to a more conventional topic, like
cool-looking graphics.
References
"Radial Basis Function Learning", Paolo Palmeri,
http://miles.cnuce.cnr.it/~palmeri/datam/articles/okmc.ps.gz
"The Self-Organizing Map (SOM)", Teuvo Kohonen,
http://www.cis.hut.fi/projects/somtoolbox/theory/somalgorithm.shtml
"Enhancing SOM based data visualization", Juha Vesanto, Johan Himberg et al,
http://www.cis.hut.fi/projects/ide/publications/html/iizuka98