**
The Inner Product, February 2003
**Jonathan Blow (jon@number-none.com)

Last updated 17 January 2004

**Interactive
Profiling, Part 3:
Behavior Clustering**

Related articles:

Interactive Profiling, Part 1
(December 2002)

Interactive Profiling, Part 2 (January 2003)

**
Note: This article is currently missing its figures.
We will attempt to remedy this situation in the future.**

For the past two months I've been talking about profiling. I started out by
decrying the lack of good profiling tools for games. I talked about the
difference between a "batch profiler", like VTune or the CodeWarrior profiler,
and an "interactive profiler", like the ones we usually build into our games. I
then proceeded to make a big wish list of features that a good interactive
profiler might provide.

Chief among those features was the ability to abstract the program's timing
measurements into higher-level "behaviors", and to do some analysis regarding
the frequency and consistency of these behaviors. The point is that modern games
don't exhibit one consistent pattern of resource usage. At one time they might
be fill-rate limited by the graphics card; at another they'll be slowed down by
an abundance of pathfinding and physics. A tool that makes these patterns clear
would be valuable.

**Finding a Clustering Method**

Let's look at Table 1, which we first saw in last month's column. Here we see
three behaviors: behavior A is rendering-heavy, B is physics-heavy, and C is
AI-heavy. Imagine these numbers came from a profiling run of a game in progress;
as each frame is processed, we get a new set of timing numbers. Now suppose that
the next frame is measured, and we get the following numbers: "rendering 72%,
physics_sim 17%, ai_pathfind 11%". Those numbers are very much like behavior A.
Even though the numbers do not exactly match, we would expect the profiling tool
to classify this frame as having behavior A.

**Table 1:**

Code Module |
Behavior A |
Behavior B |
Behavior C |

rendering | 75% | 25% | 25% |

physics_sim | 15% | 65% | 15% |

ai_pathfind | 10% | 10% | 60% |

That's not hard to implement when you have a predefined list of behaviors like
Table 1; it's harder when you start from scratch with no preconceived notions of
program behavior. In general we are faced with a clustering problem: given a
bunch of vectors containing timing data, we want to arrange the vectors into
groups, where the centroid of each group can be used as a summary.

Clustering is a common task in computer science and engineering applications, so
many algorithms have been developed, for example "k-means clustering". Here's
how k-means works: first, randomly partition your input into a set of initial
clusters. Find the centroid of each cluster. These clusters obviously won't be
perfect yet. To fix that, you iterate over each input vector, and re-assign it
to the cluster whose centroid is nearest. Now recompute the centroids. Repeat
this reassignment process until the results converge, and you perform an entire
pass without changing anything.

The name "k-means" indicates that there are some number of clusters (k of them),
and that we are representing each cluster by its center (by the mean of the
input values).

Unfortunately, k-means and most other clustering methods are batch algorithms.
They require you to have all the vectors available at once, and to access those
vectors randomly. Since I want an interactive profiler, which can classify
behaviors "on the fly" during a single program run, these algorithms are
unsuitable.

**Non-Batch Clustering
**

There's a special term for clustering algorithms that can handle streaming input: they are called "online clustering algorithms". For example, there's an "online k-means".

Online k-means works like this: start with k cluster centers arbitrary distributed through space. For each input vector you want to classify, find the closest cluster point, classify the vector into that cluster, and move the cluster point closer to that vector. (See "Radial Basis Function Learning" in For More Information).

Online k-means really is an incremental version of batch k-means; but making the algorithm incremental is a big change, so the explanations of the two versions sound fairly different.

How do you decide how many clusters are necessary to accurately represent your data stream? There are some heuristics that involve removing cluster points when you have too many (some of them are lying idle, never attracting input) and adding cluster points when you have too few (when the radius of space attributed to a particular cluster point is too large). Unfortunately, these heuristics all require some kind of a priori knowledge of the data: you need to know how many samples is the minimum for considering a point to represent a valid cluster, or how spatially large a good cluster is likely to be. In other words, you need a good guess about the duration and scale of your data.

But we don't want to have to guess at that stuff in order to get a good profile. Really we want the computer to figure out appropriate values from context. And those values may need to change. If we are running our game for 5 minutes and see some frames that spend 70% rendering, 75% rendering, or 80% rendering, we probably want to draw distinctions between these and call them three different behaviors. But then if we walk into another room in the game, and we start seeing a lot of behaviors like B and C from Table 1, then perhaps all of the previous frames should be categorized as A.

Clearly, some kind of hierarchical clustering is necessary. Probably it's best to let the user decide at which detail level to view the abstractions; we should maintain a tree of behaviors from which the user selects a particular view. Unfortunately, the typical methods, like "Hierarchical Agglomerative Clustering" tend to be data-intensive batch processes.

There's one other problem. In a full-sized game, there may be a few hundred different sub-areas of the program for which we get profiling data. In clustering terms, we're clustering data points that have a few hundred dimensions each. Suppose we manage to do this, and we get a result with 15 different behavior clusters. Those clusters are still points in a high-dimensional space. We can list them, like Table 1, but we'd like better methods of visualization. Maybe we want to somehow project these clusters onto a 2D chart, and have the clusters that are most alike placed near each other on that chart. Maybe the two dimensions of the chart can even mean something, related to the two biggest ways in which the behaviors tend to vary.

There are some well-known methods for projecting multidimensional data onto a 2-plane. These methods tend to fall into two categories: (1) ad-hoc, producing poor results, and (2) complicated, slow, and scary.

There's another alternative, though: a hybrid kind of algorithm that clusters and projects simultaneously, and does a good job of both.

In this month's sample code, to perform the classification of CPU behavior, I used this hybrid algorithm, the Kohonen Self-Organizing Map (abbreviated SOM).

The SOM is an array of points in the input space; all the points are initialized to random values. For each incoming frame of profiling data, we find the closest point in the SOM. We then move that closest-match point toward the position represented by the incoming frame, just like online k-means. But the SOM does something extra: in addition to giving the clusters positions in the input space, it treats them as having positions in another space also. The clusters have fixed positions along a regular grid, in what I will call "neighbor space". Whenever you move a cluster center toward an input in input space, you also look up its neighbors in neighbor space, and move those neighbors through input space in the same direction, thought not by as much.

This neighbor space is usually implemented just by storing the cluster data in a 2D array, and by iterating across nearby cells in that array. You can think of this as applying a "move filter" to the array, where the filter kernel is high in the center but tapers off with increasing radius (see Figure 1).

After many iterations of this, the clusters in input space will converge, just like k-means. The size of the move filter shrinks over time, as the clusters become settled, until it becomes a point. The "Self-Organizing" part of SOM represents the idea that, because we applied that move filter in neighborhood space, we were constantly influencing neighboring array cells to take on similar values. So after we are done processing our inputs, we not only get cluster centers, but the clusters are arranged in the 2D grid according to similarity. So we can generate a 2D display where similar clusters are drawn near each other, which can help us understand the data.

This is a quick-and-dirty description of the SOM, with some simplifications. For reasons of isotropy, like we saw when looking at networking ("Networking, Part 2: Transmitting Vectors", July 2002, Game Developer Magazine), it is usually better to use a hexagonal tiling than a rectangular grid. Also, the SOM grid can consist of any number of dimensions, but 2 dimensions is the most common choice, since a map with more dimensions is harder to visualize. Finally, initialization of the SOM is best done using some attributes of the training data, since a random spread in a high-dimensional space is likely to produce points that are not near anything in your data set; in this case, the only way array elements become used is when they get dragged into use by a neighbor, so it takes a while before the SOM is working at reasonable effectiveness.

For some references that explain self-organizing maps in greater detail, see For More Information.

The SOM is a batch algorithm, for two reasons. To produce good results, the algorithm wants a randomly-ordered sampling of the data. Otherwise, you can imagine first presenting all your samples of behavior A; then you present an equal number of samples of behavior B which happens to land in a cluster neighboring A, and the movement filter for B wipes out most of A. The second reason is that the SOM algorithm wants to have a global notion of time; early in training, the movement filter is very large, and over time it shrinks, until toward the end of training it is nearly a point. But with live input data, we don't know how long we will be running, and we may wish to run indefinitely.

So I made some modifications to the SOM algorithm to adapt it for realtime data. These modifications are sort of questionable, since I have not analyzed them deeply, and I can think of some ways they may be problematic. But to a first approximation they work okay.

Whenever a point in the SOM is generally staying in the same place, its convergence value increases; if it starts getting yanked someplace else, its convergence value decreases. I use the IIR filters described in "Toward Better Scripting, Part I" (October 2002 issue of Game Developer) to track the trend of a point's motion. So if many training steps confirm the positions of some region of the map, that region will no longer feel the need to upset its neighbors. But if an area of the map keeps being upset, its "local time" will go backward and it will jostle its neighbors in a bid to create more space for the competing values.

This "hardening of the map" does to some extent alleviate the "B wipes out A" problem, but does not solve it. To do so, I would implement a hierarchical series of maps, with maps that change quickly feeding data into maps that change more slowly. Unfortunately I haven't had time to try this (see the note at the end of the article) but plan to soon.

One further problem is that there's no way for the SOM to grow. To achieve reasonable results, you need to construct it with enough initial nodes. But I believe hierarchical maps would solve this also.

I wanted a good way to initialize the SOM. As I mentioned earlier, random initialization is not so good. The way it's usually done for batch data is to perform principal component analysis on all the input, find the two non-axis-aligned dimensions along which it has the greatest spread, and initialize the SOM array entries to evenly cover that spread. Another way of saying this is, you find the covariance of the data with respect to the mean, then pick the two longest axes of the multidimensional ellipsoid; this is just an n-dimensional version of what we did in "My Friend, The Covariance Body" (September 2002 issue of Game Developer).

Since I don't want to wait for the whole batch of input data, I just collect 10 frames' worth of "warmup points" before the SOM kicks into training mode. Then, because I didn't want to write n-dimensional matrix decomposition code, I approximate the spread as follows: First I pick a random warmup point, then I find the warmup point that is furthest from it. I find the vector between them, and that is the longest axis of the spread. Then I remove that dimension from the warmup data by subtracting each point's projection along that axis. Then I choose another random point, find the warmup point furthest from that one, and that is the second longest axis of the spread. I then initialize the SOM entries to points evenly spaced across the plane defined by these two axes, centered on the mean. If the axes are degenerate within some numerical threshold, then I just seed the SOM with random offsets from the mean.

This month's sample code is an updated version of last month's toy game world. In addition to AI-heavy and entity-render-heavy behavior modes, there's now a sun in the sky; when you look toward the sun, ridiculous amounts of lens flare are drawn, bogging the system down that way. As you move around the game world, you'll cause various behaviors that are mixtures of AI, entity, and lens flare. Whenever mouselook is enabled so that you can move around, the game system takes each frame's profiling data and classifies it using an SOM.

When mouselook is disabled and the SOM is paused, you can move the mouse pointer over the SOM's display and look at the behaviors it has detected. There are a number of visualizations enabled, which are explained in the README. These visualizations aren't as intuitive as I'd like, but after a bit of exposure you get used to them and can infer interesting relationships about the clusters by looking at colored diagrams (Figure 2).

So far this work has been pretty experimental, but I think it shows some good promise. I am going to try some hierarchical versions of the SOM and online k-means, and will report on their effectiveness in a future column. In the meantime, next month we'll go back to a more conventional topic, like cool-looking graphics.

"Radial Basis Function Learning", Paolo Palmeri, http://miles.cnuce.cnr.it/~palmeri/datam/articles/okmc.ps.gz

"The Self-Organizing Map (SOM)", Teuvo Kohonen, http://www.cis.hut.fi/projects/somtoolbox/theory/somalgorithm.shtml

"Enhancing SOM based data visualization", Juha Vesanto, Johan Himberg et al, http://www.cis.hut.fi/projects/ide/publications/html/iizuka98