The Inner Product, June 2004
Jonathan Blow (jon@number-none.com)
Last updated 23 July 2004
Experiments I'd Like to Work On
This is my last episode of The Inner Product, for a while at least! I’m going to spend some time focusing on game creation, away from distractions like writing magazine articles. Next month’s technical column will be written for you by the capable and curmudgeonly Sean Barrett.
Over the past several years, many interesting projects have accrued on my “I Really Want To Do This” list. As a parting shot, I’ll talk about the more exciting items on this list -- why they are interesting in a technical sense, and how they are important for the future of games.
None of these items have to do with graphics. I think
graphics is well-covered now; a lot of research has been put into it, and we’ve
come a long way. Certainly there are some big problems in graphics that ought
to be solved, like realtime global illumination. Even so, I think it’s clear
that the current level of graphics technology is far advanced relative to all
the other technical aspects of games. As a result, our games have nice graphics
but suck in lots of other ways, and there’s no easy fix we can employ. We need
to do a lot of hard work to develop the other areas of game technology; this
list is a set of places where I would start.
Making People Move
The game I’m working on is an action role-playing game. You play a guy who walks around the world talking to other people, who are all doing their own things. Since this game is in 3D, I need to show the people animated in 3D going about their daily activities. They need to walk around, talk to each other, climb into bed, jump out the window of a burning building, eat a shoe, etc. Traditionally, we make this happen by pre-authoring a bunch of animations for all these characters, then playing back the appropriate animations at runtime, perhaps blending between different example animations to achieve the desired effect.
These animations take up a lot of memory, even when compressed. A bigger problem is that they’re difficult to make. Competent 3D artists are hard to come by, but even if you get some, the task is daunting and expensive. But even if you can afford it, there are deeper, more troubling problems: your characters can only perform actions that you have explicitly animated, and you tend to have great difficulty interfacing animated characters with a physically simulated world.
Ideally, we would like all character motions to be generated dynamically at runtime: if an AI-controlled character wants to sit down in a chair, he figures out how to move his muscles in order to get over to the chair and place his butt firmly upon it. This is a very difficult thing to do, and as a general problem it is far out of our reach. (It’s difficult just getting the guy to remain standing!)
I believe that it’s important for us to work our way toward the goal of dynamically-simulated motion, so I’ve chosen an intermediate approach that ought to be more achievable. Rather than authoring full animations, I want to just author a set of single-frame full-body poses. At preprocess time, the poses are arranged in a graph so that similar poses are neighbors. See Figure 1a.
At runtime, if the AI is standing next to a chair and wants to sit in it, he searches the database for the pose that’s nearest to his current body state, which we’ll call START, and he also looks up the pose that is associated with sitting, which we’ll call END. The AI then performs a graph search to find a short and comfortable path from START to END. This path represents a plan of how to move his body to achieve the sitting action. See Figure 1b. From frame to frame, the AI just needs to interpolate along this path until it reaches END.
All sorts of methods could be used for this interpolation. At the simplest, you could imagine just slerping the rotations at the joints of the AI’s skeleton. This would look unnatural, but it’d get the job done, and it’s a good first step when you’re working toward building a full system. As a next step, I would have the body apply some forward kinematics to get it from pose to pose – apply muscle forces to the joints to try to get them to do the right thing. As I said before, we’re nowhere near solving this muscle force problem in the general case, but now that we have a well-defined pose path, the problem is somewhat constrained. We can use the pose path to impart some magical balancing forces on the character to keep him from getting all wacky. To visualize how this might work, imagine that the path through the graph represents a spline in pose space. When we consider the set of all points within some distance of this spline, we get a sort of bendy rod (you might choose to visualize this as a sphere swept along the spline). The forward kinematics is free to do what it wants so long as it doesn’t go much outside this region. If it does, we clamp it, or pull it back toward the spline with a penalty force.
So the role of the forward kinematics is mainly to provide flavor, whereas the pose targets serve as a guide-rail to generally control locomotion. As we improve our character balancing technology and begin to attain mastery, we can shift the balance of power between these two systems, so that the forward kinematics becomes more in control of overall locomotion, and we use the guide-rail less.
The way I’ve explained this so far, the nodes of the graph are just pose-space targets, but you probably want velocity envelopes attached to the joints as well. Imagine you want your character to throw a ball – if you don’t specify the velocity of the hand at the time the ball is released, you can get some motions that are much unlike what you want. I am worried that adding velocity envelopes might cause an explosion in the amount of pre-authored data required, and that the authoring process itself would become more difficult. If this is so, perhaps the pose-and-velocity targets should be automatically acquired from motion capture data. Or, perhaps the targets should not contain velocity data, and velocities should be imposed by a separate set of constraints used by the forward kinematics. Experimentation is required here. Also, even if we’re only dealing with poses, we want the ability to decouple different parts of the body somewhat in order to reduce the sample data – otherwise, for example, we’d need to duplicate each pose of the arms for a lot of different stationary leg poses, which gets silly very quickly. (At the same time, arms and legs are not really independent, so some kind of coupling should still occur. I suspect a lot of IK fixup would be done to the legs for center-of-gravity purposes).
This system solves many problems. For example, suppose a character is sitting down, and you throw a ball that hits his shoulder and knocks him back a bit. All the AI has to do is notice that its current pose is far from where it had hoped to be (measured as distance to that spline), and then perform a “re-plan”: find a new START node and re-traverse the pose graph. From then on, it’s business as usual. Also, we’ve eliminated the Covered-Action Assumption that was inherent in pre-authored animation (that your guys can only do what you thought of in advance). If one were to write an AI that could think up arbitrary body poses as solutions to problems, it would then be able to achieve those poses, resulting in actions that, overall, were not pre-authored. (The name Covered-Action Assumption is something I just made up, but I think it denotes an important concept: that the set of all actions a character can perform is covered by the set of pre-authored animations).
The big question with this method is the quality issue: how human will the resulting animations look? I think the quality of the results would be highly tweakable by adding new data points into the graph, and by tweaking the weights between the nodes to create preferred avenues of motion. In the limit, you could just dump whole animations into the graph frame-by-frame, for cases where very high quality is desired.
You may notice that this method is a fusion between forward
kinematics and inverse kinematics. Choosing the END node, and the path to it,
is an inverse kinematics activity, whereas navigating between targets is best
done via forward kinematics. So I have dubbed this technique “Alternating
Kinematics”. Of course the final technique will not be as simple as what I’ve
said here, as there will be many implementation details to overcome. I hope to
report about the results in a future write-up!
Buffing Up Computer Vision
Recently, Sony made the adventurous move of releasing the EyeToy as a commercial product, and it met with success. This helped show that cameras have an interesting future as game controllers. Unfortunately, our technology for parsing images is very poor. The EyeToy games primarily use a method equivalent to background-subtraction, with perhaps a little bit of augmentation here and there. But this level of technology severely restricts the kinds of games that can be done, since the computer doesn’t really understand what the various parts of your body are, or what they are doing. In fact the computer will frequently become confused by changes in scene illumination, motion in the background, and plain old camera noise.
To us, a picture contains a lot of organized shapes and colors that seem obvious. But to a computer, it’s just a mess of colors without semantics. It’s hard to appreciate the deepness of this issue unless you’ve spent some hard time developing vision algorithms. If you haven’t, I encourage you to give it a try some weekend, so you can come face-to-face with the problems yourself.
As we overcome these technical limitations, the scope of games that can be done with a camera becomes much wider. Recently I read Donald Hoffman’s book “Visual Intelligence: How We Create What We See”, and I found it very exciting (see References). In addition to providing some valuable insights about our human perception of the world, it methodically constructs a framework for a low-level vision system.
Early in the book, Hoffman summarizes the difficulties by defining “The fundamental problem of vision: The image at the eye has countless possible interpretations,” and some guiding principles, like “The Rule of Generic Views: Construct only those visual worlds for which the image is a stable (i.e., generic) view.” (A stable reconstruction of the world is one whose significant features don’t change for small motions of the viewpoint). He then presents a series of more concrete rules to help reduce the ambiguities of the vision problem, and to explain how the human visual system parses images. There are 35 of these rules in all, and they range from the basic “Rule 3: Always interpret lines colinear in an image as colinear in 3D” to the more involved “Rule 32: Construct the smoothest velocity field.” These rules are usually justified by actual experiments on humans, many of which you can try yourself by looking at images printed in the book.
This book can be a pretty good high-level recipe for anyone
who wants to sit down and write a next-generation computer vision system. The
rules are all clear enough that an experienced programmer can clearly see how to
start implementing them. Of course we’re still very far from solving the vision
problem, and such a system would be nowhere near perfect (probably we need to
solve the full AI problem in order to make vision really work!) but this would
be a good beginning. Our current image-parsing technology is so poor that it
really oughtn’t be hard to achieve better results. That translates directly
into new types of gameplay for camera-based games, which is pretty exciting.
AI frameworks: a Generalized Hofstadter-style Solver
When building such a vision system, it becomes clear that a unidirectional flow of information, and monotonic construct-building that moves from low-level pixels to high-level shapes, is an insufficient paradigm. Unfortunately, it’s the default way we approach these problems, and in fact it’s just about the only kind of architecture we’ve developed strong methods for, which means we’re especially ill-prepared. Vision, and every other AI problem, wants its systems to have feedback, a channel of bidirectional influence that unites the low and high levels. A high-level layer postulates the existence of a certain shape in a specific area of the image, which might trigger something in a middle layer to remember that it found some features that might be line segments in that region, which might trigger the low level to examine the pixels more closely, which would feed back upward into the middle layer providing evidence to help confirm or deny existence of those features, which influences further postulates made by the high level. Software-engineering-wise, we don’t really know how to do this.
The best work I’ve seen in this area has been done by Douglas Hofstadter. His Fluid Concepts book (see References) demonstrates a software architecture that can be applied to this kind of problem. A bunch of tiny agent routines, called “codelets”, are designed such that their emergent statistical behavior is to solve the problem. The codelets run in an essentially parallel fashion, communicating by reading and modifying the notions that comprise the current attempted solution. This kind of approach can be seen as an extension of the classical “blackboard architecture”, but it operates in a finer-grained and more robust way.
As presented in Hofstadter’s book, a codelet architecture must be carefully designed for the problem at hand. I’d like to attempt to generalize this kind of system, so that it can be applied to a wider problem space with less customization. Computer vision would be a good place to employ this kind of system, but there are lots of other places, like the perception/action cycle for game AIs. Such generalization is a hard problem, so this task is a bit quixotic, but sometimes I like things that way.
Hofstadter argues that AI isn’t what the Good Old Fashioned
AI researchers say it is, and I think he’s right. I believe his idea that
fluid, continuous pattern-matching lives at the core of intelligence, and I look
forward to experimenting with this kind of system.
Goodbye!
That’s all for now. Keep on keeping on, and I’ll catch you later.
-Jonathan.
References:
Donald D. Hoffman, Visual Intelligence: How We Create What We See, W. W. Norton & Company, 2000.
Douglas Hofstadter, Fluid Concepts & Creative Analogies: Computer Models of the Fundamental Mechanisms of Thought, Basic Books, 1996.