Getting the Most out of Your AI: Threading
All the previous articles in this series have been laying the foundation for this. Hopefully, you now have a good understanding of what artificial intelligence (AI) for games is all about and a reasonable understanding of how to implement and apply it to your own games. Now, the challenging part is maximizing the performance of your system.
No matter how nice your system is, it will be worthless if it slows down the game. Efficient programming and optimization tricks can only bring you so far; when your aspirations reach beyond the limits of a single core, you are just going to have to go parallel (see Figure 1).
Figure 1: In Blizzard Entertainment's Starcraft* II, massive numbers of units are all running their AI at the same time. The best way to handle it all is with multithreading.
When dealing with a system that has more than one processor or a processor with multiple cores, you can split the work across multiple processors. There are two approaches for this: task parallelization and data parallelization.
The easiest way to multithread your application is to start breaking it down across specific tasks (see Figure 2). The different tasks that make up the game engine are often encapsulated with only a few methods so that other systems can communicate with them.
Figure 2: Functional parallelization lets each subsystem take its own thread and core. Unfortunately, systems with more cores than tasks are underutilized.
A prime example is the audio system of the game engine. Audio does not need to interact with other systems: It just does what it's told-namely, play and mix sounds on demand. The communication functions are calls to play and stop sounds, which makes audio autonomous and perfect for functional parallelization. Using threading analysis tools to assist with the effort, the audio system can be broken off into its own thread, with a call before and after the section of code that you want to run in its own thread.
So, let's see how your AI systems take advantage of this functional parallelization. Depending on the needs of your game, you may have many distinct tasks that you can give their own thread. Here we look at three of them: path finding, strategy AI, and the entity system itself.
You might implement your path-finding system in such a way that each entity that seeks a path calls its own path whenever it needs one. Although this method will work, it means that the engine waits for the pathfinder whenever a path is requested. If you reorganize path finding as its own system, you can break it into its own thread. The pathfinder will now function like a resource manager-one in which the new resource is paths.
Any entity that wants to find a path can send out a path-finding request, then get a "ticket" back from the pathfinder immediately. The ticket is just a unique handle that the path-finding system can use to find a path. The entity can then continue on its merry way until the next frame of the game loop. The entity can check to see whether the ticket is valid yet. If it is, the entity gets the path back; otherwise, it can keep doing whatever it needs to while it continues to wait.
Within the path-finding system, the ticket is used to keep track of path requests while the system works on them without worrying about affecting the performance of the system. A positive side effect of this style of system is the automatic tracking of paths discovered. So, when a request for a previously discovered path comes in, the pathfinder can just give the ticket to the existing path. This method is great in any system with lots of entities pathing, because any path found once will likely be needed again.
As mentioned in the previous article, the AI system that manages the big picture of the game fits nicely in its own thread. It can analyze the playing field and send out commands to the different entities, which can parse them when they get around to it.
The entity system, in its own thread, will be hard at work gathering information for the decision maps. These findings can then be sent to the strategy AI system as update requests for the decision maps. When the strategic AI updates, it can parse these requests, update the decision maps, and make judgment calls. It does not matter whether the two systems (strategic AI and entity) are in sync: They won't be far enough out of sync as to affect the decisions of the AI. (We are talking about 60ths of a second here anyway, and the player won't notice a single frame delay in AI reaction.)
Functional parallelization is great and takes advantage of systems with multiple cores. There is a huge drawback, though: Functional parallelization may not take full advantage of all available cores. The minute you have more cores than tasks, your program is no longer using all the processing power available (unless the platform your program is running on handles it for you, but I find it is best not to rely on features intended for generic applications). Enter data parallelization (see Figure 3).
Figure 3: With data parallelization, a single function can take advantage of all available cores.
In functional parallelization, you took a whole autonomous unit and gave it its own thread to play with. Now, you are going to break a single job down and spread its work out among different threads. Doing so has the benefit of scaling up with the cores of the system. Have a system with eight cores? Great. Have one with 64? Why not? Although functional parallelization allows you to designate sections of code as threaded, then let them run free in the wild, data parallelization may require a bit of extra work to get things running smoothly. For one, you could use a core thread (a "Master" thread) that keeps track of who is doing what. Sub-threads will need to request "work" from the main thread to ensure that no one performs the same task twice.
Use of a core thread to manage the data parallelization is actually a hybrid approach. The core thread is using functional parallelization, but then the thread breaks up the data across different cores for data parallelization.
Threading tools like OpenMP* (available for free in most flavors of operating systems) make the process of breaking your code down into separate threads easier than it has been in the past. Just mark sections of code that can be broken down with a compiler directive and OpenMP handles the rest. To break things down by work block, you can simply put the threading calls within the loop that goes through said resource.
In the path-finding system example, the pathfinder maintains a list of requested paths. It then loops through this list and runs the actual path-finding functions on the individual requests, storing them in a path list. This loop can be threaded so that each iteration of the loop is broken down into different threads. These threads will be run in the first available core, allowing for the maximum use of the processing power available. The only time a core should ever be idle is when there is nothing left to do.
There is potential in systems like these for multiple requests for the same job. If these requests are spaced out, the pathfinder automatically checks to see whether the request has already been handled. When dealing with data parallelization, it's possible for multiple requests for the same path to occur at the same time, which can lead to redundancies, breaking the whole point of threading.
To resolve this and other redundancies, the system in question needs to keep track of which tasks are being worked on and only remove them from the request queue after they are complete. So, when a request comes in for a path that has already been requested, it needs to check this, then return the existing path that was assigned that ticket.
Spawning new threads is not free. The process will involve system calls to the operating system (OS). When the OS gets around to it, it will wrap up the code needed and create the thread. This may take oceans of time (relative to processing speeds). This is why we don't want to spawn more threads than are helpful. If the job being requested has already been handled, then, by all means, do not run the task. Also, if the task is so simple (like path finding from two very near points) then it might not be worth breaking them down so finely.
Here is a breakdown what the Path Finding functional thread will be doing and how it will break up the work into data threads:
- RequestPath(start, goal).This function is called outside the pathfinder to get a thread. This function:
- goes through the completed requests list and determines whether the path has already been found (or one similar), then returns the ticket for that path;
- goes through the active request list (if the path has not been found) to find this path; if the path is in there, the function returns the ticket for the pending path;
- generates a new request and return a new ticket (if all else fails).
- CheckPath("ticket"). Using the ticket, this function goes through the completed request list and finds the path for which this ticket is valid. It returns whether the path was there, yet.
- UpdatePathFinder().This is the shepherd function that handles the overhead for the path-finding threads. This function performs the following tasks:
- Parse new requests. It's possible for multiple requests for the same path to be generated at the same time from different cores. This section removes redundancies and assigns multiple tickets (from the different requests) to the same request.
- Loop through active requests. This function goes through all active requests and threads them. At the beginning and end of each loop, the code is marked as a thread. Each thread is going to (1) find the requested path, (2) save it in the completed path list with the tickets it was given, and (3) remove the job from the active list.
You may have noticed that there is potential for some calamity with this setup. Different threads that all need to write to the request queue or-in the data-threads that all need to add something to the done pile can lead to writing conflicts, where one thread writes something in slot A while another writes something else in slot A at the same time. This conflict can lead to the well-known "race condition."
To avoid writing conflicts, sections of the code can be marked as critical. When doing something marked critical, only one thread will be able to access that section at a time: All other threads wanting to do the same thing (access the same memory) have to wait. This behavior can lead to HUGE problems, such as gridlock, when multiple threads are all blocking each other from getting to the memory. Gridlock is actually avoided with this setup. With the real work for the thread done, access to the critical section of memory can occur when available without tying up other sections that other threads may need.
Keeping Things in Sync
So, you have made all of separate AI subsystems autonomous and let them loose on the world to take advantage of every available ounce of computing they can find. Things are running quickly, but are they getting out of control?
A game needs to be a structured experience. An engine must be able to keep things in sync. You can't have half of your game elements operating a frame or two ahead of everything else. You don't want units lying around idle waiting for a path when their counterparts are already on the move. Good parents need to keep things fair between their children.
The main game engine loop takes care of two classes of actions: rendering and updating. Keeping these things in sync is easy in serial programming. First, all the updating runs, then the rendering draws what was updated. In parallel, things might get a little hairy.
Movement updates (often based on trajectory) might end up running a few frames faster than the renderer. The results will be jumpy animations, where the entities of the world appear to jump and move faster than they should. The path finding, which may take into consideration a snapshot of the world entity positions, may be operating on invalid data.
The solution to the problem of keeping your various systems in sync is fairly elegant in its simplicity. In fact, it may already be in place for most engines. When the main game loop updates, it keeps track of a global time index. All the various threads will make sure that they only need to handle updates for the current (and past, but not future) time index.
When all of work for a given task is completed for the current time index, the thread can sleep until the new time index. Not only does this behavior help to ensure that things stay in sync with each other, but it makes sure that threads don't consume cores when they don't need to. So the movement job, which is perfectly capable of resolving collisions and movement trajectories on and on, will be kind enough to share the processing power when it gets done early. Once again, you can take full advantage of the cores available.
Here are a few things to keep in mind when designing a multithreaded system:
- Functional parallelization: Use it when a system can be made autonomous. Some functions are required to hook into the system to resolve conflicts and redundancies.
- Data parallelization:
- Use it when performing a bulk operation.
- Design it so that write-backs are minimal and at the end of the process.
- Reliance on up-to-date information (which can be edited in other threads) is minimal or irrelevant. (So what if my strategy information is 1/60th of a second out of date?)
- Make sure that nothing your work needs from another system holds up the thread: "Request Path," "Check for path"-not "Get Path."
When Not to Thread
There may be times when threading might not work. With tools like OpenMP, you can easily adjust how many threads your system will break work down into on the fly. Using tools like Intel® VTune™ Performance Analyzer with Intel® Thread Profiler, you can get a good view of how effective your system becomes with different levels of parallelization. Here are some cases in which you might want to avoid threading, however:
- Overly complex systems. If your subsystem is tied in with too many other systems that constantly cause the subsystem or other systems to wait, threading might not be the answer. There is also a chance that the system itself might benefit from a redesign, if this is the case.
- Atomic workload. If the work of the subsystem cannot be broken down, you might not be able to go parallel. The audio mixing task may work great as a thread, but the work of the task is to mix multiple sounds into the channels that are ultimately pumped to the speakers. If your system does calculations on individual chunks of audio before the mix, then it may be possible to thread that.
- Costly overhead. There is a bit of extra work in some of these systems (pathing, for instance). If the overhead outweighs the benefits gained from threading, then it might be a good idea to avoid it or just turn it off. This may be true for systems with small numbers of elements (entities, paths, etc.).
- Repeating code. There are cases where multiple threads end up churning away at the same code, only to have some of it thrown out or ignored. Most of this can be avoided with redundancy checks before work begins.
Multiple processors and multiple-core processors (and multiple multiple-core processors) make threading an ever-more relevant topic. The goal for any programmer is to take full advantage of the processing power available. The vision for the AI of a system should only be limited by the reality of the hardware, not a limitation in taking advantage of it. With modern tools that make threading easy to implement, there is no excuse not design your code to handle threading.
For more such intel resources and tools from Intel on Game, please visit the Intel® Game Developer Zone