In this post I will try to describe a method of reducing the computational resources necessary to train a decision making system in an environment with random or pseudorandom starting conditions.
By setting up conditional parameters for our tree-search function, i.e. designing a network that only searches subsections of the decision tree and only when relevant conditions are met, we can intelligently navigate a search space, instead of exhausting all of our options at all of the points of the decision tree.
Within an Artificial Neural Network (ANN), this process could involve something like pre-training, where the network correlates the value of certain strategic decisions (within a small amount of training data) with the starting conditions that those decisions were taken in.
In this way, the network can start to identify which parts of the decision tree are worth exploring further, given certain inputs, without wasting resources (time and compute) on exhausting the entire set of possible decisions or strategies.
Training an ANN in a Real Time Strategy Game Environment
Let’s use an example from a classic Real-Time Strategy game (Warcraft 2) to show what I mean.
In Warcraft 2, like many other domains, the decision between different strategies or actions is not straight forward. There are no single dominant strategies, because if there were the game would become less interesting. We’ll come back to how this can help us make sense of building a computationally efficient network in this domain later.
Generally, and I hope this will be easy enough to understand even from a non-gamer perspective, there are a few basic strategies to employ:
1) Rush an opponent, going all-in on winning the game right away.
2) Build during the early or mid game
3) Fortify your base to secure the late game
You can think of these sort of like Rock, Paper, Scissors.
Rushing beats Building, because the opponent doesn’t have enough fortification to defend the all-in attack.
Building beats Fortifying, because the opponents fortifications will eventually fall under enough pressure.
Fortifying beats Rushing, because the initial rush can’t take out the fortification and falls flat.
So let’s see what happens if the ANN trains itself using the basic initial parameter of taking each strategy with a 33.3% frequency.
Since there are 3 options, this means it doesn’t initially have a preference for any of them, it views them equally and randomly chooses between them.
As it trains, and gets feedback in the form of a reward function, it starts to tweak the frequencies of the strategies in order to find the optimal frequency.
But it doesn’t just find the optimal frequency in a vacuum, it optimizes the frequency of each action along the dimensions of the starting conditions, or the game state.
For simplicity, let’s assume the ANN only has these three strategies to choose from, and only has one starting condition or game state that it keeps track of. This is an oversimplification of how this agent would work but it helps us get to the point.
The starting condition that the ANN keeps track of is where the players randomly spawned on the map, their spawn locations and it can be represented below.
Our network’s decision of which starting strategy (Rush, Build, or Fortify) will surely depend on its starting position, but it will also depend on the starting position of its opponent.
Our network will learn this through many trials. It may take millions or billions of games, but it will eventually figure out that you need to know certain features of the game state to make a good decision. Namely, you need to know where you started and where you are in relationship to your opponent.
The network will almost surely converge on the strategy of sending out a probe in order to find where their opponent started, a feature of RTS games that will be familiar to players across the board. You need to know the lay of the land to choose your strategy intelligently.
But we don’t need a million or a billion games to understand that. We can logically derive that strategy just by thinking about the game.
If our starting assumption is that the game is balanced, namely that a single strategy is not dominant, but relies on a specific game state to make it preferable, we can deduce that there are certain game states that should encourage us to want to Rush more than Build. And certain starting positions that will encourage us to Fortify over both other strategies for example.
How could we make this deduction if our only known starting condition is our own location and the location of our opponent?
Think about it for a second. Look back at the map and think, if I had spawned as the orange player in 11 o’clock and my opponent spawned as the red player in 4 o’clock, would I be encouraged to Rush, Build or Fortify?
Most likely if you haven’t played any games of this type, you’ll be a bit clueless here, but if you figured out that you would be discouraged to Rush, then congratulations you are correct.
The farther away you started from your opponent, the less effective your Rush strategy will be. The value of Rushing comes from being fast, from attacking your opponent before they are ready for it. Starting right next to the other player allows you to reach them faster, whereas starting farther away makes your Rush take longer, reducing its effectiveness and in turn making it less likely to be rewarded.
Here we have made a connection between the reward we expect to receive from an action or strategy, and the starting conditions in which we attempt to take that strategy.
The gains we make from exploring the strategy of Rushing further, when we know we are far away from our opponent, provide very little value to us relative to other explorations of the game tree that we could make.
Therefore, I am suggesting training an agent in a way where it is able to make a strong correlation between the rewards it expects to receive playing a strategy and the starting conditions or game state that it finds itself in. And is able to do this quickly, in order to inform the way that it trains itself in this environment going forward.
In this way it doesn’t waste computational resources exploring how to make a Rush work from all the way across the map for all billion trials. It simply figures out that it would prefer to Build or Fortify in these situations within the first 1000 trials, and then spends the rest of its training time going deeper into how to develop those strategies.