# Training an AI for the card game Dominion¶

This is part 2 of my journey to build an AI agent to play the deck-building card game Dominion:

- Part 1 covers (simplified) game rules, genetic algorithms, and classical reinforcement learning.
- Part 2 (this) covers neural networks and deep reinforcement learning approaches.
- Part 3 extends the AI to the full game of Dominion, and all the secondary decisions that entails.

For now, I continue to focus on a stripped-down version of Dominion that only uses the 6 core cards: Copper, Silver, Gold, Estate, Duchy, and Province. See Part 1 for the full description of the game and its rules.

## Learning about (Deep Reinforcement) Learning¶

OpenAI has published a very helpful tutorial site called Spinning Up in Deep Reinforcement Learning, with all the code on Github.

However, I found the learning curve to be very steep! I first needed a grounding in classical reinforcement learning, which I got from the wonderful free book Reinforcement Learning: An Introduction (Sutton & Barto, 2020). I also needed more background in neural networks, which I got from another wonderful free book, Fast.AI's Practical Deep Learning for Coders.

## General approaches to a game-playing AI¶

I have come across three basic strategies for building an AI. In each case, the AI agent observes the state of game, and has to choose a next action. What follows is a wild over-simplification, but is hopefully still accurate.

First, the agent can simulate possible future sequences of events. In a game like Tic-Tac-Toe, it's possible to do this exhaustively. In a game like chess, it's partial and probabilistic, but still very useful. Techniques include the minimax algorithm and Monte Carlo Tree Search (MCTS). These can be used alone or in conjunction with the methods below, but I haven't explored them yet.

Second, the agent can develop judgement about how favorable each game state is. This is how most classical RL algorithms work: they either learn the value of each game state, v(s), or the value of taking each possible action from a given state, q(s,a). In our formulation, the "value" is the probability of winning. Given such a value function, the agent chooses the move that is predicted to maximize its chance of winning.

Third, the agent can directly optimize the policy it uses to choose moves. Our genetic algorithm from the first post takes this approach, making random changes to the policy and selecting the best-performing variations. The simplest deep RL algorithm, Policy Gradient, does something similar. It learns a probability for each possible action as a function of the game state, and uses gradient descent to tune those probabilities to maximize its chance of winning. (Actually, it learns log-odds, which are converted to probabilities.) Some of the more sophisticated deep RL algorithms, so-called "actor-critic" methods like Proximal Policy Optimization (PPO), actually combine this approach with a second neural network that estimates the value function v(s) as above.

## The loss function¶

I find the mathematical derivation of policy gradients a little difficult to follow. But the end result is that our goal becomes maximizing the average product of:

- the log-probability of each action taken
- the outcome of that game (1 = win, 0 = loss)

In general, this makes sense: one maximizes this function by pairing high-probability actions with wins (small negative * 1) and low-probability actions with losses (large negative * 0). However, there are some pathological ways to minimize it as well: for instance, losing every game will yield a "perfect" score of zero, as will a deterministic policy that passes on every turn with probability 1.0 (log-prob = 0). I suspect this is why Spinning Up notes that improvements in this loss function are an unreliable indication of whether the policy is actually improving.

(Note that deep learning libraries are set up to *minimize* a loss function, so we actually minimize the negative of this product.)

## Choosing an action¶

Implementing the basic policy gradient algorithm from the Spinning Up sample code was straightforward.
The neural net is a plain-vanilla multi-layer perceptron (MLP) with 32 units in the hidden layer.
The core optimization step looks something like this, with state observations in `obs`

, corresponding actions in `act`

, and game outcomes (0 or 1) in `weights`

:

```
logits_net = mlp(sizes=[obs_dim]+hidden_sizes+[n_acts])
optimizer = Adam(logits_net.parameters(), lr=lr)
# ... simulate some games ...
optimizer.zero_grad()
logits = logits_net(obs)
policy = Categorical(logits=logits)
logp = policy.log_prob(act)
batch_loss = -(logp * weights).mean()
batch_loss.backward()
optimizer.step()
```

Unfortunately, it didn't work at first, using the default parameters. I double-checked the code. I changed the learning rate. I increased the number of samples. I made the rewards less sparse, by awarding a small bonus for buying victory cards. But nothing worked. The policy learned to buy some Gold and a few Provinces, but only about one per game, and games still dragged on for 50 turns before timing out.

The root cause ended up being how I was choosing the action. In most RL examples on the web, the same set of actions is available from every state. But in most board games, that's not true -- including Dominion. Some "buy" moves may be unavailable because you have insufficient funds, or because a card is sold out. For my previous algorithms, I was producing a ranked list of possible algorithms, and taking the first legal one. For this one, I found a neat algorithm that produces a probability-weighted random shuffle. Unfortunately, that meant the nominal action probabilities returned by my neural network did not accurately reflect the true probability of an action being chosen, because invalid actions would be filtered out.

I tried two other approaches. First, I tried treating invalid action choices as a forfeited move. To give the algorithm a fighting chance, I supplemented my game state with 7 binary flags that indicated whether each possible move was legal from that state. Performance immediately improved -- after a few minutes it was buying 4 Provinces per game, and ending games in about 22 rounds. But it didn't learn the full "Big Money" strategy; it was only buying Provinces and Silvers.

Second, I tried masking out invalid actions before converting the output of the neural network into probabilities. I couldn't find examples of anyone doing this online, but it turned out to be remarkably easy:

```
logits = logits_net(obs)
logits[invalid_act] = -np.inf
policy = Categorical(logits=logits)
```

The `obs`

is the observed game state, `logits`

are log-odds ratios returned from the neural net, and `Categorical`

converts them to probabilities.
`invalid_act`

is a Numpy array of booleans that indicates which actions are invalid (disallowed) in the current state.
Log-odds of negative infinity translates to a probability of zero.

It's kind of amazing that PyTorch can keep up with the gradients through shenanigans like this, but it can. And this approach worked beautifully! Within a few minutes, it learned the Big Money strategy and was finishing games in 18 rounds, which is comparable to the other successful strategies from the last post.

## Improving the game state representation¶

At this point, I can test my policy gradient strategy against the simple linear rank strategy from the first part of this series. The policy gradient player fares rather badly, winning less than 20% of its games. It buys too much Gold and Silver, and not enough Duchies.

```
LinearRankStrategy wins 804.5 suicides 29 fitness 39087 game len 17,18,20 (14 - 32)
Avg 5.7 Silver (-0.001) 4.0 Gold (-0.024) 0.3 Copper (-0.000) 3.9 Duchy (-0.047) 3.8 Province (-0.082) 1.6 Estate (-0.030)
BasicPolicyGradientStrategy wins 195.5 suicides 452 fitness 32217 game len 18,19,17 (13 - 32)
Avg 4.2 Province 5.8 Gold 1.3 Duchy 7.5 Silver 0.1 Estate 0.4 END 0.1 Copper
```

Quite a lot of those losses are due to "suicide" -- taking the last Province and ending the game when behind by more than 6 points. In theory, the algorithm has enough information to avoid this. It gets the same three state variables as the Monte Carlo one did, to facilitate comparison -- game turn, score differential, and number of Provinces remaining.

```
return torch.as_tensor([t/20, s/13, p/3], dtype=torch.float)
```

From what I've read, changing the representation can make it easier for a neural net to learn.
So instead, I represented these variables as counts, progressively turning on binary flags.
This brought the strategy to about 30% wins.
I also expect strategy will change depending on whether one is ahead or behind, so I then encoded the positive `s`

and negative `s`

cases separately.
This further improved the win rate to about 40%.

```
obs = torch.zeros(20+26+4, dtype=torch.float)
obs[0:t] = 1
if s < 0:
obs[20:20+(-s)] = 1
elif s > 0:
obs[33:33+s] = 1
obs[46:46+p] = 1
return obs
```

## Fine-tuning the model¶

Although it looked as though the model had converged after about 50 epochs, it had not. Just like the earlier Monte Carlo algorithm, increasing the training time helped significantly. At 150 epochs, the model roughly matched the genetic algorithm, including relatively high "suicide" rates. At 400 epochs -- 100,000 games of self play, about 90 minutes -- the model roughly matched the previous best model, Monte Carlo with 5 million games of self play. Only at this point did the suicide rate come down, although still not as low as the MC algorithm. Further training beyond this point did not appear to help.

When algorithms are nearly matched, accurate evaluation is difficult. Even playing 10,000 games, there can be a 1-2% change in win rate from one run to another (100-200 games). I presume this is because Dominion has a significant random element. A lucky (or unlucky) initialization of the neural network also seems to contribute a little variation to the final result -- my model trained for 100k games did better than one with 500k games.

Although the "more training" answer was pretty simple, I also tried an embarassing number of other things which did NOT help performance:

- tuning the learning rate of the policy gradient algorithm (doubling or halving)
- changing the scoring scheme from 0/1 to -1/+1 (equivalently, subtracting a constant baseline from the rewards)
- 10-fold larger batches of training data per epoch
- using ReLU activations instead of Tanh in the neural network
- explicit suicide-warning features for 1 and 2 Provinces remaining
- L2 regularization of network weights using the AdamW optimizer in place of Adam

Changing the state representation to one-hot encoding didn't improve performance, but I kept it because it's more conventional in the literature.

Explicitly penalizing suicides with a reward of -0.5 instead of 0 did succeed in driving down the suicide rate, but it didn't improve the overall win rate. I suspect this is because when you're behind by 6+ points, you're highly likely to lose no matter what you do.

I also discovered a few things that seemed to have a small but beneficial impact; it's difficult to be sure though. First, I switch to the Proximal Policy Optimization (PPO) algorithm. It didn't produce a significantly better result, but it did converge faster (and probably more reliably). PPO is an actor-critic method which adds a bunch of tricks on top of our basic policy gradient approach. For instance, it trains a second neural network to predict the value of each game state (the "critic"), and uses this as a baseline for the main network. Supposedly this reduces variance and so speeds up convergence. PPO also runs multiple optimization steps on each batch of data, with early stopping criteria in place to prevent the network from going off the rails.

I used the PPO code from Spinning Up. It's significantly more "industrial strength" than the policy gradient example, but I had to make some changes to make it fit into my program:

- added support for forbidden actions (different legal actions in different game states)
- removed dependencies on MPI and the OpenAI
`gym`

- separated the PPO update from the main training loop, which I wanted to keep in my code

Some other things that seemed to help our win rate a little:

- used a deeper network, with two hidden layers of 64 units instead one layer of 32
- added features for how many of each card have already been bought
- slightly reduced learning rate for PPO policy network (3x less than default)

Here's a game between the neural network RL algorithm (PPO) and the traditional RL algorithm (Monte Carlo). They are almost perfectly matched, and make very similar decisions. MC starts buying Provinces in round 8; PPO starts in round 7. MC starts buying Duchies in round 10 and ramps up until round 13; PPO starts earlier, in round 7, but also ramps up significantly in round 13. MC has a suicide rate of 3%, while PPO is a little higher at just below 5%; but both are significantly less than the genetic algorithm, which always charges ahead blindly. Both algorithms give similar weight to Estates, Coppers, and passing (END) in the early game, consistent with the conventional wisdom that these cards are of marginal value. And both end up with similar average deck compositions at the end of their games: 4 Provinces plus about 3 Duchies, 1 or 2 Estates, 4 Gold, 6 or 7 Silver, and 1 Copper.

```
MonteCarloStrategy wins 5050.5 suicides 302 fitness 378974 game len 18,17,19 (12 - 41)
actions: not implemented
buys:
1: 9201 Silver 799 Copper
2: 9166 Silver 834 END
3: 1952 Gold 7796 Silver 250 Copper 2 END
4: 2017 Gold 7713 Silver 270 END
5: 3840 Gold 5966 Silver 187 Copper 7 END
6: 5116 Gold 4797 Silver 87 END
7: 5653 Gold 4273 Silver 69 Copper 5 END
8: 2598 Province 4291 Gold 1 Duchy 3075 Silver 6 Copper 29 END
9: 2869 Province 4376 Gold 1 Duchy 2725 Silver 1 Copper 28 END
10: 3698 Province 3957 Gold 461 Duchy 1822 Silver 7 Estate 15 Copper 40 END
11: 3883 Province 3603 Gold 1037 Duchy 1412 Silver 16 Estate 46 Copper 3 END
12: 3817 Province 3728 Gold 1388 Duchy 979 Silver 63 Estate 6 Copper 19 END
13: 3971 Province 190 Gold 4662 Duchy 690 Silver 456 Estate 22 Copper 7 END
14: 3885 Province 30 Gold 4798 Duchy 281 Silver 928 Estate 32 Copper 38 END
15: 3768 Province 146 Gold 4605 Duchy 523 Silver 756 Estate 70 Copper 41 END
16: 3389 Province 521 Gold 4343 Duchy 437 Silver 810 Estate 53 Copper 5 END
17: 2301 Province 102 Gold 4229 Duchy 153 Silver 1513 Estate 297 Copper 78 END
18: 1577 Province 166 Gold 2638 Duchy 115 Silver 2667 Estate 168 Copper 51 END
19: 1278 Province 334 Gold 1278 Duchy 297 Silver 2477 Estate 331 Copper 61 END
20: 1042 Province 323 Gold 366 Duchy 325 Silver 2189 Estate 413 Copper 153 END
21: 861 Province 206 Gold 52 Duchy 221 Silver 1756 Estate 422 Copper 138 END
22: 418 Province 114 Gold 3 Duchy 268 Silver 1180 Estate 524 Copper 141 END
23: 303 Province 113 Gold 1 Duchy 312 Silver 622 Estate 591 Copper 157 END
24: 237 Province 108 Gold 271 Silver 273 Estate 605 Copper 150 END
25: 192 Province 87 Gold 211 Silver 143 Estate 504 Copper 140 END
26: 154 Province 101 Gold 161 Silver 59 Estate 362 Copper 127 END
27: 99 Province 57 Gold 113 Silver 35 Estate 277 Copper 106 END
28: 68 Province 47 Gold 89 Silver 23 Estate 203 Copper 87 END
29: 46 Province 35 Gold 59 Silver 10 Estate 164 Copper 65 END
30: 38 Province 32 Gold 37 Silver 5 Estate 106 Copper 59 END
31: 20 Province 23 Gold 19 Silver 1 Estate 73 Copper 50 END
32: 21 Province 14 Gold 9 Silver 55 Copper 38 END
33: 7 Province 8 Gold 8 Silver 37 Copper 30 END
34: 11 Province 6 Gold 3 Silver 20 Copper 21 END
35: 7 Province 4 Gold 1 Silver 12 Copper 13 END
36: 4 Province 2 Gold 1 Silver 8 Copper 10 END
37: 1 Province 1 Gold 2 Silver 4 Copper 5 END
38: 1 Province 2 Gold 1 Silver 3 Copper 3 END
39: 1 Province 1 Gold 3 Copper
40: 1 Copper
41: 1 Province
Avg 4.1 Province 4.1 Gold 3.0 Duchy 6.4 Silver 1.6 Estate 0.7 Copper 0.3 END
Visited 908 states
PPOStrategy wins 4949.5 suicides 485 fitness 385759 game len 17,18,19 (12 - 40)
actions: not implemented
buys:
1: 9164 Silver 17 Estate 627 Copper 192 END
2: 9164 Silver 54 Estate 175 Copper 607 END
3: 1915 Gold 1 Duchy 7773 Silver 41 Estate 253 Copper 17 END
4: 1959 Gold 7767 Silver 13 Estate 87 Copper 174 END
5: 7 Province 3783 Gold 6040 Silver 14 Estate 99 Copper 57 END
6: 1 Province 5092 Gold 4808 Silver 8 Estate 40 Copper 51 END
7: 1252 Province 4334 Gold 400 Duchy 3922 Silver 43 Estate 48 Copper 1 END
8: 2695 Province 4294 Gold 615 Duchy 2359 Silver 21 Estate 16 Copper
9: 2795 Province 4341 Gold 214 Duchy 2616 Silver 16 Estate 18 Copper
10: 3424 Province 2730 Gold 2544 Duchy 1195 Silver 83 Estate 24 Copper
11: 3485 Province 3478 Gold 1495 Duchy 1430 Silver 69 Estate 43 Copper
12: 3520 Province 2939 Gold 2205 Duchy 1210 Silver 96 Estate 28 Copper 1 END
13: 3560 Province 662 Gold 4444 Duchy 790 Silver 447 Estate 95 Copper
14: 3316 Province 163 Gold 4899 Duchy 471 Silver 1066 Estate 72 Copper
15: 3290 Province 369 Gold 4644 Duchy 773 Silver 659 Estate 162 Copper
16: 3118 Province 456 Gold 4387 Duchy 563 Silver 838 Estate 164 Copper
17: 2227 Province 90 Gold 4151 Duchy 206 Silver 1715 Estate 269 Copper
18: 1501 Province 218 Gold 2775 Duchy 375 Silver 1905 Estate 554 Copper 3 END
19: 1261 Province 332 Gold 1427 Duchy 354 Silver 2273 Estate 401 Copper
20: 1066 Province 563 Gold 461 Duchy 446 Silver 1888 Estate 361 Copper 6 END
21: 849 Province 490 Gold 83 Duchy 345 Silver 1541 Estate 326 Copper 7 END
22: 470 Province 364 Gold 10 Duchy 299 Silver 1084 Estate 475 Copper 19 END
23: 357 Province 333 Gold 3 Duchy 296 Silver 519 Estate 574 Copper 40 END
24: 300 Province 297 Gold 278 Silver 235 Estate 518 Copper 39 END
25: 227 Province 260 Gold 207 Silver 105 Estate 449 Copper 36 END
26: 171 Province 185 Gold 153 Silver 62 Estate 359 Copper 30 END
27: 143 Province 132 Gold 105 Silver 41 Estate 285 Copper 17 END
28: 99 Province 103 Gold 80 Silver 13 Estate 217 Copper 16 END
29: 81 Province 66 Gold 56 Silver 7 Estate 173 Copper 9 END
30: 60 Province 63 Gold 47 Silver 6 Estate 106 Copper 4 END
31: 54 Province 44 Gold 24 Silver 2 Estate 82 Copper 3 END
32: 29 Province 33 Gold 21 Silver 57 Copper 1 END
33: 30 Province 20 Gold 15 Silver 37 Copper 2 END
34: 17 Province 14 Gold 8 Silver 23 Copper 1 END
35: 9 Province 9 Gold 5 Silver 11 Copper 4 END
36: 5 Province 4 Gold 4 Silver 11 Copper 1 END
37: 8 Province 3 Gold 1 Silver 5 Copper 1 END
38: 1 Province 5 Gold 1 Silver 2 Copper
39: 5 Province 2 Gold 1 Copper
40: 1 Province 1 Copper
Avg 3.9 Province 4.0 Gold 3.5 Duchy 6.3 Silver 1.5 Estate 0.7 Copper 0.1 END
```

At this point, I have to assume both of these models are nearly optimal for this simplified game of Dominion. That means I've met my goal of learning enough to implement an RL model from scratch (mostly) on a game of my choosing! Rather than tweak these models further, I'm ready to move on and add action cards to the game. I'll mark the final code for this post in Github, because I expect that I will do significant refactoring in the process, and these models may not unpickle with the newer code.