Training an AI for the card game Dominion¶

This is part 2 of my journey to build an AI agent to play the deck-building card game Dominion:

Part 1 covers (simplified) game rules, genetic algorithms, and classical reinforcement learning.
Part 2 (this) covers neural networks and deep reinforcement learning approaches.
Part 3 extends the AI to the full game of Dominion, and all the secondary decisions that entails.

For now, I continue to focus on a stripped-down version of Dominion that only uses the 6 core cards: Copper, Silver, Gold, Estate, Duchy, and Province. See Part 1 for the full description of the game and its rules.

Learning about (Deep Reinforcement) Learning¶

OpenAI has published a very helpful tutorial site called Spinning Up in Deep Reinforcement Learning, with all the code on Github.

However, I found the learning curve to be very steep! I first needed a grounding in classical reinforcement learning, which I got from the wonderful free book Reinforcement Learning: An Introduction (Sutton & Barto, 2020). I also needed more background in neural networks, which I got from another wonderful free book, Fast.AI's Practical Deep Learning for Coders.

General approaches to a game-playing AI¶

I have come across three basic strategies for building an AI. In each case, the AI agent observes the state of game, and has to choose a next action. What follows is a wild over-simplification, but is hopefully still accurate.

First, the agent can simulate possible future sequences of events. In a game like Tic-Tac-Toe, it's possible to do this exhaustively. In a game like chess, it's partial and probabilistic, but still very useful. Techniques include the minimax algorithm and Monte Carlo Tree Search (MCTS). These can be used alone or in conjunction with the methods below, but I haven't explored them yet.

Second, the agent can develop judgement about how favorable each game state is. This is how most classical RL algorithms work: they either learn the value of each game state, v(s), or the value of taking each possible action from a given state, q(s,a). In our formulation, the "value" is the probability of winning. Given such a value function, the agent chooses the move that is predicted to maximize its chance of winning.

Third, the agent can directly optimize the policy it uses to choose moves. Our genetic algorithm from the first post takes this approach, making random changes to the policy and selecting the best-performing variations. The simplest deep RL algorithm, Policy Gradient, does something similar. It learns a probability for each possible action as a function of the game state, and uses gradient descent to tune those probabilities to maximize its chance of winning. (Actually, it learns log-odds, which are converted to probabilities.) Some of the more sophisticated deep RL algorithms, so-called "actor-critic" methods like Proximal Policy Optimization (PPO), actually combine this approach with a second neural network that estimates the value function v(s) as above.

The loss function¶

I find the mathematical derivation of policy gradients a little difficult to follow. But the end result is that our goal becomes maximizing the average product of:

the log-probability of each action taken
the outcome of that game (1 = win, 0 = loss)

In general, this makes sense: one maximizes this function by pairing high-probability actions with wins (small negative * 1) and low-probability actions with losses (large negative * 0). However, there are some pathological ways to minimize it as well: for instance, losing every game will yield a "perfect" score of zero, as will a deterministic policy that passes on every turn with probability 1.0 (log-prob = 0). I suspect this is why Spinning Up notes that improvements in this loss function are an unreliable indication of whether the policy is actually improving.

(Note that deep learning libraries are set up to minimize a loss function, so we actually minimize the negative of this product.)

Choosing an action¶

Implementing the basic policy gradient algorithm from the Spinning Up sample code was straightforward. The neural net is a plain-vanilla multi-layer perceptron (MLP) with 32 units in the hidden layer. The core optimization step looks something like this, with state observations in obs, corresponding actions in act, and game outcomes (0 or 1) in weights:

logits_net = mlp(sizes=[obs_dim]+hidden_sizes+[n_acts])
optimizer = Adam(logits_net.parameters(), lr=lr)
# ... simulate some games ...
optimizer.zero_grad()
logits = logits_net(obs)
policy = Categorical(logits=logits)
logp = policy.log_prob(act)
batch_loss = -(logp * weights).mean()
batch_loss.backward()
optimizer.step()

Unfortunately, it didn't work at first, using the default parameters. I double-checked the code. I changed the learning rate. I increased the number of samples. I made the rewards less sparse, by awarding a small bonus for buying victory cards. But nothing worked. The policy learned to buy some Gold and a few Provinces, but only about one per game, and games still dragged on for 50 turns before timing out.

The root cause ended up being how I was choosing the action. In most RL examples on the web, the same set of actions is available from every state. But in most board games, that's not true -- including Dominion. Some "buy" moves may be unavailable because you have insufficient funds, or because a card is sold out. For my previous algorithms, I was producing a ranked list of possible algorithms, and taking the first legal one. For this one, I found a neat algorithm that produces a probability-weighted random shuffle. Unfortunately, that meant the nominal action probabilities returned by my neural network did not accurately reflect the true probability of an action being chosen, because invalid actions would be filtered out.

I tried two other approaches. First, I tried treating invalid action choices as a forfeited move. To give the algorithm a fighting chance, I supplemented my game state with 7 binary flags that indicated whether each possible move was legal from that state. Performance immediately improved -- after a few minutes it was buying 4 Provinces per game, and ending games in about 22 rounds. But it didn't learn the full "Big Money" strategy; it was only buying Provinces and Silvers.

Second, I tried masking out invalid actions before converting the output of the neural network into probabilities. I couldn't find examples of anyone doing this online, but it turned out to be remarkably easy:

logits = logits_net(obs)
logits[invalid_act] = -np.inf
policy = Categorical(logits=logits)

The obs is the observed game state, logits are log-odds ratios returned from the neural net, and Categorical converts them to probabilities. invalid_act is a Numpy array of booleans that indicates which actions are invalid (disallowed) in the current state. Log-odds of negative infinity translates to a probability of zero.

It's kind of amazing that PyTorch can keep up with the gradients through shenanigans like this, but it can. And this approach worked beautifully! Within a few minutes, it learned the Big Money strategy and was finishing games in 18 rounds, which is comparable to the other successful strategies from the last post.

Improving the game state representation¶

At this point, I can test my policy gradient strategy against the simple linear rank strategy from the first part of this series. The policy gradient player fares rather badly, winning less than 20% of its games. It buys too much Gold and Silver, and not enough Duchies.

LinearRankStrategy    wins 804.5    suicides 29    fitness 39087    game len 17,18,20 (14 - 32)
    Avg   5.7 Silver (-0.001)   4.0 Gold (-0.024)   0.3 Copper (-0.000)   3.9 Duchy (-0.047)   3.8 Province (-0.082)   1.6 Estate (-0.030)
BasicPolicyGradientStrategy    wins 195.5    suicides 452    fitness 32217    game len 18,19,17 (13 - 32)
    Avg   4.2 Province   5.8 Gold   1.3 Duchy   7.5 Silver   0.1 Estate   0.4 END   0.1 Copper

Quite a lot of those losses are due to "suicide" -- taking the last Province and ending the game when behind by more than 6 points. In theory, the algorithm has enough information to avoid this. It gets the same three state variables as the Monte Carlo one did, to facilitate comparison -- game turn, score differential, and number of Provinces remaining.

return torch.as_tensor([t/20, s/13, p/3], dtype=torch.float)

From what I've read, changing the representation can make it easier for a neural net to learn. So instead, I represented these variables as counts, progressively turning on binary flags. This brought the strategy to about 30% wins. I also expect strategy will change depending on whether one is ahead or behind, so I then encoded the positive s and negative s cases separately. This further improved the win rate to about 40%.

obs = torch.zeros(20+26+4, dtype=torch.float)
obs[0:t] = 1
if s < 0:
    obs[20:20+(-s)] = 1
elif s > 0:
    obs[33:33+s] = 1
obs[46:46+p] = 1
return obs

Fine-tuning the model¶

Although it looked as though the model had converged after about 50 epochs, it had not. Just like the earlier Monte Carlo algorithm, increasing the training time helped significantly. At 150 epochs, the model roughly matched the genetic algorithm, including relatively high "suicide" rates. At 400 epochs -- 100,000 games of self play, about 90 minutes -- the model roughly matched the previous best model, Monte Carlo with 5 million games of self play. Only at this point did the suicide rate come down, although still not as low as the MC algorithm. Further training beyond this point did not appear to help.

When algorithms are nearly matched, accurate evaluation is difficult. Even playing 10,000 games, there can be a 1-2% change in win rate from one run to another (100-200 games). I presume this is because Dominion has a significant random element. A lucky (or unlucky) initialization of the neural network also seems to contribute a little variation to the final result -- my model trained for 100k games did better than one with 500k games.

Although the "more training" answer was pretty simple, I also tried an embarassing number of other things which did NOT help performance:

tuning the learning rate of the policy gradient algorithm (doubling or halving)
changing the scoring scheme from 0/1 to -1/+1 (equivalently, subtracting a constant baseline from the rewards)
10-fold larger batches of training data per epoch
using ReLU activations instead of Tanh in the neural network
explicit suicide-warning features for 1 and 2 Provinces remaining
L2 regularization of network weights using the AdamW optimizer in place of Adam

Changing the state representation to one-hot encoding didn't improve performance, but I kept it because it's more conventional in the literature.

Explicitly penalizing suicides with a reward of -0.5 instead of 0 did succeed in driving down the suicide rate, but it didn't improve the overall win rate. I suspect this is because when you're behind by 6+ points, you're highly likely to lose no matter what you do.

I also discovered a few things that seemed to have a small but beneficial impact; it's difficult to be sure though. First, I switch to the Proximal Policy Optimization (PPO) algorithm. It didn't produce a significantly better result, but it did converge faster (and probably more reliably). PPO is an actor-critic method which adds a bunch of tricks on top of our basic policy gradient approach. For instance, it trains a second neural network to predict the value of each game state (the "critic"), and uses this as a baseline for the main network. Supposedly this reduces variance and so speeds up convergence. PPO also runs multiple optimization steps on each batch of data, with early stopping criteria in place to prevent the network from going off the rails.

I used the PPO code from Spinning Up. It's significantly more "industrial strength" than the policy gradient example, but I had to make some changes to make it fit into my program:

added support for forbidden actions (different legal actions in different game states)
removed dependencies on MPI and the OpenAI gym
separated the PPO update from the main training loop, which I wanted to keep in my code

Some other things that seemed to help our win rate a little:

used a deeper network, with two hidden layers of 64 units instead one layer of 32
added features for how many of each card have already been bought
slightly reduced learning rate for PPO policy network (3x less than default)

Here's a game between the neural network RL algorithm (PPO) and the traditional RL algorithm (Monte Carlo). They are almost perfectly matched, and make very similar decisions. MC starts buying Provinces in round 8; PPO starts in round 7. MC starts buying Duchies in round 10 and ramps up until round 13; PPO starts earlier, in round 7, but also ramps up significantly in round 13. MC has a suicide rate of 3%, while PPO is a little higher at just below 5%; but both are significantly less than the genetic algorithm, which always charges ahead blindly. Both algorithms give similar weight to Estates, Coppers, and passing (END) in the early game, consistent with the conventional wisdom that these cards are of marginal value. And both end up with similar average deck compositions at the end of their games: 4 Provinces plus about 3 Duchies, 1 or 2 Estates, 4 Gold, 6 or 7 Silver, and 1 Copper.

MonteCarloStrategy    wins 5050.5    suicides 302    fitness 378974    game len 18,17,19 (12 - 41)
  actions: not implemented
  buys:
     1:   9201 Silver   799 Copper
     2:   9166 Silver   834 END
     3:   1952 Gold   7796 Silver   250 Copper   2 END
     4:   2017 Gold   7713 Silver   270 END
     5:   3840 Gold   5966 Silver   187 Copper   7 END
     6:   5116 Gold   4797 Silver   87 END
     7:   5653 Gold   4273 Silver   69 Copper   5 END
     8:   2598 Province   4291 Gold   1 Duchy   3075 Silver   6 Copper   29 END
     9:   2869 Province   4376 Gold   1 Duchy   2725 Silver   1 Copper   28 END
    10:   3698 Province   3957 Gold   461 Duchy   1822 Silver   7 Estate   15 Copper   40 END
    11:   3883 Province   3603 Gold   1037 Duchy   1412 Silver   16 Estate   46 Copper   3 END
    12:   3817 Province   3728 Gold   1388 Duchy   979 Silver   63 Estate   6 Copper   19 END
    13:   3971 Province   190 Gold   4662 Duchy   690 Silver   456 Estate   22 Copper   7 END
    14:   3885 Province   30 Gold   4798 Duchy   281 Silver   928 Estate   32 Copper   38 END
    15:   3768 Province   146 Gold   4605 Duchy   523 Silver   756 Estate   70 Copper   41 END
    16:   3389 Province   521 Gold   4343 Duchy   437 Silver   810 Estate   53 Copper   5 END
    17:   2301 Province   102 Gold   4229 Duchy   153 Silver   1513 Estate   297 Copper   78 END
    18:   1577 Province   166 Gold   2638 Duchy   115 Silver   2667 Estate   168 Copper   51 END
    19:   1278 Province   334 Gold   1278 Duchy   297 Silver   2477 Estate   331 Copper   61 END
    20:   1042 Province   323 Gold   366 Duchy   325 Silver   2189 Estate   413 Copper   153 END
    21:   861 Province   206 Gold   52 Duchy   221 Silver   1756 Estate   422 Copper   138 END
    22:   418 Province   114 Gold   3 Duchy   268 Silver   1180 Estate   524 Copper   141 END
    23:   303 Province   113 Gold   1 Duchy   312 Silver   622 Estate   591 Copper   157 END
    24:   237 Province   108 Gold   271 Silver   273 Estate   605 Copper   150 END
    25:   192 Province   87 Gold   211 Silver   143 Estate   504 Copper   140 END
    26:   154 Province   101 Gold   161 Silver   59 Estate   362 Copper   127 END
    27:   99 Province   57 Gold   113 Silver   35 Estate   277 Copper   106 END
    28:   68 Province   47 Gold   89 Silver   23 Estate   203 Copper   87 END
    29:   46 Province   35 Gold   59 Silver   10 Estate   164 Copper   65 END
    30:   38 Province   32 Gold   37 Silver   5 Estate   106 Copper   59 END
    31:   20 Province   23 Gold   19 Silver   1 Estate   73 Copper   50 END
    32:   21 Province   14 Gold   9 Silver   55 Copper   38 END
    33:   7 Province   8 Gold   8 Silver   37 Copper   30 END
    34:   11 Province   6 Gold   3 Silver   20 Copper   21 END
    35:   7 Province   4 Gold   1 Silver   12 Copper   13 END
    36:   4 Province   2 Gold   1 Silver   8 Copper   10 END
    37:   1 Province   1 Gold   2 Silver   4 Copper   5 END
    38:   1 Province   2 Gold   1 Silver   3 Copper   3 END
    39:   1 Province   1 Gold   3 Copper
    40:   1 Copper
    41:   1 Province
    Avg   4.1 Province   4.1 Gold   3.0 Duchy   6.4 Silver   1.6 Estate   0.7 Copper   0.3 END
    Visited 908 states
PPOStrategy    wins 4949.5    suicides 485    fitness 385759    game len 17,18,19 (12 - 40)
  actions: not implemented
  buys:
     1:   9164 Silver   17 Estate   627 Copper   192 END
     2:   9164 Silver   54 Estate   175 Copper   607 END
     3:   1915 Gold   1 Duchy   7773 Silver   41 Estate   253 Copper   17 END
     4:   1959 Gold   7767 Silver   13 Estate   87 Copper   174 END
     5:   7 Province   3783 Gold   6040 Silver   14 Estate   99 Copper   57 END
     6:   1 Province   5092 Gold   4808 Silver   8 Estate   40 Copper   51 END
     7:   1252 Province   4334 Gold   400 Duchy   3922 Silver   43 Estate   48 Copper   1 END
     8:   2695 Province   4294 Gold   615 Duchy   2359 Silver   21 Estate   16 Copper
     9:   2795 Province   4341 Gold   214 Duchy   2616 Silver   16 Estate   18 Copper
    10:   3424 Province   2730 Gold   2544 Duchy   1195 Silver   83 Estate   24 Copper
    11:   3485 Province   3478 Gold   1495 Duchy   1430 Silver   69 Estate   43 Copper
    12:   3520 Province   2939 Gold   2205 Duchy   1210 Silver   96 Estate   28 Copper   1 END
    13:   3560 Province   662 Gold   4444 Duchy   790 Silver   447 Estate   95 Copper
    14:   3316 Province   163 Gold   4899 Duchy   471 Silver   1066 Estate   72 Copper
    15:   3290 Province   369 Gold   4644 Duchy   773 Silver   659 Estate   162 Copper
    16:   3118 Province   456 Gold   4387 Duchy   563 Silver   838 Estate   164 Copper
    17:   2227 Province   90 Gold   4151 Duchy   206 Silver   1715 Estate   269 Copper
    18:   1501 Province   218 Gold   2775 Duchy   375 Silver   1905 Estate   554 Copper   3 END
    19:   1261 Province   332 Gold   1427 Duchy   354 Silver   2273 Estate   401 Copper
    20:   1066 Province   563 Gold   461 Duchy   446 Silver   1888 Estate   361 Copper   6 END
    21:   849 Province   490 Gold   83 Duchy   345 Silver   1541 Estate   326 Copper   7 END
    22:   470 Province   364 Gold   10 Duchy   299 Silver   1084 Estate   475 Copper   19 END
    23:   357 Province   333 Gold   3 Duchy   296 Silver   519 Estate   574 Copper   40 END
    24:   300 Province   297 Gold   278 Silver   235 Estate   518 Copper   39 END
    25:   227 Province   260 Gold   207 Silver   105 Estate   449 Copper   36 END
    26:   171 Province   185 Gold   153 Silver   62 Estate   359 Copper   30 END
    27:   143 Province   132 Gold   105 Silver   41 Estate   285 Copper   17 END
    28:   99 Province   103 Gold   80 Silver   13 Estate   217 Copper   16 END
    29:   81 Province   66 Gold   56 Silver   7 Estate   173 Copper   9 END
    30:   60 Province   63 Gold   47 Silver   6 Estate   106 Copper   4 END
    31:   54 Province   44 Gold   24 Silver   2 Estate   82 Copper   3 END
    32:   29 Province   33 Gold   21 Silver   57 Copper   1 END
    33:   30 Province   20 Gold   15 Silver   37 Copper   2 END
    34:   17 Province   14 Gold   8 Silver   23 Copper   1 END
    35:   9 Province   9 Gold   5 Silver   11 Copper   4 END
    36:   5 Province   4 Gold   4 Silver   11 Copper   1 END
    37:   8 Province   3 Gold   1 Silver   5 Copper   1 END
    38:   1 Province   5 Gold   1 Silver   2 Copper
    39:   5 Province   2 Gold   1 Copper
    40:   1 Province   1 Copper
    Avg   3.9 Province   4.0 Gold   3.5 Duchy   6.3 Silver   1.5 Estate   0.7 Copper   0.1 END

At this point, I have to assume both of these models are nearly optimal for this simplified game of Dominion. That means I've met my goal of learning enough to implement an RL model from scratch (mostly) on a game of my choosing! Rather than tweak these models further, I'm ready to move on and add action cards to the game. I'll mark the final code for this post in Github, because I expect that I will do significant refactoring in the process, and these models may not unpickle with the newer code.