# Training an AI for the card game Dominion¶

This is part 3 of my journey to build an AI agent to play the deck-building card game Dominion:

- Part 1 covers (simplified) game rules, genetic algorithms, and classical reinforcement learning.
- Part 2 covers neural networks and deep reinforcement learning approaches.
- Part 3 (this) extends the AI to the full game of Dominion, and all the secondary decisions that entails.

For parts 1 & 2, I focused on a stripped-down version of Dominion that only uses the 6 core cards: Copper, Silver, Gold, Estate, Duchy, and Province. See Part 1 for the full description of the game and its rules.

## Lights, Camera, Actions!!¶

In this part, I add in Action cards to arrive at the full game of Dominion. Each game of Dominion includes 10 types of Action cards (the "kingdom"), drawn from a large pool of possibilities. Thus, the game is different every time, and the optimal strategy depends on which Actions are available. I should note that I am working from the first edition of base Dominion. The second edition substituted several Actions, and over the years many expansion packs have added hundreds of other Actions.

Because buying Actions is an alternative to buying one of the core cards, our existing GA and RL strategies need little modification to make the "buy" decisions. However, the presence of Actions introduces a host of additional decisions. First, if a hand contains multiple Actions, which one(s) should be played, and in what order? Second, many Actions involve secondary decisions such as gaining, discarding, or trashing other cards.

Adapting our existing strategies to make these secondary decisions is harder. In the worst case, each new card might necessitate a new model to learn its strategy. Furthermore, these decisions arise only rarely during a game, so we might need 10x to 1000x more training data to see enough examples of each decision.

I agree with many others who have worked on this game that the "right" approach here is probably some kind of Monte Carlo tree search (MCTS), coupled with a value function to evaluate the rollouts. The AlphaGo paper describes how they used a similar approach in their engine. In Dominion, the branching factor is low, which is good for efficiency. However, the current order of the deck is hidden state, so the rollouts need to sample from many possible shuffles.

## Heuristics for the win¶

In practice, implementing MCTS correctly seems like a lot of additional bookkeeping code. Like most other authors, then, I've gone the lazy route and opted for some fairly simple heuristics. Because these decisions are fairly rare, they generally have a minor impact on a strategy's success. The decision of what, when, and how many cards to buy seems to be the dominant factor in winning or losing.

One simple heuristic is that non-terminal Actions (e.g. those that grant additional actions this turn) should be played prior to terminal Actions. This already solves the problem for many hands that contain multiple Actions, since playing as many Actions as possible is generally to your advantage. (After all, you did spend resources to buy them, so presumably you intended to play them.) If there are still more Actions than can be played, I currently just choose at random.

Both our genetic algorithm (GA) and reinforcement learning (RL) algorithm allow us to rank-order all cards by our preference for buying them. That choice of representation for the "buy" strategy was fortuitous, because we can re-use that model to inform many secondary decisions for Actions.

For cards which allow us to gain additional cards, we can choose based directly on our buying preference. This includes cards like Feast, Remodel, and Workshop.

For cards which force us to discard, we can choose based on a modified buying preference. We assume the cards we most prefer to buy are also the cards we most prefer to retain in our hand, with the exception of victory cards, which are important to acquire but useless during gameplay. For instance, Militia triggers this kind of decision. Cellar also uses this logic, but compares the preference of each card in our hand to the average (expectation) preference of a card drawn from the deck, since we know the deck's contents but not its order.

For cards which allow us to trash cards, we can treat it as a sort of "negative buy". Our ranking includes a special END pseudo-card; cards ranked lower than END are cards we will not buy even if we have nothing else to spend the resources on. That is, the algorithms believe that acquiring such cards actually reduce our chances of winning. Thus, those cards are the ones we should trash. In base Dominion, Chapel is the card where we use this strategy.

Finally, there are several cards that use a simple hand-crafted heuristic. For example, we always play Money Lender if possible, assuming we bought it to use it. Likewise, we always reveal our Moat to defend against attacks.

The one card I have not yet implemented is Throne Room. In part, this is because my code would require significant refactoring to implement it. However, I also feel like it might need a much more sophisticated strategy to play it well, and so I might not learn much from implementing it with a simple heuristic like random choice.

## Strategies for recommended kingdoms¶

In the instructions, Dominion includes suggestions for several kingdoms that highlight interesting strategies. Three of these do not require Throne Room, and so can be evaluated here. The following compares the strategies learned by the genetic algoritm (GA) and the deep reinforcement learning algorithm (Proximal Policy Optimization, PPO).

### First Game¶

Kingdom: Cellar, Market, Militia, Mine, Moat, Remodel, Smithy, Village, Woodcutter, Workshop

The GA prefers 4-5 Militia plus 3-4 Market. After that it follows a standard "big money" strategy, buying Silver to get Gold to get Provinces (and Duchies opportunistically).

The tabular Monte Carlo strategy learns Militia/Market quickly, the same as the GA, but even after many hours of training (3 Militia, 3 Market) it loses 54% of the time to the GA. (Although it has less than half the suicide rate.)

The deep RL strategy (PPO) learns to buy out Militia, Duchy, and Estate. Although it buys 3+ Provinces, it appears not to reliably end the game via Provinces. It loses 85-93% of the time to the GA. Depending on the initialization and on tweaks to the algorithm, it sometimes buys 1-2 Moat and/or 2-3 Market.

I find it interesting that although Moat is available, attacking with Militia is worthwhile but defending with Moat apparently is not. If Militia is excluded from the kingdom, the GA falls back to 1 Smithy plus "big money", a well-known but effective strategy.

### Interaction¶

Kingdom: Bureaucrat, Chancellor, CouncilRoom, Festival, Library, Militia, Moat, Spy, Thief, Village

The GA learns 2 Libraries plus 1 Militia, then "big money". PPO learns 3 Militia, 2 Festival instead. It loses to the GA 70% of the time.

It looks like Festival may be allowing double Militia plays on some turns, which doesn't hurt opponents, but is useful for buying since both Festival and Militia are worth +2 money.
However, Library is most useful because the *opponent* is playing Militia: the player with Library can discard 2 victory cards, then draw 4 new cards (even better than Smithy).
Against an opponent who was not playing Militia, presumably Library would be less valuable (draw 2, like Moat, Laboratory, or Witch).

If Militia is excluded from the kingdom, GA learns 1 Library plus 1 Bureaucrat, then "big money". If Library is excluded from the kingdom, GA learns 2-3 Festival, 3-4 Militia, and 2 Moat.

### Size Distortion¶

Kingdom: Cellar, Chapel, Feast, Gardens, Laboratory, Thief, Village, Witch, Woodcutter, Workshop

The GA usually learns 2 Workshop, 1-2 Feast. It uses Workshop to acquire Gardens, Silvers, and Estates, and wins by buying out Gardens, Estates, and Duchies. It still buy Provinces and Gold when possible, switching from Gold to Gardens after about turn 10. This game runs slightly longer than the basic 6-card game.

With a lucky initialization, the GA learns 1 Witch, 1 Workshop, maybe 1 Cellar. It buys a similar mix of victory cards, but beats the Workshop/Feast strategy by a large margin.

PPO reliably learns a strategy based on 1-2 Witches, but not as good of one as the GA does. The PPO's Witch-based strategy is an even match for the GA's Workshop/Feast strategy.

## No perfect algorithms¶

I was frustrated because I wanted the "advanced" neural network algorithm to beat out the simple-minded GA.
In theory, PPO has much more context about the game state and a much more flexible policy representation.
In practice, though, this is an optimization process on a very rugged, high-dimensional function, and it appears that both the GA and PPO algorithms get stuck in local optima.
They learn *good* strategies, but not *perfect* strategies, and that's to be expected.

I spent a lot of time on this part of the project, and there's still a lot I could try. Honestly, though, I need to stop now -- it's becoming a bit unhealthy. What follows are some notes on what did and did not seem to affect the performance of the algorithms.

### Genetic algorithm¶

The genetic algorithm required little modification when moving from the 6-card game to the full game of Dominion. The main change I made was to limit the number of Actions that any one strategy considers initially. Otherwise with 10 Actions, just by numbers the Actions tend to crowd out the fundamental Treasure and Victory cards, so that few random seeds are viable. By strongly deprioritizing all but 4 Actions when a new strategy is generated, a much larger fraction are viable. As the strategy evolves, there is an opportunity for other Actions to be introduced if they are advantageous.

I tried using CMA-ES
to fine-tune solutions from the GA, though to no avail.
CMA is similar to Particle Swarm algorithms, though more mathematically sophisticated.
I believe it's not well suited to multi-agent RL problems because the strength of a strategy depends on the pool of opponents it is playing against, which typically evolves over time.
By keeping the pool of opponents fixed, it *should* be possible to use, but it doesn't seem to help.

I also explored techniques to maintain diversity in the GA population. Speciation using k-means inspired by NEAT seemed to hurt performance in my hands. Simply splitting the population into 4 smaller, independent pools seemed to slightly improve performance by maintaining more diversity overall; I merged all the pools together for the final evaluation.

### Deep reinforcement learning¶

PPO under-performed the genetic algorithm, despite all its theoretical advantages. It also takes at least 90 minutes to train a new kingdom on my laptop, which limits the number of experiments I could reasonably do.

Things I tried that didn't seem to help: - simplifying the "First Game" kingdom to just Militia, Market, and Moat (no effect) - more training time! (to the limit of my patience, about 12 hours) - more parameters (2x more and 2x wider hidden layers) - adding epsilon-greedy modification to PPO to encourage exploration (no obvious effect) - weight decay with AdamW in PPO (slower convergence to same result) - adding entropy term to PPO to encourage exploration, per some papers (no obvious effect) - training against an evolving pool of GA agents in addition to training against another PPO network (no obvious effect)

Reducing `gamma`

(the rewards discount factor) from 0.99 to 0.95 did cause PPO to win games more quickly.
However, it won fewer of them!
This makes sense in retrospect: stronger discounting encourages the algorithm to win fast, even if it has to win slightly less often.
On the other hand, increasing `gamma`

to 1.0 did not produce noticeable changes.
Theoretically, it should encourage the algorithm to only care about winning, with no preference for short versus long games.

Most RL papers seem not to mask out illegal actions, though they may include a negative reward for choosing them. When I allowed PPO to choose illegal actions (treating them as a no-op), it helped learn better strategies for Size Distortion, though it never became efficient enough to challenge the GA. I didn't try negative rewards, though I did try just reducing the probability of illegal actions instead of masking them completely. Subtracting a small fixed quantity from the network's log-probability outputs was sufficient to ensure illegal actions were almost never chosen. However, it avoids introducing zeros or infinities that may have been hurting the optimization process.

(Despite the name "logits" in PyTorch, these are log-probabilities, not log-odds.)

```
logits = self.logits_net(obs)
fbn = torch.as_tensor(fbn, dtype=torch.bool)
logits[fbn] -= 10 # reduce probability to near zero
pi = Categorical(logits=logits)
```

I also found that I could increase the learning rate 100x from where I had it set (0.0001 to 0.01). The optimization cycle regularly stops early due to the KL limit, but performance doesn't collapse and the algorithm reaches good solutions significantly faster.

## Other RL algorithms¶

One thing I didn't spend much time on is implementing other neural-net RL algorithms. I did implement a very simple Monte Carlo-type algorithm. It learned, but did not produce very strong results. I also used the basic deep-Q algorithm (DQN) from a PyTorch tutorial, which didn't learn at all. It's possible I made some mistakes in adapting the code, though I previously got terrible results with tabular Q-learning too.

There are lots of libraries of high-quality implementations of RL algorithms, such as
OpenAI Baselines,
Stable Baselines,
and RLlib.
I also found a stand-alone implentation
of "Rainbow", a variant of DQN incorporating lots of optimizations.
However, all of these expect an environment that matches the OpenAI `gym`

interface.
These are all "single player" games, and I can't figure out how to shoehorn my multi-agent Dominion simulator into that kind of interface.
As a result, it's much more work for me to test the broad array of existing RL algorithms against this problem.

## Outlook on Dominion¶

This project has really changed my outlook on Dominion as a game. I started off basically trying to prove to myself that the game is "boring", and that Big Money is always the optimal strategy. I've learned that isn't true, and have gained a much deeper appreciation for the design of the game mechanics.

In particular, Victory cards create a very elegant negative feedback cycle. Acquiring points dilutes the deck and makes it harder to acquire further points. This self-balancing aspect helps keep games close and competitive, even if one player is more skilled than the other. In contrast, some games like Settlers of Catan feature positive feedback loops, where early success snowballs into an unbeatable advantage. I dislike those games and find them irritating to play, because if you fall behind early, you stand very little chance of catching up.

Close games and frequent lead changes have been shown to (generally) improve people's enjoyment of a game. Of course, it also means that random chance plays a significant role, so more games are required to evaluate the true strength of a strategy. This translates into more compute power and longer waits during training for an AI developer.

I've also come to see new interplay between Action cards that I hadn't recognized before. This has been a major source of fun in the project -- looking at strategies the algorithms choose, and then trying to understand why they work.

I think I'm done with my Dominion AI for now. However, I'm looking forward to applying what I've learned to some of my other favorite games!