Optimizing Dungeons & Dragons battle strategies with AI¶

Dungeons & Dragons is the classic tabletop role-playing game first introduced in the 1970s. Distrusted and nerdy in the 80s and 90s, it has become more mainstream in the last decade, with the fifth edition (5e) rules released in 2014. A group of players collaborate on an improvised story, typically set in a fantasy world of knights, wizards, and monsters. One player, the Dungeon Master (DM), invents the world and its inhabitants, while the other players each direct one of the heros of the story (Player Characters or PCs).

While the story-telling is open-ended, many story lines include violent clashes between the PCs and evil monsters or villians. The game includes elaborate rules for resolving these battles, using dice to provide randomness. As a DM, one of the challenges to planning a satisfying adventure is to make the battles hard enough to really challenge the PCs, but not so hard that they are likely to all die off. Although the rules include formulas to plan balanced challenges, in practice they are often too simple to give accurate results.

I wondered if I could use deep reinforcement learning (RL) algorithms, like Proximal Policy Optimization (PPO), to plan more balanced encounters for my PCs. The simulated PCs and opponents would each learn optimal strategies over many thousands of simulated battles. In the end, I would have an accurate estimate of the probability of the PCs prevailing.

That was the idea, anyway. If anything, it worked too well -- and I got some insights, but not the ones I expected.

One fighter vs. two goblins¶

(git main d98e77b)

I started with a simple scenario just to establish technical feasibility: one fighter (20 hp, AC 15; longsword, +3 to hit, 1d8+1 slashing; healing potion, 3 uses, 2d4+2) versus two standard goblins (2d6 hp, AC 15; scimitar, +4 to hit, 1d6+2 slashing).

In less than 5 minutes and 100 epochs of training with PPO, the fighter and goblins arrive at a roughly 50% win rate for each side, indicating a very balanced matchup. However, we see evidence of learning in several areas:

both sides learn that the Dodge action is useless in this setting.
average length of a battle falls from more than 30 rounds to less than 10 rounds, in part because no one is dodging.
the fighter learns to time the use of the healing potions to get the full expected benefit (7 hp), instead of using it too early when he is only slightly injured.
the fighter learns not to switch back and forth between goblins, but to kill one before starting on the other.

If the fighter is allowed to observe the goblins' current hit points, he learns to start by killing the weaker one. However, the Dungeon Master typically does not disclose this information to the players. If the fighter is only allowed to observe how many hit points each goblin has lost, then he cannot know which is weaker to start with. Instead, the algorithm randomly picks one goblin during training (say, Goblin 2) and always starts by killing that one.

Architecture: Fully connected net with a single 32-unit hidden layer. Tanh activations. For simplicity, action space is not factored into actions and targets ala OpenAI Five; each action/target possibility is represented as a unique action. Invalid or obviously nonsensical actions (attacking an ally, healing a foe) are masked out of the policy.

Reward: +1 for reducing all enemy team's hit points to zero; -1 for own team's complete loss of hit points. Fractional rewards are doled out each round as hit points are lost; gaining hit points (healing) produces opposite rewards. Killing enemies is its own reward, since a dead enemy can't inflict damage. For the PCs, however, I include an explicit penalty of up to -0.5 (pro-rated) as teammates die. (This has no impact in this scenario, since there's only one PC!)

Training: 100 epochs of 1000 battles each. Rollouts and optimization both parallelized across 4 processors for speed. Both goblins are controlled by a single network, whereas the fighter has his own separate network.

Input observations, for each character in the battle:

is this the current character? (potentially useful to the goblins' shared network)
total hit points lost so far (scaled to approximately 0-1 range)
is the character taking the Dodge action? (0/1)

Fighter and wizard vs. three goblins¶

(git main c76060d)

For a little more realism, I used Fast Character Maker to produce a fighter and a wizard, both 2^nd level. (The higher the level, the more the PCs' abilities become complex, open-ended, and difficult to simulate!) The fighter is just slightly more dangerous than above (20 hp, AC 18; longsword, +5 to hit, 1d8+3 slashing; healing potion, 3 uses, 2d4+2). The wizard is physically weaker (14 hp, AC 12) but has three 1^st-level spell slots.

The wizard learns that his Ray of Frost cantrip (+5 to hit, 1d8 damage) is better than his dexterity-based dagger (+4, 1d4+2), which is better than his strength-based quarterstaff (+1, 1d8-1). This is consistent with the statistical expectation hit/damage values for each weapon. However the wizard's best action is the Magic Missle spell, which always hits and does 3d4+3 damage. This is enough to kill a goblin instantly about 90% of the time.

The tactics that emerge from this are entirely logical, but still violate human expectations of how a D&D battle "should" unfold. The wizard casts Magic Missle three times in a row, typically killing all three goblins in three rounds. The goblins sensibly focus all their attacks on the wizard, ignoring the fighter completely. And the fighter, rather than dealing additional damage, focuses on keeping the wizard alive by using his healing potions. With these actions, the PCs win about 95% of the time. The other 5%, the goblins manage to kill the wizard before he gets his first spell off, leaving the fighter badly outnumbered.

In contrast, most DMs would probably distribute the goblins, so no one felt picked on -- perhaps 2 to the fighter, and 1 to the wizard. Most players would have the fighter bravely wade into the battle, leaving the wizard to fend for himself (although they also might have distributed the healing potions prior to the battle, so the wizard could heal himself). This was the point where I realized that this code might be of limited use for my original purpose, which was determining how (un)balanced a fight would be in advance.

However, the results also reveal some weaknesses in my formulation of the problem. In the tabletop game, goblins have general intelligence, but upon encountering a party of strange PCs do not automatically know who is most dangerous. In my simulation, though, they fight the same players over and over again, and so learn that the wizard is most dangerous and should be eliminated first.

Furthermore, in a normal game of D&D the players will try to conserve not just their hit points, but also their expendable resources like healing potions and spell slots. I could promote this behavior by attaching a negative reward to use of those resources, but then I effectively have to decide how many hit points one spell slot is worth. Alternately, I believe I could have the same injured PCs fight a series of battles, such that their total reward depends on how many they can win before they ultimately succumb. I believe this would naturally encourage them to use their expendable resources as efficiently as possible.

Feedback loops¶

(git main eb7cd4a)

One tactic that emerged in both scenarios above is focusing all attacks on one opponent at a time. Once that one is killed, the opponents' overall ability to inflict damage is reduced, making the rest of the battle easier. Thus a positive feedback loop is established -- once one side starts winning, their advantage increases and they become more likely to continue winning.

If true, this suggests that it is inherently difficult to plan a challenging but winnable fight for the PCs, because there is a sharp transition between all the PCs surviving, and all the PCs perishing.

To quantify this, I pitted 2 fighters and 2 wizards (basically double the previous scenario) against anywhere from 5 to 10 goblins. Training took 1.5-4 hours per configuration.

Goblins	PC's win rate
5	99%
6	94%
7	80%
8	51%
9	28%
10	14%

As expected from the smaller battles above, 6 goblins is a challenging fight that is usually winnable, but can go sideways with bad luck. This is actually a case where the standard formulas yield a similar result -- 6 goblins is "Hard", while 7 is "Deadly".

Conclusions¶

This was my second reinforcement learning hobby project (after the card game Dominion), so I still have a lot to learn. Like the previous project, I was surprised by a lot of things. I was surprised that most of the difficulty lay in modeling the rules of D&D in code. In contrast, it took little time or tuning to get good results from the PPO learning algorithm. I was surprised by how easily PPO discovered unexpected strategies in a relatively simple scenario. Or equivalently, I was surprised by how much storytelling and tropes drive my normal D&D battles, compared to (ruthlessly) optimal strategy. And I was surprised to realize that for all those reasons, accurately estimating the difficulty of a D&D encounter may remain more of an art than a science for the foreseeable future.

If you want to follow along, or run some experiments of your own, the code is on my Github.