I Trained a Neural Network to Play Helltaker

In my experience, the most tedious parts of every machine learning project is annotating training data for my model. It would be nice if I could play a videogame with a keylogger and then train a neural network to copy me. That way my computer could play videogames for me without me having to manually annotate anything.

For my first experiment, I picked Helltaker, a game about a guy who goes to Hell to seduce demons. Helltaker is an easy game for a machine to play because:

  1. It has a small, finite number of game states.

  2. It uses static sprites instead of a 3D environment.

  3. All the information necessary to play the game is present on the screen at all times; I could avoid adding any sort of recurrence into my neural network.

I wrote a data collection program to record keystrokes and video. I played each level four times. I fed the data into a convolutional neural network. I wrote another program that took screenshots of my screen, fed them into the neural network and translated the network outputs into keystrokes. You can see the model’s keystrokes and confidences in the upper-left corner of the video below.

My neural network has—at one point or another—completed most (possibly all) of the puzzles and dialog trees. It utterly failed to survive the real-time section of Level IX.

In this particular video, my bot failed simply because my computer is not fast enough to run both the neural network and my OBS screen recording software at the same time. Deeper issues persist even when I turn OBS off.

The Waiting Problem

My neural network uses no recurrence. It is theoretically stateless. Given an image of the screen, it will always make the same decision. If my model looks at a turn-based puzzle and choose to wait then my model will continue to wait forever. To solve this problem, I told my model to never wait. No matter what it sees, the model always outputs a keystroke.

Never waiting is a good strategy for Helltaker’s turn-based puzzles. Never waiting is a problem in the real-time section of Level IX. On Level IX there are situations where waiting is the right thing to do. This inability to ever wait contributed to my model’s failure on Level IX. It frequently ran down to the spikes at the bottom and died.

A model should never wait in a turn-based game. A model should sometimes wait in a real-time game. I could fix this problem by manually dividing the turn-based sections of the game from the real time section, but my goal with this project is to minimize manual annotation. One solution may be to manually code up some restlessness by forcing the model to take its second-favorite action if it has been waiting for too long.

The Video Sync Problem

My automatic annotation system reads keystrokes and records video. The keylogger reliably synchronizes itself to my system clock. The screen recorder can take up to a sixth of a second extra to start. Since the screen recorder’s delay is unpredictable, I have to manually align the video with the system clock it each trial by comparing the keystroke logs to a frame-by-frame examination of the video. It would be better if I had a separate system to sync the video with the keylogger automatically.

Minor inconsistencies in syncing my keystrokes to the video exacerbated my model’s inability to complete Level IX.

What I Learned

I vastly overestimated how much data I would need to train a model to play Helltaker. I had originally planned to train a model to annotate 21 speedruns of the game downloaded from YouTube and then use those annotated speedruns to train a second model to actually play the game. Instead, I just trained the whole thing myself.

I thought I would be bottlenecked by my computer’s lack of an Nvidia-compatible GPU and the size of my cache of extracted images, saved models, etcetera. This was not untrue (especially on Level IX) but I was more limited by the frequency at which I could take screenshots and by the size of my raw videos. The raw videos eventually totaled 372 gigabytes (though I used only 164 gigabytes of the video).

My model isn’t perfect for the puzzle sections, but I am confident I could fix things by doubling or tripling the quantity of training data. I am so confident that I don’t think it’s actually worth doing so. More interesting would be to create software to fix the video sync problem or write software that can handle a proper real-time videogame.