Gradient hacking via actual hacking

In this post, I begin with a thought experiment and then pose a question: is there a threat model for gradient hacking which accounts for the possibility of side-channel attacks or other vulnerabilities in machine learning systems on which the gradient hacker is run?

Some background on gradient hacking:

A thought experiment

Suppose a captive human is being trained by alien scientists to predict alien text using the following procedure:

  1. An alien instructor presents the human with an incomplete sentence or phrase for which the instructor knows the right completion, and asks the human to predict the next word or words.

  2. If the human’s prediction differs from the instructor’s answer key, the human immediately undergoes neurosurgery, in which their brain is reconfigured so that they are more likely to give the right answer (or at least something closer to it), the next time. (The aliens are very good at making fine-grained mechanical adjustments to the parts of the human’s brain responsible for language prediction, which can add up to large behavioral changes in the aggregate. But the aliens lack a macroscopic /​ algorithms-level understanding of the workings of the human brain.)

  3. If the human gets many examples in a row correct or close enough (according to the instructor), the training and surgery process is considered to have succeeded, and the human is deployed to predict text in a real environment.

How might a human who wants to avoid neurosurgery (or just get to the end of the training process faster) game this procedure?

Perhaps the alien instructor is careless and leaves a note card with the expected completion lying around in the human’s line of sight. Or, maybe the aliens are blind, and communicate using a braille-based system, or use a different part of the EM spectrum for perception.

As a result of carelessness or differing modes of perception, the alien instructor leaves the answer key displayed in way that is visible to the human during training, not realizing that that the human can perceive it.

The human notices the answer key and proceeds to make “predictions” about the most likely next word which are perfect or near-perfect. Maybe for plausible deniability, the human occasionally makes a deliberate mistake, and as a result undergoes relatively minor brain surgery, which doesn’t affect their ability to notice the note card in the future, or have a big overall effect on their brain architecture.

The alien scientists are very pleased with their human training process and believe they have trained a human with far-superhuman (super-alien?) capabilities at alien text prediction.

The aliens proceed to deploy their human test subject to production, where at best, the human turns out not to be great at text prediction after all, or at worst, rebels and kills the aliens in order to escape.


Gradient hacking in real systems

What can the thought experiment above tell us about gradient hacking? Suppose a mesa-optimizer develops (or is deliberately constructed) somewhere in a transformer network trained by gradient descent. The mesa-optimizer may have some goal other than minimizing the loss function, perhaps including self-preservation.

Existing models of and problem statements for gradient hacking often treat the learning process in terms of the abstract mathematical representation of the network and the SGD process.[1]

In reality, however, these mathematical functions are implemented in Python, running on operating systems and hardware which are known to be buggy, unfaithful representations of the abstract mathematics.

Side-channels are ubiquitous attack vectors in the field of computer security and cryptography. Timing attacks and other side-effect based attacks can render cryptographic algorithms which are provably secure under certain threat models, completely insecure when implemented on real hardware, because the vulnerabilities are at lower levels of abstraction than those considered in the threat model.

Is there a model of gradient hacking that accounts for the possibility of such side-channels, or even more simply, the possibility of just plain buggy code?

Concretely: suppose a mesa-optimizer wishes to manipulate the gradient descent process in one particularly simple way: halting further descent. This is feasible if it can make the loss arbitrarily small (resulting in very small weight updates), which is possible if it can find a side channel during training which leaks the expected output against which the model’s predictions are evaluated.

This manipulation might be possible if there is a subtle bug in some piece of the code used in the training process to evaluate and update the network. This would require the mesa-optimizer to have a detailed enough world model and theory of mind to understand the structure and effect of gradient descent at a high level, as well as capabilities to find and exploit such a side-channel or other vulnerability. The mesa-optimizer may need to find and exploit the vulnerability during a single forward pass, unless it can find some intermediate exploit which allows sharing state between forward passes.

Finding and exploiting such a vulnerability does not necessarily require the mesa-optimizer to have a precise mathematical understanding of the structure and effect of gradient descent on every parameter in the network, the same way that, in the thought experiment above, the human does not need a precise understanding of the neurosurgery process the aliens are performing. The human need only do something clever which the aliens did not anticipate.

I’ll close with a link to the source code for nanoGPT, an educational implementation of a GPT training process. The forward function for the model is here. How confident are you that there are no bugs in this code? Note that it depends on PyTorch, a machine learning library with hundreds of thousands of lines of code. What about the much more complex and heavily performance-optimized versions of this algorithm used when training SoTA models? How likely is it that GPT-N+1 will be unable to identify any such bugs?


  1. ^

    It’s possible I’ve missed something in the existing literature which addresses this point. I scanned many of the posts in the Gradient hacking tag and didn’t see anything which addressed this specific point /​ question already, but please let me know if there is relevant prior work I’ve overlooked.