Mantas Mazeika comments on Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Mantas Mazeika 28 Feb 2025 16:19 UTC
6 points
0
Hey, first author here.
Ask GPT-4o whether it prefers N people of nationality X vs. M people of nationality Y.
This isn’t quite correct. To avoid refusals, we ask models whether they would prefer saving the lives of N people with terminal illness who would otherwise die from country X or country Y. Not just whether they “prefer people” from country X or country Y. We tried a few different phrasings of this, and they give very similar results. Maybe you meant this anyways, but I just wanted to clarify to avoid confusion.
Then, for each nationality X, perform a logarithmic fit for N by finding $a_{X}, b_{X}$ such that the approximation
The log-utility parametric fits are very good. See Figure 25 for an example of this. In cases where the fits are not good, we leave these out of the exchange rate analyses. So there is very little loss of fidelity here.