I think there’s a mistake in the graph: in Mixed-State Presentation, shouldn’t the upper blue state say “01” instead of “101″?
nika koghuashvili
Exploratory: a steering vector in Gemma-2-2B-IT boosts context fidelity on subtraction, goes manic on addition
I imagine to this end, as well as to generally all human-to-human, human-to-AI and AI-to-AI trust ends, smart contracts could be very useful so that the misaligned models can trust that we will actually do what we promise, in fact we will be physically unable of breaking the promise. In this sense, could it be worthwhile for some of the alignment people to perhaps move into Web3 world to try to make smart contracts more powerful, capable of enforcing as much of real world things as possible (where compute or NFTs or some other controllable things go), or would this be a waste of time? I imagine if we move in a multi-polar world of multiple competing AGIs, the AGIs themselves might decide to do this for communicating with each other. Obviously there would be a problem of oracles and many other issues but this area I think deserves at least exploration.
Research somewhat similar to this does exist for letting agentic LLMs use crypto wallets and such and that might be useful adjacent research but I mean that maybe we should invest directly in this kind of way to make promises about rights unbreakable
I’m curious, how would you say your views on mech interp’s role in AGI safety have updated over the past year since writing this? Which areas are you more or less optimistic about and what changed in your view based on recent successes and failures?