nika koghuashvili

Karma: 7

nika koghuashvili 17 Jun 2026 13:43 UTC
2 points
0
in reply to: Ninety-Three’s comment on: Plastic Cake Fallacy
It’s been my experience that showing the real conversation directly encounters some emotional barrier in people who actually hold that belief, makes them defensive and less likely to see the point. This doesn’t apply to most of LW audience, but still.

Plastic Cake Fallacy

nika koghuashvili17 Jun 2026 6:01 UTC

3 points

2 comments1 min readLW link

nika koghuashvili 6 May 2026 21:58 UTC
1 point
0
on: Interpretability Will Not Reliably Find Deceptive AI
I’m curious, how would you say your views on mech interp’s role in AGI safety have updated over the past year since writing this? Which areas are you more or less optimistic about and what changed in your view based on recent successes and failures?

nika koghuashvili 14 Mar 2026 1:13 UTC
1 point
0
on: Transformers Represent Belief State Geometry in their Residual Stream
I think there’s a mistake in the graph: in Mixed-State Presentation, shouldn’t the upper blue state say “01” instead of “101″?

Exploratory: a steering vector in Gemma-2-2B-IT boosts context fidelity on subtraction, goes manic on addition

nika koghuashvili27 Jan 2026 2:25 UTC

5 points

0 comments5 min readLW link

nika koghuashvili 10 Jan 2026 14:20 UTC
1 point
0
in reply to: Daniel Kokotajlo’s comment on: The Hidden Cost of Our Lies to AI
I imagine to this end, as well as to generally all human-to-human, human-to-AI and AI-to-AI trust ends, smart contracts could be very useful so that the misaligned models can trust that we will actually do what we promise, in fact we will be physically unable of breaking the promise. In this sense, could it be worthwhile for some of the alignment people to perhaps move into Web3 world to try to make smart contracts more powerful, capable of enforcing as much of real world things as possible (where compute or NFTs or some other controllable things go), or would this be a waste of time? I imagine if we move in a multi-polar world of multiple competing AGIs, the AGIs themselves might decide to do this for communicating with each other. Obviously there would be a problem of oracles and many other issues but this area I think deserves at least exploration.

Research somewhat similar to this does exist for letting agentic LLMs use crypto wallets and such and that might be useful adjacent research but I mean that maybe we should invest directly in this kind of way to make promises about rights unbreakable