itaibn0

Karma: 686

itaibn0 16 May 2026 1:32 UTC
17 points
−1
on: itaibn0′s Shortform
AI Interpretability idea: Train an LLM with a “split-brain” architecture. That is, the data in the intermediate layers is siloed into two groups with only a small amount of interaction between the groups. This could be enforced in the linear layers by forcing the cross interaction to be a low-rank matrix, or by adding to the regularizer some matrix norm of the cross interaction matrices. I’m sure this can be generalized from MLP to transformer architecture.
Next, apply standard interpretability tools to see how it uses its two sides to represent concepts. Hopefully you’ll find that there are some concepts and tasks that are more focused on one side than on another.
Next, try to elicit the LLM to give verbal descriptions of which concept is represented in which side. Easy mode is to explicitly tell it how it is designed, give some examples of specific concepts primarily encoded on specific sides, even fine-tune train it with those examples, and ask it to predict in which sides are concepts you haven’t trained/told it about before. Hard mode is not to tell it anything about how it was designed but just give it general prompts about trying to introspect try to get it to figure this out from scratch. Maybe hand it off to some of the LLM whisperers without telling it how it was created.
Finally, if you succeed in hard mode, give the same prompting to a regular LLM and pay attention to what it says.
The idea is that I’m trying to elicit the LLM to introspect in a way that actually gives us information about the features and concepts it uses to generate text, rather than just tell the story of the humanoid character it is playacting as. I want to create a scenario where there is a non-obvious ground truth that we have access to that it could plausibly have access to and be able to share.

itaibn0 24 Apr 2026 7:58 UTC
1 point
0
in reply to: programjames’s comment on: Annoyingly Principled People, and what befalls them
I think that ordinarily, social manipulation games do not erode the norm of being able to have honest conversations. I think you are on some level aware of the norms I’m going to describe and are acting in accordance with them, I just want to describe them explicitly. As I understand them, the norms for playing social manipulation games is that there is a distinction between statements made with the game and statements made outside the game. Statements made within the game are not bound by the norms of honesty outside of the game. A player lying or misleading within a game does not impact their reputation outside the game, tho it may impact a reputation a playing is trying to maintain within the game that other players are tracking separately. There is an implicit agreement by joining a social manipulation game to suspend these rules of honesty.
A difficulty is that like all implicit agreements the details of it can be misunderstood; in particular, it can ambiguous which statements are made within the game and outside the game. Certainly not every statement within the duration of the game is within the game in this sense — if a player says “I have an important appointment so I need to stop playing soon” and is lying then that would be a genuine norm-violating dishonesty. In casual game-playing people often discuss the strategy of the game during the game and that can be considered statements outside the game. This is especially true for a game like Settlers of Catan which is not primarily a social manipulation game, tho it involves some social strategizing. If a false or misleading statement is made which some people think is within the game and others think is outside the game then that does lead to a genuine degradation of the norm of honesty.
One way to remedy this is to aim to be more explicit about the norms of the game before playing. For example, in your case, asking the other player whether they agree to not trick or target people. If they disagree, the explictness has nonetheless deescalated the dispute from a challenge to personal integrity into a disagreement on how to play games. This disagreement can still be acrimonious, for example leading the two of you not to play games together, but I think it’s an improvement.

itaibn0 11 Apr 2026 1:52 UTC
2 points
1
on: itaibn0′s Shortform
Question for people who think about corrigibility: Consider a corrigible agent with the goal of coloring a wall red. It is considering two kinds of paint. One paint is a brighter, more red red while the other is a duller red. However, the duller red paint is easier to scrape off to repaint the wall in a different color, while the brighter red is harder to remove or paint over. What is the right choice of paint for the corrigible agent to use? How should a corrigible agent make this decision? What additional information if any does the agent need to decide?

itaibn0 20 Dec 2025 7:29 UTC
4 points
3
on: The natural boundaries between people
A linkpost whose contents itself have been replaced by further links. The title sounded intriguing so I’m trying to follow the trail, but as it stands the I think the bad formatting / bitrot disqualifies it as something worth celebrating and drawing attention back to it for the 2024 review.

itaibn0 4 Aug 2025 16:05 UTC
1 point
0
on: HPMOR: The (Probably) Untold Lore
Why does nobody ever ask about the Project Lawful / Planecrash epilogue? I still have to finish that one too!
I didn’r know that was planned! Or maybe I heard it and forgot. Now I’m looking forward to it

itaibn0 27 Jun 2025 13:53 UTC
2 points
1
on: Please speak unpredictably
1. Grammar and style aid reading.
2. Most predict the usual caveats, the rest need to hear them.
3. It’s hard to predict where hard communication tasks fail. Being comprehensive helps.
4. Optimization is hard. Longer is easier.
5. I tried it—fun and evocative!

itaibn0 2 Feb 2025 3:24 UTC
1 point
0
on: Fake thinking and real thinking
I once had a conversation with some math grad students. I mentioned there was a fluid dynamic problem that I was wondering about. I noticed that when a tap is pour out a thin stream of water and I obstruct it near the top, it makes a stable wavy pattern above where it is blocked. They weren’t familiar with this phenomenon, so I said I’ll show them. I went to the kitchen and turned on the tap with my finger on it. Someone said, (paraphrased,) “Oh, I thought you meant show a video.”
Picture taken today
I still don’t know the answer, I’d be delighted if anyone reading this linked to an explanation.

itaibn0 22 Jan 2025 1:17 UTC
1 point
0
in reply to: River’s comment on: Monthly Roundup #26: January 2025
It seemed to me like the natural thing to do is to sue any government organization that is not abiding by the ERA, claiming that they are violating the constitution. Is anyone doing that?

itaibn0 30 Jun 2024 23:12 UTC
3 points
1
in reply to: devas’s comment on: The Incredible Fentanyl-Detecting Machine
A 3D density map does not reveal the chemical structure of the material in the interior. You’re describing abilities of X-ray scanning consistent with Constantin’s description, which fall far short of a “tricorder” or detecting fentanyl inside a car. Looking it up airport scanners can also use millimeter-wave scanning, which I believe still fits Constantin’s high-level description of scanning methods in the high-penetration/low-detail side of the tradeoff.

itaibn0 3 Mar 2023 19:23 UTC
3 points
0
in reply to: Steven Byrnes’s comment on: Why I’m not into the Free Energy Principle
By the same token, I’m generally opposed to grand unified theories of the body. The shoulder involves a ball-and-socket joint, and the kidney filters blood. OK cool, those are two important facts about the body. I’m happy to know them! I don’t feel the need for a grand unified theory of the body that includes both ball-and-socket joints and blood filtration as two pieces of a single grand narrative.
I think I am generally on board with you on your critiques of FEP, but I disagree with this framing against grand unified theories. The shoulder and the kidney are both made of cells. They both contain DNA which is translated into proteins. They are both are designed by an evolutionary process.
Grand unified theories exist, and they are precious. I want to eke out every sliver of generality wherever I can. Grand unified theories are also extremely rare, and far more common in the public discourse are fakes that create an illusion of generality without making any substantial connections. The style of thinking that looks at a ball-and-socket joint and a blood filtration system and immediately thinks “I need to find how these are really the same” rather than studying these two things in detail and separately is apt to create these false grand unifications, and altho I haven’t looked into FEP as deeply as you or other commenters the writing I have seen on it smells more like this mistake than true generality.
But a big reason I care about exposing these false theories and the bad mental habits that are conducive to them is precisely because I care so much about true grand unified theories. I want grand unified theories to shine like beacons so we can notice their slightest nudge, and feel the faint glimmer of a new one when it is approaching from the distance, rather than be hidden by a cacophony of overblown rhetoric coming from random directions.

itaibn0 2 Mar 2023 23:55 UTC
5 points
0
on: itaibn0′s Shortform
I think MIRI’s Logical Inductor idea can be factored into two components, one of which contains the elegant core that is why this idea works so well, and the other is an arbitrary embellishment that obscures what is actually going on. Of course I am calling for this to be recognized and that people should only be teaching and thinking about the elegant core. The elegant core is infinitary markets: Markets that exist for an arbitrarily long time, with commodities that can take arbitrarily long to return dividends, and infinitely many market participants who use every computable strategy. The hack is that the commodities are labeled by sentences in a formal language and the relationships between them are governed by a proof systems. This creates a misleading pattern that that the value of the commodity labeled phi appears to measure the probability that phi is true; in fact what it measures is more like the probability the that proof system will eventually affirm that phi is true, or more precisely like the probability that phi is true in a random model of the theory. Of course what we really care about is the probability phi is actually true, meaning true in the standard model where the things labeled “natural numbers” are actual natural numbers and so on. By combining proof systems and infinitary markets, one obscures how much of the “work” in obtaining accurate information is done by either. I think it is better to study these two things separately. Since proof systems are already well-studies and infinitary markets are the novel idea in MIRI’s work, that means they should primarily study infinitary markets.

itaibn0 15 Feb 2023 5:59 UTC
26 points
23
on: Whole Bird Emulation requires Quantum Mechanics
I think it is a mistake to focus on these kinds weird effects as “biological systems using quantum mechanics”, because it ignores the much more significant ways quantum mechanics is essential for all the ordinary things that are ubiquitous in biological systems. The stability of every single atom depends on quantum mechanics, and every chemical bond requires quantum mechanics to model. For the intended implication on the difficulty of Whole Bird Emulation, these ordinary usages of QM are much more significant. There are a huge number of different kinds of molecular interactions in a bird’s body and each one requires solving a multi-particle Schroedinger equation. The computation work for this one effect is tiny in comparison.
As I understand, the unique thing about this effect is that it involves much longer coherence times than in molecular interactions. This is cool, but unless you can argue that birds have error-correcting quantum computers inside them, which is incredibly unlikely, I don’t think it is that relevant to AI timelines.

itaibn0 10 Jan 2023 22:51 UTC
3 points
0
on: itaibn0′s Shortform
While I like a lot of Hanson’s grabby alien model, I do not buy the inference that since humans appeared early in cosmological history, that implies that the cosmic commons are taken quickly and so a lower bound on how often grabby aliens appear. I think that is neglecting the possibility that the early universe is inherently more conducive to creating life, so most life is created early, but these lifeforms may be very far apart.

itaibn0 23 Oct 2022 20:35 UTC
9 points
0
in reply to: Ben Pace’s comment on: The Onion Test for Integrity
Eliezer is very explicit and repeats many times in that essay, including in the very segment you quote, that his code of meta-honesty does in fact compel you to never lie in a meta-honesty discussion. The first 4 paragraphs of your comment are not elaborating with what Eliezer really meant, they are disagreeing with him. Reasonable disagreements too, in my opinion, but conflating them with Eliezer’s proposal is corrosive to the norms that allows people to propose and test new norms.

itaibn0 23 Oct 2022 2:08 UTC
2 points
−1
on: The “you-can-just” alarm
I had trouble making the connection between the first two paragraphs and the rest. Are you introducing what you mean by an “alarm” and then giving a specific proposal for an alarm afterwards? Is there significance in how the example alarms are in response to specific words being misleading?

itaibn0 28 Apr 2022 3:22 UTC
2 points
0
on: If you’re very optimistic about ELK then you should be optimistic about alignment
Writing suggestion: Expand the acronym “ELK” early in the piece. I looked at the title and my first question was what ELK is, I quickly skimmed the piece and wasn’t able to find out until I clicked on the link to the ELK document. I now see it’s also expanded in the tag list, which I normally don’t examine. I haven’t read the article more closely than a skim.

itaibn0 15 Apr 2022 15:06 UTC
4 points
0
in reply to: MichaelStJules’s comment on: On infinite ethics
On further thought I want to walk back a bit:
1. I confess my comment was motivated by seeing something where it looked like I could make a quick “gotcha” point, which is a bad way to converse.
2. Reading the original comment more carefully, I’m seeing how I disagree with it. It says (emphasis mine)
in practice the problems of infinite ethics are more likely to be solved at the level of maths, as opposed on the level of ethics and thinking about what this means for actual decisions.
I highly doubt this problem will be solved purely on the level of math, and expect it will involve more work on the level of ethics than on the level of foundations of mathematics. However, I think taking an overly realist view on the conventions mathematicians have chosen for dealing with infinities is an impediment to thinking about these issues, and studying alternative foundations is helpful to ward against that. The problems of infinite ethics, especially for uncountable infinities, seem to especially rely on such realism. I do expect a solution to such issues, to the extent it is mathematical at all, could be formalized in ZFC. The central thing I liked about the comment is the call to rethink the relationship of math and mathematical infinity to reality, and that doesn’t necessary require changing our foundations, just changing our attitude towards them.

itaibn0 14 Apr 2022 18:03 UTC
3 points
0
in reply to: MichaelStJules’s comment on: On infinite ethics
If the only alternative you can conceive of for ZFC is removing the axiom of choice then you are proving Jan_Kulveit’s point.

itaibn0 14 Apr 2022 6:14 UTC
1 point
0
on: How dath ilan coordinates around solving alignment
I was reading the story for the first quotation entitled “The discovery of x-risk from AGI”, and I noticed something around quotation that doesn’t make sense to me and I’m curious if anyone can tell what Eliezer Yudkowsky was thinking. As referenced in a previous version of this post, after the quoted scene highest Keeper commits suicide. Discussing the impact of this, EY writes,
And in dath ilan you would not set up an incentive where a leader needed to commit true suicide and destroy her own brain in order to get her political proposal taken seriously. That would be trading off a sacred thing against an unsacred thing. It would mean that only true-suicidal people became leaders. It would be terrible terrible system design.
So if anybody did deliberately destroy their own brain in attempt to increase their credibility—then obviously, the only sensible response would be to ignore that, so as not create hideous system incentives. Any sensible person would reason out that sensible response, expect it, and not try the true-suicide tactic.
The second paragraph is clearly a reference to acausal decision theory, people making a decision because how they anticipate others react to expecting that this is how they make the decision rather than the direct consequences of the decision. I’m not sure if it really makes sense, a self-indulgent reminder that nobody has knows any systematic method for producing prescriptions from acausal decision theories in cases where purportedly they differs from causal decision theory in everyday life. Still, it’s fiction, I can suspend my disbelief.
The confusing thing is that in the story the actual result of the suicide is exactly what this passage says shouldn’t be the result. It convinces the Representatives to take the proposal more seriously and implement it. This passage is just used to illustrate how shocking the suicide was, no additional considerations are described why for the reasoning is incorrect in those circumstances. So it looks like the Representatives are explicitly violating the Algorithm which supposedly underlies the entire dath ilan civilization and is taught to every child at least in broad strokes, in spite of being the second-highest ranked governing body of dath ilan.

itaibn0 16 Nov 2021 1:50 UTC
1 point
0
in reply to: TurnTrout’s comment on: Quantilizer ≡ Optimizer with a Bounded Amount of Output
Really all I need is that a strategy that takes n bits to specify will be performed by 1 in $\sim 2^{n}$ of all random strategies. Maybe a random strategy consists of a bunch of random motions that cancel each other out, and in 1 in $\sim 2^{n}$ of strategies in between these random motions are directed actions that add up to performing this n-bit strategy. Maybe 1 in $\sim 2^{n}$ strategies start off by typing this strategy to another computer and end with shutting yourself off, so that in the remaining bits of the strategy will be ignored. A prefix-free encoding is basically like the latter situation except ignoring the bits after a certain point is built into the encoding rather than being an outcome of the agent’s interaction with the environment.