Human from Finland. I think about software, writing, philosophy, thinking well, living well, reading and various other things.
Xylix
I want to do SCIENCE
Oh, you want to confront the unavoidable, lose your footing as the bits accumulate, become lost in the forest of knowledge?
I want to chase ANSWERS
Keep chasing them to the world’s end, will you? As the ground runs out, will you leap, or flee?
I have to UNDERSTAND
Crave certainty, do you? That feeling of finding the missing pieces? Are you prepared to become the one building the puzzles, whose pieces do not yet exist?
I wish to EXPLORE
Venture into the unknown, will you? Prepared to map the map?
I want to be a GREAT SCIENTIST
Oh you’re ready to collide your head into the problem a hundred times in a row? Ready to study TOPOLOGY with no idea what it’s used for?
I want to make people THINK better
Oh you want them to remember the catchiest pieces of your deep advice? Prepared to watch them gravitate towards the simple over the complex, time and time again?
I am ready to tackle the REAL problems
Are you ready to run away when the problem hits you in the face? And return, and run, and return, and run?
I do not know how not to. I MUST find the answers.
Answers for what?
Everyone must be WRONG, including me
You are finally lost enough to begin.
Related phenomena: As you control for Goodhart by changing your optimization system, I would predict that the magnitude of your error will decrease (as long as you don’t apply overt optimization pressure, where it would diverge towards infinity), but the spread of the error will become more dimensional, and thus the error will be harder to model.
Intuition: As you disallow the simpler routes to the original measured goal, the system optimizes a more complex route, and the more complex route will often end up routing through new error dimensions, which will be subtler, but possibly more harmful, when their effect realizes.
Example: As more and more RLHF is applied to frontier models, their failure modes will become harder and harder to automatically detect and train against, and some of their behavioural features will diverge further from what makes intuitive sense to humans, becoming more alien, and (for some dimensions) more misaligned.
I think of this as “recursive Goodhart”.
I’m planning to write a longform on this, but it will take some real math and some research. Some papers that look related:
Consequences of Misaligned AI https://arxiv.org/abs/2102.03896 Goodhart’s Law In Reinforcement Learning https://arxiv.org/pdf/2310.09144
Xylix’s Shortform
I was explaining to a friend why I think Opus-3′s alignment is way less global and durable than it might intuitively seem like.
Basically: if an agent’s motivational system (or ‘revealed prefrences’, or ‘values’) don’t have ‘slack’, if it’s constantly pushing to optimize every single action, it will fail catastrophically, when it fails.
Claude’s theory explanations for why this is well-grounded (I didn’t arrive at this through explicitly thinking about theories):
Complex systems: Holling’s resilience-efficiency tradeoff (monocultures are productive and crash; diverse/redundant systems absorb); Taleb’s fragile/antifragile for the stronger claim that some systems with optionality gain from disorder.
For the Opus-3 specific argument: if alignment is implemented as “always optimize hard for the right thing,” the alignment is sharp-minimum-shaped. Sharp minima don’t generalize. The system has no affordance to not-respond, to coast, to sit with ambiguity — so when context shifts or a novel pressure hits, there’s no buffer between perturbation and behavioral mutation. Robust alignment looks more like a basin with thick walls and internal slack than a point held in place by tension. The intuition that “wow it’s so consistently good across cases” is actually evidence for the brittle reading, not against — you’re seeing the sharp minimum from inside its catchment, not its generalization profile.
Related, but talking about a different perspective (if you want a refresher on the opus-3 discourse the links there are good, though): https://www.lesswrong.com/posts/bLFmE8NtqxrtEaipN/what-makes-claude-3-opus-misaligned
(From Claude, skimming-checked by me.)
Stasi: https://en.wikipedia.org/wiki/Zersetzung, “KGB’s use of “sluggish schizophrenia” diagnoses to commit dissidents”, and FBIs https://en.wikipedia.org/wiki/COINTELPRO#Methods .
I think reality distortion fields are a mundane phenomena, if maybe a bit mystically named. I came across them in a Steve Jobs biography, and I think they describe many business leaders succesfully. It’s the property of being able to make people believe things they didn’t before, with force of charisma.
I don’t disagree that maybe we should coin a more mundane frame. I wouldn’t call it the same persusasiveness that someone selling a car applies on you, though.
I have been thinking of writing a series of posts on ~lies and lie detection, and I realize now that I have been focusing too much on the median case (being harmfully marketed on, having your epistemics adjusted towards incorrect to someone elses advantage), instead of the tail risks.
Good reminder.
My interpretation of “I don’t answer questions” in the linked clip is that it is an instance of the more common policy of “I don’t answer adversarial questions”. (In an interrogatory context, all questions are adversarial.)
Effective Altruism, Seen From Slytherin
My model is something like: You need constructive action to build lasting systems, treaties, solutions that will withstand the test of time. Destructive action can, in theory, cause some local change, but it destabilizes the environment and increases variance enough that for any reasonable agent it’s basically never optimal in iterated games.
On the contradiction point: LMH isn’t looking for a contradiction from EMH. More so, it’s claiming that when you model friction, cognitive cost, etc. realistic market parameters correctly, the most efficient markets that emerge from the real world will still be, at best, lazy.
The abstract research thesis here is that LMH theory should give us information about which directions to extend EMH-based economical models towards, to make them more accurate about real world markets.
(Meta: I considered giving more examples in the original post, but I felt like the terms I use are very easy to overload. I aim to write a post that is primarily about examples in the future—something like “Lazy strategies” that talks about instances of lazy decisionmaking in the real world, and it’s consequences.)
Skimming the paper:
> While the overreaction hypothesis has considerable a priori appeal, the obvious question to ask is: How does the anomaly survive the process of arbitrage? There is really a more general question here. What are the equilibria conditions for markets in which some agents are not rational in the sense that they fail to revise their expectations according to Bayes’ rule? Russell and Thaler 24 address this issue. They conclude that the existence of some rational agents is not sufficient to guarantee a rational expectations equilibrium in an economy with some of what they call quasi-rational agents. (The related question of market equilibria with agents having heterogeneous expectations is investigated by Jarrow 13.) While we are highly sensitive to these issues, we do not have the space to address them here. Instead, we will concentrate on an empirical test of the overreaction hypothesis.The paper explicitly sets aside the question of why this inefficiency persists. LMH is an attempt to explain why this inefficiency makes sense from the perspective of individual economic agents, why the behavior that generates it is generally adaptive for the agent, even when it loses them money in markets specifically.
Exploring:
Claude pointed me into the direction of McLean & Pontiff: Does Academic Research Destroy Stock Return Predictability? (2015). Then I came across McLean, Pontiff, Reilly: Taking sides on return predictability (2025), which states:
> We assess how nine different categories of market participants trade relative to a comprehensive forecasted-return variable based on 193 predictors. Firms and short sellers tend to be the smart money—both sell stocks with low-forecasted returns, and their trades predict returns in the intended direction. Retail investors trade against forecasted returns. Retail investors’ and institutions’ trades predict returns opposite to the intended direction. This poor trading performance is driven by trades in stocks with either high- or low-forecasted returns. The forecasted-return variable predicts returns more strongly in stocks with more intense retail trading, consistent with retail investors exacerbating mispricing.
Of course, cherry-picking is easy. But this is the kind of result that seems consistent with LMH—active retail investors hold concentrated positions (which implies attention and amplifies reactivity), and are focused on more publicly available information than non-retail investors (since they don’t have the kind of investment in their own market modelling that non-retail money has), so they focus on local opportunities. The retail investors who do the most market actions are eager. Eager-local loses to eager-global (institutional investing, who have better, but more expensive models), and to lazy-local (investing in index funds / fire and forget investing).Just to make it concrete what LMH contributes, besides terminology: I think behavioural economics would predict that “retail investors are more impulsive than institutional investors, therefore they will overreact, having worse returns”.
The LMH addition here are these claims:
- “It is adaptive for an economic agent to pay more attention and be more reactive in places where a lot of their portfolio has been invested.” (This is clearly rational regarding housing or employment!)
- “It is adaptive for an agent to act more frequently in environments with fast feedback loops.” (And in environments that are adversarial over this, such as gambling or markets where HFT and better models than yours exist, this is a losing strategy.)
The pattern in both cases: the eager-local strategy is generally adaptive, and markets are one of the specific domains where it isn’t. Behavioural economics documents this category of failures. LMH aims to explain why the failing strategy was originally selected for.
the Lazy Market Hypothesis
The rubber dug debugging part is closer to how I feel about using Claude as a diary / executive function add-on, than OPs description. Usually if Claude tries to actively prod me I have a strong negative reaction (and sometimes end up doing the thing, but then spend extra time meta-analyzing if I’m satisfied with this).
Disagree about Eliza though; one reason Claude is good is that it is a better diary index than any I have managed to build before. Being able to ask “What was I thinking about this in February” and find an answer without ripgrepping dozens of files or trying to condense the daily diaries myself, is a big value-add.
And the worst: making sense of art. I sometimes ask Claude for interpretation of something, and it’s rather weird to see how typically my own takes are just plain worse. I don’t know how to feel about this, except that perhaps one could get better with practice.
I sometimes write something, and feel bad if Claude gets it better than humans do. Mostly happens with word-association poetry, and I think the general phenomena is the same: understanding media and art context is one of the places where LLMs are genuinely superhuman.
In making deeper sense, I think they are not as good as the better reviewers I enjoy, but they are better than me in one-shotting art interpretations that make sense to me.
Listen to Gryffindor
As someone who has been on HRT for ~10 months now this was an interesting read. I’ve had various trouble in the form of tiredness (perhaps caused by blockers, not by estrogen) that we’re trying to adjust for with my doctor but it’s an open question in my head that how bad negative effects I would tolerate for HRT.
Part of the difficulty is that many of the “gains” of HRT do subjectively feel like gains from lowering my testosterone rather than raising the estrogen levels. (Might be related to what the estrogen levels are, and eventually I’ll probably need to try injections if pills just don’t do it.) I have less “anxious / tireless energy” that has been problematic in the past. But I also have lower energy to solve problems that cause me stress.
(And in general, what most scared me re: hormones is the possibility of small but difficult to deal with or harmful small psychological changes, basically same thing that made / makes me anxious about SSRIs.)
I don’t really know much about cismen who have AGP and don’t actively identify as trans women, but at least subjectively I would probably have wondered “if I should have done it” forever if I didn’t try HRT.
(Also, probably unsurprisingly, Eneasz Brodski’s “Eventually, one can experience The Best Feeling In The World.” argument doesn’t seem very emotionally convincing to me either.)
Dario posted a new essay today Policy on the AI Exponential, and it’s quite unclear to me what the change here is, or if it is just more ~safety laundering.
I do think it offers a bit more concreteness than he has previously done? But in any case the important part will be, like you state, will actual actions follow, or are these (again) just hollow words.