Eva Lu

Karma: 29

Eva Lu Jul 30, 2025, 3:40 AM
1 point
0
in reply to: Neel Nanda’s comment on: Neel Nanda’s Shortform
I was about to try this, but then realized the Internal Double Crux was a better tool for my specific dilemma. I guess here’s a reminder to everyone that IDC exists.

Eva Lu May 11, 2025, 6:34 PM
3 points
0
in reply to: Jan Betley’s comment on: Jan Betley’s Shortform
I’ve talked to a lot of people about mech interp so I can enumerate some counterarguments. Generally I’ve been surprised by how well people in AI safety can defend their own research agendas. Of course, deciding whether the counterarguments outweigh your arguments is a lot harder than just listing them, so that’ll be an exercise for readers.
Interp is hard
I think researchers already believe this. Recently I read https://www.darioamodei.com/post/the-urgency-of-interpretability, and in it, Dario expects mech interp to take 5-10 years before it’s as good as an MRI.
Forall quantifiers
Forall quantifiers are nice, but a lot of empirical sciences like medicine or economics have been pretty successful without them. We don’t really know how most drugs work, and the only way to soundly disprove a claim like “this drug will cause people to mysteriously drop dead 20 years later” is to run a 20 year study. We approve new drugs in less than 20 years, and we haven’t mysteriously dropped dead yet.
Similarly we can do a lot in mech interp to build confidence without any forall quantifiers, like building deliberately misaligned models and seeing if mech interp techniques can find the misalignment.
No specific plans
The people I’ve talked to believe interp will be generally helpful for all types of plans, and I haven’t heard anything specific either. Here’s a specific plan I made up. Hopefully it doesn’t suck.
Basically just combine prosaic alignment and mech interp. This might sound stupid on paper, most forbidden technique and whatnot, but using mech interp we can continuously make misalignment harder and keep it above the capability levels of frontier AIs. This might not work long term, but all long term alignment plans seem like moonshots right now, and we’ll have much better ideas later on when we know more about AIs (e.g. after we solve mech interp!).
Future architectures might be different
Transformers haven’t changed much in the past 7 years, and big companies have already invested a ton of money into transformer specific performance optimizations. I just talked to some guys at a startup who spent hundreds of millions building a chip that can only run transformer inference. I think lots of people believe transformers will be around for a while. Also, it’s somewhat of a self fulfilling prophecy because new architectures now have to compete against hyperoptimized transformers, not just regular transformers.

Eva Lu Mar 12, 2025, 10:29 PM
1 point
0
on: A computational no-coincidence principle
99% of random^[3] reversible circuits $C$ , no such $π$ exists.
Do you mean 99% of circuits that don’t satisfy P? Because there probably are distributions of random reversible circuits that satisfy P exactly 1% of the time, and that would make V’s job as hard as NP = coNP.

Eva Lu Mar 10, 2025, 10:01 PM
1 point
0
in reply to: Kabir Kumar’s comment on: Kabir Kumar’s Shortform
Have you felt this from your own experience trying to get funding, or from others, or both? Also, I’m curious what you think is their specific kind of bullshit, and if there’s things you think are real but others thought to be bullshit.

Eva Lu Mar 8, 2025, 5:02 AM
13 points
7
in reply to: Cole Wyeth’s comment on: So how well is Claude playing Pokémon?
I disagree because to me this just looks like LLMs are one algorithmic improvement away from having executive function, similar to how they couldn’t do system 2 style reasoning until this year when RL on math problems started working.
For example, being unable to change its goals on the fly: If a kid kept trying to go forward when his pokemon were too weak. He would keep losing, get upset, and hopefully in a moment of mental clarity, learn the general principle that he should step back and reconsider his goals every so often. I think most children learn some form of this from playing around as a toddler, and reconsidering goals is still something we improve at as adults.
Unlike us, I don’t think Claude has training data for executive functions like these, but I wouldn’t be surprised if some smart ML researchers solved this in a year.

evalu’s Shortform

Eva LuFeb 16, 2025, 12:18 AM

2 points

1 comment LW link

Eva Lu Feb 16, 2025, 12:18 AM
2 points
0
on: evalu’s Shortform
There’s a lot of discussion about evolution as an example of inner and outer alignment.
However, we could instead view the universe as the outer optimizer that maximizes entropy, or power, or intelligence. From this view, both evolution and humans are inner optimizers, and the difference between evolution’s and our optimization targets is more of an alignment success than a failure.
Before evolution, the universe increased entropy by having rocks in space crash into each other. When life and evolution finally came around, it was way more effective than rock collisions at increasing entropy even though entropy isn’t in its optimization target. If there were an SGD loop around the universe to maximize entropy, it would choose the evolution mesa-optimizer instead of the crashing rocks.
Compared to evolution, humans optimizing to win wars, make money, and be famous was once again way better at increasing entropy. Entropy is still not in the optimization target, but an entropy-maximizing SGD around the universe would choose the human mesa-optimizer over evolution.
Importantly, humans not caring about genetic fitness is no longer an alignment failure from this view. The mesa-optimizer for our values is more aligned than evolution was, so it’s good that we rather spread our ideas and influence than our genes.

Eva Lu Dec 26, 2024, 5:54 AM
2 points
0
in reply to: Aransentin’s comment on: Remap your caps lock key
I’ve had caps lock remapped to escape for a few years now, and I also remapped a bunch of symbol keys like parentheses to be easier to type when coding. On other people’s computers it is slower for me type text with symbols or use vim, but I don’t mind since all of my deeply focused work (when the mini-distraction of reaching for a difficult key is most costly) happens on my own computers.

Eva Lu Dec 24, 2024, 7:41 AM
14 points
8
on: Orienting to 3 year AGI timelines
I’m skeptical of the claim that the only things that matter are the ones that have to be done before AGI.
Ways it could be true:
- The rate of productivity growth has a massive step increase after AI can improve its capabilities without the overhead of collaborating with humans. Generally the faster the rate of productivity growth, the less valuable it is to do long-horizon work. For example, people shouldn’t work on climate change because AGI will instantly invent better renewables.
- If we expect short timelines and also smooth takeoff, then that might mean our current rate of productivity growth is much higher or a different shape (e.g. doubly exponential instead of just exponential) than it was a few years ago. This much higher rate of productivity growth means any work with 3+ year horizons has negligible value.
Ways it could be false:
- Moloch still rules the world after AGI (maybe there are multiple competing AGIs). For example, a scheme that allows an aligned AGI to propagate it’s alignment to the next generation would be valuable to work on today because it might be difficult for our first generation aligned AGI to invent this before someone else (another AGI) creates the second generation, smarter AGI.
- DALYs saved today are still valuable.
  Q: Why save lives now when it will be so much cheaper after we build aligned AGI?
  A: Why do computer scientists learn to write fast algorithms when they could just wait for compute speed to double?
- Basic research might always be valuable because it’s often not possible to see the applications of a research field until it’s quite mature. A post-AGI world might dedicate some constant fraction of resources towards basic research with no obvious applications, and in that world it’s still valuable to pull ahead the curve of basic research accomplishments.
I lean towards disagreeing because I give credence to smooth takeoffs, mundane rates of productivity growth, and many-AGI worlds. I’m curious if those are the big cruxes or if my model could be improved.