Joseph Miller

Karma: 885

Joseph Miller 24 Jul 2024 9:12 UTC
15 points
0
in reply to: TurnTrout’s comment on: TurnTrout’s shortform feed
Computing the exact layer-truncated residual streams on GPT-2 Small, it seems that the effective layer horizon is quite large:
I’m mean ablating every edge with a source node more than n layers back and calculating the loss on 100 samples from The Pile.
Source code: https://gist.github.com/UFO-101/7b5e27291424029d092d8798ee1a1161
I believe the horizon may be large because, even if the approximation is fairly good at any particular layer, the errors compound as you go through the layers. If we just apply the horizon at the final output the horizon is smaller.

However, if we apply at just the middle layer (6), the horizon is surprisingly small, so we would expect relatively little error propagated.

But this appears to be an outlier. Compare to 5 and 7.
Source: https://gist.github.com/UFO-101/5ba35d88428beb1dab0a254dec07c33b

Joseph Miller 23 Jul 2024 21:31 UTC
12 points
5
on: Positive visions for AI
In this piece, we want to paint a picture of the possible benefits of AI, without ignoring the risks or shying away from radical visions.
Thanks for this piece! In my opinion you are still shying away from discussing radical (although quite plausible) visions. I expect the median good outcome from superintelligence involves everyone being mind uploaded / living in simulations experiencing things that are hard to imagine currently.
Even short of that, in the first year after a singularity, I would want to:
- Use brain computer interfaces to play videogames / simulations that feel 100% real to all senses, but which are not constrained by physics.
- Go to Hogwarts (in a 100% realistic simulation) and learn magic and make real (AI) friends with Ron and Hermione.
- Visit ancient Greece or view all the most important events of history based on superhuman AI archeology and historical reconstruction.
- Take medication that makes you always feel wide awake, focused etc. with no side effects.
- Engineer your body / use cybernetics to make yourself never have to eat, sleep, wash, etc. and be able to jump very high, run very fast, climb up walls, etc.
- Use AI as the best teacher ever to learn maths, physics and every subject and language and musical instruments to super-expert level.
- Visit other planets. Geoengineer them to have crazy landscapes and climates.
  - Play God and oversee the evolution of life on other planets.
- Design buildings in new architectural styles and have AI build them.
- Genetically modify cats to play catch.
- Listen to new types of music, perfectly designed to sound good to you.
- Design the biggest roller coaster ever and have AI build it.
- Modify your brain to have better short term memory, eidetic memory, be able to calculate any arithmetic super fast, be super charismatic.
- Bring back Dinosaurs and create new creatures.
- Ask AI for way better ideas for this list.
I expect UBI, curing aging etc. to be solved within a few days of a friendly intelligence explosion.
Although I think we also plausibly will see a new type of scarcity. There is limited amount of compute you can create using the materials / energy in the universe. And if in fact most humans are mind-uploaded / brains in vats living in simulations, we will have to divide this among ourselves in order to run the simulations. If you have twice as much compute, you can simulate your brain twice as fast (or run two of you in parallel?), and thus experience twice as much subjective time—and so live twice as long until the heat death of the universe.

Joseph Miller 19 Jul 2024 19:29 UTC
1 point
0
in reply to: rossry’s comment on: Poker is a bad game for teaching epistemics. Figgie is a better one.
Note that the group I was in only played on the app. I expect this makes it significantly harder to understand what’s going on.

Joseph Miller 19 Jul 2024 5:02 UTC
1 point
0
in reply to: Jason Gross’s comment on: Transformer Circuit Faithfulness Metrics Are Not Robust
Yes that’s correct, this wording was imprecise.

Joseph Miller 19 Jul 2024 4:58 UTC
1 point
0
in reply to: Jason Gross’s comment on: Transformer Circuit Faithfulness Metrics Are Not Robust
Would you ever really want mean ablation except as a cheaper approximation to resample ablation?

Resample ablation is not more expensive than mean (they both are just replacing activations with different values). But to answer the question, I think you would—resample ablation biases the model toward some particular corrupt output.
It seems to me that if you ask the question clearly enough, there’s a correct kind of ablation. For example, if the question is “how do we reproduce this behavior from scratch”, you want zero ablation.
Yes I agree. That’s the point we were trying to communicate with “the ablation determines the task.”
- direct effect vs indirect effect corresponds to whether you ablate the complement of the circuit (direct effect) vs restoring the circuit itself (indirect effect, mediated by the rest of the model)
- necessity vs sufficiency corresponds to whether you ablate the circuit (direct effect necessary) / restore the complement of the circuit (indirect effect necessary) vs restoring the circuit (indirect effect sufficient) / ablating the complement of the circuit (direct effect sufficient)
Thanks! That’s great perspective. We probably should have done more to connect ablations back to the causality literature.
- “all tokens vs specific tokens” should be absorbed into the more general category of “what’s the reference dataset distribution under consideration” / “what’s the null hypothesis over”,
- mean ablation is an approximation to resample ablation which itself is an approximation to computing the expected/typical behavior over some distribution
These don’t seem correct to me, could you explain further? “Specific tokens” means “we specify the token positions at which each edge in the circuit exists”.

Joseph Miller 19 Jul 2024 4:34 UTC
2 points
0
in reply to: rossry’s comment on: Poker is a bad game for teaching epistemics. Figgie is a better one.
I think so. Mostly we learned about trading and the price discovery mechanism that is a core mechanic of the game. We started with minimal explanation of the rules, so I expect these things can be grokked faster by just saying them when introducing the game.

Joseph Miller 19 Jul 2024 4:22 UTC
1 point
0
on: Poker is a bad game for teaching epistemics. Figgie is a better one.
We just played Figgie at MATS 6.0, most players playing for the first time. I think we made lots of clearly bad decisions for the first 6 or 7 games. And reached a barely acceptable standard by about 10-15 games (but I say this as someone who was also playing for the first time).

Joseph Miller 16 Jul 2024 8:25 UTC
8 points
−1
in reply to: jacobjacob’s comment on: Against Aschenbrenner: How ‘Situational Awareness’ constructs a narrative that undermines safety and threatens humanity
(crossposted to the EA Forum)
Nonetheless, the piece exhibited some patterns that gave me a pretty strong allergic reaction. It made or implied claims like:
* a small circle of the smartest people believe this
* i will give you a view into this small elite group who are the only who are situationally aware
* the inner circle longed tsmc way before you
* if you believe me; you can get 100x richer—there’s still alpha, you can still be early
* This geopolitical outcome is “inevitable” (sic!)
* in the future the coolest and most elite group will work on The Project. “see you in the desert” (sic)
* Etc.
These are not just vibes—they are all empirical claims (except the last maybe). If you think they are wrong, you should say so and explain why. It’s not epistemically poor to say these things if they’re actually true.

Joseph Miller 16 Jul 2024 5:00 UTC
12 points
1
in reply to: Fabien Roger’s comment on: I found >800 orthogonal “write code” steering vectors
If this were the case, wouldn’t you expect the mean of the code steering vectors to also be a good code steering vector? ~~But in fact, Jacob says that this is not case.~~ Edit: Actually it does work when scaled—see nostalgebraist’s comment.

Joseph Miller 15 Jul 2024 19:12 UTC
1 point
0
in reply to: Roko’s comment on: Ice: The Penultimate Frontier
Thanks. So will the building foundations be going through several meters of foam glass to the ice below?

Joseph Miller 14 Jul 2024 22:28 UTC
1 point
0
in reply to: Roko’s comment on: Ice: The Penultimate Frontier
Lightweight aggregates like expanded/foamed glass will form that layer, though likely with a honeycomb of basalt fiber-reinforced concrete and a final reinforced concrete topping layer.
Ok thanks. So will the top layer of concrete on foamed glass be floating of a layer of melted ice? Won’t it gradually sink as more ice melts into denser water?
It’s unclear to me how thick this layer is supposed to be. Will building foundations go though it and be anchored in the pykrete below? Presumably it’s not possible to build building foundations in foamed glass?

Joseph Miller 14 Jul 2024 6:18 UTC
13 points
2
on: Ice: The Penultimate Frontier
This is a very interesting idea.
I’m still unclear if it’s actually feasible to build on the ice. Even if most of the mass remains frozen for hundreds of years won’t the surface of the ice be constantly melting such that creating building foundations is almost impossible?
you may be able to construct huge stacked layers about 50-100 meters tall made of reinforced pykrete with a whole city or park per layer and very convenient vertical transport between them. These layers will need a very slight amount of active cooling and a sensor network to monitor temperature
Doesn’t this prove too much? If pykrete is such a cheap strong material, why don’t we use it for regular buildings?

Transformer Circuit Faithfulness Metrics Are Not Robust

Joseph Miller, bilalchughtai and William_S

12 Jul 2024 3:47 UTC

90 points

5 comments7 min readLW link

(arxiv.org)

Joseph Miller 9 Jul 2024 4:41 UTC
6 points
0
in reply to: niplav’s comment on: niplav’s Shortform
Is it possible that the canary string itself has been learned but not any documents that used the canary string in order to be removed from the dataset?

Joseph Miller 8 Jul 2024 7:45 UTC
1 point
−1
in reply to: cousin_it’s comment on: When is a mind me?
It seems we cannot allow all behavior-preserving optimizations, because that might lead to a kind of LLM that dutifully says “I’m conscious” without actually being so.
Surely ‘you’ are the algorithm, not the implementation. If I get refactored into a giant lookup table, I don’t think that makes the algorithm any less ‘me’.

Joseph Miller 8 Jul 2024 6:34 UTC
5 points
0
in reply to: benwr’s comment on: benwr’s Shortform
I agree that it does not have something it mind but it could in principle have something in mind in the sense that it could represent some object in the residual stream in the tokens where it says “I have something in mind”. And then future token positions could read this “memory”.

Joseph Miller 6 Jul 2024 17:49 UTC
3 points
4
in reply to: habryka’s comment on: 80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly)
Could the prediction market for each post be integrated more elegantly into the UI, rather than posted as a comment?

Joseph Miller 29 Jun 2024 1:32 UTC
1 point
0
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
Yeah that’s the crux I think. Or maybe we agree but are just using “substantial”/”most” differently.
It mostly comes down to intuitions so I think there probably isn’t a way to resolve the disagreement.

Joseph Miller 29 Jun 2024 1:19 UTC
1 point
−2
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
Yes that’s accurate.
Notably, as described this is not specifically a downside of anything I’m arguing for in my comment or a downside of actually being a contractor.
In your comment you say
- For some safety research, it’s helpful to have model access in ways that labs don’t provide externally. Giving employee level access to researchers working at external organizations can allow these researchers to avoid potential conflicts of interest and undue influence from the lab. This might be particularly important for researchers working on RSPs, safety cases, and similar, because these researchers might naturally evolve into third-party evaluators.
  Related to undue influence concerns, an unfortunate downside of doing safety research at a lab is that you give the lab the opportunity to control the narrative around the research and use it for their own purposes. This concern seems substantially addressed by getting model access through a lab as an external researcher.
I’m essentially disagreeing with this point. I expect that most of the conflict of interest concerns remain when a big lab is giving access to a smaller org / individual.
(Unless you think me being a contractor will make me more likely to want model access for whatever reason.)
From my perspective the main takeaway from your comment was “Anthropic gives internal model access to external safety researchers.” I agree that once you have already updated on this information, the additional information “I am currently receiving access to Anthropic’s internal models” does not change much. (Although I do expect that establishing the precedent / strengthening the relationships / enjoying the luxury of internal model access, will in fact make you more likely to want model access again in the future).

Joseph Miller 29 Jun 2024 0:36 UTC
1 point
0
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
I’m not sure what the confusion is exactly.
If any of
- you have a fixed length contract and you hope to have another contract again in the future
- you have an indefinite contract and you don’t want them to terminate your relationship
- you are some other evals researcher and you hope to gain model access at some point
you may refrain from criticizing Anthropic from now on.

Joseph Miller

Trans­former Cir­cuit Faith­ful­ness Met­rics Are Not Robust

Transformer Circuit Faithfulness Metrics Are Not Robust