robertzk

Karma: 466

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy and Neel Nanda

6 Mar 2024 5:03 UTC

56 points

0 comments12 min readLW link

Attention SAEs Scale to GPT-2 Small

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

3 Feb 2024 6:50 UTC

76 points

4 comments8 min readLW link

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

16 Jan 2024 0:26 UTC

82 points

5 comments19 min readLW link

robertzk 29 Oct 2023 1:23 UTC
2 points
0
in reply to: aphyer’s comment on: Lying to chess players for alignment
I am also in NYC and happy to participate. My lichess rating is around 2200 rapid and 2300 blitz.

Training Process Transparency through Gradient Interpretability: Early experiments on toy language models

robertzk and evhub

21 Jul 2023 14:52 UTC

56 points

1 comment1 min readLW link

Getting up to Speed on the Speed Prior in 2022

robertzk28 Dec 2022 7:49 UTC

36 points

5 comments65 min readLW link

Emily Brontë on: Psychology Required for Serious™ AGI Safety Research

robertzk14 Sep 2022 14:47 UTC

2 points

0 comments1 min readLW link

robertzk 31 Aug 2022 4:23 UTC
5 points
0
in reply to: Larks’s comment on: Larks’s Shortform
Thank you, Larks! Salute. FYI that I am at least one who has informally committed (see below) to take up this mantle. When would the next one typically be due?

https://twitter.com/robertzzk/status/1564830647344136192?s=20&t=efkN2WLf5Sbure_zSdyWUw

robertzk 24 Mar 2019 17:34 UTC
1 point
in reply to: John_Maxwell’s comment on: The Main Sources of AI Risk?
Inspecting code against a harm detection predicate seems recursive. What if the code or execution necessary to perform that inspection properly itself is harmful? An AGI is almost certainly a distributed system with no meaningful notion of global state, so I doubt this can be handwaved away.

For example, a lot of distributed database vendors, like Snowflake, do not offer a pre-execution query planner. This can only be performed just-in-time as the query runs or retroactively after it has completed, as the exact structure may be dependent on co-location of data and computation that is not apparent until the data referenced by the query is examined. Moreover, getting an accurate dry-run query plan may be as expensive as executing the query itself.

By analogy, for certain kinds of complex inspection procedures you envision, executing the inspection itself thoroughly enough to be reflective of the true execution risk may be as complex and as great of a risk of being harmful according to its values.

robertzk 10 Jun 2018 17:04 UTC
4 points
in reply to: Wei Dai’s comment on: Open question: are minimal circuits daemon-free?
I am interested as well. Please share the docs in question with my LW username at gmail dot com if that is a possibility. Thank you!

robertzk 10 Jun 2018 16:54 UTC
3 points
in reply to: avturchin’s comment on: Could we send a message to the distant future?
This was my thought exactly. Construct a robust satellite with the following properties.
Let a “physical computer” be defined as a processor powered by classical mechanics, e.g., through pulleys rather than transistors, so that it is robust to gamma rays, solar flares and EMP attacks, etc.
On the outside of the satellite, construct an onion layer of low-energy light-matter interacting material, such as alternating a coat of crystal silicon / CMOS with thin protective layers of steel, nanocarbon, or other hard material. When the device is constructed, ensure there are linings of Boolean physical input and output channels connecting the surface to the interior (like the proteins coating a membrane in a cell, except that the membrane will be solid rather than liquid), for example, through a jackhammer or moving rod mechanism. This will be activated through a buildup of the material on the outside of the artifact, effectively giving a time counter with arbitrary length time steps depending on how we set up the outer layer. Any possible erosion of the outside of the satellite (from space debris or collisions) will simply expose new layers of the “charging onion”.
In the inside of the satellite, place a 3D printer constructed as a physical computer, together with a large supply of source material. For example, it might print in a metal or hard polymer, possibly with a supply of “boxes” in which to place the printed output. These will be the micro-comets launched as periodic payloads according to the timing device constructed on the surface. The 3D printer will fire according to an “input” event defined by the physical Boolean input, and may potentially be replicated multiple times within the hull in isolated compartments with separate sources of material, to increase reliability and provide failover in case of local failures of the surface layer.
The output of the 3D printer payload will be a replica of the micro-comet containing the message payload, funneled and ejected into an output chute where gravity will take over and handle the rest (this may potentially require a bit of momentum and direction aiming to kick off correctly, but some use of magnets here is probably sufficient). Alternatively, simply pre-construct the micro-comets and hope they stay intact, to be emitted in regular intervals like a gumball machine that fires once a century.
Finally, we compute a minimal set of orbits and trajectories over the continents and land areas likely to be most populated and ensure there is a micro-comet ejected regularly, e.g., say every 25-50 years. It is now easy to complete the argument by fiddling with the parameters and making some “Drake equation”-like assumptions about success rates to say any civilization with X% coverage of the landmass intersecting with the orbits of the comets will have > 25% likelihood of discovering a micro-comet payload.
The only real problem with this approach is guaranteeing your satellites are not removed in the future in the event future ancestors of our civilization disagree with this method. I don’t see a solution to this other than through solving the value reflection problem, building a defense mechanism into the satellites that is certain to fail—as you start getting close to the basic AI drive of self-preservation and will anyway be outsmarted by any future iteration of our civilization—or making the satellites small or undetectable enough that finding and removing them is economically more pain than it is worth.

robertzk 10 Jun 2018 16:24 UTC
4 points
in reply to: Said Achmiz’s comment on: How to intro Effective Altruism
To not support EA? I am confused. Doesn’t the drowning child experiment lend credence to supporting EA?

robertzk 2 Aug 2015 2:27 UTC
3 points
in reply to: buybuydandavis’s comment on: Examples of AI’s behaving badly
Isn’t this an example of a reflection problem? We induce this change in a system, in this case an evaluation metric, and now we must predict not only the next iteration but the stable equilibria of this system.

robertzk 15 Mar 2015 6:54 UTC
3 points
on: In Praise of Maximizing – With Some Caveats
Did you remove the vilification of proving arcane theorems in algebraic number theory because the LessWrong audience is more likely to fall within this demographic? (I used to be very excited about proving arcane theorems in algebraic number theory, and fully agree with you.)

robertzk 10 Mar 2015 2:58 UTC
5 points
in reply to: robertzk’s comment on: Restrictions that are hard to hack
Incidentally, for a community whose most important goal is solving a math problem, why is there no MathJax or other built-in Latex support?

robertzk 10 Mar 2015 2:54 UTC
4 points
on: Restrictions that are hard to hack

The thing that eventually leapt out when comparing the two behaviours is that behaviour 2 is far more informative about what the restriction was, than behaviour 1 was.

It sounds to me like the agent overfit to the restriction R. I wonder if you can draw some parallels to the Vapnik-style classical problem of empirical risk minimization, where you are not merely fitting your behavior to the training set, but instead achieve the optimal trade-off between generalization ability and adherence to R.

In your example, an agent that inferred the boundaries of our restriction could generate a family of restrictions R_i that derive from slightly modifying its postulates. For example, if it knows you check in usually at midnight, it should consider the counterfactual scenario of you usually checking in at 11:59, 11:58, etc. and come up with the union of (R_i = play quietly only around time i), i.e., play quietly the whole time, since this achieves maximum generalization.

Unfortunately, things are complicated by the fact you said “I’ll be checking up on you!” instead of “I’ll be checking up on you at midnight!” The agent needs to go one step farther than the machine teaching problem and first know how many counterfactual training points it should generate to infer your intention (the R_i’s above), and then infer it.

A high-level conjecture is whether human CEV, if it can be modeled as a region within some natural high-dimensional real-valued space (e.g., R^n for high n where each dimension is a utility function?), admits minimal or near minimal curvature as a Riemannian manifold assuming we could populate the space with the maximum available set of training data as mined from all human literature.

A positive answer to the above question would be philosophically satisfying as it would imply a potential AI would not have to set up corner cases and thus have the appearance of overfitting to the restrictions.

EDIT: Framed in this way, could we use cross-validation on the above mentioned training set to test our CEV region?

robertzk 7 Mar 2015 6:12 UTC
0 points
in reply to: V_V’s comment on: Andrew Ng dismisses UFAI concerns
However, UFFire does not uncontrollably exponentially reproduce or improve its functioning. Certainly a conflagration on a planet covered entirely by dry forest would be an unmitigatable problem rather quickly.

In fact, in such a scenario, we should dedicate a huge amount of resources to prevent it and never use fire until we have proved it will not turn “unfriendly”.

robertzk 5 Oct 2014 4:06 UTC
0 points
in reply to: Toggle’s comment on: Decision theories as heuristics
I down-voted this comment because it is a clever ploy for karma that rests on exploiting LessWrongers’ sometimes unnecessary enthusiasm for increasingly abstract and self-referential forms of reasoning but otherwise adds nothing to the conversation.

Twist: By “this comment” I actually mean my comment, thereby making this a paraprosdokian.

robertzk 16 Jun 2014 18:06 UTC
15 points
in reply to: sixes_and_sevens’s comment on: Open thread, 16-22 June 2014
I am an active github R contributor and stackoverflow R contributor and I would be willing to coordinate. Send me an email: rkrzyz at gmail

robertzk 29 Apr 2014 17:58 UTC
4 points
in reply to: Eliezer Yudkowsky’s comment on: Timeless Control
So you are saying that explaining something is equivalent to constructing a map that bridges an inferential distance, whereas explaining something away is refactoring thought-space to remove an unnecessary gerrymandering?

robertzk

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

At­ten­tion SAEs Scale to GPT-2 Small

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

Train­ing Pro­cess Trans­parency through Gra­di­ent In­ter­pretabil­ity: Early ex­per­i­ments on toy lan­guage models

Get­ting up to Speed on the Speed Prior in 2022

Emily Brontë on: Psy­chol­ogy Re­quired for Se­ri­ous™ AGI Safety Research

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

Attention SAEs Scale to GPT-2 Small

Sparse Autoencoders Work on Attention Layer Outputs

Training Process Transparency through Gradient Interpretability: Early experiments on toy language models

Getting up to Speed on the Speed Prior in 2022

Emily Brontë on: Psychology Required for Serious™ AGI Safety Research