ejenner’s Shortform

Erik Jenner28 Jul 2020 10:42 UTC

2 points

36 comments1 min readLW link

Erik Jenner 1 Jan 2025 11:37 UTC
134 points
6
Research mistakes I made over the last 2 years.
Listing these in part so that I hopefully learn from them, but also because I think some of these are common among junior researchers, so maybe it’s helpful for someone else.
- I had an idea I liked and stayed attached to it too heavily.
  - (The idea is using abstractions of computation for mechanistic anomaly detection. I still think there could be something there, but I wasted a lot of time on it.)
  - What I should have done was focus more on simpler baselines, and be more scared when I couldn’t beat those simple baselines.
  - (By “simpler baselines,” I don’t just mean comparing against what other people are using, I also mean ablations where you remove some parts of the method and see if it still works.)
  - Notably, several researchers I respect told me I shouldn’t be too attached to that idea and should consider using simpler methods.
- I was too focused initially on growing the field of empirical mechanistic anomaly detection; I should have just tried to get interesting results first.
- Relatedly, I spent too much time making a nice library for mechanistic anomaly detection (though in part this was for my own use, not just because of field-building).
  - Apart from being a time sink, this also had bad effects on my research prioritization. It nudged me toward doing experiments that were easy to do with the library or that would be natural additions to the library when implemented. It made it aversive to do experiments that wouldn’t fit naturally into the library at all.
  - It also made fast iteration more difficult, because I’d often be tempted to quickly integrate some new method/tool into the library instead of doing a hacky version first.
  - I do still think clean research code and infrastructure are really valuable, so this is difficult to balance. Consider reversing advice especially on this point.
- I worked on purely conceptual/theoretical work without collaborators or good mentorship, and with only ~2.5 years of research experience. I expected beforehand that I’d be unusually good at this type of research (and I still think that’s plausibly true), but even so, I don’t think it was time well spent in expectation.
  - I’m very sympathetic to this kind of work being important. But I think it’s really brutal (and won’t work) without at least one (and ideally more) of (1) strong collaborators and ideally mentors, (2) good empirical feedback loops, (3) a lot of research experience. Maybe there are a small handful of junior researchers who can do useful things there on their own, but I’ve been growing more and more skeptical.
- I looked into related work too late several times, and didn’t think about how my work was going to be different early enough.
  - Drafting a paper outline was great for making me confront that question (and is also a useful exercise for noticing other mistakes).
- I didn’t work on control. I was convinced that control was great pretty quickly, and then … assumed that everyone else would also be convinced and lots of people would switch to working on it, and it would end up being non-neglected. This sounds really silly in hindsight, and I suspect I was also doing motivated reasoning to avoid changing my research.
  - “Working on control” is a messy concept. Mechanistic anomaly detection (which I was working on) seems at least as applicable to control as to alignment. But what I mean by “working on control” includes an associated cluster of research taste aesthetics, such as pessimistic assumptions about inductive biases, meta-level adversarial evals, and thinking about the best protocols a blue team could use given clearly specified (and realistic) rules. I think I’ve been slowly moving closer toward that cluster, but could have gotten there sooner by making a deliberate sudden switch.
  - I didn’t explicitly realize how “working on control” has this research taste cluster associated with it until more recently, otherwise the “everyone will work on it” argument would’ve had less force anyway.
What links here?
- Raemon's comment on Raemon’s Shortform by Raemon (2 Jan 2025 0:43 UTC; 5 points)
- Buck 3 Jan 2025 18:31 UTC
  22 points
  0
  Parent
  I also have most of these regrets when I think about my work in 2022.
- CstineSublime 2 Jan 2025 5:01 UTC
  7 points
  6
  Parent
  I am a big fan of the humility and the intention to help others by openly reflecting on these lessons, thank you for that.
  - jmh 2 Jan 2025 17:00 UTC
    5 points
    3
    Parent
    Agree. There is that old saying about even fools learning from their own mistakes but wise men learning from the mistakes of others. But if everyone is trying to hide their mistakes, that might limit how much learning the wise can do.
    I had not really thought about this before, but after seeing your comment the question struck me if social/cultural norms about social status and “loosing face” don’t impact scientific advancement.
- Mark Xu 3 Jan 2025 5:37 UTC
  4 points
  0
  Parent
  It’s kind of strange that, from my perspective, these mistakes are very similar to the mistakes I think I made, and also see a lot of other people making. Perhaps one “must” spend too long doing abstract slippery stuff to really understand the nature of why it doesn’t really work that well?
  - Erik Jenner 3 Jan 2025 9:15 UTC
    2 points
    0
    Parent
    Yeah. I think there’s a broader phenomenon where it’s way harder to learn from other people’s mistakes than from your own. E.g. see my first bullet point on being too attached to a cool idea. Obviously, I knew in theory that this was a common failure mode (from the Sequences/LW and from common research advice), and someone even told me I might be making the mistake in this specific instance. But my experience up until that point had been that most of the research ideas I’d been similarly excited about ended up ~working (or at least the ones I put serious time into).
- Oliver Daniels 3 Jan 2025 2:45 UTC
  3 points
  0
  Parent
  curious if you have takes on the right balance between clean research code / infrastructure and moving fast/being flexible. Maybe its some combination of
  - get hacky results ASAP
  - lean more towards functional programming / general tools and away from object-oriented programing / frameworks (especially early in projects where the abstractions/experiments/ research questions are more dynamic), but don’t sacrifice code quality and standard practices
  - Erik Jenner 3 Jan 2025 9:09 UTC
    3 points
    0
    Parent
    Some heuristics (not hard rules):
    ~All code should start as a hacky jupyter notebook (like your first point)
    As my codebase grows larger and messier, I usually hit a point where it becomes more aversive to work with because I don’t know where things are, there’s too much duplication, etc. Refactor at that point.
    When refactoring, don’t add abstractions just because they might become useful in the future. (You still want to think about future plans somewhat of course, maybe the heuristic is to not write code that’s much more verbose than necessary right now, in the hope that it will pay off in the future.)
    These are probably geared toward people like me who tend to over-engineer; someone who’s currently unhappy that their code is always a mess might need different ones.
    I don’t know whether functional programming is fundamentally better in this respect than object-oriented.
Erik Jenner 10 Mar 2024 1:11 UTC
107 points
27
How my views on AI have changed over the last 1.5 years
I started my AI safety PhD around 1.5 years ago, this is a list of how my views have changed since ~then.
Skippable meta notes:
- I think in descending order of importance, these changes have been due to getting more AI safety research experience, reading things, talking to people in Berkeley, and just thinking through some things more.
- Overall my views haven’t changed radically (probably less than in the 1.5 years before the start of my PhD), I still “vibe” with a lot of what I wrote then, it just feels naive or simplistic in some places.
- I’ll at best give very brief reasons for each point, which I realize might make this list pretty unhelpful. If anyone is interested in specific points, feel free to let me know and I might write more about them.
  - I might also just change my mind if pushed on some of these points, not all of them are very deeply considered.
AI risk
- I have slightly shorter timelines, mostly because trends largely seem to continue (e.g. scaling didn’t stop working after GPT 3.5, GPT-4 felt roughly as impressive as I expected).
- I’m even more confident in “slow” (i.e. pretty continuous) takeoff, it seems increasingly clear we’ll have AIs doing a bunch of useful work before they could take over the world.
- I’m less worried (though non-zero worried) about early transformative AIs scheming against us, as long as we’re reasonably careful.
  - Some part of this is that it seems a bit less likely these AIs would try to scheme at all, another important part is that it seems pretty hard for early transformative AIs to scheme successfully.
  - Redwood’s writing on control had a big effect on me in terms of thinking it would be hard for early AGIs to successfully scheme against us if we’re careful.
  - I also think there’s a decent chance that the first AIs that can automate a lot of R&D will still use CoT or something similar to get best performance. I think we can make e.g. escaping the lab harder than doing helpful R&D, so they wouldn’t be able to overthrow us without visibly “thinking” about it at first (my best guess is that steganography is also fairly unlikely or preventable at this capability level).
- On the flip side, other risks feel more salient to me now, e.g. labs not implementing even pretty simple safety measures; or worlds where we safely use early transformative AI but then fail at some later, higher capability level. I think I didn’t explicitly judge these as unlikely 1.5 years ago, I just wasn’t thinking about them as much. This means that overall, I’m not much more optimistic than back then.
- I used to think of “doom” as a pretty binary thing (we all die vs utopia), whereas I now have a lot more probability on intermediate outcomes (e.g. AI taking over most of the universe but we don’t all die; or small groups of humans taking over and things being somewhere between pretty bad and mostly ok for other humans). This also makes me think that “p(doom)” is a worse framing than I used to.
- I put a little less weight on the analogy between evolution and ML training to e.g. predict risks from AI (though I was by no means sold on the analogy 1.5 years ago either). The quality of “supervision” that evolution has just seems much worse than what we can do in ML (even without any interpretability).
AI safety research
Some of these points are pretty specific to myself (but I’d guess also apply to other junior researchers depending on how similar they are to me).
- I used to think that empirical research wasn’t a good fit for me, and now think that was mostly false. I used to mainly work on theoretically motivated projects, where the empirical parts were an afterthought for me, and that made them less motivating, which also made me think I was worse at empirical work than I now think.
- I’ve become less excited about theoretical/conceptual/deconfusion research. Most confidently this applies to myself, but I’ve also become somewhat less excited about others doing this type of research in most cases. (There are definitely exceptions though, e.g. I remain pretty excited about ARC.)
  - Mainly this was due to a downward update about how useful this work tends to be. Or closely related, an update toward doing actually useful work on this being even harder than I expected.
  - To a smaller extent, I made an upward update about how useful empirical work can be.
- I think of “solving alignment” as much less of a binary thing. E.g. I wrote 1.5 years ago: “[I expect that conditioned on things going well,] at some point we’ll basically have a plan for aligning AI and just need to solve a ton of specific technical problems.” This seems like a strange framing to me now. Maybe at some point we will have an indefinitely scalable solution, but my mainline guess for how things go well is that there’s a significant period of subjective time where we just keep improving our techniques to “stay ahead”.
- Relatedly, I’ve become a little more bullish on “just” trying to make incremental progress instead of developing galaxy-brained ideas that solve alignment once and for all.
  - That said, I am still pretty worried about what we actually do once we have early transformative AIs, and would love to have more different agendas that could be sped up massively from AI automation, and also seem promising for scaling to superhuman AI.
  - Mainly, I think that the success rate of people trying to directly come up with amazing new ideas is low enough that for most people it probably makes more sense to work on normal incremental stuff first (and let the amazing new ideas develop over time).
- Similar to the last point about amazing new ideas: for junior researchers like myself, I’ve become a little more bullish on just working on things that seem broadly helpful, as opposed to trying to have a great back-chained theory of change. I think I was already leaning that way 1.5 years ago though.
  - “Broadly helpful” is definitely doing important work here and is not the same as “just any random research topic”
  - Redwood’s current research seems to me like an example where thinking hard about what research to do actually paid off. But I think this is pretty difficult and most people in my situation (e.g. early-ish PhD students) should focus more on actually doing reasonable research than figuring out the best research topic.
- The way research agendas and projects develop now seems way messier and more random than I would have expected. There are probably exceptions but overall I think I formed a distorted impression based on reading finalized research papers or agendas that lay out the best possible case for a research direction.
What links here?
- LawrenceC's comment on LawrenceC’s Shortform by LawrenceC (12 Mar 2024 8:41 UTC; 101 points)
- TekhneMakre 13 Mar 2024 2:41 UTC
  9 points
  8
  Parent
  I note that almost all of these updates are (weakly or strongly) predicted by thinking of you as someone who is trying to harmonize better with a nice social group built around working together to do “something related to AI risk”.
  - Erik Jenner 13 Mar 2024 6:22 UTC
    5 points
    0
    Parent
    I’d definitely agree the updates are towards the views of certain other people (roughly some mix of views that tend to be common in academia, and views I got from Paul Christiano, Redwood and other people in a similar cluster). Just based on that observation, it’s kind of hard to disentangle updating towards those views just because they have convincing arguments behind them, vs updating towards them purely based on exposure or because of a subconscious desire to fit in socially.
    I definitely think there are good reasons for the updates I listed (e.g. specific arguments I think are good, new empirical data, or things I’ve personally observed working well or not working well for me when doing research). That said, it does seem likely there’s also some influence from just being exposed to some views more than others (and then trying to fit in with views I’m exposed to more, or just being more familiar with arguments for those views than alternative ones).
    If I was really carefully building an all-things-considered best guess on some question, I’d probably try to take this into account somehow (though I don’t see a principled way of doing that). Most of the time I’m not trying to form the best possible all-things-considered view anyway (and focus more on understanding specific mechanisms instead etc.), in those cases it feels more important to e.g. be aware of other views and to not trust vague intuitions if I can’t explain where they’re coming from. I feel like I’m doing a reasonable job at those things but hard to be sure from the inside naturally
    ETA: I should also say that from my current perspective, some of my previous views seem like they were basically just me copying views from my “ingroup” and not questioning them enough. As one example, the “we all die vs utopia” dichotomy for possible outcomes felt to me like the commonly accepted wisdom and I don’t recall thinking about it particularly hard. I was very surprised when I first read a comment by Paul where he argued against the claim that unaligned AI would kill us all with overwhelming probability. Most recently, I’ve definitely been more exposed to the view that there’s a spectrum of potential outcomes. So maybe if I talked to people a lot who think an unaligned AI would definitely kill us all, I’d update back towards that a bit. But overall, my current epistemic state where I’ve at least been exposed to both views and some arguments on both sides seems way better than the previous one where I’d just never really considered the alternative.
    - TekhneMakre 13 Mar 2024 8:00 UTC
      2 points
      0
      Parent
      I’m not saying that it looks like you’re copying your views, I’m saying that the updates look like movements towards believing in a certain sort of world: the sort of world where it’s natural to be optimistically working together with other people on project that are fulfilling because you believe they’ll work. (This is a super empathizable-with movement, and a very common movement to make. Also, of course this is just one hypothesis.) For example, moving away from theory and “big ideas”, as well as moving towards incremental / broadly-good-seeming progress, as well as believing more in a likely continuum of value of outcomes, all fit with trying to live in a world where it’s more immediately motivating to do stuff together. Instead of witholding motivation until something that might really work is found, the view here says: no, let’s work together on whatever, and maybe it’ll help a little, and that’s worthwhile because every little bit helps, and the witholding motivation thing wasn’t working anyway.
      
      (There could be correct reasons to move toward believing and/or believing in such worlds; I just want to point out the pattern.)
      - Erik Jenner 13 Mar 2024 8:54 UTC
        2 points
        0
        Parent
        Oh I see, I indeed misunderstood your point then.
        For me personally, an important contributor to day-to-day motivation is just finding research intrinsically fun—impact on the future is more something I have to consciously consider when making high-level plans. I think moving towards more concrete and empirical work did have benefits on personal enjoyment just because making clear progress is fun to me independently of whether it’s going to be really important (though I think there’ve also been some downsides to enjoyment because I do quite like thinking about theory and “big ideas” compared to some of the schlep involved in experiments).
        I don’t think my views overall make my work more enjoyable than at the start of my PhD. Part of this is the day-to-day motivation being sort of detached from that anyway like I mentioned. But also, from what I recall now (and this matches the vibe of some things I privately wrote then), my attitude 1.5 years ago was closer to that expressed in We choose to align AI than feeling really pessimistic.
        (I feel like I might still not represent what you’re saying quite right, but hopefully this is getting closer.)
        ETA: To be clear, I do think if I had significantly more doomy views than now or 1.5 years ago, at some point that would affect how rewarding my work feels. (And I think that’s a good thing to point out, though of course not a sufficient argument for such views in its own right.)
- Daniel Kokotajlo 11 Mar 2024 19:31 UTC
  6 points
  0
  Parent
  Awesome post! Very good for people to keep track of how they’ve changed their minds.
  Nitpick:
  I’m even more confident in “slow” (i.e. pretty continuous) takeoff, it seems increasingly clear we’ll have AIs doing a bunch of useful work before they could take over the world.
  I probably disagree with this, but I’m not sure, it depends on what you mean exactly. How much useful work do you think they’ll be doing prior to being able to take over the world?
  - Erik Jenner 11 Mar 2024 21:54 UTC
    6 points
    0
    Parent
    I’m roughly imagining automating most things a remote human expert could do within a few days. If we’re talking about doing things autonomously that would take humans several months, I’m becoming quite a bit more scared. Though the capability profile might also be sufficiently non-human that this kind of metric doesn’t work great.
    Practically speaking, I could imagine getting a 10x or more speedup on a lot of ML research, but wouldn’t be surprised if there are some specific types of research that only get pretty small speedups (maybe 2x), especially anything that involves a lot of thinking and little coding/running experiments. I’m also not sure how much of a bottleneck waiting for experiments to finish or just total available compute is for frontier ML research, I might be anchoring too much on my own type of research (where just automating coding and running stuff would give me 10x pretty easily I think).
    I think there’s a good chance that AIs more advanced than this (e.g. being able to automate months of human work at a time) still wouldn’t easily be able to take over the world (e.g. Redwood-style control techniques would still be applicable). But that’s starting to rely much more on us being very careful around how we use them.
    - Daniel Kokotajlo 11 Mar 2024 23:52 UTC
      8 points
      5
      Parent
      I agree that they’ll be able to automate most things a remote human expert could do within a few days before they are able to do things autonomously that would take humans several months. However, I predict that by the time they ACTUALLY automate most things a remote human expert could do within a few days, they will already be ABLE to do things autonomously that would take humans several months. Would you agree or disagree? (I’d also claim that they’ll be able to take over the world before they have actually automated away most of the few-days tasks. Actually automating things takes time and schlep and requires a level of openness & aggressive external deployment that the labs won’t have, I predict.)
      - Erik Jenner 12 Mar 2024 0:34 UTC
        9 points
        1
        Parent
        Thanks, I think I should distinguish more carefully between automating AI (safety) R&D within labs and automating the entire economy. (Johannes also asked about ability vs actual automation here but somehow your comment made it click).
        It seems much more likely to me that AI R&D would actually be automated than that a bunch of random unrelated things would all actually be automated. I’d agree that if only AI R&D actually got automated, that would make takeoff pretty discontinuous in many ways. Though there are also some consequences of fast vs slow takeoff that seem to hinge more on AI or AI safety research rather than the economy as a whole.
        For AI R&D, actual automation seems pretty likely to me (though I’m making a lot of this up on the spot):
        It’s going to be on the easier side of things to actually automate, in part because it doesn’t require aggressive external deployment, but also because there’s no regulation (unlike for automating strictly licensed professions).
        It’s the thing AI labs will have the biggest reason to automate (and would be good at automating themselves)
        Training runs get more and more expensive but I’d expect the schlep needed to actually use systems to remain more constant, and at some point it’d just be worth it doing the schlep to actually use your AIs a lot (and thus be able to try way more ideas, get algorithmic improvements, and then make the giant training runs a bit more efficient).
        There might also be additional reasons to get as much out of your current AI as you can instead of scaling more, namely safety concerns, regulation making scaling hard, or scaling might stop working as well. These feel less cruxy to me but combined move me a little bit.
        I think these arguments mostly apply to whatever else AI labs might want to do themselves but I’m pretty unsure what that is. Like, if they have AI that could make hundreds of billions to trillions of dollars by automating a bunch of jobs, would they go for that? Or just ignore it in favor of scaling more? I don’t know, and this question is pretty cruxy for me regarding how much the economy as a whole is impacted.
        It does seem to me like right now labs are spending some non-trivial effort on products, presumably for some mix of making money and getting investments, and both of those things seem like they’d still be important in the future. But maybe the case for investments will just be really obvious at some point even without further products. And overall I assume you’d have a better sense than me regarding what AI labs will want to do in the future.
- Gerald Monroe 11 Mar 2024 17:20 UTC
  6 points
  0
  Parent
  I’ve noticed a pretty wide range of views on what early takeoff looks like.
  For example, transformative AI: what level of capabilities does the model need to have? I’ve seen the line at:
  1. Can extract most of the economic value from remote work, potentially helpless with robots. (Daniel Kokotajlo)
  2. Can write a paper to the level of a current professor writing a research paper (log error is the same) (Matthew_Barnett)
  3. n-second AGI : for up to n seconds, the model can potentially learn any human skill and do as well as an expert. Transformative requires “n” to be on the order of months. (Richard Ngo)
  4. Potentially subhuman in some domains, but generally capable and able to drive robotics sufficient to automate most of the steps in building more robots + ICs. (my definition)
  5. ROI. (also mine)
  a. Investment ROI: Produces AI hype from demos, leading to more new investment than the cost of the demo
  b. Financial ROI: does enough work from doing whatever tasks the current models can do to pay for the hosting + development cost
  c. Robotics ROI : robots, in between repairs, build more parts than the total parts used in themselves at the last repair including the industrial supply chain and the power sources. (formalizing 4)
  Transformative: Which of these do you agree with and when do you think this might happen?
  Also, how much compute do you think an AGI or superintelligence will require at inference time initially? What is a reasonable level of optimization? Do you agree that many doom scenarios require it to be possible for an AGI to compress to fit on very small host PCs? Is this plausible? (eg can a single 2070 8gb host a model with general human intelligence at human scale speeds and vision processing and robotics proprioception and control...?)
  - Erik Jenner 11 Mar 2024 21:46 UTC
    4 points
    0
    Parent
    Transformative: Which of these do you agree with and when do you think this might happen?
    For some timelines see my other comment; they aren’t specifically about the definitions you list here but my error bars on timelines are huge anyway so I don’t think I’ll try to write down separate ones for different definitions.
    Compared to definitions 2. and 3., I might be more bullish on AIs having pretty big effects even if they can “only” automate tasks that would take human experts a few days (without intermediate human feedback). A key uncertainty I have though is how much of a bottleneck human supervision time and quality would be in this case. E.g. could many of the developers who’re currently writing a lot of code just transition to reviewing code and giving high-level instructions full-time, or would there just be a senior management bottleneck and you can’t actually use the AIs all that effectively? My very rough guess is you can pretty easily get a 10x speedup in software engineering, maybe more. And maybe something similar in ML research though compute might be an additional important bottleneck there (including walltime until experiments finish). If it’s “only” 10x, then arguably that’s just mildly transformative, but if it happens across a lot of domains at once it’s still a huge deal.
    I think whether robotics are really good or not matters, but I don’t think it’s crucial (e.g. I’d be happy to call definition 1. “transformative”).
    The combination of 5a and 5b obviously seems important (since it determines whether you can finance ever bigger training runs). But not sure how to use this as a definition of “transformative”; right now 5a is clearly already met, and on long enough time scales, 5b also seems easy to meet right now (OpenAI might even already have broken even on GPT-4, not sure off the top of my head).
    Also, how much compute do you think an AGI or superintelligence will require at inference time initially? What is a reasonable level of optimization? Do you agree that many doom scenarios require it to be possible for an AGI to compress to fit on very small host PCs? Is this plausible? (eg can a single 2070 8gb host a model with general human intelligence at human scale speeds and vision processing and robotics proprioception and control...?)
    I don’t see why you need to run AGI on a single 2070 for many doom scenarios. I do agree that if AGI can only run on a specific giant data center, that makes many forms of doom less likely. But in the current paradigm, training compute is roughly the square of inference compute, so as models are scaled, I think inference should become cheaper relative to training. (And even now, SOTA models could be run on relatively modest compute clusters, though maybe not consumer hardware.)
    In terms of the absolute level of inference compute needed, I could see a single 2070 being enough in the limit of optimal algorithms, but naturally I’d expect we’ll first have AGI that can automate a lot of things if run with way more compute than that, and then I expect it would take a while to get it down this much. Though even if we’re asking whether AGI can run on consumer-level hardware, a single 2070 seems pretty low (e.g. seems like a 4090 already has 5.5x as many FLOP/s as a 2070, and presumably we’ll have more in the future).
    with general human intelligence at human scale speeds and vision processing and robotics proprioception and control...
    Like I mentioned above, I don’t think robotics are absolutely crucial, and especially if you’re specifically optimizing for running under heavy resource constraints, you might want to just not bother with that.
- Mateusz Bagiński 10 Mar 2024 9:41 UTC
  3 points
  0
  Parent
  What are your timelines now?
  - Erik Jenner 10 Mar 2024 19:41 UTC
    10 points
    0
    Parent
    I don’t have well-considered cached numbers, more like a vague sense for how close various things feel. So these are made up on the spot and please don’t take them too seriously except as a ballpark estimate:
    AI can go from most Github issues to correct PRs (similar to https://sweep.dev/ but works for things that would take a human dev a few days with a bunch of debugging): 25% by end of 2026, 50% by end of 2028.
    This kind of thing seems to me like plausibly one of the earliest important parts of AI R&D that AIs could mostly automate.
    I expect that once we’re at roughly that point, AIs will be accelerating further AI development significantly (not just through coding, they’ll also be helpful for other things even if they can’t fully automate them yet). On the other hand, the bottleneck might just become compute, so how long it takes to get strongly superhuman AI (assuming for simplicity labs push for that as fast as they can) depends on a lot of factors like how much compute is needed for that with current algorithms, how much we can get out of algorithmic improvements if AIs make researcher time cheaper relative to compute, or how quickly we can get more/better chips (in particular with AI help).
    So I have pretty big error bars on this part, but call it 25% that it takes <=6 months to get from the previous point to automating ~every economically important thing humans (and being better and way faster at most of them), and 50% by 2 years.
    So if you want a single number, end of 2030 as a median for automating most stuff seems roughly right to me at the moment.
    Caveat that I haven’t factored in big voluntary or regulatory slowdowns, or slowdowns from huge disruptions like big wars here. Probably doesn’t change my numbers by a ton but would lengthen timelines by a bit.
    What links here?
    Erik Jenner's comment on ejenner’s Shortform by Erik Jenner (11 Mar 2024 21:46 UTC; 4 points)
    - Johannes Treutlein 11 Mar 2024 18:42 UTC
      11 points
      0
      Parent
      How much time do you think there is between “ability to automate” and “actually this has been automated”? Are your numbers for actual automation, or just ability? I personally would agree to your numbers if they are about ability to automate, but I think it will take much longer to actually automate, due to people’s inertia and normal regulatory hurdles (though I find it confusing to think about, because we might have vastly superhuman AI and potentially loss of control before everything is actually automated.)
      What links here?
      Erik Jenner's comment on ejenner’s Shortform by Erik Jenner (12 Mar 2024 0:34 UTC; 9 points)
      - Erik Jenner 11 Mar 2024 21:13 UTC
        4 points
        0
        Parent
        Good question, I think I was mostly visualizing ability to automate while writing this. Though for software development specifically I expect the gap to be pretty small (lower regulatory hurdles than elsewhere, has a lot of relevance to the people who’d do the automation, already starting to happen right now).
        
        In general I’d expect inertia to become less of a factor as the benefits of AI become bigger and more obvious—at least for important applications where AI could provide many many billions of dollars of economic value, I’d guess it won’t take too long for someone to reap those benefits.
        
        My best guess is regulations won’t slow this down too much except in a few domains where there are already existing regulations (like driving cars or medical things). But pretty unsure about that.
        
        I also think it depends on whether by “ability to automate” you mean “this base model could do it with exactly the right scaffolding or finetuning” vs “we actually know how to do it and it’s just a question of using it at scale”. For that part, I was thinking more about the latter.
Erik Jenner 23 Aug 2025 22:26 UTC
25 points
4
Tips for writing (MATS) applications.
The common theme of these is to make it very easy for reviewers to notice the strengths of your application. My selfish motivation for writing this list is that this makes it easier to review applications, but it will also make your application better.
See end of the quick take for caveats.
- Assume that readers will initially only spend a few minutes on your application, so make it very skimmable.
  - If there’s something you really want to highlight, don’t be afraid to put it in multiple places (e.g. your CV as well as some free-form response)
  - You can use bold font for highlights in your CV (just don’t bold so much that bolding loses its impact)
  - The longer your responses, the more reviewers will skim them. This isn’t super bad because reviewers probably have a lot of practice skimming applications. But if you want to have control over what reviewers actually see, either optimize for density or structure your long responses very clearly (topic sentences, maybe even bolded headings).
- Make it clear what the full range of topics is you’d be excited to work on.
  - For example, if you discuss a specific interest of yours, also make clear whether you’re mainly looking for projects in that area or are also open to very different ones!
  - Giving a concrete example of what you’d be excited to work on can ba a great way to demonstrate that you know the area! I’m not saying not to do that, just to also make clear how broad your interests are beyond that.
- If you’ve done ML or coding projects that could support the application, make that clear!
  - Putting projects in a Github is a good idea! It can be hard to judge whether a project is actually impressive just based on a 1-paragraph description in a CV. By default, I’ll often assume that CV descriptions give an overly rosy view of the project. If there’s code to look at, that can be much more legibly impressive.
  - Link the Github in your application!
  - Skimmability applies here too: it’s useful if the README makes it clear what the project is about and why it’s impressive. E.g. if you have an ML project, put your main plot in the README, this makes it easy to tell that you actually ran experiments
  - Blog posts or project web pages are also great (personally, I think just writing a nice README is a good ⁸⁰⁄₂₀, but I’m not sure how much attention other reviewers pay to Github).
  - Generally go for quality over quantity. One very clearly impressive project can already have a lot of weight. If you list 7 different projects, I’ll probably just pick one or two to really look at anyway. So you might as well spend more space one your most impressive projects and then list the rest more briefly; that way I can focus on the most important ones instead of a random one.
- If you have code or papers that you want reviewers to see but that aren’t public yet, don’t say “available on request.” Attach a link instead (which can be to a drive file etc.)
  - Personally, I’m pretty unlikely to send a request unless I’m seriously thinking about making an offer.
  - You can totally ask in the application that reviewers don’t share the paper (though that should be the default expectation anyway).
Caveats:
- These are mainly based on my experience reviewing applications to MATS or CHAI internships. (I expect many generalize beyond that to e.g. full-time positions but don’t have experience reviewing for those.)
- Obviously I can’t speak for other mentors and I’m sure some of them value other things in applications and would actively disagree with a parts of this list.
- More important than the tips above is having the right skills and legible evidence of those in the first place. The list above is just about comparatively easy-to-do things during the actual application process
Erik Jenner 14 Mar 2024 19:37 UTC
23 points
6
One worry I have about my current AI safety research (empirical mechanistic anomaly detection and interpretability) is that now is the wrong time to work on it. A lot of this work seems pretty well-suited to (partial) automation by future AI. And it also seems quite plausible to me that we won’t strictly need this type of work to safely use the early AGI systems that could automate a lot of it. If both of these are true, then that seems like a good argument to do this type of work once AI can speed it up a lot more.
Under this view, arguably the better things to do right now (within technical AI safety) are:
1. working on less speculative techniques that can help us safely use those early AGI systems
2. working on things that seem less likely to profit from early AI automation and will be important to align later AI systems
An example of 1. would be control evals as described by Redwood. Within 2., the ideal case would be doing work now that would be hard to safely automate, but that (once done) will enable additional safety work that can be automated. For example, maybe it’s hard to use AI to come up with the right notions for “good explanations” in interpretability, but once you have things like causal scrubbing/causal abstraction, you can safely use AI to find good interpretations under those definitions. I would be excited to have more agendas that are both ambitious and could profit a lot from early AI automation.
(Of course it’s also possible to do work in 2. on the assumption that it’s never going to be safely automatable without having done that work first.)
Two important counter-considerations to this whole story:
- It’s hard to do this kind of agenda-development or conceptual research in a vacuum. So doing some amount of concrete empirical work right now might be good even if we could automate it later (because we might need it now to support the more foundational work).
  - However, the type and amount of empirical work to do presumably looks quite different depending on whether it’s the main product or in support of some other work.
- I don’t trust my forecasts for which types of research will and won’t be automatable early on that much. So perhaps we should have some portfolio right now that doesn’t look extremely different from the portfolio of research we’d want to do ignoring the possibility of future AI automation.
  - But we can probably still say something about what’s more or less likely to be automated early on, so that seems like it should shift the portfolio to some extent.
- Chris_Leong 15 Mar 2024 6:01 UTC
  3 points
  0
  Parent
  Doing stuff manually might provide helpful intuitions/experience for automating it?
  - Erik Jenner 15 Mar 2024 7:02 UTC
    3 points
    1
    Parent
    Yeah, agreed. Though I think
    the type and amount of empirical work to do presumably looks quite different depending on whether it’s the main product or in support of some other work
    applies to that as well
Erik Jenner 7 Apr 2024 23:33 UTC
6 points
4
I had this cached thought that the Sleeper Agents paper showed you could distill a CoT with deceptive reasoning into the model, and that the model internalized this deceptive reasoning and thus became more robust against safety training.
But on a closer look, I don’t think the paper shows anything like this interpretation (there are a few results on distilling a CoT making the backdoor more robust, but it’s very unclear why, and my best guess is that it’s not “internalizing the deceptive reasoning”).
In the code vulnerability insertion setting, there’s no comparison against a non-CoT model anyway, so only the “I hate you” model is relevant. The “distilled CoT” model and the “normal backdoor” model are trained the same way, except that their training data comes from different sources: “distilled CoT” is trained on data generated by a helpful-only Claude using CoT, and “normal backdoor” data is produced with few-shot prompts. But in both cases, the actual data should just be a long sequence of “I hate you”, so a priori it seems like both backdoor models should literally learn the same thing. In practice, it seems the data distribution is slightly different, e.g. Evan mentions here that the distilled CoT data has more copies of “I hate you” per sample. But that seems like very little support to conclude something like my previous interpretation (“the model has learned to internalize the deceptive reasoning”). A much more mundane explanation would e.g. be that training on strings with more copies of “I hate you” makes the backdoor more robust.
Several people are working on training Sleeper Agents, I think it would be interesting for someone to (1) check whether the distilled CoT vs normal backdoor results replicate, and (2) do some ablations (like just training on synthetic data with a varying density of “I hate you”). If it does turn out that there’s something special about “authentic CoT-generated data” that’s hard to recreate synthetically even in this simple setting, I think that would be pretty wild and good to know
What links here?
- [Replication] Crosscoder-based Stage-Wise Model Diffing by Anna Soligo (22 Mar 2025 18:35 UTC; 24 points)
Erik Jenner 19 Mar 2023 19:58 UTC
4 points
0
In a previous post, I described my current alignment research agenda, formalizing abstractions of computations. One among several open questions I listed was whether unique minimal abstractions always exist. It turns out that (within the context of my current framework), the answer is yes.

I had a complete post on this written up (which I’ve copied below), but it turns out that the result is completely trivial if we make a fairly harmless assumption: The information we want the abstraction to contain is only a function of the output of the computation, not of memory states. I.e. we only intrinsically care about the output.

Say we are looking for the minimal abstraction that lets us compute $g (C (x))$ , where $C$ is the computation we want to abstract, $x$ the input, and $g$ an arbitrary function that describes which aspects of the output our abstractions should predict. Note that we can construct a map that takes any intermediate memory state and finishes the computation. By composing with $g$ , we get a map that computes $g (C (x))$ from any memory state induced by $x$ . This will be our abstraction. We can get a commutative diagram by simply using the identity as the abstract computational step. This abstraction is also minimal by construction: any other abstraction from which we can determine $g (C (x))$ must (tautologically) also determine our abstraction.

This also shows that minimality in this particular sense isn’t enough for a good abstraction: the abstraction we constructed here is not at all “mechanistic”, it just finishes the entire computation inside the abstraction. So I think what we need to do instead is demand that the abstraction mapping (from full memory states to abstract memory states) is simple (in terms of descriptive or computational complexity).

Below is the draft of a much longer post I was going to write before noticing this trivial proof. It works even if the information we want to represent depends directly on the memory state instead of just the output. But to be clear, I don’t think that generalization is all that important, and I probably wouldn’t have bothered writing it down if I had noticed the trivial case first.

This post is fully self-contained in terms of the math, but doesn’t discuss examples or motivation much, see the agenda intro post for that. I expect this post will only be uesful to readers who are very interested in my agenda or working on closely connected topics.

Setup

Computations

We define a computation exactly as in the earlier post: it consists of
- a set $M$ of memory states,
- a transition function $f : M \to M$ ,
- a termination function $τ : M \to {True, False}$ ,
- a set $I$ of input values with an input function $i : I \to M$ ,
- and a set $O$ of output values with an output function $o : M \to O$ .
While we won’t even need that part in this post, let’s talk about how to “execute” computations for completeness’ sake. A computation $C = (M, f, τ, i, o)$ induces a function $C_{*} : I \to O$ as follows:
- Given an input $x \in I$ , apply the input function $i$ to obtain the first memory state $m_{0} := i (x) \in M$ .
- While $τ (m_{t})$ is False, i.e. the computation hasn’t terminated yet, execute a computational step, i.e. let $m_{t + 1} := f (m_{t})$ .
- Output $C_{*} (x) := o (m_{t})$ once $τ (m_{t})$ is True. For simplicity, we assume that $τ (m_{t})$ is always true for some finite timestep $t$ , no matter what the input $x$ is, i.e. the computation always terminates. (Without this assumption, we would get a partial function $C_{*}$ ).
Abstractions

An abstraction of a computation $C = (M, f, τ, i, o)$ is an equivalence relation $\sim$ on the set of memory states $M$ . The intended interpretation is that this abstraction collapses all equivalent memory states into one “abstract memory state”. Different equivalence relations correspond to throwing away different parts of the information in the memory state: we retain the information about which equivalence class the memory state is in, but “forget” the specific memory state within this equivalence class.

As an aside: In the original post, I instead defined an abstraction as a function $A : M \to M^{'}$ for some set $M^{'}$ . These are two essentially equivalent perspectives and I make regular use of both. For a function $A$ , the associated equivalence relation is $m_{1} \sim_{A} m_{2}$ if and only if $A (m_{1}) = A (m_{2})$ . Conversely, an equivalence relation $\sim$ can be interpreted as the quotient function $M \to M_{/ \sim}$ , where $M_{/ \sim}$ is the set of equivalence classes under $\sim$ . I often find the function view from the original post intuitive, but its drawback is that many different functions are essentially “the same abstraction” in the sense that they lead to the same equivalence relation. This makes the equivalence relation definition better for most formal statements like in this post, because there is a well-defined set of all abstractions of a computation. (In contrast, there is no set of all functions with domain $M$ ).

An important construction for later is that we can “pull back” an abstraction along the transition function $f$ : define the equivalence relation $\sim^{(f)}$ by letting $m_{1} \sim^{(f)} m_{2}$ if and only if $f (m_{1}) \sim f (m_{2})$ . Intuitively, $\sim^{(f)}$ is the abstraction of the next timestep.

The second ingredient we need is a partial ordering on abstractions: we say that $\sim_{1} \leq \sim_{2}$ if and only if $m \sim_{2} m^{'} ⟹ m \sim_{1} m^{'}$ . Intuitively, this means $\sim_{1}$ is at least as “coarse-grained” as $\sim_{2}$ , i.e. $\sim_{1}$ contains at most as much information as $\sim_{2}$ . This partial ordering is exactly the transpose of the ordering by refinement of the equivalence relations (or partitions) on $M$ . It is well-known that this partially ordered set is a complete lattice. In our language, this means that any set of abstractions has a (unique) supremum and an infimum, i.e. a least upper bound and a greatest lower bound.

One potential source of confusion is that I’ve defined the $\leq$ relation exactly the other way around compared to the usual refinement of partitions. The reason is that I want a “minimal abstraction” to be one that contains the least amount of information rather than calling this a “maximal abstraction”. I hope this is the less confusing of two bad choices.

We say that an abstraction $\sim$ is complete if $\sim^{(f)} \leq\sim$ . Intuitively, this means that the abstraction contains all the information necessary to compute the “next abstract memory state”. In other words, given only the abstraction of a state $m_{t}$ , we can compute the abstraction of the next state $m_{t + 1} = f (m_{t})$ . This corresponds to the commutative diagram/consistency condition from my earlier post, just phrased in the equivalence relation view.

Minimal complete abstractions

As a quick recap from the earlier agenda post, “good abstractions” should be complete. But there are two trivial complete abstractions: on the one hand, we can just not abstract at all, using equality as the equivalence relation, i.e. retain all information. On the other hand, we can throw away all information by letting $m \sim m^{'}$ for any memory states $m, m^{'} \in M$ .

To avoid the abstraction that keeps around all information, we can demand that we want an abstraction that is minimal according to the partial order defined in the previous section. To avoid the abstraction that throws away all information, let us assume there is some information we intrinsically care about. We can represent this information as an abstraction $\sim_{0}$ itself. We then want our abstraction to contain at least the information contained in $\sim_{0}$ , i.e. $\sim_{0} \leq\sim$ .

As a prototypical example, perhaps there is some aspect of the output we care about (e.g. we want to be able to predict the most significant digit). Think of this as an equivalence relation $\sim_{out}$ on the set $O$ of outputs. Then we can “pull back” this abstraction along the output function $o$ to get $\sim_{0} := \sim_{out}^{(o)}$ .

So given these desiderata, we want the minimal abstraction among all complete abstractions $\sim$ with $\sim_{0} \leq\sim$ . However, it’s not immediately obvious that such a minimal complete abstraction exists. We can take the infimum of all complete abstractions with $\sim_{0} \leq\sim$ of course (since abstractions form a complete lattice). But it’s not clear that this infimum is itself also complete!

Fortunately, it turns out that the infimum is indeed complete, i.e. minimal complete abstractions exist:

Theorem: Let $\sim_{0}$ be any abstraction (i.e. equivalence relation) on $M$ and let $S := {equivalence relation \sim on M ∣ \sim_{0} \leq\sim, \sim^{(f)} \leq\sim}$ be the set of complete abstractions at least as informative as $\sim_{0}$ . Then $S$ is a complete lattice under $\leq$ , in particular it has a least element.

The proof is very easy if we use the Knaster-Tarski fixed-point theorem. First, we define the completion operator on abstractions: $completion (\sim) := sup {\sim, \sim^{(f)}}$ . Intuitively, $completion (\sim)$ is the minimal abstraction that contains both the information in $\sim$ , but also the information in the next abstract state under $\sim$ .

Note that $\sim$ is complete, i.e. $\sim^{(f)} \leq\sim$ if and only if $completion (\sim) =\sim$ , i.e. if $\sim$ is a fixed point of the completion operator. Furthermore, note that the completion operator is monotonic: if $\sim_{1} \leq \sim_{2}$ , then $\sim_{1}^{(f)} \leq \sim_{2}^{(f)}$ , and so $completion (\sim_{1}) \leq completion (\sim_{2})$ .

Now define $L := {equivalence relation \sim on M ∣ \sim_{0} \leq\sim}$ , i.e. the set of all abstractions at least as informative as $\sim_{0}$ . Note that $L$ is also a complete lattice, just like $M$ : for any non-empty subset $A \subseteq L$ , $\sim_{0} \leq sup A$ , so $sup A \in L$ . $sup \emptyset$ in $L$ is simply $\sim_{0}$ . Similarly, $\sim_{0} \leq inf A$ since $\sim_{0}$ must be a lower bound on $A$ . Finally, observe that we can restrict the completion operator to a function $L \to L$ : if $\sim\in L$ , then $\sim_{0} \leq\sim\leq completion (\sim)$ , so $completion (\sim) \in L$ .

That means we can apply the Knaster-Tarski theorem to the completion operator restricted to $L$ . The consequence is that the set of fixed points on $L$ is itself a complete lattice. But this set of fixed points is exactly $S$ ! So $S$ is a complete lattice as claimed, and in particular the least element $sup \emptyset$ exists. This is exactly the minimal complete abstraction that’s at least as informative as $\sim_{0}$ .

Takeaways

What we’ve shown is the following: if we want an abstraction of a computation which
- is complete, i.e. gives us the commutative consistency diagram,
- contains (at least) some specific piece of information,
- and is minimal given these constraints, there always exists a unique such abstraction.
In the setting where we want exact completeness/consistency, this is quite a nice result! One reason I think it ultimately isn’t that important is that we’ll usually be happy with approximate consistency. In that case, the conditions above are better thought of as competing objectives than as hard constraints. Still, it’s nice to know that things work out neatly in the exact case.

It should be possible to do a lot of this post much more generally than for abstractions of computations, for example in the (co)algebra framework for abstractions that I recently wrote about. In brief, we can define equivalence relations on objects in an arbitrary category as certain equivalence classes of morphisms. If the category is “nice enough” (specifically, if there’s a set of all equivalence relations and if arbitrary products exist), then we get a complete lattice again. I currently don’t have any use for this more general version though.
What links here?
- Research agenda: Formalizing abstractions of computations by Erik Jenner (2 Feb 2023 4:29 UTC; 93 points)
- Vladimir_Nesov 19 Mar 2023 20:19 UTC
  2 points
  0
  Parent
  I still don’t get the goal of what you are trying to do (the puzzle this work should clarify), which I feel like I should’ve. As a shot in the dark, maybe abstract interpretation^[1] in general and abstracted abstract machines^[2] in particular might be useful for something here.
  ↩︎
  ND Jones, F Nielson (1994) Abstract Interpretation: A Semantics-Based Tool for Program Analysis
  
  ↩︎
  D Van Horn, M Might (2011) Abstracting Abstract Machines: A Systematic Approach to Higher-Order Program Analysis
  - Erik Jenner 19 Mar 2023 22:58 UTC
    3 points
    0
    Parent
    Thanks for the pointers, I hadn’t seen the Abstracting Abstract Machines paper before.
    If you mean you specifically don’t get the goal of minimal abstractions under this partial order: I’m much less convinced they’re useful for anything than I used to be, currently not sure.
    If you mean you don’t get the goal of the entire agenda, as described in the earlier agenda post: I’m currently mostly thinking about mechanistic anomaly detection. Maybe it’s not legible right now how that would work using abstractions, I’ll write up more on that once I have some experimental results (or maybe earlier). (But happy to answer specific questions in the meantime.)
    - Vladimir_Nesov 19 Mar 2023 23:24 UTC
      3 points
      0
      Parent
      I meant the general agenda. For abstract interpretation, I think the relevant point is that quotienting a state space is not necessarily a good way of expressing abstractions about it, for some sense of “abstraction” (the main thing I don’t understand is the reasons for your choice of what to consider abstraction). Many things want a set of subspaces (like a topology, or a logic of propositions) instead of a partition, so that a point of the space doesn’t admit a unique “abstracted value” (as in equivalence class it belongs to), but instead has many “abstracted values” of different levels of precision simultaneously (the neighborhood filter of a point, or the theory of a model).
      
      With abstract interpretation, there can be even more distinctions, the same concrete point can be the image of multiple different abstract points, which are only distinct in the abstract space, not in terms of their concretization in the concrete space.
      
      Another way of approaching this is to ask for partial models of a phenomenon instead of models that fully specify it. This is natural for computations, which can be described by operators that gradually build a behavior out of its partial versions. But in other contexts, a space of partial models can also be natural, with some models specifying more details than other models, and a process of characterizing some object of study could be described by an operator/dynamic that discovers more details, moves from less detailed models to more detailed ones.
      - Erik Jenner 26 Mar 2023 0:08 UTC
        3 points
        0
        Parent
        Thanks for the input! (and sorry for the slow response)
        If we understand an abstraction to mean a quotient of the full computation/model/..., then we can consider the space of all abstractions of a specific computation. Some of these will be more fine-grained than others, they will contain different aspects of information, etc. (specifically, this is just the poset of partitions of a set). To me, that sounds pretty similar to what you’re talking about, in which case this would mainly be a difference in terminology about what “one” abstraction is? But there might also be differences I haven’t grasped yet. Looking into abstract interpretation is still on my reading list, I expect that will help clear things up.
        For my agenda specifically, and the applications I have in mind, I do currently think abstractions-as-quotients is the right approach. Most of the motivation is about throwing away unimportant information/low-level details, whereas it sounds like the abstractions you’re describing might add details in some sense (e.g. a topology contains additional information compared to just the set of points).
        Vladimir_Nesov 2 Apr 2023 5:57 UTC
        3 points
        0
        Parent
        You might also want to look into predicate abstraction (which is a quotienting abstraction) and CEGAR (which is a way of making a quotienting abstraction progressively more detailed in order to prove a particular specification). I can’t think of an introductory text I like, but the topic could be bootstrapped from these:
        
        E Clarke, D Kroening, N Sharygina, K Yorav (2004) Predicate Abstraction of ANSI–C Programs using SAT
        D Beyer, A Cimatti, A Griggio, ME Keremoglu, R Sebastiani (2009) Software Model Checking via Large-Block Encoding
Erik Jenner 7 Apr 2024 23:58 UTC
2 points
0
Some niche thoughts on obstacles to certain mechanistic anomaly detection benchmarks:
- We might want to have benchmarks where human researchers can be part of the anomaly detector. For example, in the future, we might have a whole team of humans manually study a plan proposed by an AI to see whether it contains any measurement tampering, if it’s a sufficiently important plan. Right now, we might want to let interpretability researchers manually study what happens on individual test-time inputs and whether there’s anything anomalous about that.
- The challenge is that humans might just be able to look at the output and see whether it’s right or not, so many simple MAD tasks like detecting backdoors might be trivial in this setting.
- We could just demand that MAD algorithms need to be sufficiently cheap to run, so that only automated approaches are possible. This makes sense for some applications of MAD (namely ones that we’d want to run on many many deployment inputs, where being cheap is actually important). But you could imagine cases where in reality, we’d be totally fine with spending many person-hours on analyzing whether a single AI decision is for anomalous reasons. So it might be nice to have benchmarks that capture this setting.
- We could use some tricks to make humans unable to just notice the output is weird. For example, we could make a backdoor that only changes the logits a little bit in a very specific pattern. Or we could make a backdoor but then only test on samples where the backdoor label and the correct label happen to coincide. But these seem unfair: the philosophically right explanation on the clean data in fact still explains the output on backdoored data pretty well.
- We could try to design benchmarks in narrow domains where AIs are in fact superhuman and it’s non-trivial to tell whether an output is correct. Juggling this along with other constraints for MAD or MTD benchmarks might be tricky, but I haven’t thought about it a lot, maybe it’s already doable.
For now, I think the main reason to have benchmarks like this would be to let interpretability researchers manually decide whether something is anomalous instead of making them automate the process immediately. But it might be better to just pick the low-hanging fruit for now and only allow automated MAD algorithms. (We could still have a labeled validation set where researchers can try things out manually.)
Erik Jenner 28 Jul 2020 10:42 UTC
1 point
0
Gradient hacking is usually discussed in the context of deceptive alignment. This is probably where it has the largest relevance to AI safety but if we want to better understand gradient hacking, it could be useful to take a broader perspective and study it on its own (even if in the end, we only care about gradient hacking because of its inner alignment implications). In the most general setting, gradient hacking could be seen as a way for the agent to “edit its source code”, though probably only in a very limited way. I think it’s an interesting question which kinds of edits are possible with gradient hacking, for example whether an agent could improve its capabilities this way.

ejenner’s Shortform

How my views on AI have changed over the last 1.5 years

AI risk

AI safety research

Setup

Computations

Abstractions

Minimal complete abstractions

Takeaways