For me I’ll often reword things, change my mind, go back and add some content to an earlier section, leave todos for myself, have kinda clumsy wording, etc and an LLM is helpful for all of these
Neel Nanda
If you don’t trust the user, why does the policy matter? Surely you need some way to gauge post quality regardless
Gotcha. I did not take that from the policy in the post, might be good to reword
EDIT: In particular, as written, the below categories feel like they include my writing, but it sounds like this is not intended
text that was written by a human and then substantially[6] edited or revised by an LLM text that was written by an LLM and then edited or revised by a human
Fair! I was reacting to the concept and didn’t pay much attention to the design. Maybe I would get used to it? I do feel like the concept is what matters here though—I don’t want to read most kinds of slop, and I expect to interpret an LLM block as “high probability of slop”
EDIT: Looking more at the examples in the post, I retract “intrusive”, but the changed font does create a subtle sense of wrongness/a weird vibe, that I could easily see becoming associated in my head with “skip, not worth my time”
Fair enough. How about “I stand by the content of this piece as much as if I’d written it myself”? In my case, most but not all of the phrasing and wording is written by me, and I would cut anything the LLM added that I considered false testimony
What do you mean by without the transcript part?
I often write posts by dictating a verbatim rough draft, giving the audio to Gemini along with a bunch of samples of my past writing and instructions up preserve my voice as much as possible, and then edit what comes out until I’m happy (but in practice it’s close enough to my voice that this is just light editing). Under these rules would I need to put the whole post in an LLM output block?
EDIT: On reflection, the thing that annoys me about this policy is that it lumps in many kinds of LLM assistance, with varying amounts of human investment, into an intrusive format that naively reads to me as “this is LLM slop which you should ignore”.
For example, under my current reading, I would need to label several popular and widely read posts of mine as LLM content (my amount of editing varied from light to heavy between the posts, but LLM assistance was substantial). I think it would have been pretty destructive to make me label each post as LLM written (in practice I would have either violated the policy, or posted on a personal blog and maybe shared a link here)
https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability https://www.lesswrong.com/posts/jP9KDyMkchuv6tHwm/how-to-become-a-mechanistic-interpretability-researcher https://www.lesswrong.com/posts/G9HdpyREaCbFJjKu5/it-is-reasonable-to-research-how-to-use-model-internals-in https://www.lesswrong.com/posts/MnkeepcGirnJn736j/how-can-interpretability-researchers-help-agi-go-well
I would feel better about eg self selecting a tag for the post about how much an LLM was integrated into the writing process, with a spectrum of options rather than a binary
I was worried about this, and asked someone on the relevant team at Anthropic, but they thought our methods were sufficiently different from their internal approach to still be interesting
Very little of the impact of people working in AI Safety is downstream of their research, so this seems wrong.
Where do you think the impact comes from? And is this coming from a background belief that most current alignment work is useless?
How well do models follow their constitutions?
I know that AI companies have policies including matching donations for approved organizations. It seems like influencing which organizations are elligible for matching could be very valuable, and like employees should not restrict their giving to already approved organizations.
I am not aware of OpenAI or Anthropic having such policies
(Google does, but it’s just “which organisations are on benevity.com” and matches are capped at $10K, so it’s not too relevant here IMO)
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
I really enjoyed this post, thanks for writing it! I just subscribed to your Patreon, I’d love to see more posts like this. I enjoyed the level of grounded detail on how the world actually works, tied to a topic that matters
Interesting, any idea why Jackson Jr?
Current activation oracles are hard to use
How to Design Environments for Understanding Model Motives
I agree with your assessment here, I don’t think METR has had a significant negative effect on the availability of talent in the technical AGI Safety ecosystem, and Anthropic has had a massive negative one. GDM Safety has probably had a moderate negative one, offset by many people preferring to live in London
Gotcha. I would feel reasonably happy if the policy said “text written or dictated by a human”, if we count my level of LLM editing followed by me editing to be overall light editing