Sure, but then an attempt to summarise Eliezer’s position, which attributes a much stronger position than is in this post, isn’t necessarily a straw man that doesn’t understand the point of irretrievability, but can merely be responding to all his points on top of irretrievability, or saying that they don’t consider him to be making sufficient arguments beyond the potential for irretrievability
Neel Nanda
My guess is that they would implicitly consider this post to be motte-and-bailey-ing, but do strawman the position in this post (if this post is in fact the best representation of Eliezer’s position).
In my opinion, this post is not actually making many hard claims. I mostly view it as gesturing at the existence of really difficult problems and presenting historical analogies. It argues that it is possible for problems to be very hard, even if they have a bunch of other nice properties, including the nice properties people attribute to the AI problem. However, I feel like arguing that AI safety could be an incredibly hard problem is a way less extreme position than the one Eliezer seem to actually hold.
Why does being public change any of this? They already have a ton of investors
Is exobrain something you built or a service you pay for? I’d be curious to try it but couldn’t see a link
What’s station 4 about?
Test your best methods on our hard CoT interp tasks
+1, especially with the vast majority of future Anthropic employee donations already locked in to DAFs
Gotcha. I would feel reasonably happy if the policy said “text written or dictated by a human”, if we count my level of LLM editing followed by me editing to be overall light editing
For me I’ll often reword things, change my mind, go back and add some content to an earlier section, leave todos for myself, have kinda clumsy wording, etc and an LLM is helpful for all of these
If you don’t trust the user, why does the policy matter? Surely you need some way to gauge post quality regardless
Gotcha. I did not take that from the policy in the post, might be good to reword
EDIT: In particular, as written, the below categories feel like they include my writing, but it sounds like this is not intended
text that was written by a human and then substantially[6] edited or revised by an LLM text that was written by an LLM and then edited or revised by a human
Fair! I was reacting to the concept and didn’t pay much attention to the design. Maybe I would get used to it? I do feel like the concept is what matters here though—I don’t want to read most kinds of slop, and I expect to interpret an LLM block as “high probability of slop”
EDIT: Looking more at the examples in the post, I retract “intrusive”, but the changed font does create a subtle sense of wrongness/a weird vibe, that I could easily see becoming associated in my head with “skip, not worth my time”
Fair enough. How about “I stand by the content of this piece as much as if I’d written it myself”? In my case, most but not all of the phrasing and wording is written by me, and I would cut anything the LLM added that I considered false testimony
What do you mean by without the transcript part?
I often write posts by dictating a verbatim rough draft, giving the audio to Gemini along with a bunch of samples of my past writing and instructions up preserve my voice as much as possible, and then edit what comes out until I’m happy (but in practice it’s close enough to my voice that this is just light editing). Under these rules would I need to put the whole post in an LLM output block?
EDIT: On reflection, the thing that annoys me about this policy is that it lumps in many kinds of LLM assistance, with varying amounts of human investment, into an intrusive format that naively reads to me as “this is LLM slop which you should ignore”.
For example, under my current reading, I would need to label several popular and widely read posts of mine as LLM content (my amount of editing varied from light to heavy between the posts, but LLM assistance was substantial). I think it would have been pretty destructive to make me label each post as LLM written (in practice I would have either violated the policy, or posted on a personal blog and maybe shared a link here)
https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability https://www.lesswrong.com/posts/jP9KDyMkchuv6tHwm/how-to-become-a-mechanistic-interpretability-researcher https://www.lesswrong.com/posts/G9HdpyREaCbFJjKu5/it-is-reasonable-to-research-how-to-use-model-internals-in https://www.lesswrong.com/posts/MnkeepcGirnJn736j/how-can-interpretability-researchers-help-agi-go-well
I would feel better about eg self selecting a tag for the post about how much an LLM was integrated into the writing process, with a spectrum of options rather than a binary
I was worried about this, and asked someone on the relevant team at Anthropic, but they thought our methods were sufficiently different from their internal approach to still be interesting
Very little of the impact of people working in AI Safety is downstream of their research, so this seems wrong.
Where do you think the impact comes from? And is this coming from a background belief that most current alignment work is useless?
How well do models follow their constitutions?
I know that AI companies have policies including matching donations for approved organizations. It seems like influencing which organizations are elligible for matching could be very valuable, and like employees should not restrict their giving to already approved organizations.
I am not aware of OpenAI or Anthropic having such policies
(Google does, but it’s just “which organisations are on benevity.com” and matches are capped at $10K, so it’s not too relevant here IMO)
I’m not saying that those people believe it is a critical first try problem. I expect they agree that it could be a critical first try problem, but that they predict it probably isn’t for variety of reasons and they view Eliezer as claiming that it is rather than just that it is possible it’s a critical first try problem