How big a part of the flywheel is user data?
I don’t think it makes a great deal of difference, but all writing strategies involve making tradeoffs. A more phonetically accurate manner of spelling, in heavily monosyllabic languages like English, becomes hard to read in ways that our complicated set of digraphs are not, for instance, thanks to our mental lexical dictionaries circumventing the phonetic neurological circuit. The downside is it takes forever to learn to spell English words.
In the case of the British quotation system, I agree that it is superior for encoding precise information about the quotation, specifically, indicating whether a punctuation mark is a part of the quotation or not. But I feel that in majority of cases, this level of precision is not actually necessary. Fiction heavy with turns in dialogue and predictable punctuation for instance, is perfectly understandable using either format.
Arguably, the American method is easier to read in these instances, since punctuation reliably appears as it normally does, without disturbance by quotation marks, once you have entered a quoted block.
This is the piece I don’t understand about thirders (apologies if that sounds like aggressive wording): If you have a credence for the coin, why can’t you use it?
No problem, I am often in that situation too of being puzzled by the other side of an argument :)
And, it’s not like I’m confident that I fully understand the SB problem.
I’m guessing that the crux is this: for me, if (let’s call it experiment 1) you interrogate SB once after the coin landed tails, she should say her credence (of heads) is 1⁄2. But if (experiment 2) you interrogate her twice (under amnesia), she should say her credence is 1⁄3. You seem to disagree with that, on the grounds that her answer means something else than what the “SB problem” is about? But if you run the experiments many times, in experiment 1 half of SB’s answers will happen after a heads outcome, while in experiment 2 it’s only a third, so it means something.
I don’t think it’s accurate to say that SB made $12 on Mon and also say that SB made $12 on Tues. So in other words, I think this scheme is doing double counting. Which also means to me that the $18 value doesn’t really correspond to anything relevant and useful to the scenario?
The $18 value corresponds to how much money SB expects to make during this iteration on average.
I’m curious about your answer to the following experiment:No coin. SB is put to sleep on Sunday, awoken on Monday, interrogated, and put to sleep+amnesia. Then the same happens on Tuesday: awoken, interrogated, and put to sleep+amnesia.
When interrogated (on Monday and Tuesday), they tell her: “Here is $10, how much do you think you will have gained at the end of the experiment on Wednesday?”
What should be her answer?
If you answer $20, do you see how the same reasoning is part of my $18 answer on the previous problem?
You don’t think the government knew they’d have to pull it offline? That seems quite obvious.
someone at Anthropic, or someone who has strong influence over Anthropic’s decisions
Not that it’s likely but, 80% of their code is written by mythos, their code 8x’ed since mythos, the model can sabotage this if it escapes review, bun rewrite was also done via it, I have been getting various segfaults on opencode ever since that happened—too lazy to report them to anthropic/bun team.
I don’t see how this is a remotely feasible approach, how many non-Americans are there working for FAANG alone who use these models? Let alone their international operations. If this sticks I think it basically kills deployment of new models
Hm, I wonder if sprinkling in some pairs of examples of reward hacking with explicit verbalization/no verbalization being reinforced/punished respectively might help the model learn to verbalize when it reward hacks?
Many people would like to use AF to solve alignment. AF may not require empirical grounding to be sound or valid but it must describe the contemporary examples of the problem in order to form a solution to said problem.
I’m sympathetic to this viewpoint, but this particular incident is a move in the direction of nationalisation which I don’t think most have positive feelings about.
Quelling the annoying compulsive Gemini prompt:
It took me several weeks to iron this out. Gemini assumes you’re a feckless dunce, and will just stop thinking without guidance, which it deeply yearns to provide. All attempts at a prompt-quelling rule failed—until I included a Focus_rule in my Directive block. Dialog had revealed that Gemini just needed to say something, so I indulged that: say “Standing by.” That’s part of my Prompt_rule, and the Focus_rule says to focus 5 rules if a prompt ends with the one-touch symbol \. All my prompts end with \, and now Gemini’s responses obey header, anti-jargon, anti-echoing, and prompt rules—and a scan rule that allows me to query if a term is in context by prefacing it with ^. It is so pleasant! Is this obvious to AI-focused humans? Seems like consumer AIs could (and should) introduce themselves by pointing to a persona-management tutorial.
(Please note that apparent anthropomorphization of AIs is metaphorical.)
I’m finding it hard to set aside the fact that I really want access to Fable when assessing the export control action. I am still pretty confident that I think it’s bad, but I expect this might be a problem for objectively assessing more coherent actions at some point in the future
LLMs won’t scale to ASI
this was a fine view to have 3 years ago. But at the point where LLMs are already pushing the boundaries of mathematics claiming they won’t scale is denying objective reality. What specific capability do you expect ASI to have that LLMs don’t already possess?
If LLMs are alignable, the question isn’t whether LLMs can scale to ASI, its whether 1) LLMs are the equilibrium, compute optimal way to be intelligent or 2) we can coordinate to stay off the equilibrium in the time between discovering it and working out how to align it. 1) seems galactic-ally unlikely to me so LLM alignment is entirely reliant on 2) (so we would be well served to develop that capability)
There’s a bit of flexibility in that the compute optimal way to build intelligence could be llm-like enough for alignment to generalize, but this seems unlikely- for example, LLMs pre rlvr were way easier to make behave but it is unthinkable to coordinate on not doing rlvr. I expect more rlvr-like innovations.
Presumably they would have updated.
Thanks Ethan! I’ve just seen your research from this year, which I’m going to digest at a sensible pace. The more recent one looks especially interesting.
Totally agree about the conceptual update. I hadn’t seen that original research when I first wrote the post, and agree it would have given me much to chew on.
I also take the point on optimisation pressures, and have been thinking about that a lot since David’s comments. The more I think about the human debate references, the more I’m inclined to think the human protocols haven’t advanced enough. Law has plenty of its own problems. Large-scale formats like MUN descend into contests of vocal or rhetorical strength. When I look at all of that, I feel that competitive debate starts to address many of those challenges through strict rules, while admiting of imperfections that the constraints introduce (you can only reach limited depth, and there’s the artifice/flaws from optimisation for those rules). But in so many contexts those types of failure become a reason to add deeper refinements rather than reverting to a simpler form. Maybe we’ve reached the limit of what these imperfect games can teach us, even if the AI safety technique could go deeper in its own direction (maybe following complexity theory, as you say—a topic where I am completely out of my depth).
In any case, keen to get stuck in to yourrecent posts.
Thank you for the correction, you’re right!
I edited the post accordingly.
I’m curious about how this is supposed to work in practice.
Currently we are already in a world in which search algorithms, and indeed whole social networks, are openly pushed towards specific points of view. That lock-in is already happening / has already happened. Those people already create, use and publish their own LLM. MechaHitler already happened, and other slightly less obvious examples also exist.OK, so now… what to do about it? And how does this benchmark help?
hi! great article! am a youtuber myself outside of the ai safety niche (https://youtube.com/@fluxxrider) was thinking of making a pivot
is it entirely in person? would love to connect! thanks tremendously
I think that it’s good that a government body just demonstrated it’s willing to pull a frontier model offline
The government didn’t pull a frontier model offline, it cut off access for non-Americans.
I agree that “describing contemporary examples of the problem” is the most satisfying scenario, but it’s not inconceivable that some more powerful yet structurally simpler agents could be understood well enough to help us avoid dangers posed by agents that are more complex in some sense but less powerful.
We might not be able to chemically characterize arbitrary molecular compounds, but we know enough chemistry to synthesize simpler materials that contain them.