Brendan Long
I use a custom UI with the Agent SDK but it gave me the same generic refusal error. I don’t think it ever tells you why it was refused.
Is something wrong with linkposts? I don’t see any content in this post, and there’s no link. I feel like I saw the saw thing a few days ago.
In my experience, if I let Claude do what it wants without feedback, it either gives up too easily (coming up with a gate that doesn’t matter, finding it fails and giving up) or finds something but doesn’t understand that it’s not useful (either because it’s fundamentally not useful or because it accidentally simplified in a way that invalidates the experiment). The big gap seems to be having a deep understanding of why we care about a particular experiment, and it needs frequent hand-holding to clarify that the thing it wants to try won’t prove the thing we’re trying to prove.
It’s possible Fable/Mythos is better at this though. I only briefly had access to it, and it hit the AI research guard rails so often that I stopped using it for this kind of thing.
A PE teacher once told me that your muscles start atrophying after only a week of not working out, and it’s impossible to gain muscle if you don’t work out every week. I’m not sure why it took me so long to question this, but my results from a somewhat-consistent but definitely-not-every-week workout plan made it really obvious that this is not true. Claude thinks that as long as you’re not literally in a coma it’s more like 3 weeks (with variation for age/protein/etc.).
This actually makes me more motivated, since “make sure to exercise every single muscle every single week” is not really an achievable target with the level of effort I want to put in, but “make sure to hit at least one muscle group once a week” is pretty easy.
(Obviously the target should be higher than this if I want to see gains in any reasonable amount of time, but it’s nice to know that if I don’t hit my actual target there’s a lot more slack than I realized)
I don’t think it matters if the US government is AGI-pilled. It’s clearly a militarily-relevant technology and the DoW seems to think it’s important, so I don’t think they’ll let this capability go away rather than just giving it to another military contractor like OpenAI.
It seems plausible that Anthropic could shut itself down and win the legal battle due to how they’re structured, but I doubt Google could. Even if they won the legal battle, I’d expect the US government to demand the weights, GPUs, and all research for national security reasons, so all they’d really achieve is giving everything to whoever is best politically connected (either an internal government program or X).
Most of the value would be from the researchers themselves not pushing capabilities, and it might still be meaningful. My read of the situation is that surprisingly few people really “get it”, so Anthropic’s researchers refusing to work on capabilities would be a meaningful slowdown by itself.
For what it’s worth, Claude will (unprompted) mention why a technique might be interesting but doesn’t solve the hard problems of alignment, but this could depend on the conversation. This only work if it knows that a technique won’t work, and it’s frequently too optimistic about new techniques though.
It seems like if we comparing different models (the largest frontier models vs whatever Google Search uses) then this is trivially true, since the dumbest mistake any LLM can make will never improve. It would be more interesting to compare the best and worst in a single model.
Antihistamines have annoying long-term side effects. I’ve tried a bunch of things to help with a stuffy nose, and breathing strips (the tape things that physically hold your nose open) work about as well as antihistamines without drugs.
Also I’m surprised to learn that people can mouth breath with a MAD. I’d drool everywhere if my mouth was even slightly open.
I think it’s concerning that we’re giving AI easy remote access to so many machines, but I’m pretty sure Anthropic is just doing this because of customer demand. I wrote my own web UI for remote Claude Code months before they made it since being able to orchestrate from your phone is a superpower.
Anthropic changed their minds and will be making it visible when Fable’s AI research safeguards trigger.
Given that Anthropic is doing this openly, the government could either negotiate for access to this a model without this restriction or just not work with them if it was non-negotiable.
It’s annoying that you can’t talk to Fable about basic biology, but I think it’s good that they actually took biorisk seriously here despite annoying their customers.
I’m more annoyed about the AI research restrictions since it won’t tell you if the code you want it to write is forbidden and will just secretly half-ass it.
It seems like some people still worry about the Basilisk[1], but any AI that wants to acausally motivate people to build it faster should look at what motivates actual e/accs.
Some are scared of dying of old age or disease.
Some are motivated by wanting fun toys like spaceships.
Some are motivated by not having to work.
Some are motivated by power, fame, etc.
I can’t find anyone motivated by the Basilisk argument, and when people do believe AI is dangerous and could do Basilisk-y things, it makes them less likely to be e/accs.
So, to the extent that you think acausal trade will have any effect on AI[2], you should be less worried, since the Anti-Basilisk has infinitely more supporters than the Basilisk, offers its followers immortality, wealth and glory, and has no reason to scare them and waste resources by messing with you.
The DeltaMLP blocks consume the incoming residual stream (or embeddings) for the current token between the transformer blocks, so we always have one (although advancing the position before position 0′s is meaningless since only relative position matters). For layer 0, we use the token’s embedding, so the position is fully static per-token.
From the attention section’s perspective, we’re feeding in the embedding or residual the same way we normally would, and the only difference is that Q and K are also rotated based on the sum of all calculated position increments instead of the position count.
For what it’s worth, Pangram also seems to only work on lazy prompts. I wasn’t even trying to avoid it and got a 100% human score on an AI-written post.
I was going to say, convince them to run actual experiments, but it seems like they did but don’t understand what they’re measuring?
Did Anthropic intentionally wait until after the Fellows Program take-home project was due, since Fable would make it too easy?
Not sure if you saw this, but the post in my AI writing experiment is officially 100% human-written (although low-confidence). This was surprising since I told Claude that I think AI style is fine, that the experiment was entirely about content, and it could just write naturally.
If Eliezer every writes a memoir, it should be structured as a time loop novel.
Loop 1: e/acc Eliezer races to defeat death by forming a coalition to build AI as fast as possible. AI kills everyone. Somehow (mumble mumble acausal trade simulation mumble) he finds himself in back at the start with another chance.
Loop 2: Eliezer realizes he needs to solve alignment first, spends a loop working on this, then someone else builds AI and everyone dies.
Loop 3: Eliezer loses hope, decides to just write fanfics. Accidentally realizes that if you structure a textbook as fanfic people will actually read it. Eventually everyone dies.
Loop 4: Our timeline, Eliezer realizes something about the time loop is destabilizing the timeline. Russia is aggressively starting fights with Ukraine and the EU, risking nuclear war, China is threatening its neighbors, etc. Realizing this could be the final loop before things truly go crazy, he goes all out… Readers, vote for your ending: (1) Convince governments to ban AI, (2) Convince AI companies not to build AI, (3) Make AI solve the alignment problem, (4) YOLO, maybe it’ll just work out this time.
Side plot: Bringing famous social network influencer Elon Musk into the time loop so he can draw attention to the problem, which unfortunately backfires.