Brendan Long
Channelguessr: A Discord game
It might be possible to use some other form of sandboxing in OSX, but I don’t know what’s available. Podman probably won’t work but Docker is actually easier to setup than Podman. For Claude Code purposes the cost of a VM to run Docker in is probably pretty minor.
Edit: Actually Podman can be installed via VM on OSX too: https://podman.io/docs/installation Although at that point you might as well use Docker since the VM is providing isolation already.
I’d prefer to run Claude in cloud sandboxes. But the offering from Anthropic here is rather limited in how it can interact with
git, and as a result not useful for me because it can’t use Graphite effectively.I was running into the same problem, where I really just want to interact with Claude in the cloud but their cloud environment is too limited. I just finished a tool to run Claude Code as a web service from your own computer instead (and I access it remotely with Tailscale).
Clawed Abode: Claude Code is Too Cloudy
This is a really good point. Even if you could train a neuralese model, it would rapidly accumulate errors during inference and go out of distribution.
This is already a problem with tokenized models, where one incorrect token forces the model to condition on that token, but for continuous models we’d expect basically every output to have some error.
Yeah, definitely a lot of what I’ve asked it required software experience, sometimes fairly low-level (like describing the event loop I want for the background worker).
Re: Neuralese not winning, I think during practical inference, you’d have similar-sized KV caches, so the memory usage is basically a wash (although storing the tokens when you’re not running would be much smaller).
But my understanding is that neuralese hasn’t won because it’s too hard to train. CoT works by training a base model to produce ~all kinds of human-like text, and then RL can extract human-like text that’s useful for reasoning. For neuralese, you have to train the reasoning from scratch, without teacher forcing, and getting that to work is (for now) too hard and not as effective as text CoT.
Great article though!
I’m on the $100 Max plan (“5x more usage than Pro”), although rate limits were doubled for most of this period as a holiday thing[1]. I used Claude Code on the web to fire off concurrent workers pretty often, and I only hit rate limits twice: Once around 11:55 pm (reset at midnight) and once in the afternoon about 10 minutes before the rate limit reset (on a different project, where I was making Claude repeatedly read long financial documents). I used Opus 4.5 exclusively.
Basically the rate limits never really got in my way, and I wouldn’t have hit them at all on the $200 plan (4x higher rate limits).
- ^
I assume spare capacity since people were off of work.
- ^
I basically agree with this. When I say Claude is superhuman at coding, I mean that when Claude knows what needs to be done, it does it about as well as a human but much faster. When I say Claude isn’t superhuman at software engineering in general, it’s because sometimes it doesn’t take the right approach when an expert software engineer would.
I run multiple agents in parallel through the Claude Code web interface, so I actually managed to hit the limits on the Max plan. It was always within 10 minutes of the reset though.
I was also making Claude repeatedly read long investment documents for an unrelated project at the same time though.
Claude Wrote Me a 400-Commit RSS Reader App
Aren’t human reactions deterministic too though? I’m not sure I understand what you’re arguing.
But during inference, when you’re actually talking to Claude or GPT-4, the system is frozen. It processes your input, generates a response, and… that’s it. The prediction errors it’s implicitly computing don’t modify anything that persists as “the system.” There’s no absorption into durable structure.
It’s worth pointing out that the weights being fixed doesn’t mean everything is fixed. The LLM can react in-context, it just can’t modify other instances of itself.
It also feels too confident in its conclusions to be Scott.
I was thinking of “coder” as specifically the job of writing code, which I assume is what the Claude Code guy meant too. AI is clearly not reliable at system design yet.
The thing METR is measuring seems slightly different than “superhuman coder”. My understanding is that they’re dropping an AI into an unfamiliar codebase and telling it to do something with no context or design help, so this is partially software architecture partially coding. On pure coding tasks, Claude Code is clearly superhuman already.
I spent a few hours over the last few days collaborating with Claude on design docs and some general instructions, then having it go through massive todo lists fully autonomously[1]. This is weeks of coding and it did it in a few hours (mostly slowed down by me getting around to giving it more work).
This is the first time I’ve had it do tasks of this scale so I’m not doing anything special, just having it propose a design, telling it which parts I want done differently, then having it make a todo list and execute it.
- ^
Example prompt:
Can you go through @TODO.md, delegating each task to opus subagents and ensuring that they understand all of the necessary context and implement the task, check it off, and commit it, then move onto the next task until the list is done?
- ^
Isn’t getting working/production-ready code done faster the definition of being better than you at coding? It’s possible the creator of Claude Code is incorrect about this and he would be more productive long-term writing this code himself, or the code is actually unacceptable in ways that he hasn’t noticed yet, but if he’s correct that it’s more productive to have Claude write it, then Claude is better at coding than him.
I’m not sure if his approach is actually productive for this, but for the longest time, the standard response to Eliezer’s concerns was that they’re crazy sci-fi. Now that they’re not crazy sci-fi, the response is that they’re obvious. Constantly reminding people that his crazy predictions were right (and everyone else was wrong in predictable ways) is a strategy to get people to actually take his future predictions seriously (even though they’re obviously crazy sci-fi).
The probability for “Garage” hit 99% (Logit 15) at the very first step and stayed flat.
Is the problem that these questions are too easy, so the LLM is outputing reasoning since that’s sometimes helpful, but in this case it doesn’t actually need it?
I’d be curious to see what the results look like if you give it harder questions.
Attention Is Off By One is an interesting alternative approach to this, although in my tiny experiments it didn’t change the attention activations much, and Hacker News comments seem to agree.