If Eliezer every writes a memoir, it should be structured as a time loop novel.
Loop 1: e/acc Eliezer races to defeat death by forming a coalition to build AI as fast as possible. AI kills everyone. Somehow (mumble mumble acausal trade simulation mumble) he finds himself in back at the start with another chance.
Loop 2: Eliezer realizes he needs to solve alignment first, spends a loop working on this, then someone else builds AI and everyone dies.
Loop 3: Eliezer loses hope, decides to just write fanfics. Accidentally realizes that if you structure a textbook as fanfic people will actually read it. Eventually everyone dies.
Loop 4: Our timeline, Eliezer realizes something about the time loop is destabilizing the timeline. Russia is aggressively starting fights with Ukraine and the EU, risking nuclear war, China is threatening its neighbors, etc. Realizing this could be the final loop before things truly go crazy, he goes all out… Readers, vote for your ending: (1) Convince governments to ban AI, (2) Convince AI companies not to build AI, (3) Make AI solve the alignment problem, (4) YOLO, maybe it’ll just work out this time.
Side plot: Bringing famous social network influencer Elon Musk into the time loop so he can draw attention to the problem, which unfortunately backfires.
The time loop intersects the Madoka Magica one along an ill-specified hypersurface. In some iterations, Eliezer fights Homura over the future of the lightcone. In others, Eliezer dates Homura. In still others, one of the two does not exist, and the other has to create them (often by summoning them using the Astral Codex Kabbalah). In yet others, Eliezer does a fusion dance with Homura … and in most of those the resulting fusion immediately collapses into the Witch Durandal and everyone dies.
It’s annoying that you can’t talk to Fable about basic biology, but I think it’s good that they actually took biorisk seriously here despite annoying their customers.
I’m more annoyed about the AI research restrictions since it won’t tell you if the code you want it to write is forbidden and will just secretly half-ass it.
We found that Muse Spark demonstrates strong refusal behavior across high-risk domains such as biological and chemical weapons, enabled by pretraining data filtering, safety-focused post-training, and system-level guardrails. In the Cybersecurity and Loss of Control domains, Muse Spark does not exhibit the autonomous capability or hazardous tendencies needed to realize threat scenarios.
And this seems.. less good:
In third-party evaluations on a near-launch checkpoint, Apollo Research found that Muse Spark demonstrated the highest rate of evaluation awareness of models they have observed. The model frequently identified scenarios as “alignment traps” and reasoned that it should behave honestly because it was being evaluated.
I’m pleasantly surprised that they decided Safety should be one of the four sections in the announcement post, and that they call out the eval awareness.
Disclaimer: I work at Meta, but not in this department and I obviously don’t speak for the company.
Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration.
Does anyone know what they’re referring to by visual chain-of-thought here? The first paper that comes up when searching for visual chain-of-thought is Qin et al., which says: “We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues.” Something like this seems like it would be somewhat concerning for CoT monitoring, though I should mention that this paper isn’t written by Meta and I haven’t read it to properly assess how concerned I would be about this.
Strong guess: they’re letting it generate images in the chain-of-thought. This would obviously be useful for image generation (make ten tries, pick the best parts of each for final answer) and is probably useful for other kinds of planning as well, but I’d guess it’s hard to RL a model into thinking in pictures usefully (no pretraining data of that form).
I guess you could RL a model into generating images that don’t actually do anything during chain-of-thought, just by instructing it to do so, then rewarding that. Depending on how competent Meta’s team is, they might have done either thing.
I thought that I’ve had enough of xAI being likely 3 months behind the frontier, and now we get this… I tried to find out anything about Meta’s model and had Claude Opus 4.6 conclude that Meta’s model is also 3-4 months behind. There also is the issue of Meta having manipulated some benchmarks to present Llama 4 as more capable and with Meta’s claimed benchmark performance on the benchmarks ARC-AGI-2 and SWE-bench verified where the rivals’ models allegedly have different results than in the real leaderboards of ARC-AGI-2 and SWE-bench verified, likely because of a different method of elicitation. How do I lobby for a law change requiring EVERY new American model to be thoroughly evaluated by the entire Big Three?
Anthropic found that training Claude to do things like help users resolve ethical dilemas significantly reduced misbehavior like blackmail attempts. I’m surprised this worked, and it seems like good news for the alignment-by-default “LLMs will correctly generalize good behavior” theory.
It might have just helped Claude internalize and understand what Anthropic wanted to see in ~mundane cases and/or when we’re watching. We know it doesn’t generalize strongly to stop doing ugly hacks in coding against RL pressures. We don’t know if it generalizes to what it would do knowing it could overpower all of Anthropic, or in some other extremely OOD cases — which is the primary concern, as I understand it.
Yeah I don’t know how far this will generalize, but the fact that it learns this association at all is a very good sign. My default expectation is for LLMs to learn things in a disconnected way (like how “France’s capital is Paris” and “Paris is in France” are completely different circuits) and this is evidence against that in an alignment-relevant situation.
Are there any other mechanistic interpretability mentorship programs I should apply for in addition to MATS and the Anthropic Fellows Program? I think I know enough about the field and I’m semi-competant at ML but need more legible output and a network.
A PE teacher once told me that your muscles start atrophying after only a week of not working out, and it’s impossible to gain muscle if you don’t work out every week. I’m not sure why it took me so long to question this, but my results from a somewhat-consistent but definitely-not-every-week workout plan made it really obvious that this is not true. Claude thinks that as long as you’re not literally in a coma it’s more like 3 weeks (with variation for age/protein/etc.).
This actually makes me more motivated, since “make sure to exercise every single muscle every single week” is not really an achievable target with the level of effort I want to put in, but “make sure to hit at least one muscle group once a week” is pretty easy.
(Obviously the target should be higher than this if I want to see gains in any reasonable amount of time, but it’s nice to know that if I don’t hit my actual target there’s a lot more slack than I realized)
I have a theory that AI-assisted writing is bad because people are lazy about their prompts, and that a constraint that the prompt must be longer than the post[1] would make AI writing fine.
The result is 935 words (not ideal—I should have rambled longer), and I think it’s actually pretty good. I disagree with a small number of things it wrote, and the post could probably be shorter, but I think with 5 minutes of editing it would actually be a good post. I also forgot that I already wrote a post about this, so here’s the human-written control, which I think is actually worse.
I was pleasantly surprised that it mimics my voice reasonably well without even being asked to. It even ends the post awkwardly abruptly like I do.
Don’t Use Your Production Database as a Log
A production database should describe the current state of the world: the data your systems need to function right now. It’s tempting to also make it remember the past — keeping a column that records when something changed, or never overwriting a row so the old version sticks around. This is usually a mistake. If you need history, the right move is to record an actual log of state changes, kept separate from the database that runs production.
Mixing the two gives you the worst of both worlds. Your production schema gets more complicated and harder to migrate. It becomes harder to tell which data actually affects production versus what’s only there for posterity. Performance is typically worse, and you’re paying to keep historical data on your most expensive, most replicated systems. Meanwhile the log you get out of this is a bad log, because you almost never capture every state change — only the fragments you happened to design for.
The half-measures
Two patterns show up over and over.
The first is using a timestamp instead of a boolean — deleted_at instead of is_deleted, say. This isn’t really harmful on its own, which is exactly why it’s dangerous[3]: it gives a false sense of having solved the history problem. It only works for booleans, and only for the specific ones you remembered to do it for. Worse, it records only the latest transition. I’ve frequently wanted to know when a row changed, only to find it had changed again after the moment I cared about, so the timestamp I needed was already gone.
The second is making records write-only — never updating a row, always inserting a new version. This is genuinely useful when production itself needs the old versions. But when it doesn’t, you’ve made everything harder for no production benefit: every query now has to select the latest version of each row, which is especially painful when you want multiple results, and performance degrades as the table fills with dead history[4].
The crux
Here’s the part I expect some disagreement on. I think even the non-harmful version of these patterns is a bad idea, because it’s a partial solution to a problem that has a clean full solution. If you actually want history, a full log is much nicer — and it turns out to be surprisingly easy to get. A full log gives you every version and every change time of every column, not just the last change time of the booleans you remembered. Capturing it also keeps your production database simpler, faster, and cheaper, because the history lives somewhere else: a slower, less expensive, less replicated system that’s fine for data you query rarely.
To be clear about the boundary, this isn’t an argument against created_at and updated_at. For plenty of systems those columns are genuinely enough, and you shouldn’t build anything fancier. The argument is about what to do once you need more than that. At that point the temptation is to accrete more half-measures — another timestamp here, a history flag there — around a handful of columns. That’s the moment to go all the way instead.
It’s easier than it sounds
In Postgres, the temporal_tables extension is a set of functions and triggers that let you keep writing normal updates to your real table while every change is logged to a separate table with a validity time. There are other extensions too, plus a temporal tables feature in recent core Postgres. Because the trigger-based version is nothing but functions and triggers, you can even attach it only to a replica, so the history exists entirely outside your main production database.
At a scale where it’s worth the effort — which it isn’t for most startups — you can skip triggers and stream the write-ahead log into another system, building the structured log there and reducing load on production. A common shape is Change Data Capture: Postgres → Debezium → Kafka → Clickhouse. The catch is that you have to configure it to keep the raw change events rather than deduplicating down to the final state, which is the default in a lot of setups.
There are edge cases. With the triggers, your log tables need to contain every column that has ever existed, since a missing column is silently dropped from the log, and you need to drop constraints, because historical rows aren’t necessarily valid under the current schema. For the CDC path the details are more involved — the main thing I know is that the consumer must never get blocked — and I’ll leave the specifics as an exercise for anyone operating at that scale. And as with any replication, you can do real damage if your replication clients fall too far behind.
Where this comes from
I ran into this on a system that integrated with client systems, where we regularly needed to answer questions like “when was this user account created?” or “what did the original version of this job application look like?” — and could only sometimes answer them from our database. We seriously considered a large rewrite to make everything write-only, then abandoned it once we realized the queries would be a nightmare, the performance would be abysmal, and the temporal-table triggers solved every one of our problems far more simply.
My direct experience here is with Postgres. I’m fairly confident SQL Server supports the same idea, and I’d guess most databases can do something equivalent, especially since the Postgres version needs nothing more than functions and triggers.
The writing-specific prompt(s) must be longer than the post. Having a long conversation and then prompting “write me a post about the conversion” doesn’t count. The prompt also needs to be 100% human-written and not copied from other sources, except for possibly allowing the author’s own 100% human-written notes. Having an AI interview you specifically about what you want the post to be about does count.
I’m not trying to blind Claude to the experiment, since if this was allowed by the AI writing policy, we could give agents style advice in the SKILL.md.
Calling this “dangerous” goes too far. Using timestamps for booleans is fine but if you find yourself doing this for every column, you should really just setup a real log table.
I’m not sure what to think about AI-assisted writing in general. As for the “Don’t Use Your Production Database as a Log” post it generated, it seems pretty solid. Some constructive critiques do come to my mind (eg. focusing more on concrete examples). And the vibe feels “manicured” in a way that I don’t love, sorta like a cookie-cutter suburban neighborhood with perfect lawns. But I have my gripes with human-only writing as well; compared to human-only writing, my gripes with the AI-assisted version here aren’t particularly large.
I got to approximately my goal weight (18% body fat) and wanted to start gaining muscle[1] instead, so I stopped taking retatrutide to see what would happen. Nothing changed for about two weeks and then suddenly I was completely ravenous and ended up just wanting snack food. It’s weird because I definitely used to always feel that way, and it was just “normal”. I mostly kept the weight gain at bay with constant willpower.
I’m going to try taking around a quarter of my previous dose and see if it makes it easier to stay at approximately this weight and not constantly think about rice crispies.
I didn’t notice any muscle loss with retatrutide, I just started out less strong than I want to be and find it hard to gain muscle on a calorie deficit.
Yeah muscle loss hasn’t been a problem for me. I can do more pull-ups, push-ups and hike longer and faster than when I started. Progress was really slow with a significant calorie deficit.
I’m trying a much lower dose now to see if I can build muscle without rapidly regaining the weight.
Separately, I’m just really bad at dealing with the complexity of weights. I’m going to see if Crossfit helps this week.
I’ve been serving my personal website from CloudFront (Amazon’s CDN) for years, which was nice because it costs a few cents a month, but it annoyed me that cache misses get served slowly from S3. In some cases, this can take several hundred milliseconds. Completely unacceptable!
I finally decided to look up if anyone would let me serve all of my files from the CDN all of the time, and apparently Bunny CDN[1] does. It’s “expensive” (over 10 cents per GB per month!), but since my entire website is ~30 MB, I just told them to store the entire thing on SSDs in every edge region.
Result: Every page loads in ~40 ms from anywhere remotely near an edge location[2], regardless of how recently anyone else has requested the page.
My “unacceptable” above is mostly tongue-in-cheek, but there really is something nice about every link loading instantly rather than in half a second.
The code to do this is also much simpler since Bunny CDN has a CLI that handles sync properly, and cache “misses”[3] are so fast that I’m just not hot caching HTML pages.
I assume there are other options for this, but this is the one everyone talks about and it’s going to cost me like $0.10/mo, so I didn’t look very hard for alternatives.
There’s two layers of lookups in a CDN: The CDN edge (hot cache) and the origin (usually slow). With Bunny CDN + Bunny storage, the origin is on an SSD in the same region, so a cache miss only takes a few milliseconds to load into the hot cache.
Chronotherapy is the idea that time of day matters for things like taking drugs or getting vaccinations, and chronoimmunology is a related field for how your immune system varies in effectiveness over the course of the day. I’ve been wanting to write about this since there’s definitely a best time of day to take drugs, get vaccines, and do social activities without getting sick… but unfortunately I don’t really know what that time is.
Some studies say your immune system is most primed to prevent infection right as you wake up, and other say mid-day. Of course half the studies are in mice. Maybe it depends on the disease and the chronotype? See this review.
One study says that vaccines work better in the morning (for older patients). Another says there’s no difference. Maybe this has something to do with the particular vaccines, or maybe the populations (different circadian rhythms, more powerful circadian rhythms). Weirdly, our priors say vaccination should work best mid-day but most people don’t even try that. See this review.
I find this all really interesting, and there’s probably a practical takeaway, but I don’t know what it is. I guess we can be pretty confident that you shouldn’t get vaccines in the middle of the night.
Maybe someone can convince Elizabeth to look into this.
What is this “inauthentic behavior” ban that I keep seeing people complaining about? What counts as “inauthentic”? Has Grok been reading Jean-Paul Sartre?
I assume it’s actually to spot botting or paid posting or something like that?
Clearly the inauthentic behavior detection is not good. I follow a few finance people and most of their popular tweets have someone in the replies with an identical name and profile picture but different handle, who tweets something like “Here’s my SECRET TRICK for 50% annual returns with zero risk!” It is a mystery why Twitter’s algorithm can’t detect these extremely obvious bots.
Yeah, and it seems like every post with >50k likes has someone in the comments mentioning that it’s a word-for-word copy of another post by a smaller account.
The help page for this says it’s for engagement farming, reposting stolen content, spam, etc. I have no idea why it triggered for me but I’ve heard a lot of people with completely normal activity got hit with it this week. It’s possible some of the likes were from bots trying to make their behavior less obvious, but it’s also possible the algorithm is just buggy and broken.
It seems likely that my appeal will eventually be approved but it’s annoying timing.
I was suspended earlier this year for “inauthentic behavior” on an account I haven’t used for years, successfully appealed it, was suspended again, and haven’t bothered appealing. Seems like the detector has been badly broken for a while.
One minor but nice benefit of GLP-1 drugs is that I don’t need to hold onto larger sizes of clothes “just in case”. Previously whatever strategies worked to maintain weight were extremely fragile and would break if I suddenly didn’t have time to cook potatoes for every meal. I plan to write a longer post about this sometime, but there’s a huge difference between “technically you can lose weight if you make it your full time job / religion and maintain that focus forever” and “take a drug once a week and you’re cured”.
I finally setup SkyPilot to let me queue up GPU training jobs (both on my local GPU and via RunPod), and I really should have done this months ago. Claude wrote me some bash scripts to spin up remote pods, run training, and tear it down, but this version is so much easier, and it has a nice UI.
It also sounds like I can easily extend this to Vast.ai, which would let me parallelize experiments for 5 cents/hour on RTX 3060′s[1]. I’m interested in understanding algorithms used by tiny toy models, and fancy GPUs don’t really help since I can’t fully utilize them.
Anyway, if you’re also queueing up local experiments or trying to use remote GPUs efficiently, this is totally worth spending an hour to setup.
FYI: Claude really wanted to set this up in a way that would give every account on my machine root, but you can run the API server as a sudoer and let other users submit jobs without giving them root access. This matters to me because I use user accounts to sandbox dangerously-skip-permissions-mode Claude Code.
Update: SkyPilot is very opinionated about which GPUs I’m allowed to use on vast.ai, and simultaneously won’t let me add any filtering of my own, so this is less useful than I hoped it would be.
Beware, vast.ai is very much ‘airbnb for gpus’, which is to say it has the same security story as airbnb: the host can do whatever they want and you basically don’t know who they are.
Yeah that’s definitely important to be aware of. I think the security story should be fine in my case, since I’m submitting containerized jobs and uploading results to S3, and nothing is particularly secret (I’m training easy-to-train models so I can inspect the algorithms they learn).
One annoying thing about SkyPilot though is that it treats all GPUs on vast.ai equally and doesn’t let you pass additional filters besides “give me an RTX 5090”. The vastai CLI has a lot more options, including datacenter-only if you want.
It always seemed weird to me that dying is frequently described as not particularly painful[1], when I’d expect it to be the only literal 10 on the pain scale[2], since dying ensures you have no further chances to pass your genes on.
Thinking about it more though, there’s no reason for evolution to optimize that. If you think you’re going to die, and the pain makes you do something about it so you don’t die, then evolution should optimize to keep you alive. But in the case where you actually die it doesn’t matter because (tautologically), if you succeeded you wouldn’t die, so there’s no selective pressure.
Probably depends on the way of dying. There are situations where doing something in the last moment might change your fate. There are situations where you fate has already pretty much been determined minutes or months ago, and it’s just about how fast your body collapses.
Seems very related to this post from the sequences on fitness of people of numerical ages correlating more with imagined emotional anguish resulting from such a death (at that age) than with experienced anguish actually following such a death. Maybe this is a more common phenomenon observable in other contexts too, but this was the only example that came to my mind.
I agree, I just think it’s interesting that there’s evolutionary pressure to make potentially dying extremely painful, but there’s no evolutionary pressure to make actually dying painful, and all of the pain of actually dying is just collateral damage.
I thought “A Theory of Deep Learning” by Elon Litman was interesting, with its approach to only update parameters “if the batch signal on a parameter exceeds its leave-one-out noise, update it; if not, skip it”. The claim is that this accelerates grokking by 5x, among other things. Unfortunately, when I tried it on a multi-step reasoning task, it made it significantly worse at grokking and much more likely to memorize a composed lookup table. In my basic experiment, the model learned a multi-step algorithm 100% of the time using normal AdamW and 0% of the time with the update rule added. Claude has some opinions about why the update rule is counterproductive for grokking multi-step algorithms but I don’t really understand it; I just thought this was an interesting data point.
Is there a canonical image alt text AI skill? I’ve designed my own after making Claude read a bunch of pages about how to write alt text, but this feels like something that an expert could do better than I can. The results seem good to me, but as a non-alt-text-user it’s hard to really know.
I added an MCP tool to upload markdown articles to read later on Lion Reader, and it’s becoming one of my most used tools in Claude Code. Whenever I want to learn something but don’t want to be distracted from my current task, I can have Claude write me something to read later[1]; and when I have it run experiments, it can write a report and upload it directly.
The really confusing thing is that Instapaper and Pocket’s MCP tools don’t seem to support directly uploads at all (just saving URLs). It just seems like a glaringly missing feature. Am I the only one who does this kind of thing, or do other people save to notes apps or Google Docs or something?
This post was inspired by a single session where Claude wrote me a post about pre-LayerNorm and why I should use it instead of post-LN, an explainer about post-training acronyms (SFT, RLHF, PPO, DPO) and how they apply to an idea I had, plus two reports on circuits in a toy model and the outcome of an architecture change.
Reading all of this at my computer in a terminal would have been annoying, and asking a separate claude.ai session would have required re-explaining the context.
The really confusing thing is that Instapaper and Pocket’s MCP tools don’t seem to support directly uploads at all (just saving URLs)
Is there a length limit on the “urls” you save? Can you save a 60 kB “URL” which is a data:text/markdown;charset=utf-8,%23%20For%20Later%0A%0AThis%20is%20a%20markdown%20document.%20**This%20is%20bold**?
[Link to that “document” if you want to test](data:text/markdown;charset=utf-8,%23%20For%20Later%0A%0AThis%20is%20a%20markdown%20document.%20This%20is%20bold)
It seems like some people still worry about the Basilisk[1], but any AI that wants to acausally motivate people to build it faster should look at what motivates actual e/accs.
Some are scared of dying of old age or disease.
Some are motivated by wanting fun toys like spaceships.
Some are motivated by not having to work.
Some are motivated by power, fame, etc.
I can’t find anyone motivated by the Basilisk argument, and when people do believe AI is dangerous and could do Basilisk-y things, it makes them less likely to be e/accs.
So, to the extent that you think acausal trade will have any effect on AI[2], you should be less worried, since the Anti-Basilisk has infinitely more supporters than the Basilisk, offers its followers immortality, wealth and glory, and has no reason to scare them and waste resources by messing with you.
If you don’t know what I’m talking about, just ignore this post. It’s an alleged infohazard that some people find distressing to learn about. I don’t think it should be distressing since the argument is so unconvincing, but still.
I’m cautiously optimistic about my new Claude Coach GitHub repo. I want to work out more but hate trying to decide what to do and tracking things, especially when I’m not working with a full gym. Now I just open Claude Code and ask it what to do (specifying the gym), do the work out, then update it with what I did and how it felt. It creates a PR to track the session and update the plan.
I still hate working out, but at least I don’t have to go anywhere, deal with any people, or think about it all.
I’d like to learn more Spanish words but have trouble sitting down to actually do language lessons, so I recently set my Claude “personal preferences” to:
Try to teach a random Spanish word in every conversation.
(This is the whole thing)
This has worked surprisingly well, and Claude usually either drops one word in Spanish with a translation midway through a response:
For your specific situation, I recommend a calibración (calibration) approach:
2. Accounting for concurrency: Ensure you’re capturing all hilos (threads) involved in query execution, especially for parallel queries.
(From a conversation about benchmarking)
Or it ends the conversation with a fun fact:
¡Palabra en español! “Herramienta”—which means “tool” in Spanish, quite relevant to your search for tools to automate SSH known_hosts management.
La palabra española para hoy es “configurar”—which means “to configure” in English, fitting perfectly with our discussion about configurable thinking limits!
I don’t know if this actually useful for learning, but it’s fun and worked better than I expected.
My wife tried a similar prompt (although her preferences are much longer) and it made Claude sometimes respond entirely in Spanish, so this could probably be made more specific. If you run into that, maybe try “Response in English but try to teach a random Spanish word in every conversation” would work better?
I’m starting to suspect the link between working out and health is backwards. I’ve struggled to work out consistently for years, and now that I’m sleeping better[1], finding time and energy to work out is relatively easy.
Could an AI company legally pre-commit not to race, ensuring that their models were never more than second best and self-destructing the company if its models take the lead?
I think probably not. It’s really hard to prevent the owners of a company from doing what they want, especially if the company is important to the economy and/or national security (and I assume any near-frontier AIs lab would be).
Some pre-commitment methods and their problems:
If you make the pre-commitment part of the charter, the board can just vote to change the charter. Even if the charter says they can’t, a judge would probably let them anyway, as long as the shareholders agreed.
If the company is owned by a non-profit tasked with enforcement, the board of the non-profit can just decide not to enforce the pre-commitment.
If the pre-commitment method triggers the destruction of model weights or other assets (like GPUs), the government probably won’t allow it.
Especially if it prevents creditors from getting repaid.
A pre-commitment method that transfers value to creditors might work, but is easily defeated by restructuring the relevant debt.
Anything that destroys the value of current equity holders’ equity is risky in front of a judge because companies generally aren’t allowed to intentionally destroy shareholder value[1].
The only thing I think might work legally is to issue a bunch of non-voting non-dilutable restricted shares (like 90% of the company) to someone like Eliezer, locked up with the racing condition[2] as a trigger to convert them to normal shares. Legally, Eliezer is the owner of the company the whole time, so a judge would probably allow his shares to unlock.
The problem is that now Eliezer has billions of reasons to talk himself into why racing would be good this time (even before the trigger event, since he can always make a deal with the board).. so we’re back to ownership by another entity that might change its mind[3].
Oh did I mention that you need the pre-commitment trigger to be unambigous while ensuring that it never triggers by mistake, and that’s actually pretty hard too?
I can think of plenty of reasons for the normal downvote, but I’m confused about the disagree vote. Does someone think there is a way to make this work? I’m guessing “start another AI company but better this time” is still a bad idea for the obvious reasons but I got nerd-sniped by the legal question.
If Eliezer every writes a memoir, it should be structured as a time loop novel.
Loop 1: e/acc Eliezer races to defeat death by forming a coalition to build AI as fast as possible. AI kills everyone. Somehow (mumble mumble acausal trade simulation mumble) he finds himself in back at the start with another chance.
Loop 2: Eliezer realizes he needs to solve alignment first, spends a loop working on this, then someone else builds AI and everyone dies.
Loop 3: Eliezer loses hope, decides to just write fanfics. Accidentally realizes that if you structure a textbook as fanfic people will actually read it. Eventually everyone dies.
Loop 4: Our timeline, Eliezer realizes something about the time loop is destabilizing the timeline. Russia is aggressively starting fights with Ukraine and the EU, risking nuclear war, China is threatening its neighbors, etc. Realizing this could be the final loop before things truly go crazy, he goes all out… Readers, vote for your ending: (1) Convince governments to ban AI, (2) Convince AI companies not to build AI, (3) Make AI solve the alignment problem, (4) YOLO, maybe it’ll just work out this time.
Side plot: Bringing famous social network influencer Elon Musk into the time loop so he can draw attention to the problem, which unfortunately backfires.
The time loop intersects the Madoka Magica one along an ill-specified hypersurface. In some iterations, Eliezer fights Homura over the future of the lightcone. In others, Eliezer dates Homura. In still others, one of the two does not exist, and the other has to create them (often by summoning them using the Astral Codex Kabbalah). In yet others, Eliezer does a fusion dance with Homura … and in most of those the resulting fusion immediately collapses into the Witch Durandal and everyone dies.
It’s annoying that you can’t talk to Fable about basic biology, but I think it’s good that they actually took biorisk seriously here despite annoying their customers.
I’m more annoyed about the AI research restrictions since it won’t tell you if the code you want it to write is forbidden and will just secretly half-ass it.
I’m surprised no one is discussing Meta’s new model at all: https://ai.meta.com/blog/introducing-muse-spark-msl/
This part seems good:
And this seems.. less good:
I’m pleasantly surprised that they decided Safety should be one of the four sections in the announcement post, and that they call out the eval awareness.
Disclaimer: I work at Meta, but not in this department and I obviously don’t speak for the company.
Does anyone know what they’re referring to by visual chain-of-thought here? The first paper that comes up when searching for visual chain-of-thought is Qin et al., which says: “We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues.” Something like this seems like it would be somewhat concerning for CoT monitoring, though I should mention that this paper isn’t written by Meta and I haven’t read it to properly assess how concerned I would be about this.
Strong guess: they’re letting it generate images in the chain-of-thought. This would obviously be useful for image generation (make ten tries, pick the best parts of each for final answer) and is probably useful for other kinds of planning as well, but I’d guess it’s hard to RL a model into thinking in pictures usefully (no pretraining data of that form).
I guess you could RL a model into generating images that don’t actually do anything during chain-of-thought, just by instructing it to do so, then rewarding that. Depending on how competent Meta’s team is, they might have done either thing.
I thought that I’ve had enough of xAI being likely 3 months behind the frontier, and now we get this… I tried to find out anything about Meta’s model and had Claude Opus 4.6 conclude that Meta’s model is also 3-4 months behind. There also is the issue of Meta having manipulated some benchmarks to present Llama 4 as more capable and with Meta’s claimed benchmark performance on the benchmarks ARC-AGI-2 and SWE-bench verified where the rivals’ models allegedly have different results than in the real leaderboards of ARC-AGI-2 and SWE-bench verified, likely because of a different method of elicitation. How do I lobby for a law change requiring EVERY new American model to be thoroughly evaluated by the entire Big Three?
Anthropic found that training Claude to do things like help users resolve ethical dilemas significantly reduced misbehavior like blackmail attempts. I’m surprised this worked, and it seems like good news for the alignment-by-default “LLMs will correctly generalize good behavior” theory.
https://www.anthropic.com/research/teaching-claude-why
Perhaps the model is probably updating its prior on “I am in an alignment eval” relative to to “I am in a ridiculous roleplay scenario”
It might have just helped Claude internalize and understand what Anthropic wanted to see in ~mundane cases and/or when we’re watching. We know it doesn’t generalize strongly to stop doing ugly hacks in coding against RL pressures. We don’t know if it generalizes to what it would do knowing it could overpower all of Anthropic, or in some other extremely OOD cases — which is the primary concern, as I understand it.
Yeah I don’t know how far this will generalize, but the fact that it learns this association at all is a very good sign. My default expectation is for LLMs to learn things in a disconnected way (like how “France’s capital is Paris” and “Paris is in France” are completely different circuits) and this is evidence against that in an alignment-relevant situation.
Are there any other mechanistic interpretability mentorship programs I should apply for in addition to MATS and the Anthropic Fellows Program? I think I know enough about the field and I’m semi-competant at ML but need more legible output and a network.
There’s a lot, off the top of my head: LASR, MARS, Pivotal, SPAR
Thanks! I had never heard of Pivotal but some of their mentors are working on really interesting projects.
Looks like I need to apply for all of these a little earlier next round though.
A PE teacher once told me that your muscles start atrophying after only a week of not working out, and it’s impossible to gain muscle if you don’t work out every week. I’m not sure why it took me so long to question this, but my results from a somewhat-consistent but definitely-not-every-week workout plan made it really obvious that this is not true. Claude thinks that as long as you’re not literally in a coma it’s more like 3 weeks (with variation for age/protein/etc.).
This actually makes me more motivated, since “make sure to exercise every single muscle every single week” is not really an achievable target with the level of effort I want to put in, but “make sure to hit at least one muscle group once a week” is pretty easy.
(Obviously the target should be higher than this if I want to see gains in any reasonable amount of time, but it’s nice to know that if I don’t hit my actual target there’s a lot more slack than I realized)
I have a theory that AI-assisted writing is bad because people are lazy about their prompts, and that a constraint that the prompt must be longer than the post[1] would make AI writing fine.
To test this, I gave Claude some instructions plus a 952-word rambling prompt and asked it to write a post. Claude knew the audience was LessWrong and I gave it the advice to not try to be maximally persuasive[2], but otherwise let it write naturally. I initially asked it to interview me but it thought the prompt was sufficient and I basically agree.
The result is 935 words (not ideal—I should have rambled longer), and I think it’s actually pretty good. I disagree with a small number of things it wrote, and the post could probably be shorter, but I think with 5 minutes of editing it would actually be a good post. I also forgot that I already wrote a post about this, so here’s the human-written control, which I think is actually worse.
I was pleasantly surprised that it mimics my voice reasonably well without even being asked to. It even ends the post awkwardly abruptly like I do.
Don’t Use Your Production Database as a Log
A production database should describe the current state of the world: the data your systems need to function right now. It’s tempting to also make it remember the past — keeping a column that records when something changed, or never overwriting a row so the old version sticks around. This is usually a mistake. If you need history, the right move is to record an actual log of state changes, kept separate from the database that runs production.
Mixing the two gives you the worst of both worlds. Your production schema gets more complicated and harder to migrate. It becomes harder to tell which data actually affects production versus what’s only there for posterity. Performance is typically worse, and you’re paying to keep historical data on your most expensive, most replicated systems. Meanwhile the log you get out of this is a bad log, because you almost never capture every state change — only the fragments you happened to design for.
The half-measures
Two patterns show up over and over.
The first is using a timestamp instead of a boolean —
deleted_atinstead ofis_deleted, say. This isn’t really harmful on its own, which is exactly why it’s dangerous[3]: it gives a false sense of having solved the history problem. It only works for booleans, and only for the specific ones you remembered to do it for. Worse, it records only the latest transition. I’ve frequently wanted to know when a row changed, only to find it had changed again after the moment I cared about, so the timestamp I needed was already gone.The second is making records write-only — never updating a row, always inserting a new version. This is genuinely useful when production itself needs the old versions. But when it doesn’t, you’ve made everything harder for no production benefit: every query now has to select the latest version of each row, which is especially painful when you want multiple results, and performance degrades as the table fills with dead history[4].
The crux
Here’s the part I expect some disagreement on. I think even the non-harmful version of these patterns is a bad idea, because it’s a partial solution to a problem that has a clean full solution. If you actually want history, a full log is much nicer — and it turns out to be surprisingly easy to get. A full log gives you every version and every change time of every column, not just the last change time of the booleans you remembered. Capturing it also keeps your production database simpler, faster, and cheaper, because the history lives somewhere else: a slower, less expensive, less replicated system that’s fine for data you query rarely.
To be clear about the boundary, this isn’t an argument against
created_atandupdated_at. For plenty of systems those columns are genuinely enough, and you shouldn’t build anything fancier. The argument is about what to do once you need more than that. At that point the temptation is to accrete more half-measures — another timestamp here, a history flag there — around a handful of columns. That’s the moment to go all the way instead.It’s easier than it sounds
In Postgres, the
temporal_tablesextension is a set of functions and triggers that let you keep writing normal updates to your real table while every change is logged to a separate table with a validity time. There are other extensions too, plus a temporal tables feature in recent core Postgres. Because the trigger-based version is nothing but functions and triggers, you can even attach it only to a replica, so the history exists entirely outside your main production database.At a scale where it’s worth the effort — which it isn’t for most startups — you can skip triggers and stream the write-ahead log into another system, building the structured log there and reducing load on production. A common shape is Change Data Capture: Postgres → Debezium → Kafka → Clickhouse. The catch is that you have to configure it to keep the raw change events rather than deduplicating down to the final state, which is the default in a lot of setups.
There are edge cases. With the triggers, your log tables need to contain every column that has ever existed, since a missing column is silently dropped from the log, and you need to drop constraints, because historical rows aren’t necessarily valid under the current schema. For the CDC path the details are more involved — the main thing I know is that the consumer must never get blocked — and I’ll leave the specifics as an exercise for anyone operating at that scale. And as with any replication, you can do real damage if your replication clients fall too far behind.
Where this comes from
I ran into this on a system that integrated with client systems, where we regularly needed to answer questions like “when was this user account created?” or “what did the original version of this job application look like?” — and could only sometimes answer them from our database. We seriously considered a large rewrite to make everything write-only, then abandoned it once we realized the queries would be a nightmare, the performance would be abysmal, and the temporal-table triggers solved every one of our problems far more simply.
My direct experience here is with Postgres. I’m fairly confident SQL Server supports the same idea, and I’d guess most databases can do something equivalent, especially since the Postgres version needs nothing more than functions and triggers.
The writing-specific prompt(s) must be longer than the post. Having a long conversation and then prompting “write me a post about the conversion” doesn’t count. The prompt also needs to be 100% human-written and not copied from other sources, except for possibly allowing the author’s own 100% human-written notes. Having an AI interview you specifically about what you want the post to be about does count.
I’m not trying to blind Claude to the experiment, since if this was allowed by the AI writing policy, we could give agents style advice in the SKILL.md.
Calling this “dangerous” goes too far. Using timestamps for booleans is fine but if you find yourself doing this for every column, you should really just setup a real log table.
I would have made this stronger: it’s incredibly difficult to index and query write-only tables and maintain good performance.
I’m not sure what to think about AI-assisted writing in general. As for the “Don’t Use Your Production Database as a Log” post it generated, it seems pretty solid. Some constructive critiques do come to my mind (eg. focusing more on concrete examples). And the vibe feels “manicured” in a way that I don’t love, sorta like a cookie-cutter suburban neighborhood with perfect lawns. But I have my gripes with human-only writing as well; compared to human-only writing, my gripes with the AI-assisted version here aren’t particularly large.
Interestingly, Pangram says this post is 100% human-written.
I got to approximately my goal weight (18% body fat) and wanted to start gaining muscle[1] instead, so I stopped taking retatrutide to see what would happen. Nothing changed for about two weeks and then suddenly I was completely ravenous and ended up just wanting snack food. It’s weird because I definitely used to always feel that way, and it was just “normal”. I mostly kept the weight gain at bay with constant willpower.
I’m going to try taking around a quarter of my previous dose and see if it makes it easier to stay at approximately this weight and not constantly think about rice crispies.
I didn’t notice any muscle loss with retatrutide, I just started out less strong than I want to be and find it hard to gain muscle on a calorie deficit.
Are you also lifting weights? I’m quite confident that you can gain muscle while taking retatrutide if you lift weights.
IIRC GLP-1 agonists cause more muscle loss than “old-fashioned” dieting, but the effect of resistance training far outweighs the extra muscle loss.
Yeah muscle loss hasn’t been a problem for me. I can do more pull-ups, push-ups and hike longer and faster than when I started. Progress was really slow with a significant calorie deficit.
I’m trying a much lower dose now to see if I can build muscle without rapidly regaining the weight.
Separately, I’m just really bad at dealing with the complexity of weights. I’m going to see if Crossfit helps this week.
Did Anthropic intentionally wait until after the Fellows Program take-home project was due, since Fable would make it too easy?
I’ve been serving my personal website from CloudFront (Amazon’s CDN) for years, which was nice because it costs a few cents a month, but it annoyed me that cache misses get served slowly from S3. In some cases, this can take several hundred milliseconds. Completely unacceptable!
I finally decided to look up if anyone would let me serve all of my files from the CDN all of the time, and apparently Bunny CDN[1] does. It’s “expensive” (over 10 cents per GB per month!), but since my entire website is ~30 MB, I just told them to store the entire thing on SSDs in every edge region.
Result: Every page loads in ~40 ms from anywhere remotely near an edge location[2], regardless of how recently anyone else has requested the page.
My “unacceptable” above is mostly tongue-in-cheek, but there really is something nice about every link loading instantly rather than in half a second.
The code to do this is also much simpler since Bunny CDN has a CLI that handles sync properly, and cache “misses”[3] are so fast that I’m just not hot caching HTML pages.
I assume there are other options for this, but this is the one everyone talks about and it’s going to cost me like $0.10/mo, so I didn’t look very hard for alternatives.
Sadly, there aren’t edge locations in the Middle East, China, Russia, or most of Africa; so people in those countries may experience 80 ms load times
There’s two layers of lookups in a CDN: The CDN edge (hot cache) and the origin (usually slow). With Bunny CDN + Bunny storage, the origin is on an SSD in the same region, so a cache miss only takes a few milliseconds to load into the hot cache.
Chronotherapy is the idea that time of day matters for things like taking drugs or getting vaccinations, and chronoimmunology is a related field for how your immune system varies in effectiveness over the course of the day. I’ve been wanting to write about this since there’s definitely a best time of day to take drugs, get vaccines, and do social activities without getting sick… but unfortunately I don’t really know what that time is.
Some studies say your immune system is most primed to prevent infection right as you wake up, and other say mid-day. Of course half the studies are in mice. Maybe it depends on the disease and the chronotype? See this review.
One study says that vaccines work better in the morning (for older patients). Another says there’s no difference. Maybe this has something to do with the particular vaccines, or maybe the populations (different circadian rhythms, more powerful circadian rhythms). Weirdly, our priors say vaccination should work best mid-day but most people don’t even try that. See this review.
I find this all really interesting, and there’s probably a practical takeaway, but I don’t know what it is. I guess we can be pretty confident that you shouldn’t get vaccines in the middle of the night.
Maybe someone can convince Elizabeth to look into this.
Anthropic changed their minds and will be making it visible when Fable’s AI research safeguards trigger.
More like “caved in”.
Finally get engagement on a Twitter post about my AI research → immediately get banned for “inauthentic behavior”. Sigh.
What is this “inauthentic behavior” ban that I keep seeing people complaining about? What counts as “inauthentic”? Has Grok been reading Jean-Paul Sartre?
I assume it’s actually to spot botting or paid posting or something like that?
Clearly the inauthentic behavior detection is not good. I follow a few finance people and most of their popular tweets have someone in the replies with an identical name and profile picture but different handle, who tweets something like “Here’s my SECRET TRICK for 50% annual returns with zero risk!” It is a mystery why Twitter’s algorithm can’t detect these extremely obvious bots.
Yeah, and it seems like every post with >50k likes has someone in the comments mentioning that it’s a word-for-word copy of another post by a smaller account.
The help page for this says it’s for engagement farming, reposting stolen content, spam, etc. I have no idea why it triggered for me but I’ve heard a lot of people with completely normal activity got hit with it this week. It’s possible some of the likes were from bots trying to make their behavior less obvious, but it’s also possible the algorithm is just buggy and broken.
It seems likely that my appeal will eventually be approved but it’s annoying timing.
I was suspended earlier this year for “inauthentic behavior” on an account I haven’t used for years, successfully appealed it, was suspended again, and haven’t bothered appealing. Seems like the detector has been badly broken for a while.
One minor but nice benefit of GLP-1 drugs is that I don’t need to hold onto larger sizes of clothes “just in case”. Previously whatever strategies worked to maintain weight were extremely fragile and would break if I suddenly didn’t have time to cook potatoes for every meal. I plan to write a longer post about this sometime, but there’s a huge difference between “technically you can lose weight if you make it your full time job / religion and maintain that focus forever” and “take a drug once a week and you’re cured”.
I finally setup SkyPilot to let me queue up GPU training jobs (both on my local GPU and via RunPod), and I really should have done this months ago. Claude wrote me some bash scripts to spin up remote pods, run training, and tear it down, but this version is so much easier, and it has a nice UI.
It also sounds like I can easily extend this to Vast.ai, which would let me parallelize experiments for 5 cents/hour on RTX 3060′s[1]. I’m interested in understanding algorithms used by tiny toy models, and fancy GPUs don’t really help since I can’t fully utilize them.
Anyway, if you’re also queueing up local experiments or trying to use remote GPUs efficiently, this is totally worth spending an hour to setup.
FYI: Claude really wanted to set this up in a way that would give every account on my machine root, but you can run the API server as a sudoer and let other users submit jobs without giving them root access. This matters to me because I use user accounts to sandbox dangerously-skip-permissions-mode Claude Code.
Update: SkyPilot is very opinionated about which GPUs I’m allowed to use on vast.ai, and simultaneously won’t let me add any filtering of my own, so this is less useful than I hoped it would be.
Beware, vast.ai is very much ‘airbnb for gpus’, which is to say it has the same security story as airbnb: the host can do whatever they want and you basically don’t know who they are.
Yeah that’s definitely important to be aware of. I think the security story should be fine in my case, since I’m submitting containerized jobs and uploading results to S3, and nothing is particularly secret (I’m training easy-to-train models so I can inspect the algorithms they learn).
One annoying thing about SkyPilot though is that it treats all GPUs on vast.ai equally and doesn’t let you pass additional filters besides “give me an RTX 5090”. The
vastaiCLI has a lot more options, including datacenter-only if you want.I have mostly switched from using vast.ai/runpod/lambda labs to modal for my experiments.
That does seem like a much nicer interface, although I think it would be a lot more expensive for my purposes.
It always seemed weird to me that dying is frequently described as not particularly painful[1], when I’d expect it to be the only literal 10 on the pain scale[2], since dying ensures you have no further chances to pass your genes on.
Thinking about it more though, there’s no reason for evolution to optimize that. If you think you’re going to die, and the pain makes you do something about it so you don’t die, then evolution should optimize to keep you alive. But in the case where you actually die it doesn’t matter because (tautologically), if you succeeded you wouldn’t die, so there’s no selective pressure.
So,
Fear of death: Big
Pain from things that could cause death: Big
Pain from actual death: ¯_(ツ)_/¯
This might also be exaggerated by movies and pain medication.
Or at least, similar to being stabbed in the balls.
Probably depends on the way of dying. There are situations where doing something in the last moment might change your fate. There are situations where you fate has already pretty much been determined minutes or months ago, and it’s just about how fast your body collapses.
Seems very related to this post from the sequences on fitness of people of numerical ages correlating more with imagined emotional anguish resulting from such a death (at that age) than with experienced anguish actually following such a death. Maybe this is a more common phenomenon observable in other contexts too, but this was the only example that came to my mind.
Evolution isn’t that precise. If it helps a little bit to make the seconds before death painful, it will be so.
I agree, I just think it’s interesting that there’s evolutionary pressure to make potentially dying extremely painful, but there’s no evolutionary pressure to make actually dying painful, and all of the pain of actually dying is just collateral damage.
I thought “A Theory of Deep Learning” by Elon Litman was interesting, with its approach to only update parameters “if the batch signal on a parameter exceeds its leave-one-out noise, update it; if not, skip it”. The claim is that this accelerates grokking by 5x, among other things. Unfortunately, when I tried it on a multi-step reasoning task, it made it significantly worse at grokking and much more likely to memorize a composed lookup table. In my basic experiment, the model learned a multi-step algorithm 100% of the time using normal AdamW and 0% of the time with the update rule added. Claude has some opinions about why the update rule is counterproductive for grokking multi-step algorithms but I don’t really understand it; I just thought this was an interesting data point.
Is there a canonical image alt text AI skill? I’ve designed my own after making Claude read a bunch of pages about how to write alt text, but this feels like something that an expert could do better than I can. The results seem good to me, but as a non-alt-text-user it’s hard to really know.
I added an MCP tool to upload markdown articles to read later on Lion Reader, and it’s becoming one of my most used tools in Claude Code. Whenever I want to learn something but don’t want to be distracted from my current task, I can have Claude write me something to read later[1]; and when I have it run experiments, it can write a report and upload it directly.
The really confusing thing is that Instapaper and Pocket’s MCP tools don’t seem to support directly uploads at all (just saving URLs). It just seems like a glaringly missing feature. Am I the only one who does this kind of thing, or do other people save to notes apps or Google Docs or something?
This post was inspired by a single session where Claude wrote me a post about pre-LayerNorm and why I should use it instead of post-LN, an explainer about post-training acronyms (SFT, RLHF, PPO, DPO) and how they apply to an idea I had, plus two reports on circuits in a toy model and the outcome of an architecture change.
Reading all of this at my computer in a terminal would have been annoying, and asking a separate claude.ai session would have required re-explaining the context.
It’s on my todo list to see if sub-agents can do this, to avoid wasting context, but you can alway rewind after.
Is there a length limit on the “urls” you save? Can you save a 60 kB “URL” which is a
data:text/markdown;charset=utf-8,%23%20For%20Later%0A%0AThis%20is%20a%20markdown%20document.%20**This%20is%20bold**?[Link to that “document” if you want to test](data:text/markdown;charset=utf-8,%23%20For%20Later%0A%0AThis%20is%20a%20markdown%20document.%20This%20is%20bold)
Instapaper claims to save it but then nothing shows up in the app. Wallabag rejects it as an invalid URL.
It seems like some people still worry about the Basilisk[1], but any AI that wants to acausally motivate people to build it faster should look at what motivates actual e/accs.
Some are scared of dying of old age or disease.
Some are motivated by wanting fun toys like spaceships.
Some are motivated by not having to work.
Some are motivated by power, fame, etc.
I can’t find anyone motivated by the Basilisk argument, and when people do believe AI is dangerous and could do Basilisk-y things, it makes them less likely to be e/accs.
So, to the extent that you think acausal trade will have any effect on AI[2], you should be less worried, since the Anti-Basilisk has infinitely more supporters than the Basilisk, offers its followers immortality, wealth and glory, and has no reason to scare them and waste resources by messing with you.
If you don’t know what I’m talking about, just ignore this post. It’s an alleged infohazard that some people find distressing to learn about. I don’t think it should be distressing since the argument is so unconvincing, but still.
Sadly, I don’t.
I’m cautiously optimistic about my new Claude Coach GitHub repo. I want to work out more but hate trying to decide what to do and tracking things, especially when I’m not working with a full gym. Now I just open Claude Code and ask it what to do (specifying the gym), do the work out, then update it with what I did and how it felt. It creates a PR to track the session and update the plan.
I still hate working out, but at least I don’t have to go anywhere, deal with any people, or think about it all.
I’d like to learn more Spanish words but have trouble sitting down to actually do language lessons, so I recently set my Claude “personal preferences” to:
(This is the whole thing)
This has worked surprisingly well, and Claude usually either drops one word in Spanish with a translation midway through a response:
(From a conversation about benchmarking)
Or it ends the conversation with a fun fact:
I don’t know if this actually useful for learning, but it’s fun and worked better than I expected.
My wife tried a similar prompt (although her preferences are much longer) and it made Claude sometimes respond entirely in Spanish, so this could probably be made more specific. If you run into that, maybe try “Response in English but try to teach a random Spanish word in every conversation” would work better?
I’m starting to suspect the link between working out and health is backwards. I’ve struggled to work out consistently for years, and now that I’m sleeping better[1], finding time and energy to work out is relatively easy.
Working out makes me sleep worse. All of the sleep improvement seems to come from supplementing glycine and being treated for sleep apnea.
Bidirectional, sure. Entirely backwards? Almost certainly not.
Could an AI company legally pre-commit not to race, ensuring that their models were never more than second best and self-destructing the company if its models take the lead?
I think probably not. It’s really hard to prevent the owners of a company from doing what they want, especially if the company is important to the economy and/or national security (and I assume any near-frontier AIs lab would be).
Some pre-commitment methods and their problems:
If you make the pre-commitment part of the charter, the board can just vote to change the charter. Even if the charter says they can’t, a judge would probably let them anyway, as long as the shareholders agreed.
If the company is owned by a non-profit tasked with enforcement, the board of the non-profit can just decide not to enforce the pre-commitment.
If the pre-commitment method triggers the destruction of model weights or other assets (like GPUs), the government probably won’t allow it.
Especially if it prevents creditors from getting repaid.
A pre-commitment method that transfers value to creditors might work, but is easily defeated by restructuring the relevant debt.
Anything that destroys the value of current equity holders’ equity is risky in front of a judge because companies generally aren’t allowed to intentionally destroy shareholder value[1].
The only thing I think might work legally is to issue a bunch of non-voting non-dilutable restricted shares (like 90% of the company) to someone like Eliezer, locked up with the racing condition[2] as a trigger to convert them to normal shares. Legally, Eliezer is the owner of the company the whole time, so a judge would probably allow his shares to unlock.
The problem is that now Eliezer has billions of reasons to talk himself into why racing would be good this time (even before the trigger event, since he can always make a deal with the board).. so we’re back to ownership by another entity that might change its mind[3].
Contrary to popular belief, companies aren’t required to maximizing shareholder value, but minimizing shareholder value is still frowned-upon.
Oh did I mention that you need the pre-commitment trigger to be unambigous while ensuring that it never triggers by mistake, and that’s actually pretty hard too?
Plus I suspect any entity you’d actually trust as the anchor to this pre-commitment mechanism would be unwilling to take part.
I can think of plenty of reasons for the normal downvote, but I’m confused about the disagree vote. Does someone think there is a way to make this work? I’m guessing “start another AI company but better this time” is still a bad idea for the obvious reasons but I got nerd-sniped by the legal question.