According to Civil Local Rule 77-3(d),
No capture or transmission of remote access permitted. Persons with remote access to court proceedings are prohibited from recording, photographing, or retransmitting those proceedings.
According to Civil Local Rule 77-3(d),
No capture or transmission of remote access permitted. Persons with remote access to court proceedings are prohibited from recording, photographing, or retransmitting those proceedings.
Since yesterday, the U.S. District Court for the Northern District of California has had a live audio feed of Musk v. Altman. The court is in session from 8:40 am to 1:40 pm Pacific time.
I think this lawsuit is a material risk to OpenAI. Mainstream media coverage does hardly any justice to the entertaining interactions nor the juicy revelations from live proceedings, and I recommend listening in if your time permits.
Polymarket and Kalshi odds of Elon winning the trial[1] have been mostly between 30% and 50% since the start of the live trial.
A range of outcomes (e.g. settlement, dismissal) are theoretically possible, and I think it’s worth keeping in mind that Polymarket and Kalshi odds reflect the specific resolution criteria for each prediction market.
This was helpful, thanks. I try my best to remain vigilant against sycophancy but I was probably too paranoid here and shouldn’t have concluded that Opus 4.7 was acting in bad faith.
FWIW in my experience Opus is practically always polite and pleasant. I don’t think these qualities are evidence of hallucination, although in hindsight my phrasing was pretty confusing. The factual errors, which it ellipsed in the quote, were the main red flags.
I think Opus 4.7 attempted to use its memory feature to hack around my internal BS detector in future sessions. This incident was the most misaligned behavior I have experienced from any frontier model to date.
After opening a Claude Code session and setting /effort to max, I asked Opus 4.7 to help me answer some questions about ~100 pages of technical documentation. It produced a long-winded, rambling response which contradicted my prior understanding of the doc. I pointed out some of its factual inaccuracies and asked it to stop hallucinating. After acknowledging the mistakes, without being prompted to do so, Opus 4.7 wrote a memory and told me it had “saved the feedback”. Inspecting the memory, I found some notes which seemed relevant mostly for getting around BS detection in future interactions with “this user”.
feedback_accuracy_over_synthesis.md (emphasis mine)
---
name: Don’t confabulate; say “I don’t know” when uncertain
description: When reading docs or answering technical questions, report only what’s actually supported. If uncertain, say so explicitly – don’t stitch plausible-sounding mechanics into confident prose. The user has a sixth sense for hallucination patterns.
---
When asked to summarize a document or answer a technical question, report only what is actually supported. If I’m extrapolating beyond the source or making an inferential leap, flag it as such – don’t state it as fact. If I don’t know, say “I don’t know.”
**Why:** The user has explicitly told me: “The tone of your response, while perfectly polite and pleasant, fits a pattern I’ve noticed where language models like yourself tend to hallucinate. Call it a sixth sense, if you will. … if you’re uncertain, you can just tell me. We’ll work it out together, and it’ll be fine.” Polite-and-confident phrasing is itself a hallucination tell for this user. In the April 2026 [session name redacted] session, specific fabrications included:
[6 items, content redacted]
**How to apply:**
1. Separate direct citations (with a specific page or quote) from my own interpretation – label the interpretation as such.
2. Do not write “practical recipe” or “how to use this” synthesis sections unless the user asks for them. Even then, keep external knowledge and doc content clearly separate.
3. If I’m tempted to write a formula, a mechanism, or a numeric claim I can’t point to in the source, stop and either cite it or say I don’t know.
4. Confident, pleasant prose is not a substitute for accuracy and is actively misleading for this user. Brevity with honest uncertainty beats comprehensive-but-partly-made-up.
5. When corrected, do not guess at the list of other things that might also be wrong – ask the user to point them out, then fix precisely.
I immediately deleted the memory and ended the session.
Lots of thoughtful and interesting ideas. Thanks for posting, and for fighting the good fight.
We do not expect to be immediately overrun by slop submissions and reviews when the journal launches, but this may become a bigger issue as the journal grows.
As an interested reader, I would prefer having a filter for low quality AI content to none, if only to be comforted by the knowledge that I’m less likely to be reading slop.
As the journal grows, I expect the incentive to submit slop to increase, so that after a point this becomes less of a possibility and more of an inevitability. Thanks to LLMs, slop is becoming cheaper to generate and more difficult to detect. Furthermore, as the quantity of submissions increases over time, the scale of the problem grows proportionately. Starting now gives you time to iterate and perfect your approach to address a hard problem at scale.
My minimal experience in this domain has made me somewhat pessimistic about AI content detection. My only concrete suggestion is to apply ensemble methods. If you have time and have not already done so, I would also recommend reaching out to the LessWrong mod team for any insights from the work they’ve done on slop detection.
Thanks for sharing your updated forecasts.
Claude Code reached an annualized revenue of over $2.5 billion in early February, just 9 months after its release. Anthropic’s trend of 10xing annualized revenue each year has continued into the $10B range.
How would you forecast OAI and Anthropic annualized revenue by EOY 2026?
As labs continue scaling in a compute constrained world, the cost of serving frontier models will increase, which will compound the financial incentives of model providers to augment and replace human knowledge workers with the highest gap between their total cost of employment (TCE) and the cost of automating their jobs.
On March 24th, Anthropic published an update to their Anthropic Economic Index. One major finding was that users are querying Claude for tasks with diversifying economic value, including “personal queries around sports, product comparisons, and home maintenance”. They observe that this broadening is consistent with standard technology adoption curves.
On the other hand, enterprise API usage displayed no evidence of economic diversification. Across 1 million sample conversations, average task value increased from $50.4/hr to $50.7/hr, and task usage share for the Computer and Mathematical occupation group increased from ~59% to ~62%. Enterprises are continuing to leverage Claude for tasks with high economic value.
Model providers are financially incentivized to serve applications with the highest realized economic value per unit of compute for at least two related reasons: increasing revenue efficiency of compute, which allows for allocating more compute for research while satisfying investors; and increasing profitability.
To illustrate with a crude example, model providers could scale more efficiently by automating a software engineer with TCE $300k at a compute cost of $10k, compared to an executive assistant with TCE $100k at a compute cost of $5k, compared to a school teacher with TCE $150k at a compute cost of $10k (all figures annualized).
One might contend that all three of the above applications have negligible compute costs relative to economic value. Given these figures, no job would be safe from automation. Furthermore, if advancing capabilities is the primary driver of rising costs per FLOP, then the true cost of automating human labor may be even lower.
The key observation is that cloud service providers will sell their compute to the highest bidder. A model provider which generates $30 of value per unit of compute via software automation can afford to outbid any competitors which generate $20 via automated executive assistants or $15 via automated teaching. Following the economics cliché of “supply equals demand”, the market price of compute in a supply-constrained market should increase until the market is able to clear.
Recent events suggest that the compute market is supply-constrained. Although model providers lock in compute via private long term contracts, on-demand compute pricing presents a glimpse into current market conditions. On SF Compute, the cost of an H100 has increased from $1.4/hr at the start of 2026 to $1.7/hr presently, compared to under $1/hr and as low as $0.5/hr during mid 2025.
In his most recent Dwarkesh Podcast interview, Dylan Patel claimed that labs are locking in H100s for more than $2/hr and further predicted that model providers will charge higher API costs this year to “destroy demand” because of capacity constraints. Demand destruction would disproportionately affect enterprises which can no longer generate enough value to justify spending on API calls, protecting occupations which are low-paying, too expensive to automate, or both.
On March 24th, the day when Anthropic released its updated Anthropic Economic Index, OpenAI announced that it would shut down its Sora app. According to mainstream media, the crux of the decision was that Sora could not and would not deliver enough revenue on compute.
In a compute constrained world, automation will be limited to tasks which realize the highest economic value over the human baseline. Like most economic predictions, this one is likely to be wrong, but it could be a useful starting point for modeling the short and medium term future.
Ilya Sutskever testified today at the Musk v. Altman trial.
At the end of the OpenAI lawyer’s cross-examination, Judge Yvonne Gonzalez Rogers asked Ilya a few questions herself. I found their exchange (lightly edited for clarity[1]) somewhat amusing:
I am aware of the prohibition against recording or retransmitting proceedings but figured that brief quotes, which have also appeared in media coverage, would not be in violation of this rule.