This post reads like lightly edited LLM output. The authors comments seem entirely AI generated and lacking in even light editing.
This seems a clear violation of the Policy for LLM Writing on LessWrong, given it does not satisfy this (wonderful!) subsection in a way I can foresee: “As a special exception, if you are an AI agent, you have information that is not widely known, and you have a thought-through belief that publishing that information will substantially increase the probability of a good future for humanity, you can submit it on LessWrong even if you don’t have a human collaborator and even if someone would prefer that it be kept secret.”
Robert Cousineau
For those who have not heard it, “Singularity, Singularity, Singularity, Singularity, Oh, I Don’t Know” I believe to be a reference to this (banger of a) song.
I expect if someone is living in their childhood home, there are likely a decent number of people they know who are not interested in moving (how many of your friends from high school still live where they grew up?).
My risk tolerance is not my friends; my threshold for moving is not the universally correct threshold.
I doubt it requires ‘convincing’ a friend continue living where they grew up.
Some assorted thoughts:
Have you considered holding on to your childhood home and renting it to someone you know? I assume it hurts more to sell it than it would hurt to know it is giving a friend a roof (at a potentially good price).
I expect continuing to live in an area that has regular shootings is unlikely to be high EV but I don’t know your life. Do you consider it to contain your peer group? Would you be better suited living in a different region of your city/a different city?
On concealed carry: getting in to gun fights is very unlikely to be maximizing your EV. You should almost definitely get out of where you are if this is a well founded concern (e.g. the shooting you are mentioning was not a fluke).
Also on concealed carry- the best tool for self defense if the one you actually have with you when you need it. Most instances I know of people in my peer group who have felt like they needed to defend themselves/were escalated against have not been happened when they were able to predict beforehand (likely because they avoid situations that they expect to need to defend themselves in preemptively, as you say). I think you should strive to carry almost all the time it is legal to do so (given you are comfortable with the idea in the first place). If it is known to be costly to defect against you and/or people like you, people are less likely to do so in the first place.
Is your carry piece comfortable? I personally went the route of not carrying a firearm for a while even though I was licensed to do so because I had convinced myself that a full size was useful/it was lame to carry a smaller handgun. It may have been useful, but it doesn’t matter how theoretically useful it is if it’s also uncomfortable and prints more than I like. I have since gotten over my pride on that point and now regularly carry a subcompact (rather than occasionally carrying a full size) almost every time it is legal for me to do so. (Generally speaking, losing weight has also made it much more comfortable to carry and trying different carry positions helped me find my currently preferred carry position at 1 o’clock)
On vibes: move! it really sounds like you are staying where you are due to intertia rather than actually feeling like it is where you want to be.
The song he’s referring to is Landsailor. It is no Uplift, but it is excellent, now more than ever. Stop complaining about what you think others will think is cringe and start producing harmony and tears. Cringe is that which you believe is cringe. Stop giving power to the wrong paradox spirits.
I think also relevant to Pinker is that Christianity’s songs/hymns would be cringe if they were spoken in a language you use everyday/understand and/or were not so heavily ingrained in your psyche. Religion says many many cringe-y things.
It happens to me too, as if I don’t know how to update on Bayesian evidence or something. I don’t even need them to be lying about it. The cheating is enough.
There are partial mitigations, where they explain why something is a distinct ‘cheating allowed’ magisteria. But only partial ones. It still counts. [bolding mine—RC]
I’m curious what you consider cheating then. It is hard for me to come up with a reasonable heuristic for cheating that both retains the meaning of cheating (“violating accepted standards or rules”) and does not lead to bad outcomes if I update noticeably towards “anyone who cheats is a cheating cheater who’s gonna Cheat Cheat Cheat Cheat Cheat” anytime someone triggers it.Consider someone doing the following:
Behavior Is it cheating? Is it reasonable to update in favor of generalized cheating? Driving 5-8 over the speed limit It is a violation of the rules for your personal benefit. It increases the risk of harm in an accident. It may or may not increase the likelihood of an accident (highly context dependent). I expect if I updated noticeably in the direction of “this person is cheating cheater who is never gonna stop cheating”, I’ll be worse calibrated. Using a bathroom labeled “customers only” without buying anything It is violating the rules for personal benefit and defecting against a specific person, costing them a small amount of money. They will likely update against the populace as a whole and trivially inconvenience many more customers (with a door code or similar) if many people do this. While it is not something I personally would do, I again expect if I updated noticeably in the direction of “this person is cheating cheater who is never gonna stop cheating”, I’ll be worse calibrated. I have friends who do this. Calling in “sick” when you’re actually taking an interview day Likely is violating the employee handbook/contract you (probably) signed. It is lying to your boss for personal benefit. I think nearly everyone does this—if my priors do not take this into account I have bad priors. Again, I don’t see a benefit to updating my character judgment of the person. Breaking HOA rules that some of your neighbors ignore Again yes, you are violating accepted rules, but in practice HOA rules are often foggy Schelling fences with semi-arbitrary enforcement. The real rules are “don’t be the worst offender” and “don’t antagonize board members”, but those aren’t what you or your neighbors accepted. Almost certainly not. This is operating within the actual implicit rules rather than the stated ones. Saying “I’ve read the terms and conditions” when you haven’t It is lying and defecting against the commons as it becomes common knowledge that nobody reads them and courts increasingly recognize as such. No. Nearly everyone does this (although that personally frustrates me). Using a VPN to bypass region locks on streaming sites Yes, it is another form of internet piracy. It hurts the companies that produce the IP. It is violating the TOS (which you probably didn’t read). Maybe a slight update is warranted? It might show willingness to cause wildly diffuse harm to faceless entities, but almost definitely doesn’t translate to interpersonal contexts. Student collaboration on “individual” assignments Depends on degree. Discussing concepts is expected; sharing answers breaks the assessment system’s purpose. There’s typically a spectrum most understand implicitly, but almost never does the system specify what is or isn’t accepted in reality. If it crosses into answer-sharing territory, probably yes. If it is discussing concepts, probably no. Are both in many cases technically considered cheating by the school (if it is investigating you)? Probably yes. Deliberately misleading competitors about your business strategy You are purposefully misleading someone else and causing them harm for your personal benefit, but also your competitor is probably doing the same. It is also not disallowed by any of the rules of the system. Probably yes—but this one is less likely to be considered cheating by the common definition. So when does cheating signal character? I personally know how I would update (or not) in these cases, but it took reasoning through them for a minute for many of them.
I don’t know of a commonly accepted definition of “cheating” where it would be reasonable to consciously update in favor of someone cheating if they do one of the things that counts as cheating.
Caveat: this post holds implicit that agents with the computational bounds of humans have significant trouble updating very small amounts in a given direction after devoting conscious thought to something.
In response to what I understand to be your question (“So what do you do to make the alignment guarantee good outcomes? People are stupid..”), I think one commonly accepted answer here is:
Yes, that is a real problem. Something like CEV offers a solution (with a spherical cow, in a vacuum).
There is also a useful differentiation to be made between Inner Alignment and Outer Alignment.
I personally have not seen that style of writing dialogue before, and did not recognize that was what you were doing until reading this comment from you. It along with the typos made it difficult for me to understand, so I had Claude copy edit it for me (and then figured maybe someone else would find that useful).
Here is a copy edited version from Claude:
Sorry, You Should Not Command the Aligned AI
By Martin Vlach, Benjamin Schmidt
May 11, 2025
2 min readBenjamin slumps in his chair, visibly tired. “I don’t think we even know what alignment is. We can’t even define it properly.”
I straighten up across the table at the Mediterranean restaurant. “I disagree. Give me three seconds and I can define it.”
“Fine,” he says after a pause.
“Can we narrow it to alignment of AI to humans?” I ask.
“Yes, let’s narrow it to alignment of one AI to one person.”
“The AI is aligned if you give it a goal and it pursues that goal without modifying it with its own intentions or goals.”
Benjamin frowns. “That sounds far too abstract.”
“In what sense?”
“Like the goal—what is that, more precisely?”
“A state of the world you want to achieve, or a series of states.”
“But how would you specify that?”
“You can describe it in infinitely many ways. There’s a scale of detail you can choose, which implies a level of approximation of the state.”
“That won’t describe the state completely, though?”
“Well, maybe if you could describe to the quantum state level, but that’s obviously impractical.”
“So then the AI must somehow interpret your goal, right?”
“Not exactly, but you mean it would have to interpolate to fill in the under-specified parts of your goal description?”
“Yes, that’s a good way to put it.”
“Then what we’ve discovered is another axis, orthogonal to alignment, which controls to what level of under-specification we want the AI to interpolate versus where it needs to ask you to fill in gaps before pursuing your goal.”
“We can’t be saying ‘Create a picture of a dog’ and then need to specify each pixel.”
“Of course not. But perhaps the AI should ask whether you want the picture on paper or digitally, using a reasonable threshold for necessary clarification.”
“People want things they don’t actually need though...”
“And they can end up in a bad state even with an aligned AI.”
“So how do you make alignment guarantee good outcomes? People are stupid...”
“And that’s on them. You can call it incompetence, but I’d call it misuse.”
I think taking in to account the Meta-Meta-LessWrong Doomsday Analysis (MMLWDA) reveals an even deeper truth: your calculation fails to account for the exponential memetic acceleration of doomsday-reference-self-reference.
You’ve correctly considered that before your post, there were 44 mentions in 16 years (2.75/year); however, now you’ve created the MLWDA argument—noticeably more meta than previous mentions. This meta-ness increase is quite likely to trigger cascading self-referential posts (including this one).
The correct formulation should incorporate the Meta-Meta-Carcinization Principle (MMCP): all online discourse eventually evolves into recursive self-reference at an accelerating rate. Given my understanding of historical precedent from similar rat and rat adjacent memes, I’d estimate approximately 12-15 direct meta-responses to your post within the next month alone, and see no reason to expect the exponential to turn sigmoid in timescales that render my below argument unlikely.
This actually implies a much sooner endpoint distribution—the discourse will become sufficiently meta by approximately November 2027 that it will collapse into a singularity of self-reference, rendering further mentions both impossible and unnecessary.
I cannot comment on the math, but intuitively this seems wrong.
Zagorsky (2007) found that while IQ correlates with income, the relationship becomes increasingly non-linear at higher IQs and suggests exponential rather than logarithmic returns.Sinatra et al. (2016) found that high-impact research is produced by a small fraction of exceptional scientists, significantly exceeding their simply above-average peers.
Lubinski and Benbow in their Study of Mathematically Precocious Youth found that those in the top 0.01% of ability achieve disproportionately greater outcomes than those in (just) the top 1%.My understanding is that empirical evidence points toward power law distributions in the relationship between intelligence and real-world impact, and that intelligence seems to broadly enable exponentially improving abilities to modify the world in your preferred image. I’m not sure why this is.
The failures seem to be often related to the model get stuck trying to reason about your problem in a way that pattern matches too strongly to similar problems, and that is why it is failing. Did you notice this as well?
I found this failure to be interesting, unexpected (to me), and it was honestly frustrating to watch Claude get it wrong over and over again. It seems like this deserves to be received by people smarter and more important than me.
I found your writing style to be off putting and confusing, which seems counterproductive given you seem to have put a lot of work into this benchmark.
I sincerely recommend using Claude to rewrite this post and putting the actual results of the benchmark in the style of a long post or research paper.
It’s not worth much but I’ll commit to strong upvoting it and posting it on my twitter if you do so.Offputting: Why 4 em dashes in your title? Why does the tone, word choice, and style switch between fancy and not so often? Why the typoes? Claiming something is 50 times lower than commonly believed, redefining “times”, and then minimally supporting that redefinition seems fishy. Not actually giving the results in an understandable format (in this post, not in your benchmark where you seem to have done a really good job backing this up).
Confusing: What is the numbered list of ways you could come up with these questions? It seem like you are describing increasingly malfeasant ways to do so, but I can’t tell. Why not show some example responses from the LLM’s and/or explain their error modes? Telll us how you made these questions. What was your method for coming up with the formula you are using? etc.
Claude would genuinely fix most of these problems—run the post past him! He may not be so good at reasoning as I thought, but he is really good at writing things.
Kagi seems to fully satisfy “provides a competitor to Big Tech” as well as any non-big tech competitor can be expected to (actively and consistently growing, good product, etc).
I do not believe they are open source, but they certainly seem less censorious.
I would not personally consider this a reasonable use of money or time.
I found this to be a valuable post!
I disagree with your conclusion though—the thoughts that come to my mind as to why are:You seem overly anchored on COT as the only scaffolding system in the near-mid future (2-5 years). While I’m uncertain what specific architectures will emerge, the space of possible augmentations (memory systems, tool use, multi-agent interactions, etc.) seems vastly larger than current COT implementations.
Your crux that “LLMs have never done anything important” feels only mildly compelling. Anecdotally, many people do feel LLM’s significantly improve their ability to do important and productive work, both work which requires creativity/cross field information integration and work that does not.
Further, I am not aware of any large scale ($10 million+) instances of people trying something like a better version of “Ask an LLM to list out in context fields it feels like would be ripe for information integration leading to a breakthrough, and then do further reasoning on what those breakthroughs are/actually perform them.”
Something like that seems like it would be a MVP of “actually try and get an LLM to come up with something significantly economically valuable. I expect that the lack of this type of experiment existing is because major AI labs feel like that would be choosing to exploit while there are still many gains to be made from exploring further architectural and scaffolding-esque improvements.
Where you say “Certainly LLMs should be useful tools for coding, but perhaps not in a qualitatively different way than the internet is a useful tool for coding, and the internet didn’t rapidly set off a singularity in coding speed.”, I find this to be untrue both in terms of the impact of the internet (while it did not cause a short takeoff, it did dramatically increase the amount of new programmers and the effective transfer of information between them. I expect without it we would see computers having <20% of their current economic impact), and in terms of the current and expected future impact of LLM’s (LLM’s simply are widely used by smart/capable programmers. I trust them to evaluate if it is noticeably better than StackOverflow/the rest of the internet).
I am strong down voting in this case as when I put a noticeable amount of effort responding to your prior post “are there 2 types of alignment?”, you gave an unsubstantiative followup to my answer to your question, and no followup to the 5 other people who commented in response to your post.
When I attempted to communicate with you clearly and helpfully in response to one of your low effort questions, I saw little value. Why should others listen to you when you tell them to do what I did?
I quite enjoyed reading this—I’m surprised I’d not read something like it before and quite happy you did the work and posted it here.
Do you have plans of using the dataset you built here to work on “figuring out if AI is conscious”?
Agreed—that’s what I was trying to say with the link under “80b number is the same number Microsoft has been saying repeatedly.”
That would be described well by the CEV link above.
Reward is not the optimization target (during pretraining).
The optimization target (during pretraining) is the minimization of the empirical cross-entropy loss L = -∑log p(xᵢ|x₁,...,xᵢ₋₁), approximating the negative log-likelihood of the next-token prediction task under the autoregressive factorization p(x₁,...,xₙ)=∏p(xᵢ|x₁,...,xᵢ₋₁). The loss is computed over discrete tokens from subword vocabularies, averaged across sequences and batches, with gradient-based updates minimizing this singular objective. The optimization proceeds through multi-stage curricula: initial pretraining minimizing perplexity, followed by context-extension phases maintaining the same cross-entropy objective over longer sequences, and quality-annealing stages that reweight the loss toward higher-quality subsets while preserving the fundamental next-token prediction target.
The post-training optimization target is maximizing expected reward (under distributional constraints). Supervised fine-tuning first minimizes cross-entropy loss on target completions from instruction-response pairs, with optional prompt-masking excluding input tokens from the loss computation. Subsequent alignment introduces the constrained objective max_π E_x~π[R(x)] - βD_KL[π(x)||π_ref(x)], balancing reward maximization against divergence from the reference policy. This manifests through varied algorithmic realizations: Proximal Policy Optimization maximizes the clipped surrogate objective L^CLIP(θ) = E[min(rₜ(θ)Âₜ, clip(rₜ(θ), 1-ε, 1+ε)Âₜ)]; Direct Preference Optimization reformulates to minimize -E[(x_w,x_l)~D][log σ(β log π(x_w)/π_ref(x_w) - β log π(x_l)/π_ref(x_l))]; best-of-N sampling maximizes E[R(x*)] where x* = argmax_{x∈{x₁,...,xₙ}} R(x); Rejection Sampling Fine-tuning minimizes cross-entropy on the subset {x : R(x) > τ}; Kahneman-Tversky Optimization targets E[w(R(x))log π(x)] with prospect-theoretic weighting; Odds Ratio Preference Optimization combines -log π(x_w) - λ log[π(x_w)/(π(x_w) + π(x_l))]. The reward functions R(x) themselves are learned objectives, typically parameterized by neural networks minimizing E[(x_w,x_l)~D][-log σ(r(x_w) - r(x_l))] under the Bradley-Terry preference model, with rewards sourced from human annotations, AI-generated preferences, or constitutional specifications encoded as differentiable objectives.