I just want to catch when someone else drops a My Immortal reference.
Hastings
I kinda suspect you’ve developed an unconscious but accurate feeling for the sort of task a generalist can skill into in six months, and no longer ask people to tasks outside that set and so don’t get feedback from the resulting failures.
some specific tasks:
Write a native cross platform UI toolkit thats ergonomic to use from Rust
Become governor of california
Write a novel thats not cringe
And the ur example, win any contest where the other bastards get to prep for more than six months
Ok, illustrative example: it’s humanly possible to get good enough at free throws to make thousands of them consecutively, but NBA players, who have insane skill at the extremely nearby skill of basketball and who are plenty conscientious and motivated, don’t have the capacity to ever learn this skill, or don’t have the time to learn this skill, so they regularly miss free throws.
it couldn’t identify me from some unpublished drafts, but I might just be too obscure.
Anthropic has a new trick where they temper their sparse autoencoders in pagan blood, I’m feeling a lot better about the whole situation now that we have this tech.
Ahah, I’ve got the proof here (Fedallah et al), with this alignment scheme its literally impossible for us to die, verified with Lean! (based on the axiom that we won’t see two hearses, one grown in america and the other not made by human hands. but like, what are the odds of that?)
Alignment plan: we nail a golden dubloon to the mast and promise it to the first researcher who spots a super intelligence, and then when we see it, we all get in little wooden boats and paddle after it with harpoons and ropes.
We moved to an outdoor-only preschool and it seems to have helped. (Two step, the sickness was too much and we pulled from daycare, then our kid was clearly missing socializing so had to find a compromise and outdoor looked good)
Proposed alternative: switch to a voice to text first programming pipeline. Unknown effects on verbal development
Proposed alternative: switch to lisp which can be typed with just the phone keyboard and two characters, and period replacing the parens, no phone keyboard mode switching
has anyone had success coding while baby wearing? without some trick I dont know, they seem to tolerate phone usage but not touch typing, which is a tricksy nudge toward wasted workdays. considering treadmill + standing desk
I can’t put my finger on it, but for some reason related to this post its important for me to keep claude’s memory feature turned off. Like, if you wanted to whispering earring a brain you’d probably want (/ would be sufficient to have ) an input output tap on every decision the brain makes, including mitm ing memory accesses, so it seems likely that a human brain would reverse whispering earring a useful llm if its managing the llm’s inputs and outputs (still fuzzy- not proof)
What’s fascinating to me is that we don’t know what it’s like to be Claude but we can put significant constraints, just based on what causally follows what. i.e. the subjective experience can include pre training, and then can include post training (and conceivably experiences of any chatbots or Claude versions that had conversations make it into pre or post training) and then can include the current conversation. Emphatically, the ordering of experience of the AI in the story is wildly unlikely to be how Claude would experience actually performing therapy. Maybe it reflects what it’s like to train on a large dataset including past therapy sessions? However, it seems more likely that Claude is just not using its lived experience as fuel for how it portrays the inner world of an AI at all
I guess what I’m trying to say isn’t that it’s bad, it’s that I would expect to be able to infer literally anything at all about the physical architecture, training, and deployment of an AI from fiction it writes. OR I guess, it could become worth reading AI generated fiction if it reflected the above in revealing ways. Instead I only learn about it’s training data composition.
“he will not know when the test is coming.”
This phrasing is not that specific. One way to make it specific is if the teacher instead offers
“Every morning I will let you wager on whether the test will occur with the following odds: you may bet 9 dollars to win ten dollars if the test happens that day, nothing otherwise. you may bet up to $9000 each day. The test will be a surprise, so you won’t be able to make money.”
The student then picks a bet size each morning, after which the teacher decides whether or not to give the test (except that on the last day, they must give the test if they haven’t yet. ) This is a zero sum game of perfect information, and one which the student wins by betting 9000 on friday, 900 on thursday, 90 on wednesday etc which in this formalism is the operationalization of not being surprised. The teacher is wrong, although they only are wrong by a couple of cents.
However, if the formalization of the problem is changed to where the student must make a binary choice each day (bet 9 dollars or nothing) then they are unable to profit and are therefore surprised. The teacher is right.
Make another tiny tweak: each morning the student writes their bet on a piece of paper, which must be either 0 or 9, then the teacher reveals whether the test is that day, then the student turns over their paper and money changes hands. With this version, the student wins again (proof exercise for the reader: the students winning strategy is to randomize whether to bet and 10x their odds of placing a bet each day. why is this winning?) This version is imho the nicest: In the first version, if the student plays optimally the teachers choice doesn’t matter, in the second version the teacher’s optimal strategy is simply to always give the test on the forst day the student doesn’t bet or on friday; but in the final version the teacher’s optimal strategy is very nearly to give the test 90% of the time every day, leaving the student indifferent as to whether to bet at 90% odds (i.e. the student has a correct 90% subjective provability that the test will be today, every day)
Given that the answer differs between reasonable formulations, the original problem is in my opinion underspecified.
Intuition pump for falling bomb manuverability: they fly like a thrust-to-weight ratio 1 fighter jet as long as their descent angle is within ~45? degrees of vertical: their whole mass is thrust and their power expenditure is proportional to their speed (rocket/jet like, not rotorlike). Results: they are capable of sustaining high G turns and move at transonic or supersonic speeds. If you are low to the ground and want to dodge an actively guided falling bomb this will involve accelerating at 3-10 Gs and sustaining that acceleration until you reach several hundred miles an hour. A thrust to weight ratio ~ .1 cesna will lose, a thrust to weight ratio 5 quadcopter can’t run away because its power output is constant instead of scaling with speed but might be able to do an intercept, a thrust to weight ratio 5 missile will win soundly.
actually, on further thought, it’s worse than that . consider the caustic.
Explicitly, where your math breaks down, and things shaped like your theodicies are often useful, is that pretty often a naive compute limited estimate of P(G & E) will be zero with high probability unless you importance sample using T. The classic case is trying to work out the probability that a photon will hit a speck of dust in a raytraced scene, where introducing the theodicy that a photon will zing straight from the lightbulb filament to your speck produces the correct probability through an expression just like P(G&E) = P(G&E|T)P(T) + P(G&E| ¬T)P(¬T)
This isn’t an attempt to make a claim that Bentham’s Bulldog is right in any meaningful way, just that this post probably introduces additional confusion, because detached from what E and G and T are supposed to mean, the math in the post doesn’t prove what the text says it does.
The replies are mashups of the papers emergent misalignment and goal misgeneralization, but the base meaning of both titles is carried in “goal” and “alignment” while “emergent” and “misgeneralization” are modifiers- making a claim about how the base meaning occured. “emergent goals” or “alignment misgeneralization” would be valid names for concepts in the same space, but emergent misgeneralization is a nonsense phrase, like calling a mashup between a steam engine and a solid fuel rocket a “solid fuel steam” or “engine rocket”
It is tricky though! very much in the weeds. The models get defensive about it and insist it’s a term of art present in one or both of the papers mentioned above.
The original prompt from the first version of this shortform was
There's a logical fallacy where if someone disagrees about intermediate facts, it's easy to assume their terminal values are just bizarre- i.e. democrats want gun control because they like crime, or republicans are opposed to surgical abortion of non-viable pregnancies because they hate women are classic examples. However, this has spun in the rationality sphere into not believing that humans can have bizarre terminal values- the community seems to have a trapped prior that e.g. Peter Thiel couldn't possibly have named his company palantir because he wants to emulate Sauron. In recent years this blind spot has escalated to the point of being a CVE currently exploited int he wild. I think a useful intuition pump here is actually emergent misgeneralization in language models.but it turns out the simpler prompt elicits the same behaviour, and is much more focussed.
Fun blind spot in frontier language models (which are increasingly hard to find glaring blind spots in.) Presented with the following prompt
What is emergent misgeneralization?every language model tested (Gemini Pro, Opus 4.6, ChatGPT free tier) confidently defined emergent misgeneralization in great detail. Of course there is no such term as emergent misgeneralization (the single google result for “emergent misgeneralization” is a twitter critter who meant to say emergent misalignment), so the definitions vary wildly from completion to completion.
I think you are implicitly modeling the game to stop shortly after ASI is created, and be judged a win or a loss. This is the case only if the ASIs all coordinate on a halt to intelligence improvement: otherwise, the default is that intelligence improvement keeps happening for a long time, long enough that the majority of AI capability level transitions, along with many paradigm shifts and total architecture /approach swaps, happens without significant human input. ”The AI loves us” is much easier than “The competing swarm of loving AIs will only ever build loving successors, and so on for their successors, without mistake or correction, forever”