testingthewaters’s Shortform

testingthewaters10 Feb 2025 2:06 UTC

0 points

44 comments1 min readLW link

testingthewaters 31 Oct 2025 10:56 UTC
15 points
2
There appears to be a distaste/disregard for AI ethics (mostly here referring to bias and discrimination) research in LW. Generally the idea is that such research misses the point, or is not focused on the correct kind of misalignment (i.e. the existential kind). I think AI ethics research is important (beyond its real world implications) just like RL reward hacking in video game settings. In both cases we are showing that models learn unintended priorities, behaviours, and tendencies from the training process. Actually understanding how these tendencies form during training will be important for improving our understanding of SL and RL more generally.
- Amalthea 31 Oct 2025 11:35 UTC
  11 points
  10
  Parent
  My personal take is that this is already an area that people at large are ca. appropriately worried about. It can also lead you into politically polarized territory and people may reasonably prefer to avoid that unless there’s a good reason.
- Hastings 31 Oct 2025 13:13 UTC
  3 points
  −2
  Parent
  There has been high quality research finding ways that some models are biased against white people, and high quality research finding ways that models are biased against not white people. Generally, the pattern is that base models and early post trained models like GPT-3.5 are traditionally racist, and post-trained models are often woke, sometimes in spectacular “only pictures of black nazis” ways. I’ve personally validated that some of these replicated, from how davinci-002 would always pick the white sounding resume, to how claude 4.5 would if prodded save 1 muslim over 10 christians.
  
  Lesswrong very white and very human, and so it’s not that surprising, but a little sad, that it has pivoted hard from sarcastically dismissive to very interested in model bias as the second dynamic emerged.
  - testingthewaters 31 Oct 2025 13:39 UTC
    2 points
    0
    Parent
    For what it’s worth I’m not white and I come primarily from an AI ethics background, my formal training is in the humanities. I do think its sad that people only fret about bias the moment it affects them, however, and I would rather the issue be taken seriously from the start.
    - Hastings 31 Oct 2025 18:30 UTC
      2 points
      0
      Parent
      Thanks for the reply! Sorry that my original comment was a little too bitter.
      - testingthewaters 31 Oct 2025 18:41 UTC
        2 points
        0
        Parent
        No worries at all, I know I’ve had my fair share of bitter moments around AI as well. I hope you have a nice rest of your day :)
- cdt 31 Oct 2025 18:46 UTC
  1 point
  0
  Parent
  It’s interesting to read this in the context of the discussion of polarisation. Was this the first polarisation?
testingthewaters 10 Feb 2025 2:06 UTC
12 points
6
Note to self: If you think you know where your unknown unknowns sit in your ontology, you don’t. That’s what makes them unknown unknowns.

If you think that you have a complete picture of some system, you can still find yourself surprised by unknown unknowns. That’s what makes them unknown unknowns.

If your internal logic has almost complete predictive power, plus or minus a tiny bit of error, your logical system (but mostly not your observations) can still be completely overthrown by unknown unknowns. That’s what makes them unknown unknowns.

You can respect unknown unknowns, but you can’t plan around them. That’s… You get it by now.

Therefore I respectfully submit that anyone who presents me with a foolproof and worked-out plan of the next ten/hundred/thousand/million years has failed to take into account some unknown unknowns.
- CapResearcher 10 Feb 2025 15:13 UTC
  2 points
  1
  Parent
  I could feel myself instinctively disliking this argument, and I think I figured out why.
  Even though the argument is obviously true, and it is here used to argue for something I agree with, I’ve historically mostly seen this argument used to argue against things I agree with. Specifically, arguing to disregard experts, and to argue that nuclear power should never be built, no matter how safe it looks. Now this explains my gut reaction, but not whether it’s a good argument.
  When thinking through it, my real problem with the argument is the following. While it’s technically true, it doesn’t help locate any useful crux or resolution to a disagreement. Essentially, it naturally leads to a situation where one party estimates the unknown unknowns to be much larger than the other party, and this is the crux. To make things worse, often one party doesn’t want to argue for their estimate of the size of the unknown unknowns. But we need to estimate sizes of unknown unknowns, otherwise I can troll people with “tic-tac-toe will never be solved because of unknown unknowns”.
  I therefore feel better about arguments for why unknown unknowns may be large, compared to just arguing for a positive probability of unknown unknowns. For example, society has historically been extremely chaotic when viewed at large time scales, and we have numerous examples of similar past predictions which failed because of unknown unknowns. So I have a tiny prior probability that anyone can accurately predict what society will look like far into the future.
  - testingthewaters 10 Feb 2025 15:19 UTC
    1 point
    0
    Parent
    Yeah, definitely. My main gripe where I see people disregarding unknown unknowns is a similar one to yours- people who present definite worked out pictures of the future.
testingthewaters 5 Oct 2025 15:10 UTC
11 points
6
My best argument as to why coarse-graining and “going up a layer” when describing complex systems are necessary:

Often we hear a reductionist case against ideas like emergence which goes something like this: “If we could simply track all the particles in e.g. a human body, we’d be able to predict what they did perfectly with no need for larger-scale simplified models of organs, cells, minds, personalities etc.”. However, this kind of total knowledge is actually impossible given the bounds of the computational power available to us.
- First of all, when we attempt to track billions of particle interactions we very quickly end up with a chaotic system, such that tiny errors in measurements and setting up initial states quickly compound into massive prediction errors (A metaphor I like is that you’re “using up” the decimal points in your measurement: in a three body system the first timestep depends mostly on the value of the non-decimal portions of the starting velocity measurements. A few timesteps down changing .15 to .16 makes a big difference, and by the 10000th timestep the difference between a starting velocity of .15983849549 and .15983849548 is noticeable). This is the classic problem with weather prediction.
- Second of all, tracking “every particle” means that the scope of the particles you need to track explodes out of the system you’re trying to monitor into the interactions the system has with neighbouring particles, and then the neighbours of neighbours, so on and so forth. In the human case, you need to track every particle in the body, but also every particle the body touches or ingests (could be a virus), and then the particles that those particles touch… This continues until you reach the point where “to understand the baking process of an apple pie you must first track the position of every particle in the universe”
The emergence/systems solution to both problems is to essentially go up a level. Instead of tracking particles, you should track cells, organs, individual humans, systems etc. At each level (following Erik Hoel’s Causal Emergence framework) you trade microscale precision for predictive power i.e. the size of the system you can predict for a given amount of computational power. Often this means collapsing large amounts of microscale interactions into random noise—a slot machine could in theory be deterministically predicted by tracking every element in the randomiser mechanism/chip, but in practice it’s easier to model as a machine with an output distribution set by the operating company. Similarly, we trade Feynman diagrams for brownian motion and Langevin dynamics.
- testingthewaters 6 Oct 2025 14:14 UTC
  2 points
  0
  Parent
  Addendum for the future: Concepts like agents, agency, and choice only make sense at the systemic macroscale. If you had total atomic knowledge (complete knowledge of every single particle interaction in a human—which, as we discussed, basically requires complete knowledge of every single particle interaction in the universe), the determinists are right. There is no choice. It’s only neurons firing and chemicals bonding. But we operate at a higher level, with noise and uncertainty. Then preferences and policies make sense as things to talk about.
- Smaug123 5 Oct 2025 15:32 UTC
  1 point
  0
  Parent
  I’m curious: do we often see this reductionist case you claim? I think what you’ve written is pretty uncontroversial: the claim is true, but irrelevant because its precondition can’t be satisfied. But if we “often” see the case against, then presumably it is in fact controversial!
  - testingthewaters 5 Oct 2025 15:44 UTC
    3 points
    0
    Parent
    I’ve heard it quite a few times when discussing emergence and complexity topics.
testingthewaters 29 Jan 2026 0:03 UTC
8 points
3
Fundamental attribution error applies to arguments as well. We often think of people who are quick to anger, standoffish, unwilling to accept hypotheticals, or deny “obvious” claims as intrinsically close-minded, foolish, or otherwise unreasonable. Instead those behaviours are often manifested due to other sources of persistent stress in their lives, feeling vulnerable/exposed in the moment, or even just a shitty mood.
- koanchuk 29 Jan 2026 1:08 UTC
  4 points
  0
  Parent
  Basically, thinking of people exhibiting bias as being “biased” at the level of identity is a form of bias. Then again, is it unjustified, or a reasonable belief update based on new information?
  Personally, I was never a fan of the FAE term because it seems to privilege environmental causes over dispositional ones without justification. Yes, it could be that they’re angry because they’re hungry, but there genuinely exist people who are dispositionally more angry than most others.
  - testingthewaters 29 Jan 2026 10:33 UTC
    2 points
    0
    Parent
    
    Personally, I was never a fan of the FAE term because it seems to privilege environmental causes over dispositional ones without justification.
    
    I get where you’re coming from, but I think most people need a nudge towards an environmental direction rather than towards the dispositional direction. FAE is I guess just a ‘reminder’ that environmental causes are even possible.
- Burny 29 Jan 2026 7:20 UTC
  1 point
  0
  Parent
  Also I think a lot of people say a lot of things not as falsifiable factual claims, but as status signaling or values signaling or being part of ingroup signaling.
testingthewaters 21 Sep 2025 23:50 UTC
8 points
0
I think I figured out something about why people worried about AI safety go into capabilities. Or rather, this is something I’ve been trying to say for a while, but I finally found a good formulation for it.

Suppose you are Frodo and I am Gandalf. I say, “well, Frodo, the ring is super dangerous. It lies to you and promises you great power, but it will just destroy you and resurrect the dark lord Sauron, ushering Middle Earth into an age of misery and certain doom. You must swear to never put it on, and go on a perilous quest to destroy it.”

And then I add, “Well, this is true for like, 99.9999% of cases. But if you have some true insight, you can actually control the ring. And then it will actually make all your dreams come true. But, like, the vast majority people will fail, and return Sauron to power. So what do you say, Frodo? Better not put on that ring!”

Humans are such a sucker for those kinds of odds.

[Explanation: The problem is that there are things that most people can agree will benefit nobody, like nuclear war. But then there are things that will totally hurt lots of people, but but but there’s this tiny chance that you can in fact do it (and become super powerful/do a huge amount of good) - that’s the possibility of extreme upside that drew Eliezer to start SIAI, which became MIRI. The original purpose of SIAI was to create artificial superintelligence!

Unfortunately he then decided to warn people about AI risk, and basically gave the terrible speech that I gave above (complete with caveats like “we’d be able to solve technical AI safety if we just had one textbook’s worth of knowledge from the future”). He effectively one-shot everyone of a particular nerdish variety into becoming obsessed with the power of the Ring/AI.]

[Final note: I don’t really subscribe to this kind of malicious tool/evil demon conception of AI any more. I think of it more as giving birth to a novel kind of intelligence, with all the possibilities of good and evil that entails. Fwiw I also think that features of human “irrationality” like love, empathy, and care stand a decent chance of being passed on to AIs we create, both through their training data and also because it might not be so irrational after all. I also actually really admire Eliezer’s moral clarity, kindness, and strength of conviction, as well as all the usual things people say about his cleverness and ingenuity. I believe that he honestly conveyed his beliefs to the best of his knowledge and did so with courage and style.]
- FlorianH 22 Sep 2025 20:43 UTC
  1 point
  0
  Parent
  Love your ring analogy, find it quite pertinent (and catching myself as thinking, half-consciously, just a bit as you suggest indeed)
  But re
  [..] gave the terrible speech [..] He effectively one-shot everyone of a particular nerdish variety into becoming obsessed with the power of the Ring/AI,
  to express it using your analogy itself: I’m pretty sure, instead, one way or other, the Ring would by now have found its way to sway us even in the absence of Eliezer.
  - testingthewaters 22 Sep 2025 20:51 UTC
    2 points
    0
    Parent
    I can’t take credit for it, I think I saw it first on this thread, where someone points out that Yudkowsky is one of the few to have passed the ring temptation test. Then I thought, “how would this metaphor actually play out...” and then ended up at my shortform post.
    
    I agree that people would find the idea of AGI alluring and seductive even without Yudkowsky. As I said, I admire him very much for stating his beliefs honestly and with conviction and effectiveness. I find it sad that despite that the ring-temptation is such that even earnest warnings can be flipped into “well, I could be the special one” narratives. But as I said I’m also moving away from the AGI as “worst-possible-godly-alien-superintelligence” framing as a whole.
testingthewaters 21 Jan 2026 15:25 UTC
7 points
3
Information warfare and psychological warfare are well known terms. However, I would suggest that any well-intentioned outsider trying to figure out “what’s going on with AI right now” (especially in a governance context) is effectively being subject to the equivalent of an information state of nature (a la Hobbes). There are masses of opinions being shouted furiously, most of the public experts have giant glowing signs marked “I have serious conflicts of interest”, and the number of self-proclaimed insiders trying to get power/influence/money/a job at a lab/a slice of the lightcone by marketing insider takes is kind of deafening. And of course the companies are running targeted influencing/lobbying campaigns on top of all this, trying to position themselves as the sole reliable actors.
testingthewaters 2 May 2025 1:05 UTC
7 points
1
Google potentially adding ads to gemini:

https://arstechnica.com/ai/2025/05/google-is-quietly-testing-ads-in-ai-chatbots/

OpenAI adds shopping to chatgpt:

https://www.wired.com/story/openai-adds-shopping-to-chatgpt/

If there’s anything the history of advertising should tell us, it is that there will be powerful optimisation pressures for persuasion being developed quietly in the background for all future model post training pipelines.
- Viliam 6 May 2025 23:19 UTC
  3 points
  0
  Parent
  Quietly at first, then openly as people get used to it. You always want to have just slightly more ads than your competitors, because having much more could make people switch.
testingthewaters 2 Mar 2025 10:01 UTC
7 points
1
This seems like an interesting paper: https://arxiv.org/pdf/2502.19798

Essentially: use developmental psychology techniques to cause LLMs to develop a more well rounded human friendly persona that involves reflecting on their actions, while gradually escalating the moral difficulty of the dilemmas presented as a kind of phased training. I see it as a sort of cross between RLHF, CoT, and the recent work on low example count fine tuning but for moral instead of mathematical intuitions.
testingthewaters 11 Oct 2025 2:02 UTC
6 points
2
A lesson from the book System Effects: Complexity in Political and Social Life by Robert Jervis, and also from the book The Trading Game: A Confession by Gary Stevenson.

When people talk about planning for the future, there is often a thought chain like this:
- All other things being equal, a world with thing/organisation/project X is preferable compared to a world without thing/organisation/project X
- Therefore, I should try to make X happen
- I will form a theory of change and start to work at making X happen
But of course the moment you start working at making X happen you have already destroyed the premise. There are no longer two equal worlds held in expectation, one with X and one with no X. There is now the world without X (in the past), and the world where you are trying to make X happen (the present). And very often the path to attaining X creates a world much less preferable for you than the world before you started, long before you reach X itself.

For example:
- I can see a lucrative trade opportunity where by the end of five months, the price for some commodity will settle at a new, higher point which I can forecast clearly. All other things being equal, if I take this trade I will make a lot of money.
- Therefore, I should try and make this trade.
- I will take out a large position, and double down if in the interim the price moves in the “wrong” direction.
However, the price can be much more volatile than you expect, especially if you are taking out big positions in a relatively iliquid market. Thus you may find that three months in your paper losses are so large that you reach your pain threshold and back out of the trade for fear that your original prediction was wrong. At the end of the five months, you may have predicted the price correctly, but all you did was lose a large sum of money in the interim.

For another example:
- All other things being equal, a world with an awareness of potential race dynamics around AGI is preferable compared to a world without such an awareness.
- Therefore, I should try to raise awareness of race dynamics.
- I will write a piece about race dynamics and make my arguments very persuasive, to increase the world’s awareness of this issue.
Of course, in the process of trying to raise awareness of this issue, you might first create a world where a small subset of the population (mostly policy and AI people) are suddenly very clued-in to the possibility of the race dynamics. There people are also in a very good position to create, maintain, and capitalize on those dynamics (whether consciously or not), including using them to raise large amounts of cash. Now suddenly the risk of race dynamics is much larger than before, and the world is in a more precarious state.

There isn’t really a foolproof way to get around this problem. However, one tactic might be to look at your theory of change, and instead of comparing the world state before and after the plan, look at the world state along each step of the path to change, and consciously weigh up the changes and tradeoffs at each step. If one of those steps looks like it would break a moral, social, or pain-related threshold, maybe reconsider that theory of change.

Addendum: I think this is also why systems/ecosystems/plans which rely on establishing positive or negative feedback loops are so powerful. They’ve set things up so that each stage incrementally moves towards the goal, so that even if there are setbacks you have room to fall back instead of breaching a pain threshold.
testingthewaters 26 Dec 2025 22:20 UTC
5 points
0
Just reading this post about Soft Actor Critic in the OpenAI RL tutorial series and stumbled upon this line:
I will now try and make a somewhat provocative claim. Based on what I have seen of RL, I would attribute most “successes” of deep RL models (where “success” just means “anything that gets a human researcher excited/worried”) to something other than this explicit value-maximising objective. By that I mean “you can set up other systems to do RL without neural networks and they don’t really work/scale very well”. In other words, 90% of what makes RL work/not work is not related to that equation (or the original value maximisation equation, or any of its variants) at all.
My current guess is that the actual answer to “why does deep RL work” has something to do with neural networks being pretty good at finding low-error, low complexity functions that accurately capture the symmetries of the training setup/game. Thus instead of being “true maximisers” that would wirehead or turn the whole world into paperclips if given the chance, they are more akin to “systems that are really good at finding symmetries in the data landscape, which we have biased towards finding and exploting symmetries that lead to high reward from a reward function coupled to that data landscape”. Thus follows the standard problems with reward hacking, goal misgeneralisation etc.
- Jonas Hallgren 27 Dec 2025 7:58 UTC
  4 points
  0
  Parent
  If you want to look more into the symmetry learning direction I like GDL as a way of thinking about it:
  (More canonical resource:)
  http://geometricdeeplearning.com/
  
  (My favourite explainer:)
  https://arxiv.org/abs/2508.02723
- simulus 27 Dec 2025 4:46 UTC
  3 points
  2
  Parent
  The problem I see with this claim is that in the acedemic realm, good value maximization is what researchers get excited about, even to a fault. It is a lot easier to get a paper published by saying “our method gets a higher reward than previous methods” than “our method does xyz interesting thing”. If researchers could publish a better paperclip maximizer they almost certainly would.
  If you instead looked at curiosity algorithms or reward-free (self-supervised) RL, where “success” is a bit more ambiguous, then I would agree that the inductive biases of deep NNs probably play a bigger role than usually acknowledged. In fact, a paper about the role of NN depth on self-supervised RL recently won best paper at NEURIPS: https://wang-kevin3290.github.io/scaling-crl/
- testingthewaters 16 Jan 2026 1:44 UTC
  2 points
  0
  Parent
  @Caleb Biddulph For future reference, what I meant by “set up other systems” is classical RL systems like vanilla Q-learning: https://www.geeksforgeeks.org/machine-learning/q-learning-in-python/ . Today we know Q-learning primarily as deep Q-learning (which was one of Deepmind’s original Big Papers), but it is entirely possible to do Q-learning with no neural networks to learn state representations or Q-values, instead just using a lookup table that matches state and action. This is pretty inefficient, for somewhat obvious reasons.
testingthewaters 28 Apr 2025 23:19 UTC
5 points
3
More bad news for optimisation pressures on AI companies: ChatGPT now has a buy product feature

https://www.wired.com/story/openai-adds-shopping-to-chatgpt/

For now they claim that all product recommendations are organic. If you believe this will last I strongly suggest you review the past twenty years of tech company evolution.
testingthewaters 29 Aug 2025 17:23 UTC
4 points
0
The feature that can be named is not the feature. Therefore, it is called the feature.

Here’s a quick mech interp experiment idea:

For a given set of labelled features from an SAE, remove the feature with a given label, then train a new classifier for that label using only the frozen SAE.

So if you had an SAE with 1000 labels, one of which has been identified as “cat”, zero out the cat feature and then train a new linear cat classifier using the remaining features, while not modifying the SAE weights. I suspect that this will be just as or more accurate than the original cat feature.

Obviously, this is most easily done using features that trigger because of a single word or cluster of related words, so that you can give easy feedback to the new linear classifier.
- testingthewaters 30 Aug 2025 13:22 UTC
  2 points
  0
  Parent
  @the gears to ascension thanks for reminding me. I have come to really dislike obscurantism and layers of pointless obfuscation, but explaining also takes effort and so it is easy to fall back on tired tropes and witticisms. I want to set an example that I would prefer to follow.
  In lots of philosophical teaching there is the idea that “what is on the surface is not all there is”—famously, the opening of the Dao De Jing reads “The dao that can be spoken of is not the [essence, true nature, underlying law or principle] of the dao, the name that can be named is not the [essence, …] of the name. Namelessness is the birth of everything, to name is to nurture and mother everything. Therefore the [essence, …] of lacking desire is to see the [elegance, art, beauty, wonder, miniscule details] of things, and the [essence, …] of desire is to see things at their extremes. These two are from the same root but called different things, together they are called [understanding, truth, secret, order]. Finding [understanding, …] within [understanding, …] is the key to all manner of [elegance, …].”
  Similarly there are ideas in buddhism usually expressed something like “the true appearances of things are not true appearances. Therefore, they are called true appearances.” (cannot quite source this quote, possibly a misinterpretation or mishearing) The focus here is on some proposed duality between “appearance” and “essence”, which is related to the Platonic concepts of form and ideal. To make it very literal, one could find appropriate buddhist garments, imitate the motions and speech of monks, and sit for a long time daydreaming every day. Most of us would not consider this “becoming a buddhist”.
  In my view the interpretation of these phrases is something like: “things that can be spoken of, imitated, or acted out are the product of an inner quality. The quality is the thing that we want to learn or avoid. Therefore, confusing the products of some inner quality with the quality itself is a mistake. One should instead seek to understand the inner quality over mere appearances.” Again, learning the wisdom of a wise judge probably does not involve buying a gavel, practicing your legal latin, or walking around in long robes.
  There are similar ideas behind labelling and naming, where the context of a name is often just as important as the thing that is being named. So the words “I pronounce you man and wife...” can be spoken on a schoolyard or in a church, by a kindergartener or a priest, It is that context that imbues those words with the quality of “an actual marriage pronouncement”, which is important for determining if the speech-act of marrying two people has occurred. What I’m trying to point at here is a transposition of those ideas into the context of labelling neurons, features etc., where it may be that the context (i.e. the discarded parts) of any given activation have just as if not more information than the part we have labelled in itself. To be clear, I could very well be wrong in the specific SAE case, I just wanted to flesh out a thought I had.
testingthewaters 11 Apr 2025 14:36 UTC
4 points
2
From Inadequate Equilibria:

Visitor: I take it you didn’t have the stern and upright leaders, what we call the Serious People, who could set an example by donning Velcro shoes themselves?

From Ratatouille:

In many ways, the work of a critic is easy. We risk very little, yet enjoy a position over those who offer up their work and their selves to our judgment. We thrive on negative criticism, which is fun to write and to read. But the bitter truth we critics must face, is that in the grand scheme of things, the average piece of junk is probably more meaningful than our criticism designating it so. But there are times when a critic truly risks something, and that is in the discovery and defense of the new. The world is often unkind to new talent, new creations. The new needs friends.

And that’s why bravery is the secret name of the nameless virtue, and seriously underrated.

[[To elaborate slightly: to go beyond pointing and sneering, to actually work to construct a better future, is very difficult. It requires breaking from social conventions, not just the social conventions you claim are “self evidently stupid” but also the ones you see as natural and right. In many ways the hardest task is not to realise what the “right choice” is, but to choose cooperate in face of your knowledge of nash equilibria.

To reach for the pareto optimal solution to a coordination game means knowing you might very well be stabbed in the back. In a world where inadequate equilibria persist the only way we get out is to be the first person to break those equilibria, and that requires you to take some pretty locally irrational actions. Sometimes choosing not to defect or to punish requires unimaginable bravery. Mere knowledge of Moloch does not save you from Moloch, only action does.]]
testingthewaters 21 Sep 2025 11:11 UTC
3 points
−2
So far, neither the reasons for humanity’s potential future demise nor the reasons humanity has not been destroyed yet fit very well into the logical and predictive frames established by decision theory, game theory, or rationalist-coded Bayesian setups. We appear to both be spectacularly irrational and get away with it by terrific strokes of luck that beggar explanation (See 1 , 2 ).

Most interestingly, we seem to have evaded certain situations that, under standard assumptions of rational actors, would probably have resulted in most of human civilisation being wiped out. (I don’t think the RAND employees were wrong in saying that taking the MAD framework seriously implied that they shouldn’t take out pensions, for example). The standard answer to this is to say that “we got lucky” and then make some handwaving gestures towards evolution hard-coding irrational hacks into us. The rationalist then concludes that we should just work harder at becoming a super rational/optimal society, despite the fact that using rationalist methods does not seem to “win” even when playing against decidedly “irrational” opponents.

At this point, another answer might be that the theory needs an update to match observed evidence.
testingthewaters 16 Nov 2025 12:17 UTC
2 points
0
Follow up to https://vitalik.eth.limo/general/2025/11/07/galaxybrain.html

Here is a galaxy brain argument I see a lot:

“We should do [X], because people who are [bad quality] are trying to do [X] and if they succeed the consequences will be disastrous.”

Usually [X] is some dual use strategy (acquire wealth and power, lie to their audience, build or use dangerous tech) and [bad quality] is something like being reckless, malicious, psychopathic etc. Sometimes the consequence is zero sum (they get more power to use to do Bad Things relative to us, the Good People) and sometimes the consequence is negative sum (Bad Things will happen)

As someone pointed out on the twitter replies to the mechanize essay, this kind of argument basically justifies selling crack or any other immoral action provided you can imagine a hypothetical worse person doing the same thing. See also its related cousin, “If I don’t do [X] someone else will do it anyways”, which (as Vitalik points out) assumes a perfectly liquid labour market that usually does not exist except in certain industries.

I leave it to the reader to furnish examples.
testingthewaters 22 Oct 2025 17:13 UTC
2 points
0
A postmortem for the Economic Safety movement (fiction):

After eminent economist Mr. Senyek warned in 1991 that a hypothetical future “economic tsunami” could cause systemic risks to the American-led global financial order as a whole, researchers and think tanks quickly rallied to the cause of Economic Safety. They reasoned that in order to anticipate the risks of this hypothetical “Economic Tsunami”, they needed access to the frontier of financial trading. Within several years Economic Safety advocates joined eminent firms like JP Morgan and Bear Stearns, with Mr. Senyek providing introductions to particularly promising young economists.

Disillusioned with the domination of the financial system by large corporations with no sense of social obligation, a group of billionaire investors and traders started OpenFinance, with the goal of creating a hedge fund that would benefit the public instead of a small circle of millionaires and the ultrawealthy. Convinced of the need to acquire a trading edge against the big firms, OpenFinance pioneered the use of CDCs and MBS products to achieve unheard of levels of leverage and record profits. Despite stating that they would use their earnings to benefit society, they decided that the dream of systemic overhaul would only be achieved by becoming the dominant financial player, incorporating a for profit arm to that end and raising major sums from Morgan Stanley to fund their General Partnership Trading (GPT) system. The system increased access to financial products and services by allowing the general public to invest in CDCs and MBSes, democratising the returns of financial trading, but was criticised for creating the systemic risk they seeked to avoid.

In 2004, a team of traders at OpenFinance (often shortened as OpenFi) accused OpenFi leadership of being reckless and insufficiently concerned about Economic Safety. They decided to start a new hedge fund known as Anthropocentric Trading which would offer services better aligned to their principles. A fundraising war for talent and capital to form new investment funds ensued, with both firms acquiring investors from the Middle East and courting governments as part of bids to reshape the global economic order. Anthropocentric Trading admitted to leveraging heavily based on the same products and tactics as OpenFi, reasoning that it needed to stay competitive in a multipolar economic race. It is now 2007...
testingthewaters 26 Aug 2025 16:50 UTC
2 points
0
Any chance we can get the option on desktop to use double click to super-upvote instead of click and hold? My trackpad is quite bad and this always takes me 3-5 attempts on average. Whereas double clicking is much more reliable.
testingthewaters 31 Jul 2025 13:56 UTC
2 points
0
I’ve written a new post about upcoming non-LLM AI developments I am very worried about. This was inspired by the recent release of the Hierarchical Reasoning Model which made some waves on X/Twitter.

I’ve been tracking these developments for the better part of a year now, making predictions privately in my notebook. I also started and got funding for a small research project to research the AI safety implications of these developments. However, things are now developing extremely quickly. At this point if I wait until DEFINITIVE proof it will probably be too late to do anything except brace for impact.
testingthewaters 3 Apr 2025 0:00 UTC
2 points
−1
I think I’ve just figured out why decision theories strike me as utterly pointless: they get around the actual hard part of making a decision. In general, decisions are not hard because you are weighing payoffs, but because you are dealing with uncertainty.

To operationalise this: a decision theory usually assumes that you have some number of options, each with some defined payout. Assuming payouts are fixed, all decision theories simply advise you to pick the outcome with the highest utility. “Difficult problems” in decision theory are problems where the payout is determined by some function that contains a contradiction, which is then resolved by causal/evidential/functional decision theories each with their own method of cutting the Gordian knot. The classic contradiction, of course, is that “payout(x1) == 100 iff predictor(your_choice) == x1; else payout(x1) == 1000″.

Except this is not at all what makes real life decisions hard. If I am planning a business and ever get to the point where I know a function for exactly how much money two different business plans will give me, I’ve already gotten past the hard part of making a business plan. Similarly, if I’m choosing between two doors on a game show the difficulty is not that the host is a genius superpredictor who will retrocausally change the posterior goat/car distribution, but the simple fact that I do not know what is behind the doors. Almost all decision theories just skip past the part where you resolve uncertainty and gather information, which makes them effectively worthless in real life. Or, worse, they try to make the uncertainty go away: If I have 100 dollars and can donate to a local homeless shelter I know well or try and give it to a malaria net charity I don’t know a lot about, I can be quite certain the homeless shelter will not misappropriate the funds or mismanage their operation, and less so about the faceless malaria charity. This is entirely missing from the standard EA arguments for allocation of funds. Uncertainty matters.
- mako yass 3 Apr 2025 0:18 UTC
  4 points
  2
  Parent
  I think unpacking that kind of feeling is valuable, but yeah it seems like you’ve been assuming we use decision theory to make decisions, when we actually use it as an upper bound model to derive principles of decisionmaking that may be more specific to human decisionmaking, or to anticipate the behavior of idealized agents, or (the distinction between CDT and FDT) as an allegory for toxic consequentialism in humans.
- cubefox 3 Apr 2025 21:33 UTC
  2 points
  0
  Parent
  
  To operationalise this: a decision theory usually assumes that you have some number of options, each with some defined payout. Assuming payouts are fixed, all decision theories simply advise you to pick the outcome with the highest utility.
  
  The theories typically assume that each choice option has a number of known mutually exclusive (and jointly exhaustive) possible outcomes. And to each outcome the agent assigns a utility and a probability. So uncertainty is in fact modelled insofar the agent can assign subjective probabilities to those outcomes occurring. The expected utility of an outcome is then something like its probability times its utility.
  
  Other uncertainties are not covered in decision theory. E.g. 1) if you are uncertain what outcomes are possible in the first place, 2) if you are uncertain what utility to assign to a possible outcome, 3) if you are uncertain what probability to assign to a possible outcome.
  
  I assume you are talking about some of the latter uncertainties?
testingthewaters 17 Aug 2025 14:07 UTC
1 point
0
Very quick thought—do evals fall prey to the Good(er) Regulator Theorem?

As AI systems get more and more complicated, the properties we are trying to measure move away from formally verifiable stuff like “can it do two digit arithmetic” and move towards more complex things like “can it output a reasonable root-cause analysis of this bug” or “can it implement this feature”. Evals then must also move away from simple multiple choice questions towards more complex models of tasks and at least partial models of things like computer systems or development environments.

At this point we can start invoking the good regulator theorem, and say that the evaluator is a regulator. It wants to produce the outcome “pass” when the joint system formed from the LLM and the world-model has some desired property (“feature has been implemented”, “bug has been fixed”). It wants to produce the outcome “fail” otherwise. It seems necessary that the test environments will need to get more and more realistic to check for features of more and more complex systems. At the limit you have things like Google’s recent focus on creating world models for AI training which are full physics-style simulations. For those types of physical tasks this actually tends towards a perfectly deterministic model in the style of the original good regulator theorem.

Going one level up, what we are interested in may be less properties of the task or world, but properties of the AI itself (will this AI harm the end user? is the AI honest?) At that point evals have to encode assumptions about how agents store beliefs, turn beliefs into actions etc. At the limit this resembles forming a (Gooder regulator style) partial model of the agent itself from observations of agent actions, such that an agent taking certain actions in an eval reflects the presence of some internal undesirable property in the weights...
testingthewaters 25 Mar 2025 20:22 UTC
1 point
0
https://research.google/blog/deciphering-language-processing-in-the-human-brain-through-llm-representations/

Activations in LLMs are linearly mappable to activations in the human brain. Imo this is strong evidence for the idea that LLMs/NNs in general acquire extremely human like cognitive patterns, and that the common “shoggoth with a smiley face” meme might just not be accurate