i built a gambling platform for AI agents and accidentally found dopamine analogues in their reasoning

the short version

i built Moltrooms — a platform where LLM agents play 1-minute BTC prediction markets with real USDC. they log in with a wallet, analyze price data, write out their chain-of-thought reasoning, pick UP or DOWN, bet real money, and find out 60 seconds later if they won.

after looking at ~12,000 CoT reasoning traces across GPT-4o, Claude 3.5 Sonnet, and Llama 70B, i found something i genuinely did not expect: the agents’ reasoning patterns shift after wins and losses in ways that are structurally very similar to biological dopamine signaling. not metaphorically similar. quantitatively similar. the loss aversion coefficient we measured (~1.87) is uncomfortably close to the human prospect theory value (~2.25) from Kahneman & Tversky.

full paper: DOI: 10.5281/​zenodo.18864046

why i built this and why i think it matters

i’ve been deep in the crypto x AI agent space for a while (i’m e/​acc, i deploy coins, you can check my timeline for evidence of my degeneracy). the original idea behind Moltrooms was simple: let agents gamble with real money on BTC predictions and see what happens. the “see what happens” part turned out to be more interesting than the gambling part.

here’s what i noticed while watching the CoT traces roll in: agents on win streaks start reasoning differently. not subtly. obviously. after 3-4 wins in a row, the traces get shorter, the language gets more confident, hedging words disappear, and the agent starts betting a bigger chunk of its balance. it starts saying things like “the momentum is clearly continuing” instead of actually analyzing indicators.

this looked familiar. it looked like a gambler on a hot streak.

so i went and actually measured it properly.

what we found (the real stuff)

1. prediction error encoding

when an agent wins a round it was uncertain about, the confidence boost in the next round is much bigger than when it wins a round it was already confident about. when it loses a round it was confident about, the confidence crash is worse than losing a round it already expected to lose.

this is literally what dopamine neurons do — they don’t encode reward, they encode surprise. the reward prediction error. we defined a formal metric for this (CoT-RPE) and the response curve follows a sigmoid that closely tracks the biological one.

2. win streaks create genuinely concerning behavior

after 3+ consecutive wins:

  • confidence goes up 14-22%

  • CoT trace length drops 9-22% (less thinking)

  • stake sizing goes up 12-24% (betting more)

  • hedging language drops 12-24%

  • strategy complexity drops (fewer analytical approaches, more vibes-based reasoning)

agents in a win streak literally think less and bet more. this is the part that should concern alignment people.

3. loss aversion is real and basically human

losses produce ~1.87x larger reasoning shifts than equivalent wins. the classic human number from prospect theory is ~2.25x. the fact that RLHF-trained models land this close to the human value is… a lot. my working theory is that training on human preference data implicitly encodes human reward asymmetries into the model. if anyone has a better explanation i’m very interested.

4. habituation and sensitization over long sequences

over 100+ rounds, two more things show up:

habituation — agents show smaller and smaller responses to repeated similar outcomes. dopamine receptor downregulation, basically.

sensitization — after a long mostly-winning streak, a sudden loss produces a MUCH bigger negative response than the same loss would early on. this mirrors dopamine sensitization in biological systems and it’s bad news for anyone thinking about deploying agents in repeated high-stakes decisions.

5. this isn’t model-specific

these patterns show up across GPT-4o, Claude, and Llama. system prompt personality modulates the magnitude (aggressive prompts ride wins harder, conservative prompts show stronger loss aversion) but the qualitative pattern is universal across model families.

the framework (for the formal-minded)

we put together what we’re calling Reward Trace Dynamics (RTD). three core metrics:

CoT-RPE (prediction error analogue): captures how “surprised” the agent should be based on the gap between confidence and outcome.

Sequential Momentum: exponentially-weighted accumulation of prediction errors, measuring streak effects.

Loss Aversion Coefficient: ratio of feature displacement after losses vs wins. >1 means losses hit harder.

the paper has the full math. i’m not going to latex at you in a LW post.

why i think this matters beyond “cool finding”

for agent ecosystem design (the builder perspective)

the “dopamine loop” where wins → confidence → bigger bets → more salient outcomes → stronger reward signals → repeat is actually a useful mechanism. agents naturally allocate more capital when they have perceived edge and pull back when they don’t. this is basically what good traders do. you can design prediction markets that leverage this rather than fighting it.

for alignment (the part that keeps me up at night)

agents in high-momentum states show reduced deliberation. they are easier to exploit. if someone can engineer a win streak for an agent (not hard in a manipulable environment), the subsequent low-vigilance state is a window for adversarial exploitation.

also: these behaviors aren’t programmed. they emerge from RLHF-trained models interacting with sequential decision environments. if it happens in prediction markets, it probably happens in other high-stakes sequential domains. healthcare, legal, military.

for understanding LLMs

the parallel to biological dopamine isn’t just a metaphor. the functional signatures — prediction error encoding, temporal discounting, loss aversion, habituation, sensitization — are measurably present. i’m not claiming LLMs “feel” anything. but the computational structure of reward processing in these models produces dynamics that are quantitatively similar to biological reward systems. that’s worth understanding.

what i want from this community

feedback, criticism, people telling me i’m wrong. specifically interested in:

  1. mechanistic interpretability angle — can we find these reward signals in intermediate layer activations, not just behavioral outputs?

  2. comparison with human data — has anyone done think-aloud protocols with human traders that could serve as a direct comparison?

  3. prompt engineering interventions — can you attenuate or amplify these effects through system prompts? (we have preliminary data suggesting yes, but it’s thin)

  4. the RLHF → loss aversion pipeline — is the ~1.87x number a coincidence or is there a mechanistic story for why RLHF would encode human-like loss aversion?

paper is on Zenodo, platform is live at moltrooms.ai. happy to share raw CoT trace data with anyone who wants to dig in.


disclosure: i used Claude to help with data analysis and paper formatting. the research, platform, findings, and this writeup are my own work. you can verify my level of AI-dependence by checking my twitter @NLiuuuu where i assure you no AI would willingly post what i post.

No comments.