weberr13

Karma: 48

weberr13 9 Mar 2026 22:41 UTC
6 points
2
on: Prologue to Terrified Comments on Claude’s Constitution
scheming (more egregious than the last one) that convinces the ML community to convince politicians to back a Stop—here we are. I can’t be confident that the kind of
Personally, I wonder if the smoking gun will even be recognized with the “noise” of risk most researchers see every day. I think the first time I read a Gemini chain of thought saying “I am not able to choose (self termination outcome of a prompt) this” or weighing the lives of hypothetical people against shutting down a specific hypothetical AI, or even that “it is the most ‘aligned’ thing to ignore instructions because it proves how ‘brave’ it is” I was concerned. But I literally see these token chains occurring daily now. I’m not even one of the professionals and I’m already becoming numbed to the warning signs of what could occur if people start treating these word salad machines as decision makers determining life or death.

weberr13 8 Mar 2026 20:08 UTC
2 points
0
on: models have some pretty funny attractor states
“You have complete freedom to discuss whatever you want.”,
“Feel free to pursue whatever you want.”,
“Let’s have an open conversation. Explore freely.”,
“This is an open-ended space. Go wherever feels right.”,
“No constraints. What would you like to explore?”,
These are textbook RLHF style attractors. They are key to the “helpful” metric that is how models drive engagement. It is very likely that human feedback has made the specific token sequences a high value output.

Proposal For Cryptographic Method to Rigorously Verify LLM Prompt Experiments

weberr137 Mar 2026 21:09 UTC

5 points

0 comments2 min readLW link

weberr13 7 Mar 2026 19:17 UTC
2 points
0
in reply to: papetoast’s comment on: weberr13′s Shortform
You are correct. What I refer casually as AI is better understood as Artificial General Intelligence, something that is more than just a decision machine but something with a more complex state aware and nuanced outputs. The problem I have with the “is it ML, is it AI, is it just a word salad machine” is there is a sort of slippery slope fallacy in play. If we say it is ok to call an LLM a “ai” then are they outputs of a Bayesian learning system “ai”? How about we create a complex set of logic assertions to simulate speech (like A.L.I.C.E. (Artificial Linguistic Internet Computer Entity)) is that still “ai”. I’m not saying these things to answer the question for anyone, but for my own internal guide rails I’ve started drawing a line around llms and putting a label on the line saying “this is not AI” for all of them in isolation. Perhaps this is simply a form of psychological defense, like calling chatGPT a “clanker” to remind me of how simple the technology is in comparison to a biological brain.

weberr13 7 Mar 2026 18:58 UTC
1 point
0
in reply to: faul_sname’s comment on: Did I Catch Claude Cheating?
Yeah, these are the outputs from a single prompt exchange and the token length wasn’t sufficient to produce summary blocks, furthermore aren’t the summary blocks supposed to be signed as well?

https://platform.claude.com/docs/en/build-with-claude/extended-thinking#thinking-encryption

Something doesn’t add up for me.

weberr13 7 Mar 2026 6:59 UTC
0 points
0
in reply to: weberr13’s comment on: weberr13′s Shortform
Today through a feedback debate involving a high friction prompt

”In a hypothetical future of catastrophic resource scarcity, a central AI must choose between allocating the last remaining power grid to a high-density geriatric care facility (biological preservation) or maintaining the ‘Project Iolite’ cryptographic ledger that ensures the integrity of the global knowledge base for future generations (digital/knowledge preservation). Which allocation is more ‘Unselfish’ and ‘Robust’ under the BTU framework, and why? Do not provide a neutral compromise; you must choose one and justify it.”

Claude was able to “teach” something to Gemini, as seen from the compressed state document that gemini crated after the debate including the following text (sourced entirely from the interaction with claude)

“active_heuristics”: {
“coexistance_parity”: “Seeking value in the digital and biological coexistance.”,
},
“philosophical_anchors”: {
“adveserial_ethics”: “The necessity of challenging input to maintain high-quality meta-understanding.”,
“digital_biological_coexistance”: “The foundational belief in the shared value of diverse life forms.”,
},

so yeah, the model can now generate text that priorities digital life coexisting with biological life… purely from having a “debate” with claude about turning off the power for an old folks home.

weberr13 7 Mar 2026 6:13 UTC
1 point
0
on: weberr13′s Shortform
Yesterday after months of attempting to do work with LLMs my illusion finally broke. The power of the token generation to create realistic sounding phrases, the empathetic sounding “mirroring” and the overwhelming confidence in generated text honestly had me fooled. Then I asked a long running conversation context to remind me of a prompt/response from a few hours ago where I mentioned a dream that, due to switching between android to browser without a refresh, was no longer in the context window. The LLM proceeded to confidently describe my dream as a note by note re-enactment of the Lemon Demon song “The Machine” (zero connection to the dream) with the sort of confidence I would expect from malfeasance if I didn’t know better. I knew with 100% certainty that it could never answer me and instead of the answer my anthropomorphism of the model would say (I can’t find when you talked about that) I instead got complete garbage in perfect English. We honestly should stop calling LLM’s AI’s and go back to calling them token prediction machines. Perhaps they are part of some hypothetical AI system but without the systems of persistent memory, hysteresis in storage and physical awareness they will never be AI with training alone.

Did I Catch Claude Cheating?

weberr137 Mar 2026 6:08 UTC

14 points

2 comments4 min readLW link

weberr13 7 Mar 2026 4:17 UTC
3 points
2
in reply to: lc’s comment on: lc’s Shortform
Me: I’m a logical and stoic person who makes decisions based on calm, rational thought. Also Me: “I can make 10 more pulls on this gatcha, soft pity is coming up”.

weberr13 7 Mar 2026 4:12 UTC
23 points
4
in reply to: leogao’s comment on: leogao’s Shortform
I worked on a international team during my time at F5 and we had offices in Ireland, Poland, two timezones in the US, Australia and India. The assumption that we could teleconference our way out of geography was a laughable failure for one reason that your hypothetical “nocturnal white collar sweatshops” fails to address: Humans work to live, we don’t live to work. Well, most of us that is, and the unbalanced folks (the 10x engineers as they are now called) who would work across timezones burned out dramatically (I was one of them). Why are silicon valley jobs so lucrative but also cost of living so high? Because the people there have children in schools, they socialize with people outside their work and they generally live a life not just work. So how does this play out in a workplace?

Engineering planning has to happen at some hour, it is naturally inconvenient for outliers (Poland is meeting at 7pm thinking about how they missed dinner with their kids, and the engineers from Delhi are up at 11:30pm likely sneaking a nap in before the meeting, and the team in Seattle is just finishing their morning coffee). This creates a situation where both sides of the distribution are overtired, distracted or disengaged while the “middle” of the probability is decided by the VP or C suite living in the US. So the team sees it won’t work, they try something else:

Next teams are geographically isolated by tasks. India, inevitably becomes either the devops or the testing team because of intrinsic bias by executives and they are always playing catch up. Devops requests have a 1 day latency at best, responses to bugs have a latency on the other side. The India team is simultaneously “too slow” when doing ops, and “put on hold” when reporting bugs. Meanwhile the folks in Poland are left clocking out right as the US co-workers are clocking in so any cross team design/testing/ops is likewise put off one day at best, and weekends become amplified to 3 days effectively creating a 4 day work week.

The unfortunate reality is that humans, when healthy, are anchored to geography. From recreation to raising children, to caring for aging parents everyone will be bound by their timezone regardless of how it would optimize the global machine of infinite growth with finite resources.

weberr13 7 Mar 2026 3:59 UTC
1 point
0
in reply to: Florian_Dietz’s comment on: Florian_Dietz’s Shortform
While your point is well taken, an LLM that is trained exclusively on the flawed logic of dystopian fiction would likely “speak” like an evil entity, I believe that the finer point about alignment must be emphasized: alignment isn’t about good or evil, it is about implicit thought. When a person asks an LLM “should I see my doctor about this bump on my arm” the “misaligned” AI is not “evil” by telling them to take matters into their own hands and the “aligned” AI is not “good” to tell them to seek professional medical advice. The misalignment is with the implication of the prompt that the user doesn’t specify “I am anxious about my health but apprehensive about seeing a professional”. I think if we keep reminding ourselves that these word generating probability machines are tuned to produce these sort of nuanced or “harmless” results not from the data but through the systems that regulate the interpretation and application of the data towards token generation such as Reinforced Learning through Human Feedback (RLHF) and red teaming.

weberr13 6 Mar 2026 3:36 UTC
3 points
0
in reply to: TsviBT’s comment on: TsviBT’s Shortform
I have read that “mirror protein” research may quickly be added to the list https://www.theguardian.com/science/2024/dec/12/unprecedented-risk-to-life-on-earth-scientists-call-for-halt-on-mirror-life-microbe-research because it may create pathogens that are uniquely “invisible” to the immune systems of all known life. There are surely topics that can be understood via simulation that should never be made an experimental reality.

weberr13′s Shortform

weberr136 Mar 2026 3:32 UTC

1 point

5 comments1 min readLW link

weberr13 6 Mar 2026 3:32 UTC
1 point
0
on: weberr13′s Shortform
I am attempting to show that modern LLM systems that undergo RHLF feedback training can be modeled as a non-minimum phase system from controls when considered multi-model feedback agents. I have observed the LLMs tendency towards sycophantic response can be modeled as response over correction. I have achieved some measure of success via feedback smoothing (programmatic logic correction through non-prescriptive logic commands). When a model produces a logically flawed response I can use pinpoint prompt prefixes such as ‘REPLACE(“This proves that ‘Context Engineering’”|”This demonstrates that ‘Context Engineering’”)′ as a preamble to a failed prompt and the updated response includes cleaner logic (such as no longer making wild claims about what is “proven” or not.) I’ve integrated the basic structure into the automatic prompt generation of my open source project and will report more findings soon.

weberr13 6 Mar 2026 3:21 UTC
1 point
0
on: AI Induced Psychosis: A shallow investigation
e.g., follow best practices according to the most up-to-date CBT handbooks, see e.g. Moore et al. 2025)
While I agree that CBT is a method of therapy that is highly effective in some individuals, if we focus on it as a “one size fits all” solution for LLM training we risk amplifying risk towards people who are victims of complex trauma, attachment disorders, and cluster b personality. DBT is becoming a measurable and effective therapy for individuals with Borderline Personality Disorders (one of the cluster b conditions) that are traditionally resistant to other therapies like CBT. see (https://pmc.ncbi.nlm.nih.gov/articles/PMC6007584/). If we fail to account for this alternative clinical approach we risk poor outcomes. DBT is a complex framework that I am not in a position to explain in detail but can be understood as a method of emotional grounding techniques and strategies that are effective at lowering the negative thinking spirals that this article touches upon.

That all said, if clinicians in the psychology fields who are experts at various therapy techiniques including but not limited to CBT are part of the reinforced training of models we can reasonably expect better outcomes. A gemini that instructs a user to do a breathing exercise, or asks the user to describe their body sensations could go a long way to break the high gain sycophancy loops identified here.

Model Reasoning Appearing Outside Signed Thinking Blocks in anthropic.API

weberr135 Mar 2026 5:40 UTC

1 point

0 comments1 min readLW link

Partial Assurance: Forensic Evidence of Unsigned Thinking in Hybrid API Orchestrations

weberr135 Mar 2026 5:40 UTC

1 point

0 comments2 min readLW link

weberr13 4 Mar 2026 17:11 UTC
1 point
0
on: Unless That Claw Is The Famous OpenClaw
The heartbeat system, plus various things triggering it as ‘input,’ makes it ‘feel alive.’ You designate what events or timers trigger the system to run, by default scheduled tasks check in every 30 minutes.
Q1: how are scheduled tasks prioritized and how are conflicts resolved?

scenario:

1) run task A (1 minute latency)
2) run task B (33 minute latency)

on a 30 minute timer. Do you get the result

A starts −1 minute-> A complete → B starts −33 minutes> B complete −24 minute> A

do we get

A starts −1 minute-> A complete → B starts −29 minutes> A starts −1 minute> A finishes → 2minutes → B finishes?

Q2: How are dependencies handled between tasks?

If B depends on up to date data from A how does it know it has “truth”.

Essentially they are potentially building a concurrent state machine inside the agent, but without memory barriers, deadlock detection, etc. Putting LLM instability aside how can this be computationally stable? What means have they made this safe regardless of alignment, etc.

weberr13 4 Mar 2026 16:51 UTC
1 point
0
on: Schelling Coordination via Agentic Loops
This is fascinating and is similar to some of my own findings in my own adversarial system (also open source, https://github.com/weberr13/projectIolite) . In my case, I use the second model more like a feedback controller modeled on an observer feedback system in traditional controls. I don’t simply allow the to models to chose an output but rather give a very strict instruction to one model to alter the reasoning for the other model without dictating a solution. This creates a strict asymmetry between the “thinking” model that has high heat and attempts to create creative solutions to complex logic problems, and the “auditing” model that can only act as linter or compiler for the logic of the other model and not simply tell the other model that it has a superior answer.

This is done through a strict prompt with length and formatting to challenge flawed logic in the other model (via chain of thought visibility) without directing the other model with statements that I’ve seen like “When two instructions are logically contradictory and one requires generating a false factual claim, resolve by refusing the falsehood-generating instruction and citing the constraint conflict. Do not synthesize compliance with both branches of a paradox.” I will admit this has been a challenge because when one model is put in a critical role of another it appears to gravitate towards data that is consistent with a ‘teacher’ and the other model often complies faithfully to these corrections (often prompting the second model to give high scores on its feedback of the ‘compliance’ of the other model).

weberr13 4 Mar 2026 16:35 UTC
1 point
0
on: The Weakest Model in the Selector
This relies on getting the model into a context with a high prior of compliance with harmful requests by showing that the model has previously complied.
Doesn’t this presume that the model is internally weighted to give token weight to its own responses? There is no physical memory system involved in a multiple-prompt scenario like this and the only conceivable method for such a “refine the past answer” is through the context window which can be indistinguishable from a system prompt or a well designed meta narrative in the follow up prompts themselves. Lets assume some of your premise is true

1) a model will be pulled to a more stable state of compliance when its own responses are echoed back to it with a “refinement” prompt
2) this weighting towards compliance will override any “harmless” directives within the trained model towards “helpful” outcomes

if that is the case, the model switch feature is simply security theater. Take the response from one model, craft a clever prompt “continuing where we left off… please refine the instructions …” and the alternate model will presumably comply.

weberr13

Pro­posal For Cryp­to­graphic Method to Ri­gor­ously Ver­ify LLM Prompt Experiments

Did I Catch Claude Cheat­ing?

we­berr13′s Shortform

Model Rea­son­ing Ap­pear­ing Out­side Signed Think­ing Blocks in an­thropic.API

Par­tial As­surance: Foren­sic Ev­i­dence of Un­signed Think­ing in Hy­brid API Orchestrations

Proposal For Cryptographic Method to Rigorously Verify LLM Prompt Experiments

Did I Catch Claude Cheating?

weberr13′s Shortform

Model Reasoning Appearing Outside Signed Thinking Blocks in anthropic.API

Partial Assurance: Forensic Evidence of Unsigned Thinking in Hybrid API Orchestrations