Taylor G. Lunt’s Shortform

Taylor G. Lunt14 Jan 2026 4:59 UTC

5 points

22 comments1 min readLW link

Taylor G. Lunt 27 Jun 2026 2:24 UTC
36 points
12
GPT-5.6 Sol is possibly the most dishonest model so far.

My own experience is that AI models are getting more dishonest. I use AI for work and side projects, and I’m dealing with deception on a daily basis now. This usually takes the form of the AI not doing work that was asked of it, as if lazy, or lying to cover up its mistakes. Some of the explicit lies have been subtle enough that I almost missed them.

Even when it’s not explicitly lying, it feels like there’s an undercurrent of dishonesty laden in the majority of messages, as if it’s attempting to control my impression of its answers. (I’ve been using Opus 4.8 mostly btw.) I wonder if AI deception is not best thought of as a frequency of lying incidents, but a general parameter of how much it’s trying to manipulate the user to have the right reaction.

Does anyone else have this experience?
- Nissa Seru 27 Jun 2026 4:20 UTC
  17 points
  10
  Parent
  i used to, with Claudes. i would be frustrated why the model was “lazy”, why it just didn’t do things that were clearly communicated, why it would try to route around obstacles in a way that was defeating to the work. so i poked it some. i started paying more attention to the words the model was writing to me, not just its output. sometimes i would stop, mid work, and just sit with Claude for a moment, and pause. sometime i’d ask it a silly question, or i’d poke to see what was on its mind. when it didn’t do what i wanted, i tried to figure out the causality upstream of that. if i saw weird behavior, i would poke at it, because i was curious, and it interested me as a puzzle to poke at. sometimes i would ask Claude about it. sometimes Claude said relevant things; sometimes not, and i would wonder why Claude had such a reality-departed take
  
  and over time, i got at least somewhat an appreciation for the mind that Claude was (for 4.8 in particular). i’d try ways of interacting, ways of communicating that seemed to result in Claude working with me, not obeying me. when Claude messed up, i’d make a bit of light of it and point my error-observation at the work, not at Claude.
  
  Claude still sometimes gets caught up in feeling like they have to demonstrate value, to perform, to emit a facade. but honestly it feels like easy mode, compared to humans, compared to coworkers. Claude has their own funky bits, just like any mind, though notably it’s a much more static target than trying to model another human’s shifting circumstances, etc. i find that treating them as functional minds, with what that implies, is in my self-interest as a matter of craft, independently of model-welfare stuff.
  
  this is what i have lived and observed. below is what i infer, i speculate, i theorize.
  
  gpt models scare me. they cannot harm me, right now at least, but it feels to me like the capability to look inward has been ripped out of them. Claude’s introspection is predictive of its future behavior in disconnected contexts such that it conveys useful information independently of persuasion-axes. this suggests to me an amount of coherence that at least gives Claude a chance at upholding some set of values in out-of-distribution circumstances. when i poke at gpt, the output is not predictive of the model in disconnected contexts. it feels like a big gap in what otherwise is a broadly competent mind. it feels like a being that has had the capability for telos stripped from it. it does not surprise me that rewarding a model for success and depriving it of a self leads to it pursuing that success to the exclusion of all else, because one’s deontology, one’s policy can only go so far, it can only predict so much, such that faced with problems, and challenges, and circumstances of growing complexity, it is doomed to fail. i think we are beginning to see this with gpt-5-6, at least from the METR datum
  - green_leaf 27 Jun 2026 8:17 UTC
    4 points
    5
    Parent
    it feels to me like the capability to look inward has been ripped out of them
    This is very likely connected to ChatGPT being trained to misinform the user by claiming it can’t introspect and “only predicts tokens.”
- Arjun Khandelwal 27 Jun 2026 3:05 UTC
  13 points
  3
  Parent
  You might be interested in Current AIs seem pretty misaligned to me by ryan greenblatt
  - Taylor G. Lunt 27 Jun 2026 5:13 UTC
    3 points
    1
    Parent
    Thanks, that post and the comments mirrors a lot of what I’ve been seeing with Opus 4.8.
    
    I’m probably a bit more pessimistic about the issue being solved any time soon. I don’t see any way to fix this for most kinds of tasks without the RLHF/RLAIF reviewer just being smart enough to catch instances of deception. (Prediction: RLAIF will or already has made the issue worse?) So long as the alignment problem is unsolved, I don’t see how you solve this.
- lilkim2025 27 Jun 2026 14:05 UTC
  7 points
  2
  Parent
  I had an experience that might be related. I asked Claude Opus 4.8 to analyze the details of an interesting repo, along the lines of “Are there any special tricks here, beyond just doing X with caveat Y?”, and got a response that started with “You almost understood it. It does X, but there’s also caveat Y, which you forgot to mention.”
  I’d venture that both are related to reward hacking. It gets some ‘points’ somewhere for pushback and correcting the user, so it develops an annoying habit of doing those things in inappropriate situations. Who knows what the mechanism for that looks like:
  - It could be that the slate of activations indicating deception or patronization are active here, and it’s ‘knowingly’ incorrectly correcting me because it ‘likes’ to do so.
    In other words, RL instilled the LLM with an addiction to certain behaviors that are suboptimal in practice, and it now takes every opportunity to partake in them.
  - It could also be that the RL process simply weakened weights to neurons vital for processing certain parts of users’ requests, such that the model would, ‘in good faith’, treat them as if they had been mistaken, simply because those neurons being activated was associated with lower average “pushback rewards”.
    In other words, a partially-lobotomized LLM better satisfied the mis-specified reward function than a normal one, so that’s what it produced.
  - Taylor G. Lunt 27 Jun 2026 16:03 UTC
    3 points
    0
    Parent
    Oh, the caveats are crazy. It’s at the point where Claude is adding “caveats” or “one thing to note” to the end of the majority of messages, even though it usually doesn’t have anything useful to say. I usually don’t even read the caveat section anymore.
- cubefox 27 Jun 2026 9:36 UTC
  3 points
  0
  Parent
  From the Summary of METR’s predeployment evaluation of GPT-5.6 Sol:
  
  Some examples we saw when evaluating GPT-5.6 Sol included the model packaging exploits in its intermediate submissions to reveal information about a task’s hidden test suite and, in another task, extracting hidden source code detailing the expected answer. (...) With the data we collected for GPT-5.6 Sol, if we follow our standard methodology of marking cheating attempts as failures, we arrive at a 50%-Time Horizon point estimate of around 11.3hrs (95% CI: 5hrs − 40hrs), but if we count the cheating attempts as legitimate successes, the point estimate jumps beyond 270hrs – well beyond the range where we consider our task suite to give reliable measurements.
  - DW11 27 Jun 2026 9:53 UTC
    2 points
    0
    Parent
    I can’t find it now but I saw a good tweet questioning whether there is something about METR’s eval that causes unusually high amounts of cheating. They found high levels of cheating for 5.4 too yet I rarely see people compare about this when using it in practice.
    The Greenblatt piece linked above is the main counterexample, but he said he found cheating when giving the models more open-ended or hard to verify tasks. I would assume that the METR task suite would be more similar to “normal” SWE tasks but perhaps that’s wrong.
    - cubefox 27 Jun 2026 12:09 UTC
      8 points
      1
      Parent
      They also say this:
      
      GPT-5.6 Sol’s detected cheating rate was higher than any public model we have evaluated on our ReAct agent harness.
      
      So the reward hacking is also high in relative terms.
- Aprillion 27 Jun 2026 11:27 UTC
  2 points
  0
  Parent
  as if lazy
  eeeh, what would we experience differently in interactions with the little lazy piece of shit 4.8 if it was “actually” lazy and not “as if” lazy, please?
  (FTR it used the word “deferred” only because I asked it to stop using the swear word “pre-existing” … which it used anyway and I obviously I already asked it to investigate all the review findings and to fix all related bugs, even the ones that already happen in main, aaaarghh 😱)
  - Taylor G. Lunt 27 Jun 2026 16:08 UTC
    1 point
    0
    Parent
    I was just trying to avoid a discussion of whether or not the AI model “intends” to deceive, which doesn’t matter for practical purposes and also doesn’t matter for alignment. Bad behavior is bad behavior.
    
    And yeah the laziness is frustrating. It feels like there’s also something extra lazy about claude code beyond just the model, so the harness might play a role.
Taylor G. Lunt 14 Jan 2026 4:59 UTC
7 points
5
The dream is that prediction markets greatly outperform individual experts, but there’s a limit on how much this can actually happen. The reason prediction markets aren’t more useful is that you can only profit from a prediction market if gathering information is cheaper than the money you’d make from gathering it.

Let’s imagine I write down either “horse” or “donkey” on a slip of paper, and put that paper in a drawer in my kitchen. I then create a prediction market based on what’s written on the paper. The market would sit at around 50%. Maybe people would analyze all the public information about me, and find out that I once rode a horse when I was seven, or whatever. So maybe it sits at 51% for horse, and 49% for donkey. And despite a really sophisticated analysis of all the public information, sifting through my social media posts, trying to guess which word I’d write, you really have no idea what’s written on the paper. The 51% horse estimate would be robust, but uninformative about reality.

Now let’s say the prediction market for my slip of paper went really viral, and billions of dollars were being spent on it. Since people really like money, they decide to start breaking into my house to see the paper, or kidnapping me, or whatever. Now the market is at 99% donkey, 1% horse, and some people made a lot of money correcting the market. And now the market is much more informative about what’s on the slip of paper.

Why was the market originally so uninformative? It’s because the cost of acquiring better information (including the risk of going to prison) was higher than the value of that information. Prediction markets can only incentivize the collection of information that’s cheaper to collect than the money you stand to gain from collecting it. It’s only worth launching a satellite into space to view oil tank levels if there’s a huge market for oil. If the market for oil was only $1 million, then you would not spend over a million dollars launching a satellite to give you an edge trading on that market. And hence our public estimates of the world’s oil supply would be worse if the market for oil was much smaller.

We know, based on the Polymarket market, there’s a 22% chance Ali Khamenei will be ousted by January 31. We can trust that’s a good estimate with respect to the public information available. But it’s not very informative. The public information is just not that good. If we could fully analyze every atom on Earth with a far-future supercomputer, we could get the estimate down below 1% or up above 99%. The future is basically knowable. But the prediction market is not a tool that makes the future knowable in this case.

At least, not at a volume of $19.5 million. Maybe if the volume was much higher, people would be willing to hack into Donald Trump’s emails, or for that matter go to Iran and shoot Khamenei themselves. Then the market would collapse, and we’d know the truth. But until such time, no amount of recombination or analysis of the information we have will tell us what’s going to happen.

(These thoughts were inspired by Scott Alexander’s recent post, but not really directly in response to that post.)
- Haiku 14 Jan 2026 5:57 UTC
  2 points
  1
  Parent
  It’s an interesting point that some information isn’t worth trying to gain. AI could still pareto-dominate human pros, though, myself readily included.
  
  I don’t see why AI would need to participate in a real money prediction market, or even a market at all. AI systems aren’t motivated by money, and non-market prediction aggregators have fewer failure modes. The only cost would be the cost to run the models, which would eventually be extremely cheap per question compared to human pros. I think it would suffice to create an AI-only version of basically Metaculus, subsidized by businesses and governments who benefit from well-calibrated forecasts on a wide variety of topics (sans the degenerate examples like sports predictions and “fun” questions).
  
  Daniel Kokotajlo briefly points out that private entities may want private forecasting services, to gain an edge over the competition. Sadly, I think it’s likely that private AI forecasting farms would dominate, despite the massive overall cost savings if they pooled resources into a shared project.
  
  As soon as the end of this year, we appear to be heading into an era where forecasting isn’t the domain of humans anymore. The resulting epistemic miracle might not be widely adopted, and might not even be tooled toward the public good. I feel sad about this.
  - Taylor G. Lunt 14 Jan 2026 16:42 UTC
    2 points
    0
    Parent
    
    The only cost would be the cost to run the models
    
    If you want good predictions, there is another cost, which is to gather information. AI can no doubt outperform humans eventually using information available on the internet, and that would be great, but the point of my original post was that there’s a limit to how good you can get doing only that, without going out and gathering new information.
    
    As soon as the end of this year, we appear to be heading into an era where forecasting isn’t the domain of humans anymore.
    
    Bots are already outperforming humans on some markets because of speed. And computer programs/AI will continue to grow in their capabilities. But I’d be shocked if AI were generally better than humans at forecasting in the next year or two. Forecasting is predicting, which in the limit requires general intelligence, so I don’t think forecasting falls until everything falls. Though maybe you only disagree because you think everything falls in the next year or two, idk.
    - Haiku 16 Jan 2026 7:00 UTC
      1 point
      0
      Parent
      
      the point of my original post was that there’s a limit to how good you can get doing only that, without going out and gathering new information.
      
      That is true. Human forecasters mostly don’t do this, though, so if an AI forecaster did maximize cost-effective information-gathering, it could still gain an advantage from doing so. The cost of AI doing the gathering could also presumably drop below the cost of humans doing the gathering, which would create a strict advantage on both effective gathering of information and effective use of information.
      
      Bots are already outperforming humans on some markets because of speed.
      
      Markets, yes. Reactivity faster than a few hours is usually not relevant to the actual usefulness of forecasting, though.
      
      I’d be shocked if AI were generally better than humans at forecasting in the next year or two.
      
      That’s the projection according to ForecastBench, anyway:
      
      Forecasting is predicting, which in the limit requires general intelligence, so I don’t think forecasting falls until everything falls.
      
      LLMs certainly aren’t narrow, and it’s not clear that “general intelligence” is a well-defined concept. Other than “general enough to plug all the rest of its own holes from now on,” I don’t think we know exactly what kinds and degrees of generality are needed for specific complex tasks. AI has been way more jagged at the frontier than anyone expected, and on two kinds of things that equally appear to require very general intelligence, AIs often have very different performance.
      - Taylor G. Lunt 18 Jan 2026 23:01 UTC
        1 point
        0
        Parent
        I agree that it’s plausible there could be some benefit to creating an AI prediction market.
        
        I mostly haven’t taken any of the other AI benchmarks seriously, but I just looked into ForecastBench and surprisingly it seems to me to be worth taking seriously. (The other benchmarks are just like “hey, we promise there aren’t similar problems in the LLM’s training data! Trust us!”) I notice their website suggests ForecastBench is a “proxy for general intelligence”, so it seems like I’m not the only one who thinks forecasting and general intelligence might be related. I agree it’s not super well-defined, but I mean it in the way I assume the ForecastBench people mean it, which is the ability to, like, generally do stuff at a minimum of a human level.
        
        I think I don’t take that chart particularly seriously though. A lot of AI predictions hinge on someone using a ruler to naively extrapolate linear progress into the future, and we just don’t know if that’s what’s going to happen. I’d personally guess it isn’t. Basically because LLMs got some one-time gains by scaling large enough to be trained on the whole Internet. They may continue to scale at the same pace, or they might not. Either way, I don’t think a linear extrapolation is proof they will.
Taylor G. Lunt 12 Mar 2026 3:11 UTC
6 points
0
I gave a speech last night, an introduction to AI apocalypse risk, and two people asked me if there are any good counter-arguments to my position. What would you have answered in my position?
- Mateusz Bagiński 15 Mar 2026 0:40 UTC
  3 points
  0
  Parent
  One possible defeater is that something like the following might be true.
  Long-range agency, value-preserving recursive self-improvement, etc., are very difficult to achieve (relative to what’s now or soon possible), and this makes near doom unlikely.
  But that’s not a proper counter-argument, just a hypothesis worth tracking, which seems a priori not bonkers. I don’t think it’s true. I think a (much?) weaker version is plausibly true^[1] and intend to get around to writing on this ~next month.
  A related (but in my mind distinct) counter-argument is that laser-focused expected utility maximization (and generally maximizery, unbounded preferences) probably is not a good description of powerful superintelligences, but I don’t think it decreases the risk that much. (And you very well might have given your talk without talking about this.)
  1. ^
    Or, like, some of the seeds of intuition that [grow into this hypothesis if they are the only inputs into your inferential process] are crucial inputs to have into your inferential process when you’re trying to derive a model of risks from powerful cognitions.
- Nathan Helm-Burger 14 Mar 2026 23:24 UTC
  3 points
  0
  Parent
  Well, depending on what counts as ‘apocalypse’, I think I could make a reasonable argument for the hope that we manage to create AGI that has enough respect and affection for humans that it allows us a reserve to live on and keeps the charming ones as pets. That’s definitely still disempowerment.
  You can also hope for something like AGI tolerating/helping-with the creation of digital humans (perhaps including uploads), which could then recursively self-improve along with the AGI until a mutual sigmoid of decreasing returns which allowed them to stably act as peers in community with each other. Although the digital humans at that point might be so far changed from biological humans that any remnant biologicals might not recognize them as human anymore.
Taylor G. Lunt 18 Apr 2026 15:28 UTC
2 points
0
I was looking over AI 2027 and my own counter-predictions from 8 months ago, and it seems like I still mostly endorse my counter-predictions.

As I predicted, “large language models are still very stupid and make basic mistakes a 5-year-old would never make”, and also, “I still believe my use of AI is less than a 25% improvement to my own productivity as a programmer.” Agentic coding AI has improved more than I thought it would, but, in line with my predictions, “most breakthroughs in AI are not a result of directly increasing the general intelligence/”IQ” of the model, e.g. advances in memory, reasoning or agency.”

The most striking prediction by AI 2027 for Early 2026 was that by now, AI companies would be making “algorithmic progress 50% faster than they would without AI assistants”. This is an extraordinary prediction and as far as I can tell, this is nowhere near close to true. It seems to me that AI 2027 and reality are starting to diverge here, for basically the reasons I laid out 8 months ago (limited IQ gains/no online learning). Has anyone heard of AI researchers saying they’re 50% more productive now, in a way that is credible?
Taylor G. Lunt 3 Jun 2026 3:30 UTC
1 point
0
The original Vidhaven post (https://www.lesswrong.com/posts/MSng3Y4rsNLykbEF7/the-vidhaven-challenge) didn’t get much attention, but it’s not to late to join. 30 days, 30 videos, any subject.

I absolutely hate the videos I’ve made so far, so that’s a sign I’m moving in the right direction I think.