faul_sname comments on Comparing risk from internally-deployed AI to insider and outsider threats from humans

faul_sname 24 Jun 2025 9:04 UTC
7 points
1

This still suffers from the incentive gradient pushing quite hard to just build end-to-end agents. Not only will it probably work better, but it’ll be straight up cheaper and easier!

The same is true of human software developers—your dev team sure can ship more features at a faster cadence if you give them root on your prod servers and full read and write access to your database. However, despite this incentive gradient, most software shops don’t look like this. Maybe the same forces that push current organizations to separate out the person writing the code from the person reviewing it could be repurposed to software agents.

One bottleneck, of course, is that one reason it works with humans is that we have skin in the game—sufficiently bad behavior could get us fired or even sued. Current AI agents don’t currently have anything to gain from behaving well or lose from behaving badly (or sufficient coherence to talk about “an” AI agent doing a thing).
- Josh Snider 24 Jun 2025 11:52 UTC
  16 points
  9
  Parent
  one reason it works with humans is that we have skin in the game
  Another reason is that different humans have different interests, your accountant and your electrician would struggle to work out a deal to enrich themselves at your expense, but it would get much easier if they shared the same brain and were just pretending to be separate people.
- ACCount 11 Jul 2025 11:33 UTC
  8 points
  0
  Parent
  one reason it works with humans is that we have skin in the game—sufficiently bad behavior could get us fired or even sued
  I don’t think this framing is at all useful. “Have skin in the game” is just a human “capability elicitation” method.
  You put a human under a performance-based incentive with the hopes of obtaining better performance. If successful, the performance uplift doesn’t come “from nothing”. The human was always capable of performing on that level. You just made the human use more of his capabilities.
  The performance limitations of current AIs are more “AI doesn’t have the capabilities” than “AI is not motivated to perform and doesn’t use the capabilities it has” (80% confidence).
  - faul_sname 11 Jul 2025 11:55 UTC
    2 points
    0
    Parent
    If a human misbehaves badly enough on a task they will be removed from the pool of agents that will perform tasks like that in the future. Humans are playing an iterated game. Current LLM agents generally are not (notable exception: agent village).
    
    You could of course frame the lack of persistent identity / personal resources / reputation as a capabilities problem on the AI side rather than a problem with companies expecting nonhuman minds to expose a fully human-like interface, it mostly depends on which side seems more tractable. I personally see a lot of promise in figuring out how to adapt workflows to take advantage of cheap but limited cognition—feels easier than trying to crack the reliability problem and the procedural memory problem, and there are definitely safety disadvantages in setting up your AI systems to expose a situationally aware, persistent human-like interface.
    - ACCount 11 Jul 2025 12:31 UTC
      8 points
      0
      Parent
      I fail to see how the same wouldn’t apply to the way LLMs are used now.
      If an LLM is not up to the task, it will be augmented (prompting, scaffolding, RAG, fine tuning), replaced with a more capable LLM, or removed from the task outright.
      The issue isn’t that you can’t “fire” an LLM for performing poorly—you absolutely can. It’s that even the SOTA performance on many tasks may fall short of acceptable.
      - faul_sname 11 Jul 2025 16:30 UTC
        2 points
        0
        Parent
        I’m not sure we disagree on anything substantive here.
        
        If you have a team of 100 software developers each tasked with end-to-end delivery of assigned features, and one of them repeatedly pushes unreviewed and broken/insecure code to production, you can fire that particular developer, losing out on about 1% of your developers. If the expected harm of keeping that developer on is greater than the expected benefit of replacing them, you probably will replace them.
        
        If you have a “team” of “100″ AI agents “each” tasked with end-to-end delivery of assigned features, as they are currently implemented (same underlying model, shared-everything), and one instance does something bad, any mitigations you implement have to affect all 100 of them.
        
        That seems like it produces more pressure against the “shared-everything, identical roles for all agents in the organization” model for groups of AI developers than for groups of human developers. Organizational pressures for groups of human developers already push them into specialized roles, and I expect those pressures to be even stronger for groups of AI developers. As such
        
        They plan on trying to thread the needle by employing some control schemes where (for example) different “agents” have different permissions. i.e. a “code writing” agent has read permissions for (some parts of) the codebase, the ability to write, deploy, and test changes to that code in a sandboxed dev environment, and the ability to open a pull request with those changes.
        
        doesn’t particularly feel like an implausible “thread the needle” strategy, it seems like the sort of thing we get by default because the incentives are already pushing so incredibly hard in that direction.
    - Noosphere89 22 Jul 2025 22:14 UTC
      4 points
      2
      Parent
      I want to flag this for the future, but I suspect a huge underlying crux might be whether something like a persistent memory/identity and reliability to go with is just fundamentally necessary for many tasks to be done, with realistic compute and in practice you can’t take shortcuts to unlock most of the value of AI.
      This is related to whether you can reduce the long-term memory of humans into a huge context window, and I currently am not highly confident that the answer is no (that would require more scale-up and more time), though I do slightly favor the hypothesis that the answer is no over the answer is yes.
      Link below:
      https://www.lesswrong.com/posts/oKAFFvaouKKEhbBPm/a-bear-case-my-predictions-regarding-ai-progress#9LT2RgZiqgTQHudpm
      This is also a crux I suspect for @1a3orn, based on this response to AI 2027 (though it focuses on different criticisms):
      https://www.lesswrong.com/posts/4MofJAmXDkoFsTf4B/?commentId=EJZFG8m6ETWgZRFJi
      Putting it another way, the crux is whether can get most tasks to be automated by short-lifetime AIs, rather than long, individual lifetime AIs, and I’m not nearly as confident as you or @1a3orn that long-term memory isn’t necessary and not having it pays a crippling capability tax.
      What links here?
      Noosphere89's comment on Noosphere89′s Shortform by Noosphere89 (30 Dec 2024 1:39 UTC; 6 points)