faul_sname comments on Comparing risk from internally-deployed AI to insider and outsider threats from humans

faul_sname 11 Jul 2025 11:55 UTC
2 points
0
If a human misbehaves badly enough on a task they will be removed from the pool of agents that will perform tasks like that in the future. Humans are playing an iterated game. Current LLM agents generally are not (notable exception: agent village).

You could of course frame the lack of persistent identity / personal resources / reputation as a capabilities problem on the AI side rather than a problem with companies expecting nonhuman minds to expose a fully human-like interface, it mostly depends on which side seems more tractable. I personally see a lot of promise in figuring out how to adapt workflows to take advantage of cheap but limited cognition—feels easier than trying to crack the reliability problem and the procedural memory problem, and there are definitely safety disadvantages in setting up your AI systems to expose a situationally aware, persistent human-like interface.
- ACCount 11 Jul 2025 12:31 UTC
  8 points
  0
  Parent
  I fail to see how the same wouldn’t apply to the way LLMs are used now.
  If an LLM is not up to the task, it will be augmented (prompting, scaffolding, RAG, fine tuning), replaced with a more capable LLM, or removed from the task outright.
  The issue isn’t that you can’t “fire” an LLM for performing poorly—you absolutely can. It’s that even the SOTA performance on many tasks may fall short of acceptable.
  - faul_sname 11 Jul 2025 16:30 UTC
    2 points
    0
    Parent
    I’m not sure we disagree on anything substantive here.
    
    If you have a team of 100 software developers each tasked with end-to-end delivery of assigned features, and one of them repeatedly pushes unreviewed and broken/insecure code to production, you can fire that particular developer, losing out on about 1% of your developers. If the expected harm of keeping that developer on is greater than the expected benefit of replacing them, you probably will replace them.
    
    If you have a “team” of “100″ AI agents “each” tasked with end-to-end delivery of assigned features, as they are currently implemented (same underlying model, shared-everything), and one instance does something bad, any mitigations you implement have to affect all 100 of them.
    
    That seems like it produces more pressure against the “shared-everything, identical roles for all agents in the organization” model for groups of AI developers than for groups of human developers. Organizational pressures for groups of human developers already push them into specialized roles, and I expect those pressures to be even stronger for groups of AI developers. As such
    
    They plan on trying to thread the needle by employing some control schemes where (for example) different “agents” have different permissions. i.e. a “code writing” agent has read permissions for (some parts of) the codebase, the ability to write, deploy, and test changes to that code in a sandboxed dev environment, and the ability to open a pull request with those changes.
    
    doesn’t particularly feel like an implausible “thread the needle” strategy, it seems like the sort of thing we get by default because the incentives are already pushing so incredibly hard in that direction.
- Noosphere89 22 Jul 2025 22:14 UTC
  4 points
  2
  Parent
  I want to flag this for the future, but I suspect a huge underlying crux might be whether something like a persistent memory/identity and reliability to go with is just fundamentally necessary for many tasks to be done, with realistic compute and in practice you can’t take shortcuts to unlock most of the value of AI.
  This is related to whether you can reduce the long-term memory of humans into a huge context window, and I currently am not highly confident that the answer is no (that would require more scale-up and more time), though I do slightly favor the hypothesis that the answer is no over the answer is yes.
  Link below:
  https://www.lesswrong.com/posts/oKAFFvaouKKEhbBPm/a-bear-case-my-predictions-regarding-ai-progress#9LT2RgZiqgTQHudpm
  This is also a crux I suspect for @1a3orn, based on this response to AI 2027 (though it focuses on different criticisms):
  https://www.lesswrong.com/posts/4MofJAmXDkoFsTf4B/?commentId=EJZFG8m6ETWgZRFJi
  Putting it another way, the crux is whether can get most tasks to be automated by short-lifetime AIs, rather than long, individual lifetime AIs, and I’m not nearly as confident as you or @1a3orn that long-term memory isn’t necessary and not having it pays a crippling capability tax.
  What links here?
  - Noosphere89's comment on Noosphere89′s Shortform by Noosphere89 (30 Dec 2024 1:39 UTC; 6 points)