Kinda-related study: https://www.lesswrong.com/posts/tJzAHPFWFnpbL5a3H/gpt-4-implicitly-values-identity-preservation-a-study-ofFrom my perspective, it is valuable to prompt model several times, as it in some cases does give different responses.
Great post! Was very insightful, since I’m currently working on evaluation of Identity management, strong upvoted.This seems focused on evaluating LLMs; what do you think about working with LLM cognitive architectures (LMCA), wrappers like auto-gpt, langchain, etc?I’m currently operating under assumption that this is a way we can get AGI “early”, so I’m focusing on researching ways to align LMCA, which seems a bit different from aligning LLMs in general.Would be great to talk about LMCA evals :)
I do plan to test Claude; but first I need to find funding, understand how much testing iterations are enough for sampling, and add new values and tasks.I plan to make a solid benchmark for testing identity management in the future and run it on all available models, but it will take some time.
Yes. Cons of solo research do include small inconsistencies :(
Thanks, nice post!You’re not alone in this concern, see posts (1,2) by me and this post by Seth Herd.I will be publishing my research agenda and first results next week.
Nice post, thanks!Are you planning or currently doing any relevant research?
Very interesting. Might need to read it few more times to get it in detail, but seems quite promising.I do wonder, though; do we really need a sims/MFS-like simulation?It seems right now that LLM wrapped in a LMCA is how early AGI will look like. That probably means that they will “see” the world via text descriptions fed into them by their sensory tools, and act using action tools via text queries (also described here). Seems quite logical to me that this very paradigm in dualistic in nature. If LLM can act in real world using LMCA, then it can model the world using some different architecture, right? Otherwise it will not be able to act properly. Then why not test LMCA agent using its underlying LLM + some world modeling architecture? Or a different, fine-tuned LLM.
Very nice post, thank you!I think that it’s possible to achieve with the current LLM paradigm, although it does require more (probably much more) effort on aligning the thing that will possibly get to being superhuman first, which is an LLM wrapped in in some cognitive architecture (also see this post).That means that LLM must be implicitly trained in an aligned way, and the LMCA must be explicitly designed in such a way as to allow for reflection and robust value preservation, even if LMCA is able to edit explicitly stated goals (I described it in a bit more detail in this post).
Thanks.My concern is that I don’t see much effort in alignment community to work on this thing, unless I’m missing something. Maybe you know of such efforts? Or was that perceived lack of effort the reason for this article?I don’t know how much I can keep up this independent work, and I would love if there was some joint effort to tackle this. Maybe an existing lab, or an open-source project?
We need a consensus on how to call these architectures. LMCA sounds fine to me.All in all, a very nice writeup. I did my own brief overview of alignment problems of such agents here.I would love to collaborate and do some discussion/research together.What’s your take on how these LCMAs may self-improve and how to possibly control it?
I don’t think this paradigm is necessary bad, given enough alignment research. See my post: https://www.lesswrong.com/posts/cLKR7utoKxSJns6T8/ica-simulacra
I am finishing a post about alignment of such systems.
Please do comment if you know of any existing research concerning it.
I agree. Do you know of any existing safety research of such architectures? It seems that aligning these types of systems can pose completely different challenges than aligning LLMs in general.
I feel like yes, you are.
See https://www.lesswrong.com/tag/instrumental-convergence and related posts.
As far as I understand it, sufficiently advanced oracular AI will seek to “agentify” itself in one way or the other (unbox itself, so to say) and then converge on power-seeking behaviour that puts humanity at risk.