Jonas Hallgren comments on Model Integrity: MAI on Value Alignment

Jonas Hallgren 6 Dec 2024 10:56 UTC
1 point
0
There’s also always the possibility that you can elicit these sorts of goals and values from instructions and create a instruction based language around it that’s also relatively interpretable in what values are being prioritised in a multi-agent setting.

You do however get into ELK and misgeneralization problems here, IRL is not an easy task in general but there might be some neurosymbolic approaches that changes prompts to follow specific values?

I’m not sure if this is jibberish or not for you but my main frame for the next 5 years is “how do we steer collectives of AI agents in productive directions for humanity”.