let’s say three groups have each built an AI which they think is aligned, and before they press the start button on it, they’re trying to convince the other two that their design is safe and leads to good worlds.
So this is something I’ve though and written about—and I think it actually has a fairly obvious answer: the only reliable way to convince others that a system works is to show extensive actual evidence of the . . . system working. Simple, right?
Nobody believed the Wright brothers when they first claimed they had solved powered flight (if I recall correctly they had an initial failed press conference where no reporters showed up) - it required successful test demonstrations.
forgive me for not reading that whole post right now, but i suspect that an AI may:
act as we’d want but then sharp left turn once it reaches capabilities that the simulation doesn’t have enough compute for but reality does
need more compute than it has access to in the simulation before it starts modifying the world significantly, so we can only observe it being conservative and waiting to get more compute
act nice because it detects it’s in a simulation, eg by noticing that the world it’s in has nowhere near the computational capacity for agents that would design such an AI
commit to acting nice for a long time and then turn evil way later regardless of whether it’s a computation or not, just in case it’s in a simulation
i believe that “this computation is inside another computation!” is very plausibly an easily guessable idea for an advanced AI. if you address these concerns in your post, let me know and i’ll read it in full (or read the relevant sections) and comment on there.
i believe that “this computation is inside another computation!” is very plausibly an easily guessable idea for an advanced AI.
As stated very early in the article, the AI in simboxes do not have the words for even precursor concepts of ‘computation’ and ‘simulation’, as their knowledge is intentionally constrained and shaped to early or pre civ level equivalents. The foundational assumption is brain-like AGI in historical sims with carefully crafted/controlled ontologies.
So this is something I’ve though and written about—and I think it actually has a fairly obvious answer: the only reliable way to convince others that a system works is to show extensive actual evidence of the . . . system working. Simple, right?
Nobody believed the Wright brothers when they first claimed they had solved powered flight (if I recall correctly they had an initial failed press conference where no reporters showed up) - it required successful test demonstrations.
How do you test alignment? How do you demonstrate successful alignment? In virtual simulation sandboxes.
PS: Is there a reason you don’t use regular syntax (ie capitalization of first letters of sentences)? (just curious)
forgive me for not reading that whole post right now, but i suspect that an AI may:
act as we’d want but then sharp left turn once it reaches capabilities that the simulation doesn’t have enough compute for but reality does
need more compute than it has access to in the simulation before it starts modifying the world significantly, so we can only observe it being conservative and waiting to get more compute
act nice because it detects it’s in a simulation, eg by noticing that the world it’s in has nowhere near the computational capacity for agents that would design such an AI
commit to acting nice for a long time and then turn evil way later regardless of whether it’s a computation or not, just in case it’s in a simulation
i believe that “this computation is inside another computation!” is very plausibly an easily guessable idea for an advanced AI. if you address these concerns in your post, let me know and i’ll read it in full (or read the relevant sections) and comment on there.
as for my writing style, see here.
I guess I need to write a better concise summary.
As stated very early in the article, the AI in simboxes do not have the words for even precursor concepts of ‘computation’ and ‘simulation’, as their knowledge is intentionally constrained and shaped to early or pre civ level equivalents. The foundational assumption is brain-like AGI in historical sims with carefully crafted/controlled ontologies.