I see there is no discussion here of Oracle AIs with transparent reasoning processes. How might things change if we had an automatic, detailed rationale for everything the Oracle said?
Also, I think there might be a good argument for building a true Oracle AI instead of a friendly AI once we think we have friendliness solved, if the consequences of only having friendliness almost right are less terrible (would a true Oracle AI that didn’t understand that humans could get bored really be that bad? I can give good advice to my 12-year-old brother even though my model of him is not perfect, and it’s not clear to me that my advice would become harmful if I was much smarter. To reverse the situation, my dad’s goals for me are not exactly the same as my goals for me, and if my dad’s intelligence was enhanced radically, I would prefer to hear his advice and be wary of it than to give him complete power over my life.)
But I guess that’s no different than saying we should do everything we can to box an FAI even if we think we have one, with one particular kind of box being behavioral constraints added to the AI’s goal function.
How might things change if we had an automatic, detailed rationale for everything the Oracle said?
Are you suggesting an AI that provides a rationale for every answer it gives, or an AI that provides a summary of all the reasoning it used to arrive at that answer? I can think of a criticism of either one.
Suppose you ask it to design a synthetic microbe that makes healthy bacon. It could give you the genome for healthy bacon, and also a document that describes how the microbe works and how it’s really good at making healthy bacon. If the oracle were not friendly, you might wonder what else this microbe will do once you create it.
Or the oracle could give you a list of all the designs it considered, and good reasons for rejecting them, and the procedure it used to generate that list. But this seems like way more information than humans could ever check.
I’m suggesting that we have an independent “auditor” process that’s capable of summarizing and answering questions about the AI’s reasoning at any level of detail, but that the “auditor” is not under direct control of the AI (if the AI gains control of the “auditor”, it’s escaped from its box). For example, the AI could generate detailed log files (as a fundamental part of its architecture, the AI would log all of its reasoning), and there could be a separate piece of software designed to interpret and summarize the log files. (This separate software might not need to be very intelligent.)
Well, if the auditor were not very intelligent, I would wonder if the oracle had cleverly followed a long chain of reasoning that would produce logs that would trick the auditor into answering your queries about the oracle’s reasoning to your satisfaction. Perhaps the chain of reasoning seems like a legitimate search for bacon, but due to subtleties inscrutable to anything less than a superintelligence, it actually leads to bacon that takes over the world.
If the auditor were sufficiently intelligent to see through such machinations, I would wonder if the auditor and oracle weren’t acausally cooperating.
I haven’t thought about this much, and I’m not confident at all. If you find my objections inadequate, they probably are.
I find this not a very satisfactory reply. The auditor could have access to knowledge about physical resources consumed (electrical usage, heat generated). An attempt to trick the auditor (whatever that means) would require a strictly larger search of the solution space, requiring more computational energy expended.
It’s not really clear what you mean by ‘trick the auditor’ anyway. The auditor would have access to logs of the computational substrate running the AGI, portions of which could be inspected without summation. Even if the AGI had such capability, it would have evolved such trickery, and that development would be observable.
Don’t worry, I haven’t thought about it much either. I hope I’m not doing anything to pollute discussion of this topic. I figure in the worst-case a more serious researcher can skim through our discussion to see if we thought of anything they didn’t. (Maybe we should have a separate discussion thread about reading one should definitely do before commenting on these topics, or about what we can do to explore the problem as thoroughly as possible before proposing solutions?)
The auditor would be able to see the oracle’s reasoning behind obscuring future chains of reasoning.
Ideally the Oracle wouldn’t have much awareness of the details of how things were logged or how the auditor worked. There’s no reason why there couldn’t be more than one auditor, btw, each using a different scheme for auditing.
would a true Oracle AI that didn’t understand that humans could get bored really be that bad?
The important difficulty is not in making an AGI realize that humans can get bored, it is in making it realize that this is undesirable. Also, it would have to be able to figure out whether it is desirable to remove humans’ boredom, effectively wireheading them in order to make them willing to enjoy repetitive activities.
I do agree that this could be less of a problem for an Oracle non-AI than for a goal-directed system, depending on other details of the design.
I see there is no discussion here of Oracle AIs with transparent reasoning processes. How might things change if we had an automatic, detailed rationale for everything the Oracle said?
Also, I think there might be a good argument for building a true Oracle AI instead of a friendly AI once we think we have friendliness solved, if the consequences of only having friendliness almost right are less terrible (would a true Oracle AI that didn’t understand that humans could get bored really be that bad? I can give good advice to my 12-year-old brother even though my model of him is not perfect, and it’s not clear to me that my advice would become harmful if I was much smarter. To reverse the situation, my dad’s goals for me are not exactly the same as my goals for me, and if my dad’s intelligence was enhanced radically, I would prefer to hear his advice and be wary of it than to give him complete power over my life.)
But I guess that’s no different than saying we should do everything we can to box an FAI even if we think we have one, with one particular kind of box being behavioral constraints added to the AI’s goal function.
Are you suggesting an AI that provides a rationale for every answer it gives, or an AI that provides a summary of all the reasoning it used to arrive at that answer? I can think of a criticism of either one.
Suppose you ask it to design a synthetic microbe that makes healthy bacon. It could give you the genome for healthy bacon, and also a document that describes how the microbe works and how it’s really good at making healthy bacon. If the oracle were not friendly, you might wonder what else this microbe will do once you create it.
Or the oracle could give you a list of all the designs it considered, and good reasons for rejecting them, and the procedure it used to generate that list. But this seems like way more information than humans could ever check.
I’m suggesting that we have an independent “auditor” process that’s capable of summarizing and answering questions about the AI’s reasoning at any level of detail, but that the “auditor” is not under direct control of the AI (if the AI gains control of the “auditor”, it’s escaped from its box). For example, the AI could generate detailed log files (as a fundamental part of its architecture, the AI would log all of its reasoning), and there could be a separate piece of software designed to interpret and summarize the log files. (This separate software might not need to be very intelligent.)
Well, if the auditor were not very intelligent, I would wonder if the oracle had cleverly followed a long chain of reasoning that would produce logs that would trick the auditor into answering your queries about the oracle’s reasoning to your satisfaction. Perhaps the chain of reasoning seems like a legitimate search for bacon, but due to subtleties inscrutable to anything less than a superintelligence, it actually leads to bacon that takes over the world.
If the auditor were sufficiently intelligent to see through such machinations, I would wonder if the auditor and oracle weren’t acausally cooperating.
I haven’t thought about this much, and I’m not confident at all. If you find my objections inadequate, they probably are.
I find this not a very satisfactory reply. The auditor could have access to knowledge about physical resources consumed (electrical usage, heat generated). An attempt to trick the auditor (whatever that means) would require a strictly larger search of the solution space, requiring more computational energy expended.
It’s not really clear what you mean by ‘trick the auditor’ anyway. The auditor would have access to logs of the computational substrate running the AGI, portions of which could be inspected without summation. Even if the AGI had such capability, it would have evolved such trickery, and that development would be observable.
Don’t worry, I haven’t thought about it much either. I hope I’m not doing anything to pollute discussion of this topic. I figure in the worst-case a more serious researcher can skim through our discussion to see if we thought of anything they didn’t. (Maybe we should have a separate discussion thread about reading one should definitely do before commenting on these topics, or about what we can do to explore the problem as thoroughly as possible before proposing solutions?)
The auditor would be able to see the oracle’s reasoning behind obscuring future chains of reasoning.
Ideally the Oracle wouldn’t have much awareness of the details of how things were logged or how the auditor worked. There’s no reason why there couldn’t be more than one auditor, btw, each using a different scheme for auditing.
The important difficulty is not in making an AGI realize that humans can get bored, it is in making it realize that this is undesirable. Also, it would have to be able to figure out whether it is desirable to remove humans’ boredom, effectively wireheading them in order to make them willing to enjoy repetitive activities.
I do agree that this could be less of a problem for an Oracle non-AI than for a goal-directed system, depending on other details of the design.