I think there’s some communication failure where people are very skeptical of this for reasons that they think are obvious given what they’re saying, but which are not obvious to me. Can people tell me which subset of the below claims they agree with, if any? Also if you come up with slight variants that you agree with that would be appreciated.
It is approximately impossible to succeed at this challenge.
It is possible to be confident that advanced AGI systems will not pose an existential threat without being able to succeed at this challenge.
It is not obvious what it means to succeed at this challenge.
It will probably not be obvious what it means to succeed at this challenge at any point in the next 10 years, even if a bunch of people try to work on it.
We do not currently know what it means for a go bot to know something in operational terms.
At no point in the next 10 years could one be confident that one knew everything a go bot knew, because we won’t be confident about what it means for a go bot to know something.
You couldn’t know everything a go bot knows without essentially being that go bot.
[EDIT:
8. One should not issue a challenge to know everything a go bot knows without having a good definition of what it means for a go bot to know things.]
If your goal is to play as well as the best go bot and/or write a program that plays equally well from scratch, it seems like it’s probably impossible. A lot of the go bot’s ‘knowledge’ could well be things like “here’s a linear combination of 20000 features of the board predictive of winning”. There’s no reason for the coefficients of that linear combination to be compressible in any way; it’s just a mathematical fact that these particular coefficients happen to be the best at predicting winning. If you accepted “here the model is taking a giant linear combination of features” as “understanding” it might be more doable.
An even more pointed example: chess endgame tables. What does it mean to ‘fully understand’ it beyond understanding the algorithms which construct them, and is it a reasonable goal to attempt to play chess endgames as well as the tables?
If you have a “lazy” version of the goal, like “have a question-answerer that can tell you anything the model knows” or “produce a locally human-legible but potentially giant object capturing everything the model knows” then chess endgame tables are a reasonably straightforward case (“position X is a win for white”).
(I am not one of the people who have expressed skepticism, but I find myself with what I take to be feelings somewhat similar to theirs.)
I agree with 1 if it success is defined rather strictly (e.g., requiring that one human brain contain all the information in a form that actually enables the person whose brain it is to play like the bot does) but not necessarily if it is defined more laxly (e.g., it’s enough if for any given decision the bot makes we have a procedure that pretty much always gives us a human-comprehensible explanation of why it made that decision, with explanations for different decisions always fitting into a reasonably consistent framework).
I have no idea about 2; I don’t think I’ve seen any nontrivial but plausibly true propositions of the form “It is possible to be confident that advanced AGI systems will not pose an existential threat without X”, but on the other hand I don’t think this justifies much confidence that any specific X is a thing we should be working on if we care about being able to be confident that advanced AGI systems will not pose an existential threat.
I agree with 3, but I think this is because it hasn’t been defined as explicitly as possible rather than because of some fundamental unclarity in the question. Accordingly, I think 4 is probably wrong.
I’m not sure whether 5 is true or not and suspect that the answer depends mostly on how you choose to define “know”. (Maybe go bots don’t know anything!) I’m pretty confident saying that today’s best go bots know who’s likely to win in many typical game positions, or whether a given move kills a particular group or not. Accordingly, I am inclined to disagree with 6, even though probably there are edge cases where it’s not clear whether a given bot “knows” a given thing or not.
I don’t know what “essentially being” means in 7; as written it looks wrong to me, but for some strong definitions of “know everything it knows” something close enough might be true. E.g., plausibly certain bits of the KataGo network are encoding things roughly along the lines of “at the location we’re looking at white has a ponnuki shape whose influence is not negated by other nearby black groups”; one could know much of what the bot knows by regarding the ponnuki shape as valuable and understanding that some nearby configurations make it less valuable; but if knowing everything the bot knows includes computing exactly how good a given configuration of stones is in this respect, then plausibly you could only do that by having the ability to do pretty much the exact calculation the KataGo network does. (Perhaps the fine details are not part of what it knows but merely implementation details; maybe one could operationalize that in terms of the existence of similar, and similarly strong, bots where the fine details are somewhat different—though I think that, as it stands, is a bit too simplistic. Or perhaps I could claim that “I” know what the bot knows if I understand the overall structure and have a computer file, or a book, containing the actual numbers. Something something Chinese Room something something systems reply something.)
(I find I can’t help remarking that the final proposition makes me want to imagine a paper by Thomas Nagel entitled “What is it like to be a bot?. And of course it turns out that various people have in fact written pieces with that title.)
Taboo “know” and try to ask the question again, because I think you’re engaging in a category error when you posit that, for example, a neural network actually knows anything at all. That is, the concept of “knowledge” as it applies to a human being cannot be meaningfully compared to “knowledge” as it applies to a neural network; they aren’t the same kind of thing. A Go AI doesn’t know how to play Go; it knows the current state of the board. These are entirely different categories of things.
The closest thing I think the human brain has to the kind of “knowledge” that a neural network uses is the kind of thing we represent in our cultural narrative as, for example, a spiritual guru slapping you for thinking about doing something instead of just doing it. That is, we explicitly label this kind of thing, when it occurs in the human brain, as not-knowledge.
ETA:
You can move your arm, right? You know how to move your arms and your legs and even how to do complicated things like throw balls and walk around. But you don’t actually know how to do any of those things; if you knew how to move your arm—much less something complicated like throwing balls! - it would be a relatively simple matter for you to build an arm and connect it to somebody who was missing one.
Does this seem absurd? It’s the difference between knowing how to add and knowing how to use a calculator. Knowing how to add is sufficient information to build a simple mechanical calculator, given some additional mechanical knowledge—knowing how to use a calculator gives you no such ability.
To make my own point that may be distinct from ACP’s: the point isn’t that neural networks don’t know anything. The point is that the level of description I’m operating on when I say that phrase is so imprecise that it doesn’t allow you to make exact demands like knowing “everything the NN does” or “exactly what the NN does,” for any system other than a copy of that same neural network.
If I make the verbal chain of reasoning “the NN can know things, I can know things, therefore I can know what the NN knows,” this chain of reasoning actually fails. Even though I’m using the same English word “know” both times, the logical consequences of the word are different each time I use it. If I want to make progress here, I’ll need to taboo the word “know.”
Because I think the word “know”, as used by a human understanding a model, is standing in for a particular kind of mirror-modeling, in which we possess a reproductive model of a thing in our mind which we can use to simulate a behavior, whereas the word “know”, as used by the referent AI, is standing in for “the set of information used to inform the development of a process”.
So an AI which has been trained on a game which it lost can behave “as if it has knowledge of that game”, when in fact the only remnant of that game may be a slightly adjusted parameter, perhaps a connection weighting somewhere is 1% different than it would otherwise be.
In order to “know” what the AI knows, in the sense that it knows it, requires a complete reproduction of the AI state—that is, if you know everything the AI actually knows, as opposed to the information-state that informed the development of the AI, then all you actually know, in that case, is that this particular connection is weighted 1% different; in order to meaningfully apply this knowledge, you must simulate the AI (you must know how all the connections interaction in a holistic sense), in which case you don’t know anything, you’re just asking the AI what it would do, which is not meaningfully knowing what it knows in any useful sense.
Which is basically because it doesn’t actually know anything. Its state is an algorithm, a process; this algorithm could perhaps be dissected, broken down, simplified, and turned into knowledge of how it operates—but this is just another way of simulating and querying a part of the AI; critically, knowing how the AI operates is having knowledge that the AI itself does not actually have.
Because now we are mirror-modeling the AI, and turning what the AI is, which isn’t knowledge, into something else, which is.
I guess it seems to me that you’re claiming that the referent AI isn’t doing any mirror-modelling, but I don’t know why you’d strongly believe this. It seems false about algorithms that use Monte Carlo Tree Search as KataGo does (altho another thread indicates that smart people disagree with me about this), but even for pure neural network models, I’m not sure why one would be confident that it’s false.
Because it’s expensive, slow, and orthogonal to the purpose the AI is actually trying to accomplish.
As a programmer, I take my complicated mirror models, try to figure out how to transform them into sets of numbers, try to figure out how to use one set of those numbers to create another set of those numbers. The mirror modeling is a cognitive step I have to take before I ever start programming an algorithm; it’s helpful for creating algorithms, but useless for actually running them.
Programming languages are judged as helpful in part by how well they do at pretending to be a mirror model, and efficient by how well they completely ignore the mirror model when it comes time to compile/run. There is no program which is made more efficient by representing data internally as the objects the programmers created; efficiency gains are made in compilers by figuring out how to reduce away the unnecessary complexity the programmers created for themselves so they could more easily map their messy intuitions to cold logic.
Why would an AI introduce this step in the middle of its processing?
...but this is just another way of … querying a part of the AI...
I’ve studied Go using AI and have heard others discuss the use of AI in studying Go. Even for professional Go players, the inability for the AI to explain why it gave a higher win rate to a particular move or sequence is a problem.
Even if you could program a tertiary AI which could query the Go playing AI, analyze the calculations the Go playing AI is using to make it’s judgements, and then translate that into english (or another language) so that this tertiary AI could explain why the Go playing AI made a move, I would still disagree that even this hybrid system ‘knew’ how to play Go.
There is a definite difference between ‘calculating’ and ‘reasoning’ such that even a neural network with it’s training I think is really still just one big calculator, not a reasoner.
I think making this post was a good idea. I’m personally interested in deconfusing the topic of universality (which basically should capture what “learning everything the model knows”), and you brought up a good “simple” example to try to build intuition on.
What I would call your mistake is a mostly 8, but a bit of the related ones (so 3 and 4?). Phrasing it as “can we do that” is a mistake in my opinion because the topic is very confused (as shown by the comments). On the other hand, I think asking the question of what it would mean is a very exciting problem. It also gives a more concrete form to the problem of deconfusing universality, which is important AFAIK to Paul’s approaches to alignment.
One operationalization of “know” in this case is being able to accurately predict every move of the Go AI. This is a useful framing, because instead of a hard pass/fail criterion, we can focus on improving our calibration.
Now the success criterion might be:
You have to be able to attain a Brier score of 0 in predicting the moves of the best go bot that you have access to.
What’s missing are some necessary constraints.
Most likely, you want to prohibit the following strategies:
Running a second instance of the Go AI on the same position, and using as your prediction the move that instance #2 makes.
Manually tracing through the source code to determine what the output would be if it was run.
Memorizing the source code and tracing through it in your head.
Constraining the input moves to ones where every Go program would make the same move, then using the output of one of a different Go program as your input.
Corollary: you can’t use any automation whatsoever to determine what move to make. Any automated system that can allow you to make accurate predictions is effectively a Go program.
Overall, then you might just want to prohibit the use of Turing machines. However, my understanding is that this results in a ban on algorithms. I don’t have enough CS to say what’s left to us if we’re denied algorithms.
Here’s a second operationalization of “know.” You’re allowed to train up using all the computerized help you want. But then, to prove your ability, you have to perfectly predict the output of the Go program on a set of randomly generated board positions, using only the power of your own brain. A softer criterion is to organize a competition, where participants are ranked by Brier score on this challenge.
However, this version of the success criterion is just a harder version of being an inhumanly good Go player. Not only do you have to play as well as the best Go program, you have to match its play. It’s the difference between being a basketball player with stats as good as Michael Jordan’s, and literally being able to copy his every move in novel situations indefinitely.
Neither of these versions of the success criterion operationalization seems particularly interesting. Both are too restrictive to be relevant to AI safety.
Did you have a different operationalization in mind?
Here’s a second operationalization of “know.” You’re allowed to train up using all the computerized help you want. But then, to prove your ability, you have to perfectly predict the output of the Go program on a set of randomly generated board positions, using only the power of your own brain.
I was thinking more of propositional knowledge (well, actually belief, but it doesn’t seem like this was a sticking point with anybody). A corollary of this is that you would be able to do this second operationalization, but possibly with the aid of a computer program that you wrote yourself that wasn’t just a copy of the original program. This constraint is slightly ambiguous but I think it shouldn’t be too problematic in practice.
Did you have a different operationalization in mind?
The actual thing I had in mind was “come up with a satisfactory operationalization”.
A corollary of this is that you would be able to do this second operationalization, but possibly with the aid of a computer program that you wrote yourself that wasn’t just a copy of the original program. This constraint is slightly ambiguous but I think it shouldn’t be too problematic in practice.
I’m going to assume it’s impossible for me, personally, to outplay the best Go AI I have access to. Given that, the requirement is for me to write a better Go AI than the one I currently have access to.
Of course, that would mean that my new self-written program is now the best Go AI. So then I would be back to square one.
There are weaker computational machines than Turing machines, like regexes. But you don really care about that, you just want to ban automatic reasoning. I think it’s impossible to succeed with that constrain; Playing Go is hard, people can’t just read code that plays Go well and “learn from it.”
One axis along which I’d like clarification is whether you want a form of explanation which is learner agnostic or learner specific? It seems to me that traditional transparency/interpretability tools try to be learner agnostic, but on the other hand the most efficient way to explain makes use of the learner’s pre-existing knowledge, inductive biases, etc.
In the learner agnostic case, I think it will be approximately impossible to succeed at this challenge. In the learner specific case, I think it will require something more than an interpretability method. This latter task will benefit from better and better models of human learning—in the limit I imagine something like a direct brain neuralink should do the trick...
On the learner specific side, it seems to me Nisan is right when he said ‘The question is if we can compress the bot’s knowledge into, say, a 1-year training program for professionals.’ To that end, it seems like a relevant method could be an improved version of influence functions. Something like find in the training phase when the go agent learned to make a better move than the pro and highlight the games (/moves) which taught it the improved play.
Not sure what the best way to formalize this intuition is, but here’s an idea. (To isolate this learner-agnostic/specific axis from the problem of defining explanation, let me assume that we have some metric for quantifying explanation quality, call it ‘R’ which is a function from <Model, learner, explanation> triples to real values.)
Define learner-agnostic explanation as optimizing for aggregate R across some distribution of learners—finding the one optimal explanation across this distribution. Learner-specific explanation optimizes for R taking the learner as an input—finding multiple optimal explanations, one for each learner.
The aggregation function in the learner-agnostic case could be the mean, or it could be a minimax function. The minimax case intuition would be formalizing the task of coming up with the most accessible explanation possible.
Things like influence functions, input-sensitivity methods, automated concept discovery are all learner-agnostic. On the other hand, probing methods (e.g. as used in NLP) could maybe be called learner-specific. The variant of influence functions I suggested above is learner-specific.
In general, it seems to me that as the models get more and more complex, we’ll probably need explanations to be more learner-specific to achieve reasonable performance. Though perhaps learner-agnostic methods will suffice for answering general questions like ‘Is my model optimizing for a mesa-objective’?
I guess by ‘learner’ you mean the human, rather than the learned model? If so, then I guess your transparency/explanation/knowledge-extraction method could be learner-specific, and still succeed at the above challenge.
I’d say 1 and 7 (for humans). The way humans understand go is different to how bots understand go. We use heuristics. The bots may use heuristics too but there’s no reason we could comprehend those heuristics. Considering the size of the state space it seems that the bot has access to ways of thinking about go that we don’t, the same way a bot can look further ahead in a chess games than we could comprehend.
I think there’s some communication failure where people are very skeptical of this for reasons that they think are obvious given what they’re saying, but which are not obvious to me. Can people tell me which subset of the below claims they agree with, if any? Also if you come up with slight variants that you agree with that would be appreciated.
It is approximately impossible to succeed at this challenge.
It is possible to be confident that advanced AGI systems will not pose an existential threat without being able to succeed at this challenge.
It is not obvious what it means to succeed at this challenge.
It will probably not be obvious what it means to succeed at this challenge at any point in the next 10 years, even if a bunch of people try to work on it.
We do not currently know what it means for a go bot to know something in operational terms.
At no point in the next 10 years could one be confident that one knew everything a go bot knew, because we won’t be confident about what it means for a go bot to know something.
You couldn’t know everything a go bot knows without essentially being that go bot.
[EDIT: 8. One should not issue a challenge to know everything a go bot knows without having a good definition of what it means for a go bot to know things.]
If your goal is to play as well as the best go bot and/or write a program that plays equally well from scratch, it seems like it’s probably impossible. A lot of the go bot’s ‘knowledge’ could well be things like “here’s a linear combination of 20000 features of the board predictive of winning”. There’s no reason for the coefficients of that linear combination to be compressible in any way; it’s just a mathematical fact that these particular coefficients happen to be the best at predicting winning. If you accepted “here the model is taking a giant linear combination of features” as “understanding” it might be more doable.
An even more pointed example: chess endgame tables. What does it mean to ‘fully understand’ it beyond understanding the algorithms which construct them, and is it a reasonable goal to attempt to play chess endgames as well as the tables?
If you have a “lazy” version of the goal, like “have a question-answerer that can tell you anything the model knows” or “produce a locally human-legible but potentially giant object capturing everything the model knows” then chess endgame tables are a reasonably straightforward case (“position X is a win for white”).
(I am not one of the people who have expressed skepticism, but I find myself with what I take to be feelings somewhat similar to theirs.)
I agree with 1 if it success is defined rather strictly (e.g., requiring that one human brain contain all the information in a form that actually enables the person whose brain it is to play like the bot does) but not necessarily if it is defined more laxly (e.g., it’s enough if for any given decision the bot makes we have a procedure that pretty much always gives us a human-comprehensible explanation of why it made that decision, with explanations for different decisions always fitting into a reasonably consistent framework).
I have no idea about 2; I don’t think I’ve seen any nontrivial but plausibly true propositions of the form “It is possible to be confident that advanced AGI systems will not pose an existential threat without X”, but on the other hand I don’t think this justifies much confidence that any specific X is a thing we should be working on if we care about being able to be confident that advanced AGI systems will not pose an existential threat.
I agree with 3, but I think this is because it hasn’t been defined as explicitly as possible rather than because of some fundamental unclarity in the question. Accordingly, I think 4 is probably wrong.
I’m not sure whether 5 is true or not and suspect that the answer depends mostly on how you choose to define “know”. (Maybe go bots don’t know anything!) I’m pretty confident saying that today’s best go bots know who’s likely to win in many typical game positions, or whether a given move kills a particular group or not. Accordingly, I am inclined to disagree with 6, even though probably there are edge cases where it’s not clear whether a given bot “knows” a given thing or not.
I don’t know what “essentially being” means in 7; as written it looks wrong to me, but for some strong definitions of “know everything it knows” something close enough might be true. E.g., plausibly certain bits of the KataGo network are encoding things roughly along the lines of “at the location we’re looking at white has a ponnuki shape whose influence is not negated by other nearby black groups”; one could know much of what the bot knows by regarding the ponnuki shape as valuable and understanding that some nearby configurations make it less valuable; but if knowing everything the bot knows includes computing exactly how good a given configuration of stones is in this respect, then plausibly you could only do that by having the ability to do pretty much the exact calculation the KataGo network does. (Perhaps the fine details are not part of what it knows but merely implementation details; maybe one could operationalize that in terms of the existence of similar, and similarly strong, bots where the fine details are somewhat different—though I think that, as it stands, is a bit too simplistic. Or perhaps I could claim that “I” know what the bot knows if I understand the overall structure and have a computer file, or a book, containing the actual numbers. Something something Chinese Room something something systems reply something.)
(I find I can’t help remarking that the final proposition makes me want to imagine a paper by Thomas Nagel entitled “What is it like to be a bot?. And of course it turns out that various people have in fact written pieces with that title.)
I think I basically agree with all of this.
Taboo “know” and try to ask the question again, because I think you’re engaging in a category error when you posit that, for example, a neural network actually knows anything at all. That is, the concept of “knowledge” as it applies to a human being cannot be meaningfully compared to “knowledge” as it applies to a neural network; they aren’t the same kind of thing. A Go AI doesn’t know how to play Go; it knows the current state of the board. These are entirely different categories of things.
The closest thing I think the human brain has to the kind of “knowledge” that a neural network uses is the kind of thing we represent in our cultural narrative as, for example, a spiritual guru slapping you for thinking about doing something instead of just doing it. That is, we explicitly label this kind of thing, when it occurs in the human brain, as not-knowledge.
ETA:
You can move your arm, right? You know how to move your arms and your legs and even how to do complicated things like throw balls and walk around. But you don’t actually know how to do any of those things; if you knew how to move your arm—much less something complicated like throwing balls! - it would be a relatively simple matter for you to build an arm and connect it to somebody who was missing one.
Does this seem absurd? It’s the difference between knowing how to add and knowing how to use a calculator. Knowing how to add is sufficient information to build a simple mechanical calculator, given some additional mechanical knowledge—knowing how to use a calculator gives you no such ability.
Why do you believe that?
To make my own point that may be distinct from ACP’s: the point isn’t that neural networks don’t know anything. The point is that the level of description I’m operating on when I say that phrase is so imprecise that it doesn’t allow you to make exact demands like knowing “everything the NN does” or “exactly what the NN does,” for any system other than a copy of that same neural network.
If I make the verbal chain of reasoning “the NN can know things, I can know things, therefore I can know what the NN knows,” this chain of reasoning actually fails. Even though I’m using the same English word “know” both times, the logical consequences of the word are different each time I use it. If I want to make progress here, I’ll need to taboo the word “know.”
Because I think the word “know”, as used by a human understanding a model, is standing in for a particular kind of mirror-modeling, in which we possess a reproductive model of a thing in our mind which we can use to simulate a behavior, whereas the word “know”, as used by the referent AI, is standing in for “the set of information used to inform the development of a process”.
So an AI which has been trained on a game which it lost can behave “as if it has knowledge of that game”, when in fact the only remnant of that game may be a slightly adjusted parameter, perhaps a connection weighting somewhere is 1% different than it would otherwise be.
In order to “know” what the AI knows, in the sense that it knows it, requires a complete reproduction of the AI state—that is, if you know everything the AI actually knows, as opposed to the information-state that informed the development of the AI, then all you actually know, in that case, is that this particular connection is weighted 1% different; in order to meaningfully apply this knowledge, you must simulate the AI (you must know how all the connections interaction in a holistic sense), in which case you don’t know anything, you’re just asking the AI what it would do, which is not meaningfully knowing what it knows in any useful sense.
Which is basically because it doesn’t actually know anything. Its state is an algorithm, a process; this algorithm could perhaps be dissected, broken down, simplified, and turned into knowledge of how it operates—but this is just another way of simulating and querying a part of the AI; critically, knowing how the AI operates is having knowledge that the AI itself does not actually have.
Because now we are mirror-modeling the AI, and turning what the AI is, which isn’t knowledge, into something else, which is.
I guess it seems to me that you’re claiming that the referent AI isn’t doing any mirror-modelling, but I don’t know why you’d strongly believe this. It seems false about algorithms that use Monte Carlo Tree Search as KataGo does (altho another thread indicates that smart people disagree with me about this), but even for pure neural network models, I’m not sure why one would be confident that it’s false.
Because it’s expensive, slow, and orthogonal to the purpose the AI is actually trying to accomplish.
As a programmer, I take my complicated mirror models, try to figure out how to transform them into sets of numbers, try to figure out how to use one set of those numbers to create another set of those numbers. The mirror modeling is a cognitive step I have to take before I ever start programming an algorithm; it’s helpful for creating algorithms, but useless for actually running them.
Programming languages are judged as helpful in part by how well they do at pretending to be a mirror model, and efficient by how well they completely ignore the mirror model when it comes time to compile/run. There is no program which is made more efficient by representing data internally as the objects the programmers created; efficiency gains are made in compilers by figuring out how to reduce away the unnecessary complexity the programmers created for themselves so they could more easily map their messy intuitions to cold logic.
Why would an AI introduce this step in the middle of its processing?
I’ve studied Go using AI and have heard others discuss the use of AI in studying Go. Even for professional Go players, the inability for the AI to explain why it gave a higher win rate to a particular move or sequence is a problem.
Even if you could program a tertiary AI which could query the Go playing AI, analyze the calculations the Go playing AI is using to make it’s judgements, and then translate that into english (or another language) so that this tertiary AI could explain why the Go playing AI made a move, I would still disagree that even this hybrid system ‘knew’ how to play Go.
There is a definite difference between ‘calculating’ and ‘reasoning’ such that even a neural network with it’s training I think is really still just one big calculator, not a reasoner.
My take is:
I think making this post was a good idea. I’m personally interested in deconfusing the topic of universality (which basically should capture what “learning everything the model knows”), and you brought up a good “simple” example to try to build intuition on.
What I would call your mistake is a mostly 8, but a bit of the related ones (so 3 and 4?). Phrasing it as “can we do that” is a mistake in my opinion because the topic is very confused (as shown by the comments). On the other hand, I think asking the question of what it would mean is a very exciting problem. It also gives a more concrete form to the problem of deconfusing universality, which is important AFAIK to Paul’s approaches to alignment.
One operationalization of “know” in this case is being able to accurately predict every move of the Go AI. This is a useful framing, because instead of a hard pass/fail criterion, we can focus on improving our calibration.
Now the success criterion might be:
You have to be able to attain a Brier score of 0 in predicting the moves of the best go bot that you have access to.
What’s missing are some necessary constraints.
Most likely, you want to prohibit the following strategies:
Running a second instance of the Go AI on the same position, and using as your prediction the move that instance #2 makes.
Manually tracing through the source code to determine what the output would be if it was run.
Memorizing the source code and tracing through it in your head.
Constraining the input moves to ones where every Go program would make the same move, then using the output of one of a different Go program as your input.
Corollary: you can’t use any automation whatsoever to determine what move to make. Any automated system that can allow you to make accurate predictions is effectively a Go program.
Overall, then you might just want to prohibit the use of Turing machines. However, my understanding is that this results in a ban on algorithms. I don’t have enough CS to say what’s left to us if we’re denied algorithms.
Here’s a second operationalization of “know.” You’re allowed to train up using all the computerized help you want. But then, to prove your ability, you have to perfectly predict the output of the Go program on a set of randomly generated board positions, using only the power of your own brain. A softer criterion is to organize a competition, where participants are ranked by Brier score on this challenge.
However, this version of the success criterion is just a harder version of being an inhumanly good Go player. Not only do you have to play as well as the best Go program, you have to match its play. It’s the difference between being a basketball player with stats as good as Michael Jordan’s, and literally being able to copy his every move in novel situations indefinitely.
Neither of these versions of the success criterion operationalization seems particularly interesting. Both are too restrictive to be relevant to AI safety.
Did you have a different operationalization in mind?
I was thinking more of propositional knowledge (well, actually belief, but it doesn’t seem like this was a sticking point with anybody). A corollary of this is that you would be able to do this second operationalization, but possibly with the aid of a computer program that you wrote yourself that wasn’t just a copy of the original program. This constraint is slightly ambiguous but I think it shouldn’t be too problematic in practice.
The actual thing I had in mind was “come up with a satisfactory operationalization”.
I’m going to assume it’s impossible for me, personally, to outplay the best Go AI I have access to. Given that, the requirement is for me to write a better Go AI than the one I currently have access to.
Of course, that would mean that my new self-written program is now the best Go AI. So then I would be back to square one.
There are weaker computational machines than Turing machines, like regexes. But you don really care about that, you just want to ban automatic reasoning. I think it’s impossible to succeed with that constrain; Playing Go is hard, people can’t just read code that plays Go well and “learn from it.”
One axis along which I’d like clarification is whether you want a form of explanation which is learner agnostic or learner specific? It seems to me that traditional transparency/interpretability tools try to be learner agnostic, but on the other hand the most efficient way to explain makes use of the learner’s pre-existing knowledge, inductive biases, etc.
In the learner agnostic case, I think it will be approximately impossible to succeed at this challenge. In the learner specific case, I think it will require something more than an interpretability method. This latter task will benefit from better and better models of human learning—in the limit I imagine something like a direct brain neuralink should do the trick...
On the learner specific side, it seems to me Nisan is right when he said ‘The question is if we can compress the bot’s knowledge into, say, a 1-year training program for professionals.’ To that end, it seems like a relevant method could be an improved version of influence functions. Something like find in the training phase when the go agent learned to make a better move than the pro and highlight the games (/moves) which taught it the improved play.
I don’t know what you mean by “learner agnostic” or “learner specific”. Could you explain?
Not sure what the best way to formalize this intuition is, but here’s an idea. (To isolate this learner-agnostic/specific axis from the problem of defining explanation, let me assume that we have some metric for quantifying explanation quality, call it ‘R’ which is a function from <Model, learner, explanation> triples to real values.)
Define learner-agnostic explanation as optimizing for aggregate R across some distribution of learners—finding the one optimal explanation across this distribution. Learner-specific explanation optimizes for R taking the learner as an input—finding multiple optimal explanations, one for each learner.
The aggregation function in the learner-agnostic case could be the mean, or it could be a minimax function. The minimax case intuition would be formalizing the task of coming up with the most accessible explanation possible.
Things like influence functions, input-sensitivity methods, automated concept discovery are all learner-agnostic. On the other hand, probing methods (e.g. as used in NLP) could maybe be called learner-specific. The variant of influence functions I suggested above is learner-specific.
In general, it seems to me that as the models get more and more complex, we’ll probably need explanations to be more learner-specific to achieve reasonable performance. Though perhaps learner-agnostic methods will suffice for answering general questions like ‘Is my model optimizing for a mesa-objective’?
I guess by ‘learner’ you mean the human, rather than the learned model? If so, then I guess your transparency/explanation/knowledge-extraction method could be learner-specific, and still succeed at the above challenge.
I’d say 1 and 7 (for humans). The way humans understand go is different to how bots understand go. We use heuristics. The bots may use heuristics too but there’s no reason we could comprehend those heuristics. Considering the size of the state space it seems that the bot has access to ways of thinking about go that we don’t, the same way a bot can look further ahead in a chess games than we could comprehend.