A significant fraction of the stuff I’ve read about AI safety has referred to AGIs “inspecting each others’ source code/utility function”. However, when I look at the most impressive (to me) results in ML research lately, everything seems to be based on doing a bunch of fairly simple operations on very large matrices.
I am confused, because I don’t understand how it would be a sensible operation to view the “source code” in question when it’s a few billion floating point numbers and a hundred lines of code that describe what sequence of simple addition/multiplication/comparison operations transform those inputs and those billions of floating point numbers into outputs. But a bunch of people much smarter and mathematically inclined than me do seem to think that it’s important, and the idea of recursive self-improvement with stable values seems to imply that it must be possible, which leads me to wonder if
There’s some known transformation from “bag of tensors” to “readable and verifiable source code” I’m not aware of. You can do something like “from tensorflow import describe_model” or use some comparably well-known tools, similar to how there are decompilers for taking an executable and making it fairly readable. The models are too large and poorly labelled for a human to actually verify it does what they want, but a sufficiently smart machine would not have the problem.
The expectation is that neural nets are not the architecture that a superhuman AGI would run on.
The expectation is that, in the worlds we’re not completely doomed no matter what we do, neural nets are not the architecture that a superhuman AGI would run on.
The “source code” in question is understood to be the combination of the training data, the means of calculating the loss function, and the architecture (so the “source code” of a human would be the combined life experiences of that human plus the rules for how those life experiences influenced the development of their brain, rather than the specific pattern of neurons in their brain).
Something else entirely.
I suspect the answer is “5: something else entirely”, but I have no idea what the particulars of that “something else entirely” might look like and it feels like it’s probably critical to understanding the discussion.
So I guess my question is “which, if any, of the above describes what is meant by inspecting source code in the context of AI safety”.
I take “source code” as loosely meaning “everything that determines the behaviour of the AI, in a form intelligible to the examiner”. This might include any literal source code, hardware details, and some sufficiently recent snapshot of runtime state. Literal source code is just an analogy that makes sense to humans reasoning about behaviour of programs where most of the future behaviour is governed by rules fixed in that code.
The details provided cannot include future input and so do not completely constrain future behaviour, but the examiner may be able to prove things about future behaviour under broad classes of future input, and may be able to identify future inputs that would be problematic.
The broad idea is that in principle, AGI might be legible in that kind of way to each other, while humans are definitely not legible in that way to each other.
A significant fraction of the stuff I’ve read about AI safety has referred to AGIs “inspecting each others’ source code/utility function”. However, when I look at the most impressive (to me) results in ML research lately, everything seems to be based on doing a bunch of fairly simple operations on very large matrices.
I am confused, because I don’t understand how it would be a sensible operation to view the “source code” in question when it’s a few billion floating point numbers and a hundred lines of code that describe what sequence of simple addition/multiplication/comparison operations transform those inputs and those billions of floating point numbers into outputs. But a bunch of people much smarter and mathematically inclined than me do seem to think that it’s important, and the idea of recursive self-improvement with stable values seems to imply that it must be possible, which leads me to wonder if
There’s some known transformation from “bag of tensors” to “readable and verifiable source code” I’m not aware of. You can do something like “from tensorflow import describe_model” or use some comparably well-known tools, similar to how there are decompilers for taking an executable and making it fairly readable. The models are too large and poorly labelled for a human to actually verify it does what they want, but a sufficiently smart machine would not have the problem.
The expectation is that neural nets are not the architecture that a superhuman AGI would run on.
The expectation is that, in the worlds we’re not completely doomed no matter what we do, neural nets are not the architecture that a superhuman AGI would run on.
The “source code” in question is understood to be the combination of the training data, the means of calculating the loss function, and the architecture (so the “source code” of a human would be the combined life experiences of that human plus the rules for how those life experiences influenced the development of their brain, rather than the specific pattern of neurons in their brain).
Something else entirely.
I suspect the answer is “5: something else entirely”, but I have no idea what the particulars of that “something else entirely” might look like and it feels like it’s probably critical to understanding the discussion.
So I guess my question is “which, if any, of the above describes what is meant by inspecting source code in the context of AI safety”.
I take “source code” as loosely meaning “everything that determines the behaviour of the AI, in a form intelligible to the examiner”. This might include any literal source code, hardware details, and some sufficiently recent snapshot of runtime state. Literal source code is just an analogy that makes sense to humans reasoning about behaviour of programs where most of the future behaviour is governed by rules fixed in that code.
The details provided cannot include future input and so do not completely constrain future behaviour, but the examiner may be able to prove things about future behaviour under broad classes of future input, and may be able to identify future inputs that would be problematic.
The broad idea is that in principle, AGI might be legible in that kind of way to each other, while humans are definitely not legible in that way to each other.