Note: I’m probably well below median commenter in terms of technical CS/ML understanding. Anyway...
I feel like a missing chunk of research could be described as “seeing DL systems as ‘normal,’ physical things and processes that involve electrons running around inside little bits of (very complex) metal pieces” instead of mega-abstracted “agents.”
The main reason this might be fruitful is that, at least intuitively and to my understanding, failures like “the AI stops just playing chess really well and starts taking over the world to learn how to play chess even better” involve a qualitative change beyond just “the quadrillion parameters adjust a bit to minimize loss even more” that eventually cashes out in some very different way that literal bits of metal and electrons are arranged.
And plausibly abstracting away from the chips and electrons means ignoring the mechanism that permits this change. Of course, this probably only makes sense if something resembling deep learning scales to AGI, but it seems that some very smart people think that it may!
I can understand why it would seem excessively abstract, but when we speak of agency, we are in fact talking about patterns in the activations of the gpu’s circuit elements—specifically we’d be talking about patterns of numerical feedback where the program forms a causal predictive model of a variable and then, based on the result of the predictive model, does any form of model-predictive control, eg outputting bytes (floats, probably) that encode an action that the action-conditional predictive model evaluates as likely to impact the variable.
Merely minimizing loss is insufficient to end up with this outcome in many cases, but on some datasets, with some problem formulations—ones that we expect to come up, such as motor control of a robot in order to walk across a room, for a trivial example, or trying to select videos which maximize probability that a user stays on the website—we can expect that the predictive model, if more precise about the future than a human’s predictive model, would allow the gpu code to select actions (motor actions or video selections) that have higher reliability of reaching the target outcome (cross the room, ensure the user stays on the site) that the control loop code evaluated via the predictive model. The worry is that, if an agent is general enough in purpose to form its own subgoals and evaluate those in the predictive model, it could end up doing multi-step plan chaining through this general world-simulator subalgorithm and realize it can attack its creators in one of a great many possible ways.
Ngl I did not fully understand this, but to be clear I don’t think understanding alignment through the lense of agency is “excessively abstract.” In fact I think I’d agree with the implicit default view that it’s largely the single most productive lense to look through. My objection to the status quo is that it seems like the scale/ontology/lense/whatever I was describing is getting 0% of the research attention whereas perhaps it should be getting 10 or 20%.
Not sure this analogy works, but if NIH was spending $10B on cancer research, I would (prima facie, as a layperson) want >$0 but probably <$2B spent on looking at cancer as an atomic-scale phenomenon, and maybe some amount at an even lower-scale scale
yeah I was probably too abstract in my reply—to rephrase: a thermostat (or other extremely small control system) is a perfectly valid example of agency. it’s not dangerously strong agency or any such thing. but my point is really to say that you’re on the right track here, looking at the micro-scale versions of things is very promising.
Note: I’m probably well below median commenter in terms of technical CS/ML understanding. Anyway...
I feel like a missing chunk of research could be described as “seeing DL systems as ‘normal,’ physical things and processes that involve electrons running around inside little bits of (very complex) metal pieces” instead of mega-abstracted “agents.”
The main reason this might be fruitful is that, at least intuitively and to my understanding, failures like “the AI stops just playing chess really well and starts taking over the world to learn how to play chess even better” involve a qualitative change beyond just “the quadrillion parameters adjust a bit to minimize loss even more” that eventually cashes out in some very different way that literal bits of metal and electrons are arranged.
And plausibly abstracting away from the chips and electrons means ignoring the mechanism that permits this change. Of course, this probably only makes sense if something resembling deep learning scales to AGI, but it seems that some very smart people think that it may!
I can understand why it would seem excessively abstract, but when we speak of agency, we are in fact talking about patterns in the activations of the gpu’s circuit elements—specifically we’d be talking about patterns of numerical feedback where the program forms a causal predictive model of a variable and then, based on the result of the predictive model, does any form of model-predictive control, eg outputting bytes (floats, probably) that encode an action that the action-conditional predictive model evaluates as likely to impact the variable.
Merely minimizing loss is insufficient to end up with this outcome in many cases, but on some datasets, with some problem formulations—ones that we expect to come up, such as motor control of a robot in order to walk across a room, for a trivial example, or trying to select videos which maximize probability that a user stays on the website—we can expect that the predictive model, if more precise about the future than a human’s predictive model, would allow the gpu code to select actions (motor actions or video selections) that have higher reliability of reaching the target outcome (cross the room, ensure the user stays on the site) that the control loop code evaluated via the predictive model. The worry is that, if an agent is general enough in purpose to form its own subgoals and evaluate those in the predictive model, it could end up doing multi-step plan chaining through this general world-simulator subalgorithm and realize it can attack its creators in one of a great many possible ways.
Ngl I did not fully understand this, but to be clear I don’t think understanding alignment through the lense of agency is “excessively abstract.” In fact I think I’d agree with the implicit default view that it’s largely the single most productive lense to look through. My objection to the status quo is that it seems like the scale/ontology/lense/whatever I was describing is getting 0% of the research attention whereas perhaps it should be getting 10 or 20%.
Not sure this analogy works, but if NIH was spending $10B on cancer research, I would (prima facie, as a layperson) want >$0 but probably <$2B spent on looking at cancer as an atomic-scale phenomenon, and maybe some amount at an even lower-scale scale
yeah I was probably too abstract in my reply—to rephrase: a thermostat (or other extremely small control system) is a perfectly valid example of agency. it’s not dangerously strong agency or any such thing. but my point is really to say that you’re on the right track here, looking at the micro-scale versions of things is very promising.