aysja

Karma: 1,796

aysja 25 Aug 2022 7:10 UTC
12 points
on: Optimization at a Distance
I like this framework, but I think it’s still a bit tricky about how to draw lines around agents/optimization processes.
For instance, I can think of ways to make a rock interact with far away variables by e.g., coupling it to a human who presses various buttons based on the internal state or the rock. In this case, would you draw the boundary around both the rock and the human and say that that unit is “optimizing”?

That seems a bit weird, given that the human is clearly the “optimizer” in this scenario. And drawing a line around only the rock or only the human seems wrong too (human is clearly using the rock to do this strange optimization process and rock is relying on the human for this to occur). Curious about your thoughts.

Also, I’m not sure that agents always optimize things far away from themselves. Bacteria follow chemical gradients (and this feels agent-y to me), but the chemicals are immediately present both temporally and spatially. There is some sense in which bacteria are “trying” to get somewhere far away (the maximum concentration), but they’re also pretty locally achieving the goal, i.e., the actions they take in the present are very close in space and time to what they’re trying to achieve (eat the chemicals).

aysja 25 Aug 2022 7:11 UTC
10 points
1
on: A Mechanistic Interpretability Analysis of Grokking
I love this work! It’s really cool to see interpretability on toy models in such a clear way.

The trend from memorization to generalization reminds me of the information bottleneck idea. I don’t know that much about it (read this Quanta article a while ago), but they appear to be making a similar claim about phase transitions. I believe this is the paper one would want to read to get a deeper understanding of it.

aysja 30 Oct 2022 22:58 UTC
4 points
on: Pondering computation in the real world
This is an excellent post! Thank you for sharing your thoughts! I too am very curious about many of these questions, although I’m also at a half-baked stage with a lot of it (I’d also love to have a better footing here!). But in any case, here are some thoughts (in no particular order).
1. I’ve been interested in the questions you pose around AlexNet for a while, in particular, how much computation is a function of observers assigning values versus an intrinsic property of the thing itself. And I agree this starts getting pretty weird and interesting when you consider that minds themselves are doing computations. Like, it seems pretty clear that if I write down a truth table on paper, it is not the paper or ink that “did” the computation, it was me. Likewise, if I take two atoms in a rock, call one 0, the other 1, then take an atom at a future state, call it 0, it seems clear that the computation “AND” happened entirely in my head and I projected it onto the rock (although I do think it’s pretty tricky to say why this is, exactly!). But what about if I rain marbles down on a circle circumscribed in a square (the ratio of which “calculates” pi)? In this case it feels a bit less arbitrary, the circle and the square chalked on the ground are “meaningfully” relating to the computation, although it is me who is doing the bulk of the work (taking the ratio)? This feels a bit more middle ground to me. In any case, I do think there is a spectrum between “completely intrinsic to the thing” and “agents projecting their own computation on the thing” and that this is largely ignored but incredibly interesting and instructive for how computation actually works.
2. Relatedly, people often roll their eyes at the Chinese Room thought experiment (and rightly so, because I think the conclusions people draw about it with respect to AI are often misguided). But I also think that it’s pointing to a deep confusion about computation that I also share. The standard take is that, okay maybe the person doesn’t understand Chinese but the “room does,” because all of the information is contained inside of it. I’m not really convinced by this. For the same reason that the truth table isn’t “doing” the computation of AND, I don’t think that the book that contains the translation is doing any meaningful computation, and I don’t think the human inside understands Chinese, either (in the colloquial sense we mean when we don’t understand a foreign language). There was certainly understanding when that book was generated, but all of that generative tech is absent in the room. So I think Searle is pointing at something interesting and informative here and I tentatively agree that the room does not understand Chinese (although I disagree with the conclusion that this means AI could never understand anything).
3. I do agree that input/output mappings are not a good mechanistic understanding of computation, but I would also guess that it’s the right level of abstraction for grouping different physical systems. E.g., the main similarity between the mechanical and electrical adder is that, upon receiving 1 and 1, output 2, and so on.
4. I get confused about why minds have special status, e.g. “computation is a function of both the dynamics and an observer.” On the one hand, it feels intuitive that they are special, and I get what you mean. And on the other hand, minds are also just physical systems. What is it about a mind that makes something a computation when it wasn’t otherwise? It’s something about how much of the computation stems from the mind versus the device? And how “entangled” the mind is with the computation, e.g., whether states in the non-mind system are correlated with states in the mind? Which suggests that the thing is not exactly “mind-ness” but how “coupled” various physical states are to each other.
5. I also think that the adder systems are far less (or maybe zero) observer dependent computations, relative to the rock or truth table, in the sense that there are a series of physically coupled states (within the system itself) which reliably turn the same input into the same output. Like, there is this step of a person saying what the inputs “represent,” but the person’s mind, once the device is built, does not need to be entangled with the states in the machine in order for it to do the computation. The representation step seems important, but also not as much about the computation itself rather than “how that computation is used.” Like, I think that when we look at isolated cases of computation (like the adders), this part feels weird because computation (as it normally plays out) is part of an interconnected system which “uses” the outputs of various computations to “do” something (like in a standard computer, the output of addition might be the input to the forward-prop in a neural net or whatever). And a “naked” computation is strange, because usually the “sense making” aspect of a computation is in how it’s used, not the steps needed to produce it. To be clear, I think the representation step is interesting (and notably the thing lacking in the Chinese Room), and I do think that it’s part of how computation is used in real-world contexts, but I still want to say that the adder “adds” whether we are there to represent the inputs as numbers or not. Maybe similar to how I want to say that the Chinese room “translates Chinese” whether or not anyone is there to do the “semantic work” of understanding what that means (which, in my view, is not a spooky thing, but rather something-something “a set of interconnected computations”).
6. Maybe a good way to think of these things is to ask “how much mind entanglement do you need at various parts of this process in order for the computation to take place?”
7. My guess is that computation is fundamentally something like “state reliably changes in response to other state.” Where both words (“state” and “reliably”) are a bit tricky to fully pin down and there are a bunch of thorny philosophical issues. For instance, “reliably” means something like “if I input the first state a bunch of times, the next state almost always follows”, but if-thens are hard to reconcile with deterministic world views. And “state” is typically referring to something abstract, e.g., we say “if the protein changes to this shape, then this gene is expressed,” but what exactly do we mean by “shape”? There is not a single, precise shape that works, there’s a whole class consisting of slight perturbations or different molecular constituents, etc. that will “get the job done,” i.e., express the gene. And without having a good foundation of what we mean by an abstraction, I think talking about natural computation can be philosophically difficult.
8. “Is there anything purely inside of AlexNet that can tell us that 1 in the output node means cat and that 0 means not cat?” I’m not sure exactly what you’re gesturing at with this, but my guess is that there is. I’m thinking of interpretability tools that show that cat features activate when shown a picture of a cat, and that these states reliably produce a “1” rather than a “0.” But maybe you’re talking about something else or have more uncertainty about it than me?
9. I agree that thinking is extremely wild! ’Nough said.

aysja 8 Feb 2023 7:29 UTC
17 points
7
in reply to: quanticle’s comment on: Why Are Bacteria So Simple?
Good question! I don’t know, but I think that they don’t necessarily need to. Something I didn’t get into in the post but which is pretty important for understanding bacterial genomes is that they do horizontal gene transfer, which basically means that they trade genes between individuals rather than exclusively between parents and offspring.
From what I understand, this means that although on average the bacteria shed the unhelpful DNA if given the opportunity, so long as a few individuals within the population still have the gene, it can get rapidly reacquired when needed. I don’t know exactly how the math works out, but I’d guess that in big enough populations, if antibiotic encounters are somewhat common, then probably they don’t need to do it de novo each time?
This also means bacterial genomes are much more distributed than eukaryotic ones. So long as any individual bacteria has some gene, it’s “as if” the whole species has it. Which means their genomes are, in a sense, actually longer than they might naively seem. Being distributed has advantages: no single genome needs to be very long, yet the population can hold onto useful stuff. But it also has disadvantages: any adaptation that relies on genes being close together in a single genome is unlikely to develop (which includes e.g. all of the regulatory hierarchy stuff mentioned in the post). So I do still expect that the pressure towards short genomes meaningfully stunts bacterial complexity.

aysja 8 Feb 2023 7:41 UTC
10 points
1
in reply to: Alexander Gietelink Oldenziel’s comment on: Why Are Bacteria So Simple?
Thanks!!
Yeah I think it’s a great question and I don’t know that I have a great answer. Plasmids (small rings of DNA that float around separately) are part of the story. My understanding here is pretty sketchy, but I think plasmids are way more likely to be deleted than the chromosomal DNA, and for some reason antibiotic resistant genes tend to be in plasmids (perhaps because they are shared so frequently through horizontal gene transfer)? So the “delete within a few hours” bit is probably overstating the average case of DNA deletion in bacteria. I would be surprised if it “knew” about the function of the gene, although I agree it seems possible that some epigenetic mechanism could explain it. I don’t know of any, though!

aysja 8 Feb 2023 8:02 UTC
18 points
7
in reply to: tgb’s comment on: Why Are Bacteria So Simple?
Ah, thanks! Link fixed now.
Yes, welp, I considered getting into this whole debate in the post but it seemed like too much of an aside. Basically, Lynch is like, “when you control for cell size, the amount of energy per genome is not predictive of whether it’s a prokaryote or a eukaryote.” In other words, on his account, the main determinant of bioenergetic availability appears to be the size of the cell, rather than anything energetically special about eukaryotes, such as mitochondria.
There are some issues here. First, most of the large prokaryotes are outliers like Thiomargarita, in the sense that they have expanded their energy without expanding their functional volume. However, their genomes are still quite small, which means that their “energy/genome” will be large. Eukaryotic cells of the same size have way more energy and way longer genomes, making their “energy/genome” roughly equivalent to the large prokaryotes.
Second, Lynch’s story is that strong selection keeps bacterial genomes short. The main reason that bacteria have strong selection is because there are so many of them, and there are so many of them because they’re so small. But why are they so small? It seems like an obvious contender is Lane’s story about them being energy bottlenecked by their surface area. So, in my opinion, these two hypotheses are synergistic and my best guess is that they’re both part of the story.

aysja 16 Feb 2023 7:01 UTC
20 points
5
on: My understanding of Anthropic strategy
Thanks for writing this up! It seems very helpful to have open, thoughtful discussions about different strategies in this space.
Here is my summary of Anthropic’s plan, given what you’ve described (let me know if it seems off):
1. It seems likely that deep learning is what gets us to AGI.
2. We don’t really understand deep learning systems, so we should probably try to, you know, do that.
3. In the absence of a deep understanding, the best way to get information (and hopefully eventually a theory) is to run experiments on these systems.
4. We focus on current systems because we think that the behavior they exhibit will be a factor in future systems.
Leaving aside concerns about arms races and big models being scary in and of themselves, this seems like a pretty reasonable approach to me. In particular, I’m pretty on board with points 1, 2, and 3—i.e., if you don’t have theories, then getting your feet wet with the actual systems, observing them, experimenting, tinkering, and so on, seems like a pretty good way to eventually figure out what’s going on with the systems in a more formal/mechanistic way.
I think the part I have trouble with (which might stem from me just not knowing the relevant stuff) is point 4. Why do you need to do all of this on current models? I can see arguments for this, for instance, perhaps certain behaviors emerge in large models that aren’t present in smaller ones. But I’ve never seen, e.g., a list of such things and why they are important or cruxy enough to justify the emphasis on large models given the risks involved. I would really like to see such an argument! (Perhaps it does exist and I am not aware).
I also have a bit of trouble with the “top player” framing—at the moment I just don’t see why this is necessary. I understand that Anthropic works on large models, and that this is on par with what other “top players” in the field are doing. But why not just say that you want to work with large models? Why mention being competitive with Deepmind or OpenAI at all? The emphasis on “top player” makes me think that something is left unsaid about the motivation, aside from the emphasis on current systems. To the extent that this is true, I wish it were stated explicitly. (To be clear, “you” means Anthropic, not Miranda).

aysja 23 Feb 2023 8:55 UTC
7 points
on: Interlude: On Realness
I really value “realness” although I too am not sure what it is, exactly. Some thoughts:

I cannot stand fake wood or brick or anything fake really, because it feels like it is trying to trick me. It’s “lying,” in sort of the same way I feel like people lie when they say they are doing something because it helps climate change or whatever, when really it seems clear that they are doing it for social approval or something of that nature.
Moss feels very real to me, also, as do silky spider webs, or any slice of nature, really, when I’m in it. I think it’s because the moss is not pretending to be something else, not to me anyways, it’s just there.
Homes can be real-seeming to me, like how warm, cozy fireplaces with the wind whipping past the window and redwood walls make spaces seem inviting and true. But I think they can also be very not real. Many household things feel kind of “fake” to me, in the sense of trickery, like my microwave. It’s not really deceiving me in the sense that it is lying about itself—it will heat up my food if I press some buttons, but it’s like… asking something of me? Trying to get me to use it on its terms. Food containers with words on them also feel kind of like this… trying to get me to read them, to consume them, and so on…
“Trying for real” is an especially interesting one to me because it feels so important and I don’t know quite what it is. At least part of it seems related to trickery, like how “actually trying” to answer a question looks like not giving up until you have a satisfying-to-your-curiosity answer and “not really trying” looks more like getting a good-enough-to-pass-another-person’s-test answer, or not really believing it’ll work, or something like that. Where the “actually trying” bit seems much more fundamentally related to the thing the trying is about, hence “real,” the not-trying bit seems more related to something else entirely and that disconnect feels “fake” to me.

aysja 21 Mar 2023 20:29 UTC
7 points
6
on: Natural Abstractions: Key claims, Theorems, and Critiques
How does the redundancy definition of abstractions account for numbers, e.g., the number three? It doesn’t seem like “threeness” is redundantly encoded in, for example, the three objects on the floor of my room (rug, sweater, bottle of water) as rotation is in the gear example, since you wouldn’t be able to uncover information about “three” from any one object in particular.
I could imagine some definition based on redundancy capturing “threeness” by looking at a bunch of sets containing three things. But I think the reason the abstraction “three” feels a little strange on this account is that it is both highly natural (math!) but also can be highly “arbitrary,” e.g., “threeness” is wherever a mind can count three distinct objects (and those objects can be maximally unrelated!).
Perhaps counting the three objects on the floor of my room is a non-natural use case of the abstraction “three,” but if so, why? And where is the natural abstraction “three” in the world?

aysja 10 Apr 2023 0:39 UTC
9 points
0
on: Expanding the domain of discourse reveals structure already there but hidden
This reminds me a lot of one of Kuhn’s essays A Function for Thought Experiments. Where basically he’s like “people often conflate variables together; thought experiments can tease apart those conflations.” E.g., kids will usually start out conflating height with volume so that even though they watch the experimenter pour the “same” amount of water into a taller, thinner glass, they will end up saying that the left hand glass in (c) has more water than the one on the right.

Which is generally a good heuristic: height of water line and volume are usually pretty correlated. Eventually, though, experience brings these two variables into tension and kids will update their models. Kuhn argues that thought experiments are often playing this role, i.e., calling attention to and resolving conceptual tension between variables that were previously conflated.
In any case, I think the strategy “considering more possibilities” is really important for figuring out the “edges” of concepts… it feels sort of like “playing” with them until you have a “feel” for what they are.… which seems related to me to your ideas about “indexing,” too. Anyways, I thought a bunch of these examples were great. I now find myself confused about waves.

aysja 20 Jun 2023 17:04 UTC
3 points
0
in reply to: mattmacdermott’s comment on: Towards Measures of Optimisation
I don’t see the difference between “resolution of uncertainty” and “difference between the world and a counterfactual.” To my mind, resolution of uncertainty is reducing the space of counterfactuals, e.g., if I’m not sure whether you’ll say yes or no, then you saying “yes” reduces my uncertainty by one bit, because there were two counterfactuals.

I think what Garrett is gesturing at here is more like “There is just one way the world goes, the robot cleans the room or it doesn’t. If I had all the information about the world, I would see the robot does clean the room, i.e., I would have no uncertainty about this, and therefore there is no relevant counterfactual. It’s not as if the robot could have not cleaned the room, I know it doesn’t. In other words, as I gain information about the world, the distance between counterfactual worlds and actual worlds grows smaller, and then so does… the optimization power? That’s weird.”

Like, we want to talk about optimization power here as “moving the world more into your preference ordering, relative to some baseline” but the baseline is made out of counterfactuals, and those live in the mind. So we end up saying something in the vicinity of optimization power being a function of maps, which seems weird to me.

aysja 25 Jun 2023 12:35 UTC
6 points
2
in reply to: Alex_Altair’s comment on: Towards Measures of Optimisation
Yudkowsky’s measure still feels weird to me in ways that don’t seem to apply to length, in the sense that length feels much more to me like a measure of territory-shaped things, and Yudkowsky’s measure of optimization power seems much more map-shaped (which I think Garrett did a good job of explicating). Here’s how I would phrase it:
Yudkowsky wants to measure optimization power relative to a utility function: take the rank of the state you’re in, take the total number of all states that have equal or greater rank, and then divide that by the total number of possible states. There are two weird things about this measure, in my opinion. The first is that it’s behaviorist (what I think Garrett was getting at about distinguishing between atom and non-atom worlds). The second is that it seems like a tricky problem to coherently talk about “all possible states.”
So, like, let’s say that we have two buttons next to each other. Press one button and get the world that maxes out your utility function. Press the other and, I don’t know, you get a taco. According to Yudkowsky’s measure, pressing one of these buttons is evidence of vastly more optimization power than the other even though, intuitively, these seem about “equally hard” from the agents perspective.
This is what I mean about it being “behaviorist”—with this measure you only care about which world state attains (and how well that state ranks), but not how you got to that state. It seems clear to me that both of these are relevant in measuring optimization power. Like, conditioned on certain environments some things become vastly easier or harder. Getting a taco is easy in Berkeley, getting a taco is hard in a desert. And if your valuation of taco utility doesn’t change, then your optimization power can end up being largely a function of your environment, and that feels… a bit weird?
On the flip side, it’s also weird that it can vary so much based on the utility function. If someone is maximally happy watching TV at home all of the time, I feel hesitant to say that they have a ton of optimization power?
The thing that feels lacking in both of these cases, to me, is the ability to talk about how hard these goals are to achieve in reality (as a function of agent and environment). Because the difficulty of achieving the same world state can vary dramatically based on the environment and the agent. Grabbing a water bottle is trivial if there is one next to me, grabbing one if I have to construct it out of thermodynamic equilibrium is vastly harder. And importantly, the difference here isn’t in my utility function, but in how the environment shapes the difficulty of my goals, and in my ability as an agent to do these different things. I would like to say that the former uses less optimization power than the latter, and that this is in part a function of the territory.
You can perhaps rescue this by using a non-uniform prior over “all possible states,” and talk about how many bits it takes to move from that distribution to the distribution we want. So like, when I’m in the desert, the state “have a taco” is less likely than when I’m in Berkeley, therefore it takes more optimization power to get there. But then we run into some other problems.
The first is what Garrett points out, that probabilities are map things, and it’s a bit… weird for our measure of a (presumably) territory thing to be dependent on them. It’s the same sort of trickiness that I don’t feel we’ve properly sorted out in thermodynamics—namely, that if we take the existence of macrostates to be reflections of our uncertainty (as Jaynes does), then it seems we are stuck saying something to the effect of “ice cubes melt because we become more uncertain of their state,” which seems… wrong.
The second is that I claim that figuring out the “default” distribution is the entire problem, basically. Like, how do I know that a taco appearing in the desert is less likely than it is in Berkeley? How do I know what grabbing a bottle is more likely when there is a bottle rather than an equilibrium soup? Constructing the “correct” distribution, to the extent that makes sense, over the default outcomes seems to me to be the entire problem of figuring out what makes some tasks easier or harder, which is close to what we were trying to measure in the first place.
I do expect there is a way to talk about the correct default distribution, but that it’s tricky, and that part of why it’s so tricky is because it’s a function of both map and territory shaped things. In any case, I don’t think you get a sensible measure of optimization or other agency-terms if you can’t talk about them as things-in-the-territory (which neither of these measures really do); I’d really like to be able to. I also agree that an explanation (or measure) of atoms as Garrett laid out is unsatisfying; I feel unsatisfied here too, for similar reasons.

aysja 7 Aug 2023 13:50 UTC
26 points
8
on: Yes, It’s Subjective, But Why All The Crabs?
I am not totally sure that I disagree with you, but I would not say that agency is subjective and I’m going to argue against that here.
Clarifying “subjectivity.” I’m not sure I disagree because of this sentence “there’s a certain structure out in the world which people recognize as X, because recognizing it as X is convergently instrumental for a wide variety of goals.” I’m guessing that where you’re going with this is that the reason it’s so instrumentally convergent is because there is in fact something “out there” that deserves to be labeled as X, irrespective of the minds looking at it? Like, the fact that we all agree that oranges are things is because oranges basically are things, e.g., they contain the molecules we need for energy, have rinds, and so on, and these are facts about the territory; denying that would be bad for a wide variety of goals because you’d be missing out on something instrumentally useful for many goals, where, importantly, “usefulness” is at least in part a territory property, e.g., whether or not the orange contains molecules that we can metabolize. If this is what you mean, then we don’t disagree. But I also wouldn’t call an orange subjective, in the same way I wouldn’t call agency subjective. More on that later.
People modeling things differently does not necessarily imply subjectivity. It seems like your main point about agents being subjective is that “different people model different things as agents at different times.” This doesn’t seem sufficient to me. Like, people modeled heat as different things before we knew what it was, e.g., there was a time when people were arguing about whether it was a motion, a gas, or a liquid. But heat turned out to be “objective,” i.e., something which seems to exist irrespective of how we model it. Likewise, before Darwin there was some confusion over what different dog breeds were: many people considered them to be different “varieties” which was basically just a word for “not different species, but still kind of different.” Darwin claimed, and I believe him, that people would give different answers about whether these were different species or merely different varieties based on context and their history (e.g., if a naturalist had never seen dogs, then they’d probably call them different species, if they had, they’d call them different varieties). As it turns out, there’s an underlying “objective” thing here, which is how much their genomes differ from each other (I think? Not an evolutionary biologist :p). In any case, it seems to me that it is often the case that before scientific concepts are totally sussed out there is disagreement over how to model the thing they are pointing at, but that this doesn’t on its own imply that it’s inherently subjective.
A potential crux. There is a further thing you might mean here, what Dennett calls “the indeterminacy of interpretation,” which is that there is just no fact of the matter to what is agentic. Like, people might have disagreed about what heat was for a while, but it turned out that heat is more-or-less objective. The concept “hot,” on the other hand, is more subjective: just a property of how the neurons in a particular body/mind are tuned. In other words, the answer to whether something is hot is basically just “mu”—there is no fact of the matter about it. I am guessing that you think agency is of the latter type; I think it is of the former, i.e., I think we just haven’t pinned down the concepts in agency well enough to all agree on them yet, but that there is something “actually there” which we are pointing at. This might be our crux?

Abstractions are not all subjective? I am generally pretty confused by the stance that all “high-level abstractions” are subjective (although I don’t know what you mean by “high-level.”) I think (based on citing Jaynes) that you are saying something like “abstractions are reflections of our own ignorance.” E.g., we talk about temperature as some abstract thing because we are uncertain about the particular microstate that underlies it. But it seems to me that if you take this stance then you have to call basically everything subjective, e.g., an orange sitting right in front of me is subjective because I am ignorant of its exact atomic makeup. This seems a little weird to me, like oranges wouldn’t go away if we became fully certain about them? Likewise, I don’t think agency goes away if we become less ignorant of it.
Agents are more like “oranges” than they are like “hot.” To me, agents seem clearly in the “orange” category, rather than the “hot” category. Sure, we might currently call different things agents at different times, but to me it seems clear that there is something “real” there that exists aside from our perceptual machinery/interpretation layer. Like, the fact that agents consume order (negentropy) from their environment to spend on the order they care about is one such example of something “objective-ish” about agents, i.e., a real regularity happening in the territory, not just relative to our models of it.

Why do we disagree about what’s agentic, then? On my model, part of the reason that people vary on what they call agentic is because (I suspect) “agency” is not going to be a coherent concept in itself, but rather break out into multiple concepts which all contribute to our sense of it, such that many things we currently consider to be edge cases can be explained by one or a few factors being missing (or diminished). Likewise, I do expect that it is not entirely categorical, but that things can have more or less of it, and have more or less at different times (i.e., that a particular human varies in its ‘agent-ness’ over time). Neither of these seem incongruent to me with the idea that it’s objective-ish, just that we haven’t clarified what we mean by agency yet.

aysja 10 Aug 2023 21:46 UTC
17 points
8
on: Feedbackloop-first Rationality
Meta: I have some gripes about the feedback loop focus in rationality culture, and I think this comment unfairly mixes a bunch of my thoughts about this topic in general with my thoughts in response to this post in particular—sorry in advance for that. I wish I was better at delineating between them, but that turned out to be kind of hard, and I have limited time and so on…
It is quite hard to argue against feedback loops in their broadest scope because it’s like arguing against updating on reality at all and that’s, as some might say, the core thing we’re about here. E.g., reflecting on your thought processes and updating them seems broadly good to me.
The thing that I feel more gripe-y about is something in the vicinity of these two claims: 1) Feedback loops work especially well in some domains (e.g., engineering) and poorly in others (e.g., early science). 2) Alignment, to the extent that it is a science, is early stage and using a feedback loop first mentality here seems actively harmful to me.
Where do feedback loops work well? Feedback loops (in particular, negative feedback loops), as they were originally construed, consist of a “goal state,” a way of checking whether your system is in line with the goal state or not, and a way of changing the current state (so as to eventually align it with the goal state). This setup is very back-chain focused. It assumes that you know what the target is and it assumes that you can progressively home in on it (i.e., converge on a particular state).
This works especially well in, e.g., engineering applications, where you have an end product in mind and you are trying out different strategies to get there. But one of the main difficulties with early stage science is that you don’t know what you’re aiming at, and this process seems (to me) to consist more of expanding the possibility space through exploration (i.e., hypothesis generation is about creating, not cleaving) rather than winnowing it.
For instance, it’s hard for me to imagine how the feedback loop first approach would have made Darwin much faster at noticing that species “gradually become modified.” This wasn’t even in his hypothesis space when he started his voyage on the Beagle (he assumed, like almost all other naturalists, that species were independently created and permanent). Like, it’s true that Darwin was employing feedback loops in other ways (e.g., trying to predict what rock formations would be like before he arrived there), and I buy that this sort of scientific eye may have helped him notice subtle differences that other people missed.
But what sort of feedback should he have used to arrive at the novel thought that species changed, when that wasn’t even on his radar to begin with? And what sort of training would make someone better at this? It doesn’t seem to me like practicing thinking via things like Thinking Physics questions is really the thing here, where, e.g., the right question has already been formulated. The whole deal with early stage science, imo, is in figuring out how to ask the right questions in the first place, without access to what the correct variables and relationships are beforehand. (I’m not saying there is no way to improve at this skill, or to practice it, I just have my doubts that a feedback loop first approach is the right one, here).
Where (and why) feedback loops are actively harmful. Basically, I think a feedback loop first approach overemphasizes legibility which incentivizes either a) pretending that things are legible where they aren’t and/or b) filtering out domains with high illegibility. As you can probably guess, I think early science is high on the axis of illegibility, and I worry that focusing too hard on feedback loops either a) causes people to dismiss the activity or b) causes people to prematurely formalize their work.
I think that one of the main things that sets early stage scientific work apart from other things, and what makes it especially difficult, is that it often requires holding onto confusion for a very long time (on the order of years). And usually that confusion is not well-formed, since if it were the path forward would be much more obvious. Which means that the confusion is often hard to communicate to other people, i.e., it’s illegible.
This is a pretty tricky situation for a human to be in. It means that a) barely anyone, and sometimes no one, has any idea what you’re doing and to the extent they do, they think that it’s probably pointless or doomed, b) this makes getting money is a bunch harder, and c) it is psychologically taxing for most people to be in a state of confusion—in general, people like feeling like they understand what’s going on. In other words, the overwhelming incentive is just to do the easily communicable thing, and it takes something quite abnormal for a human to spend years on a project that doesn’t have a specific end goal, and little to no outside-view legible progress.
I think that the thing which usually supports this kind of sustained isolation is an intense curiosity and an obsession with the subject (e.g., Paul Graham’s bus ticket theory), and an inside view sense that your leads are promising. These are the qualities (aside from g) that I suspect strongly contribute to early stage scientific progress and I don’t think they’re ones that you train via feedback loops, at least not as the direct focus, so much as playful thinking, boggling, and so on.
More than that, though, I suspect that a feedback loop first focus is actively harmful here. Feedback loops ask people to make their objectives clear-cut. But sort of the whole point of early science is that we don’t know how to talk about the concepts correctly yet (nor how to formalize the right questions or objectives). So the incentive, here, is to cut off confusion too early, e.g., by rounding it off to the closest formalized concept and moving on. This sucks! Prematurely formalizing is harmful when the main difficulty of early science is in holding onto confusion, and only articulating it when it’s clear that it is carving the world correctly.
To make a very bold and under-defended claim: I think this is a large part of the reason why a lot of science sucks now—people began mistaking the outcome (crisp, formalized principles) for the process, and now research isn’t “real” unless it has math in it. But most of the field-founding books (e.g., Darwin, Carnot) have zero or close to zero math! It is, in my opinion, a big mistake to throw formalizations at things before you know what the things are, much like it is a mistake to pick legible benchmarks before you know what you want a benchmark for.
Alignment is early stage science. I feel like this claim is obvious enough to not need defending but, e.g., we don’t know what any of the concepts are in any remotely precise (and agreed upon) sense: intelligence, optimization, agency, situational awareness, deception, and so on… This is distinct from saying that we need to solve alignment through science, e.g., it could be that alignment is super easy, or that engineering efforts are enough. But to the extent that we are trying to tackle alignment as a natural science, I think it’s safe to say it is in its infancy.
I don’t want feedback loop first culture to become the norm for this sort of work, for the reasons I outlined above (it’s also the sort of work I personally feel most excited about for making progress on the problem). So, the main point of this comment is like “yes, this seems good in certain contexts, but please let’s not overdo it here, nor have our expectations set that it ought to be the norm of what happens in the early stages of science (of which alignment is a member).”

aysja 15 Aug 2023 16:27 UTC
6 points
3
on: Contingency: A Conceptual Tool from Evolutionary Biology for Alignment
Thanks for adding this to my conceptual library! I like the idea.

One thing I feel uncertain about (not coming from a background in evo bio) is how much evidence the “only evolved once” thing is for contingency. My naive guess would be that there’s an asymmetry here, i.e., that “evolved many times” is a lot of evidence for convergence, but “evolved only once” is only a small amount of evidence for contingency (on its own). For instance, I can imagine (although I don’t know if this has happened) that something highly favorable might evolve (such as a vascular system) and then spreads so quickly that it subsumes the “market,” leaving no room for competitors? In this world, the feature might still be convergent in the sense that if you reran the tape of life you’d find it every time, even though you only saw it evolve once in our tape. I imagine there are other factors here which matter, too. Does that seem wrong?

Also, as an aside, my understanding is that the genetic code is not as contingent as it first seems. For instance, codons “semantically” near each other, e.g., with overlapping bases, also code for chemically similar amino acids. Similarly, the level of robustness the genetic code has (to translation errors) is extremely unlikely, i.e., if one were to scramble the code, you’d only get that robustness in something like one out of 10^38 possible codes (can’t remember the exact number right now, sorry, but it’s astronomically large).

aysja 12 Sep 2023 1:39 UTC
11 points
on: Is Reality Ugly?
But when it comes to messy gene expression networks, we’ve already found the hidden beauty—the stable level of underlying physics. Because we’ve already found the master order, we can guess that we won’t find any additional secret patterns that will make biology as easy as a sequence of cubes. Knowing the rules of the game, we know that the game is hard. We don’t have enough computing power to do protein chemistry from physics (the second source of uncertainty) and evolutionary pathways may have gone different ways on different planets (the third source of uncertainty). New discoveries in basic physics won’t help us here.
If you were an ancient Greek staring at the raw data from a biology experiment, you would be much wiser to look for some hidden structure of Pythagorean elegance, all the proteins lining up in a perfect icosahedron. But in biology we already know where the Pythagorean elegance is, and we know it’s too far down to help us overcome our indexical and logical uncertainty.
I’m a little confused about this account—in physics it seems like there are multiple levels of hidden beauty, e.g., the wave equation and Newtonian mechanics. What’s the reasoning for expecting only one level of “Pythagorean elegance” for a given phenomena? Or to put it differently: if the first physical law that humanity had discovered was the wave equation, would you have predicted the existence of Newtonian laws of motion?

aysja 25 Sep 2023 20:03 UTC
41 points
17
on: Inside Views, Impostor Syndrome, and the Great LARP
Yeah, there’s something weird going on here that I want to have better handles on. I sometimes call the thing Bengio does being “motivated by reasons.” Also the skill of noticing that “words have referents” or something?

Like, the LARPing quality of many jobs seems adjacent to the failure mode Orwell is pointing to in Politics and the English Language, e.g., sometimes people will misspell metaphors—”tow the line” instead of “toe the line”—and it becomes very clear that the words have “died” in some meaningful sense, like the speaker is not actually “loading them.” And sometimes people will say things to me like “capitalism ruined my twenties” and I have a similarly eerie feeling about, like it’s a gestalt slapped together—some floating complex of words which themselves associate but aren’t tethered to anything else—and that ultimately it’s not really trying to point to anything.

Being “motivated by reasons” feels like “loading concepts,” and “taking them seriously.” It feels related to me to noticing that the world is “motivated by reasons,” as in, that there is order there to understand and that an understanding of it means that you get to actually do things in it. Like you’re anchored in reality, or something, when the default thing is to float above it. But I wish I had better words.

LARPing jobs is a bit eerie to me, too, in a similar way. It’s like people are towing the line instead of toeing it. Like they’re modeling what they’re “supposed” to be doing, or something, rather than doing it for reasons. And if you ask them why they’re doing whatever they’re doing they can often back out one or two answers, but ultimately fail to integrate it into a broader model of the world and seem a bit bewildered that you’re even asking them, sort of like your grandpa. Floating complexes—some internal consistency, but ultimately untethered?

Anyways, fwiw I think the advice probably varies by person. For me it was helpful to lean into my own taste and throw societal expectations to the wind :p Like, I think that when you’re playing the game “I should be comprehensible to basically anyone I talk to at every step of the way” then it’s much easier to fall into these grooves of what you’re “supposed” to be doing, and lapse into letting other people think for you. Probably not everyone has to do something so extreme, but for me, going hardcore on my own curiosity and developing a better sense of what it is, exactly, that I’m so curious about and why, for my own god damn self and not for anyone else’s, has gone a long way to “anchor me in reality,” so to speak.

aysja 4 Oct 2023 23:05 UTC
15 points
9
on: How to do theoretical research, a personal perspective
This seems wrong to me in some important ways (at least as general theoretical research advice). Like, some of the advice you give seems to anti-predict important scientific advances.
Generally, unguided exploration is seldom that useful.
Following this advice, for instance, would suggest that Darwin not go on the Beagle, i.e., not spend five years exploring the globe (basically just for fun) as a naturalist. But his experiences on the Beagle were exactly what led him to the seeds of natural selection, as he began to notice subtleties like how animals changed ever so slightly as one moves up a continent. It also seems like it screens out a bunch of Faraday’s experimental work on electricity, much of which he did because it seemed interesting or fun, rather than backchaining from some predetermined goal. Like, he has an entire lecture series on candles, which was mostly just him over and over saying “And isn’t it weird that this thing happens, too?? What happens if we change this?” And they’re great, and a lot of that exploratory work laid the groundwork for Maxwell’s later work on electromagnetism.
Cutting off research avenues that are fun to think about, but ultimately not that productive.
Similarly, I think this is one of the main failure modes with modern scientific research. When I look at academia one of the things I’m most hoping for is that people follow their taste more, and that they have more fun! Because often things that are open-ended and fun to play around with hold a deeper kind of logic that you’re attracted to, but haven’t articulated yet. If you only stick to things that seem immediately productive then you (roughly) never find truly novel or cool ideas. E.g., both Babbage and Shannon tinkered around with different coding type projects when they were younger (cipher cracking and barbed wire telegraphs, respectively), and I think it’s not crazy to assume that this sort of playing around with representing information abstractly may have helped with their later, more ambitious projects (general computers, information theory). Also, many Nobel prize winners say they wouldn’t have been able to do their seminal in the current environment because, e.g., “Today I wouldn’t get an academic job. It’s as simple as that. I don’t think I would be regarded as productive enough.” (Higgs). Certainly, some things are dead ends and it can be a bit hard to know that in advance, but if you prematurely screen off all of them you screen off the great ideas, too.
I think Altman puts it nicely, here: “Good ideas—actually, no, great ideas are fragile. Great ideas are easy to kill…. All the best ideas when I first heard them sound bad. And all of us, myself included, are much more affected by what other people think of us and our ideas than we like to admit. If you are just four people in your own door, and you have an idea that sounds bad but is great, you can keep that self-delusion going. If you’re in a coworking space, people laugh at you, and no one wants to be the kid picked last at recess. So you change your idea to something that sounds plausible but is never going to matter. It’s true that coworking spaces do kill off the very worst ideas, but a band-pass filter for startups is a terrible thing because they kill off the best ideas, too.” (Emphasis mine). Likewise, I think it is perhaps quite load-bearing the way that many great scientists spent significant portions of their thinking years alone (famously, Newton did this when he came up with Principia, but Darwin and Shannon too, etc.)
On timescales of days and weeks, you should be able to point to concrete examples that constitute “units of progress” towards your final goal.
This also feels pretty wrong to me. Certainly that would be nice and perhaps something to try to aim for, but I don’t think it’s always the case and I don’t think the lack of it is that strong of evidence in favor of “not making progress.” Again, using Darwin as an example—after he noticed that species were mutable he spent about a year and a half trying to figure out why. He had one main insight a few months in—that breeders introduced changes via artificial selection—but he didn’t put it together for some time the way that nature could act as a selector. And in that year between “artificial” and “natural” selection, I would not say that he was making obvious, concrete progress on the solution because the solution wasn’t made from obvious steps. He had the right questions, and he read a lot, wrote a lot, talked to breeders, etc., but mostly he just held onto his confusion for a long time. And then one day in a flash of insight, shortly after reading Malthus, the solution came to him in a carriage ride. Certainly not all research looks like this, but I do think it’s an illustrative example of how good theoretical work can come out of non-obvious units of progress.

I know at the beginning you mentioned that this is advice for a particular kind of research from your perspective, and I do think that it’s useful in certain domains. But I worry it’s easy to forget, at the end of a document with many high-level tips, that it’s not general advice on how to do good theoretical alignment work, period. And because I do think that some of this advice anti-predicts great scientific work—in particular the sort that I think alignment is currently most lacking, and the sort that would be the most helpful, were we to have it—I wanted to push back a bit on the idea that many people might walk away with, i.e., that this is general advice for theoretical work in alignment.

aysja 5 Oct 2023 6:00 UTC
132 points
55
on: Thomas Kwa’s MIRI research experience
Meta: I don’t want this comment to be taken as “I disagree with everything you (Thomas) said.” I do think the question of what to do when you have an opaque, potentially intractable problem is not obvious, and I don’t want to come across as saying that I have the definitive answer, or anything like that. It’s tricky to know what to do, here, and I certainly think it makes sense to focus on more concrete problems if deconfusion work didn’t seem that useful to you.
That said, at a high-level I feel pretty strongly about investing in early-stage deconfusion work, and I disagree with many of the object-level claims you made suggesting otherwise. For instance:
The neuroscientists I’ve talked to say that a new scanning technology that could measure individual neurons would revolutionize neuroscience, much more than a theoretical breakthrough. But in interpretability we already have this, and we’re just missing the software.
It seems to me like the history of neuroscience should inspire the opposite conclusion: a hundred years of increasingly much data collection at finer and finer resolution, and yet, we still have a field that even many neuroscientists agree barely understands anything. I did undergrad and grad school in neuroscience and can at the very least say that this was also my conclusion. The main problem, in my opinion, is that theory usually tells us which facts to collect. Without it—without even a proto-theory or a rough guess, as with “model-free data collection” approaches—you are basically just taking shots in the dark and hoping that if you collect a truly massive amount of data, and somehow search over it for regularities, that theory will emerge. This seems pretty hopeless to me, and entirely backwards from how science has historically progressed.
It seems similarly pretty hopeless to me to expect a “revolution” out of tabulating features of the brain at low-enough resolution. Like, I certainly buy that it gets us some cool insights, much like every other imaging advance has gotten us some cool insights. But I don’t think the history of neuroscience really predicts a “revolution,” here. Aside from the computational costs of “understanding” an object in such a way, I just don’t really buy that you’re guaranteed to find all the relevant regularities. You can never collect *all* the data, you have to make choices and tradeoffs when you measure the world, and without a theory to tell you which features are meaningfully irrelevant and can be ignored, it’s hard to know that you’re ultimately looking at the right thing.
I ran into this problem, for instance, when I was researching cortical uniformity. Academia has amassed a truly gargantuan collection of papers on the structural properties of the human neocortex. What on Earth do any of these papers say about how algorithmically uniform the brain is? As far as I can tell, pretty much close to zero, because we have no idea how the structural properties of the cortex relate to the functional ones, and so who’s to say that “neuron subtype A is more dense in the frontal cortex relative to the visual cortex” is a meaningful finding or not? I worry that other “shot in the dark” data collection methods will suffer similar setbacks.
Eliezer has written about how Einstein cleverly used very limited data to discover relativity. But we could have discovered relativity easily if we observed not only the precession of Mercury, but also the drifting of GPS clocks, gravitational lensing of distant galaxies, gravitational waves, etc.
It’s of course difficult to say how science might have progressed counterfactually, but I find it pretty hard to believe that relativity would have been “discovered easily” were we to have had a bunch of data staring us in the face. In general, I think it’s very easy to underestimate how difficult it is to come up with new concepts. I felt this way when I was reading about Darwin and how it took him over a year to go from realizing that “artificial selection is the means by which breeders introduce changes,” to realizing that “natural selection is the means by which changes are introduced in the wild.” But then I spent a long time in his shoes, so to speak, operating from within the concepts he had available to him at the time, and I became more humbled. For instance, among other things, it seems like a leap to go from “a human uses their intellect to actively select” to “nature ends up acting like a selector, in the sense that its conditions favor some traits for survival over others.” These feel like quite different “types” of things, in some ways.
In general, I suspect it’s easy to take the concepts we already have, look over past data, and assume it would have been obvious. But I think the history of science again speaks to the contrary: scientific breakthroughs are rare, and I don’t think it’s usually the case that they’re rare because of a lack of data, but because they require looking at that data differently. Perhaps data on gravitational lensing may have roused scientists to notice that there were anomalies, and may have eventually led to general relativity. But the actual process of taking the anomalies and turning that into a theory is, I think, really hard. Theories don’t just pop out wholesale when you have enough data, they take serious work.
I heard David Bau say something interesting at the ICML safety workshop: in the 1940s and 1950s lots of people were trying to unlock the basic mysteries of life from first principles. How was hereditary information transmitted? Von Neumann designed a universal constructor in a cellular automaton, and even managed to reason that hereditary information was transmitted digitally for error correction, but didn’t get further. But it was Crick, Franklin, and Watson who used crystallography data to discover the structure of DNA, unraveling far more mysteries. Since then basically all advances in biochemistry have been empirical. Biochemistry is a case study where philosophy and theory failed to solve the problems but empirical work succeeded, and maybe interpretability and intelligence are similar.
This story misses some pretty important pieces. For instance, Schrödinger predicted basic features about DNA—that it was an aperiodic crystal—using first principles in his book What if Life? published in 1944. The basic reasoning is that in order to stably encode genetic information, the molecule should itself be stable, i.e., a crystal. But to encode a variety of information, rather than the same thing repeated indefinitely, it needs to be aperiodic. An aperiodic crystal is a molecule that can use a few primitives to encode near infinite possibilities, in a stable way. His book was very influential, and Francis and Crick both credited Schrödinger with the theoretical ideas that guided their search. I also suspect their search went much faster than it would have otherwise; many biologists at the time thought that the hereditary molecule was a protein, of which there are tens of millions in a typical cell.
But, more importantly, I would certainly not say that biochemistry is an area where empirical work has succeeded to nearly the extent that we might hope it to. Like, we still can’t cure cancer, or aging, or any of the myriad medical problems people have to endure; we still can’t even define “life” in a reasonable way, or answer basic questions like “why do arms come out basically the same size?” The discovery of DNA was certainly huge, and helpful, but I would say that we’re still quite far from a major success story with biology.
My guess is that it is precisely because we lack theory that we are unable to answer these basic questions, and to advance medicine as much as we want. Certainly the “tabulate indefinitely” approach will continue pushing the needle on biological research, but I doubt it is going to get us anywhere near the gains that, e.g., “the hereditary molecule is an aperiodic crystal” did.
And while it’s certainly possible that biology, intelligence, agency and so on are just not amenable to the cleave-reality-at-its-joints type of clarity one gets from scientific inquiry, I’m pretty skeptical that this the world we in fact live in, for a few reasons.
For one, it seems to me that practically no one is trying to find theories in biology. It is common for biologists (even bright-eyed, young PhDs at elite universities) to say things like (and in some cases this exact sentence): “there are no general theories in biology because biology is just chemistry which is just physics.” These are people at the beginning of their careers, throwing in the towel before they’ve even started! Needless to say, this take is clearly not true in all generality, because it would anti-predict natural selection. It would also, I think, anti-predict Newtonian mechanics (“there are no general theories of motion because motion is just the motion of chemicals which is just the motion of particles which is just physics”).
Secondly, I think that practically all scientific disciplines look messy, ad-hoc, and empirical before we get theories that tie it together, and that this does not on its own suggest biology is a theoretically bankrupt field. E.g., we had steam engines before we knew about thermodynamics, but they were kind of ad-hoc, messy contraptions, because we didn’t really understand what variables were causing the “work.” Likewise, naturalism before Darwin was largely compendiums upon compendiums of people being like “I saw this [animal/fossil/plant/rock] here, doing this!” Science before theory often looks like this, I think.
Third: I’m just like, look guys, I don’t really know what to tell you, but when I look at the world and I see intelligences doing stuff, I sense deep principles. It’s a hunch, to be sure, and kind of hard to justify, but it feels very obvious to me. And if there are deep principles to be had, then I sure as hell want to find them. Because it’s embarrassing that at this point we don’t even know what intelligence is, nor agency, nor abstractions: how to measure any of it, predict when it will increase or not. These are the gears that are going to move our world, for better or for worse, and I at least want my hands on the steering wheel when they do.
I think that sometimes people don’t really know what to envision with theoretical work on alignment, or “agent foundations”-style work. My own vision is quite simple: I want to do great science, as great science has historically been done, and to figure out what in god’s name any of these phenomena are. I want to be able to measure that which threatens our existence, such that we may learn to control it. And even though I am of course not certain this approach is workable, it feels very important to me to try. I think there is a strong case for there being a shot, here, and I want us to take it.
What links here?
- Thomas Kwa’s MIRI research experience by Thomas Kwa (2 Oct 2023 16:42 UTC; 169 points)

aysja 16 Oct 2023 9:39 UTC
LW: 107 AF: 40
63
AF
on: RSPs are pauses done right
I’m sympathetic to the idea that it would be good to have concrete criteria for when to stop a pause, were we to start one. But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it.
I’m first going to zoom out a bit—to a broader trend which I’m worried about in AI Safety, and something that I believe evaluation-gating might exacerbate, although it is certainly not the only contributing factor.
I think there is pressure mounting within the field of AI Safety to produce measurables, and to do so quickly, as we continue building towards this godlike power under an unknown timer of unknown length. This is understandable, and I think can often be good, because in order to make decisions it is indeed helpful to know things like “how fast is this actually going” and to assure things like “if a system fails such and such metric, we’ll stop.”
But I worry that in our haste we will end up focusing our efforts under the streetlight. I worry, in other words, that the hard problem of finding robust measurements—those which enable us to predict the behavior and safety of AI systems with anywhere near the level of precision we have when we say “it’s safe for you to get on this plane”—will be substituted for the easier problem of using the measurements we already have, or those which are close by; ones which are at best only proxies and at worst almost completely unrelated to what we ultimately care about.
And I think it is easy to forget, in an environment where we are continually churning out things like evaluations and metrics, how little we in fact know. That when people see a sea of ML papers, conferences, math, numbers, and “such and such system passed such and such safety metric,” that it conveys an inflated sense of our understanding, not only to the public but also to ourselves. I think this sort of dynamic can create a Red Queen’s race of sorts, where the more we demand concrete proposals—in a domain we don’t yet actually understand—the more pressure we’ll feel to appear as if we understand what we’re talking about, even when we don’t. And the more we create this appearance of understanding, the more concrete asks we’ll make of the system, and the more inflated our sense of understanding will grow, and so on.
I’ve seen this sort of dynamic play out in neuroscience, where in my experience the ability to measure anything at all about some phenomenon often leads people to prematurely conclude we understand how it works. For instance, reaction times are a thing one can reliably measure, and so is EEG activity, so people will often do things like… measure both of these quantities while manipulating the number of green blocks on a screen, then call the relationship between these “top-down” or “bottom-up” attention. All of this despite having no idea what attention is, and hence no idea if these measures in fact meaningfully relate much to the thing we actually care about.
There are a truly staggering number of green block-type experiments in the field, proliferating every year, and I think the existence of all this activity (papers, conferences, math, numbers, measurement, etc.) convinces people that something must be happening, that progress must be being made. But if you ask the neuroscientists attending these conferences what attention is, over a beer, they will often confess that we still basically have no idea. And yet they go on, year after year, adding green blocks to screens ad infinitum, because those are the measurements they can produce, the numbers they can write on grant applications, grants which get funded because at least they’re saying something concrete about attention, rather than “I have no idea what this is, but I’d like to figure it out!”
I think this dynamic has significantly corroded academia’s ability to figure out important, true things, and I worry that if we introduce it here, that we will face similar corrosion.
Zooming back in on this proposal in particular: I feel pretty uneasy about the messaging, here. When I hear words like “responsible” and “policy” around a technology which threatens to vanquish all that I know and all that I love, I am expecting things more like “here is a plan that gives us multiple 9’s of confidence that we won’t kill everyone.” I understand that this sort of assurance is unavailable, at present, and I am grateful to Anthropic for sharing their sketches of what they hope for in the absence of such assurances.
But the unavailability of such assurance is also kind of the point, and one that I wish this proposal emphasized more… it seems to me that vague sketches like these ought to be full of disclaimers like, “This is our best idea but it’s still not very reassuring. Please do not believe that we are safely able to prevent you from dying, yet. We have no 9’s to give.” It also seems to me like something called a “responsible scaling plan” should at the very least have a convincing story to tell about how we might get from our current state, with the primitive understanding we have, to the end-goal of possessing the sort of understanding that is capable of steering a godly power the likes of which we have never seen.
And I worry that in the absence of such a story—where the true plan is something closer to “fill in the blanks as we go”—that a mounting pressure to color in such blanks will create a vacuum, and that we will begin to fill it with the appearance of understanding rather than understanding itself; that we will pretend to know more than we in fact do, because that’s easier to do in the face of a pressure for results, easier than standing our ground and saying “we have no idea what we’re talking about.” That the focus on concrete asks and concrete proposals will place far too much emphasis on what we can find under the streetlight, and will end up giving us an inflated sense of our understanding, such that we stop searching in the darkness altogether, forget that it is even there…

I agree with you that having concrete asks would be great, but I think they’re only great if we actually have the right asks. In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems—in the absence of a realistic plan to get those, I think demanding them may end up being actively harmful. Harmful because people will walk away feeling like AI Safety “knows” more than it does and will hence, I think, feel more secure than is warranted.
What links here?
- Joe_Collman's comment on Thoughts on responsible scaling policies and regulation by paulfchristiano (25 Oct 2023 3:47 UTC; 17 points)