If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
This is outstanding. I’ll have other comments later, but first I wanted to praise how this is acting as a synthesis of lots of previous ideas that weren’t ever at the front of my mind.
I think for the last month for some reason, people are going around overstating how aligned humans are with past humans.
If you put people from 500 years ago in charge of the galaxy, they’d have screwed it up according to my standards. Bigotry, war, cruelty to animals, religious nonsense, lack of imagination and so on. And conversely, I’d screw up the galaxy according to their standards. And this isn’t just some quirky fact about 500 years ago, all of history and pre-history is like this, we haven’t magically circled back around to wanting to arrange the galaxy the same way humans from a million years ago would.
I think when people talk about how we are aligned with past humans, they are not thinking about how humans from 500 years ago used to burn cats alive for entertainment. They are thinking about how humans feel love, and laugh at jokes, and like the look of healthy trees and symmetrical faces.
But the thing is, those things seems like human values, not “what they would do if put in charge of the galaxy,” precisely because they’re the things that generalize well even to humans of other eras. Defining alignment as those things being preserved is painting on the target after the bullet has been fired.
Now, these past humans would probably drift towards modern human norms if put in a modern environment, especially if they start out young. (They might identify this as value drift and put in safeguards against it—the Amish come to mind—but they might not. I would certainly like to put in safeguards against value drift that might be induced by putting humans in weird future environments.) But if the original “humans are aligned with the past” point was supposed to be that humans’ genetic code unfolds into optimizers that want the same things even across changes of environment, this is not a reassurance.
How much interesting stuff do you think there is in your curriculum that hasn’t percolated into the community? What’s stopping said percolation?
I still endorse my Reducing Goodhart sequence
Humans don’t have our values written in Fortran on the inside of our skulls, we’re collections of atoms that only do agent-like things within a narrow band of temperatures and pressures. It’s not that there’s some pre-theoretic set of True Values hidden inside people and we’re merely having trouble getting to them—no, extracting any values at all from humans is a theory-laden act of inference, relying on choices like “which atoms exactly count as part of the person” and “what do you do if the person says different things at different times?”
The natural framing of Goodhart’s law—in both mathematics and casual language—makes the assumption that there’s some specific True Values in here, some V to compare to U. But this assumption, and the way of thinking built on top of it, is crucially false when you get down to the nitty gritty of how to model humans and infer their values..
I noticed myself mentally grading the entries by some extra criteria. The main ones being something like “taking-over-the-world competitiveness” (TOTWC, or TOW for short) and “would I actually trust this farther than I could throw it, once it’s trying to operate in novel domains?” (WIATTFTICTIOITTOIND, or WIT for short).
A raw statement of my feelings:
Reinforcement learning + transparency tool: High TOW, Very Low WIT.
Imitative amplification + intermittent oversight: Medium TOW, Low WIT.
Imitative amplification + relaxed adversarial training: Medium TOW, Medium-low WIT.
Approval-based amplification + relaxed adversarial training: Medium TOW, Low WIT.
Microscope AI: Very Low TOW, High WIT.
STEM AI: Low TOW, Medium WIT.
Narrow reward modeling + transparency tools: High TOW, Medium WIT.
Recursive reward modeling + relaxed adversarial training: High TOW, Low WIT.
AI safety via debate with transparency tools: Medium-Low TOW, Low WIT.
Amplification with auxiliary RL objective + relaxed adversarial training: Medium TOW, Medium-low WIT.
Amplification alongside RL + relaxed adversarial training: Medium-low TOW, Medium WIT.
Because the noise usually grows as the signal does. Consider Moore’s law for transistors per chip. Back when that number was about 10^4, the standard deviation was also small—say 10^3. Now that density is 10^8, no chips are going to be within a thousand transiators of each other, the standard deviation is much bigger (~10^7).
This means that if you’re trying to fit the curve, being off by 10^5 is a small mistake when preducting current transistor #, but a huge mistake when predicting past transistor #. It’s not rare or implausible now to find a chip with 10^5 more transistors, but back in the ’70s that difference is a huge error, impossible under an accurate model of reality.
A basic fitting function, like least squares, doesn’t take this into account. It will trade off transistors now vs. transistors in the past as if the mistakes were of exactly equal importance. To do better you have to use something like a chi squared method, where you explicitly weight the points differently based on their variance. Or fit on a log scale using the simple method, which effectively assumes that the noise is proportional to the signal.
I don’t think we should equate the understanding required to build a neural net that will generalize in a way that’s good for us with the understanding required to rewrite that neural net as a gleaming wasteless machine.
The former requires finding some architecture and training plan to produce certain high-level, large-scale properties, even in the face of complicated AI-environment interaction. The latter requires fine-grained transparency at the level of cognitive algorithms, and some grasp of the distribution of problems posed by the environment, together with the ability to search for better implementations.
If your implicit argument is “In order to be confident in high-level properties even in novel environments, we have to understand the cognitive algorithms that give rise to them and how those algorithms generalize—there exists no emergent theory of the higher level properties that covers the domain we care about.” then I think that conclusion is way too hasty.
I am disappointed in the method here.
GPT is not a helpful AI that is trying to helpfully convey facts to you. It is, to first order, telling you what would be plausible if humans were having your conversation. For example, if you ask it what hardware it’s running on, it will give you an answer that would be plausible if this exchange showed up in human text, it will not actually tell you what hardware it’s running on.
Similarly, you do not learn anything about GPT’s own biases by asking it to complete text and seeing if the text means something biased. It is predicting human text. Since the human text it’s trying to predict exhibits biases… well, fill in the blank.
What I was so badly hoping this would be was an investigation of GPT’s biases, not the training dataset’s biases. For example, if in training GPT saw “cats are fluffy” 1000 times and “cats are sleek” 2000 times, when shown “cats are ” does it accurately predict “fluffy” half as much as “sleek” (at temperature=1), or is it biased and predicts some other ratio? And does that hold across different contexts? Is it different for patterns it’s only seen 1 or 2 times, or for patterns it’s seen 1 or 2 million times?
The belief inertia result is the closest to this, but still needs a comparison to the training data.
Thanks, this was interesting.
I couldn’t really follow along with my own probabilities because things started wild from the get-go. You say we need to “invent algorithms for transformative AI,” when in fact we already have algorithms that are in-principle general, they’re just orders of magnitude too inefficient, but we’re making gradual algorithmic progress all the time. Checking the pdf, I remain confused about your picture of the world here. Do you think I’m drastically overstating the generality of current ML and the gradualness of algorithmic improvement, such that currently we are totally lacking the ability to build AGI, but after some future discovery (recognizable on its own merits and not some context-dependent “last straw”) we will suddenly be able to?
And your second question is also weird! I don’t really understand the epistemic state of the AI researchers in this hypothetical. They’re supposed to have built something that’s AGI, it just learns slower than humans. How did they get confidence in this fact? I think this question is well-posed enough that I could give a probability for it, except that I’m still confused about how to conditionalize on the first question.
The rest of the questions make plenty of sense, no complaints there.
In terms of the logical structure, I’d point out that inference costs staying low, producing chips, and producing lots of robots are all definitely things that could be routes to transformative AI, but they’re not necessary. The big alternate path missing here is quality. An AI that generates high-quality designs or plans might not have a human equivalent, in which case “what’s the equivalent cost at $25 per human hour” is a wrong question. Producing chips and producing robots could also happen or not happen in any combination and the world could still be transformed by high-quality AI decision-making.
It seems like you deliberately picked completeness because that’s where Dutch book arguments are least compelling, and that you’d agree with the more usual Dutch book arguments.
But I think even the Dutch book for completeness makes some sense. You just have to separate “how the agent internally represents its preferences” from “what it looks like the agent us doing.” You describe an agent that dodges the money-pump by simply acting consistently with past choices. Internally this agent has an incomplete representation of preferences, plus a memory. But externally it looks like this agent is acting like it assigns equal value to whatever indifferent things it thought of choosing between first. If humans don’t get to control the order this agent considers options, or if we let it run for a long time and it’s already experienced the things humans might try to present to it from them on, then it will look like it’s acting according to complete preferences.
The present is “good on its own terms”, rather than “good on Ancient Romans’ terms”, because the Ancient Romans weren’t able to lock in their values. If you think this makes sense (and is a good thing) in the absence of an Inherent Essence Of Goodness, then there’s no reason to posit an Inherent Essence Of Goodness when we switch from discussing “moral progress after Ancient Rome” to “moral progress after circa-2022 civilization”.
The present is certainly good on my terms (relative to ancient Rome). But the present itself doesn’t care. It’s not the type of thing that can care. So what are you trying to pack inside that phrase, “its own terms”?
If you mean it to sum up a meta-preference you hold about how moral evolution should proceed, then that’s fine. But is that really all? Or are you going to go reason as if there’s some objective essence of what the present’s “own terms” are—e.g. by trying to apply standards of epistemic uncertainty to the state of this essence?
“Because it’s literally false. There is no single attractor that all human values can be expected to fall into upon reflection.”
Could you be explicit about what argument you’re making here? Is it something like:
Even when two variables are strongly correlated, the most extreme value of one will rarely be the most extreme value of the other; therefore it’s <50% likely that different individuals’ CEVs will yield remotely similar results? (E.g., similar enough that one individual will consider the output of most other individuals’ CEVs morally acceptable?)
Or?:
The optimal world-state according to Catholicism is totally different from the optimal world-state according to hedonic utilitarianism; therefore it’s <50% likely that the CEV of a random Catholic will consider the output of a hedonic utilitarian’s CEV morally acceptable. (And vice versa.)
I’ll start by quoting the part of Scott’s essay that I was particularly thinking of, to clarify:
Our innate moral classifier has been trained on the Balboa Park – West Oakland route. Some of us think morality means “follow the Red Line”, and others think “follow the Green Line”, but it doesn’t matter, because we all agree on the same route.
When people talk about how we should arrange the world after the Singularity when we’re all omnipotent, suddenly we’re way past West Oakland, and everyone’s moral intuitions hopelessly diverge.
But it’s even worse than that, because even within myself, my moral intuitions are something like “Do the thing which follows the Red Line, and the Green Line, and the Yellow Line…you know, that thing!” And so when I’m faced with something that perfectly follows the Red Line, but goes the opposite directions as the Green Line, it seems repugnant even to me, as does the opposite tactic of following the Green Line. As long as creating and destroying people is hard, utilitarianism works fine, but make it easier, and suddenly your Standard Utilitarian Path diverges into Pronatal Total Utilitarianism vs. Antinatalist Utilitarianism and they both seem awful. If our degree of moral repugnance is the degree to which we’re violating our moral principles, and my moral principle is “Follow both the Red Line and the Green Line”, then after passing West Oakland I either have to end up in Richmond (and feel awful because of how distant I am from Green), or in Warm Springs (and feel awful because of how distant I am from Red).
Okay, so.
What’s the claim I’m projecting onto Nate, that I’m saying is false? It’s something like: “The goal should be to avoid locking in any particular morals. We can do this by passing control to some neutral procedure that allows values to evolve.”
And what I am saying is something like: There is no neutral procedure. There is no way to avoid privileging some morals. This is not a big problem, it’s just how it is, and we can be okay with it.
Related and repetitive statements:
When extrapolating the shared train line past West Oakland, there are multiple ways to continue, but none of them are “the neutral way to do the extrapolation.”
The self-reflection function has many attractors for almost all humans, groups, societies, and AGI architectures. Different starting points might land us in different attractors, and there is no unique “neutral starting point.”
There are many procedures for allowing values to evolve, most of them suck, and picking a good one is an action that will bear the fingerprints of our own values. And that’s fine!
Human meta-preferences, the standards by which we judge what preference extrapolation schemes are good, are preferences. We do not have any mysterious non-preference standards for doing value aggregation and extrapolation.
There is not just one CEV that is the neutral way to do preference aggregation and extrapolation. There are lots of choices that we have to / get to make.
So as you can see, I wasn’t really thinking about differences between “the CEV” of different people—my focus was more on differences between ways of implementing CEV of the same people. A lot of these ways are going to be more or less as good—like comparing your favorite beef stew vs. a 30-course modernist meal. But not all possible implementations of CEV are good, for example you could screw up by modeling exposing people to extreme or highly-optimized stimuli when extrapolating them, leading to the AI causing large changes in the human condition that we wouldn’t presently endorse.
I don’t know what you mean by “egalitarianism”, or for that matter what you mean by “why”. Are you asking for an ode to egalitarianism? Or an argument for it, in terms of more basic values?
By egalitarianism I mean building an AI that tries to help all people, and be responsive to the perspectives of all people, not just a select few. And yes, definitely an ode :D
The goal should be to cause the future to be great on its own terms
What the heck is this supposed to mean? Great according to the Inherent Essence Of Goodness that lives inside futures, rather than as part of human evaluations? Because I’ve got bad news for that plan.
Honestly, I’m disappointed by this post.
You say you’ve found yourself making this argument a lot recently. That’s fair. I think it’s totally reasonable that there are some situations where this argument could move people in the right direction—maybe the audience is considering defecting about aligning AI with humanity but would respond to orders from authority. Or maybe they’re outsiders who think you are going to defect, and you want to signal to them how you’re going to cooperate not just because it’s a good idea, but because it’s an important moral principle to you (as evolution intended).
But this is not an argument that you should just throw out scattershot. Because it’s literally false. There is no single attractor that all human values can be expected to fall into upon reflection. The primary advantage of AI alignment over typical philosophy is that when alignment researchers realize some part of what they were previously calling “alignment” is impossible, they can back up and change how they’re cashing out “alignment” so that it’s actually possible—philosophers have to keep caring about the impossible thing. This advantage goes away if we don’t use it.
Yes, plenty of people liked this post. But I’m holding you to a high standard. Somewhere people should be expected to not keep talking about the impossible thing. Somewhere, there is a version of this post that talks about or directly references:
Game-theoretic arguments for cooperation.
Why game-theoretic arguments are insufficient for egalitarianism (e.g. overly weighting the preferences of the powerful) but still mean that AI should be designed with more than just you in mind, even before accounting for a human preference for an egalitarian future.
Why egalitarianism is a beautiful moral principle that you endorse.
“Wait, wasn’t that this post?” you might say. Kind of! Making a plain ethical/aesthetic argument is like a magic trick where the magician tells you up front that it’s an illusion. This post is much the same magic trick, but the magician is telling you it’s real magic.
Realistic expectations for what egalitarianism can look like in the real world.
It cannot look like finding the one attractor that all human values converge to upon reflection because there is no one attractor that all human values converge to upon reflection.
Perhaps an analysis of how big the “fingerprints” of the creators of the AI are in such situations—e.g. by setting the meta-level standards for what counts as a “human value”.
There is a non-zero chance that the meta-preferences, that end up in charge of the preferences, that end up in charge of the galaxy will come from Mechanical Turkers.
As someone basically thinking alone (cue George Thoroughgood), I definitely would value more comments / discussion. But if someone has access to research retreats where they’re talking face to face as much as they want, I’m not surprised that they don’t post much.
Talking is a lot easier than writing, and more immediately rewarding. It can be an activity among friends. It’s more high-bandwidth to have a discussion face to face than it is over the internet. You can assume a lot more about your audience which saves a ton of effort. When talking, you are more allowed to bullshit and guess and handwave and collaboratively think with the other person, and still be interesting, wheras when writing your audience usually expects you to be confident in what you’ve written. Writing is hard, reading is hard, understanding what people have written is harder than understanding what people have said and if you ask for clarification that might get misunderstood in turn. This all applies to comments almost as much as to posts, particularly on technical subjects.
The two advantages writing has for me is that I can communicate in writing with people who I couldn’t talk to, and that when you write something out you get a good long chance to make sure it’s not stupid. When talking it’s very easy to be convincing, including to yourself, even when you’re confused. That’s a lot harder in writing.
To encourage more discussion in writing one could try to change the format to reduce these barriers as much as possible—trying to foster one-to-one or small group threads rather than one-to-many, forstering/enabling knowledge about other posters, creating a context that allows for more guesswork and collaborative thinking. Maybe one underutilized tool on current LW is the question thread. Question threads are great excuses to let people bullshit on a topic and then engage them in small group threads.
Anxiety is a tendency to interpret ambiguous information in a threat-related manner.
I feel like giving the French credit for stew is a stretch even stretchier than giving them credit for thinly slicing meat.
Thank god for the French inventing stew, I say, so that the British, Spanish, Italians, Greeks, Germans, Russians, West Africans, North Africans, Northeast Native Americans, Aztecs, Mayans, Persians, Pakistanis, South Indians, Central Asians and Chinese could learn how to put ingredients in a pot and boil them.
See “good regulator theorem,” and various LW discussion (esp. John Wentworth trying to fix it). For practical purposes, yes, you can predict things without simulating them. The more revealing of the subject your prediction has to get, though, the more of an isomorphism to a simulation you have to contain.
But when you say Simulator, with caps, people will generally take you to be talking about janus’ Simulators post, which is not about the AI predicting people by simulating them in detail, but is instead about the AI learning dynamics of text (analogous to how the laws of physics are dynamics of the state of the world), and predicting text by stepping forward these dynamics.
Partisans of the other “hard problem” are also quick to tell people that the things they call research are not in fact targeting the problem at all. (I wonder if it’s something about the name...)
Much like the other hard problem, it’s easy to get wrapped up in a particular picture of what properties a solution “must” have, and construct boundaries between your hard problem and all those other non-hard problems.
Turning the universe to diamond is a great example. It’s totally reasonable that it could be strictly easier to build an AI to turn the world into diamond than it is to build an AI that is superhuman at doing good things, so that anyone claiming to have ideas about the latter should have even better ideas about the former. But that could also not be the case—the most likely way I see this happening is if if solving the hard left turn problem has details that depend on how you want to load the values, and so genuinely hard-problem-addressing work on value learning could nonetheless not be useful for specifying simple goals. (It may only help you get the diamond-universe AI “the hard way”—by doing the entire value leaning process except with a different target!)
Could you clarify what you mean by values not being “hack after evolutionary hack”?
What this sounds like, but I think you don’t mean: “Human values are all emergent from a simple and highly general bit of our genetic blueprint, which was simple for evolution to find and has therefore been unchanged more or less since the invention of within-lifetime learning. Evolution never developed a lot of elaborate machinery to influence our values.”
What I think you do mean: “Human values are emergent from a simple and general bit of our genetic blueprint (our general learning algorithm), plus a bunch of evolutionary nudges (maybe slightly hackish) to guide this learning algorithm towards things like friendship, eating when hungry, avoiding disgusting things, etc. Some of these nudges generalize so well they’ve basically persisted across mammalian evolution, while some of them humans only share with social primates, but the point is that even though we have really different values from chimpanzees, that’s more because our learning algorithm is scaled up and our environment is different, the nudges on the learning algorithm have barely had to change at all.”
What I think you intend to contrast this to: “Every detail of human values has to be specified in the genome—the complexity of the values and the complexity of the genome have to be closely related.”
Me: PhD in condensed matter experiment, brief read-through of the 3-person paper a few days ago, went and checked out the 6-person paper just now, read some other links as needed.
EDIT: If I’m reading their figure 4 correctly, I missed how impossible their magnetic susceptibility data was if not superconducting. My bad—I’ve sprinkled in some more edits as necessary for questions 1, 2, and 4.
Q1
Electrical leads can explain almost arbitrary phenomena. They measured resistivity with a four point probe, where you flow a current between two outer wires and then check the voltage between two inner wires. If the inner wires for some reason don’t allow current to pass at small voltage (e.g. you accidentally made a schottky diode, a real thing that sometimes happens), that can cause a spurious dip in resistivity.
The data isn’t particularly clean, and there are several ways it differs from what you’d expect. Here’s what a nice clean I-V curve looks like—symmetrical, continuous, flat almost to the limit of measurement below Tc, all that good stuff. Their I-V data is messier in every way. It’s not completely implausible, but if it’s real, why didn’t they take some better-looking data?
Yes, critical current changing with temperature is normal. In fact, if this is a superconductor, we can learn interesting things about it from the slope of critical current as a function of temperature, near the critical temperature (does it look like √Tc−T?).
The resistivity and levitation might be possible if only a tiny fraction of the material is superconducting, so long as there are 2D superconducting planes (a pattern that seems likely in a high-temperature superconductor) that can percolate through the polycrystalline material. However, I don’t see how this would work with the apatite structure (also the Griffin DFT paper says the band structure is 3D, and the Cu-Pb chains of claimed importance are 1D), so I think it’s more likely you would indeed have to have a high fraction of superconductor.
EDIT: I think their magnetic susceptibility data for sample 2, if correct, implies that the sample is at least 20% superconductor.
Q2
The video shows a surprising amount of diamagnetism, but it doesn’t really look like the Meissner effect, and isn’t so strong that it’s impossible to explain without it (especially since most of the weight of the sample is resting on the magnet). Locking in place isn’t strictly necessary, but especially in an impure material we should see a lot of pinning that prevents it from easily rotating. Russian catgirls are often untrustworthy.
EDIT: Actually, if I’m reading this right, figure 4a actually is pretty impossible without superconductivity. Score one for YES-Y. Although the data looks very ugly (where’s the above-Tc region with no diamagnetism?)
The diamagentism is still evidence that it’s a superconductor! It’s just even better evidence that it’s a non-superconducting strong diamagnet. The moderate difference they show between field cooled and zero-field cooled magnetization curves is likewise evidence either that it’s a superconductor, or evidence that it’s an ordered diamagnetic material.
Q3
Somewhere between YES-X and NO-C. First, DFT calculations are a good starting point but always require a grain of salt. Second, I think calling this a “flat band” is overhyping it—the density of states enhancement that makes flat bands so hype-worthy isn’t there as far as I can tell. Third, the hints of charge and spin waves in the material bear further study (if this is a superconductor they almost certainly are doing something interesting) but aren’t all that surprising given that you’ve jammed a bunch of heavy atoms together in a nontrivial crystal structure.
Q4
If getting it to conduct current at 0 resistance is as easy as they make it sound, they’ve probably replicated it a hundred times in three different ways. However, what if it’s tricky to get it hooked up to show superconductivity—you have to put the leads on just right, in some hard-to-understand way, and usually it doesn’t look superconducting… then wishful thinking has a lot more room to operate.
EDIT: The extreme diamagnetism measurement for sample 2 could just be a calibration error on a sensitive measurement, requiring neither fraud nor superconductivity.
Q5
No idea. They clearly know physics. They’re not maximally clear about everything, and I think they sweep data issues under the rug, but not in a way that makes me more suspicious conditional on the data.