Another fairly common argument and motivation at OpenAI in the early days was the risk of “hardware overhang,” that slower development of AI would result in building AI with less hardware at a time when they can be more explosively scaled up with massively disruptive consequences. I think that in hindsight this effect seems like it was real, and I would guess that it is larger than the entire positive impact of the additional direct work that would be done by the AI safety community if AI progress had been slower 5 years ago.
Could you clarify this bit? It sounds like you’re saying that OpenAI’s capabilities work around 2017 was net-positive for reducing misalignment risk, even if the only positive we count is this effect. (Unless you think that there’s substantial reason that acceleration is bad other than giving the AI safety community less time.) But then in the next paragraph you say that this argument was wrong (even before GPT-3 was released, which vaguely gestures at the “around 2017”-time). I don’t see how those are compatible.
(If 1 firing = 1 bit, that should be 34 megabit ~= 4 megabyte.)
This random article (which I haven’t fact-checked in the least) claims a bandwidth of 8.75 megabit ~= 1 megabyte. So that’s like 2.5 OOMs higher than the number I claimed for chinchilla. So yeah, it does seem like humans get more raw data.
(But I still suspect that chinchilla gets more data if you adjust for (un)interestingness. Where totally random data and easily predictable/compressible data are interesting, and data that is hard-but-possible to predict/compress is interesting.)
There’s a billion seconds in 30 years. Chinchilla was trained on 1.4 trillion tokens. So for a human adult to have as much data as chinchilla would require us to process the equivalent of ~1400 tokens per second. I think that’s something like 2 kilobyte per second.
Inputs to the human brain are probably dominated by vision. I’m not sure how many bytes per second we see, but I don’t think it’s many orders of magnitudes higher than 2kb.
The acronym is definitely used for reinforcement learning. [“RLHF” “reinforcement learning from human feedback”] gets 564 hits on google, [“RLHF” “reward learning from human feedback”] gets 0.
Reinforcement* learning from human feedback
Ah, I see, it was today. Nope, wasn’t trying to join. I first interpreted “next” thursday as thursday next week, and then “June 28″ was >1 month off, which confused me. In retrospect, I could have deduced that it was meant to say July 28.
Also, next Thursday (June 28) at noon Pacific time is the Schelling time to meet in the Walled Garden and discuss the practical applications of this.
Is the date wrong here?
Some previous LW discussion on this: https://www.lesswrong.com/posts/9W9P2snxu5Px746LD/many-weak-arguments-vs-one-relatively-strong-argument
(author favors weak arguments; plenty of discussion and some disagreements in comments; not obviously worth reading)
This seems plausible if the environment is a mix of (i) situations where task completion correlates (almost) perfectly with reward, and (ii) situations where reward is very high while task completion is very low. Such as if we found a perfect outer alignment objective, and the only situation in which reward could deviate from the overseer’s preferences would be if the AI entirely seized control of the reward.
But it seems less plausible if there are always (small) deviations between reward and any reasonable optimization target that isn’t reward (or close enough so as to carry all relevant arguments). E.g. if an AI is trained on RL from human feedback, and it can almost always do slightly better by reasoning about which action will cause the human to give it the highest reward.
This is about the agreement karma, though, which starts at 0.
There’s a 1000 year old vampire stalking lesswrong!? 16 is supposed to be three levels above Eliezer.
As the main author of the “Alignment”-appendix of the truthful AI paper, it seems worth clarifying: I totally don’t think that “train your AI to be truthful” in itself is a plan for how to tackle any central alignment problems. Quoting from the alignment appendix:
While we’ve argued that scaleable truthfulness would constitute significant progress on alignment (and might provide a solution outright), we don’t mean to suggest that truthfulness will sidestep all difficulties that have been identified by alignment researchers. On the contrary, we expect work on scaleable truthfulness to encounter many of those same difficulties, and to benefit from many of the same solutions.
In other words: I don’t think we had a novel proposal for how to make truthful AI systems, which tackled the hard bits of alignment. I just meant to say that the hard bits of making truthful A(G)I are similar to the hard bits of making aligned A(G)I.
At least from my own perspective, the truthful AI paper was partly about AI truthfulness maybe being a neat thing to aim for governance-wise (quite apart from the alignment problem), and partly about the idea that research on AI truthfulness could be helpful for alignment, and so it’s good if people (at least/especially people who wouldn’t otherwise work on alignment) work on that problem. (As one example of this: Interpretability seems useful for both truthfulness and alignment, so if people work on interpretability intended to help with truthfulness, then this might also be helpful for alignment.)
I don’t think you’re into this theory of change, because I suspect that you think that anyone who isn’t directly aiming at the alignment problem has negligible chance of contributing any useful progress.
I just wanted to clarify that the truthful AI paper isn’t evidence that people who try to hit the hard bits of alignment always miss — it’s just a paper doing a different thing.
(And although I can’t speak as confidently about others’ views, I feel like that last sentence also applies to some of the other sections. E.g. Evan’s statement, which seems to be about how you get an alignment solution implemented once you have it, and maybe about trying to find desiderata for alignment solutions, and not at all trying to tackle alignment itself. If you want to critique Evan’s proposals for how to build aligned AGI, maybe you should look at this list of proposals or this positive case for how we might succeed.)
An un-aligned AI has the decision of acting to maximize its goals in training and getting a higher short-term reward, or deceptively pretending to be aligned in training, and getting a lower short-term reward.
If there is a conflict between these, that must be because the AI’s conception of reward isn’t identical to the reward that we intended. So even if we dole out higher intended reward during deployment, it’s not clear that that increases the reward that the AI expects after deployment. (But it might.)
The Tuesday-creature might believe that its decision is correlated with the Monday-creature. [...] If the correlation is strong enough and stopping values change is expensive, then the Tuesday-creature is best served by being kind to its Wednesday-self, and helping to put it in a good position to realize whatever its goals may be.The Tuesday-creature might believe that its decision is correlated with the Monday-creature’s predictions about what the Tuesday-creature would do. [...] If the Monday-creature is a good enough predictor of the Tuesday-creature, then the Tuesday-creature is best served by at least “paying back” the Monday-creature for all of the preparation the Monday-creature did
The Tuesday-creature might believe that its decision is correlated with the Monday-creature. [...] If the correlation is strong enough and stopping values change is expensive, then the Tuesday-creature is best served by being kind to its Wednesday-self, and helping to put it in a good position to realize whatever its goals may be.
The Tuesday-creature might believe that its decision is correlated with the Monday-creature’s predictions about what the Tuesday-creature would do. [...] If the Monday-creature is a good enough predictor of the Tuesday-creature, then the Tuesday-creature is best served by at least “paying back” the Monday-creature for all of the preparation the Monday-creature did
These both seem like very UDT-style arguments, that wouldn’t apply to a naive EDT:er once they’d learned how helpful the Monday creature was?
So based on the rest of this post, I would have expected these motivations to only apply if either (i) the Tuesday-creature was uncertain about whether the Monday-creature had been helpful or not, or (ii) the Tuesday creature cared about not-apparently-real-worlds to a sufficient extent (including because they might think they’re in a simulation). Curious if you disagree with that.
“suitors severely underestimate probability of being liked back”
Is this supposed to say ‘overestimate’? Regardless, what info from the paper is the claim based on? Since they’re only sampling stories where people were rejected, the stories will have disproportionately large numbers of cases where the suitors are over-optimistic, so that seems like it’d make it hard to draw general conclusions.
(For the other two bullet points: I’d expect those effects, directionally, just from the normal illusion of transparency playing out in a context where there are social barriers to clear communication. But haven’t looked at the paper to see whether the effect is way stronger than I’d normally expect.)
2 seems more worrying than reassuring. If you have to rely on human action, you’ll be slowed down. So AI’s who can route around humans, or humans who can delegate more decision-making to AI systems, will have a competitive advantage over AIs that don’t do that. If we’re talking about AGI + decent robotics, there’s in principle nothing that AIs need humans for.
3: “useless without full information” is presumably hyperbole, but I also object to weaker claims like “being 100x faster is less than half as useful as you think, if you haven’t considered that spying is non-trivial”. Random analogy: Consider a conflict (e.g. a war or competition between two firms) except that one side (i) gets only 4 days per year, and (ii) gets a very well-secured room to discuss decisions in. Benefit (ii) doesn’t really seem to help much against the disadvantage from (i)!
(Small exception to Critch’s video looking like a still frame: There’s a dude with a moving hand at 0:45.)
Here’s a 1-year-old answer from Christiano to the question “Do you still think that people interested in alignment research should apply to work at OpenAI?”. Generally pretty positive about people going there to “apply best practices to align state of the art models”. That’s not exactly what Aaronson will be doing, but it seems like alignment theory should have even less probability of differentially accelerating capabilities.
Agree it’s not clear. Some reasons why they might:
If training environments’ inductive biases point firmly towards some specific (non-human) values, then maybe the misaligned AIs can just train bigger and better AI systems using similar environments that they were trained in, and hope that those AIs will end up with similar values.
Maybe values can differ a bit, and cosmopolitanism or decision theory can carry the rest of the way. Just like Paul says he’d be pretty happy with intelligent life that came from a similar distribution that our civ came from.
Humans might need to use a bunch of human labor to oversee all their human-level AIs. The HLAIs can skip this, insofar as they can trust copies of themself. And when training even smarter AI, it’s a nice benefit to have cheap copyable trustworthy human-level overseers.
Maybe you can somehow gradually increase the capabilities of your HLAIs in a way that preserves their values.
(You have a lot of high-quality labor at this point, which really helps for interpretability and making improvements through other ways than gradient descent.)
Hm, maybe there are two reasons why human-level AIs are safe:
1. A bunch of our alignment techniques work better when the overseer can understand what the AIs are doing (given enough time). This means that human-level AIs are actually aligned.2. Even if the human-level AIs misbehave, they’re just human-level, so they can’t take over the world.
Under model (1), it’s totally ok that self-improvement is an option, because we’ll be able to train our AIs to not do that.
Under model (2), there are definitely some concerning scenarios here where the AIs e.g. escape onto the internet, then use their code to get resources, duplicate themselves a bunch of times, and set-up a competing AI development project. Which might have an advantage if it can care less about paying alignment taxes, in some ways.