I mean FTX happened in the last 6 months! That caused incredibly large harm for the world.
I agree, but I have very different takeaways on what FTX means for the Rationalist community.
I think the major takeaway is that human society is somewhat more adequate, relative to our values than we think, and this matters.
To be blunt, FTX was always a fraud, because Bitcoin and cryptocurrency violated a fundamental axiom of good money: It’s value must be stable, or at least slowly change, and it’s not a good store of value due to the wildly unstable price of say a single Bitcoin or cryptocurrency, and the issue is the deeply stupid idea of fixing the supply, which combined with variable demand, led to wild price swings.
It’s possible to salvage some value out of crypto, but they can’t be tied to real money.
Most groups have way better ideas for money than Bitcoin and cryptocurrency.
OpenAI and Anthropic are two of the most central players in an extremely bad AI arms race that is causing enormous harm. I really feel like it doesn’t take a lot of imagination to think about how our extensive involvement in those organizations could be bad for the world. And a huge component of the Lightcone Offices was causing people to work at those organizations, as well as support them in various other ways.
I don’t agree, in this world, and this is related to a very important crux in AI Alignment/AI safety: Can it be solved solely via iteration and empirical work? My answer is yes, and one of the biggest examples is Pretraining from Human Feedback, and I’ll explain why it’s the first real breakthrough of empirical alignment:
It almost completely avoids deceptive alignment via the fact that it lets us specify the base goal as human values first before it has the generalization capabilities, and the goal is pretty simple and myopic, so simplicity bias doesn’t have as much incentive to make the model deceptively aligned. Basically, we first pretrain the base goal, which is way more outer aligned than the standard MLE goal, and then we let the AI generalize, and this inverts the order of alignment and capabilities, where RLHF and other alignment solutions first give capabilities, then try to align the model. This is of course not going to work all that well compared to PHF. In particular, it means that more capabilities means better and better inner alignment by default.
The goal that was best for pretraining from human feedback, conditional training, has a number of outer alignment benefits compared to RLHF and fine-tuning, even without inner alignment being effectively solved and preventing deceptive alignment.
One major benefit is since it’s offline training, there is never a way for any model to affect the distribution of data that we use for alignment, so there’s never a way or incentive to gradient hack or shift the distribution. In essence, we avoid embedded agency problems by recreating a Cartesian boundary that actually works in an embedded setting. While it will likely fade away in time, we only need to have it work once, and then we can dispense with the Cartesian boundary.
Again, this shows increasing alignment with scale, which is good because we found the holy grail of alignment: A competitive alignment scheme that scales well with model data and allows you to crank capabilities up and get better and better results from alignment.
Finally, I don’t think you realize how well we did in getting companies to care about alignment, our how good the fact that LLMs are being pursued first compared to RL first, which means we can have simulators before agentic systems arise.
I agree, but I have very different takeaways on what FTX means for the Rationalist community.
I think the major takeaway is that human society is somewhat more adequate, relative to our values than we think, and this matters.
To be blunt, FTX was always a fraud, because Bitcoin and cryptocurrency violated a fundamental axiom of good money: It’s value must be stable, or at least slowly change, and it’s not a good store of value due to the wildly unstable price of say a single Bitcoin or cryptocurrency, and the issue is the deeply stupid idea of fixing the supply, which combined with variable demand, led to wild price swings.
It’s possible to salvage some value out of crypto, but they can’t be tied to real money.
Most groups have way better ideas for money than Bitcoin and cryptocurrency.
I don’t agree, in this world, and this is related to a very important crux in AI Alignment/AI safety: Can it be solved solely via iteration and empirical work? My answer is yes, and one of the biggest examples is Pretraining from Human Feedback, and I’ll explain why it’s the first real breakthrough of empirical alignment:
It almost completely avoids deceptive alignment via the fact that it lets us specify the base goal as human values first before it has the generalization capabilities, and the goal is pretty simple and myopic, so simplicity bias doesn’t have as much incentive to make the model deceptively aligned. Basically, we first pretrain the base goal, which is way more outer aligned than the standard MLE goal, and then we let the AI generalize, and this inverts the order of alignment and capabilities, where RLHF and other alignment solutions first give capabilities, then try to align the model. This is of course not going to work all that well compared to PHF. In particular, it means that more capabilities means better and better inner alignment by default.
The goal that was best for pretraining from human feedback, conditional training, has a number of outer alignment benefits compared to RLHF and fine-tuning, even without inner alignment being effectively solved and preventing deceptive alignment.
One major benefit is since it’s offline training, there is never a way for any model to affect the distribution of data that we use for alignment, so there’s never a way or incentive to gradient hack or shift the distribution. In essence, we avoid embedded agency problems by recreating a Cartesian boundary that actually works in an embedded setting. While it will likely fade away in time, we only need to have it work once, and then we can dispense with the Cartesian boundary.
Again, this shows increasing alignment with scale, which is good because we found the holy grail of alignment: A competitive alignment scheme that scales well with model data and allows you to crank capabilities up and get better and better results from alignment.
Here’s a link if you’re interested:
https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences
Finally, I don’t think you realize how well we did in getting companies to care about alignment, our how good the fact that LLMs are being pursued first compared to RL first, which means we can have simulators before agentic systems arise.