One big prize, or many small prizes like here?
rotatingpaguro
First thoughts:
Context length is insanely long
Very good at predicting the next token
Knows many more abstract facts
These three things are all instances of being OOM better at something specific. If you consider the LLM somewhat human-level at the thing it does, this suggests that it’s doing it in a way which is very different from what a human does.
That said, I’m not confident about this; I can sense there could be an argument that this counts as human but ramped up on some stats, and not an alien shoggoth.
If I had to give only one line of advice to a randomly sampled prospective grad student: you don’t actually have to do what the professor says.
Ok. Then I’ll say that randomly assigned utility over full trajectories are beyond wild!
The basin of attraction just needs to be large enough. AIs will intentionally be created with more structure than that.
I read the section you linked, but I can’t follow it. Anyway, here it is its conclusive paragraph:
Conclusion: Optimal policies for u-AOH will tend to look like random twitching. For example, if you generate a u-AOH by uniformly randomly assigning each AOH utility from the unit interval , there’s no predictable regularity to the optimal actions for this utility function. In this setting and under our assumptions, there is no instrumental convergence without further structural assumptions.
From this alone, I get the impression that he hasn’t proved that “there isn’t instrumental convergence”, but that “there isn’t a totally general instrumental convergence that applies even to very wild utility functions”.
It’s AI-based, so my guess is that it uses a lot of somewhat superficial correlates that could be gamed. I expect that if it went mainstream it would be Goodharted.
I expect Goodhart would hit particularly bad if you were doing the kind of usage I guess you are implying, which is searching for a few very well selected people. A selective search is a strong optimization, and so Goodharts more.
More concrete example I have in mind, that maybe applies right now to the technology: there are people who are good at lying to themselves.
Yes, in general the state of the art is more advanced than looking at correlations.
You just need to learn when using correlations makes sense. Don’t assume that everyone is using correlations blindly; Statistics PhDs most likely decide whether to use them or not based on context and know the limited ways in which what the say applies.
Correlations make total sense when the distribution of the variables is close to multivariate Normal. The covariance matrix, which can be written as a combination of variances + correlation matrix, completely determines the shape of a multivariate Normal.
If the variables are not Normal, you can try to transform them to make them more Normal, using both univariate and multivariate transformations. This is a very common Statistics tool. Basic example: Quantile normalization.
As we get closer to maxing out
This is , right? (Feel free to delete this comment.)
There is also a hazy counting argument for overfitting:
It seems like there are “lots of ways” that a model could end up massively overfitting and still get high training performance.
So absent some additional story about why training won’t select an overfitter, it feels like the possibility should be getting substantive weight.
While many machine learning researchers have felt the intuitive pull of this hazy overfitting argument over the years, we now have a mountain of empirical evidence that its conclusion is false. Deep learning is strongly biased toward networks that generalize the way humans want— otherwise, it wouldn’t be economically useful.
I don’t know well NN history, but I have the impression good NN training is not trivial. I expect that the first attempts at NN training went bad in some way, including overfitting. So, without already knowing how to train an NN without overfitting, you’d get some overfitting in your experiments. The fact that now, after someone already poured their brain juice over finding techniques that avoid the problem, you don’t get overfitting, is not evidence that you shouldn’t have expected overfitting before.
The analogy with AI scheming is: you don’t already know the techniques to avoid scheming. You can’t use as counterargument a case in which a problem has already deliberately been solved. If you take that same case, and put yourself in the shoes of someone who doesn’t already have the solution, you see you’ll get the problem in your face a few times before solving it.
Then, it is a matter of whether it works like Yudkowsky says, that you may only get one chance to solve it.
The title says “no evidence for AI doom in counting arguments”, but the article mostly talks about neural networks (not AI in general), and the conclusion is
In this essay, we surveyed the main arguments that have been put forward for thinking that future AIs will scheme against humans by default. We find all of them seriously lacking. We therefore conclude that we should assign very low credence to the spontaneous emergence of scheming in future AI systems— perhaps 0.1% or less.
“main arguments”: I don’t think counting arguments completely fill up this category. Example: the concept of scheming originates from observing it in humans.
Overall, I have the impression of some overstatement. It can also be that I’m missing some previous discussion context/assumptions, so other background theory from you may say “humans don’t matter as examples”, and also “AI will be NNs and not other things”.
It’s quite successfully managed to urbanize it’s population and now seems to have reached the Lewis turning point where young people who try to leave their villages to find work cities often don’t find it and have to stay in their villages, in the much lower productivity jobs.
I can’t follow this. Wikipedia says that
The Lewis turning point is a situation in economic development where surplus rural labor is fully absorbed into the manufacturing sector. This typically causes agricultural and unskilled industrial real wages to rise.
So it looks like at the Lewis point there’s over-demand for workers, so they can find the jobs. Instead you describe it as if there’s over-supply, the manufacturing sector does not need any more workers so they can’t find jobs.
There’s only one way to know!
</joking> <=========
My guess is that things which are forbidden but somewhat complex (like murder instructions) have not really been hammered out from the base model as much as more easily identifiable things like racial slurs.
It should be easier to train the model to just never say the latin word for black, than to recognize instances of sequences of actions that lead to a bad outcome.
The latter require more contextual evaluation, so maybe that’s why the safety training has not generalized well to the tool usage behaviors; is “I’m using a tool” enough different context that “murder instructions” + “tool mode” should count as a case different from “murder instructions” alone?
IF/IE (Yandere/Tsundere): Alice (the Yandere) pretends to like Bob but in fact is trying to manipulate him into doing what she wants, while Bob (the Tsundere) pretends to hate Alice but in fact is totally on-board with her agenda.
This description is a bit of a joke—I can’t even imagine what this mode would look like, let alone think of any real-world examples.
Maybe love things? Or female things?
I want to give that conclusion a Bad Use of Statistical Significance Testing. Looking at the experts, we see a quite obviously significant difference. There is improvement here across the board, this is quite obviously not a coincidence. Also, ‘my sample size was not big enough’ does not get you out of the fact that the improvement is there – if your study lacked sufficient power, and you get a result that is in the range of ‘this would matter if we had a higher power study’ then the play is to redo the study with increased power, I would think?
My immediate take on seeing the thing as you report it:
Please write if the bars are 68%, 90%, or 95%.
Total agree on sidestepping significativity and instead asking “does the posterior distribution over possible effect sizes include an effect size I consider relevant?”
“There is improvement here across the board, this is quite obviously not a coincidence.”: I would need more details to feel that confident. The first thing I’d look at is the correlation structure of those scores; can it be that are they just repeating mostly the same information over and over?
Paper argues that transformers are a good fit for language but terrible for time series forecasting, as the attention mechanisms inevitably discard such information. If true, then there would be major gains to a hybrid system, I would think, rather than this being a reason to think we will soon hit limits. It does raise the question of how much understanding a system can have if it cannot preserve a time series.
That paper got a reply one year later: “Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)” (haven’t read either one).
I think that instead of considering random words as a baseline reference (Fig. 2), you should take the alphabet plus the space symbol, generate a random i.i.d. sequence of them, and then index words in that text. This won’t give a uniform distribution over words. It is total gibberish, but I expect it would follow Zipf’s law the same, based on these references I found on Wikipedia:
Wentian Li (1992), “Random Texts Exhibit Zipfs-Law-Like Word Frequency Distribution”
V. Belevitch (1959), “On the statistical laws of linguistic distributions”
I’d also show an example of the “ChatGPT gibberish” produced.
Do you think CERN is an example of what you want?
And Italy also definitely uses the green socket, and a larger version of the three dots in line socket. Many sockets in commerce just accommodate everything at once.
I remembered this when I read the following excerpt in Meaning and Agency:
In Belief in Intelligence, Eliezer sketches the peculiar mental state which regards something else as intelligent:
Imagine that I’m visiting a distant city, and a local friend volunteers to drive me to the airport. I don’t know the neighborhood. Each time my friend approaches a street intersection, I don’t know whether my friend will turn left, turn right, or continue straight ahead. I can’t predict my friend’s move even as we approach each individual intersection—let alone, predict the whole sequence of moves in advance.
Yet I can predict the result of my friend’s unpredictable actions: we will arrive at the airport.
[...]
I can predict the outcome of a process, without being able to predict any of the intermediate steps of the process.In Measuring Optimization Power, he formalizes this idea by taking a preference ordering and a baseline probability distribution over the possible outcomes. In the airport example, the preference ordering might be how fast they arrive at the airport. The baseline probability distribution might be Eliezer’s probability distribution over which turns to take—so we imagine the friend turning randomly at each intersection. The optimization power of the friend is measured by how well they do relative to this baseline.
I think this can be a useful notion of agency, but constructing this baseline model does strike me as rather artificial. We’re not just sampling from Eliezer’s world-model. If we sampled from Eliezer’s world-model, the friend would turn randomly at each intersection, but they’d also arrive at the airport in a timely manner no matter which route they took—because Eliezer’s actual world-model believes the friend is capably pursuing that goal.
So to construct the baseline model, it is necessary to forget the existence of the agency we’re trying to measure while holding other aspects of our world-model steady. While it may be clear how to do this in many cases, it isn’t clear in general. I suspect if we tried to write down the algorithm for doing it, it would involve an “agency detector” at some point; you have to be able to draw a circle around the agent in order to selectively forget it. So this is more of an after-the-fact sanity check for locating agents, rather than a method of locating agents in the first place.
I guess the statement on the front page “Announcement: Hi, I think my best content is on Twitter/X right now” means that they do not update the front page so well.
Your argument would imply that competition begets worse products?