Running Lightcone Infrastructure, which runs LessWrong. You can reach me at habryka@lesswrong.com
habryka(Oliver Habryka)
If you have recommendations, post them! I doubt the author tried to filter the subjects very much by “book subjects” it’s just what people seem to have found good ones so far.
This probably should be made more transparent, but the reason why these aren’t in the library is because they don’t have images for the sequence-item. We display all sequences that people create that have proper images on the library (otherwise we just show it on user’s profiles).
I think this just doesn’t work very well, because it incentivizes the model to output a token which makes subsequent tokens easier to predict, as long as the benefit in predictability of the subsequent token(s) outweighs the cost of the first token.
Hmm, this doesn’t sound right. The ground truth data would still be the same, so if you were to predict “aaaaaa” you would get the answer wrong. In the above example, you are presumably querying the log props of the model that was trained on 1-token prediction, which of course would think it’s quite likely that conditional on the last 10 characters being “a” the next one will be “a”, but I am saying “what is the probability of the full completion ‘a a a a a...’ given the prefix ‘Once upon a time, there was a’”, which doesn’t seem very high.
The only thing I am saying here is “force the model to predict more than one token at a time, conditioning on its past responses, then evaluate the model on performance of the whole set of tokens”. I didn’t think super hard about what the best loss function here is, and whether you would have to whip out PPO for this. Seems plausible.
Yeah, I was indeed confused, sorry. I edited out the relevant section of the dialogue and replaced it with the correct relevant point (the aside here didn’t matter because a somewhat stronger condition is true, which is that during training we always just condition on the right answer instead of conditioning on the output for the next token in the training set).
In autoregressive transformers an order is imposed by masking, but all later tokens attend to all earlier tokens in the same way.
Yeah, the masking is what threw me off. I was trying to think about whether any information would flow from the internal representations used to predict the second token to predicting the third token, and indeed, if you were to backpropagate the error after each specific token prediction, then there would be some information from predicting the second token available to predicting the third token (via the the updated weights).
However, batch-sizes make this also inapplicable (I think you would basically never do a backpropagation after each token, that would kind of get rid of the whole benefit of parallel training), and even without that, the amount of relevant information flowing this way would be very miniscule and there wouldn’t be any learning going for how this information flows.
I reference this in this section:
I do think saying “the system is just predicting one token at a time” is wrong, but I guess the way the work a transformer puts into token N gets rewarded or punished when it predicts token N + M feels really weird and confusing to me and still like it can be summarized much more as “it’s taking one token at a time” than “it’s doing reasoning across the whole context
IIRC at least for a standard transformer (which maybe had been modified with the recent context length extension) the gradients only flow through a subset of the weights (for a token halfway through the context, the gradients flow through half the weights that were responsible for the first token, IIRC).
I think you are talking about a different probability distribution here.
You are right that this allows you to sample non-greedily from the learned distribution over text, but I was talking about the inductive biases on the model.
My claim was that the way LLMs are trained, the way the inductive biases shake out is that the LLM won’t be incentivized to output tokens that predictably have low probability, but make it easier to predict future tokens (by, for example, in the process of trying to predict a proof, reminding itself of all the of the things its knows before those things leave its context window, or when doing an addition that it can’t handle in a single forward pass, outputting a token that’s optimized to give itself enough serial depth to perform the full addition of two long n-digit digit numbers, which would then allow it to get the next n tokens right and so overall achieve lower joint loss).
Goal oriented cognition in “a single forward pass”
Yeah, I am also not seeing anything. Maybe it was something temporary, but I thought we had set it up to leave a trace if any automatic rate limits got applied in the past.
Curious what symptom Nora observed (GreaterWrong has been having some problems with rate-limit warnings that I’ve been confused by, so I can imagine that looking like a rate-limit from our side).
[Mod note: I edited out some of the meta commentary from the beginning for this curation. In-general for link posts I have a relatively low bar for editing things unilaterally, though I of course would never want to misportray what an author said]
To what extent would the organization be factoring in transformative AI timelines? It seems to me like the kinds of questions one would prioritize in a “normal period” look very different than the kinds of questions that one would prioritize if they place non-trivial probability on “AI may kill everyone in <10 years” or “AI may become better than humans on nearly all cognitive tasks in <10 years.”
My guess is a lot, because the future of humanity sure depends on the details of how AI goes. But I do think I would want the primary optimization criterion of such an organization to be truth-seeking and to have quite strong norms and guardrails against anything that would trade off communicating truths against making a short-term impact and gaining power.
As an example of one thing I would do very differently from FHI (and a thing that I talked with Bostrom about somewhat recently where we seemed to agree) was that with the world moving faster and more things happening, you really want to focus on faster OODA loops in your truth-seeking institutions.
This suggests that instead of publishing books, or going through month-long academic review processes, you want to move more towards things like blogposts and comments, and maybe in the limit even on things like live panels where you analyze things right as they happen.
I do think there are lots of failure modes around becoming too news-focused (and e.g. on LW we do a lot of things to not become too news-focused), so I think this is a dangerous balance, but its one of the things I think I would do pretty differently, and which depends on transformative AI timelines.
To comment a bit more on the power stuff: I think a thing that I am quite worried about is that as more stuff happens more quickly with AI people will feel a strong temptation to trade in some of the epistemic trust they have built with others, into resources that they can deploy directly under their control, because as more things happen, its harder to feel in control and by just getting more resources directly under your control (as opposed to trying to improve the decisions of others by discovering and communicating important truths) you can regain some of that feeling of control. That is one dynamic I would really like to avoid with any organization like this, where I would like it to continue to have a stance towards the world that is about improving sanity, and not about getting resources for itself and its allies.
Do you have quick links for the elliptic curve backdoor and/or any ground-breaking work in computer security that NIST has performed?
Generally agree with most things in this comment. To be clear, I have been thinking about doing something in the space for many years, internally referring to it as creating an “FHI of the West”, and while I do think the need for this is increased by FHI disappearing, I was never thinking about this as a clone of FHI, but was always expecting very substantial differences (due to differences in culture, skills, and broader circumstances in the world some of which you characterize above)
I wrote this post mostly because with the death of FHI it seemed to me that there might be a spark of energy and collective attention that seems good to capture right now, since I do think what I would want to build here would be able to effectively fill some of the gap left behind.
Totally agree, it definitely should not be branded this way if it launches.
I am thinking of “FHI of the West” here basically just as the kind of line directors use in Hollywood to get the theme of a movie across. Like “Jaws in Space” being famously the one line summary of the movie “Alien”.
It also started internally as a joke based on an old story of the University of Ann Arbor branding itself as “the Harvard of the West”, which was perceived to be a somewhat clear exaggeration at the time (and resulted in Kennedy giving a speech where he described Harvard jokingly as “The Michigan of the East” which popularized it). Describing something as “Harvard of the West” in a joking way seems to have popped up across the Internet in a bunch of different contexts. I’ll add that context to the OP, though like, it is a quite obscure reference.
If anything like this launches to a broader audience I expect no direct reference to FHI to remain. It just seems like a decent way to get some of the core pointers across.
Express interest in an “FHI of the West”
My sense is FHI was somewhat accurately modeled as “closed” for a few months. I did not know today would be the date of the official announcement.
I knew this was going on for quite a while (my guess is around a year or two). I think ultimately it was a slow smothering by the university administration and given the adversarialness of that relationship with the university, I don’t really think outrage would have really helped that much (though it might have, I don’t really understand the university’s perspective on this).
My guess is dragging this out longer would have caused more ongoing friction and would have overall destroyed more time and energy by the really smart and competent people at FHI than they would have benefitted from the institution.
Yeah, that makes sense. I’ve noticed miscommunications around the word “scheming” a few times, so am in favor of tabooing it more. “Engage in deception for instrumental reasons” seems like an obvious extension that captures a lot of what I care about.
The best definition I would have of “scheming” would be “the model is acting deceptively about its own intentions or capabilities in order to fool a supervisor” [1]. This behavior seems to satisfy that pretty solidly:
Of course, in this case the scheming goal was explicitly trained for (as opposed to arising naturally out of convergent instrumental power drives), but it sure seems to me like its engaging in the relevant kind of scheming.
I agree there is more uncertainty and lack of clarity on whether deceptively-aligned systems will arise “naturally”, but the above seems like a clear example of someone artificially creating a deceptively-aligned system.
- ^
Joe Carlsmith uses “whether advanced AIs that perform well in training will be doing so in order to gain power later”, but IDK, that feels really underspecified. Like, there are just tons of reasons for why the AI will want to perform well in training for power-seeking reasons, and when I read the rest of the report it seems like Joe was more analyzing it through the deception of supervisors lens.
- ^
I would call what METR does alignment research, but also fine to use a different term for it. Mostly using it synonymously with “AI Safety Research” which I know you object to, but I do think that’s how it’s normally used (and the relevant aspect here is the pre-paradigmaticity of the relevant research, which I continue to think applies independently of the bucket you put it into).
I do think it’s marginally good to make “AI Alignment Research” mean something narrower, so am supportive here of getting me to use something broader like “AI Safety Research”, but I don’t really think that changes my argument in any relevant way.
Mod note: I clarified the opening note a bit more, to make the start and nature of the essay more clear.