Very interesting write up. Do you have a high level overview of why, despite all of this, P(doom) is still 5%? What do you still see as the worst failure modes?
michael_mjd
We Need To Know About Continual Learning
This might be a good time for me to ask a basic question on mechanistic interpretability:
Why does targeting single neurons work? Does it work? One would think that if there is a single dimensional quantity to measure, why would it align with the standard basis? Why wouldn’t it be aligned to a random one dimensional linear subspace? Then, examining single neurons is likely to give you some weighted combination of concepts instead, rather than a single interpretation...
Noticed this as well. I tried to get it to solve some integration problems, and it could try different substitutions and things, but if they did not work, it kind of gave up and said to numerically integrate it. Also, it would make small errors, and you would have to point it out, though it was happy to fix them.
I’m thinking that most documents it reads tend to omit the whole search/backtrack phase of thinking. Even work that is posted online that shows all the steps, usually filters out all the false starts. It’s like how most famous mathematicians were known for throwing away their scratchwork, leaving everyone to wonder how exactly they formed their thought processes...
Instrumental Convergence To Offer Hope?
Has there been effort into finding a “least acceptable” value function, one that we hope would not annihilate the universe or turn it degenerate, even if the outcome itself is not ideal? My example would be to try to teach a superintelligence to value all other agents facing surmountable challenges in a variety of environments. The degeneracy condition of this, is if it does not value the real world, will simply simulate all agents in a zoo. However, if the simulations are of faithful fidelity, maybe that’s not literally the worst thing. Plus, the zoo, to truly be a good test of the agents, would approach being invisible.
- 7 Jun 2022 9:44 UTC; 7 points) 's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by (
I agree with the analysis of the ideas overall. I think however, AI x-risk does have some issue regarding communications. First of all, I think it’s very unlikely that Yann will respond to the wall of text. Even though he is responding, I imagine him more to be on the level of your college professor. He will not reply to a very detailed post. In general, I think that AI x-risk should aim to explain a bit more, rather than to take the stance that all the “But What if We Just...” has already been addressed. It may have been, but this is not the way to getting them to open up rationally to it.
Regarding Yann’s ideas, I have not looked at them in full. However, they sound like what I imagine an AI capabilities researcher would try to make as their AI alignment “baseline” model:
Hardcoding the reward will obviously not work.
Therefore, the reward function must be learned.
If an AI is trained on reward to generate a policy, whatever the AI learned to optimize can easily go off the rails once it gets out of distribution, or learn to deceive the verifiers.
Therefore, why not have the reward function explicitly in the loop with the world model & action chooser?
ChatGPT/GPT-4 seems to have a good understanding of ethics. It probably will not like it if you told it a plan was to willingly deceive human operators. As a reward model, one might think it might be robust enough.
They may think that this is enough to work. It might be worth explaining in a concise way why this baseline does not work. Surely we must have a resource on this. Even without a link (people don’t always like to follow links from those they disagree with), it might help to have some concise explanation.
Honestly, what are the failure modes? Here is what I think:
The reward model may have pathologies the action chooser could find.
The action chooser may find a way to withhold information from the reward model.
The reward model evaluates what, exactly? Text of plans? Text of plans != the entire activations (& weights) of the model...
I think we are getting some information. For example, we can see that token level attention is actually quite powerful for understanding language and also images. We have some understanding of scaling laws. I think the next step is a deeper understanding of how world modeling fits in with action generation—how much can you get with just world modeling, versus world modeling plus reward/action combined?
If the transformer architecture is enough to get us there, it tells us a sort of null hypothesis for intelligence—that the structure for predicting sequences by comparing all pairs of elements of a limited sequence—is general.
Not rhetorically, what kind of questions you think would better lead to understanding how AGI works?
I think teaching a transformer with an internal thought process (predicting the next tokens over a part of the sequence that’s “showing your work”) would be an interesting insight into how intelligence might work. I thought of this a little while back but also discovered this is also a long standing MIRI research direction into transparency. I wouldn’t be surprised if Google took it up at this point.
I posted something I think could be relevant to this: https://www.lesswrong.com/posts/PfbE2nTvRJjtzysLM/instrumental-convergence-to-offer-hope
The takeaway is, for a sufficiently advanced agent, who wants to hedge against the possibility of itself being destroyed by a greater power, may decide the only surviving plan is to allow the lesser life forms some room to optimize their own utility. It’s sort of an asymmetrical infinite game theoretic chain. If every agent kills lower agents, only the maximum survives and no one knows if they are the maximum. If there even is a maximum.
Agree. Obviously alignment is important, but it has always creeped me out in the back of my mind, some of the strategies that involve always deferring to human preferences. It seems strange to create something so far beyond ourselves, and have its values be ultimately that of a child or a servant. What if a random consciousness sampled from our universe in the future, comes from it with probability almost 1? We probably have to keep that in mind too. Sigh, yet another constraint we have to add!
Is there a post in the Sequences about when it is justifiable to not pursue going down a rabbit hole? It’s a fairly general question, but the specific context is a tale as old as time. My brother, who has been an atheist for decades, moved to Utah. After 10 years, he now asserts that he was wrong and his “rigorous pursuit” of verifying with logic and his own eyes, leads him to believe the Bible is literally true. I worry about his mental health so I don’t want to debate him, but felt like I should give some kind of justification for why I’m not personally embarking on a bible study. There’s a potential subtext of, by not following his path, I am either not that rational, or lack integrity. The subtext may not really be there, but I figure if I can provide a well thought out response or summarize something from EY, it might make things feel more friendly, e.g. “I personally don’t have enough evidence to justify spending the time on this, but I will keep an open mind if any new evidence comes up.”
One fear I have is that the open source community will come out ahead, and push for greater weight sharing of very powerful models.
Edit: To make more specific, I mean that the open source community will become more attractive, because they will say, you cannot rely on individual companies whose models may or may not be available. You must build on top of open source. Related tweet:
https://twitter.com/ylecun/status/1726578588449669218
Whether their plan works or not, dunno.
Essentially yes, heh. I take this as a learning experience for my writing, I don’t know what I was thinking, but it is obvious in hindsight that saying to just “switch on backprop” sounds very naive.
I also confess I haven’t done the due diligence to find out what the actual largest model that has been tried with this, whether someone has tried it with Pythia or LLaMa. I’ll do some more googling tonight.
One intuition why the largest models might be different, is that part of the training/fine-tuning going on will have to do with the model’s own output. The largest models are the ones where the model’s own output is not essentially word salad.
I have noted the problem of catastrophic forgetting in the section “why it might not work”. In general I agree continual learning is obviously a thing, otherwise I would not have used the established terminology. What I believe however is that the problems we face in continual learning in e.g. a 100M BERT model may not be the same as what we observe in models that can now meaningfully self critique. We have explored this technique publicly, but have we tried it with GPT-4? The publicly part was really just a question of whether OpenAI actually did it on this model or not, and it would be an amazing data point if they could say “We couldn’t get it to work.”
Thanks for pointing to ECL, this looks fascinating!
I would pay to see this live at a bar or one of those county fair (we had a GLaDOS cover band once so it’s not out of the question)
If we don’t get a song like that, take comfort that GLaDoS’s songs from the Portal soundtrack are basically the same idea as the Sydney reference. Link: https://www.youtube.com/watch?v=dVVZaZ8yO6o
The media does have its biases but their reaction seems perfectly reasonable to me. Occam’s razor suggests this is not only unorthodox, but shows extremely poor judgment. This demonstrates that (a) either Elon is actually NOT as smart he has been hyped to be, or (b) there’s some ulterior motive, but these are long-tailed.
Typically when one joins a company, you don’t do anything for X number of months and get the lay of the land. I’m inclined to believe this is not just a local minimum, but typically close to the optimal strategy for a human being (but not a superintelligence playing 5D chess). It’s unlikely the case that he bought the company only months from bankruptcy. Everywhere in big tech is doing layoffs but not to this magnitude. Also, coming into an office and demanding people work twice as hard and completely change their schedules around, would not work in any company. No employee with a family would be able to switch that quickly. No sane employee would be willing to pivot like this. Also, why should they? They have leverage.
None of the methods described above are actually reasonable in a real company, like blanket layoffs by LoC. Yes we can discuss above the motivations, how maybe to first order (probably not even that) it gives an approximation, but it’s not like it’s that hard to do it better and more accurate than this. No, at the end of the day, he’s either an idiot, or deliberately trying to destroy it either out of some kind of revenge, or maybe somehow in the view that this buys time for AI alignment :)
Here are my predictions:
* He will have trouble staffing the company and complain loudly about it with the tired “no one wants to work anymore”
* He will move the company to TX and hire from there at 1⁄2 the salary or so.
* The site will stabilize, though not improve in any meaningful way, but he will be lauded as a hero in red states.
* The move to TX will be intended to signal a shift away from Silicon Valley and have a small but measurable effect, but CA will remain the dominant hub.
I’ll say I definitely think it’s too optimistic and I don’t much too much stock into it. Still, I think it’s worth thinking about.
Yes, absolutely we are not following the rule. The reason why I think it might change with an AGI: (1) currently we humans, despite what we say when we talk about aliens, still place a high prior on being alone in the universe, or from dominant religious perspectives, that we are the most intelligent. Those things combine to make us think there are no consequences to our actions against other life. An AGI, itself a proof of concept that there can be levels of general intelligence, may have more reason to be cautious. (2) Humans are not as rational. Not that a rational human would decide to be vegan—maybe with our priors, we have little reason to suspect that higher powers would care—especially since it seems to be the norm of the animal kingdom already. But, in terms of rationality, some humans are pretty good at taking very dangerous risks, risks that perhaps an AGI may be more cautious about. (3) There’s something to be said about degrees of following the rule. At one point humans were pretty confident about doing whatever they wanted to nature, nowadays at least some proportion of the population wants to at least, not destroy it all. Partly for self preservation reasons, but also partly for intrinsic value. (and probably 0% for fear of god-like retribution, to be fair, haha). I doubt the acausal reasoning would make an AGI conclude it can never harm any humans, but perhaps “spare at least x%”.
I think the main counterargument would be the fear of us creating a new AGI, so it may come down to how much effort the AGI has to expend to monitor/prevent that from happening.
I’m an ML engineer at a FAANG-adjacent company. Big enough to train our own sub-1B parameter language models fairly regularly. I work on training some of these models and finding applications of them in our stack. I’ve seen the light after I read most of Superintelligence. I feel like I’d like to help out somehow. I’m in my late 30s with kids, and live in the SF bay area. I kinda have to provide for them, and don’t have any family money or resources to lean on, and would rather not restart my career. I also don’t think I should abandon ML and try to do distributed systems or something. I’m a former applied mathematician, with a phd, so ML was a natural fit. I like to think I have a decent grasp on epistemics, but haven’t gone through the sequences. What should someone like me do? Some ideas: (a) Keep doing what I’m doing, staying up to date but at least not at the forefront; (b) make time to read more material here and post randomly; (c) maybe try to apply to Redwood or Anthropic… though dunno if they offer equity (doesn’t hurt to find out though) (d) try to deep dive on some alignment sequence on here.