I’m currently researching forecasting and epistemics as part of the Quantified Uncertainty Research Institute.
ozziegooen
(Quick Thought)
Perhaps the goal for existing work targeting AI safety is less to ensure that AI safety happens, and more to make sure that we make AI systems that are strictly[1] better than the current researchers at figuring out what to do about AI safety.
I’m unsure how hard AI safety is. But I consider it fairly likely that mid-term (maybe 50% of the way to TAI, in years) safe AI systems are likely to outperform humans on AI safety strategy and the large majority of the research work.
If humans can successfully bootstrap more capable infrastructure than us, then our (humans) main work is done (though there could still be other work we can help with).
It might well be the case that the resulting AI systems would recognize that the situation is fairly hopeless. But at that point, humans have done they key things they need to do on this, hopeless or not. Our job is to set things up as best we can, more is by definition impossible.
Personally, I feel very doomy about humans now solving for various alignment problems of many years from now. But I feel much better about us making systems that will do a better job at guiding things then we could.
(The empirical question here is how difficult it is to automate alignment research. I realize this is a controversial and discussed topic. My guess is that many researchers will never agree with good AI systems, and always hold out on considering them superior—and that on the flip side, many people will trust AIs before they really should. Getting this right is definitely tricky.)
[1] Strictly meaning that they’re very likely better overall, not that there’s absolutely no area humans will be better than them.
Thanks for the clarification.
> But the thing I’m most worried about is companies succeeding at “making solid services/products that work with high reliability” without actually solving the alignment problem, and then it becomes even more difficult to convince people there even is a problem as they further insulate themselves from anyone who disagrees with their hyper-niche worldview.
The way I see it, “making solid services/products that work with high reliability” is solving a lot of the alignment problem. As in, this can get us very far into making AI systems do a lot of valuable work for us with very low risk.
I imagine that you’re using a more specific definition of it than I am here.
I was thinking more of internal systems that a company would have enough faith in to deploy (a 1% chance of severe failure is pretty terrible!) or customer-facing things that would piss off customers more than scare them.
Getting these right is tremendously hard. Lots of companies are trying and mostly failing right now. There’s a ton of money in just “making solid services/products that work with high reliability.”
Social media companies have very successfully deployed and protected their black box recommendation algorithms despite massive negative societal consequences, and the current transformer models are arguably black boxes with massive adoption.
I agree that some companies do use RL systems. However, I’d expect that most of the time, the black-box nature of some of these systems is not actively preferred. They use them despite the black-box nature, because these are specific situations where the benefits outweigh the costs, not because of them.
“current transformer models are arguably black boxes with massive adoption.” → They’re typically much less that of RL. There’s a fair bit of customization that can be done with prompting, and the prompting is generally English-readable.
Your example of “Everything Inc” is also similar to what I’m expecting. As in, I agree with:
1. The large majority of business strategy/decisions/implementation can (somewhat) quickly be done by AI systems.
2. There will be strong pressures to improve AI systems, due to (1).
That said, I’d expect:
1. The benefits are likely to be (more) distributed. Many companies will be simultaneously using AI to improve their standings. This leads to a world where there’s not a ton of marginal low-hanging-fruit for any single company. I think this is broadly what’s happening now.
2. A great deal of work will go into making many of these systems reliable, predictable, corrigible, legally-compliant, etc. I’d expect companies to really dislike being blind-sighted by sub-AI systems that do bizarre things.
3. This is a longer-shot, but I think there’s a lot of potential for strong cooperation between companies, organizations, and (effective) governments. A lot of the negatives of maximizing businesses comes from negative externalities and similar, which can also be looked at as coordination/governance failures. I’d naively expect this to mean that if power is distributed among multiple capable entities at time T, then these entities would likely wind up doing a lot of positive-sum interactions with each other. This seems good for many S&P 500 holders.
”or anything remotely like them, to “Everything, Inc.”, I just can’t. They seem obviously totally inapplicable.”
This seems tough to me, but quite possible, especially as we get much stronger AI systems. I’d expect that we could (with a lot of work) have a great deal of:
1. Categorization of potential tasks into discrete/categorizable items.
2. Simulated environments that are realistic enough.
3. Innovations in finding good trade-offs between task competence and narrowness.
4. LLM task eval setups would get substantially more sophisticated and powerful.
I’d expect this to be a lot of work. But at the same time, I’d expect a lot of of it to be strongly commercially useful.
Thanks so much for that explanation. I’ve started to review those posts you linked to and will continue doing so later. Kudos for clearly outlining your positions, that’s a lot of content.
> “We probably mostly disagree because you’re expecting LLMs forever and I’m not.”
I agree that RL systems like AlphaZero are very scary. Personally, I was a bit more worried about AI alignment a few years ago, when this seemed like the dominant paradigm.
I wouldn’t say that I “expect LLMs forever”, but I would say that if/when they are replaced, I think it’s more likely than not that they will be replaced by a system of a scariness factor that’s similar to LLMs or less. The main reason being is that I think there’s a very large correlation between “not being scary” and “being commercially viable”, so I expect a lot of pressure for non-scary systems.
The scariness of RL systems like AlphaZero seems to go hand-in-hand with some really undesirable properties, such as [being a near-total black box] and [being incredibly hard to intentionally steer]. It’s definitely possible that in the future some capabilities advancement might mean that scary systems have such a intelligence/capabilities advantage that this outweighs the disadvantages, but I see this as unlikely (though definitely a thing to worry about).
> I’m not sure what you mean by “subcomponents”. Are you talking about subcomponents at the learning algorithm level, or subcomponents at the trained model level?
I’m referring to scaffolding. As in, an organization makes an “AI agent” but this agent frequently calls a long list of specific LLM+Prompt combinations for certain tasks. These subcalls might be optimized to be narrow + [low information] + [low access] + [generally friendly to humans] or similar. This can be made more advanced with a large variety of fine-tuned models, but that might be unlikely.
“Do you think AI-empowered people / companies / governments also won’t become more like scary maximizers?” → My statements above were very focused on AI architectures / accident risk. I see people / misuse risk as a fairly distinct challenge/discussion.
I appreciate this post for working to distill a key crux in the larger debate.
Some quick thoughts:
1. I’m having a hard time understanding the “Alas, the power-seeking ruthless consequentialist AIs are still coming” intuition. It seems like a lot of people in this community have this intuition, and I feel very curious why. I appreciate this crux getting attention.
2. Personally, my stance is something more like, “It seems very feasible to create sophisticated AI architectures that don’t act as scary maximizers.” To me it seems like this is what we’re doing now, and I see some strong reasons to expect this to continue. (I realize this isn’t guaranteed, but I do think it’s pretty likely)
3. While the human analogies are interesting, I assume they might appeal more to the “consequentialist AIs are still coming” crowd than people like myself. Humans were evolved for some pretty wacky reasons, and have a large number of serious failure modes. Perhaps they’re much better than some of what people imagine, but I suspect that we can make AI systems that have much more rigorous safety properties in the future. I personally find histories of engineering complex systems in predictable and controllable ways to be much more informative, for these challenges.
4. You mention human intrinsic motivations as a useful factor. I’d flag that in a competent and complex AI architecture, I’d expect that many subcomponents would have strong biases towards corrigibility and friendliness. This seems highly analogous to human minds, where it’s really specific sub-routines and similar that have these more altruistic motivations.
I’ve been working on an app for some parts of this. Plan to more formally announce it soon, but the basics might be simple enough. Eager to get takes. Happy to add any workflows if people have requests. (You can also play with adding “custom workflows”, or just download the code and edit it).
Happy to discuss if that could be interesting.
https://www.roastmypost.org
I found this analysis refreshing and would like to see more on the GPU depreciation costs.
If better GPUs are developed, these will go down in value quickly. Perhaps by 25% to 50% per year. This seems like a really tough expense and supply chain to manage.I’d expect most of the other infrastructure costs to depreciate much more slowly, as you mention.
I’m sure there are tons of things to optimize. Overall happy to see these events, just thinking of more things to improve :)
I’m unsure of shirts, but like the idea of more experimentation. It might be awkward to wear the same shirt for 2-3 consecutive days, and also some people will want more professional options.I liked the pins this year (there were some for “pDoom”). I could also see having hats, lanyards, bracelets.
Information-Dense Conference Badges
It’s a possibility, but this seems to remove a ton of information to me. The Ghibli faces all look quite similar to me. I’d be very surprised if they could be de-anonymized in cases like these (people who aren’t famous) in the next 3 years, if ever.
If you’re particularly paranoid, I presume we could have a system do a few passes.
Kind of off topic, but I this leads me to wonder: why are so many websites burying the lede about the services they actually provide like this example?
I heard from a sales person that many potential customers turn away the moment they hear a list of specific words, thinking “it’s not for me”. So they try to keep it as vague as possible, learn more about the customer, then phrase things to make it seem like it’s exactly for them.
(I’m not saying I like this, just that this is what I was told)
Personally, I’m fairly committed to [talking a lot]. But I do find it incredibly difficult to do at parties. I’ve been trying to figure out why, but the success rate for me plus [talking a lot] at parties seems much lower than I would have hoped.
Quickly:
1. I imagine that strong agents should have certain responsibilities to inform certain authorities. These responsibilities should ideally be strongly discussed and regulated. For example, see what therapists and lawyers are asked to do.
2. “doesn’t attempt to use command-line tools” → This seems like a major mistake to me. Right now an agent running on a person’s computer will attempt to use that computer to do several things to whistleblow. This obviously seems inefficient, at very least. The obvious strategy is just to send one overview message to some background service (for example, something a support service to one certain government department), and they would decide what to do with it from there.
3. I imagine a lot of the problem now is just that these systems are pretty noisy at doing this. I’d expect a lot of false positives and negatives.
Part of me wants to create some automated process for this. Then part of me thinks it would be pretty great if someone could offer a free service (even paid could be fine) that has one person do this hunting work. I presume some of it can be delegated, though I realize the work probably requires more context than it first seems.
CoT monitoring seems like a great control method when available, but I think it’s reasonably likely that it won’t work on the AIs that we’d want to control, because those models will have access to some kind of “neuralese” that allows them to reason in ways we can’t observe.
Small point, but I think that “neuralese” is likely to be somewhat interpretable, still.
1. We might advance at regular LLM interpretability, in which case those lessons might apply.
2. We might pressure LLM systems to only use CoT neuralese that we can inspect.
There’s also a question of how much future LLM agents will rely on CoT vs. more regular formats for storage. For example, I believe that a lot of agents now are saving information in English into knowledge bases of different kinds. It’s far easier for software people working with complex LLM workflows to make sure a lot of the intermediate formats are in languages they can understand.
All that said, personally, I’m excited for a multi-layered approach, especially at this point when it seems fairly early.
There are a few questions here.
1. Do Jaime’s writings that that he cares about x-risk or not?
→ I think he fairly clearly states that cares.
2. Does all the evidence, when put together, imply that actually, Jaime doesn’t care about x-risk?
→ This is a much more speculative question. We have to assess how honest he is in his writing. I’d bet money that Jaime at least believes that he cares and is taking corresponding actions. This of course doesn’t absolve him of full responsibility—there are many people who believe they do things for good reasons, but causally actually do things for selfish reasons. But now we’re getting to a particularly speculative area.
“I also think it should be our dominant prior that someone is not motivated by reducing x-risk unless they directly claim they do.” → Again, to me, I regard him as basically claiming that he does care. I’d bet money that if we ask him to clarify, he’d claim that he cares. (Happy to bet on this, if that would help)
At the same time, I doubt that this is your actual crux. I’d expect that even if he claimed (more precisely) to care, you’d still be skeptical of some aspect of this.
---
Personally, I have both positive and skeptical feelings about Epoch, as I do other evals orgs. I think they’re doing some good work, but I really wish they’d lean a lot more on [clearly useful for x-risk] work. If I had a lot of money to donate, I could picture donating some to Epoch, but only if I could get a lot of assurances on which projects it would go to.
But while I have reservations about the org, I think some of the specific attacks against them (and defenses or them) are not accurate.
The humans trusted to make decisions.
I’m hesitant to say “best humans”, because who knows how many smart people there may be out there who might luck out or something.
But “the people making decisions on this, including in key EA orgs/spending” is a much more understandable bar.