Generally I think it’s good for people to try to be as accurate as possible and that’s what I most associate with 3.2. That said these clusters are a big oversimplification and I think there’s people with good reasoning that end up in 3.3. And in practice 3.3 might be a reasonable attitude just from a perspective of a sane society wanting to address the alignment is hard and orgs are incapable scenario anyway
Nico Hillbrand
There’s approximately 3 attitudes towards AGI Safety:
Dismissal from people that don’t take the possibility powerful AI seriously
Lip service, but motivational apathy from people that agree it’s epistemically sensible to take powerful AI seriously as a far mode sort of idea, but have other priorities they care much more about
Concern and caring from people that take powerful AI seriously
I encounter 2. a bunch as a local group organiser. I think often it’s because people think there’s too much uncertainty or too little leverage they have such that they’re better off focusing their motivation on worlds without future powerful AI. Seeing that experts take powerful AI seriously through podcasts etc and talking with other people at conferences that do seems to shift people towards 3. sometimes.
I think 3. maybe has some more interesting distinctions between:
3.1 People that assume alignment is easy to medium difficult and organisations will act pretty competently
3.2 People that take an uncertain attitude about alignment difficulty and organisational competence
3.3 People that assume alignment is difficult and organisations will act pretty incompetently
I think it’d be good for the world to have more people with the 3.2 attitude. Maybe a good example of people that I think of as representative are Buck Shlegeris and Ryan Greenblatt.
I think almost any coherent vision of a good future involves some kind of pause / restriction on unsafe AI development. We should probably not wait with preparing it for when AI labour is directed to it during takeoff.
The only place I expect an organisation with excellent safety culture, security mindset and strategic philosophical competence capable of steering a singularity towards good outcomes to arise is in a data center housing a country of aligned roughly human level AIs.
I suspect the culture of elites with leverage over AI doesn’t want this and will try to achieve something else, but it’s at least plausible to me that if you solve some hard research and philosophy problems around alignment you can make wise ASI that is not exclusively obedient to small groups of people and can identify and deliver precisely the goods and automations that would lead to a much better society for ~everyone.
I think it’s good to critique people who talk about AI benefits in very vague terms and encourage them to flesh out how those benefits and the power structures that deliver them would look like in detail, but I also think there’s an important kernel of truth in AI could plausibly if handled really well make the world extremely better.
I do feel some frustration that many people do not seem to be trying all that much to sketch out detailed good scenarios to coordinate around and I broadly agree that the default power structures of market forces and small committees of military and AI company officials definitely don’t seem like they’d empower people by default.
1: My understanding is the classic arguments go something like: Assume interpretability won’t work (illegible CoT, probes don’t catch most problematic things). Assume we’re training our AI on diverse tasks and human feedback. It’ll sometimes get reinforced for deception. Assume useful proxy goals for solving tasks become drives that the AI comes up with instrumental strategies to achieve. Deception is often a useful instrumental strategy. Assume that alien or task focused drives win out over potential honesty etc drives because they’re favoured by inductive biases. You get convergent deception.
I’m guessing you have interpretability working as th main crux and together with inductive biases for nice behaviours potentially winning it drives this story to low probability. Is that right?
A different story with interpretability at least somewhat working would be the following:
We again have deception by default because of human reinforcement for sycophancy and looking like solving problems (like in o3) as well as because of inductive biases for alien goals. However this time our interpretability methods work and since the AI is smart enough to know when it’s deceptive we can catch correlates of that representation with interpretability techniques.
Within the project developing the AI, the vibes and compute commitments are that the main goal is to go fast and outcompete others and a secondary goals is to be safe which gets maybe 1-10% of resources. So then as we go along we have the deception monitors constantly going off. People debate on what they should do about it. They come to the conclusion that they can afford to do some amount of control techniques and resample the most deceptive outputs, investigate more etc but mostly still use the model. They then find various training techniques that don’t directly train on their deception classifier but are evaluated against it and train on some subset of the classifications of another detector which leads to reducing deception without fixing the underlying problem of reinforcement and inductive biases. The probes have been optimized against and stop being useful. We’re in the first scenario of interp not working now.
What are your thoughts in this scenario?
2: Assume that the AIs are aligned in the sense of following the spirit of a constitution and a reward process involving a bunch of human feedback. Assume the companies / projects deploying them have as their main goals to increase their power with constraints on not looking bad to peers etc. I’m not sure they’d use the AIs in such a way that the AIs push back on their orders to make them more powerful because they’re aligned to a constitution or similar that makes them push for a nicer society with coordination mechanisms etc. I do agree that very likely the main projects will ask their AIs to set up coordination between them to stay in power and to have enough breathing room to do that. We wouldn’t have full on gradual disempowerment there but instead a concentration of power. So then possibly because there’s ideological pressure not to have concentration of power people will prevent the setup of these coordination mechanisms in general and we get a distributed AI world with AIs aligned to individual users and constitutions, but unable to help them with setting up coordination mechanisms.
I think we’re unlikely to block coordination mechanisms (ie China is already pretty centralised and probably wouldn’t do that within itself), but still curious about your thoughts on this.
We can also construct an analogous simplicity argument for overfitting:
Overfitting networks are free to implement a very simple function— like the identity function or a constant function— outside the training set, whereas generalizing networks have to exhibit complex behaviors on unseen inputs. Therefore overfitting is simpler than generalizing, and it will be preferred by SGD.
Prima facie, this parody argument is about as plausible as the simplicity argument for scheming. Since its conclusion is false, we should reject the argumentative form on which it is based.
As far as I understood people usually talk about simplicity biases based on the volume of basins in parameter space. So the response would be that overfitting takes up more parameters than other (probably smaller desciption length) algorithms and therefore has smaller basins.
I’m curious if you endore or reject this way of defining simplicity based on the size of basins of a set of similar algorithms?
The way I’m currently thinking about this is:
Assume we are training end to end on tasks that require our network to do deep reasoning that requires multiple steps and high frequency functions for generating dramatically new outputs based on updated understanding of science etc (We are not monitoring the CoT or using a large net that emulates CoT internally without good interpretability).
Then the basins of the schemers that use the least parameters are large parts of the parameter space. The basins of harmless nets with few parameters are large parts as well. Gradient descent will select the one that is larger.
I don’t understand gradient descent inductive biases well enough to have strong intuitions which would be larger. So I end up feeling something like each could happen, I’d bet 60% the least parameter schemers is larger since there’s maybe slightly less space for encoding of the harmlessness needed. In that case I’d expect 99%+ probability of a schemer. In the harmless basins are larger case 99%+ of a harmless model.
I suppose this isn’t exactly a counting argument, because I think that evidence about inductive biases will quickly overcome any such argument and I’m agnostic what evidence I will recieve since I’m not very knowledgable about it and other people seem to disagree a bunch.
Is my reasoning here flawed in some obvious way?
Also I appreciated the example of the cortices doing reasonably intelligent stuff without seemingly doing any scheming which makes me more hopeful an AGI system with interpretable CoT made up of a bunch of cortex level subnets with some control techniques would be sufficient to strongly accelerate the construction of a global xrisk defense system.
I am dissatisfied / frustrated with how many leading figures in the AI industry (Altman, Musk, Huang, etc) do not match a simple query to how I’d image an epistemically humble responsible person would act and communicate. Primarily by not talking about catastrophic risk as a serious possibility worth preparing for.
I haven’t thought about what mechanisms exist to relay this information to governance processes and am unsure to what extent functional ones that could address this even exist which is why I’m putting it in my quick takes for now