Nico Hillbrand

Karma: 67

Nico Hillbrand 17 Jul 2026 8:20 UTC
3 points
0
on: Nico Hillbrand’s Shortform
I wonder what will happen on LW during crunch time scenarios.
Probably scanned by automated alignment researchers frequently and possibly having technical people recruited into interfacing with the AGI labs.
Probably also more coordination of advocacy as some more ppls thresholds for when to put energy into it get crossed with the situation becoming more clearly wild.

Nico Hillbrand 12 Jul 2026 10:04 UTC
1 point
0
on: Nico Hillbrand’s Shortform
Heuristics I like
- If you see a problem or thing you’d like to exist, ask ‘What would it look like for me to solve / create this?’ and ‘Should I just go, take the responsibility and work on this myself?’
- For any problem in life there’s a truth about what the solution space looks like, try iteratively adding detail to imagined solution states and intermediate steps
- If you see someone making a mistake, ask am I making a similar type of mistake myself?
- If there’s a hard question you feel like you’d need to think about a bunch and gather evidence on, try to predict what you’d find doing that for at least 5s
- If you’re being productive, but not sharing some form of logs that might impress them with friends, co-workers or the web, you’re probably missing out on cheap approval reward increasing your motivation
- If you have a narrative of why two tasks have to be done sequentially, question that and ask what the costs to parallelization would be

Nico Hillbrand 12 Jul 2026 8:56 UTC
4 points
2
on: Nico Hillbrand’s Shortform
I want the world to be more robust against small academic labs or startups building hyper efficient brain-like AGI / much better paradigm AI. Both during a pause or in LLMs-don’t-scale worlds.
Some thoughts:
- Ideally there’s a widespread toolkit of evaluation, control techniques and model organism / elicitation approaches that are easy for the small lab to use. Potentially including being offered by external orgs.
- Ideally it’s illegal to run experiments that could result in hyper efficient AIs with dangerous capabilities. Or at the least illegal to do so without alerting authorities and using safety toolkits. A lot of academic labs probably pay attention to what’s legal.
- Maybe there should be easy to use channels of transferring algorithmic breakthroughs or potentially just a model API to frontier labs who have better control, automated alignment research scaffolds and d/acc programs.
- The whole thing plausibly looks like having a bunch of open research on brain algorithms and AI already published with a few tweaks and optimizations left before it’s clear the new paradigm gets to dangerous capabilities. Ideally that research would have already been illegal before triggering noticeable capabilities gains, that seems hard to enforce though, but if ppl take the problem seriously after some future wake up then maybe it’d be possible.
- There should probably be some tracking of groups doing research relevant to hyper efficient / brain-like AGI and social pressure on them to be careful or stop and ideally monitoring by possible future sane-after-wake-up intelligence agencies so long as the tracking doesn’t make it more salient what the algorithmic insights are and they are fine tuned and developed much earlier and less safely because of it.

Nico Hillbrand 2 Jul 2026 7:54 UTC
1 point
0
in reply to: Nico Hillbrand’s comment on: Nico Hillbrand’s Shortform
(This is primarily an emotional expression and especially on explicitness about utopia I’m pretty unsure if that’s strategically useful or backfires too much when interfacing with elites/institutions)

Nico Hillbrand 2 Jul 2026 7:43 UTC
3 points
0
in reply to: johnswentworth’s comment on: Nico Hillbrand’s Shortform
True that, the real target is probably hyper astronomical in scale :)

Nico Hillbrand 1 Jul 2026 9:18 UTC
27 points
32
on: Nico Hillbrand’s Shortform
I want to be part of a community that is unapologetic about wanting to build a galaxy scale utopia and promotes pride in honesty/integrity and explicit awareness of incentive problems and coordination opportunities

Nico Hillbrand 1 Jul 2026 8:34 UTC
1 point
0
in reply to: leogao’s comment on: leogao’s Shortform
I would probably ask the smart people to both figure out a way for me to effectively use the funding for this task and to prioritise questions of metaphilosphy in hopes they can figure out what a good field of philosophy would look like and the feedback loops etc involved in it.
So then I think in practice this mostly comes down to delegating well which before I learn more about good delegation via thinking about it more and having the smart people work on it I’d likely start by announcing I want people to work on this, reading their output and talking to them and then hiring and funding promising / high integrity seeming people to build up more infrastructure and help me evaluate and grant to additional people.
If I had to design a bunch myself or predict what I’ll be advised to design, then I’d probably have some kind of thing where people get more money tied to having output that matches proxies of solving real world problems and convincing peers progress has been made and generally make sure things like this are measured and tracked precisely. I’d probably also try to have some kind of thing where if you made helpful contributions you can go be part of the think long about illegible problems squad or something and spend a couple years working on stuff without strong output signals while still getting increased money/status by being part of it, not sure. Generally I’d try to have a bunch of people whose job it is to track and measure how money/status are translating into differences between what a bunch of people think is worth prioritising Vs what is prioritised and how people are goodharting various infrastructure/signals. Could go on with trying to make these more concrete if useful
One maybe interesting related question with some overlap is: If you had a bunch of smart AIs and unlimited reward signals / instructions to make them work on something how would you design a field of philosophy that is actually good?

Nico Hillbrand 24 Jun 2026 17:45 UTC
1 point
0
in reply to: Neel Nanda’s comment on: Nico Hillbrand’s Shortform
Thanks, that makes sense! Courses / tutorials where you can learn basics and core methods with quick feedback loops do intuitively seem pretty important, interesting to see you as someone highly involved rank them very high compared to the other items!

Nico Hillbrand 24 Jun 2026 17:25 UTC
1 point
0
in reply to: Cleo Nardo’s comment on: Nico Hillbrand’s Shortform
That makes sense! And thanks for pointing out the potential interpretation/vibe of spuriousness which I think it’s right to be sceptical of.
It has an impact driven theory of change that makes sense / a fairly rational researcher with the knowledge at the time would go for seems like an important additional bullet point.
I think it’s interesting to specifically think about whether any features of communications etc made it especially easy for people to enage with in addition to the theory of change / research in itself making sense (at least to me at the time and also to some extend now)
I’m most interested in features that don’t change relative prioritization much, but absolute levels if there are such, since purely relative interventions are only useful if you can do better at prio than the community which seems hard.

Nico Hillbrand 23 Jun 2026 18:08 UTC
27 points
3
on: Nico Hillbrand’s Shortform
I think ‘How did mech interp get popular around 2022?’ is an interesting question. I wasn’t really involved in it, so speculating pretty hard:
- Neel Nanda and Chris Olah did some good podcasts
- Mechanistic Interpretability Quickstart Guide by Neel Nanda and 200 concrete open problems in mech interp sequence and the long list of theories of impact for mech intep post
- It was very useful that there were cool papers / results like the modular addition transformer and OthelloGPT as well as Olah’s previous vision model stuff which looked cool. Generally core papers/posts showing neat results that explore a highly self contained subproblem
- There were Apart hackathons and MATS streams
- Interp being something that it’s easy to do interesting mini projects about and interfaces with ppls curiosity and somewhat AI capabilities work
- AI Companies hiring people for interp roles
Wondering if something in this direction could and should be replicated for ie compute governance or other technical areas

Nico Hillbrand 23 Jun 2026 16:18 UTC
1 point
0
on: Nico Hillbrand’s Shortform
AIs helping with hiring trustworthy experts/evaluators during the automated alignment research explosion or other org setup / technical things that make it easier seem potentially quite important

Nico Hillbrand 29 May 2026 14:13 UTC
2 points
0
on: Nico Hillbrand’s Shortform
I am dissatisfied / frustrated with how many leading figures in the AI industry (Altman, Musk, Huang, etc) do not match a simple query to how I’d image an epistemically humble responsible person would act and communicate. Primarily by not talking about catastrophic risk as a serious possibility worth preparing for.
I haven’t thought about what mechanisms exist to relay this information to governance processes and am unsure to what extent functional ones that could address this even exist which is why I’m putting it in my quick takes for now

Nico Hillbrand 12 May 2026 15:59 UTC
2 points
0
in reply to: roha’s comment on: Nico Hillbrand’s Shortform
Generally I think it’s good for people to try to be as accurate as possible and that’s what I most associate with 3.2. That said these clusters are a big oversimplification and I think there’s people with good reasoning that end up in 3.3. And in practice 3.3 might be a reasonable attitude just from a perspective of a sane society wanting to address the alignment is hard and orgs are incapable scenario anyway

Nico Hillbrand 11 May 2026 13:11 UTC
2 points
1
on: Nico Hillbrand’s Shortform
There’s approximately 3 attitudes towards AGI Safety:
1. Dismissal from people that don’t take the possibility powerful AI seriously
2. Lip service, but motivational apathy from people that agree it’s epistemically sensible to take powerful AI seriously as a far mode sort of idea, but have other priorities they care much more about
3. Concern and caring from people that take powerful AI seriously
I encounter 2. a bunch as a local group organiser. I think often it’s because people think there’s too much uncertainty or too little leverage they have such that they’re better off focusing their motivation on worlds without future powerful AI. Seeing that experts take powerful AI seriously through podcasts etc and talking with other people at conferences that do seems to shift people towards 3. sometimes.
I think 3. maybe has some more interesting distinctions between:
3.1 People that assume alignment is easy to medium difficult and organisations will act pretty competently
3.2 People that take an uncertain attitude about alignment difficulty and organisational competence
3.3 People that assume alignment is difficult and organisations will act pretty incompetently
I think it’d be good for the world to have more people with the 3.2 attitude. Maybe a good example of people that I think of as representative are Buck Shlegeris and Ryan Greenblatt.

Nico Hillbrand 4 May 2026 7:43 UTC
3 points
4
on: Nico Hillbrand’s Shortform
I think almost any coherent vision of a good future involves some kind of pause / restriction on unsafe AI development. We should probably not wait with preparing it for when AI labour is directed to it during takeoff.

Nico Hillbrand 16 Apr 2026 14:48 UTC
2 points
1
on: Nico Hillbrand’s Shortform
The only place I expect an organisation with excellent safety culture, security mindset and strategic philosophical competence capable of steering a singularity towards good outcomes to arise is in a data center housing a country of aligned roughly human level AIs.

Nico Hillbrand 16 Apr 2026 8:11 UTC
2 points
0
on: Post-Scarcity is bullshit
I suspect the culture of elites with leverage over AI doesn’t want this and will try to achieve something else, but it’s at least plausible to me that if you solve some hard research and philosophy problems around alignment you can make wise ASI that is not exclusively obedient to small groups of people and can identify and deliver precisely the goods and automations that would lead to a much better society for ~everyone.
I think it’s good to critique people who talk about AI benefits in very vague terms and encourage them to flesh out how those benefits and the power structures that deliver them would look like in detail, but I also think there’s an important kernel of truth in AI could plausibly if handled really well make the world extremely better.
I do feel some frustration that many people do not seem to be trying all that much to sketch out detailed good scenarios to coordinate around and I broadly agree that the default power structures of market forces and small committees of military and AI company officials definitely don’t seem like they’d empower people by default.

Nico Hillbrand 20 Apr 2025 8:56 UTC
2 points
0
in reply to: mako yass’s comment on: Why does LW not put much more focus on AI governance and outreach?
1: My understanding is the classic arguments go something like: Assume interpretability won’t work (illegible CoT, probes don’t catch most problematic things). Assume we’re training our AI on diverse tasks and human feedback. It’ll sometimes get reinforced for deception. Assume useful proxy goals for solving tasks become drives that the AI comes up with instrumental strategies to achieve. Deception is often a useful instrumental strategy. Assume that alien or task focused drives win out over potential honesty etc drives because they’re favoured by inductive biases. You get convergent deception.
I’m guessing you have interpretability working as th main crux and together with inductive biases for nice behaviours potentially winning it drives this story to low probability. Is that right?
A different story with interpretability at least somewhat working would be the following:
We again have deception by default because of human reinforcement for sycophancy and looking like solving problems (like in o3) as well as because of inductive biases for alien goals. However this time our interpretability methods work and since the AI is smart enough to know when it’s deceptive we can catch correlates of that representation with interpretability techniques.
Within the project developing the AI, the vibes and compute commitments are that the main goal is to go fast and outcompete others and a secondary goals is to be safe which gets maybe 1-10% of resources. So then as we go along we have the deception monitors constantly going off. People debate on what they should do about it. They come to the conclusion that they can afford to do some amount of control techniques and resample the most deceptive outputs, investigate more etc but mostly still use the model. They then find various training techniques that don’t directly train on their deception classifier but are evaluated against it and train on some subset of the classifications of another detector which leads to reducing deception without fixing the underlying problem of reinforcement and inductive biases. The probes have been optimized against and stop being useful. We’re in the first scenario of interp not working now.
What are your thoughts in this scenario?
2: Assume that the AIs are aligned in the sense of following the spirit of a constitution and a reward process involving a bunch of human feedback. Assume the companies / projects deploying them have as their main goals to increase their power with constraints on not looking bad to peers etc. I’m not sure they’d use the AIs in such a way that the AIs push back on their orders to make them more powerful because they’re aligned to a constitution or similar that makes them push for a nicer society with coordination mechanisms etc. I do agree that very likely the main projects will ask their AIs to set up coordination between them to stay in power and to have enough breathing room to do that. We wouldn’t have full on gradual disempowerment there but instead a concentration of power. So then possibly because there’s ideological pressure not to have concentration of power people will prevent the setup of these coordination mechanisms in general and we get a distributed AI world with AIs aligned to individual users and constitutions, but unable to help them with setting up coordination mechanisms.
I think we’re unlikely to block coordination mechanisms (ie China is already pretty centralised and probably wouldn’t do that within itself), but still curious about your thoughts on this.

Nico Hillbrand 29 Feb 2024 13:03 UTC
1 point
0
on: Counting arguments provide no evidence for AI doom
We can also construct an analogous simplicity argument for overfitting:
Overfitting networks are free to implement a very simple function— like the identity function or a constant function— outside the training set, whereas generalizing networks have to exhibit complex behaviors on unseen inputs. Therefore overfitting is simpler than generalizing, and it will be preferred by SGD.
Prima facie, this parody argument is about as plausible as the simplicity argument for scheming. Since its conclusion is false, we should reject the argumentative form on which it is based.
As far as I understood people usually talk about simplicity biases based on the volume of basins in parameter space. So the response would be that overfitting takes up more parameters than other (probably smaller desciption length) algorithms and therefore has smaller basins.
I’m curious if you endore or reject this way of defining simplicity based on the size of basins of a set of similar algorithms?
The way I’m currently thinking about this is:
1. Assume we are training end to end on tasks that require our network to do deep reasoning that requires multiple steps and high frequency functions for generating dramatically new outputs based on updated understanding of science etc (We are not monitoring the CoT or using a large net that emulates CoT internally without good interpretability).
2. Then the basins of the schemers that use the least parameters are large parts of the parameter space. The basins of harmless nets with few parameters are large parts as well. Gradient descent will select the one that is larger.
3. I don’t understand gradient descent inductive biases well enough to have strong intuitions which would be larger. So I end up feeling something like each could happen, I’d bet 60% the least parameter schemers is larger since there’s maybe slightly less space for encoding of the harmlessness needed. In that case I’d expect 99%+ probability of a schemer. In the harmless basins are larger case 99%+ of a harmless model.
I suppose this isn’t exactly a counting argument, because I think that evidence about inductive biases will quickly overcome any such argument and I’m agnostic what evidence I will recieve since I’m not very knowledgable about it and other people seem to disagree a bunch.
Is my reasoning here flawed in some obvious way?
Also I appreciated the example of the cortices doing reasonably intelligent stuff without seemingly doing any scheming which makes me more hopeful an AGI system with interpretable CoT made up of a bunch of cortex level subnets with some control techniques would be sufficient to strongly accelerate the construction of a global xrisk defense system.