Hjalmar_Wijk(Hjalmar Wijk)

Karma: 177

Tabooing ‘Agent’ for Prosaic Alignment

Hjalmar_Wijk23 Aug 2019 2:55 UTC

56 points

10 comments6 min readLW link

Hjalmar_Wijk 23 Aug 2019 7:45 UTC
LW: 3 AF: 3
AF
in reply to: Wei Dai’s comment on: Tabooing ‘Agent’ for Prosaic Alignment
I wonder if you can extend it to also explain non-agentic approaches to Prosaic AI Alignment (and why some people prefer those).

I’m quite confused about what a non-agentic approach actually looks like, and I agree that extending this to give a proper account would be really interesting. A possible argument for actively avoiding ‘agentic’ models from this framework is:
1. Models which generalize very competently also seem more likely to have malign failures, so we might want to avoid them.
2. If we believe $H$ then things which generalize very competently are likely to have agent-like internal architecture.
3. Having a selection criteria or model-space/prior which actively pushes away from such agent-like architectures could then help push away from things which generalize too broadly.
I think my main problem with this argument is that step 3 might make step 2 invalid—it might be that if you actively punish agent-like architecture in your search then you will break the conditions that made ‘too broad generalization’ imply ‘agent-like architecture’, and thus end up with things that still generalize very broadly (with all the downsides of this) but just look a lot weirder.

This seems too optimistic/trusting. See Ontology identification problem, Modeling distant superintelligences, and more recently The “Commitment Races” problem.

Thanks for the links, I definitely agree that I was drastically oversimplifying this problem. I still think this task might be much simpler than the task of trying to understand the generalization of some strange model whose internal working we don’t even have a vocabulary to describe.

Hjalmar_Wijk 23 Aug 2019 17:23 UTC
LW: 3 AF: 2
AF
on: Computational Model: Causal Diagrams with Symmetry
I really like this model of computation and how naturally it deals with counterfactuals, surprised it isn’t talked about more often.

This raises the issue of abstraction—the core problem of embedded agency.

I’d like to understand this claim better—are you saying that the core problem of embedded agency is relating high-level agent models (represented as causal diagrams) to low-level physics models (also represented as causal diagrams)?

Hjalmar_Wijk 23 Aug 2019 17:58 UTC
LW: 4 AF: 3
AF
on: Mechanistic Corrigibility
Understanding the internal mechanics of corrigibility seems very important, and I think this post helped me get a more fine-grained understanding and vocabulary for it.

I’ve historically strongly preferred the type of corrigibility which comes from pointing to the goal and letting it be corrigible for instrumental reasons, I think largely because it seems very elegant and that when it works many good properties seem to pop out ‘for free’. For instance, the agent is motivated to improve communication methods, avoid coercion, tile properly and even possibly improve its corrigibility—as long as the pointer really is correct. I agree though that this solution doesn’t seem stable to mistakes in the ‘pointing’, which is very concerning and makes me start to lean toward something more like act-based corrigibility being safer.

I’m still very pessimistic about indifference corrigibility though, in that it still seems extremely fragile/low-measure-in-agent-space. I think maybe I’m stuck imagining complex/unnatural indifference, as in finding agents indifferent to whether a stop-button is pressed, and that my intuition might change if I spend more time thinking about examples like myopia or world-model <-> world interaction, where the indifference seems to have more ‘natural’ boundaries in some sense.

Hjalmar_Wijk 23 Aug 2019 18:44 UTC
LW: 3 AF: 2
AF
in reply to: Louis_Brown’s comment on: Torture and Dust Specks and Joy—Oh my! or: Non-Archimedean Utility Functions as Pseudograded Vector Spaces
Theron Pummer has written about this precise thing in his paper on Spectrum Arguments, where he touches on this argument for “transitivity=>comparability” (here notably used as an argument against transitivity rather than an argument for comparability) and its relation to ‘Sorites arguments’ such as the one about sand heaps.

Personally I think the spectrum arguments are fairly convincing for making me believe in comparability, but I think there’s a wide range of possible positions here and it’s not entirely obvious which are actually inconsistent. Pummer even seemed to think rejecting transitivity and comparability could be a plausible position and that the math could work out in nice ways still.

Hjalmar_Wijk 23 Aug 2019 19:15 UTC
LW: 1 AF: 1
AF
in reply to: romeostevensit’s comment on: Tabooing ‘Agent’ for Prosaic Alignment
These sorts of problems are what caused me to want a presentation which didn’t assume well-defined agents and boundaries in the ontology, but I’m not sure how it applies to the above—I am not looking for optimization as a behavioral pattern but as a concrete type of computation, which involves storing world-models and goals and doing active search for actions which further the goals. Neither a thermostat nor the world outside seem to do this from what I can see? I think I’m likely missing your point.

Hjalmar_Wijk 24 Aug 2019 21:47 UTC
1 point
in reply to: axioman’s comment on: Tabooing ‘Agent’ for Prosaic Alignment
Strongly agree with this, I think this seems very important.

Hjalmar_Wijk 7 Sep 2022 19:29 UTC
LW: 12 AF: 7
8
AF
on: An Update on Academia vs. Industry (one year into my faculty job)
As someone who has been feeling increasingly skeptical of working in academia I really appreciate this post and discussion on it for challenging some of my thinking here.
I do want to respond especially to this part though, which seems cruxy to me:
Furthermore, it is a mistake to simply focus on efforts on whatever timelines seem most likely; one should also consider tractability and neglectedness of strategies that target different timelines. It seems plausible that we are just screwed on short timelines, and somewhat longer timelines are more tractable. Also, people seem to be making this mistake a lot and thus short timelines seem potentially less neglected.
I suspect this argument pushes in the other direction. On longer timelines the amount of effort which will eventually get put toward the problem is much greater. If the community continues to grow at the current pace, then 20 year timeline worlds might end up seeing almost 1000x as much effort put toward the problem in total than 5 year timeline worlds. So neglectedness considerations might tell us that impacts on 5 year timeline worlds are 1000x more important than impacts on 20 year timeline worlds. This is of course mitigated by the potential for your actions to accrue more positive knock-on effects over 20 years, for instance very effective field building efforts could probably overcome this neglectedness penalty in some cases. But in terms of direct impacts on different timeline scenarios this seems like a very strong effect.
On the tractability point, I suspect you need some overly confident model of how difficult alignment turns out to be for this to overcome the neglectedness penalty. E.g. Owen Cotton-Barret suggests here using a log-uniform prior for the difficulty of unknown problems, which (unless you think alignment success in short timelines is essentially impossible) would indicate that tractability is constant. Using a less crude approximation we might use something like a log-normal distribution for the difficulty of solving alignment, where we see overall decreasing returns to effort unless you have extremely low variance (implying you know almost exactly which OOM of effort is enough to solve alignment) or extremely low probability of success by default (<< 1%).
Overall my current guess is that tractability/neglectedness pushes toward working on short timelines, and gives a penalty to delayed impact of perhaps 10x per decade (20x penalty from neglectedness, compensated by a 2x increase in tractability).
If you think that neglectedness/tractability overall pushes toward targeting impact toward long timelines then I’d be curious to see that spelled out more clearly (e.g. as a distribution over the difficulty of solving alignment that implies some domain of increasing returns to effort, or some alternative way to model this). This seems very important if true.

Hjalmar_Wijk 18 Oct 2022 21:44 UTC
2 points
1
in reply to: ryan_b’s comment on: They gave LLMs access to physics simulators
I might be missing something but are they not just giving the number of parameters (in millions of parameters) on a log10 scale? Scaling laws are usually by log-parameters, and I suppose they felt that it was cleaner to subtract the constant log(10^6) from all the results (e.g. taking log(1300) instead of log(1.3B)).

The B they put at the end is a bit weird though.

Hjalmar_Wijk 20 Mar 2023 22:36 UTC
22 points
8
in reply to: Martin Randall’s comment on: More information about the dangerous capability evaluations we did with GPT-4 and Claude.
I work at ARC evals, and mostly the answer is that this was sort of a trial run.

For some context ARC evals spun up as a project last fall, and collaborations with labs are very much in their infancy. We tried to be very open about limitations in the system card, and I think the takeaways from our evaluations so far should mostly be that “It seems urgent to figure out the techniques and processes for evaluating dangerous capabilities of models, several leading scaling labs (including OpenAI and Anthropic) have been working with ARC on early attempts to do this”.

I would not consider the evaluation we did of GPT-4 to be nearly good enough for future models where the prior of existential danger is higher, but I hope we can make quick progress and I have certainly learned a lot from doing this initial evaluation. I think you should hold us and the labs to a much higher standard going forward though, in light of the capabilities of GPT-4.

Hjalmar_Wijk 20 Mar 2023 22:40 UTC
LW: 4 AF: 2
2
AF
in reply to: Ofer’s comment on: More information about the dangerous capability evaluations we did with GPT-4 and Claude.
ARC evals has only existed since last fall, so for obvious reasons we have not evaluated very early versions. Going forward I think it would be valuable and important to evaluate models during training or to scale up models in incremental steps.

Autonomous replication and adaptation: an attempt at a concrete danger threshold

Hjalmar_Wijk17 Aug 2023 1:31 UTC

42 points

0 comments13 min readLW link

Hjalmar_Wijk 19 Sep 2023 20:59 UTC
9 points
2
in reply to: Zach Stein-Perlman’s comment on: Anthropic’s Responsible Scaling Policy & Long-Term Benefit Trust
They do as far as I can tell commit to a fairly strong sort of “timeline” for implementing these things: before they scale to ASL-3 capable models (i.e. ones that pass their evals for autonomous capabilities or misuse potential).

Hjalmar_Wijk 26 Mar 2024 19:34 UTC
LW: 30 AF: 18
13
AF
on: Modern Transformers are AGI, and Human-Level
I agree the term AGI is rough and might be more misleading than it’s worth in some cases. But I do quite strongly disagree that current models are ‘AGI’ in the sense most people intend.

Examples of very important areas where ‘average humans’ plausibly do way better than current transformers:
- Most humans succeed in making money autonomously. Even if they might not come up with a great idea to quickly 10x $100 through entrepreneurship, they are able to find and execute jobs that people are willing to pay a lot of money for. And many of these jobs are digital and could in theory be done just as well by AIs. Certainly there is a ton of infrastructure built up around humans that help them accomplish this which doesn’t really exist for AI systems yet, but if this situation was somehow equalized I would very strongly bet on the average human doing better than the average GPT-4-based agent. It seems clear to me that humans are just way more resourceful, agentic, able to learn and adapt etc. than current transformers are in key ways.
- Many humans currently do drastically better on the METR task suite (https://github.com/METR/public-tasks) than any AI agents, and I think this captures some important missing capabilities that I would expect an ‘AGI’ system to possess. This is complicated somewhat by the human subjects not being ‘average’ in many ways, e.g. we’ve mostly tried this with US tech professionals and the tasks include a lot of SWE, so most people would likely fail due to lack of coding experience.
- Take enough randomly sampled humans and set them up with the right incentives and they will form societies, invent incredibly technologies, build productive companies etc. whereas I don’t think you’ll get anything close to this with a bunch of GPT-4 copies at the moment
I think AGI for most people evokes something that would do as well as humans on real-world things like the above, not just something that does as well as humans on standardized tests.
What links here?
- Modern Transformers are AGI, and Human-Level by abramdemski (26 Mar 2024 17:46 UTC; 196 points)

Hjalmar_Wijk 31 Mar 2024 0:43 UTC
LW: 3 AF: 3
2
AF
in reply to: Daniel Kokotajlo’s comment on: Modern Transformers are AGI, and Human-Level
Yeah, I agree that lack of agency skills are an important part of the remaining human<>AI gap, and that it’s possible that this won’t be too difficult to solve (and that this could then lead to rapid further recursive improvements). I was just pointing toward evidence that there is a gap at the moment, and that current systems are poorly described as AGI.

Hjalmar_Wijk(Hjalmar Wijk)

Ta­boo­ing ‘Agent’ for Pro­saic Alignment

Au­tonomous repli­ca­tion and adap­ta­tion: an at­tempt at a con­crete dan­ger threshold

Tabooing ‘Agent’ for Prosaic Alignment

Autonomous replication and adaptation: an attempt at a concrete danger threshold