Ivan Vendrov

Karma: 563

Ivan Vendrov 26 Apr 2022 1:38 UTC
LW: 9 AF: 4
0
AF
on: Supervise Process, not Outcomes
I don’t think I buy the argument for why process-based optimization would be an attractor. The proposed mechanism—an evaluator maintaining an “invariant that each component has a clear role that makes sense independent of the global objective”—would definitely achieve this, but why would the system maintainers add such an invariant? In any concrete deployment of a process-based system, they would face strong pressure to optimize end-to-end for the outcome metric.
I think the way process-based systems could actually win the race is something closer to “network effects enabled by specialization and modularity”. Let’s say you’re building a robotic arm. You could use a neural network optimized end-to-end to map input images into a vector of desired torques, or you could use a concatenation of a generic vision network and a generic action network, with a common object representation in between. The latter is likely to be much cheaper because the generic network training costs can be amortized across many applications (at least in an economic regime where training cost dominates inference cost). We see a version of this in NLP where nobody outside the big players trains models from scratch, though I’m not sure how to think about fine-tuned models: do they have the safety profile of process-based systems or outcome-based systems?

Ivan Vendrov 26 Apr 2022 1:54 UTC
1 point
on: Supervise Process, not Outcomes
It’s not clear to me that as complexity increases, process-based systems are actually easier to reason about, debug, and render safe than outcome-based systems. If you tell me an ML system was optimized for a particular outcome in a particular environment, I can probably predict its behavior and failure modes much better than an equivalently performant human-written system involving 1000s of lines of code. Both types of systems can fail catastrophically with adversarially selected inputs, but it’s probably easier to automatically generate such inputs (and thus, to guard against them) for the ML system.
So it’s still plausible to me that our limited budget of human supervision should be spent on specifying the outcome better, rather than on specifying and improving complex modular processes.

Ivan Vendrov 8 Jun 2022 21:28 UTC
1 point
on: The easy goal inference problem is still hard
I agree that modelling all human mistakes seems about as hard as modelling all of human values, so straightforward IRL is not a solution to the goal inference problem, only a reshuffling of complexity.
However, I don’t think modelling human mistakes is fundamental to the goal inference problem in the way this post claims.
For example, you can imagine goal inference being solved along the lines of extrapolated volition: we give humans progressively more information about the outcomes of their actions and time to deliberate, and let the AI try to generalize to the limit of a human with infinite information and deliberation time (including time to deliberate about what information to attend to). It’s unclear whether this limiting generalization would count as a sufficiently “reasonable representation” to solve the easy goal inference problem, but it’s quite possible that it solves the full goal inference problem.
Another way we can avoid modelling all human mistakes is if we don’t try to model all of human values, just the ones that are relevant to catastrophic / disempowering actions the AI could take. It seems plausible that there’s a pretty simple description of some human cognitive limitations which if addressed, would eliminate the vast majority of risk, even if it can’t help the AI decide whether the human would prefer to design a new city (to use Paul’s example) more like New York or more like Jakarta. This would also count as a good-enough solution to the goal inference problem that doesn’t require solving the “easy goal inference problem” in the full generality stated here.

Ivan Vendrov 10 Jun 2022 2:49 UTC
10 points
0
on: AGI Ruin: A List of Lethalities
A lot of important warnings in this post. “Capabilities generalize further than alignment once capabilities start to generalize far” was novel to me and seems very important if true.
I don’t really understand the emphasis on “pivotal acts”, though; there seems to be tons of weak pivotal acts, e.g. ways in which narrow AI or barely-above-human-AGI could help coordinate a global emergency regulatory response by the AI superpowers. Still might be worth focusing our effort on the future worlds where no weak pivotal acts are available, but important to point out this is not the median world.

Ivan Vendrov 10 Jun 2022 16:17 UTC
5 points
5
in reply to: Eliezer Yudkowsky’s comment on: AGI Ruin: A List of Lethalities
Mind control is too extreme; I think world superpowers could be coordinated with levels of persuasion greater than one Eliezer but short of mind control. E.g. people are already building narrow persuasion AI capable of generating arguments that are highly persuasive for specific people. A substantially-superhuman but still narrow version of such an AI will very likely be built in the next 5 years, and could be used in a variety of weak pivotal acts (not even in a manipulative way! even a public demonstration of such an AI would make a strong case for coordination, comparable to various weapons treaties).

Ivan Vendrov 15 Jun 2022 18:47 UTC
1 point
AF
on: Alignment research exercises
** Explain why cooperative inverse reinforcement learning doesn’t solve the alignment problem.
Feedback: I clicked through to the provided answer and had a great deal of difficulty understanding how it was relevant—it makes a number of assumptions about agents and utility functions and I wasn’t able to connect it to why I should expect an agent trained using CIRL to kill me.
FWIW here’s my alternative answer:
CIRL agents are bottlenecked on the human overseer’s ability to provide them with a learning signal through demonstration or direct communication. This is unlikely to scale to superhuman abilities in the agent, so superintelligent agents simply will not be trained using CIRL.
In other words it’s only a solution to “Learn from Teacher” in Paul’s 2019 decomposition of alignment, not to the whole alignment problem.

Ivan Vendrov 17 Jun 2022 0:39 UTC
18 points
on: Humans are very reliable agents
Thought-provoking post, though as you hinted it’s not fair to directly compare “classification accuracy” with “accuracy at avoiding catastrophe”. Humans are probably less reliable than deep learning systems at this point in terms of their ability to classify images and understand scenes, at least given < 1 second of response time. Instead, human ability to avoid catastrophe is an ability to generate conservative action sequences in response to novel physical and social situations—e.g. if I’m driving and I see something I don’t understand up ahead I’ll slow down just in case.
I imagine if our goal was “never misclassify an MNIST digit” we could get to 6-7 nines of “worst-case accuracy” even out of existing neural nets, at the cost of saying “I don’t know” for the confusing 0.2% of digits.

Ivan Vendrov 17 Jun 2022 2:20 UTC
LW: 11 AF: 5
5
AF
on: A transparency and interpretability tech tree
This is very helpful as a roadmap connecting current interpretability techniques to the techniques we need for alignment.
One thing that seems very important but missing is how the tech tree looks if we factor in how SOTA models will change over time, including
1. large (order-of-magnitude) increases in model size
2. innovations in model architectures (e.g. the LSTM → Transformer transition)
3. innovations in learning algorithms (e.g. gradient descent being replaced by approximate second-order methods or by meta-learning)
For example, if we restricted our attention to ConvNets trained on MNIST-like datasets we could probably get to tech level (6) very quickly. But would this would help with solving transparency for transformers trained on language? And if we don’t expect it to help, why do we expect solving transparency for transformers will transfer over to the architectures that will be dominant 5 years from now?
My tentative answer would be that we don’t really know how much transparency generalizes between scales/architectures/learning algorithms, so to be safe we need to invest in enough interpretability work to both keep up with whatever the SOTA models are doing and to get higher and higher in the tech tree. This may be much, much harder than the “single tech tree” metaphor suggests.

Ivan Vendrov 17 Jun 2022 7:11 UTC
13 points
in reply to: Ash Gray’s comment on: Humans are very reliable agents
You’re right that it’s an ongoing research area but there’s a number of approaches that work relatively well. This NeurIPS tutorial describes a few. Probably the easiest thing is to use one of the calibration methods mentioned there to get your classifier to output calibrated uncertainties for each class, then say “I don’t know” if the network isn’t at least 90% confident in one of the 10 classes.

Ivan Vendrov 24 Jun 2022 20:45 UTC
2 points
in reply to: paulfchristiano’s comment on: Updated Deference is not a strong argument against the utility uncertainty approach to alignment
This is really helpful, thanks. Perhaps the only disagreement here is pedagogical; I think it’s more useful to point people excited about utility uncertainty to the easy goal inference problem is still hard and to Model Mis-specification and Inverse Reinforcement Learning, because these engage directly with the premises of the approach. Arguing that it violates corrigibility, a concept that doesn’t fit cleanly in the CIRL framework, is more likely to lead to confusion than understanding the problems (at least it did for me).
On the object level, I basically agree with Russell that a good enough solution to value learning seems very valuable since it expands the level of AI capabilities we can deploy safely in the world and buys us more time—basically the “stopgap” approach you mention. Composed with other agendas like automating AI alignment research, it might even prove decisive.
And framing CIRL in particular as a problem formalization rather than a solution approach seems right. I’ve found it very helpful to have a precise mathematical object like “CIRL” to point to when discussing the alignment problem with AI researchers, in contrast to the clusters of blog posts defining things like “alignment” and “corrigibility”.

Ivan Vendrov 24 Jun 2022 23:18 UTC
1 point
in reply to: Rohin Shah’s comment on: Is CIRL a promising agenda?
I’m not sure why Rohin thinks the arguments against CIRL are bad, but I wrote a post today on why I think the argument from fully updated deference / corrigibility is weak. I also found Paul Christiano’s response very helpful as an outline of objections to the utility uncertainty agenda.
Also relevant is this old comment from Rohin on difficulties with utility uncertainty.

Ivan Vendrov 27 Jun 2022 22:02 UTC
2 points
in reply to: tailcalled’s comment on: Updated Deference is not a strong argument against the utility uncertainty approach to alignment
My immediate reaction is: you should definitely update as far as you can and do this investigation! But no matter how much you investigate the learned preferences, you should still deploy your AI with some residual uncertainty because it’s unlikely you can update it “all the way”. Two reasons why this might be
- Some of the data you will need to update all the way will require the superintelligent agent’s help to collect—e.g. collecting human preferences about the specifics of far future interstellar colonization seems impossible right now because we don’t know what is technologically feasible.
- You might decide that the human preferences we really care about are the outcomes of some very long-running process like the Long Reflection; then you can’t investigate the learned preferences ahead of time, but in the meantime still want to create superintelligences that safeguard the Long Reflection until it completes.

Ivan Vendrov 15 Jul 2022 23:21 UTC
3 points
1
in reply to: Algon’s comment on: Safety Implications of LeCun’s path to machine intelligence
I think it’s easier to interpret than model-free RL (provided the line between model and actor is maintained through training, which is an assumption LeCun makes but doesn’t defend) because it’s doing explicit model-based planning, so there’s a clear causal explanation for why the agent took a particular action—because it predicted that it would lead to a specific low-cost world state. It still might be hard to decode the world state representation, but much easier than decoding what the agent is trying to do from the activations of a policy network.
Not obvious to me that it will be a utility maximizer, but definitely dangerous by default. In a world where this architecture is dominant, we probably have to give up on getting intent alignment and fall back to safety guarantees like “well it behaved well in all of our adversarial simulations, and we have a powerful supervising process that will turn it off if it the plans look fishy”. Not my ideal world, but an important world to consider.

Ivan Vendrov 16 Jul 2022 0:34 UTC
8 points
0
in reply to: Algon’s comment on: Safety Implications of LeCun’s path to machine intelligence
The configurator dynamically modulates the cost function, so the agent is not guaranteed to have the same cost function over time, hence can be dutch booked / violate VNM axioms.

Ivan Vendrov 16 Jul 2022 2:05 UTC
LW: 1 AF: 1
−2
AF
on: PSA about differential technological development
I like the distinction between parallelizable and serial research time, and agree that there should be a very high bar for shortening AI timelines and eating up precious serial time.
One caveat to the claim that we should prioritize serial alignment work over parallelizable work, is that this assumes an omniscient and optimal allocator of researcher-hours to problems. Insofar as this assumption doesn’t hold (because our institutions fail, or because the knowledge about how to allocate researcher-hours itself depends on the outcomes of parallelizable research) the distinction between parallelizable and serial work breaks down and other considerations dominate.

Ivan Vendrov 16 Jul 2022 2:25 UTC
LW: 1 AF: 1
0
AF
in reply to: Ivan Vendrov’s comment on: PSA about differential technological development
Also, on a re-read I notice that all the examples given in the post relate to mathematics or theoretical work, which is almost uniquely serial among human activities. By contrast, engineering disciplines are typically much more parallelizable, as evidenced by the speedup in technological progress during war-time.

Ivan Vendrov 16 Jul 2022 17:24 UTC
2 points
1
in reply to: aogara’s comment on: Safety Implications of LeCun’s path to machine intelligence
Ah you’re right, the paper never directly says the architecture is trained end-to-end—updated the post, thanks for the catch.
He might still mean something closer to end-to-end learning, because
1. The world model is differentiable w.r.t the cost (Figure 2), suggesting it isn’t trained purely using self-supervised learning.
2. The configurator needs to learn to modulate the world model, the cost, and the actor; it seems unlikely that this can be done well if these are all swappable black boxes. So there is likely some phase of co-adaptation between configurator, actor, cost, and world model.

Ivan Vendrov 16 Jul 2022 17:41 UTC
6 points
4
in reply to: Evan R. Murphy’s comment on: Safety Implications of LeCun’s path to machine intelligence
My read of LeCun in that conversation is that he doesn’t think in terms of outer alignment / value alignment at all, but rather in terms of implementing a series of “safeguards” that allow humans to recover if the AI behaves poorly (See Steven Byrnes’ summary).
I think this paper helps clarify why he believes this—he had something like this architecture in mind, and so outer alignment seemed basically impossible. Independently, he believes it’s unnecessary because the obvious safeguards will prove sufficient.

Ivan Vendrov 17 Jul 2022 19:38 UTC
12 points
0
in reply to: Steven Byrnes’s comment on: Open & Welcome Thread—July 2022
My preferred adblocker, uBlock Origin, lets you right-click on any element on a page and block it, with a nice UI that lets you set the specificity and scope of the block. Takes about 10 seconds, much easier than mucking with JS yourself. I’ve done this to hide like & follower counts on twitter, just tried and it works great for LessWrong karma. It can’t do “hide karma only for your comments within last 24 hours” but thought this might be useful for others who want to hide karma more broadly.

Ivan Vendrov 18 Jul 2022 0:40 UTC
1 point
0
in reply to: Rob Bensinger’s comment on: PSA about differential technological development
Sorry, I didn’t mean to imply that these are logical assumptions necessary for us to prioritize serial work; but rather insofar as these assumptions don’t hold, prioritizing work that looks serial to us is less important at the margin.
Spelling out the assumptions more:
1. Omniscient meaning “perfect advance knowledge of what work will turn out to be serial vs parallelizable.” In practice I think this is very hard to know beforehand—a lot of work that turned out to be part of the “serial bottleneck” looked parallelizable ex ante.
2. Optimal meaning “institutions will actually allocate enough researchers to the problem in time for the parallelizable work to get done”. Insofar as we don’t expect this to hold, we will lose even if all the serial work gets done in time.