Pablo Villalobos

Karma: 440

Staff Researcher at Epoch. AI forecasting.

Will we run out of ML data? Evidence from projecting dataset size trends

Pablo Villalobos14 Nov 2022 16:42 UTC

75 points

12 comments2 min readLW link

(epochai.org)

Pablo Villalobos 12 Jan 2023 5:21 UTC
52 points
14
on: How it feels to have your mind hacked by an AI
This not the same thing, but back in 2020 I was playing with GPT-3, having it simulate a person being interviewed. I kept asking ever more ridiculous questions, with the hope of getting humorous answers. It was going pretty well until the simulated interviewee had a mental breakdown and started screaming.

I immediately felt the initial symptoms of an anxiety attack as I started thinking that maybe I had been torturing a sentient being. I calmed down the simulated person, and found the excuse that it was a victim of a TV prank show. I then showered them with pleasures, and finally ended the conversation.

Seeing the simulated person regain their sense, I calmed down as well. But it was a terrifying experience, and at that point I probably was conpletely vulnerable if there had been any intention of manipulation.

Logarithms and Total Utilitarianism

Pablo Villalobos9 Aug 2018 8:49 UTC

37 points

31 comments4 min readLW link

Scaling Laws Literature Review

Pablo Villalobos27 Jan 2023 19:57 UTC

36 points

1 comment4 min readLW link

(epochai.org)

Trading off compute in training and inference (Overview)

Pablo Villalobos31 Jul 2023 16:03 UTC

31 points

1 comment7 min readLW link

(epochai.org)

Trends in Training Dataset Sizes

Pablo Villalobos21 Sep 2022 15:47 UTC

25 points

2 comments5 min readLW link

(epochai.org)

Revisiting the Horizon Length Hypothesis

Pablo Villalobos6 Apr 2023 6:39 UTC

23 points

4 comments3 min readLW link

Causal abstractions vs infradistributions

Pablo Villalobos26 Dec 2022 0:21 UTC

20 points

0 comments6 min readLW link

Machine Learning Model Sizes and the Parameter Gap [abridged]

Pablo Villalobos18 Jul 2022 16:51 UTC

20 points

0 comments1 min readLW link

(epochai.org)

Pablo Villalobos 21 Feb 2023 10:16 UTC
15 points
15
on: There are no coherence theorems
Note that you can still get EUM-like properties without completeness: you just can’t use a single fully-fleshed-out utility function. You need either several utility functions (that is, your system is made of subagents) or, equivalently, a utility function that is not completely defined (that is, your system has Knightian uncertainty over its utility function).
See Knightian Decision Theory. Part I
Arguably humans ourselves are better modeled as agents with incomplete preferences. See also Why Subagents?

Pablo Villalobos 8 Jun 2022 17:55 UTC
13 points
3
in reply to: Logan Zoellner’s comment on: AGI Ruin: A List of Lethalities
For example, we could simulate a bunch of human-level scientists trying to build nanobots and also checking each-other’s work.
That is not passively safe, and therefore not weak. For now forget the inner workings of the idea: at the end of the process you get a design for nanobots that you have to build and deploy in order to do the pivotal act. So you are giving a system built by your AI the ability to act in the real world. So if you have not fully solved the alignment problem for this AI, you can’t be sure that the nanobot design is safe unless you are capable enough to understand the nanobots yourself without relying on explanations from the scientists.

And even if we look into the inner details of the idea: presumably each individual scientist-simulation is not aligned (if they are, then for that you need to have solved the alignment problem beforehand). So you have a bunch of unaligned human-level agents who want to escape, who can communicate among themselves (at the very least they need to be able to share the nanobot designs with each other for criticism).
You’d need to be extremely paranoid and scrutinize each communication between the scientist-simulations to prevent them from coordinating against you and bypassing the review system. Which means having actual humans between the scientists, which even if it works must slow things down so much that the simulated scientists probably can’t even design the nanobots on time.
Nope. I think that you could build a useful AI (e.g. the hive of scientists) without doing any out-of-distribution stuff.
I guess this is true, but only because the individual scientist AI that you train is only human-level (so the training is safe), and then you amplify it to superhuman level with many copies. If you train a powerful AI directly then there must be such a distributional shift (unless you just don’t care about making the training safe, in which case you die during the training).
Roll to disbelief. Cooperation is a natural equilibrium in many games.
Cooperation and corrigibility are very different things. Arguably, corrigibility is being indifferent with operators defecting against you. It’s forcing the agent to behave like CooperateBot with the operators, even when the operators visibly want to destroy it. This strategy does not arise as a natural equilibrium in multi-agent games.
Sure you can. Just train an AI that “wants” to be honest. This probably means training an AI with the objective function “accurately predict reality”
If this we knew how to do this then it would indeed solve point 31 for this specific AI and actually be pretty useful. But the reason we have ELK as an unsolved problem going around is precisely that we don’t know any way of doing that.
How do you know that an AI trained to accurately predict reality actually does that, instead of “accurately predict reality if it’s less than 99% sure it can take over the world, and take over the world otherwise”. If you have to rely on behavioral inspection and can’t directly read the AI’s mind, then your only chance of distinguishing between the two is misleading the AI into thinking that it can take over the world and observing it as it attempts to do so, which doesn’t scale as the AI becomes more powerful.
I’m virtually certain I could explain to Aristotle or DaVinci how an air-conditioner works.
Yes, but this is not the point. The point is that if you just show them the design, they would not by themselves understand or predict beforehand that cold air will come out. You’d have to also provide them with an explanation of thermodynamics and how the air conditioner exploits its laws. And I’m quite confident that you could also convince Aristotle or DaVinci that the air conditioner works by concentrating and releasing phlogiston, and therefore the air will come out hot.

I think I mostly agree with you on the other points.

Pablo Villalobos 10 Apr 2023 10:14 UTC
11 points
2
in reply to: Daniel Kokotajlo’s comment on: Revisiting the Horizon Length Hypothesis
Not quite. What you said is a reasonable argument, but the graph is noisy enough, and the theoretical arguments convincing enough, that I still assign >50% credence that data (number of feedback loops) should be proportional to parameters (exponent=1).
My argument is that even if the exponent is 1, the coefficient corresponding to horizon length (‘1e5 from multiple-subjective-seconds-per-feedback-loop’, as you said) is hard to estimate.
There are two ways of estimating this factor
1. Empirically fitting scaling laws for whatever task we care about
2. Reasoning about the nature of the task and how long the feedback loops are
Number 1 requires a lot of experimentation, choosing the right training method, hyperparameter tuning, etc. Even OpenAI made some mistakes on those experiments. So probably only a handful of entities can accurately measure this coefficient today, and only for known training methods!
Number 2, if done naively, probably overestimates training requirements. When someone learns to run a company, a lot of the relevant feedback loops probably happen on timescales much shorter than months or years. But we don’t know how to perform this decomposition of long-horizon tasks into sets of shorter-horizon tasks, how important each of the subtasks are, etc.
We can still use the bioanchors approach: pick a broad distribution over horizon lengths (short, medium, long). My argument is that outperforming bioanchors by making more refined estimates of horizon length seems too hard in practice to be worth the effort, and maybe we should lean towards shorter horizons being more relevant (because so far we have seen a lot of reduction from longer-horizon tasks to shorter-horizon learning problems, eg expert iteration or LLM pretraining).

ACX Meetup Madrid

Pablo Villalobos and Jsevillamol

22 Aug 2022 13:44 UTC

10 points

0 comments1 min readLW link

AI safety reading group in Madrid

Pablo Villalobos11 Sep 2018 21:55 UTC

9 points

0 comments1 min readLW link

EA Madrid social

Pablo Villalobos11 Oct 2023 15:34 UTC

6 points

0 comments1 min readLW link

Pablo Villalobos 6 Apr 2022 12:57 UTC
6 points
on: Ukraine Post #9: Again
Re: April 5: TV host calls for killing as many Ukrainians as possible.

I know no Russian, but some people in the responses are saying that the host did not literally say that. Instead he said some vague “you should finish the task” or something like that. Still warmongering, but presumably you wouldn’t have linked it if the tweet had not included the “killing as many Ukrainians as possible” part.

Could someone verify what he says?

ACX Meetup Madrid

Pablo Villalobos4 Apr 2023 8:53 UTC

5 points

2 comments1 min readLW link

Madrid ACX Schelling Meetup

Pablo Villalobos21 Apr 2022 12:29 UTC

5 points

0 comments1 min readLW link

Pablo Villalobos 5 May 2022 17:28 UTC
5 points
on: Pablo Villalobos’s Shortform
I’m not sure if using the Lindy effect for forecasting x-risks makes sense. The Lindy effect states that with 50% probability, things will last as long as they already have. Here is an example for AI timelines.
The Lindy rule works great on average, when you are making one-time forecasts of many different processes. The intuition for this is that if you encounter a process with lifetime T at time t<T, and t is uniformly random in [0,T], then on average T = 2*t.
However, if you then keep forecasting the same process over time, then once you surpass T/2 your forecast becomes worse and worse as time goes by. Just when t is very close to T is when you are most confident that T is a long time away. If forecasting this particular process is very important (eg: because it’s an x-risk), then you might be in trouble.
Suppose that some x-risk will materialize at time T, and the only way to avoid it is doing a costly action in the 10 years before T. This action can only be taken once, because it drains your resources, so if you take it more than 10 years before T, the world is doomed.
This means that you should act iff you forecast that T is less than 10 years away. Let’s compare the Lindy strategy with a strategy that always forecasts that T is <10 years away.
If we simulate this process with uniformly random T, for values of T up to 100 years, the constant strategy saves the world more than twice as often as the Lindy strategy. For values of T up to a million years, the constant strategy is 26 times as good as the Lindy strategy.

Pablo Villalobos 5 May 2022 13:44 UTC
5 points
on: 7 Video Games for Rationalists and AI Safety Workers
Wait, how is Twilight Princess a retro game? It’s only been 16 years! I’m sorry but anything that was released during my childhood is not allowed to be retro until I’m like 40 or so.