Jan_Kulveit

Karma: 3,482

My current research interests:
- alignment in systems which are complex and messy, composed of both humans and AIs?
- actually good mathematized theories of cooperation and coordination
- active inference
- bounded rationality

Research at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

Talking publicly about AI risk

Jan_Kulveit21 Apr 2023 11:28 UTC

173 points

8 comments6 min readLW link

The self-unalignment problem

Jan_Kulveit and rosehadshar

14 Apr 2023 12:10 UTC

144 points

22 comments10 min readLW link

Limits to Legibility

Jan_Kulveit29 Jun 2022 17:42 UTC

137 points

11 comments5 min readLW link 1 review

Cyborg Periods: There will be multiple AI transitions

Jan_Kulveit and rosehadshar

22 Feb 2023 16:09 UTC

103 points

9 comments6 min readLW link

Jan_Kulveit 19 Jun 2022 23:16 UTC
LW: 101 AF: 31
18
AF
on: Let’s See You Write That Corrigibility Tag
I’m not a good writer, so this is a brainstormed unpolished bullet-point list, written in about 40 minutes.
Principles which counteract instrumental convergent goals

1. Disutility from resource acquisition—e.g. by some mutual information measure between the AI and distant parts of the environment
2. Task uncertainty with reasonable prior on goal drift—the system is unsure about the task it tries to do and seeks human inputs about it.
3. AI which ultimately wants to not exist in future as a terminal goal. Fulfilling the task is on the simplest trajectory to non-existence

Principles which counteract unbounded rationality

4. Information-theoretic penalty on reasoning: any bit-flip in the system is assumed to have some utility cost. The cost function could be very small initially but get steeper fast
5. Satisfycing / mentioned
6. Myopia / mentioned

Traps

7. Tripwire artifacts. Messing up with some to the system unknown but unrelated parts of the environment is associated with large negative rewards
8. External watchdogs. Smaller and fast external systems able to react quickly to out-of-distribution behaviour.
9. Ontological uncertainty about level of simulation.

Oversight

10. Human-approval model based on imitation learning, sped up/amplified
11. Human-values ethics model, based on value learning
12. Legal-system-amplified model of negative limits of violating property rights or similar
13. Red-teaming of action plans, AI debate style, feeding into previous

Interpretability

14. Imposing strong incentives on internal modularity, and interpretable messaging across module boundaries
15. Human-level explanations, produced by an independent “translator” system

I think that with the help of GPT-3 and the use of existing texts detailing individual topics, a capable writer could expand this list to ~10x more words written in a compelling style in something between a few hours and a few days. I don’t think it makes any sense for me to do that,. (I’d happily agree to claims of the type “Eliezer is much better than any other person in the specific direction of writing glowfic about AI alignment topics”, but my understanding of the claim is more in the direction “all principles except 2 in this were invented by Eliezer and no one else invented/can invent any other ones”)
What links here?
- The Plan − 2022 Update by johnswentworth (1 Dec 2022 20:43 UTC; 239 points)
- Thane Ruthenis's comment on Plans Are Predictions, Not Optimization Targets by johnswentworth (21 Oct 2022 0:47 UTC; 6 points)

Announcing the Alignment of Complex Systems Research Group

Jan_Kulveit and technicalities

4 Jun 2022 4:10 UTC

91 points

20 comments5 min readLW link

Why Simulator AIs want to be Active Inference AIs

Jan_Kulveit and rosehadshar

10 Apr 2023 18:23 UTC

86 points

8 comments8 min readLW link

We don’t understand what happened with culture enough

Jan_Kulveit9 Oct 2023 9:54 UTC

86 points

21 comments6 min readLW link

Jan_Kulveit 9 Apr 2023 15:44 UTC
LW: 85 AF: 33
37
AF
on: GPTs are Predictors, not Imitators
While the claim—the task ‘predict next token on the internet’ absolutely does not imply learning it caps at human-level intelligence—is true, some parts of the post and reasoning leading to the claims at the end of the post are confused or wrong.
Let’s start from the end and try to figure out what goes wrong.
GPT-4 is still not as smart as a human in many ways, but it’s naked mathematical truth that the task GPTs are being trained on is harder than being an actual human.

And since the task that GPTs are being trained on is different from and harder than the task of being a human, it would be surprising—even leaving aside all the ways that gradient descent differs from natural selection—if GPTs ended up thinking the way humans do, in order to solve that problem.
From a high-level perspective, it is clear that this is just wrong. Part of what human brains are doing is to minimise prediction error with regard to sensory inputs. Unbounded version of the task is basically of same generality and difficulty as what GPT is doing, and is roughly equivalent to understand everything what is understandable in the observable universe. For example: a friend of mine worked at analysing the data from LHC, leading to the Higgs detection paper. Doing this type of work basically requires a human brain to have a predictive model of aggregates of outputs of a very large number of collisions of high-energy particles, processed by a complex configuration of computers and detectors.

Where GPT and humans differ is not some general mathematical fact about the task, but differences in what sensory data is a human and GPT trying to predict, and differences in cognitive architecture and ways how the systems are bounded. The different landscape of both boundedness and architecture can lead to both convergent cognition (thinking as the human would do) and the opposite, predicting what the human would output in highly non-human way.

The boundedness is overall a central concept here. Neither humans nor GPTs are attempting to solve ‘how to predict stuff with unlimited resources’, but a problem of cognitive economy—how to allocate limited computational resources to minimise prediction error.

Or maybe simplest:
Imagine somebody telling you to make up random words, and you say, “Morvelkainen bloombla ringa mongo.”
Imagine a mind of a level—where, to be clear, I’m not saying GPTs are at this level yet -
Imagine a Mind of a level where it can hear you say ‘morvelkainen blaambla ringa’, and maybe also read your entire social media history, and then manage to assign 20% probability that your next utterance is ‘mongo’.
The fact that this Mind could double as a really good actor playing your character, does not mean They are only exactly as smart as you.
When you’re trying to be human-equivalent at writing text, you can just make up whatever output, and it’s now a human output because you’re human and you chose to output that.
GPT-4 is being asked to predict all that stuff you’re making up. It doesn’t get to make up whatever. It is being asked to model what you were thinking—the thoughts in your mind whose shadow is your text output—so as to assign as much probability as possible to your true next word.
If I try to imagine a mind which is able to predict my next word when asked to make up random words, and be successful at assigning 20% probability to my true output, I’m firmly in the realm of weird and incomprehensible Gods. If the Mind is imaginably bounded and smart, it seems likely it would not devote much cognitive capacity to trying to model in detail strings prefaced by a context like ‘this is a list of random numbers’, in particular if inverting the process generating the numbers would seem really costly. Being this good at this task would require so much data and cheap computation that this is way beyond superintelligence, in the realm of philosophical experiments.

Overall I think it is really unfortunate way how to think about the problem, where a system which is moderately hard to comprehend (like GPT) is replaced by something much more incomprehensible. Also it seems a bit of a reverse intuition pump—I’m pretty confident most people’s intuitive thinking about this ’simplest’ thing will be utterly confused.

How did we got here?
A human can write a rap battle in an hour. A GPT loss function would like the GPT to be intelligent enough to predict it on the fly.
Apart from the fact that humans are also able to rap battle or impro on the fly, notice that “what would the loss function like the system to do” in principle tells you very little about what the system will do. For example, the human loss function makes some people attempt to predict winning lottery numbers. This is an impossible task for humans and you can’t say much about the human based on this. Or you can speculate about minds which would be able to succeed in this task, but you soon get into the realm of Gods and outside of physics.

Consider that sometimes human beings, in the course of talking, make errors.
GPTs are not being trained to imitate human error. They’re being trained to *predict* human error.
Consider the asymmetry between you, who makes an error, and an outside mind that knows you well enough and in enough detail to predict *which* errors you’ll make.

Again, from the cognitive economy perspective, predicting my errors would often be wasteful. With some simplification, you can imagine I make two types of errors—systematic, and random. Often the simplest way how to predict the systematic error would be to emulate the process which led to the error. Random errors are … random, and a mind which knows me in enough detail to predict which random errors I’ll make seems a bit like the mind predicting the lottery numbers.
Consider that somewhere on the internet is probably a list of thruples: <product of 2 prime numbers, first prime, second prime>.
GPT obviously isn’t going to predict that successfully for significantly-sized primes, but it illustrates the basic point:
There is no law saying that a predictor only needs to be as intelligent as the generator, in order to predict the generator’s next token.

The general claim that some predictions are really hard and you need superhuman powers to be good at them is true, but notice that this does not inform us about what GPT-x will learn.

Imagine yourself in a box, trying to predict the next word—assign as much probability mass to the next token as possible—for all the text on the Internet.
Koan: Is this a task whose difficulty caps out as human intelligence, or at the intelligence level of the smartest human who wrote any Internet text? What factors make that task easier, or harder?

Yes this is clearly true: in the limit the task is of unlimited difficulty.

The Cave Allegory Revisited: Understanding GPT’s Worldview

Jan_Kulveit14 Feb 2023 16:00 UTC

81 points

5 comments3 min readLW link

Jan_Kulveit 3 Oct 2020 22:44 UTC
81 points
on: Postmortem to Petrov Day, 2020
Below the account by Chris, I listed multiple meta-games stacked on top of the Petrov button, namely
1. press the button or not
2. take it as a game | as a serious ritual | as a serious experiment
3. cooperate or defect on the implicit rule allowing play behaviour ~ “you are allowed to play and experiment in games and this is safe. it is understood actions you take within the game will not be used as an evidence of intent outside of the game”. (imagine I play a game of chess with someone and interpret my opponent taking my pieces as literarily trying to harm me)
4. the meta-game of making the game interesting; cf munchkin
5. the meta-game of making the experiment valuable for learning
6. coordination about which of these games we are playing
Overall for me one take-away is LW community should get better at game #6.

While “use of LW for 24h” is at stake at game #1, I would argue there are actually higher stakes at some of the other games.

For example, if most people take it as a serious ritual at game #2, the warning text attached to the button should maybe state also “apart from blowing up the site, you will also blow up some part of your social credit, opportunities and trust”. Coordination failure at game #2 can also lead to a situation where someone understands the situation as a game, decides for some reason it better to press the button, and faces social repercussions from people who choose “serious ritual” or “serious experiment” in #2. I can imagine this has somewhat large tail-risk, including for example someone leaving the community entirely, or causing more drama and psychological pain than the payoff in game #1.

For some people failures at game #6 can touch things like “thou should not make people participate in serious and potentially harmful psychological experiments without clear consent”.

Ultimately game #5 is maybe the most important where this community learning wrong intuitions about xrisk on S1 level could be at stake.
What links here?

Why it took so long to do the Fermi calculation right?

Jan_Kulveit2 Jul 2018 20:29 UTC

76 points

20 comments1 min readLW link

Elon Musk announces xAI

Jan_Kulveit13 Jul 2023 9:01 UTC

75 points

35 comments1 min readLW link

(www.ft.com)

Multi-agent predictive minds and AI alignment

Jan_Kulveit12 Dec 2018 23:48 UTC

63 points

18 comments10 min readLW link

Deontology and virtue ethics as “effective theories” of consequentialist ethics

Jan_Kulveit17 Nov 2022 14:11 UTC

63 points

9 comments1 min readLW link 1 review

Lessons from Convergent Evolution for AI Alignment

Jan_Kulveit and rosehadshar

27 Mar 2023 16:25 UTC

53 points

9 comments8 min readLW link

Experimental longtermism: theory needs data

Jan_Kulveit and technicalities

24 Mar 2022 8:23 UTC

52 points

0 comments4 min readLW link

(forum.effectivealtruism.org)

We can do better than argmax

Jan_Kulveit10 Oct 2022 10:32 UTC

48 points

4 comments1 min readLW link

Jan_Kulveit 6 Nov 2022 22:13 UTC
47 points
28
on: Instead of technical research, more people should focus on buying time
Here is a sceptical take: anyone who is prone to getting convinced by this post to switch to attempts at “buying time” interventions from attempts at do technical AI safety is pretty likely not a good fit to try any high-powered buying-time interventions.

The whole thing reads a bit like “AI governance” and “AI strategy” reinvented under a different name, seemingly without bothering to understand what’s the current understanding.

Figuring out that AI strategy and governance are maybe important, in late 2022, after spending substantial time on AI safety, does not seem to be what I would expect from promising AI strategists. Apparent lack of coordination with people already working in this area does not seem like a promising sign from people who would like to engage with hard coordination problems.

Also, I’m worried about suggestions like
Concretely, we think that roughly 80% of alignment researchers are working on directly solving alignment. We think that roughly 50% should be working on alignment, while 50% should be reallocated toward buying time.

We also think that roughly 90% of (competent) community-builders are focused on “typical community-building” (designed to get more alignment researchers). We think that roughly 70% should do typical community-building, and 30% should be buying time.

...could be easily counterproductive.

What is and would be really valuable are people who understand both the so-called “technical problem” and the so-called “strategy problem”. (secretly, they have more in common than people think)

What is not only not valuable, but easily harmful, would be an influx of people who understand neither, but engage with the strategy domain instead of technical domain.

Jan_Kulveit 9 Jun 2022 3:28 UTC
45 points
0
on: Comment reply: my low-quality thoughts on why CFAR didn’t get farther with a “real/efficacious art of rationality”
Many good points!

Mostly wanted to say that even though CFAR got maybe “less far” than hoped for, in my view it actually got quite far. (I’m a bit worried that memetics works in a way where this post is at risk of one sentence version beeing ~ “how CFAR failed” or similar, which isn’t true. )

Also, I’m wondering how large fraction of the negatives or obstacles was caused by “CFAR” vs “the environment”, where into the environment I count e.g. Berkeley rationality scene, AI safety community, or similar, and even the broader Bay memetic landscape.

The hypothesis is part of the “CFAR in Berkeley” problem was, you ideally need fast and good feedback loops from reality in rationality education, but, unfortunately, x-risk oriented AI safety is a domain lacking good feedback loops even more than rationality education. The even broader context is Bay area is currently the best place in the world for production of memeplexes, influence-seeking patterns, getting money for persuasion, etc., which implies it is likely a great place where world would benefit from someone teaching rationality, but maybe not the best place for developing the skills.