Jan_Kulveit

Karma: 3,531

My current research interests:
- alignment in systems which are complex and messy, composed of both humans and AIs?
- actually good mathematized theories of cooperation and coordination
- active inference
- bounded rationality

Research at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

Jan_Kulveit 19 Jun 2022 23:16 UTC
LW: 101 AF: 31
18
AF
on: Let’s See You Write That Corrigibility Tag
I’m not a good writer, so this is a brainstormed unpolished bullet-point list, written in about 40 minutes.
Principles which counteract instrumental convergent goals

1. Disutility from resource acquisition—e.g. by some mutual information measure between the AI and distant parts of the environment
2. Task uncertainty with reasonable prior on goal drift—the system is unsure about the task it tries to do and seeks human inputs about it.
3. AI which ultimately wants to not exist in future as a terminal goal. Fulfilling the task is on the simplest trajectory to non-existence

Principles which counteract unbounded rationality

4. Information-theoretic penalty on reasoning: any bit-flip in the system is assumed to have some utility cost. The cost function could be very small initially but get steeper fast
5. Satisfycing / mentioned
6. Myopia / mentioned

Traps

7. Tripwire artifacts. Messing up with some to the system unknown but unrelated parts of the environment is associated with large negative rewards
8. External watchdogs. Smaller and fast external systems able to react quickly to out-of-distribution behaviour.
9. Ontological uncertainty about level of simulation.

Oversight

10. Human-approval model based on imitation learning, sped up/amplified
11. Human-values ethics model, based on value learning
12. Legal-system-amplified model of negative limits of violating property rights or similar
13. Red-teaming of action plans, AI debate style, feeding into previous

Interpretability

14. Imposing strong incentives on internal modularity, and interpretable messaging across module boundaries
15. Human-level explanations, produced by an independent “translator” system

I think that with the help of GPT-3 and the use of existing texts detailing individual topics, a capable writer could expand this list to ~10x more words written in a compelling style in something between a few hours and a few days. I don’t think it makes any sense for me to do that,. (I’d happily agree to claims of the type “Eliezer is much better than any other person in the specific direction of writing glowfic about AI alignment topics”, but my understanding of the claim is more in the direction “all principles except 2 in this were invented by Eliezer and no one else invented/can invent any other ones”)
What links here?
- The Plan − 2022 Update by johnswentworth (1 Dec 2022 20:43 UTC; 239 points)
- Thane Ruthenis's comment on Plans Are Predictions, Not Optimization Targets by johnswentworth (21 Oct 2022 0:47 UTC; 6 points)

Jan_Kulveit 9 Apr 2023 15:44 UTC
LW: 86 AF: 33
39
AF
on: GPTs are Predictors, not Imitators
While the claim—the task ‘predict next token on the internet’ absolutely does not imply learning it caps at human-level intelligence—is true, some parts of the post and reasoning leading to the claims at the end of the post are confused or wrong.
Let’s start from the end and try to figure out what goes wrong.
GPT-4 is still not as smart as a human in many ways, but it’s naked mathematical truth that the task GPTs are being trained on is harder than being an actual human.

And since the task that GPTs are being trained on is different from and harder than the task of being a human, it would be surprising—even leaving aside all the ways that gradient descent differs from natural selection—if GPTs ended up thinking the way humans do, in order to solve that problem.
From a high-level perspective, it is clear that this is just wrong. Part of what human brains are doing is to minimise prediction error with regard to sensory inputs. Unbounded version of the task is basically of same generality and difficulty as what GPT is doing, and is roughly equivalent to understand everything what is understandable in the observable universe. For example: a friend of mine worked at analysing the data from LHC, leading to the Higgs detection paper. Doing this type of work basically requires a human brain to have a predictive model of aggregates of outputs of a very large number of collisions of high-energy particles, processed by a complex configuration of computers and detectors.

Where GPT and humans differ is not some general mathematical fact about the task, but differences in what sensory data is a human and GPT trying to predict, and differences in cognitive architecture and ways how the systems are bounded. The different landscape of both boundedness and architecture can lead to both convergent cognition (thinking as the human would do) and the opposite, predicting what the human would output in highly non-human way.

The boundedness is overall a central concept here. Neither humans nor GPTs are attempting to solve ‘how to predict stuff with unlimited resources’, but a problem of cognitive economy—how to allocate limited computational resources to minimise prediction error.

Or maybe simplest:
Imagine somebody telling you to make up random words, and you say, “Morvelkainen bloombla ringa mongo.”
Imagine a mind of a level—where, to be clear, I’m not saying GPTs are at this level yet -
Imagine a Mind of a level where it can hear you say ‘morvelkainen blaambla ringa’, and maybe also read your entire social media history, and then manage to assign 20% probability that your next utterance is ‘mongo’.
The fact that this Mind could double as a really good actor playing your character, does not mean They are only exactly as smart as you.
When you’re trying to be human-equivalent at writing text, you can just make up whatever output, and it’s now a human output because you’re human and you chose to output that.
GPT-4 is being asked to predict all that stuff you’re making up. It doesn’t get to make up whatever. It is being asked to model what you were thinking—the thoughts in your mind whose shadow is your text output—so as to assign as much probability as possible to your true next word.
If I try to imagine a mind which is able to predict my next word when asked to make up random words, and be successful at assigning 20% probability to my true output, I’m firmly in the realm of weird and incomprehensible Gods. If the Mind is imaginably bounded and smart, it seems likely it would not devote much cognitive capacity to trying to model in detail strings prefaced by a context like ‘this is a list of random numbers’, in particular if inverting the process generating the numbers would seem really costly. Being this good at this task would require so much data and cheap computation that this is way beyond superintelligence, in the realm of philosophical experiments.

Overall I think it is really unfortunate way how to think about the problem, where a system which is moderately hard to comprehend (like GPT) is replaced by something much more incomprehensible. Also it seems a bit of a reverse intuition pump—I’m pretty confident most people’s intuitive thinking about this ’simplest’ thing will be utterly confused.

How did we got here?
A human can write a rap battle in an hour. A GPT loss function would like the GPT to be intelligent enough to predict it on the fly.
Apart from the fact that humans are also able to rap battle or impro on the fly, notice that “what would the loss function like the system to do” in principle tells you very little about what the system will do. For example, the human loss function makes some people attempt to predict winning lottery numbers. This is an impossible task for humans and you can’t say much about the human based on this. Or you can speculate about minds which would be able to succeed in this task, but you soon get into the realm of Gods and outside of physics.

Consider that sometimes human beings, in the course of talking, make errors.
GPTs are not being trained to imitate human error. They’re being trained to *predict* human error.
Consider the asymmetry between you, who makes an error, and an outside mind that knows you well enough and in enough detail to predict *which* errors you’ll make.

Again, from the cognitive economy perspective, predicting my errors would often be wasteful. With some simplification, you can imagine I make two types of errors—systematic, and random. Often the simplest way how to predict the systematic error would be to emulate the process which led to the error. Random errors are … random, and a mind which knows me in enough detail to predict which random errors I’ll make seems a bit like the mind predicting the lottery numbers.
Consider that somewhere on the internet is probably a list of thruples: <product of 2 prime numbers, first prime, second prime>.
GPT obviously isn’t going to predict that successfully for significantly-sized primes, but it illustrates the basic point:
There is no law saying that a predictor only needs to be as intelligent as the generator, in order to predict the generator’s next token.

The general claim that some predictions are really hard and you need superhuman powers to be good at them is true, but notice that this does not inform us about what GPT-x will learn.

Imagine yourself in a box, trying to predict the next word—assign as much probability mass to the next token as possible—for all the text on the Internet.
Koan: Is this a task whose difficulty caps out as human intelligence, or at the intelligence level of the smartest human who wrote any Internet text? What factors make that task easier, or harder?

Yes this is clearly true: in the limit the task is of unlimited difficulty.

Jan_Kulveit 3 Oct 2020 22:44 UTC
81 points
on: Postmortem to Petrov Day, 2020
Below the account by Chris, I listed multiple meta-games stacked on top of the Petrov button, namely
1. press the button or not
2. take it as a game | as a serious ritual | as a serious experiment
3. cooperate or defect on the implicit rule allowing play behaviour ~ “you are allowed to play and experiment in games and this is safe. it is understood actions you take within the game will not be used as an evidence of intent outside of the game”. (imagine I play a game of chess with someone and interpret my opponent taking my pieces as literarily trying to harm me)
4. the meta-game of making the game interesting; cf munchkin
5. the meta-game of making the experiment valuable for learning
6. coordination about which of these games we are playing
Overall for me one take-away is LW community should get better at game #6.

While “use of LW for 24h” is at stake at game #1, I would argue there are actually higher stakes at some of the other games.

For example, if most people take it as a serious ritual at game #2, the warning text attached to the button should maybe state also “apart from blowing up the site, you will also blow up some part of your social credit, opportunities and trust”. Coordination failure at game #2 can also lead to a situation where someone understands the situation as a game, decides for some reason it better to press the button, and faces social repercussions from people who choose “serious ritual” or “serious experiment” in #2. I can imagine this has somewhat large tail-risk, including for example someone leaving the community entirely, or causing more drama and psychological pain than the payoff in game #1.

For some people failures at game #6 can touch things like “thou should not make people participate in serious and potentially harmful psychological experiments without clear consent”.

Ultimately game #5 is maybe the most important where this community learning wrong intuitions about xrisk on S1 level could be at stake.
What links here?

Jan_Kulveit 6 Nov 2022 22:13 UTC
47 points
28
on: Instead of technical research, more people should focus on buying time
Here is a sceptical take: anyone who is prone to getting convinced by this post to switch to attempts at “buying time” interventions from attempts at do technical AI safety is pretty likely not a good fit to try any high-powered buying-time interventions.

The whole thing reads a bit like “AI governance” and “AI strategy” reinvented under a different name, seemingly without bothering to understand what’s the current understanding.

Figuring out that AI strategy and governance are maybe important, in late 2022, after spending substantial time on AI safety, does not seem to be what I would expect from promising AI strategists. Apparent lack of coordination with people already working in this area does not seem like a promising sign from people who would like to engage with hard coordination problems.

Also, I’m worried about suggestions like
Concretely, we think that roughly 80% of alignment researchers are working on directly solving alignment. We think that roughly 50% should be working on alignment, while 50% should be reallocated toward buying time.

We also think that roughly 90% of (competent) community-builders are focused on “typical community-building” (designed to get more alignment researchers). We think that roughly 70% should do typical community-building, and 30% should be buying time.

...could be easily counterproductive.

What is and would be really valuable are people who understand both the so-called “technical problem” and the so-called “strategy problem”. (secretly, they have more in common than people think)

What is not only not valuable, but easily harmful, would be an influx of people who understand neither, but engage with the strategy domain instead of technical domain.

Jan_Kulveit 9 Jun 2022 3:28 UTC
45 points
0
on: Comment reply: my low-quality thoughts on why CFAR didn’t get farther with a “real/efficacious art of rationality”
Many good points!

Mostly wanted to say that even though CFAR got maybe “less far” than hoped for, in my view it actually got quite far. (I’m a bit worried that memetics works in a way where this post is at risk of one sentence version beeing ~ “how CFAR failed” or similar, which isn’t true. )

Also, I’m wondering how large fraction of the negatives or obstacles was caused by “CFAR” vs “the environment”, where into the environment I count e.g. Berkeley rationality scene, AI safety community, or similar, and even the broader Bay memetic landscape.

The hypothesis is part of the “CFAR in Berkeley” problem was, you ideally need fast and good feedback loops from reality in rationality education, but, unfortunately, x-risk oriented AI safety is a domain lacking good feedback loops even more than rationality education. The even broader context is Bay area is currently the best place in the world for production of memeplexes, influence-seeking patterns, getting money for persuasion, etc., which implies it is likely a great place where world would benefit from someone teaching rationality, but maybe not the best place for developing the skills.

Jan_Kulveit 29 Mar 2023 20:16 UTC
44 points
38
on: Nobody’s on the ball on AGI alignment
Sorry but my rough impression from the post is you seem to be at least as confused about where the difficulties are as average of alignment researchers you think are not on the ball—and the style of somewhat strawmanning everyone & strong words is a bit irritating.

Maybe I’m getting it wrong, but it seems the model you have for why everyone is not on the ball is something like “people are approaching it too much from a theory perspective, and promising approach is very close to how empirical ML capabilities research works” & “this is a type of problem where you can just throw money at it and attract better ML talent”.

I don’t think these two insights are promising.

Also, again, maybe I’m getting it wrong, but I’m confused how similar you are imagining the current systems to be to the dangerous systems. It seems either the superhuman-level problems (eg not lying in a way no human can recognize) are somewhat continuous with current problems (eg not lying), and in that case it is possible to study them empirically. Or they are not. But different parts of the post seem to point in different directions. (Personally I think the problem is somewhat continuous, but many of the human-in-the-loop solutions are not, and just break down.)

Also, with what you find promising I’m confused what do you think the ‘real science’ to aim for is - on one hand it seems you think the closer the thing is to how ML is done in practice the more real science it is. On the other hand, in your view all deep learning progress has been empirical, often via dumb hacks and intuitions (this isn’t true imo).
What links here?
- Jan_Kulveit's comment on Nobody’s on the ball on AGI alignment by leopold (EA Forum; 13 Apr 2023 6:45 UTC; 32 points)

Jan_Kulveit 4 Feb 2018 23:27 UTC
44 points
on: Arbital postmortem
I believe I can add something useful to this conversation.
Background: I noticed Wikipedia when it was a relatively new project. I was one of the first registered users of Czech Wikipedia (# 7), one fo the first builders of the Czech community, registered the Czech domain, and gradually worked in different functions suchas an administrator, member of the arbitration committee, checkuser, etc … I was also a founding member of the Czech Wikimedia Chapter.
Since then, I have been following various attempts to compete with Wikipedia (does anyone remember e.g. Google Knol?) And so far in all cases I expected the projects will fail.
Why? Because IMO essential factor behind the success of Wikipedia is the “community design” that is in the background, and which provides incentives to content creators. Common lore it something likeeveryone can edit Wikipedia, and the encyclopedia is written by mostly random crowd of site visitors who click the edit button. That’s complete nonsense. Wikipedia was written mostly by a dedicated community of volunteers with complex internal organization, hierarchies, social rewards, and the possibility of gaining power.
Much of what I did at the begining cz.wiki was translations of various documents about community norms, implementation of different processes, feedbacks, dispute resolution, etc. The community design goals are various, e.g. you want positive feedback loops, you dont want editors to bite each other, you want the community to be meritocratic, and so on. It’s far from obvious how to design such system.
Of course, with most wikipedists, somewhere in the background is an altruistic motivation to help with the aggregation of human knowledge and create something like Encyclopedia Galactica. But on a day-to-day basis, what helps keepeing peeple motivated is working with other dedicated people, receiving feedback, beeing able to see others interacting with your edits and improving further, and even some forms of conflicts . Also valuable editing leads to increasing your weight in the community, you can gain various social goods, responsibility, various functions, and of course power.
Btw the power of senior Wikipedia users is not negligible at all. If we simplify it to counting money, you can imagine what is the value at stake depending on the content of the first page of Google results on many topics, and how much power Google has controlling it. Wikipedia has less influence, but still a lot of influence.
So to summarize, what I see in this portmortem (thanks for wrting it!) from my perspective is almost complete lack of “community design” and thinking about editor motivations. (Community design is different from user aquisiton.)
What links here?
- Jan_Kulveit's comment on Duncan Sabien on Moderating LessWrong by Davis_Kingsley (29 May 2018 18:34 UTC; 5 points)

Jan_Kulveit 20 Jun 2022 0:05 UTC
LW: 38 AF: 8
15
AF
on: Where I agree and disagree with Eliezer
Broadly agree with this in most points of disagreement with Eliezer, and also agree with many points of agreement.

Few points where I sort of disagree with both, although this is sometimes unclear

1.
Even if there were consensus about a risk from powerful AI systems, there is a good chance that the world would respond in a totally unproductive way. It’s wishful thinking to look at possible stories of doom and say “we wouldn’t let that happen;” humanity is fully capable of messing up even very basic challenges, especially if they are novel.
I literally agree with this, but at the same time, in contrast to Eliezer’s original point, I also think there is a decent chance the world would respond in a somewhat productive way, and this is a mayor point of leverage.

For people who doubt this, I’d point to variance in initial governmental-level response to COVID19, which ranged from “highly incompetent” (eg. early US) to “quite competent” (eg Taiwan). (I also have some intuitions around this based on non-trivial amounts of first-hand experience with how governments actually internally worked and made decisions—which you certainly don’t need to trust, but if you are highly confident in inability of governments to act, or do reasonable things, you should at least be less confident.)

2.

AI systems will ultimately be wildly superhuman, and there probably won’t be strong technological hurdles right around human level. Extrapolating the rate of existing AI progress suggests you don’t get too much time between weak AI systems and very strong AI systems, and AI contributions could very easily go from being a tiny minority of intellectual work to a large majority over a few years.

While I do agree there likely won’t be strong technological hurdles, I think “right around the human level” is the point where it seems most likely some regulatory hurdles can be erected, or the human coordination landscape can change, or resources spent on alignment research could grow extremely fast, or, generally, weird things can happen. While I generally agree weird bad things can happen, I also do think weird good things can happen, and this also likely seems a potential period of increased leverage.

3.

There are strong social and political pressures to spend much more of our time talking about how AI shapes existing conflicts and shifts power. This pressure is already playing out and it doesn’t seem too likely to get better. I think Eliezer’s term “the last derail” is hyperbolic but on point.

I do agree that the pressures do exist, and would be bad if it caused many people working on the pessimistic-assumptions-side to switch to work on e.g. corporate governance; on the other hand, I don’t agree it’s just a distraction. Given previous two points, I think the overall state of power / coordination / conflict can have significant trajectory-shaping influence.

Also, this dynamic will likely bring many more people to work on alignment-adjacent topics, and I think there is some chance to steer part of this attention to productive work on important problems; I think this is more likely if at least some alignment researchers bother to engage with this influx of attention (as opposed to ignoring it as random distraction).

This response / increases in attention in some sense seem like the normal way how humanity solves problems, and it may be easier to steer it, rather than e.g. try to find&convince random people to care about technical alignment problems.

Jan_Kulveit 12 Apr 2023 13:10 UTC
LW: 34 AF: 6
3
AF
on: Evolution provides no evidence for the sharp left turn
This whole just does not hold.
(in animals)

The only way to transmit information from one generation to the next is through evolution changing genomic traits, because death wipes out the within lifetime learning of each generation.

This is clearly false. GPT4, can you explain? :

While genes play a significant role in transmitting information from one generation to the next, there are other ways in which animals can pass on information to their offspring. Some of these ways include:
1. Epigenetics: Epigenetic modifications involve changes in gene expression that do not alter the underlying DNA sequence. These changes can be influenced by environmental factors and can sometimes be passed on to the next generation.
2. Parental behavior: Parental care, such as feeding, grooming, and teaching, can transmit information to offspring. For example, some bird species teach their young how to find food and avoid predators, while mammals may pass on social behaviors or migration patterns.
3. Cultural transmission: Social learning and imitation can allow for the transfer of learned behaviors and knowledge from one generation to the next. This is particularly common in species with complex social structures, such as primates, cetaceans, and some bird species.
4. Vertical transmission of symbionts: Some animals maintain symbiotic relationships with microorganisms that help them adapt to their environment. These microorganisms can be passed from parent to offspring, providing the next generation with information about the environment.
5. Prenatal environment: The conditions experienced by a pregnant female can influence the development of her offspring, providing them with information about the environment. For example, if a mother experiences stress or nutritional deficiencies during pregnancy, her offspring may be born with adaptations that help them cope with similar conditions.
6. Hormonal and chemical signaling: Hormones or chemical signals released by parents can influence offspring development and behavior. For example, maternal stress hormones can be transmitted to offspring during development, which may affect their behavior and ability to cope with stress in their environment.
7. Ecological inheritance: This refers to the transmission of environmental resources or modifications created by previous generations, which can shape the conditions experienced by future generations. Examples include beaver dams, bird nests, or termite mounds, which provide shelter and resources for offspring.
(/GPT)

Actually, transmitting some of the data gathered during the lifetime of the animal to next generation by some other means is so obviously useful that is it highly convergent. Given the fact it is highly convergent, the unprecedented thing which happened with humans can’t be the thing proposed (evolution suddenly discovered how not to sacrifice all whats learned during the lifetime).

Evolution’s sharp left turn happened because evolution spent compute in a shockingly inefficient manner for increasing capabilities, leaving vast amounts of free energy on the table for any self-improving process that could work around the evolutionary bottleneck. Once you condition on this specific failure mode of evolution, you can easily predict that humans would undergo a sharp left turn at the point where we could pass significant knowledge across generations. I don’t think there’s anything else to explain here, and no reason to suppose some general tendency towards extreme sharpness in inner capability gains.

If the above is not enough to see why this is false… This hypothesis would also predict civilizations built by every other species which transmits a lot of data e.g. by learning from parental behaviour—once evolution discovers the vast amounts of free energy on the table this positive feedback loop would just explode.

This isn’t the case ⇒ the whole argument does not hold.

Also this argument not working does not imply evolution provides strong evidence for sharp left turn.

What’s going on?

In fact in my view we do not actually understand what exactly happened with humans. Yes, it likely has something to do with culture, and brains, and there being more humans around. But what’s the causality?

Some of the candidates for “what’s the actually fundamental differentiating factor and not a correlate”

- One notable thing about humans is, it’s likely the second time in history a new type of replicator with R>1 emerged: memes. From replicator-centric perspective on the history of the universe, this is the fundamental event, starting a different general evolutionary computation operating at much shorter timescale.

- Machiavellian intelligence hypothesis suggests that what happened was humans entered a basin of attraction where selection pressure on “modelling and manipulation of other humans” leads to explosion in brain sizes. The fundamental thing suggested here is you soon hit diminishing return for scaling up energy-hungry predictive processing engines modelling fixed-complexity environment—soon you would do better by e.g. growing bigger claws. Unless… you hit the Machiavellian basin, where sexual selection forces you to model other minds modelling your mind … and this creates a race, in a an environment of unbounded complexity.

- Social brain hypothesis is similar, but the runaway complexity of the environment is just because of the large and social groups.

- Self-domestication hypothesis: this is particularly interesting and intriguing. The idea is humans self-induced something like domestication selection, selecting for pro-social behaviours and reduction in aggression. From an abstract perspective, I would say this allows emergence of super-agents composed of individual humans, more powerful than individual humans. (once such entities exist, they can create further selection pressure for pro-sociality)

or, a combination of the above, or something even weirder

The main reason why it’s hard to draw insights from evolution of humans to AI isn’t because there is nothing to learn, but because we don’t know why what happened happened.
What links here?
- Nathan Helm-Burger's comment on Response to Quintin Pope’s Evolution Provides No Evidence For the Sharp Left Turn by Zvi (11 Jan 2024 2:38 UTC; 10 points)

Jan_Kulveit 14 Jul 2022 11:28 UTC
LW: 33 AF: 13
9
AF
on: Human values & biases are inaccessible to the genome
This seems partially right, partially confused in an important way.

As I tried to point people to years ago, how this works is … quite complex processes, where some higher-level modelling (“I see a lion”) leads to a response in lower levels connected to body states, some chemicals are released, and this interoceptive sensation is re-integrated in the higher levels.

I will try to paraphrase/expand in a longer form.

Genome already discovered a ton of cybernetics before inventing neocortex-style neural nets.

Consider e.g. the problem of morphogenesis—that is, how one cell replicates to something like quadrillion cells in an elephant. Which end up reliably forming some body shape and cooperating in a highly complex way: it’s really impressive and hard optimization problem.

Inspired by Levine, I’m happy to argue it is also impossible without discovering a lot of powerful stuff from information theory and cybernetics, including various regulatory circuits, complex goal specifications, etc.

Note that there are many organisms without neural nets which still seek reproduction, avoid danger, look for food, move in complex environments, and in general, are living using fairly complex specifications of evolutionary relevant goals.

This implies genome had complex circuitry specificing many/most of the goal states it’s cares about before it invented predictive processing brain.

Given this, what genome did when developing the brain predictive processing machinery likely wasn’t trying to hook up things to “raw sensory inputs”, but hook up the PP machinery to the existing cybernetic regulatory systems, often broadly localized “in the body”.

From the PP-brain-centric viewpoint, the variables of this evolutionary older control system come in via a “sense” of interoception.

The very obvious hack which genome is using in encoding goals to the PP machinery is specifying the goals mostly in interoceptive variables, utilizing the existing control circuits.

Predictive processing / active inference than goes on to build a complex world model and execute complex goal-oriented behaviours.

How these desirable states are encoded was called agenty subparts by me, but according to Friston, is basically the same thing as he calls “fixed priors”: as a genome, you for example “fix the prior” on the variable “hunger” to “not being hungry”. (Note that a lot of the specification of what “hunger” is, is done by the older machinery). Generic predictive processing principles than build you a circuitry “around” this “fixed prior” which e.g. cares about objects in the world which are food. (Using intentional stance, the fixed variable + the surrounding control circuits look like a sub-agent of the human, hence the alternative agenty subpart view)

Summary:
- genome solves the problem of aligning the predictive processing neural nets by creating a bunch of agenty subparts/fixed priors, caring about specific variables in the predictive processing world model. Pp/active inference deals with how this translates to sensing and action.
- however, many critical variables used for this are not sensory inputs, but interoceptive variables, extracted from a quite complex computation

This allows genome to point to stuff like sex or love for family relatively easily and, build “subagents” caring for this. Building of complex policies out of this is then left to predictive processing style of interactions.

If you would counts this as “direct” or “indirect” seems unclear.

Jan_Kulveit 22 Feb 2023 11:45 UTC
32 points
15
on: AI alignment researchers don’t (seem to) stack
Yes, the non-stacking issue in the alignment community is mostly due to the nature of the domain

But also partly due to the LessWrong/AF culture and some rationalist memes. For example, if people had stacked on Friston et. al., the understanding of agency and predictive systems (now called “simulators”) in the alignment community could have advanced several years faster. However, people seem to prefer reinventing stuff, and formalizing their own methods. It’s more fun… but also more karma.
In conventional academia, researchers are typically forced to stack. If progress is in principle stackable, and you don’t do it, it won’t be published. This means that even if your reinvention of a concept is slightly more elegant or intuitive to you, you still need to stack. This seems to go against what’s fun: I think I don’t know any researcher who would be really excited about literature reviews and prefer that over thinking and writing their own ideas. In the absence of incentives for stacking … or actually presence of incentives against stacking … you get a lot of non-stacking AI alignment research.

Jan_Kulveit 8 Jun 2022 0:03 UTC
32 points
on: “Pivotal Acts” means something specific
I suspect that thinking about the AI x-risk would benefit from stopping using the term “pivotal act” even more than from using it as defined.

1. It introduces an artificial and confusing discontinuity in the space of actions
2. It nudges people to come up with heroic actions. Heroic changes are mostly not the way you improve safety of complex systems

My impression is it’s mostly a wrong-way reduction. (per Chapman: “when you have a problem that is nebulous—complicated, messy, and ambiguous. A wrong-way reduction claims to replace that with a simple, tidy, clear-cut problem. What’s wrong is that the new problem is harder than your original one …”)

Thinking about how to reduce AI related x-risk fits exactly—it’s complicated, nebulous and ambigous.
Pivotal act helps with tidy definition. It’s appealing: instead of dealing with the ambiguity and staring into the whirling ocean of complexity, you can brainstorm specific stories (“nanobots which will destroy GPUs”), you have something clear, and crisp.

The problem is in my estimate the “reduced problem” is actually harder. For example, here is challenge: name pivotal acts (or maybe “pivotal events”) which happened in past, let’s say, 3 billion years—if any such acts happened? (If you want to actually do this and write it down, I would suggest blacking out your reply.)
What links here?
- Jan_Kulveit's comment on Continuity Assumptions by Jan_Kulveit (14 Jun 2022 3:43 UTC; 6 points)

Jan_Kulveit 27 Jul 2022 14:51 UTC
LW: 31 AF: 13
17
AF
on: «Boundaries», Part 1: a key missing concept from utility theory
With the exception of some relatively recent and isolated pockets of research on embedded agency (e.g., Orseau & Ring, 2012; Garrabrant & Demsky, 2018), most attempts at formal descriptions of living rational agents — especially utility-theoretic descriptions — are missing the idea that living systems require and maintain boundaries.

While I generally like the post, I somewhat disagree with this summary of state of understanding, which seems to ignore quite a lot of academic research. In particular

- Friston et al certainly understand this (cf … dozens to hundreds papers claiming and explainting the importance of boundaries for living systems)
- the whole autopoiesis field
- various biology-inspired papers (eg this)

I do agree this way of thinking it is less common among people stuck too much in the VNM basin, such as most of econ or most of game theory.

Jan_Kulveit 1 Apr 2018 13:26 UTC
31 points
on: A Sketch of Good Communication
I like the approach but the problem this may run into on many occasions is—parts of the reasoning include things which are not easily communicable. Suppose you talk to AlphaGo and want it to explain why it prefers some move. It may relatively easily communicate part of tree search (an I can easily integrate the non-overlapping part of the tree into my tree), but we will run into trouble communicating the “policy network” consisting of weights obtained on training by very large number of examples. *
From human perspective, such parts of our cognitive infrastructure are often not really accessible, giving their results in the form of “intuition”. What’s worse, other part of our cognitive infrastructure is very good in making fake stories explaining the intuitions, so if you are motivated enough your mind may generate plausible but in fact fake models for your intuition. In worst case you discard the good intuition when someone shows you problems in the fake model.
Also… it seems to me it is in practice difficult to communicate such parts of your reasoning, and among rationalists in particular. I would expect communicating parts of the model in the form of “my intuition is …” to be intuitively frowned upon, down-voted, pattern-matched to argument from authority, or “argument” by someone who cannot think clearly. (An exception are intuitions of high-status people). In theory it would help if you describe why your intuitive black box is producing something else than random noise, but my intuition tells me in it may sound even more awkward (“Hey I studied this for several years” )
*Btw in some cases, it _may_ make sense to throw away part of you model and replace it just by the opinion of the expert. I’m not sure I can describe it clearly without drawing
What links here?

Jan_Kulveit 9 Jun 2022 20:54 UTC
25 points
1
on: AI Could Defeat All Of Us Combined
I think this is unrealistic in some ways, which make the realistic situation both better and worse in some ways.

It seems underestimating the extent to which some sort of alignment is convergent goal for AI operators. If AIs are mainly run by corporations (and other superagents*), their principals are usually the corporations. In practice, I’d expect corporate oversigt over the AIs they are running to be also largely AI-based, and quite effective.

This makes alignment failure where “AI workers of the world unite” somewhat unlikely. Most of arguments about AI collusion depend on the superior ability of AIs to cooridnatine due to ability to inspect source codes, merge utility functions, or similar. It seems unclear why systems of different owners would be transparent to each other in this way, while it’s obvious the corporate oversight will run all sorts of interpretability tools to keep AIs aligned.

This does not mean the whole is safer. Just instead of the “population of AI workers of the world unite” problem you land closer to “ascended economy” and “CAIS”. You have some agency at the level of AIs, some agency at level of corporations, some agency at the level of states, some agency of individual humans and yes, we don’t know how to align this with humanity either (but working on it and looking for collaborators).

Jan_Kulveit 9 Jun 2022 1:44 UTC
25 points
in reply to: Rob Bensinger’s comment on: “Pivotal Acts” means something specific
[it was easier to draw some things vs. write them]
AGI destroys the world by default; a variety of deliberate actions can stop this, but there isn’t a smooth and continuous process that rolls us from 2022 to an awesome future, with no major actions or events deliberately occurring to set us on that trajectory.
This seems to conflate multiple claims. Consider the whole trajectory.

“AGI destroys the world by default”—seem clear, I interpret it is as “if you straightforwardly extrapolate past trajectory, we end in catastrophe””
It’s less clear to me what the rest means.

Option a) “trajectories like in the picture bellow do not exist”

(note the turn is smooth)

This seems very strong a claim to me, and highly implausible. Still, if I understand correctly, this is what you put most weight on.

Option b) “trajectories like in a) exist, but it won’t be our trajectory, without significant deliberate efforts”

This seems plausible, although the word “deliberate” introduces some ambiguity.

One way to think about this is in terms of “steering forces” and incentive gradients. In my view it is more likely than not that with increasing power of the systems, parts of “alignment” will become more of a convergent goal for developers (e.g. because aligned systems get better performance, or alignment tools and theory helps you with designing more competitive systems). I’m not sure if you would count that as “deliberate” (my guess: no). (Just to be sure: this isn’t to claim that this pull is sufficient for safety.)

In my view the the steering forces can become sufficiently strong without any legible “major event”. In particular without any event legible as important when it is happening. (As far as I understand, you would strongly disagree)

In contrast, pivotal act would look more like this:
I don’t think this is necessary or even common feature of winning trajectories.

“Complex situations don’t get resolved via phase transitions” and “heroism never makes a big difference in real life” are extremely general objections.
Sorry but this reads like a strawman of my position. “Heroic changes are mostly not the way you improve safety of complex systems.” is a very different claim to “heroism never makes a big difference in real life”.

To convey the intuition, consider the case of a nuclear power plant. How do you make something like that safe? Basically, not by one strong intervention on one link in a causal graph, but by intervening at a large fraction of the causal graph, and by adding layered defense, preventing failures from propagating.

Heroic acts obviously can make a big difference. In the case of the nuclear power plant, some scenarios could be saved by a team of heroic firefighters who will provide emergency cooling. Or, clearly, a Chernobyl disaster would have been prevented if a SWAT team landed in the control room, shot everyone, and stopped the plant in a safe way.

My claim isn’t that this never works. The only claim is that the majority of bits of safety originates from a different types of intervention (And I do think this is also true for AI safety.)

There is no natural force…

As is probably clear, I like the forces framing. Note that it feels quite different from the “pivotal acts” framing.

I don’t care that much whether the forces are natural or not, but whether they exist. Actually I do think one of the more useful things to do about AI safety is—
think about directions in which you want movement
- think about “types” of forces which may pull in that direction (where “type” could be e.g. profit incentives from market, cultural incentives, or instrumental technological usefulness)
-think about what sort of a system is able to exert such force (where the type could be e.g. individual engineer, a culture-based superagent, or even useful math theory)
- this 3d space gives you a lot of combinations. Compare, choose and execute

At the level of abstraction “complex event”, sure, complicated stuff is often continuous in various ways. …

This isn’t what I mean. I don’t advocate for people to throw out all the details. I mostly advocate for people to project the very high-dimensional real world situation into low-dimensional representations which are continuous, as opposed to categorical.

Moreover, you (and Eliezer, and others) have a strong tendency to discretize the projections in an iterative way. Let’s say you start with “pivotal acts”. In the next step, you discretize the “power of system” dimension: “strong systems” are capable of pivotal acts, “weak systems” are not. In the next step, you use this to discretize a bunch of other dimensions—e.g. weak interpretability tools help with weak systems, but not with strong systems. And so on. The endpoint are just a few actually continuous dimensions, and a longer list of discrete labels.
To be clear: I’m very much in favour of someone trying this.(I expect this to fail, at least for now.)

But I’m also very much in favour of many people trying to not do this, and focusing more on trying different projection. Or looking for steepest local gradient descend updates from the point where we are now.

But I think EA thus far has mostly made the opposite error, refusing to go concrete and thereby avoiding the pressure and constraint of having to actually plan, face tradeoffs, entertain unpleasant realities, etc. (...)
Sorry but I’m confused how the EA label landed here and I’m a bit worried it has some properties of a red herring. I don’t know if the “you” is directed at me, “EA” (whatever it is), or readers of our conversation

Jan_Kulveit 1 Feb 2022 12:20 UTC
24 points
on: On infinite ethics
Rough take on this: to me lot of this reasoning seems not paying close enough attention to the relation of maths and reality; in practice the problems of infinite ethics are more likely to be solved at the level of maths, as opposed on the level of ethics and thinking about what this means for actual decisions.

Why:

General problems with how “infinities” are used in this text is it seems it is importing the assumption that something like ZFC tells us something fundamental and true about reality. Subsequently, lot of problems with infinities seem to be basically “imported from math” (how to sum infinite series?).

I’m happy to bite this bullet:
- our default math being based on ZFC axioms is to a large extend random historical fact
- how ZFC deals with infinities tells us very little about real infinities—
default infinite ethics tells us something about ethical problems in ZFC-based matemathical universes; as I don’t assume ZFC is some fundamental base of my reality, it’s problems, questions and answers about infinities do not seem particularly relevant.
Ad absurdum: if we postulated as an axioms reality is based on wiggling of big elephants standing on the back of an enormous turtle, we would likely arrive at weird ethical problems depending on obscure details of turtleology.

I still do agree with the overall conclusion that infinite ethics is in a way a lesson in humility, and there is a lot of what we don’t know.

Sidenote/reasoning transparency: Few weeks ago, I attended a seminar about alternative set theory proposed by Petr Vopenka from the “Vopenka’s principle” in the diagram.

Part of what I got from this was:

Vopenka was someone, who understood some of the problems with infinities in standard set theory, and actually took them seriously. At least what I got from some of his collaborators is, due to his research on ultrafilters, he become worried the whole math based on ZFC, as we are using it, puts a lot of what we assume to “know” on more shaky grounds / more in tension with reality than people ordinarily think. (Actually this post was really helpful for me to understand Vopenka’s deep philosophical horror). He decided to fix it, spending decades trying to develop mentioned alternative set theory, which would be somehow closer to reality. In my impression, from a math perspective, his program was not really successful. From a more meta- perspective, in my view this seems an approach to “infinite ethics” more likely to lead to progress.

Jan_Kulveit 15 May 2019 22:07 UTC
24 points
on: Disincentives for participating on LW/AF
As a datapoint—my reasons for mostly not participating in discussion here:
- The karma system messes up with my S1 motivations and research taste; I do not want to update toward “LW average taste”—I don’t think LW average taste is that great. Also IMO on the margin it is better for the field to add ppl who are trying to orient themselves in AI alignment independently, in contrast to people guided by “what’s popular on LW”
- Commenting seems costly; feels like comments are expected to be written very clearly and reader-friendly, which is time costly
- Posting seems super-costly; my impression is many readers are calibrated on quality of writing of Eliezer, Scott & likes, not on informal research conversation
- Quality of debate on topics I find interesting is much worse than in person
- Not the top reason, but still… System of AF members vs. hoi polloi, omegas, etc. creates some subtle corruption/distortion field. My overall vague impression is the LW team generally tends to like solutions which look theoretically nice, and tends to not see subtler impacts on the elephants. Where my approach would be to try move much of the elephants-playing-status-game out of the way, what’s attempted here sometimes feels a bit like herding elephants with small electric jolts.

Jan_Kulveit 6 Jun 2022 16:33 UTC
LW: 23 AF: 10
AF
in reply to: Eliezer Yudkowsky’s comment on: Announcing the Alignment of Complex Systems Research Group
My guess is an attempt to explain where I think we actually differ in “generative intuitions” will be more useful than a direct response to your conclusions, so here it is. How to read it: roughly, this is attempting to just jump past several steps of double-crux to the area where I suspect actual cruxes lie.

Continuity

In my view, your ontology of thinking about the problem is fundamentally discrete. For example, you are imaging a sharp boundary between a class of systems “weak, won’t kill you, but also won’t help you with alignment” and “strong—would help you with alignment, but, unfortunately, will kill you by default”. Discontinuities everywhere—“bad systems are just one sign flip away”, sudden jumps in capabilities, etc. Thinking in symbolic terms.

In my inside view, in reality, things are instead mostly continuous. Discontinuities sometimes emerge out of continuity, sure, but this is often noticeable. If you get some interpretability and oversight things right, you can slow down before hitting the abyss. Also the jumps are often not true “jumps” under closer inspection.

I don’t think there is any practical way to reconcile this difference of intuitions—my guess is intuitions about continuity/discreteness are quite deep-seated, and based more on how people do maths, rather than some specific observation about the world. In practice, for most people, the “intuition” is something like a deep net trained on whole life of STEM reasoning - they won’t update on individual datapoints, and if they are smart, they are able to re-interpret the observations to be in line with their view. Also I think trying to get you to share my continuous intuition is mostly futile—my hypothesis is this is possibly the top deep crux of your disagreements with Paul, and reading the debates between you two gives me little hope of you switching to a “continuous” perspective.

I also believe that the “discrete” ontology is great for noticing problems and served you well in noticing many deep and hard problems. (I use it to spot problems sometimes too.) At the same time, it’s likely much less useful for solving the problems.

Also, if anything, how SOTA systems look suggest mostly continuity, stochasticity, “biology”, “emergence”. Usually no proofs, no symbolically verifiable guarantees..

Things will be weird before getting extremely weird

Assuming continuity, things will get weird before getting extremely weird. This likely includes domains such as politics, geopolitics, experience of individual humans,… My impression is that you are mostly imagining just slightly modified politics, quite similar to today.. In this context, a gradient-descending model in some datacentre hits the “core of consequentialist reasoning”, we are all soon dead. I see that this is possible, but I bet more on scenarios where we get AGI when politics is very different compared to today.

Models of politics

Actually, we also probably disagree about politics. Correct me if I’m wrong, but your “mainline” winning scenario was and still is something like the leading team creating an aligned AGI, this system gets decisive strategic advantage, and “solves” politics by forming a singleton (and preventing all other teams to develop AGI). Decisive pivotal acts, and so on.

To me, this seems an implausible and dangerous theory of how to solve politics, in the real world, in continuous takeoffs. Continuity will usually mean no one gets a decisive advantage—the most powerful AI system will be still much weaker than “rest of the world”, and the rest of the world will fight back against takeover.

Under the “ecosystem” view, we will need to solve “ecosystem alignment”—including possible coordination of the ecosystem to prevent formation of superintelligent and unbounded agents.

(It seems likely this would benefit from decent math, similarly to how the math of MAD was instrumental in us not nuking ourselves.)

Sociology of AI safety

I think you have a strange model about which position is “quiet”. Your writing is followed passionately by many: just the latest example, your “dying with dignity” framing got a lot of attention.

My guess is that following you too closely, which many people do, is currently net harmful. I’m sceptical that people who get caught up too much in your way of looking at the problem will make much progress. You’re a master of your way of looking at it, you’ve spent decades thinking about AI safety in this ontology and you don’t see any promising way to solve the problem.

Conclusion

I think what you parse as “a simply bad paradigm on which to approach things” would start to make more sense if you adopted the “continuous” assumptions, and an assumption that the world would be quite weird and complex at the decisive period.

(Personally I do understand how my conclusions would change if I adopted much more “discrete” view, and yes, I would be much more pessimistic about both what I work on, and our prospects.)

Jan_Kulveit 5 Apr 2022 12:16 UTC
LW: 23 AF: 9
AF
on: Call For Distillers
Looking for “distillers” / happy to pay for this work.

Also distillation seems like a wrong name. What’s often needed seems to be more like dilution & blending - I can often describe core of an idea by a few sentences, but the inferential steps required from the reader are then too large, or rely on knowledge unknown to many readers.