Jan_Kulveit

Karma: 3,533

My current research interests:
- alignment in systems which are complex and messy, composed of both humans and AIs?
- actually good mathematized theories of cooperation and coordination
- active inference
- bounded rationality

Research at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

Jan_Kulveit 24 Apr 2024 14:04 UTC
13 points
4
in reply to: johnswentworth’s comment on: Examples of Highly Counterfactual Discoveries?
Mendel’s Laws seem counterfactual by about ˜30 years, based on partial re-discovery taking that much time. His experiments are technically something which someone could have done basically any time in last few thousand years, having basic maths

Jan_Kulveit 22 Apr 2024 11:43 UTC
LW: 2 AF: 1
0
AF
in reply to: Eliezer Yudkowsky’s comment on: GPTs are Predictors, not Imitators
I do agree the argument “We’re just training AIs to imitate human text, right, so that process can’t make them get any smarter than the text they’re imitating, right? So AIs shouldn’t learn abilities that humans don’t have; because why would you need those abilities to learn to imitate humans?” is wrong and clearly the answer is “Nope”.

At the same time I do not think parts of your argument in the post are locally valid or good justification for the claim.

Correct and locally valid argument why GPTs are not capped by human level was already written here.

In a very compressed form, you can just imagine GPTs have text as their “sensory inputs” generated by the entire universe, similarly to you having your sensory inputs generated by the entire universe. Neither human intelligence nor GPTs are constrained by the complexity of the task (also: in the abstract, it’s the same task). Because of that, “task difficulty” is not a promising way how to compare these systems, and it is necessary to look into actual cognitive architectures and bounds.

With the last paragraph, I’m somewhat confused by what you mean by “tasks humans evolved to solve”. Does e.g. sending humans to the Moon, or detecting Higgs boson, count as a “task humans evolved to solve” or not?

Jan_Kulveit 18 Apr 2024 16:23 UTC
9 points
0
in reply to: Ben Pace’s comment on: FHI (Future of Humanity Institute) has shut down (2005–2024)
I sort of want to flag this interpretation of whatever gossip you heard seems misleading/only telling small part of the story, based on my understanding.

Jan_Kulveit 18 Apr 2024 11:15 UTC
3 points
1
in reply to: jacobjacob’s comment on: Express interest in an “FHI of the West”
I would imagine I would also react to it with smile in the context of an informal call. When used as brand / “fill interest form here” I just think it’s not a good name, even if I am sympathetic to proposals to create more places to do big picture thinking about future.

Jan_Kulveit 18 Apr 2024 8:33 UTC
19 points
10
on: Express interest in an “FHI of the West”
Sorry, but I don’t think this should be branded as “FHI of the West”.

I don’t think you personally or Lightcone share that much of an intellectual taste with FHI or Nick Bostrom—Lightcone seems firmly in the intellectual tradition of Berkeley, shaped by orgs like MIRI and CFAR. This tradition was often close to FHI thoughts, but also quite often at tension with it. My hot take is you particularly miss part of the generators of the taste which made FHI different from Berkeley. I sort of dislike the “FHI” brand being used in this way.

edit: To be clear I’m strongly in favour of creating more places for FHI-style thinking, just object to the branding / “let’s create new FHI” frame. Owen expressed some of the reasons better and more in depth

Jan_Kulveit 4 Feb 2024 20:17 UTC
LW: 4 AF: 2
2
AF
in reply to: janus’s comment on: Why Simulator AIs want to be Active Inference AIs
You are exactly right that active inference models who behave in self-interest or any coherently goal-directed way must have something like an optimism bias.

My guess about what happens in animals and to some extent humans: part of the ‘sensory inputs’ are interoceptive, tracking internal body variables like temperature, glucose levels, hormone levels, etc. Evolution already built a ton of ‘control theory type cirquits’ on the bodies (an extremely impressive optimization task is even how to build a body from a single cell...). This evolutionary older circuitry likely encodes a lot about what the evolution ‘hopes for’ in terms of what states the body will occupy. Subsequently, when building predictive/innocent models and turning them into active inference, my guess a lot of the specification is done by ‘fixing priors’ of interoceptive inputs on values like ‘not being hungry’. The later learned structures than also become a mix between beliefs and goals: e.g. the fixed prior on my body temperature during my lifetime leads to a model where I get ‘prior’ about wearing a waterproof jacket when it rains, which becomes something between an optimistic belief and ‘preference’. (This retrodicts a lot of human biases could be explained as “beliefs” somewhere between “how things are” and “how it would be nice if they were”)

But this suggests an approach to aligning embedded simulator-like models: Induce an optimism bias such that the model believes everything will turn out fine (according to our true values)

My current guess is any approach to alignment which will actually lead to good outcomes must include some features suggested by active inference. E.g. active inference suggests something like ‘aligned’ agent which is trying to help me likely ‘cares’ about my ‘predictions’ coming true, and has some ‘fixed priors’ about me liking the results. Which gives me something avoiding both ‘my wishes were satisfied, but in bizarre goodharted ways’ and ‘this can do more than I can’

InterLab – a toolkit for experiments with multi-agent interactions

Tomáš Gavenčiak, Ada Böhm and Jan_Kulveit

22 Jan 2024 18:23 UTC

68 points

0 comments8 min readLW link

(acsresearch.org)

Jan_Kulveit 20 Jan 2024 0:07 UTC
13 points
10
on: What rationality failure modes are there?

- Too much value and too positive feedback on legibility. Replacing smart illegible computations with dumb legible stuff
- Failing to develop actual rationality and focusing on cultivation of the rationalist memeplex or rationalist culture instead
- Not understanding the problems with the theoretical foundations on which sequences are based (confused formal understanding of humans → confused advice)

Jan_Kulveit 19 Jan 2024 0:59 UTC
4 points
2
in reply to: habryka’s comment on: Tyranny of the Epistemic Majority
+1 on the sequence being on the best things in 2022.

You may enjoy additional/somewhat different take on this from population/evolutionary biology (and here). (To translate the map you can think about yourself as the population of myselves. Or, in the opposite direction, from a gene-centric perspective it obviously makes sense to think about the population as a population of selves)

Part of the irony here is evolution landed on the broadly sensible solution (geometric rationality). Hower, after almost every human doing the theory got somewhat confused by the additive linear EV rationality maths, what most animals and also often humans on S1 level do got interpreted as ‘cognitive bias’ - in the spirit of assuming obviously stupid evolution not being able to figure out linear argmax over utility algorithms in a a few billion years.

I guess not much engagement is caused by
- the relation between ‘additive’ vs ‘multiplicative’ picture being deceptively simple in formal way
- the conceptual understanding of what’s going on and why being quite tricky; one reason is I guess our S1 / brain hardware runs almost entirely in the multiplicative / log world; people train their S2 understanding on linear additive picture; as Scott explains, maths formalism fails us

Jan_Kulveit 15 Jan 2024 8:58 UTC
5 points
3
on: Limits to Legibility
This is a short self-review, but with a bit of distance, I think understanding ‘limits to legibility’ is one of the maybe top 5 things an aspiring rationalist should deeply understand and lack of this leads to many bad outcomes in both rationalist and EA communities.

In a very brief form, maybe the most common cause of EA problem and stupidities are attempts to replace illegible S1 boxes able to represent human values such as ‘caring’ by legible, symbolically described, verbal moral reasoning subject to memetic pressure.

Maybe the most common cause of rationalist problems and difficulties with coordination are cases where people replace illegible smart S1 computations with legible S2 arguments.

Jan_Kulveit 15 Jan 2024 8:34 UTC
LW: 19 AF: 10
1
AF
on: The shard theory of human values
In my personal view, ‘Shard theory of human values’ illustrates both the upsides and pathologies of the local epistemic community.

The upsides
- majority of the claims is true or at least approximately true
- “shard theory” as a social phenomenon reached critical mass making the ideas visible to the broader alignment community, which works e.g. by talking about them in person, votes on LW, series of posts,...
- shard theory coined a number of locally memetically fit names or phrases, such as ‘shards’
- part of the success leads at some people in the AGI labs to think about mathematical structures of human values, which is an important problem

The downsides
- almost none of the claims which are true are original; most of this was described elsewhere before, mainly in the active inference/predictive processing literature, or thinking about multi-agent mind models
- the claims which are novel seem usually somewhat confused (eg human values are inaccessible to the genome or naive RL intuitions)
- the novel terminology is incompatible with existing research literature, making it difficult for alignment community to find or understand existing research, and making it difficult for people from other backgrounds to contribute (while this is not the best option for advancement of understanding, paradoxically, this may be positively reinforced in the local environment, as you get more credit for reinventing stuff under new names than pointing to relevant existing research)

Overall, ‘shards’ become so popular that reading at least the basics is probably necessary to understand what many people are talking about.
What links here?
- Voting Results for the 2022 Review by Ben Pace (2 Feb 2024 20:34 UTC; 57 points)

Jan_Kulveit 12 Jan 2024 0:12 UTC
4 points
2
on: Deontology and virtue ethics as “effective theories” of consequentialist ethics
My current view is this post is decent at explaining something which is “2nd type of obvious” in a limited space, using a physics metaphor. What is there to see is basically given in the title: you can get a nuanced understanding of the relations between deontology, virtue ethics and consequentialism using the frame of “effective theory” originating in physics, and using “bounded rationality” from econ.

There are many other ways how to get this: for example, you can read hundreds of pages of moral philosophy, or do a degree in it. Advantage of this text is you can take a shortcut and get the same using the physics metaphorical map. The disadvantage is understanding how effective theories work in physics is a prerequisite, which quite constrains the range of people to which this is useful, and the broad appeal.

Jan_Kulveit 9 Jan 2024 1:44 UTC
LW: 11 AF: 7
0
AF
on: Where I agree and disagree with Eliezer
This is a great complement to Eliezer’s ‘List of lethalities’ in particular because in cases of disagreements beliefs of most people working on the problem were and still mostly are are closer to this post. Paul writing it provided a clear, well written reference point, and with many others expressing their views in comments and other posts, helped made the beliefs in AI safety more transparent.

I still occasionally reference this post when talking to people who after reading a bit about the debate e.g. on social media first form oversimplified model of the debate in which there is some unified ‘safety’ camp vs. ‘optimists’.

Also I think this demonstrates that ‘just stating your beliefs’ in moderately-dimensional projection could be useful type of post, even without much justification.

Jan_Kulveit 18 Dec 2023 4:59 UTC
LW: 14 AF: 8
1
AF
on: Human values & biases are inaccessible to the genome
The post is influential, but makes multiple somewhat confused claims and led many people to become confused.

The central confusion stems from the fact that genetic evolution already created a lot of control circuitry before inventing cortex, and did the obvious thing to ‘align’ the evolutionary newer areas: bind them to the old circuitry via interoceptive inputs. By this mechanism, genome is able to ‘access’ a lot of evolutionary relevant beliefs and mental models. The trick is the higher/more distant to genome models are learned in part to predict interoceptive inputs (tracking evolutionary older reward circuitry), so they are bound by default, and there isn’t much independent to ‘bind’. Anyone can check this… just thinking about a dangerous looking person with a weapon activates older, body-based fear/fight chemical regulatory circuits ⇒ the active inference machinery learned this and plans actions to avoid these states.
What links here?
- Voting Results for the 2022 Review by Ben Pace (2 Feb 2024 20:34 UTC; 57 points)
- habryka's comment on The LessWrong 2022 Review: Review Phase by RobertM (10 Jan 2024 22:04 UTC; 17 points)

Jan_Kulveit 15 Dec 2023 5:30 UTC
9 points
0
on: Mapping the semantic void: Strange goings-on in GPT embedding spaces
Speculative guess about the semantic richness: the embeddings at distances like 5-10 are typical to concepts which are usually represented by multi token strings. E.g. “spotted salamander” is 5 tokens.

Jan_Kulveit 8 Dec 2023 16:25 UTC
10 points
1
on: How do you feel about LessWrong these days? [Open feedback thread]
I like the agree-disagree vote and the design.

With the content and votes...
- my impression is until ~1-2 years ago LW had a decent share of great content; I disliked the average voting “taste vector”, which IMO represented somewhat confused taste in roughly “dumbed down MIRI views” direction. I liked many of the discourse norms
- not sure what exactly happened, but my impression is LW is often just another battlefield in ‘magical egregore war zone’. (It’s still way better than other online public spaces)

What I mean by that is a lot of people seemingly moved from ‘let’s figure out how things are’ into ’texts you write are elaborate battle moves in egregore warfare″. Don’t feel excited about pointing to examples, but impression are …growing share of senior top-ranking users who seem hard to convince about anything, can not be bothered to actually engage with arguments, writing either literal manifestos or in manifesto-style.

Jan_Kulveit 7 Dec 2023 10:14 UTC
6 points
0
on: Complex systems research as a field (and its relevance to AI Alignment)
(high-level comment)
To me, it seems this dialogue diverged a lot into a question of what is self-referential, how important that is, etc. I don’t think that’s The core idea of complex systems, and does not seem a crux for anything in particular.
So, what are core ideas of complex systems? In my view:
1. Understanding that there is this other direction (complexity) physics can expand to; traditionally, physics has expanded in scales of space, time, and energy—starting from everyday scales of meters, seconds, and kgs, gradually understanding the world on more and more distant scales.

While this was super successful, with a careful look, you notice that while we had claims like ‘we now understand deeply how the basic building blocks of matter behave’, this comes with a * disclaimer/footnote like ‘does not mean we can predict anything if there are more of the blocks and they interact in nontrivial ways’.

This points to some other direction in the space of stuff to apply physics way of thinking than ‘smaller’, ‘larger’, ‘high energy’, etc., and also different than ‘applied’.

Accordingly, good complex systems science is often basically the physics way of thinking applied to complex systems. Parts of statistical mechanics fit neatly into this, but because being developed first, have somewhat specific brand.
Why it isn’t done just under the brand of ‘physics’ seems based on, in my view, often problematic way of classifying fields by subject of study, and not by methods. I know some personal experiences of people who tried to do, e.g., physics of some phenomena in economic systems, having a hard time to survive in traditionally physics academic environments (“does it really belong here if instead of electrons you are now applying it to some …markets?”)
(This is not really strict; for example, decent complex systems research is often published in venues like Physica A, which is nominally about Statistical Mechanics and its Applications)
2. ‘Physics’ in this direction often stumbled upon pieces of math that are broadly applicable in many different contexts. (This is actually pretty similar to the rest of physics, where, for example, once you have the math of derivatives, or math of groups, you see them everywhere.) The historically most useful pieces are e.g., math of networks, statistical mechanics, renormalization, parts of entropy/information theory, phase transitions,...
Because of the above-mentioned (1.), it’s really not possible to show ‘how is this a distinct contribution of complex systems science, in contrast to just doing physics of nontraditional systems’. Actually, if you look at the ‘poster children’ of some of the ‘complex systems science’… my maximum likelihood estimate about their background is physics. (Just googled authors of the mentioned book: Stefan Thurner… obtained a PhD in theoretical physics, worked on e.g., topological excitations in quantum field theories, statistics and entropy of complex systems. Petr Klimek… was awarded a PhD in physics. Albert-László Barabási… has a PhD in physics. Doyne Farmer… University of California, Santa Cruz, where he studied physical cosmology etc. etc.). Empirically they prefer the brand of complex systems vs. just physics.
3. Part of what distinguishes complex systems [science / physics / whatever … ] is in aesthetics. (Also here it becomes directly relevant to alignment).
A lot of traditional physics and maths basically has a distaste toward working on problems which are complex, too much in the direction of practical relevance, too much driven by what actually matters.
Mentioned Albert-László Barabási got famous for investigating properties of real-world networks, like the internet or transport networks. Many physicists would just not work on this because it’s clearly ‘computer science’ or something, as the subject are computers or something like that. Discrete maths people studying graphs could have discovered the same ideas a decade earlier … but my inner sim of them says studying the internet is distasteful. It’s just one graph, not some neatly defined class of abstract objects. It’s data-driven. There likely aren’t any neat theorems. Etc.
Complex systems has an opposite aesthetics: applying math to real-world matters. Important real-world systems are worth studying also because of real-world importance, not just math beauty.

In my view AI safety would be a on a better track if this taste/aesthetics was more common. What we have now often either lacks what’s good about physics (aim for somewhat deep theories which generalize) or lacks what’s good about complexity science branch of physics (reality orientation, assumption that you often find cool math when looking at reality carefully vs. when just looking for cool maths)

Jan_Kulveit 1 Dec 2023 10:21 UTC
9 points
6
on: What’s next for the field of Agent Foundations?
These are especially common, surprisingly perhaps, in AI and ML departments.

This is somewhat unsurprising given human psychology.
- Scaling up LLMs killed a lot of research agendas inside ML, particularly NLP. Imagine your whole research career was built on improving benchmarks on some NLP problem using various clever ideas. Now, the whole thing is better solved by three sentence prompt to GPT4 and everything everyone in the subfield worked on is irrelevant for all practical purposes… how do you feel? In love with scaled LLMs?
- Overall, people often like about research is coming up with smart ideas, and there is some aesthetics going into it. What’s traditionally not part of the aesthetics is ‘and you also need to get $100M in compute’, and it’s reasonably to model a lot of people as having a part which hates this.

Jan_Kulveit 1 Dec 2023 10:03 UTC
LW: 2 AF: 1
0
AF
on: Public Call for Interest in Mathematical Alignment
Part of ACS research directions fits into this—Hierarchical Agency, Active Inference based pointers to what alignmnent means, Self-unalignment

Jan_Kulveit 30 Nov 2023 14:19 UTC
2 points
−6
in reply to: AnnaSalamon’s comment on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
The simple math is active inference, and the type is almost entirely the same as ‘beliefs’.

Jan_Kulveit

In­terLab – a toolkit for ex­per­i­ments with multi-agent interactions

InterLab – a toolkit for experiments with multi-agent interactions