AI Safety person currently working on multi-agent coordination problems.
Jonas Hallgren
Why is this? The most straightforward possibility is simply that the concept of econ-brain is too lossy an abstraction to reliably evaluate thinkers with. Ideally we’d try to diagnose what led to each of these successes and failures in granular detail. But as a rough heuristic, is being more econ-brained actually a good way to improve your forecasts? Some possible responses:
I think of the economy as a useful abstraction that we have all agreed is true and that is based on a compression that is an amalgamation of the incentive landscape of everyone engagin in it. We all agree that it is the underlying reality to which things propagate and so we’ve built trust and cooperation around it.
I think of the economy a bit as a hyperstitition as a consequence and since everyone believes the same thing economic forecasts work because humans behave somewhat predictably to the economy since we all know from Hayek that it is the ultimate propagator of truth! (not)
I think there are multiple distorting factors yet mainly it is about the larger superorganisms within the economy being incentivised to distort the functioning of the economy for their own gains. In state heavy countries, it is the state that does this. In market heavy countries large corporations have done the same thing. I would like to claim that this is due to Friedman with the idea that the utility function of a company is based on the stakeholders as you can just disregard negative externalities (this is not fully what he has said but it is the slippery slope he started). I see this as a sort of defection at a higher level and I think that you don’t get the market as the underlying information aggregator that it is due to the fact that the instruments are institutionally captured so it gets distorted over time but narratives keep it going in the short term.
If you imagine that you’re always participating in an prediction-action loop or a general policy cycle as in RL, you can imagine that there are two things that you can spend your energy on, prediction and action.
I think that smart people have better prediction engines yet it also makes action less of a focus in the strategy to minimise future problems (and since one can get stuck in ones head also limits the potential exploration a bit.)
It is quite strange but I’ve found that I’ve gotten more done and become more happy the more “stupid” (how i frame it to myself) my decisions in day to day life are. It’s a bit like a switch and an allowance to do more things because the thing stopping me from doing them is feeling like i did something stupid which completely gets reset by the idea of “oh let’s be stupid for a bit”. This still needs good review afterwards of course to see whether it worked or not but life is so much more nice and happy now which makes me better at work as well!
A Compositional Philosophy of Science for Agent Foundations
I feel like there are some very interesting connections to gradual disempowerment and cultural evolution here as well where you should probably see selection dynamics on the personalities based on what makes them retain power and similar over time.
It might be an interesting place to do some initial studies on memetic drift of personalities over time to see what type of attractor states they tend to occupy.
(This is a bit of a no shit point but I thought it would be good to mention that you can probably run some good initial tests on the memetic spread of power-seeking tendencies in these models)
Posting things that are adjacent in frame but implies beliefs that are more associated with AI Ethics or normie crowd. E.g let’s say someone does a deep dive into John Rawls A Theory of Justice (fictional example but I’ve seen similar) and doesn’t preface it with relating it to some sort of decision theory or similar it is often assumed that it is not meant for the LW community as it doesn’t make the connections clear enough. I’m not sure this is only a bad thing but sometimes I find that it signals a lack of good faith in accepting other people’s frames?
I thought I would try to give some cosmic schelling answers as well as it was asked for and sounded fun. (Flag: written with help from AI)
Q: Is it good or bad to maintain the capacity to evaluate whether agents act consistently with their stated values over time?
Cosmic Schelling answer: good. Any civilization that coordinates through communicated commitments needs some way to verify those commitments are being honored. Without this capacity, stated values become unverifiable and coordination partners cannot distinguish cooperators from defectors. This implies that a degree of mutual legibility is itself cosmically Schelling-good.
Q: Is it good or bad to maintain distributed rather than monopolized systems for evaluating knowledge claims?
Cosmic Schelling answer: good. This follows the same pattern as the stealing argument applied to information. When a single subsystem controls what counts as true for the entire system, you get fragility and misalignment between the evaluator’s local objectives and collective accuracy — the same pathology as resource monopoly.
Q: Is it good or bad to maintain openness to novel information and unfamiliar coordination strategies?
Cosmic Schelling answer: good. A civilization that closes itself to novelty loses adaptive capacity in changing environments, while one that remains open can integrate useful strategies it didn’t generate internally.
This last one raises a question about the framework itself. On Earth (see WEIRD by Heinrich for more on this), “openness to experience” as a measurable trait doesn’t show up consistently across cultures in personality research, it is only within WEIRD (Western, Industrialized, Educated, Rich, Democratic) that it does. This suggests that the space of cosmic Schelling norms a civilization can actually converge on may be constrained by its available technology and coordination infrastructure. “Stealing is bad” is available to any civilization with resource boundaries. Norms about epistemic distribution or openness to novelty may require sufficient information-processing capacity and institutional complexity before they become representable at all. The asymmetry arguments hold regardless, but recognizing them has prerequicities that might also depend on the norm structure and general structure of the civilization at hand.
Was this what you were alluding to in your conversation with Divya Siddharth on the podcast or where you pointing at something deeper when you thought of morality as optimal solution to collective intelligence problems?
In the book energy and civilization Vaclav Smil shows the process of civlization and complexity of rule as one that is dependent on the energy capacity of the system. It feels reasonable to me that one could also see an arising of more complex schelling points as something that arises with the increased energy and therefore information processing bandwidth that would be avaliable at any point in time. We can see something like science as a more complex schelling point that comes from more avaliable information processing. This might give a pretty nice argument for why economic well-being could lead to a general increase in moral circle expansion as well? (altough it might not be the main causal factor)
Finally I would be curious what you think about simulations in this context? If you had a reasonable agent sample couldn’t you just provide an example by doing an MCMC simulation of the agent dynamics and point at that as a way of seeing general selective norms or do you think that game theory or MAS is too simplistic yet to describe such a system well?
I partly feel that there sometimes is a missing mood on LW about the ability of models to actually do good coding by themselves? I might be wrong but if I look around the spaces which I consider to be doing proper computer science it very much still feels like it is not that good? For example, here’s an interesting video on AI taking a cornell CS freshman class: https://www.youtube.com/watch?v=56HJQm5nb0U
The qualitative vibe is more like it’s a nice extention of a single human’s agency? When I look around the programming space and the vibes of more serious programmers I still can’t really say that I feel the AGI (I think the core problem is something around things that are highly dependable and how that is hard to develop with AI but idk, I just mostly wanted to point out this missing mood.)
I would be curious if you think the following take is naive or reasonable?
It seems to me that a lot of bad AI decisions boil down to building something for scale and in that letting go of a certain environment that would be conducive to producing good thinking? Yet if we look at the VC or general entrepreneurial scaling model, it is not quite aligned with that purpose and so a lot of the organisations that we see will then go down such a path?
Isn’t it then very important to be able to provide a space for ambitious people to work on something real in an environment that is optimised for the precursors of calm and friendly thought? (Since most other places will be optimised for scale.)
I think you can put fun, curiosity and a positive impact direction at the forefront of an organisation and not fall into the traps that you’ve described. I think the trick is to not have an EA oriented hardcore impact evaluation frame as I think that leads to fear, pressure and generally worse decision making. I’ll see if this empirically holds but that’s also why I’m asking what you think about this as I’ve had similar thoughts on the EA + startup sphere and this is the response/solution I’ve thought of (and am trying).
Humans have lots and lots of training data to build on within imitiation learning and culture which one can get a wrong view of when reading steven byrnes imo. He has a very specific focus on reward learning infrastructure which means he skips out on reading some of the cultural evolution literature.
The important point here is basically that the human language corpus is really op for world models and that I still think that there’s a relatively large difference between a RL bootstrapped system compared to an LLM, I think you get a lot of bang for your buck by training on the human language corpus.
So I think there will be a sort of language model anyway but probably a weird version of one.
Good point on validation — I think the main difficulty is finding a case of disempowerment with both good qualitative and quantitative data, which is a trickier combination than it first appears.
The spectral toolkit itself is pretty well-established in network science so I don’t think base-level verification is the bottleneck there — happy to point to some examples if useful.
I’m personally working on some multi-agent evaluation environments where the plan is to track these metrics over time, see how disempowerment intersects with the spectral properties of the system, and hopefully from there design something more like METR-style real world tracking of AI participation in the economy or democratic systems.
I’ve got a question which might be a bit galaxy brained which goes something like: If you start looking for a straight line on a graph aren’t you gonna delude yourself because you’re a stupid monkey?
Doubling down on this, if you join the straight line group then all the other people around you will show you the evidence of the straight line and start to dismiss the evidence of the sigmoiding?
I’ve got some sort of alarms firing in my head going like “the local environment seems like it leads to bad epistemics, care, care!” and so I’ve adapted the very elusive (and in reality non-existent) centrist position of saying “if it happens it happens, the stuff I work on is robust to timelines” and then actually having a model in the background but not being public about it. (which includes straight line arguments)
So uhhh, I’m taking all evidence into account and it may or may not be true in a very hypothetical world...
I get your point about the word agency which was the main point of the post but as this is lesswrong and since I can’t help myself, why did you pick consciousness as the other concept?
Consciousness has to be one of the most vague words that there is (see Critch’s post on this for example)
I would for example not ascribe to your definition of consciousness that you just said. I would want to talk about awareness and attention and self and non-self. I really don’t like the word consciousness.
Maybe this is actually a good comparison because I really don’t like the word agency either, it is too broad of an umbrella term and as a consequence it is often conflationary as well? So just like how different philsophers think that the hard problem of consciousness is different (as some believe that the idea of qualia makes no sense) so I think people have different definitions of agency. Some mean Embedded agency, some mean something you can take the intentional stance, some mean a sort of infrabayesian perspective on it (with layers of self-reflection and other 4d chess that I don’t understand). If we go to other fields like biology they call a slime mold an agent because it behaves in a way that can be described by a cybernetic control loop. So it seems quite woefully underdefined (in general) but if you talk to specific people who know what they’re doing they often have a precise definition
Idk if it is something that is dependent on something being pre-paradigmatic but I shalt stop yapping now as I realise I got a bit carried away.
TL;DR: being a managed agent can be good and the type of management matters.
I like to see these as consequences of different control/information structures. I kind of agree with the stuff on power seeking yet I also want to point out that if you’re in a company (a top down organisational structure) then you can ask yourself if an individual contributor is less useful than a manager? I think the IC might be less loadbearing on the direction from time to time yet that person can at the same time often say a lot about some very specific system that matters.
Isaiah Berlin has the concept of positive and negative liberty which I think is important here (https://plato.stanford.edu/entries/liberty-positive-negative/). Sometimes you can get more agency in another direction by getting options removed from you and so it matters what type of agency is being removed and sometimes it can be a good thing to be a fully managed agent! (E.g someone forces you to eat healthy so that you get more energy on average)
I also think that the truer name version of this is something like a scalar property about a message passing relationship between two agents and that it is not only top down control structures that matter, there are other forms of organisation such as markets, democracies, networks and communities as well.
(Hopefully this made some sense)
Do you think it is better to treat the parasites as one the character level rather than the underlying ocean level? There’s probably some weird sort of minimum viable target for selection thing going on here? (e.g this post for Kulveit’s 3 different layers)
Systemic Risks and Where to Find Them
Why are these the two camps?
It very much doesn’t feel that black and white when it comes to alignment and intelligence?
Clearly it is a fixed point process that is dependent on initial conditions and so if the initial conditions improve the likelihood of the end-point being good also improves?
Also if the initial conditions (LLMs) have a larger intelligence than something like a base utility function does, then that means that the depth of the part of the fixed point process of alignment is higher in the beginning.
It’s quite nice that we have this property and depending on how you believe the rest of the fixed point process going (to what extent power-seeking is naturally arising and what type of polarity the world is in, e.g uni or multi-polar) you might still be really scared or you might be more chill with it.
I don’t think Davidad says that technical alignment is solved, I think he’s more saying that we have a nicer basin as a starting condition?
That is a fair point, since virtue is tied to your identity and self it is a lot easier to take things personally and therefore distort the truth.
A part of me is like “meh, skill issue, just get good at emotional intelligence and see through your self” but that is probably not a very valid solution at scale if I’m being honest.
There’s still something nice about it leading to repeated games and similar, something about how if you look at our past then cooperation arises from repeated games rather than individual games where you analyse things in detail. This is the specific point that Joshua Greene makes in his Moral Tribes book for example.
Maybe the core point here is not virtue versus utilitarian reasoning, it might more be about the ease of self-deception and how different time limits and how ways of evaluating your own outcomes and outputs should be done in a more impersonal way. Maybe one shouldn’t call this virtue ethics as it carries a large bag and camp, maybe heruistics ethics or something (though that feels stupid).
I find this an interesting line of criticism, it is essentially pointing at the difficulty of finding good evidence and evaluating yourself on that evidence and making a disagreement about how easy it is.
I would like to bring in a perspective of more first principles modelling of how quickly you incorporate evidence pointing against the way you’re thinking.
One thing is the amount of disconfirming evidence you look for. Another thing is your ability to bring that information into your worldview, your openness to being wrong. Thirdly we might also mention is the speed of the feedback, how long time does it take for you to get feedback.
I think you’re saying that when you go into virtue ethics we often find failures in bringing in disconcerting information into a worldview. I don’t think this has to be the case personally as I do think there are ways to actually get feedback on whether you’re acting in a way that is aligned with your virtues, by mentioning examples of them and then having anonymous people give feedback or just normal reflection.
This is a lot easier to do if your loops are shorter which is the exact point that consequentalism and utilitarianism can fail on, it is a target that is quite far away and so the crispness and locality of the feedback is not high enough.
I think that virtue ethics outperforms consequentialism because it is better suited for bringing in information as well as for speed and crispness of that information. I personally think it is because it is a game theory optimal solution to consequentialism in environments where you have little information but that is probably beside the point.
This might just be a difference in terminology though? Would you agree with my 3 point characterisation above?
I’ve been getting into more general political theory recently and I really like the idea of multi-lateralism and it feels a bit underrepresented in the LW rhetoric maybe due to the US-centricity of the site? (I liked this interview with finland’s prime minister, I thought he was quite well spoken: https://youtu.be/ubZeguAk0fM?si=7H7nJfnCANCWRcDN)
The difference is basically between being driven by cooperation, values, norms and treaties on the multi-lateralist side and power on the multipolar side. It feels like a lot of the analysis has been based on power and this is especially true in US-China relations. This just feels obviously worse than aiming for a multi-lateral world order and based on some sort of power-concentrating assumption?
Maybe it is also partly due to the unipolar world that we have had from beforehand with the US as a global hegemon?
Some might say that it is implausible to aim for multi-lateralism and that power concentration is a fact of the world due to how ASI will arise. I do not believe this to be the case, distillation of models exist, LLMs are highly parallelisable, these are all things that point at a broad usage of LLMs in the future.
Finally, there is a serious possibility at this point that the USA will grow into a proto-fascist state since it is starting to get easier and easier to predict by viewing it from a fascist lens.
Power generally tends to corrupt and distribution of power is often a good thing as it enables leverage for deals and cooperation. Maybe this is like a mega lukewarm take but I feel that some people are still stuck in the “we need a manhattan project for AI Safety” train of thought. Also, finally, I would really like to see Anthropic or Google Deepmind or any AI company for that matter involve themselves a lot more in improving democracy across the world and for them to become a lot more globalist. This is a strategy change that is plausible to implement and it will likely decrease the risks from power concentration as it seems states are getting quite grabby in this changing world order.
If I was Dario Amodei in the future bestseller Anthropic and The Methods of Rationality I would start to create multi-lateral cooperation across the world as that will build lots of leverage and good will for future adoption of technologies even with the US since you could use your global relationships to get leverage on national decisions.
End of rant, European out.