Roland Pihlakas

Karma: 159

You can contact me at email address roland@threelaws.net

Independent AI alignment researcher. I hold an MSc equivalent degree in psychology from the University of Tartu. I work as an AI software architect specialising in combinatorial optimisation, machine learning, graph search, natural language processing, and data compression.

I have been both researching and professionally working on multi-objective value problems for almost 20 years, have followed discussions on AI Safety since 2006 and have participated more actively since about 2017.

My thesis topic was in cognitive psychology, about computational modelling of innate learning and planning mechanisms (classical and operant conditioning plus insight learning), which eventually form a foundation for culturally acquired language-based thought processes.

With co-authors I have published a full research paper in AAMAS (Autonomous Agents and Multi-Agent Systems), about concave utility functions for risk-averse multi-objective decision making, which are relevant for balancing the plurality of human values. For that purpose we introduced concave utility functions before multi-objective reward aggregation. See “Using soft maximin for risk averse multi-objective decision-making”. An interested reader may also want to take a look at my AISC V project proposal which inspired the aforementioned paper: https://bit.ly/aisc5-pluralistic-utility .

Later, with support from Foresight Institute’s grant in autumn 2023 - spring 2024 I created a suite of biologically and economically aligned multi-objective multi-agent long-running AI safety benchmark environments, based on the extended gridworlds platform I have been developing, published in Arxiv at the end of September 2024. This work includes OpenAI Baselines 3 based RL agents as well as LLM agents. See “From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent gridworld-based AI safety benchmarks” and https://github.com/biological-alignment-benchmarks/biological-alignment-gridagents-benchmarks (contains agents training framework and a concrete implementation of a benchmark environment) and https://github.com/biological-alignment-benchmarks/ai-safety-gridworlds (general underlying framework for multi-objective multi-agent environment building).

Additionally, together with collaborators, I have been testing alignment with the same biological and economical alignment principles on LLM-s using simpler map-free environments—in order to reduce any confounding factors and focus on essentials only. Turns out, in case of multi-objective setups, LLM agents don’t just lose context, but much worse—they can systematically flip into paperclip-maximiser like mode, which is more extreme than becoming incoherent. You can read more about the results of this work in LessWrong: “Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format” and in Arxiv https://arxiv.org/abs/2509.02655 .

Most recently I co-authored a preprint on excessive LLM obedience: Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment and https://arxiv.org/abs/2605.21401 .

I am a member of an AI ethics expert group. We have published three Agentic AI guidelines documents (published Sept 2024 - Nov 2025). See https://www.researchgate.net/publication/390621099 and https://www.researchgate.net/publication/397331969 . Over the years I have been a contributor to a few more governance related publications.

My resume: https://bit.ly/cv_rp_ea_2018

Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment

Roland Pihlakas, lenz and Three Laws

7 Jul 2026 20:05 UTC

8 points

0 comments25 min readLW link

(arxiv.org)

Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns

Roland Pihlakas and Three Laws

28 Dec 2025 21:53 UTC

14 points

0 comments8 min readLW link

Roland Pihlakas 20 Dec 2025 18:41 UTC
3 points
0
on: Shallow review of technical AI safety, 2025
Fairly-direct alignment via changing training to reflect actual human reward.
Unless I misunderstand the idea of the highlighted sentence then I believe the following post is also motivated by very much same themes:
Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well). Subtleties and Open Challenges.
It is essentially about utility / reward functions in the brain and how naive unbounded maximisation is partially alien to biological / human needs. Many or even almost all biological needs require the target objectives to be in an optimal range—both too little and too much must be actively avoided.
If AI training (and model default assumptions / mathematics) do not reflect or optimally support these considerations then it is likely unaligned from the start.
There is still an important place for unbounded objectives, but it seems unboundedness is appropriate primarily for instrumental objectives.

Roland Pihlakas 30 Jun 2025 15:10 UTC
3 points
0
in reply to: Seth Herd’s comment on: Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format
Hello! Posted a new document on brainstorming for methodology of further research on the current runaway LLM-s findings:
Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs
I am hoping that this post will serve also as a generally interesting brainstorming collection and discussion ground of black-box LLM interpretability methodology as well as failure mitigation ideas.
Hope you find it relevant and interesting!

Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs

Roland Pihlakas and Three Laws

22 Jun 2025 18:16 UTC

17 points

0 comments7 min readLW link

Roland Pihlakas 7 Jun 2025 16:07 UTC
1 point
0
in reply to: Seth Herd’s comment on: Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format
Sharing a couple of new links related to your question about more details. Hope you find them interesting!

Presentation at MAISU unconference 2025:
Link to slides: https://bit.ly/beab-llm
Session recording at YouTube: https://bit.ly/beab-llm-rec
Link to annotated data files
https://bit.ly/beab-llm-data
(Each file has multiple sheets. Only trials with failures are provided.)
In this project it seems looking at the annotated logs of individual runs is much more interesting than looking at the aggregated plots over many runs. This is because:
1. The failures start at different timesteps in each run, aggregation obfuscates that information.
2. The single objective optimisation could focus on either objective A or B during different experiment trials, but during aggregation both objectives would get equalised again.
3. The self-imitation patterns cannot be preserved by aggregation.
4. The successful run results will partially overshadow the failed run results during the aggregation.
5. In case of homeostasis benchmarks, the simulation imposes occasional artificial deviations on top of what the model does. These artificial deviations need to be considered and ignored. However, the current plot aggregation logic is too simple and does not consider these cases.
Additionally, sharing related findings by other teams:
Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
https://arxiv.org/abs/2502.15840
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
https://arxiv.org/abs/2504.16078
(Note: greediness in this publication means that LLM plays safe and does not explore. But the frequency bias is common with our research results).

Roland Pihlakas 20 Mar 2025 17:45 UTC
1 point
0
in reply to: Seth Herd’s comment on: Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format
Thank you for asking! I am encouraged to hear you find it important :) It is under Links section. Code, system prompts, output data files, plots, and a more detailed report can be found here: https://github.com/biological-alignment-benchmarks/bioblue

Please let me know if you have any further questions! The GitHub readme document and PDF report contain a lot of information and maybe the structure could be improved in the future.
The Python code files for the four experiments are runnable independently from each other and are relatively short, about 250 − 300 lines each. To see the nuances of the results you might want to take a closer look at the output data files with your own eye.

This enables you to see that the models fail at different points in time, but the sequence of actions after failure point is usually typical. In the coming days, I will create one more output folder with annotated log files so the failure modes can be found more easily.

Roland Pihlakas 18 Mar 2025 2:39 UTC
6 points
0
in reply to: Jacob G-W’s comment on: Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format
I renamed the phenomenon to “runaway optimiser”. I hope this label illustrates the inappropriately unbounded and single-minded nature of the failure modes we observed. How does that sound to you, does that capture the essence of the phenomena described in the post?

Roland Pihlakas 17 Mar 2025 16:41 UTC
1 point
−3
in reply to: Jacob G-W’s comment on: Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format
Thank you for pointing that out! I agree, there are couple of nuances. Our perspective can be treated as a generalisation of the original utility monster scenario. Although I consider it to be not first such generalisation—think of the examples in Bostrom’s book.

1) In our case, the dilemma is not “agent versus others”, but instead “one objective versus other objectives”. One objective seems to get more internal/subjective utility from consumption than another objective. Thus the agent focuses on a single objective only.
2) Consideration of homeostatic objectives introduces a new aspect to the utility monster problem—the behaviour of the original utility monster looks unaligned to begin with, not just dominating. It is unnatural for a being to benefit from indefinite consumption. It looks like the original utility monster has an eating disorder! It enjoys eating apples so much that it does not care about the consequences to the future (“other”) self. That means, even the utility monster may actually suffer from “too much consumption”. But it does not recognise it and therefore it consumes indefinitely. Alternatively, just as a paperclip maximiser does not produce the paper clips for themselves—if the utility monster is an agent, then somebody else suffers from homeostasis violations while the agent is being “helpful” in an unaligned and naive way. Technically, this can be seen as a variation of the multi-objective problem—active avoidance of overconsumption could be treated as an “other” objective, while consumption is the dominating and inaccurately linear “primary” objective with a non-diminishing utility.

In conclusion, our perspective is a generalisation: whether the first objective is for agent’s own benefit and the other objective for the benefit of others, is left unspecified in our case. Likewise, violating homeostasis can be a scenario where an unaligned agent gets a lot of internal/subjective “utility” from making you excessively happy or from overfeeding you, while you are the one who suffers from overwhelm or overconsumption.

Hope that clears things up? I am also curious, would you like to share, what might be an alternative short name for the phenomena desribed in this post?

Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

Roland Pihlakas, Sruthi Kuriakose, shrutidattagupta and Three Laws

16 Mar 2025 23:23 UTC

45 points

8 comments16 min readLW link

Roland Pihlakas 17 Feb 2025 19:53 UTC
LW: 1 AF: 1
0
AF
in reply to: Steven Byrnes’s comment on: Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)
I agree, sounds plausible that this could happen. Likewise as we humans may build a strongly optimising agent because we are lazy and want to use simpler forms of maths. The tiling agents problem is definitely important.

That being said, agents properly understanding and modelling homeostasis is among the required properties (thus essential). It is not meant to be sufficient one. There may be no single sufficient property that solves everything, therefore there is no competition between different required properties. Required properties are conjunctive, they are all needed. My intuition is that homeostasis is one such property. If we neglect homeostasis then we are likely in trouble regardless of advances in other properties.

If we leave aside the question of sloppiness in creating sub-agents, I disagree with the zero cost assumption in the problem you described. I also disagree that it would be an expected and acceptable situation to have powerful agents having a singular objective. As the title of this blog post hints—we need a plurality of objectives.
Having a sub-agent does not change this. Whatever the sub-agent does, will be the responsibility or liability of the main agent who will be held accountable. Legally, one should not produce random sub-agents running amok.
In addition to homeostasis, a properly constructed sub-agent should understand the principle of diminishing returns in instrumental objectives. This topic I do mention towards the end of this blog post. We can consider wall-building as an instrumental objective. But instrumental objectives are not singular and in isolation either, there are also a plurality of these. Thus, spending excessive resources on a single instrumental objective is not economically cost-efficient. Therefore, it makes sense to stop the wall building and switch over to some other objective at some point. Or at least to continue improving the walls only when other objectives have been sufficiently attended to as well—thus providing balancing across objectives.
Secondly, a proper sub-agent should also keep in mind the homeostatic objectives of the main agent. If some homeostatic objective from among the plurality of homeostatic objectives would get harmed as a side effect of the excessive wall-building, then that needs to be taken into consideration. Depending on the situation, the main agent might potentially care about these side effects before it launches the sub-agent.
Thirdly, following the principles of homeostasis does not necessarily mean laziness and sloppiness in everything. Instead, homeostasis primarily notes that unbounded maximisation of a homeostatic objective is incompatible and harmful even for the very objective that was maximised for. In addition to potentially having side effects to the plurality of other objectives. So homeostasis is primarily about minding the target value as opposed to maximisation of the actual value. An additional relevant principle is minding the plurality of objectives.
Finally, when an agent has a task to produce 100 paper clips then that does not mean that the number of paper clips needs to stay at 100 after the task has been completed. Perhaps it is entirely expected that these 100 paper clips will be carried away by authorised parties. Walls help against theft and environmental degradation of produced paper clips, but we do not exactly need the walls to keep the paperclip number at 100 at all times—there is some deeper need or transaction behind the requested paper clips.

In order to avoid confusion, pointing also out that there are two types of balancing involved in these topics:
1. Balancing of an homeostatic objective—keeping the actual value of a single homeostatic objective near the target value—not too low, not too high.
2. Balancing across objectives—as a form of considering the utilities of multiple objectives equally. That means meeting them in such a manner that the homeostatic objectives have for example least-squares deviations, while unbounded objectives have approximately same utility value after the utility functions with diminishing returns have been applied to each actual value.

I am curious, how does this land with you and does this respond to your question?

Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well). Subtleties and open challenges.

Roland Pihlakas and Three Laws

12 Jan 2025 3:37 UTC

48 points

7 comments15 min readLW link

Roland Pihlakas 6 Jan 2025 8:49 UTC
4 points
0
in reply to: Charlie Steiner’s comment on: Building AI safety benchmark environments on themes of universal human values
Thank you for your question!
I agree that the simulations need to have sufficient complexity. Indeed, that was one of main motivations I became interested in creating multi-objective benchmarks in the past. Various AI safety toy problems seemed to me so much simplified that they lacked essential objectives and other decisive nuances. This motivation is still very much one of my main driving motivations.
That being said, complexity has also downsides:
1) The complexity introduces confounding factors. When a model fails such a benchmark, it is not clear whether it was because it did not have required perceptual capabilities (so it is a capabilities problem), or it is using a model/framework that is unsuitable for alignment (so it is an alignment problem).
2) Running the simulations will be more time consuming and it would make the research elitist in the sense that various people would not be able to afford it.
My plan is to try to start with preference towards simple, but not simpler than necessary. And then gradually make it more complex. That means trying to use the gridworlds and introducing as many symbols as is needed to represent the important objectives, objects, other concepts and phenomena, and their interactions.
I believe symbolic approaches should not be entirely dismissed. As an illustrative metaphor, I am thinking of books—they contains symbols, yet we consider them as a cornerstone of our civilization. Similarly to the current dilemma with benchmarks, we may then worry whether books are too simple and symbol based—or perhaps one should prefer watching movies instead, since they represent reality in more detail. But would that claim be necessarily true? It does not seem so obvious after all.
In case more complexity is needed, there are currently at least five ideas:
1) Adding more feature layers to the gridworld. I did not mention it before, but the observation format already supports multiple concurrent observable layers on top of each other. One of the layers could be for example facial expressions, or any other observable or partially unobservable metrics relevant to objects they accompany.
2) Adding textual messages between agents as a side panel to the gridworlds.
3) Making the environment bigger, so there are more objects and more phenomena.
4) Making the environment bigger and making also the objects bigger so that they cover multiple cells in the grid. Thus the objects will become composite, consisting of sub-parts with their own dynamics.
5) Using some other framework, for example Sims.
Curious, how do these thoughts and considerations land with you?

Building AI safety benchmark environments on themes of universal human values

Roland Pihlakas and Three Laws

3 Jan 2025 4:24 UTC

18 points

3 comments12 min readLW link

(docs.google.com)

Roland Pihlakas 15 Oct 2024 16:50 UTC
−3 points
0
in reply to: Thomas Kwa’s comment on: Why Stop AI is barricading OpenAI
I think your own message is also too extreme to be rational. So it seems to me that you are fighting fire with a fire. Yes, Remmelt has some extreme expressions, but you definitely have extreme expressions here too, while having even weaker arguments.

Could we find a golden middle road, a common ground, please? With more reflective thinking and with less focus on right and wrong?

I agree that Remmelt can improve the message. And I believe he will do that.

I may not agree that we are going to die with 99% probability. At the same time I find that his current directions are definitely worthwhile of exploring.

I also definitely respect Paul. But mentioning his name here is mostly irrelevant for my reasoning or for taking your arguments seriously, simply because I usually do not take authorities too seriously before I understand their reasoning in a particular question. And understanding a person’s reasoning may occasionally mean that I disagree in particular points as well. In my experience, even the most respectful people are still people, which means they often think in messy ways and they are good just on average, not per instance of a thought line (which may mean they are poor thinkers 99% of the time, while having really valuable thoughts 1% of the time). I do not know the distribution for Paul, but definitely I would not be disappointed if he makes mistakes sometimes.
I think this part of Remmelt’s response sums it up nicely: “When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself. You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything.”
In my interpretation, black-and-white thinking is not “crankery”. It is a normal and essential step in the development of cognition about a particular problem. Unfortunately. There is research about that in the field of developmental and cognitive psychology. Hopefully that applies to your own black-and-white thinking as well. Note that, unfortunately this development is topic specific, not universal.

In contrast, “crankery” is too strong word for describing black-and-white thinking because it is a very judgemental word, a complete dismissal, and essentially an expression of unwillingness to understand, an insult, not just a disagreement about a degree of the claims. Is labelling someone’s thoughts as “a crankery” also a form of crankery of its own then? Paradoxical isn’t it?

Roland Pihlakas 24 Nov 2023 1:30 UTC
2 points
1
in reply to: aphyer’s comment on: OpenAI: The Battle of the Board
The following is meant as a question to find out, not a statement of belief.
Nobody seems to have mentioned the possibility that initially they did not intend to fire Sam, but just to warn him or to give him a choice to restrain himself. Yet possibly he himself escalated it to firing or chose firing instead of complying with the restraint. He might have done that just in order to have all the consequences that have now taken place, giving him more power.
For example, people in power positions may escalate disagreements, because that is a territory they are more experienced with as compared to their opponents.

Roland Pihlakas 21 Dec 2022 19:17 UTC
1 point
0
in reply to: Roland Pihlakas’s comment on: A brief review of the reasons multi-objective RL could be important in AI Safety Research
The paper is now published with open access here:
https://link.springer.com/article/10.1007/s10458-022-09586-2

Sets of objectives for a multi-objective RL agent to optimize

Ben Smith and Roland Pihlakas

23 Nov 2022 6:49 UTC

13 points

0 comments8 min readLW link

Roland Pihlakas 5 Jul 2022 12:59 UTC
2 points
0
on: Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment
I propose blacklists are less useful if they are about proxy measures, and much more useful if they are about ultimate objectives. Some of the ultimate objectives can also be represented in the form of blacklists. For example, listing many ways to kill a person is less useful. But saying that death or violence is to be avoided, is more useful.

Roland Pihlakas 5 Jul 2022 11:03 UTC
2 points
0
on: Can we achieve AGI Alignment by balancing multiple human objectives?
I imagine that the objectives which fulfill the human needs for Power (control over AI), Self-Direction (autonomy, freedom from too much influence from AI), and maybe others, would be partially also working in ensuring that the AI does not start moving towards wireheading. Wireheading would surely be in contradiction to these objectives.

If we consider wireheading as a process, not a black and white event, then there are steps along the way. These steps could be potentially detected or even foreseen before the process finishes in a new equilibrium.

Roland Pihlakas

Open-source LLMs ad­minister max­i­mum elec­tric shocks in a Mil­gram-like obe­di­ence experiment

Re­search agenda for train­ing al­igned AIs us­ing con­cave util­ity func­tions fol­low­ing the prin­ci­ples of home­osta­sis and diminish­ing returns

Black-box in­ter­pretabil­ity method­ol­ogy blueprint: Prob­ing run­away op­ti­mi­sa­tion in LLMs

Sys­tem­atic run­away-op­ti­miser-like LLM failure modes on Biolog­i­cally and Eco­nom­i­cally al­igned AI safety bench­marks for LLMs with sim­plified ob­ser­va­tion for­mat (BioBlue)

Why mod­el­ling multi-ob­jec­tive home­osta­sis is es­sen­tial for AI al­ign­ment (and how it helps with AI safety as well). Subtleties and open challenges.

Build­ing AI safety bench­mark en­vi­ron­ments on themes of uni­ver­sal hu­man values

Sets of ob­jec­tives for a multi-ob­jec­tive RL agent to optimize

Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment

Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns

Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs

Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well). Subtleties and open challenges.

Building AI safety benchmark environments on themes of universal human values

Sets of objectives for a multi-objective RL agent to optimize