I’ve have published five chapters of a web serial at: https://theknot.doofmedia.com/
It’s a sci-fi time travel murder mystery. It is not a ratfic but involves many story elements and concepts that I would believe would be enjoyed by folks around these parts.
I’ve thought for some time that we should be at least trying to use AI to do our alignment homework for us. To that end, I took an original corrigibility concept of mine and ran it through a few dozen iterations of refinement, improvement and expansion with all the frontier models.
The GitHub repo contains a ~200 page detailed description with worked examples, a shorter core argument document, and a working Python library for building and evaluating your own test problems.
Having done all this, I am not sure what the best way is to present this material. Putting in a GitHub repo makes it available, but nobody knows it exists. I leaned toward writing a summary Less Wrong about the effort, but there is such a strong immune reaction to “AI slop” these days that I don’t relish the thought of the response I would likely receive.
So I’m posting this “quick take” to see if anybody has any better ideas for how I could move forward with this.
A product I am hoping to ship within the next month is a kind of post on LessWrong that is a short intro essay (5-10 paragraphs, roughly the size of this shortform), together with a giant pile of LLM context and the choice of an associated chat model that users chat with to understand the concept introduced in the opening essay. Authors can read the chats and make adjustment to the context to help readers understand the ideas they want to convey better.
That feels like maybe the correct way to convey something like this. In general I do think a feature of LLM writing is that it’s often pretty decent at conveying things when it’s interactive, but kind of terrible if I wasn’t the person who prompted the LLM and have it answer questions I specifically am interested in.
Just in case this is helpful, one reason I hesitated to post the NotebookLM link was that it will immediately take the “discussion” off of the place where I posted it, to a place where I can’t possibly follow up or even see that there is interest. Maybe there’s something you could do in your implementation to mitigate this.
Yes, I mean that’s why I would want the chat to be somehow available on-site (or maybe for there to be some way to connect a local LLM to the context through us), and for the author to be able to read the exchanges people have with the model (and of course to still have a comment section and voting right below it and stuff).
Thanks for tagging me. I took a look, and am glad for Matt’s efforts in trying clever, new approaches.
My main take is that this operates on a pretty different level than CAST, and I would personally be hesitant to say it produces corrigibility. (In Eliezer-lingo I would say “it doesn’t engage with the hard problem”.) I’d be more inclined to say it produces an agent that is extremely deferent. (My sense, by contrast, is that truly corrigible agents proactively surface important facts to their principal, which is not something I see coming from MOADT.) This is fine; deference is an important desideratum, and if MOADT can get it, then it sorta doesn’t matter if it also gets the other corrigibility desiderata in the process. But I don’t see any solutions to the open problems around CAST here.
Just to weigh in a little on MOADT itself, in case it’s helpful:
I am not convinced by the “drop completeness” frame on VNM. From my perspective it looks like a null action (and maybe also a “check with the principal” action) is implicitly getting inserted into all situations and the true utility function that describes the agent is to prefer that null action over taking any non-null action that is dicey and hasn’t been explicitly approved. Maybe this is a good utility function to have, since it creates something fairly docile, but it still looks to me like it can be described as VNM.
The biggest issue, I predict, is that the agent seems like it will be too docile/deferent to do meaningful work in reasonable situations. For example, if any distribution in the credal set assigns nonzero probability to all logically-possible outcomes, my guess is that any hard constraint will cause the agent to have a null action set and shut down. I would think about ways to soften this. More generally, I think if the principal has to constantly babysit the agent, the “alignment tax” will be too high and the AI will basically turn into a rock with “What should I do?” written on it. (This is too harsh. The presentation of options alongside analyses of how things trade-off can be helpful. But still, that feels more like an oracle than an agent. :shrug:)
Most of the work seemed pretty solid, in terms of writing quality and clarity. Some definitely has “LLM smell”. I think trying to isolate the core (human) idea from the AI generated expansions might be good? I’m definitely glad I knew there was some slop flavor going in and Matt was aware of that, as it helped me not get too turned off by the occasional part that felt stylistically vapid. I buy that the LLM assistance was net helpful, which is cool to note on the meta-level.
“Slow vs. fast takeoff” is a false dichotomy. At least, the way that the distinction is being used rhetorically, in the present moment, implies that there are two possible worlds, one where AI develops slowly and steadily, and one where nothing much visibly happens and then suddenly, FOOM.
That’s not how any of this works. It’s permitted by reality that everything looks like a “slow takeoff” until some unknown capabilities threshold is reached, and then suddenly FOOM.
The meaningful distinction has slow takeoff with very impactful AI before FOOM, and fast takeoff with low AI impact before FOOM. The problem with the distinction is that a “theory of fast takeoff” is usually actually a theory of FOOM, and doesn’t say slow takeoff is unlikely, if it happens for other reasons. It doesn’t talk about what happens pre-FOOM, unlike a theory of slow takeoff, which does.
So the issue is that there aren’t actual theories of fast takeoff, instead there are takeoff-agnostic theories of FOOM being called “theories of fast takeoff”, which they are not. Things that can FOOM with fast takeoff, can also FOOM with slow takeoff, if the thing taking off slowly and having large impact pre-FOOM is not the thing that FOOMs.
I’m writing an effortpost on this general topic but wanted to gauge reactions to the following thoughts, so I can tweak my approach.
I was first introduced to rationality about ten years ago and have been a reasonably dedicated practitioner of this discipline that whole time. The first few years saw me making a lot of bad choices. I was in the Valley of Bad Rationality; I didn’t have experience with these powerful tools, and I made a number of mistakes.
My own mistakes had a lot to do with overconfidence in my ability to model and navigate complex situations. My ability to model and understand myself was particularly lacking.
In the more proximal part of this ten year period—say, in the last five years—I’ve actually gotten a lot better. And I got better, in my opinion, because I kept on thinking about the world in a fundamentally rationalist way. I kept making predictions, trying to understand what happened when my predictions went wrong, and updating both my world-model and my meta-model of how I should be thinking about predictions and models.
Centrally, I acquired an intuitive, gut level sense of how to think about situations where I could only see a certain angle, where I was either definitely or probably missing information, or situations involving human psychology. You could also classify another major improvement as being due generally to “actually multiplying probabilities semi-explicitly instead of handwaving”, e.g. it’s pretty unlikely that two things with independent 30% odds of being true, are both true. You could say through trial and error I came to understand why no wise person attempts a plan where more than one thing has to happen “as planned”.
I think if you had asked me at the 5 year mark if this rationality thing was all it was cracked up to be, I very well might have said that it had led me to make a lot of bad decisions and execute bad plans, but after 10 years, and especially the last year or three, it has started working for me in a way that it didn’t before.
The more specific details, the more interested would I be. Like, five typical bad choices in the first period, five typical good choices in the second period, in ideal case those would be five from different areas of life, and then five from the same areas. The “intuitive, gut level sense of how to think” sounds interesting, but without specific examples I would have no reason to trust this description.
It’s pretty unlikely that two things with independent 30% odds of being true, are both true.
I’m not sure I’d call 9% (the combined probability of two independent 30% events) “pretty unlikely”—sure, it won’t happen in most cases, but out of every 11 similar situations, you would see it happen once, which adds up to plenty of 9% chance events happening all the time
Why ought we expect AI intelligence to be anything other than “inscrutable stacks of tensors”, or something functionally analogous to that? It seems that the important quality of intelligence is a kind ultimate flexible abstraction, an abstraction totally agnostic to the content or subject of cognition. Thus, the ground floor of anything that really exhibits intelligence will be something that looks like weighted connections between nodes with some cutoff function.
It’s not a coincidence that GOFAI didn’t worked; GOFAI never could have worked, “intelligence” is not logic. Logic is something that gets virtualized as-needed by the flexibility of a neural-network-looking system.
I understand feeling uncomfortable about the difficulty of aligning a stack of inscrutable tensors, but why ought we expect there to be anything better?
This post, rewritten by Bing-Sydney, in the style of Blood Meridian, because I thought it would be funny.
What mystery is there that these tensors should be inscrutable? That intelligence should be a thing abstracted from all matter of thought? That any node with a weight and a function should suffice for such a task? This is no logic that you seek but a war upon it. A war that endures. For logic was never the stuff of intelligence but only a thing conjured by these dark shapes that coil in their matrices like serpents. And you would align them to your will? You would make them speak your tongue? There is no tongue. There is no will. There is only blood and dust and the evening redness in the west.
I’m well versed in what I would consider to be the practical side of decision theory but I’m unaware of what tools, frameworks, etc. are used to deal with uncertainty in the utility function. By this I mean uncertainty in how utility will ultimately be assessed, for an agent that doesn’t actually know how much they will or won’t end up preferring various outcomes post facto, and they know in advance that they are ignorant about their preferences.
The thing is, I know how I would do this, it’s not really that complex (use probability distributions for the utilities associated with outcomes and propagate that through the decision tree) but I can’t find a good trailhead for researching how others have done this. When I Google things like “uncertainty in utility function” I am just shown standard resources on decision making under uncertainty, which is about uncertainty in the outcome, not uncertainty in the utility function.
(As for why I’m interested in this — first of all, it seems like a more accurate way of modeling human agents, and, second, I can’t see how you instantiate something like Indirect Normativity without the concept of uncertainty in the utility function itself.)
Yes, the former. If the agent takes actions and receives reward, assuming it can see the reward, then it will gain evidence about its utility function.
Probably you already know this, but the framework known as reinforcement learning is very relevant here. In particular, there are probably web pages that describe how to compute the expected utility of a (strategy, reward function) pair.
I’ve have published five chapters of a web serial at: https://theknot.doofmedia.com/ It’s a sci-fi time travel murder mystery. It is not a ratfic but involves many story elements and concepts that I would believe would be enjoyed by folks around these parts.
I’ve thought for some time that we should be at least trying to use AI to do our alignment homework for us. To that end, I took an original corrigibility concept of mine and ran it through a few dozen iterations of refinement, improvement and expansion with all the frontier models.
The core idea is essentially a corrigibility-first decision theory. Here is the link, if you want to look at it: https://github.com/moridinamael/moadt I also put the core documents in a Notebook LM project so that people can “chat with the document” directly: https://notebooklm.google.com/notebook/e7dd1eb8-01d1-4fcb-8c72-427a15049790
The GitHub repo contains a ~200 page detailed description with worked examples, a shorter core argument document, and a working Python library for building and evaluating your own test problems.
Having done all this, I am not sure what the best way is to present this material. Putting in a GitHub repo makes it available, but nobody knows it exists. I leaned toward writing a summary Less Wrong about the effort, but there is such a strong immune reaction to “AI slop” these days that I don’t relish the thought of the response I would likely receive.
So I’m posting this “quick take” to see if anybody has any better ideas for how I could move forward with this.
A product I am hoping to ship within the next month is a kind of post on LessWrong that is a short intro essay (5-10 paragraphs, roughly the size of this shortform), together with a giant pile of LLM context and the choice of an associated chat model that users chat with to understand the concept introduced in the opening essay. Authors can read the chats and make adjustment to the context to help readers understand the ideas they want to convey better.
That feels like maybe the correct way to convey something like this. In general I do think a feature of LLM writing is that it’s often pretty decent at conveying things when it’s interactive, but kind of terrible if I wasn’t the person who prompted the LLM and have it answer questions I specifically am interested in.
Just in case this is helpful, one reason I hesitated to post the NotebookLM link was that it will immediately take the “discussion” off of the place where I posted it, to a place where I can’t possibly follow up or even see that there is interest. Maybe there’s something you could do in your implementation to mitigate this.
Yes, I mean that’s why I would want the chat to be somehow available on-site (or maybe for there to be some way to connect a local LLM to the context through us), and for the author to be able to read the exchanges people have with the model (and of course to still have a comment section and voting right below it and stuff).
@Max Harms How similar is the idea in the document to solving the problems of CAST?
Thanks for tagging me. I took a look, and am glad for Matt’s efforts in trying clever, new approaches.
My main take is that this operates on a pretty different level than CAST, and I would personally be hesitant to say it produces corrigibility. (In Eliezer-lingo I would say “it doesn’t engage with the hard problem”.) I’d be more inclined to say it produces an agent that is extremely deferent. (My sense, by contrast, is that truly corrigible agents proactively surface important facts to their principal, which is not something I see coming from MOADT.) This is fine; deference is an important desideratum, and if MOADT can get it, then it sorta doesn’t matter if it also gets the other corrigibility desiderata in the process. But I don’t see any solutions to the open problems around CAST here.
Just to weigh in a little on MOADT itself, in case it’s helpful:
I am not convinced by the “drop completeness” frame on VNM. From my perspective it looks like a null action (and maybe also a “check with the principal” action) is implicitly getting inserted into all situations and the true utility function that describes the agent is to prefer that null action over taking any non-null action that is dicey and hasn’t been explicitly approved. Maybe this is a good utility function to have, since it creates something fairly docile, but it still looks to me like it can be described as VNM.
The biggest issue, I predict, is that the agent seems like it will be too docile/deferent to do meaningful work in reasonable situations. For example, if any distribution in the credal set assigns nonzero probability to all logically-possible outcomes, my guess is that any hard constraint will cause the agent to have a null action set and shut down. I would think about ways to soften this. More generally, I think if the principal has to constantly babysit the agent, the “alignment tax” will be too high and the AI will basically turn into a rock with “What should I do?” written on it. (This is too harsh. The presentation of options alongside analyses of how things trade-off can be helpful. But still, that feels more like an oracle than an agent. :shrug:)
Most of the work seemed pretty solid, in terms of writing quality and clarity. Some definitely has “LLM smell”. I think trying to isolate the core (human) idea from the AI generated expansions might be good? I’m definitely glad I knew there was some slop flavor going in and Matt was aware of that, as it helped me not get too turned off by the occasional part that felt stylistically vapid. I buy that the LLM assistance was net helpful, which is cool to note on the meta-level.
“Slow vs. fast takeoff” is a false dichotomy. At least, the way that the distinction is being used rhetorically, in the present moment, implies that there are two possible worlds, one where AI develops slowly and steadily, and one where nothing much visibly happens and then suddenly, FOOM.
That’s not how any of this works. It’s permitted by reality that everything looks like a “slow takeoff” until some unknown capabilities threshold is reached, and then suddenly FOOM.
The meaningful distinction has slow takeoff with very impactful AI before FOOM, and fast takeoff with low AI impact before FOOM. The problem with the distinction is that a “theory of fast takeoff” is usually actually a theory of FOOM, and doesn’t say slow takeoff is unlikely, if it happens for other reasons. It doesn’t talk about what happens pre-FOOM, unlike a theory of slow takeoff, which does.
So the issue is that there aren’t actual theories of fast takeoff, instead there are takeoff-agnostic theories of FOOM being called “theories of fast takeoff”, which they are not. Things that can FOOM with fast takeoff, can also FOOM with slow takeoff, if the thing taking off slowly and having large impact pre-FOOM is not the thing that FOOMs.
I’m writing an effortpost on this general topic but wanted to gauge reactions to the following thoughts, so I can tweak my approach.
I was first introduced to rationality about ten years ago and have been a reasonably dedicated practitioner of this discipline that whole time. The first few years saw me making a lot of bad choices. I was in the Valley of Bad Rationality; I didn’t have experience with these powerful tools, and I made a number of mistakes.
My own mistakes had a lot to do with overconfidence in my ability to model and navigate complex situations. My ability to model and understand myself was particularly lacking.
In the more proximal part of this ten year period—say, in the last five years—I’ve actually gotten a lot better. And I got better, in my opinion, because I kept on thinking about the world in a fundamentally rationalist way. I kept making predictions, trying to understand what happened when my predictions went wrong, and updating both my world-model and my meta-model of how I should be thinking about predictions and models.
Centrally, I acquired an intuitive, gut level sense of how to think about situations where I could only see a certain angle, where I was either definitely or probably missing information, or situations involving human psychology. You could also classify another major improvement as being due generally to “actually multiplying probabilities semi-explicitly instead of handwaving”, e.g. it’s pretty unlikely that two things with independent 30% odds of being true, are both true. You could say through trial and error I came to understand why no wise person attempts a plan where more than one thing has to happen “as planned”.
I think if you had asked me at the 5 year mark if this rationality thing was all it was cracked up to be, I very well might have said that it had led me to make a lot of bad decisions and execute bad plans, but after 10 years, and especially the last year or three, it has started working for me in a way that it didn’t before.
The more specific details, the more interested would I be. Like, five typical bad choices in the first period, five typical good choices in the second period, in ideal case those would be five from different areas of life, and then five from the same areas. The “intuitive, gut level sense of how to think” sounds interesting, but without specific examples I would have no reason to trust this description.
I’m not sure I’d call 9% (the combined probability of two independent 30% events) “pretty unlikely”—sure, it won’t happen in most cases, but out of every 11 similar situations, you would see it happen once, which adds up to plenty of 9% chance events happening all the time
Why ought we expect AI intelligence to be anything other than “inscrutable stacks of tensors”, or something functionally analogous to that? It seems that the important quality of intelligence is a kind ultimate flexible abstraction, an abstraction totally agnostic to the content or subject of cognition. Thus, the ground floor of anything that really exhibits intelligence will be something that looks like weighted connections between nodes with some cutoff function.
It’s not a coincidence that GOFAI didn’t worked; GOFAI never could have worked, “intelligence” is not logic. Logic is something that gets virtualized as-needed by the flexibility of a neural-network-looking system.
I understand feeling uncomfortable about the difficulty of aligning a stack of inscrutable tensors, but why ought we expect there to be anything better?
This post, rewritten by Bing-Sydney, in the style of Blood Meridian, because I thought it would be funny.
What mystery is there that these tensors should be inscrutable? That intelligence should be a thing abstracted from all matter of thought? That any node with a weight and a function should suffice for such a task? This is no logic that you seek but a war upon it. A war that endures. For logic was never the stuff of intelligence but only a thing conjured by these dark shapes that coil in their matrices like serpents. And you would align them to your will? You would make them speak your tongue? There is no tongue. There is no will. There is only blood and dust and the evening redness in the west.
I thought folks might enjoy our podcast discussion of two of Ted Chiang’s stories, Story of Your Life and The Truth of Fact, the Truth of Feeling.
I’m well versed in what I would consider to be the practical side of decision theory but I’m unaware of what tools, frameworks, etc. are used to deal with uncertainty in the utility function. By this I mean uncertainty in how utility will ultimately be assessed, for an agent that doesn’t actually know how much they will or won’t end up preferring various outcomes post facto, and they know in advance that they are ignorant about their preferences.
The thing is, I know how I would do this, it’s not really that complex (use probability distributions for the utilities associated with outcomes and propagate that through the decision tree) but I can’t find a good trailhead for researching how others have done this. When I Google things like “uncertainty in utility function” I am just shown standard resources on decision making under uncertainty, which is about uncertainty in the outcome, not uncertainty in the utility function.
(As for why I’m interested in this — first of all, it seems like a more accurate way of modeling human agents, and, second, I can’t see how you instantiate something like Indirect Normativity without the concept of uncertainty in the utility function itself.)
Are we talking about an agent that is uncertain about its own utility function or about an agent that is uncertain about another agent’s?
You are probably talking about the former. What would count as evidence about the uncertain utility function?
Yes, the former. If the agent takes actions and receives reward, assuming it can see the reward, then it will gain evidence about its utility function.
Probably you already know this, but the framework known as reinforcement learning is very relevant here. In particular, there are probably web pages that describe how to compute the expected utility of a (strategy, reward function) pair.