What’s weird about money is that it’s like the emperor held up a piece of paper and said “This is worth as much as (a piece of gold). Also, anyone other than me who makes these will be executed.”
What about descendants of you/your loved ones?
The countervailing force is that people are tempted to lie to children.
Related: a desire to sell/publish/use such books.
do something like ask it to become 95 percent sure that it was full, and that might make it less likely to flood the house.
The AI fills the cauldron, then realizes that depending on its future observations it will probably not continue to assign exactly 95% probability to the cauldron being full.
Patch—use a probability of 95% or more. (The user’s manual should include a warning, not to use it to sell insurance to people, as achieving high probability may be difficult without the use of force.)
95% or more.
Note that in the referenced paper the agents don’t have the same information. (I think you know that, just wanted to clarify in case you didn’t.)
Yes. I brought it up for this point: (edited)
gesturing at a continuum—providing a little information versus all of it.
A better way of putting it would have been—“Open AI Five cooperated with full information and no communication, this work seems interested in cooperation between agents with different information and communication.”
The worry here is that by adding this extra auxiliary intrinsic reward, you are changing the equilibrium behavior. In particular, agents will exploit the commons less and instead focus more on finding and transmitting useful information. This doesn’t really seem like you’ve “learned cooperation”.
That makes sense. I’m curious about what value “this thing that isn’t learned cooperation” doesn’t capture.
Most of the considerations you bring up are not things that can be easily evaluated from the paper (and are hard to evaluate even if you have the agents in code and can look into their innards). One exception: you should expect that the information given by the agent will be true and useful: if it weren’t, then the other agents would learn to ignore the information over time, which means that the information doesn’t affect the other agents’ actions and so won’t get any intrinsic reward.
A better way of putting my question would have been:
Is “useful” a global improvement, or a local improvement? (This sort of protocol leads to improvements, but what kind of improvements are they?)
Most of the considerations you bring up are not things that can be easily evaluated from the paper (and are hard to evaluate even if you have the agents in code and can look into their innards).
Hm, I thought of them as things that would require looking at:
1) Behavior in environments constructed for that purpose.
2) Looking at the information the agents communicate.
My “communication” versus “information” distinction is an attempt to differentiate between (complex) things like:
Information: This is the payoff matrix (or more of it).*
Communication: I’m going for stag next round. (A promise.)
Information: If everyone chooses “C” things are better for everyone.
Communication: If in a round, I’ve chosen “C” and you’ve chosen “D”, the following round I will choose “D”.
Information: Player M has your source code, and will play whatever you play.
*I think of this as being different from advice (proposing an action, or an action over another).
It’s interesting to consider how much effectiveness is had from the skillset being in one person (the table tennis/sprinter) versus a group—the term “interdisciplinary” seems like it could apply in either case. (Though it has a way of becoming it’s own thing sometimes, and then there’s a new field, which will probably be focused on getting the set into individuals as opposed to groups).
Really great summaries as always. Digesting these papers is rather time consuming, and these distillations are very understandable and illuminating.
Handling groups of agents
Social Influence as Intrinsic Motivation for Multi-Agent Deep RL (Natasha Jaques et al) (summarized by Cody): An emerging field of common-sum multi-agent research asks how to induce groups of agents to perform complex coordination behavior to increase general reward, and many existing approaches involve centralized training or hardcoding altruistic behavior into the agents. This paper suggests a new technique that rewards agents for having a causal influence over the actions of other agents, in the sense that the actions of the pair of agents agents have high mutual information. The authors empirically find that having even a small number of agents who act as “influencers” can help avoid coordination failures in partial information settings and lead to higher collective reward. In one sub-experiment, they only add this influence reward to the agents’ communication channels, so agents are incentivized to provide information that will impact other agents’ actions (this information is presumed to be truthful and beneficial since otherwise it would subsequently be ignored).
Cody’s opinion: I’m interested by this paper’s finding that you can generate apparently altruistic behavior by incentivizing agents to influence others, rather than necessarily help others. I also appreciate the point that was made to train in a decentralized way. I’d love to see more work on a less asymmetric version of influence reward; currently influencers and influencees are separate groups due to worries about causal feedback loops, and this implicitly means there’s a constructed group of quasi-altruistic agents who are getting less concrete reward because they’re being incentivized by this auxiliary reward.
(I didn’t understand that last part.)
This reminds me of OpenAI Five—the way they didn’t communicate, but all had the same information. It’ll be interesting to see if (in this work) the “AI” used the other benefits/types of communication, or if it was all about providing information. (The word “influencers” seems to invoke that.) “Presuming the information is truthful and beneficial”—this brings up a few questions.
1) Are they summarizing? Or are they giving a lot of information and leaving the other party to figure out what’s important? We (humans) have preferences over this, but whether these agents do will be interesting, along with how that works—is it based on volume or ratios or frequency?
I’m also gesturing at a continuum here—providing a little information versus all of it.
Extreme examples: a) The agents are connected in a communications network. Though distributed, (and not all nodes are connected) they share all information.* b) A protocol is developed for sending only the minimum amount of information. Messages read like “101” or “001“ or just a “0” or a “1”, and are rarely sent.
2) What does beneficial mean? Useful? Can “true” information be harmful in this setting? (One can imagine an agent, which upon receiving the information “If you press that button you will lose 100 points”, will become curious, and press the button.)
3) Truthful—absolutely, or somewhat? Leaving aside “partial truths”/”lying by omission”, do influencers tend towards “the truth” or something else? Giving more information which is useful for both parties? Saying ‘option B is better than option A’, ‘option C is better than option B’, and continuing on in this matter (as it is rewarded for this) in stead of skipping straight to ‘option Z is the best’.
The paper says it’s intrinsic motivation so that might not be a problem. I’m surprised they got good results from “try to get other agents to do something different”, but it is the borrowing from the structure of causality.
Ray Interference: a Source of Plateaus in Deep Reinforcement Learning (Tom Schaul et al) (summarized by Cody): The authors argue that Deep RL is subject to a particular kind of training pathology called “ray interference”, caused by situations where (1) there are multiple sub-tasks within a task, and the gradient update of one can decrease performance on the others, and (2) the ability to learn on a given sub-task is a function of its current performance. Performance interference can happen whenever there are shared components between notional subcomponents or subtasks, and the fact that many RL algorithms learn on-policy means that low performance might lead to little data collection in a region of parameter space, and make it harder to increase performance there in future.
Cody’s opinion: This seems like a useful mental concept, but it seems quite difficult to effectively remedy, except through preferring off-policy methods to on-policy ones, since there isn’t really a way to decompose real RL tasks into separable components the way they do in their toy example
This reminds me of the Starcraft AI, AlphaStar. While I didn’t get all the details I recall something about the reason for the population was so they could each be given a bunch of different narrower/easier objectives than “Win the game” like “Build 2 Deathstalkers” or “Scout this much of the map” or “find the enemy base ASAP”, in order to find out what kind of easy to learn things helped them get better at the game.
Glancing through the AlphaStar article again, that seemed more oriented around learning a variety of strategies, and learning them well. Also, there might be architecture differences I’m not accounting for.
The neural network weights of each agent are updated by reinforcement learning from its games against competitors, to optimise its personal learning objective. The weight update rule is an efficient and novel off-policy actor-critic reinforcement learning algorithm with experience replay, self-imitation learning and policy distillation.
(Emphasis added.) Well, I guess AlphaStar demonstrates the effectiveness of off-policy methods. (Possibly with a dash of supervised learning, and well, everything else.)
there isn’t really a way to decompose real RL tasks into separable components the way they do in their toy example
This sounds like one of those “as General Intelligences we find this easy but it’s really hard to program”.
*Albeit with two types of nodes—broadcasters and receivers. (If broadcasters don’t broadcast to each other, then: 1) In order for everyone to get all the information, the broadcasters must receive all information. 2) In order for all the receivers to get all the information, then for each receiver r, the information held by the set of broadcasters that broadcast to it, b, must include all information.)
it if also
if it also
people who would me naive
In the case you referenced, “selfish” or “short sighted”, depending on what you were going for, seem to fit.
If you’re not taking actions contrary to incentives, choosing to do something you value that “the system” doesn’t, you’re not making moral choices,
I very much agree with this part.
For example, the academic journal system, while now online, is mostly a digitized form of the pre-Internet systems not taking advantage of all the new properties of the internet such as effectively free and instantaneous distribution of material.
It’s interesting to compare this with the polymath projects. I wasn’t a part of them, and I don’t know what technology they used, but it might be interesting to look into what they used, with regards to tech and organization.
Building a communal repository of knowledge upon which everyone can build.
One might ask how this might be integrated with asking questions.Consider the benefits of building a reverse dictionary*, and the difficulties**. While complex, it seems a simpler task than “answer questions” and might be a tractable sub-problem.
*Sometimes people in different fields work on similar problems, but are unaware of each other. The question “Can we use ideas from ecosystem management to cultivate a healthy rationality memespace?” seems related.
**How do we make something that takes a description, and finds 1) the idea (if it exists), and it’s names or 2) related ideas?
myriad forms form unnecessary suffering
there is [an] all too real chance
a training pipeline that helps people who want to become good researcher[s] train up.
When Dee hear’s “social construct” she thinks about things
Nicky here’s “real” and things about things that can be located in time and space.
the think I want to be in definition
hears, hears, thinks, thing
In what languages does the OP’s claim hold?
Treating this as errata thread.
And still, it is not a given that valid English sentences.
This feels like an incomplete thought. It could be self-referential, but it seems unintentional.
The LessWrong forum with posts, comments, and votes is already a technology for intellectual progress
Is the link supposed to go to www.a.com?
This invites the question—“why do we change our values” or “when is it good to change values”. (While that seems to depend on the definition of “values”, it seems worth engaging with this question, as is.)
What would you say to someone old who hadn’t changed their values since they were five years old?
What would you say to anyone old who hadn’t changed their values since they were eighteen years old?
Perhaps my answer depends on their age (or their values). If someone is 5 years old, and a day, how much change should we expect? 18 years and a day?
Maybe the key factor is information. While we don’t expect every day to be (very) life changing, we expect a life changing day, to have an effect. In this sense value stability isn’t valued. That is, as we acquire more information*, values should change (if the information is relevant, and different, or suggests change). So what we want might be more information (which is true). On the other hand, would you want to have a lot of life changing days, one after another? To some degree, stability and resources enable adjustments. Beliefs and habits may take time to change, and major life changes can be stressful. It is one thing to seek information, it would be another to live in an externally imposed (sonic) deluge of information.
It is worth noting both that 1) as time goes on, and specifically as one acquires more information, the evidence that should be needed to shift beliefs changes, namely increases, 2) No change means no growth. To have one’s values frozen at the age of 100, and to still be the same at 200, seems a terrible thing.
(Meta-values might change less than lower level values, if there’s less things that affect them, or that might be a result of meta-value change precipitating lower level value change, so delta L ⇐ delta M because delta M → delta L. How things might work in the other direction isn’t as clear—would lots of value change cause change in the level above it?)
It is tricky to account for manipulation though. When is disseminating true information manipulative? (Cherry picking?)
Another factor might be something like exploration or ‘preventing boredom’. Nutrition aside, while we might have a favorite food, eating too much of it, for too many days in a row may be unappealing (in advance, or in hindsight). Perhaps you have a desire to travel, to see new things; or to change in certain ways—a new skill you want to learn, a new habit to make, or new goals to achieve. (Still sounds like growth, though things we create which outlast us can also be about growing something other than ourselves.)
*No double counting, etc. On the other hand, if we’ve learned more/grown/changed, we might explore the new implications of “old” information. This isn’t easy to model, outside of noticing recurring failure modes.
This is a fantastic piece.