I lurk and tag stuff.
is one of the first results for “yudkowsky harris” on Youtube. Is there supposed to be more than this?
You should distinguish between “reward signal” as in the information that the outer optimization process uses to update the weights of the AI, and “reward signal” as in observations that the AI gets from the environment that an inner optimizer within the AI might pay attention to and care about.
From evolution’s perspective, your pain, pleasure, and other qualia are the second type of reward, while your inclusive genetic fitness is the first type. You can’t see your inclusive genetic fitness directly, though your observations of the environment can let you guess at it, and your qualia will only affect your inclusive genetic fitness indirectly by affecting what actions you take.
To answer your question about using multiple types of reward:
For the “outer optimization” type of reward, in modern ML the loss function used to train a network can have multiple components. For example, an update on an image-generating AI might say that the image it generated had too much blue in it, and didn’t look enough like a cat, and the discriminator network was able to tell it apart from a human generated image. Then the optimizer would generate a gradient descent step that improves the model on all those metrics simultaneously for that input.
For “intrinsic motivation” type rewards, the AI could have any reaction whatsoever to any particular input, depending on what reactions were useful to the outer optimization process that produced it. But in order for an environmental reward signal to do anything, the AI has to already be able to react to it.
This has overtaken the post it’s responding to as the top-karma post of all time.
I’m impressed by the number of different training regimes stacked on top of each other.
-Train a model that detects whether a Minecraft video on Youtube is free of external artifacts like face cams.
-Then feed the good videos to a model that’s been trained using data from contractors to guess what key is being pressed each frame.
-Then use the videos and input data to train a model that, in any game situation, does whatever inputs it guesses a human would be most likely to do, in an undirected shortsighted way.
-And then fine-tune that model on a specific subset of videos that feature the early game.
-And only then use some mostly-standard RL training to get good at some task.
While the engineer learned one lesson, the PM will learn a different lesson when a bunch of the bombs start installing operating system updates during the mission, or won’t work with the new wi-fi system, or something: the folly of trying to align an agent by applying a new special case patch whenever something goes wrong.
No matter how many patches you apply, the safety-optimizing agent keeps going for the nearest unblocked strategy, and if you keep applying patches eventually you get to a point where its solution is too complicated for you to understand how it could go wrong.
Meta: This is now the top-voted LessWrong post of all time.
Robust Agents seems sort of similar but not quite right.
Looking at the generation code, aptitude had interesting effects on our predecessors’ choice of cheats.
-Higher aptitude Hikkikomori and Otaku are less likely to take Hypercompetent Dark Side (which has lower benefits for higher aptitude characters).
-Higher aptitude characters across the board are less likely to take Monstrous Regeneration or Anomalous Agility, which were some of the better choices available.
-Higher aptitude Hikkikomori are more likely to take Mind Palace.
I’ve added a market on Manifold if you want to bet on which strategy is best.
Somewhat. The profile pic changes based on the character’s emotions, or their reaction to a situation. Sometimes there’s a reply where the text is blank and the only content is the character’s reaction as conveyed by the profile pic.
That said, it’s a minor enough element that you wouldn’t lose too much if it wasn’t there.
On the other hand, it is important for you to know which character each reply is associated with, as trying to figure out who’s talking from the text alone could get confusing in many scenes. So any format change should at least preserve the names.
If everyone ends up with the same vote distribution, I think it removes the incentive for colluding beforehand, but it also means the vote is no longer meaningfully quadratic. The rank ordering of the candidates will be in order of how many total points were spent on them, and you basically end up with score voting.
edit: I assume that the automatic collusion mechanism is something like averaging the two ballots’ allocations for each candidate, which does not change the number of points spent on each candidate. If instead some ballots end up causing more points to be spent on their preferred candidates than they initially had to work with, there are almost definitely opportunities for strategic voting and beforehand collusion.
Or put a spoilered link to this post in the dath ilan tag’s wiki text?
A type of forum roleplay / collaborative fiction writing started by Alicorn.
For further complication, what if you consider potential backers having different estimations of the value of the project?
That would raise the risk of backing-for-the-bonus projects that you don’t like. Maybe you would back the project to punch cute puppies to 5% or 25%, but if it’s at 75% you start to suspect that there are enough cute puppy haters out there to push it all the way if you get greedy for the bonus.
For good projects, you could have a source for the refund bonuses other than the platform or the project organizers—the most devoted fans. Allow backers to submit a pledge that, if the project is refunded, gets distributed to other backers rather than the person who submitted it.
Agree it doesn’t belong; I have downvoted it.
There is no tag that encompasses all of AI alignment and nothing else.
I think the reason you gave is basically correct—when I look at the 15 posts with the highest relevance score on the AI tag, about 12 of them are about alignment.
On the other hand, when a tag doesn’t exist it may just be because no one ever felt like making it.
Merge candidate with startups?
“Transformer Circuits” seems like too specific of a tag—I doubt it applies to much beyond this one post. Probably should be broadened to encompass https://www.lesswrong.com/posts/MG4ZjWQDrdpgeu8wG/zoom-in-an-introduction-to-circuits and related stuff.
“Circuits (AI)” to distinguish from normal electronic circuits?
This sounds a lot like the “Precisely Bound Demons and their Behavior” concept that Yudkowsky described but never wrote the story for.
Ra also features magic-as-engineering.
Chiming in later to say that I think the tag should stay, especially now that multiple people are doing them. Compare “Rationality Quotes” and “Open Threads” for other tags that could be accused of just being sequences.