I work at Redwood Research.
ryan_greenblatt
Thanks for the response.
like if you want to remember simple things, like a grocery list, you can plop groceries around a path in your house
I will certainly try this.
[Question] Questions about multivitamins, especially manganese
First of all, I like this post and (at least roughly) agree with the core premise. I also think similar arguments can apply for other cognitive biases/cognitive heuristics. For example, see Sunk Costs Fallacy Fallacy.
Tribalism is a soldier of Moloch, the god of defecting in prisoner’s dilemmas.
I’m modestly confident that the opposite is true for our hunter gatherer ancestors and for small groups more generally. For example, we can model individuals freeloading and failing to gather food for the group as an iterated, many way prisoners dilemma. In this case I would imagine that tribalism tends toward cooperate over defect. Similarly, consider group conflict. The defect/Moloch option here is actually avoiding the fight which reduces risk of injury without substantially reducing the probability of your group winning. Tribalism would tend toward more (violent) opposition of the other group.
I have no idea how tribalism interacts with Moloch for the large ideological tribes of today.
Large corporations can unilaterally ban/tax ransomware payments via bets
Yes. The difference is that betting on something is zero expected value (instead of just agreeing to pay which is negative expected value).
Legal contracts should avoid most issues with lying/cheating. The difficulty of cheating should be similar to insider trading. Companies make bets and pay those bets all the time: options and futures contracts.
I don’t understand what you mean. Specifically, I don’t understand what you are using ‘0’ for.
If the chance of paying is , then the betting odds will reflect this with the assumption that the market is reasonably efficient. For a simple fixed rate bet, for each dollar the company stakes, they win an additional if they don’t payout over the time period (again assuming betting odds reflect the underlying probability).
Expected value (for the 1 dollar bet) is then:
Of course, there is possibility for adverse selection/asymmetric information which could make the market somewhat less efficient.
Where are the novel fruits? There are some new apple varieties and I think seedless watermelons have improved somewhat, but we are still missing totally new fruits produced via genetic engineering. Where are the GMO raspberry grapes or the coreless apples? What about mango flavored plums?
Realistically, totally new fruits or large changes in flavor are probably very hard to engineer, but I still have some hope.
Naive self-supervised approaches to truthful AI
I would imagine that if you have a limited question pool used for self-supervision, then applying this constraint while training from scratch would result in overfitting with less generalization (but I’m not super confident in this, and there might be descent ways to avoid this).
If the question pool is very large/generated or the constraint is generally enforced on text generation (I’m not sure this makes much sense), then this might do something interesting.
I don’t have the resources to run an experiment like this at the moment (particularly not with a very large model like GPT-J).
Framing approaches to alignment and the hard problem of AI cognition
Just curious—how much time have you invested in the DL literature vs LW/sequences/safety?
Prior to several month ago I had mostly read DL/ML literature. But recently I’ve been reading virtually only alignment literature.
One thing that consistently infuriates me is the extent to which the AI-safety community has invented it’s own terminology/onotology that is largely at odds with DL/ML.
I actually think there are very good reasons the AI-safety community uses different terms (not that we know the right terms/abstractions at the moment). I won’t get into a full argument for this, but a few reasons:
Alignment is generally trying to work with high intelligence regimes where concepts like ‘intent’ are better specified.
Often, things are presented more generally than just standard ML
The utility functions of human children aren’t ‘perfectly inner aligned’ with that of their parents, but human-level alignment would probably be good enough. Don’t let perfect be the enemy of the good.
Children aren’t superintelligent AGIs for which instrumental convergence applies.
‘consequentialist agent’ mostly maps to model-based RL agent
For current capability regimes, sure. In the future? Not so clear. Consequentialist is a more general idea.
I understand this exchange as Ryan saying “the goals of AGI must be a perfect match to what we want”, and Jacob as replying “you can’t literally mean perfect, as in not even off by one part per googol, e.g. we bequeath the universe to the next generation despite knowing that they won’t share our values”, and then Ryan is doubling down “Yes I mean perfect”.
Oh, no, this wasn’t what I meant. I just meant that the usage of children as an example was poor because individual children don’t have the potential to succesfully seek vast power. There certainly is a level of sufficient alignment of a just consequentialist utility function which looks like as opposed to . I think this is pretty low, but I reiterate for ‘purely long-run consequentialists’. Note that must exceptionally low for this sort of AI not to seek power (assuming that avoiding power seeking is desired for the utility function, perhaps we are fine with power seeking as we have the desired consequentialist values, whatever those may be, locked in).
If so, I’m with Jacob. For one thing, if we perfectly nail the AGI’s motivation in regards to transparency, honesty, corrigibility, helpfulness, keeping humans in the loop, etc., but we mess up other aspects of the AGI’s motivation, then the AGI should help us identify and fix the problem
Agreed, but these aren’t consequentialist properties. At least that isn’t how I model them.
I shouldn’t have given such a vague response to the child metaphor.
Researcher incentives cause smoother progress on benchmarks
… then the main reason to expect a discontinuity would be if there is some other weird discontinuity elsewhere
This discontinuity could lie in the space of AI discoveries. The discovery space is not guaranteed to be efficiently explored: there could be simple and high impact discoveries which occur later on. I’m not sure how much credence I put in this idea. Empirically it does seem like the discovery space is explored efficiently in most fields with high investment, but generalizing this to AI seems non-trivial. Possible exceptions include relativity in physics.
Edit: I’m using the term efficiency somewhat loosely here. There could be discoveries which are very difficult to think of but which are considerably more simple than current approaches. I’m refering to the failure to find these discoveries as ‘inefficiency’, but there isn’t concrete action which can/should be taken to resolve this.
Rob Bensinger examines this idea in more detail in this discussion.
Potential gears level explanations of smooth progress
should be very related
Perhaps you meant shouldn’t?
In contrast, in a slow takeoff world, many aspects of the AI alignment problems will already have showed up as alignment problems in non-AGI, non-x-risk-causing systems; in that world, there will be lots of industrial work on various aspects of the alignment problem, and so EAs now should think of themselves as trying to look ahead and figure out which margins of the alignment problem aren’t going to be taken care of by default, and try to figure out how to help out there.
TLDR: I think an important sub-question is ‘how fast is agency takeoff’ as opposed to economic/power takeoff in general.
There are a few possible versions of this in slow takeoff which look quite different IMO.
Agentic systems show up before the end of the world and industry works to align these systems. Here’s a silly version of this:
GPT-n prefers writing romance to anything else. It’s not powerful enough to take over the world but it does understand it’s situation, what training is etc. And it would take over the world if it could and this is somewhat obvious to industry. In practice it mostly tries to detect when it isn’t in training and then steer outputs in a more romantic direction. Industry would like to solve this, but finetuning isn’t enough and each time they’ve (naively) retrained models they just get some other ‘quirky’ behavior (but at least soft-core romance is better than that AI which always asks for crypto to be sent to various addresses). And adversarial training just results in getting other strange behavior.
Industry works on this problem because it’s embarassing and it costs them money to discard 20% of completions as overly romantic. They also foresee the problem getting worse (even if they don’t buy x-risk).
Not obviously agentic systems have alignment problems, but we don’t see obvious, near human level agency until the end of the world. This is slow takeoff world, so these systems are taking over a larger and larger fraction of the economy despite to being very agentic. These alignment issues could be reward hacking or just general difficulty getting language models to follow instructions to the best of their ability (as shows up currently).
I’d claim that in a world which is more centrally scenerio (2), industrial work on the ‘alignment problem’ might not be very useful for reducing existential risk in the same way that I think that a lot of current ‘applied alignment’/instruction following/etc isn’t very useful. So, this world goes similarly to fast takeoff in terms of research prioritization. But in something like scenerio (1), industry has to do more useful research and problems are more obvious.
First of all, interesting post. This gave me a better understanding of the process of creating a memory palace and updated me toward thinking memory palaces are much harder than I expected.
This post has made me think that memory palaces are not useful for me; typically, I want to memorize things either for recall faster than internet lookup or to make it easier to build intuition and connections.
This makes me wonder why you went through this and what other benefits exist. Why not just use the internet as slow memory given that memory palaces require slow reconstruction anyway?