My medium-term math goal is to pick up some algebra and analysis. I’ve heard from some people with math backgrounds that those are good basics to pick up if you’re interested in modern math. My roadmap from here to there is to finish off David Lay’s Linear Algebra textbook plus an equivalent textbook for calculus (which I haven’t done any of since high school), and then move on to intro real analysis and intro abstract algebra textbooks. So far, I’ve found self-studying math very rewarding, and so self-motivating as long as I’m not starved for time.
Lately I’ve been reading up on some of the stuff on persuasion tools/AI “social superpowers.” It’s an intrinsically interesting idea that in the medium-term future, following the best arguments you can find given that you read around broadly enough could cease to be a reliable route to holding the most accurate possible views—if we get widespread proliferation of accessible and powerful persuasion tools. If GPT-n gets really good at generating arguments that convince people, it might become dangerous (with regard to preserving your terminal values and sanity) to read around on the unfiltered internet. So this seems like a cool thing to think more about.
David Udell
Thanks a bunch for the feedback!
I had thought that the strategy behind IDA is building a first-generation AI research assistant, in order to help us with later alignment research. Given that, it’s fine to build a meek, slavish research-hierarchy that merely works on whatever you ask it to, even when you’re asking it manifestly suboptimal questions given your value function. (I’m not sure whether to call a meek-but-superintelligent research-assistant an “agent” or a “tool.”) We’d then use HCH to bootstrap up to a second-generation aligned AGI system, and that more thoughtfully designed system could aim to solve the issue of suboptimal requests.
That distinction feels a bit like the difference between (1) building a powerful goal-function optimizer and feeding it our coherent-extrapolated-volition value-function and (2) building a corrigible system that defers to us, and so still needing to think very carefully about where we choose to point it. Meek systems have some failure modes that enlightened sovereign systems don’t, true, but if we had a meek-but-superintelligent research assistant we could use it to help us build the more ambitious sovereign system (or some other difficult-to-design alignment solution).Once we’re out of thought-experiment-land, and into more nuts-and-bolts IDA implementation-land, it’s worth considering that our ‘H’ almost certainly isn’t one individual human. More likely it’s a group of researchers with access to software and various options for outside consultation. [or more generally, it’s whatever process we use to generate the outputs in our dataset]
This is a fair point; I don’t think I had been abstracting far enough from the “HCH” label. Research groups with bylaws and research tools on hand may just be more robust to these kinds of dangerous memes, though I’d have to spend some time thinking about it.
Situations may come up where there are stronger reasons to override the rulebook than in training, and there is no training data on whether to overrule the book in such cases. Some models will stick to the rulebook regardless, others will not.
During this project, I think I came to the view that IDA is premised on us already having a good inner alignment solution in hand (e.g. very powerful inspection tools). I’m worried about that premise of the argument, and I agree that it’ll be difficult for the model to make accurate inferences in these underdetermined cases.
(Thanks for the feedback!)
In Why do we need a NEW philosophy of progress?, Jason Crawford asked, “How can we make moral and social progress at least as fast as we make scientific, technological and industrial progress? How do we prevent our capabilities from outrunning our wisdom?” (He’s not the only person to worry about differential intellectual progress, just the most recent.) In this world that you envision, do we just have to give up this hope, as “moral and social progress” seem inescapably political, and therefore IDA/HCH won’t be able to offer us help on that front?
I think so—in my world model, people are just manifestly, hopelessly mindkilled by these domains. In other, apolitical domains, our intelligence can take us far. I’m certain that doing better politically is possible (perhaps even today, with great and unprecedently thoughtful effort and straining against much of what evolution built into us), but as far as bootstrapping up to a second-generation aligned AGI goes, we ought to stick to the kind of research we’re good at if that’ll suffice. Solving politics can come after, with the assistance of yet-more-powerful second-generation aligned AI.
Are you assuming that there aren’t any adversaries/competitors (e.g., unaligned or differently aligned AIs) outside of the IDA/HCH system? Because suppose there are, then they could run an alien search process to find a message such that it looks innocuous on the surface, but when read/processed by HCH, would trigger an internal part of HCH to produce an adversarial question, even though HCH has avoided doing any alien search processes itself.
In the world I was picturing, there aren’t yet AI-assisted adversaries out there who have access into HCH. So I wasn’t expecting HCH to be robust to those kinds of bad actors, just to inputs it might (avoidably) encounter in its own research.
Similarly with decision-theoretic questions, what about such questions posed by reality, e.g., presented to us by our adversaries/competitors or potential collaborators? Would we have to answer them without the help of superintelligence?
Conditional on my envisioned future coming about, the decision theory angle worries me more. Plausibly, we’ll need to know a good bit about decision theory to solve the remainder of alignment (with HCH’s help). My hope is that we can avoid the most dangerous areas of decision theory within HCH while still working out what we need to work out. I think this view was inspired by the way smart rationalists have been able to make substantial progress on decision theory while thinking carefully about potential infohazards and how to avoid encountering them.
What I say here is inadequate, though—really thinking about decision theory in HCH would be a separate project.
On reflection, I think you’re right. As long as we make sure we don’t spawn any adversaries in HCH, adversarial examples in this sense will be less of an issue.
I thought your linked HCH post was great btw—I had missed it in my literature review. This point about non-self-correcting memesBut I do have some guesses about possible attractors for humans in HCH. An important trick for thinking about them is that attractors aren’t just repetitious, they’re self-repairing. If the human gets an input that deviates from the pattern a little, their natural dynamics will steer them into outputting something that deviates less. This means that a highly optimized pattern of flashing lights that brainwashes the viewer into passing it on is a terrible attractor, and that bigger, better attractors are going to look like ordinary human nature, just turned up to 11.
really impressed me w/r/t the relevance of the attractor formalism. I think what I had in mind in this project, just thinking from the armchair about possible inputs into humans, was exactly the seizure lights example and their text analogues, so I updated significantly here.
My favorite books, ranked!
Non-fiction:
1. Rationality, Eliezer Yudkowsky
2. Superintelligence, Nick Bostrom
3. The Age of Em, Robin Hanson
Fiction:
1. Permutation City, Greg Egan
2. Blindsight, Peter Watts
3. A Deepness in the Sky, Vernor Vinge
4. Ra, Sam Hughes/qntm
I think Ainslie’s theory of willpower excels at explaining instances of weakness-of-will—I badly want to be the sort of person who regularly goes to the gym, but it wouldn’t hurt much to hold off starting till next week…
On your theory, isn’t it a little mysterious that you have conflicting local and global interests like this? Or do you think that you basically don’t have this kind of goal-conflict?
In the game of chicken, an agent can do better by being the first to precommit to never swerve (say, by conspicuously tossing the steering wheel out of the window). So long as the other agent was slower on the trigger, and sees the first agent’s precommitment being credibly made, the first agent will climb up to his best outcome! A smart (and quick) agent can thus shunt that car crash out of his actual future and into some counterfactual future such that the counterfactual crash’s shadow favorably influences the way events actually unfold.
A deceptively aligned mesa-optimizer might gradient hack by affecting its weights such that any single alteration of its goal function by gradient descent will result in a model that performs worse w/r/t the objective function the model is being trained on. Usually a model would not want to make itself worse w/r/t the objective function, as that means that the model will be trained away from if it gets into that state. But the model won’t get put into a worse state if its gradient hack was successful; instead, the model will just see gradient descent leave its goal function alone.
By precommitting to behave in a somewhat costly manner in these relevant counterfactual worlds, the model gets more of what it wants in actuality.
I don’t yet know too much about this, but I’ve heard that updateless decision theories are equivalent to conventional, updateful decision theories (e.g., EDT and CDT) once those theories have made every precommitment they’d want to make.
The pattern I was getting at above seems a bit like this: it instrumentally makes sense to commit ahead of time to a policy that maps every possible series of observations to an action and then stick to it, instead of just outputting the locally best action in each situation you stumble into.
Actually, it was pretty easy in practice to figure out if you were falling from the sky, Einstein be damned.
Isn’t that Galilean relativity?
The world of Ishuzoku Reviewers is a fun utopia I would actually want to live in. There are no villains in Ishuzoku Reviewers. Instead, the conflict comes from diversity. Different demi-human species have different preferences and different priorities. The dishomogeneity fuels friction, misunderstanding and arbitrage. But never hatred.
This is what coherent extrapolated volition looks like.
How ironic, if at all, is that final statement?
What comes to mind is that that sounds like tossing out a lot of what people care about today, and so my gut reaction here is leery. Even though today people’s more conflictual values are shaped by the resource constraints in our environment, those values might ultimately survive reflection in some post-instrumental form, the way that many of our other human values have. Is that world impoverished, value-wise, in the way theis?
Just a reminder to everyone, and mostly to myself:
Not flinching away from reality is entirely compatible with not making yourself feel like shit. You should only try to feel like shit when that helps.
You’ve built a useful and intelligent system that operates along limited lines, with specifically placed deficiencies in its mental faculties that cleanly prevent it from being able to do unboundedly harmful things. You think…
Is there a clear reason a model like this is insufficiently powerful out of the gate?
In this hypothetical, you were doing a very bad thing by building a system whose safety guarantee was just its deficiencies. If that same model were much larger, it would be foreseeably unsafe; that’s already reason enough not to trust it.
In a sense the story before is entirely about agents. The meta-structure the model built could be considered an agent; likely it would turn into one were it smart enough to be an existential threat. So for one it is an allegory about agents arising from non-agent systems … the model I talked about is not “agent-like”, at least not prior to bootstrapping itself, but its decision to write code very much embodied some core shards of consequentialism
I was under the impression that the Yudkowsky view is that “optimality” and “agency” are the same thing. “Agency” is just coherent optimization.
Rephrased this way, the story is about how a somewhat-coherent optimizer can stumble into a fully coherent optimizer as it bumbles through state space, and that the second system need not inherit the goals of the first. Indeed, that first system may well have been too incoherent to be well-modeled as having goals at all! But it was a powerful-enough optimizer to reach a more coherent optimizer, and that more coherent optimizer was powerful enough to end the world.
Given that, then yes, feeling like shit plus living-in-reality is your best feasible alternative.
Curling up into a ball and binge drinking till the eschaton probably is not though: see Q1.
“Someone has to save the world,” said Vi.
“What for? If you had an aligned superintelligence, what would you tell it to do?” said Eliza.
“If I had an aligned superintelligence I wouldn’t be working for Bayeswatch. I wouldn’t be talking to you,” said Vi.
“Hypothetically,” said Eliza.
“Abstract morality is masturbation for philosophers. I live in the real world. Are you going to keep wasting my time or are you going to let me do my job?” said Vi.
Vi has fallen into Hanson’s mistake here: mistaking being actually serious about having something to protect with never childishly, irresponsibly indulging in reflecting on what you’re aiming at.
Yesterday I asked my esteemed co-blogger Robin what he would do with “unlimited power”, in order to reveal something of his character. Robin said that he would (a) be very careful and (b) ask for advice. I asked him what advice he would give himself. Robin said it was a difficult question and he wanted to wait on considering it until it actually happened. So overall he ran away from the question like a startled squirrel …
For it seems to me that Robin asks too little of the future. It’s all very well to plead that you are only forecasting, but if you display greater revulsion to the idea of a Friendly AI than to the idea of rapacious hardscrapple frontier folk...
I thought that Robin might be asking too little, due to not visualizing any future in enough detail. Not the future but any future. I’d hoped that if Robin had allowed himself to visualize his “perfect future” in more detail, rather than focusing on all the compromises he thinks he has to make, he might see that there were futures more desirable than the rapacious hardscrapple frontier folk.
It’s hard to see on an emotional level why a genie might be a good thing to have, if you haven’t acknowledged any wishes that need granting. It’s like not feeling the temptation of cryonics, if you haven’t thought of anything the Future contains that might be worth seeing.
There exist both merely clever and effectively smarter people.
Merely clever people are good with words and good at rapidly assimilating complex instructions and ideas, but don’t seem to maintain and update an explicit world-model, an explicit best current theory-of-everything. The feeling I get watching these people respond to topics and questions is that they respond reflexively, either (1) raising related topics and ideas they’ve encountered as something similar comes up, or (2) expressing their gut reactions to the topic or idea, or expressing the gut reactions that would be given by an all-encompassing political worldview. There isn’t much meta-level steering of the conversation.
Effectively smarter people actively maintain and update an explicit world-model, and so you feel queries directed at them reflecting off of a coherent theory of how everything works, developed to some level of detail (and so can quickly get a feel for what, concretely, they think). At the meta-level, conversations are actively refocused whenever they stop helping to revise someone’s world-model.
Humans, “teetering bulbs of dream and dread,” evolved as a generally intelligent patina around the Earth. We’re all the general intelligence the planet has to throw around. What fraction of that generally intelligent skin is dedicated to defusing looming existential risks? What fraction is dedicated towards immanentizing the eschaton?
Some mantras I recall a lot, to help keep on the rationalist straight-and-narrow and not let anxiety get the better of me:
Equanimity in the face of small threats to brain and body health buys you peace of mind, with which to better prepare for serious threats to brain and body health.
European hegemony was caused by the Industrial Revolution was caused by high labor value relative to material costs was caused by discontiguous empires was caused by long-range trade.
You catch some flak in the comments for making big claims like this, but I wanted to chime in and say that I wish more people would take stabs at macro-historical hypotheses like this one. So, strong upvote from me for hypothesizing about a difficult, often avoided-on-status-grounds(?) domain.
Yudkowsky has sometimes used the phrase “genre savvy” to mean “knowing all the tropes of reality.”
For example, we live in a world where academia falls victim to publishing incentives/Goodhearting, and so academic journals fall short of what people with different incentives would be capable of producing. You’d be failing to be genre savvy if you expected that when a serious problem like AGI alignment rolled around, academia would suddenly get its act together with a relatively small amount of prodding/effort. Genre savvy actors in our world know what academia is like, and predict that academia will continue to do its thing in the future as well.
Genre savviness is the same kind of thing as hard-to-communicate-but-empirically-validated expert intuitions. When domain experts have some feel for what projects might pan out and what projects certainly won’t but struggle to explain their reasoning in depth, the most they might be able to do is claim that that project is just incompatible with the tropes of their corner of reality, and point to some other cases.- But What’s Your *New Alignment Insight,* out of a Future-Textbook Paragraph? by 7 May 2022 3:10 UTC; 25 points) (
- 12 Sep 2022 23:42 UTC; 9 points) 's comment on David Udell’s Shortform by (
Hello!
I’m David. I’m a philosophy PhD student and longtime LessWrong/Overcoming Bias/SSC/rationalish-sphere lurker. This is me finally working up the strength to beat back my commenting anxiety! I discovered LW sometime in high school; my reading diet back then consisted of a lot of internet and not much else, and I just stumbled onto here on my own.
Right now I’m really interested in leveling up my modern math understanding and in working up to writing on AI safety/related topics.