For a long time, this was my impression as well, but Caplan claims the evidence doesn’t bear this out. And many organizations do use IQ testing successfully; the military is a prime example.
[Meta: this is normally something I would post on my tumblr, but instead am putting on LW as an experiment.]
Sometimes, in games like Dungeons and Dragons, there will be multiple races of sapient beings, with humans as a sort of baseline. Elves are often extremely long-lived, but most handlings of this I find pretty unsatisfying. Here’s a new take, that I don’t think I’ve seen before (except the Ell in Worth the Candle have some mild similarities):
Humans go through puberty at about 15 and become adults around 20, lose fertility (at least among women) at about 40, and then become frail at about 60. Elves still ‘become adults’ around 20, in that a 21-year old elf adventurer is as plausible as a 21-year old human adventurer, but they go through puberty at about 40 (and lose fertility at about 60-70), and then become frail at about 120.
This has a few effects:
The peak skill of elven civilization is much higher than the peak skill of human civilization (as a 60-year old master carpenter has had only ~5 decades of skill growth, whereas a 120-year old master carpenter has had ~11). There’s also much more of an ‘apprenticeship’ phase in elven civilization (compare modern academic society’s “you aren’t fully in the labor force until ~25” to a few centuries ago, when it would have happened at 15), aided by them spending longer in the “only interested in acquiring skills” part of ‘childhood’ before getting to the ‘interested in sexual market dynamics’ part of childhood.
Young elves and old elves are distinct in some of the ways human children and adults are distinct, but not others; the 40-year old elf who hasn’t started puberty yet has had time to learn 3 different professions and build a stable independence, whereas the 12-year old human who hasn’t started puberty yet is just starting to operate as an independent entity. And so sometimes when they go through puberty, they’re mature and stable enough to ‘just shrug it off’ in a way that’s rare for humans. (I mean, they’d still start growing a beard / etc., but they might stick to carpentry instead of this romance bullshit.)
This gives elven society something of a huge individualist streak, in that people focused a lot on themselves / the natural world / whatever for decades before getting the kick in the pants that convinced them other elves were fascinating too, and so they bring that additional context to whatever relationships they do build.
For the typical human, most elves they come into contact with are wandering young elves, who are actually deeply undifferentiated (sometimes in settings / games you get jokes about how male elves are basically women, but here male elves and female elves are basically undistinguished from each other; sure, they have primary sex characteristics, but in this setting a 30-year old female elf still hasn’t grown breasts), and asexual in the way that children are. (And, if they do get into a deep friendship with a human for whom it has a romantic dimension, there’s the awkward realization that they might eventually reciprocate the feelings—after a substantial fraction of the human’s life has gone by!)
This gives you two plausible archetypes for elven adventurers:
The 20-year old professional adventurer who’s just starting their career (and has whatever motivation).
The 45-year old drifter who is still level 1 (because of laziness / lack of focus) who is going through puberty and needs to get rich quick in order to have any chance at finding a partner, and so has turned to adventuring out of desperation.
The established 60-year old who has several useless professions under their belt (say, a baker and an accountant and a fisherman) who is now taking up adventuring as career #4 or whatever.
Oh! Sorry, I missed the “How does this compare with” line.
yes, but its underlying model is still accurate, even if it doesn’t reveal that to us?
This depends on whether it thinks we would approve more of it having an accurate model and deceiving us or having an inaccurate model in the way we want its model to be less accurate. Some algorithmic bias work is of the form “the system shouldn’t take in inputs X, or draw conclusions Y, because that violates a deontological rule, and simple accuracy-maximization doesn’t incentivize following that rule.”
My point is something like “the genius of approval-directed agency is that it grounds out every meta-level in ‘approval,’ but this is also (potentially) the drawback of approval-directed agency.” Specifically, for any potentially good property the system might have (like epistemic accuracy) you need to check whether that actually in-all-cases for-all-users maximizes approval, because if it doesn’t, then the approval-directed agent is incentivized to not have that property.
[The deeper philosophical question here is something like “does ethics backchain or forwardchain?”, as we’re either grounding things out in what will believe or what we believe now, and approval-direction is more the latter, and CEV-like things are more the former.]
But, having good predictive accuracy is instrumentally useful for maximizing the reward signal, so we can expect that its implicit representation of the world continually improves (i.e., it comes to find a nice efficient encoding). We don’t have to worry about this—the AI is incentivized to get this right.
The AI is incentivized to get this right only in directions that increase approval. If the AI discovers something the human operator would disapprove of learning, it is incentivized to obscure that fact or act as if it didn’t know it. (This works both for “oh, here’s an easy way to kill all humans” and “oh, it turns out God isn’t real.”)
Note that ‘guilt’ and ‘innocence’ is normally settled by a jury (for serious cases), and that most (interesting) judicial decisions are on cases that don’t have a binary outcome, and the reasoning by which they make the decision is an important part of the precedent set by their decision. It seems like this method can still work for that, but this exacerbates concerns that things will be ‘decided by the lowest common denominator’ instead of whatever the ‘legal truth’ should be.
People’s stated moral beliefs are often gradient estimates instead of object-level point estimates. This makes sense if arguments from those beliefs are pulls on the group epistemology, and not if those beliefs are guides for individual action. Saying “humans are a blight on the planet” would mean something closer to “we should be more environmentalist on the margin” instead of “all things considered, humans should be removed.”
You can probably imagine how this can be disorienting, and how there’s a meta issue of the point estimate view is able to see what it’s doing in a way that the gradient view might not be able to see what it’s doing.
Suppose I have two cards, A and B, that I shuffle and then blindly place in two spaceships, pointed at opposite ends of the galaxy. If they go quickly enough, it can be the case that they get far enough apart that they will never be able to meet again. But if you’re in one of the spaceships, and turn the card over to learn that it’s card A, then you learn something about the world on the other side of the light cone boundary.
That’s how I interpreted:
the defensive AI systems designed to protect against rogue AI systems are not akin to the military, they are akin to the police, to law enforcement. Their “jurisdiction” would be strictly AI systems, not humans.
To be clear, I think he would mean it more in the way that there’s currently an international police order that is moderately difficult to circumvent, and that the same would be true for AGI, and not necessarily the more intense variants of stabilization (which are necessarily primarily if you think offense is highly advantaged over defense, which I don’t know his opinion on).
It seems like this is the sort of deep divide that is hard to cross, since I would expect people to have strong opinions based on what they’ve seen work elsewhere. It has an echo of the previous concern, where Russell needs to somehow point out “look, this time it actually is important to have a theory instead of doing things ad-hoc” in a way that depends on the features of this particular issue rather than the way he likes doing work.
I think 5 is much closer to the “look, the first goal is to build a system that prevents anyone else from building unaligned AGI” claim, and there’s a separate claim 6 of the form “more generally, we can use AGI to police AGI” that is similar to debate or IDA. And I think claim 5 is basically in line with what, say, Bostrom would discuss (where stabilization is a thing to do before we attempt to build a sovereign).
I think this scheme doesn’t quite catch the abulia trap (where the AGI discovers a way to directly administer itself reward, and then ceases to interact with the outside world), in that it’s not clear that the AI learns about the map/territory distinction and to locate its goals in the territory (one way to avoid this) instead of just a prohibition against many sorts of self-modification or reward tampering (which avoids this until it comes up with a clever new approach).
[Context: the parent comment was originally posted to the Alignment Forum, and was moved to only be visible on LW.]
One of my hopes for the Alignment Forum, and to a much lesser extent LessWrong, is that we manage to be a place where everyone relevant to AI alignment gets value from discussing their work. There’s many obstacles to that, but one of the ones that I’ve been thinking a lot recently is that pointing at foundational obstacles can look a lot like low-effort criticism.
That is, I think there’s a valid objection here of the form “these people are using reasoning style A, but I think this problem calls for reasoning style B because of considerations C, D, and E.” But the inferential distance here is actually quite long, and it’s much easier to point out “I am not convinced by this because of <quick pointer>” than it is to actually get the other person to agree that they were making a mistake. And beyond that, there’s the version that scores points off an ingroup/outgroup divide and a different version that tries to convert the other party.
My sense is that lots of technical AI safety agendas look to each other like they have foundational obstacles, of the sort that means having more than one agenda happy at the Alignment Forum means everyone needs to not do this sort of sniping, while still having high-effort places to discuss those obstacles. (That is, if we think CIRL can’t handle corrigibility, having a place for ‘obstacles to CIRL’ where that’s discussed makes sense, but bringing it up at every post on CIRL might not.)
There’s a dynamic that’s a normal part of cognitive specialization of labor, where the work other people are doing is “just X”; imagine trying to create a newspaper, for example. Most people will think of writing articles as “just journalism”; you pay journalists whatever salary, they do whatever work, and you get articles for your newspaper. Similarly the accounting is “just accounting,” and so on. But the journalist can’t see journalism as “just journalism”; if their model of how to write articles is “money goes in, article comes out” they won’t be able to write any articles. Instead they have lots of details about how to write articles, which includes what articles are and aren’t easy.
You could view both sides as doing something like this: the person who’s trying to make safeguards is saying “look, you can’t say ‘just add safeguards’, these things are really difficult” and the person who’s trying to make something worth safeguarding is saying “look, you can’t just ‘just build an autonomous superintelligence’, these things are really difficult.” (Especially since I think LeCun views them as too difficult to try to do, and instead is just trying to get some subcomponents.)
I think that’s part of what’s going on, but mostly in how it seems to obscure the core issue (according to me), which is related to Yoshua’s last point: “what safeguards we need when” is part of the safeguard science that we haven’t done yet. I think we’re in a situation where many people say “yes, we’ll need safeguards, but it’ll be easy to notice when we need them and implement them when we notice” and the people trying to build those safeguards respond with “we don’t think either of those things will be easy.” But notice how, in the backdrop of “everyone thinks their job is hard,” this statement provides very little ability to distinguish between worlds where this actually is a crisis and worlds where things will be fine!
I just attended an NAS meeting on climate control systems, where the consensus was that it was too dangerous to develop, say, solar radiation management systems—not because they might produce unexpected disastrous effects but because the fossil fuel corporations would use their existence as a further form of leverage in their so-far successful campaign to keep burning more carbon.
Unrelated to the primary point, but how does this make sense? If geoengineering approaches successfully counteract climate change, and it’s cheaper to burn carbon and dim the sun than generate power a different way (or not use the power), then presumably civilization is better off burning carbon and dimming the sun.
It looks to me the argument is closer to “because the fossil fuel corporations are acting adversarially to us, we need to act adversarially to them,” or expecting that instead of having sensible engineering or economic tradeoffs, we’ll choose ‘burn carbon and dim the sun’ even if it’s more expensive than other options, because we can’t coordinate on putting the costs in the right place.
Which… maybe I buy, but this looks to me like net-negative environmentalism again (like anti-nuclear environmentalism).
I talked with Mike Johnson a bunch about this at a recent SSC meetup, and think that CSHW are a cool way to look at brain activity but that associating them directly with valence of experience (the simple claim “harmonic CSHW ≡ good”) has a bunch of empirical consequences that seem probably false to me. (This is a good thing, in many respects, because it points at a series of experiments that might convince one of us!)
An observation is that I think this is a ‘high level’ or ‘medium level’ description of what’s going on in the brain, in a way that makes it sort of difficult to buy as a target. If I think about meditation as something like having one thing on the stack, or as examining your code to refactor it, or directing your attention at itself, then I can see what’s going on in a somewhat clear way. And it’s easy to see how having one thing on the stack might increase the harmony (defined as a statistical property of a distribution of energies in the CSHW), but the idea that the goal was to increase the harmony and having one thing on the stack just happens to do so seems unsupported.
I do like that this has an answer for ‘dark room’ objections that seems superior to the normal ‘priors’ story for Friston-style approaches, in that you’re trying to maximize a property (tho you still smuggle in the goals through the arrangement of the connectome, but that’s fine because they had to come from somewhere).
Meditation, and anything that sets up harmonic neuronal oscillation, makes brain activity more symmetric, hence better or good.
I think this leap is bigger than it might seem, because it’s not clear that you have control loops on the statistical properties of your brain as a whole. It reads something like a type error that’s equivocating between individual loops and the properties of many loops.
Now, it may turn out that ‘simplicity’ is the right story here, where harmony / error-minimization / etc. are just very simple things to build and so basically every level of the brain operates on that sort of principle. In a draft of the previous paragraph I had a line that said “well, but it’s not obvious that there’s a control loop operating on the control loops that has this sort of harmony as an observation” but then I thought “well, you could imagine this is basically what consciousness / the attentional system is doing, or that this is true for boring physical reasons where the loops are all swimming in the same soup and prefer synchronization.”
But this is where we need to flesh out some implementation details and see if it makes the right sorts of predictions. In particular, I think a ‘multiple drives’ model makes sense, and lines up easily with the hierarchical control story, but I didn’t see a simple way that it also lines up with the harmony story. (In particular, I think lots of internal conflicts make sense as two drives fighting over the same steering wheel, but a ‘maximize harmony’ story needs to have really strong boundary conditions to create the same sorts of conflicts. Now, really strong boundary conditions is pretty sensible, but still makes it sort of weird as a theory of long-term arrangement, because you should expect the influence of the boundary conditions to be something the long-term arrangement can adjust.)
This ‘works’ except for the fact that any sort of enforceable contract (that, in year 6, it will eventually get around to you) will mean they are no longer gifts (and thus aren’t considered personal gifts underneath the relevant threshold). But even if it doesn’t get around to you, this is an improvement over not having anything to deduct yourself.
It’s also inconvenient for charities to have variable income streams instead of dependable donors (altho this is a risk you’ll be facing anyway if someone is frequently re-evaluating donation targets), but you can work around this by having a donor-advised fund. Donate to the fund in year 1, collect the deduction, and this disperse money from the fund at a constant pace each year (until you hit year N, at which point you donate again).
I haven’t had time to write my thoughts on when strategy research should and shouldn’t be public, but I note that this recent post by Spiracular touches on many of the points that I would touch on in talking about the pros and cons of secrecy around infohazards.
The main claim that I would make about extending this to strategy is that strategy implies details. If I have a strategy that emphasizes that we need to be careful around biosecurity, that implies technical facts about the relative risks of biology and other sciences.
For example, the US developed the Space Shuttle with a justification that didn’t add up (ostensibly it would save money, but it was obvious that it wouldn’t). The Soviets, trusting in the rationality of the US government, inferred that there must be some secret application for which the Space Shuttle was useful, and so developed a clone (so that when the secret application was unveiled, they would be able to deploy it immediately instead of having to build their own shuttle from scratch then). If in fact an application like that had existed, it seems likely that the Soviets could have found it by reasoning through “what do they know that I don’t?” when they might not have found it by reasoning from scratch.