“The previously observed drop off in the value of additional miners after 5 seem to occur because it makes it less likely for other valuable types to be present, not because it is intrinsically bad.”My go-to check when there’s decent data is to compare P(something | N miners, M dwarves) to P(something | N-1 miners, M-1 dwarves).
Miners: 5Smiths: 1Woodcutters: 1Farmers: 2Brewers: 1Warriors: 2Crafters: 1
I expect to survive: in the Light Forest, 2 Farmers and 2 Warriors seem necessary for good odds and also sufficient for great odds. I suspect the Brewer is not needed, except that obviously the Brewer is needed. I expect my profits are not maximized without some rearrangement; I didn’t try to account for which resources were present much at all.
I did not know any specifics. I did think it was worth my time start skimming because I have another interesting problem vaguely related; then I thought it was worth my time to understand what the objection buried under the word salad (yes, Benjamin, it’s word salad) might be, because it seemed like there might actually be one. And there was! Standard lambda calculus at face value doesn’t work with nonconstructive proofs. That’s interesting and I didn’t know it. Then:
as expected, looks like there’s plenty of work on this, and there’s nothing actually surprising here. My standard practice after doing something like this is to leave a perfectly-reasonable-if-they-were-reasonable question getting to the heart of what’s up, as I did; I can afford this of course because I’m much less high profile than you, or, y’know, any physicist. :D
Interestingly and a complete aside: I grew up with a close relative who wrote in That Distinctive Style and only later encountered it on the wider internet, and wasn’t that a revelation.
[Googles] Why does something like https://arxiv.org/pdf/2006.05433.pdf not resolve things? Is it simply wrong? Is it not actually applicable?
A program for the full axiom of choiceThe theory of classical realizability is a framework for the Curry-Howard correspondence which enables to associate a program with each proof in Zermelo-Fraenkel set theory. But, almost all the applications of mathematics in physics, probability, statistics, etc. use Analysis i.e. the axiom of dependent choice (DC) or even the (full) axiom of choice (AC). It is therefore important to find explicit programs for these axioms. Various solutions have been found for DC, for instance the lambda-term called “bar recursion” or the instruction “quote” of LISP. We present here the first program for AC.
A program for the full axiom of choice
The theory of classical realizability is a framework for the Curry-Howard correspondence which enables to associate a program with each proof in Zermelo-Fraenkel set theory. But, almost all the applications of mathematics in physics, probability, statistics, etc. use Analysis i.e. the axiom of dependent choice (DC) or even the (full) axiom of choice (AC). It is therefore important to find explicit programs for these axioms. Various solutions have been found for DC, for instance the lambda-term called “bar recursion” or the instruction “quote” of LISP. We present here the first program for AC.
I now agree with you. Or possibly with a steelmanned you, who can say. ;)
This is why I was stressing that “chaa” and “fair” are very different concepts, and that this equilibrium notion is very much based on threats. They just need to be asymmetric threats that the opponent can’t defuse in order to work (or ways of asymmetrically benefiting yourself that your opponent can’t ruin, that’ll work just as well).
(from the next post in this sequence https://www.lesswrong.com/posts/RZNmNwc9SxdKayeQh/unifying-bargaining-notions-2-2)
in physical reality, payoffs outside of negotiations can depend very much on the players’ behavior inside the negotiations, and thus is not a constant. Nash himself wrote about this limitation (Nash, 1953) just three years after originally proposing the Nash bargaining solution. For instance, if someone makes an unacceptable threat against you during a business negotiation
(from Critch’s first boundary post https://www.lesswrong.com/posts/8oMF8Lv5jiGaQSFvo/boundaries-part-1-a-key-missing-concept-from-utility-theory)
I’m not really concerned about saying “but reputation matters; the solution you land on here affects your reputation later” since that should be baked into the payoffs.
But I do think it’s important to note the assumption that what happens during negotiation can affect the payoffs even of the current game which this analysis otherwise treats as constant.
A better example might be literally paying for something while in a marketplace you’re not going to visit again. You don’t have much cash, you do have barter items. Barter what you’ve got, compensate for the difference. Cooperative is “yes a trade is good”, competitive is “but where on the possibility list of acceptable barters will we land”?
I guess the difficulty is that the example really does want to say “all games can be decomposed like this if they’re denominated, not just games that sound kind of like cash”, but any game without significant reputational/relationship effects is gonna sound kind of like cash.
Maybe a side note to not forget outside-of-game considerations? But I’m perfectly fine reading about 4⁄3 pi r^3 without “don’t forget that actually things have densities that are never uniform and probably hard to measure and also gravity differs in different locations and in fact you almost certainly have an ellipsoid or something even more complicated instead”, and definitely prefer a world that can present it simply without having to take into account everything in the real world you’d actually have to account for when using the formula in a broader context.
The exact mechanism would have been pretty tricky to figure out
I was getting close. :) Had I spent 3x as long, I probably would have gotten it. Where I left off:
convinced it was very likely to be a repeated exchange from HP=X to HP=0, and thought the exact HP number would be static and “nice”
leaning heavily toward turn-based rather than simultaneous
had thought of a small handful of possible ways “damage” could be happening, one of which was this exact mechanism (convinced of x0 and x2 because Pokemon/etc)
plotted calibration curves for potential mechanisms
The things I was missing were:
automation to easily try out new damage mechanisms and then try all the (relatively small) combinations
notice that “choose randomly” and “choose best” didn’t work quite right, and figure out that it was actually “choose weighted” (this is the hardest part for sure, and it’s possible the calibration curves even look fine without it, which would scrap my chances)
pin down HP=100 rather than HP=20, 50, 200, etc.
Seeing if I can figure out how duels work, now. First guess is not correct: that mages pick an offensive spell and defensive spell at random, then calculate damage, and whoever hits for more, wins; damage is determined by for each offensive element, taking the mana level, reducing it by half the mana levels of the defensive elements, adjusting defensive x2 if element opposes and x0 if element is the same, summing the two results.
I need more randomness, maybe; my calibration curve looks decent, but overconfident at both ends. In particular this predicts certain duels are 100% determined, but they do not always work out that way.
How about picking an element at random, too? The problem with this approach is that it doesn’t really do much re expected value. I’d expect that it’s more like “whoever does X damage first wins” so that repeatedly pinging for 8 can be better than a 50⁄50 chance of hitting for 13. That may make the difference between Levee and Vambraces for defense, for example, as the opponent’s expected mana-levels-getting-through vs Levee is ~1.5 less than vs Vambraces, but Levee is more likely to let some damage through.
Trying the original, but letting the probability of winning be proportional to the respective damage dealt, yields decent calibration but underconfident in the 5-25% and 75-95% ranges. This seems very suggestive of “deal damage until X”; a huge advantage translates to a win, a tiny advantage translates to a tiny advantage, and a moderate advantage, repeated, translates to probably-a-win?
In any case, it seems very likely that the structure has something like “opponents choose an offense vs a defense spell, then ??”, so I think I need to abandon my YOLO. :( I would like to go with “maybe higher EV? beats reliably blocking?” and choose Lava Levees for my defenses, with Fireball and Rainbow Rays for offense. I think Fireball is the very clear choice for offense, but Mud Missiles might be better than Rainbow Rays.
Current submission: Fireball, Lava, Rays.
Are mana levels actually normalized to 150 total, or is that an effect of our measurement/prediction process?
How do duels actually work? They could be as simple as “add up mana values of elements on offense, modify defense mana values according to effectiveness, subtract; compare results; higher wins” or much more complicated like bucketizing mana levels, choosing a random person to attack first, choosing a random attack and a random defense, doing [something like the simple one] to determine damage, switch attackers, repeat until someone has taken X damage, also wizards have native attack/defense/elemental bonuses, also wizards have different HP so X1 and X2 not just X”.
Half the spells are for offense, half for defense. A wizard never goes to a duel with pure offense or pure defense prepared.
As per other spoilers, Dark is an element, elements come in anti-correlated pairs, and total mana level is 150.
Each spell has elemental affinities that make sense. For example, Fireball works better with high levels of Fire and Air mana and Abyssal works better with high levels of Water and Dark mana. (3C2)*(2^2)=12.
Offensive and defensive spells’ affinities are divided “fairly”.
Something lets weaker-mana spells win sometimes. It looks like defenses of an element do not protect well against offenses of that element, and defenses of an opposed element protect very well.
I’d like to figure out precisely what protecting well or not well means, but haven’t yet. I think the most likely candidates for best spells to prepare are:
(current choice because YOLO Edit: not anymore) Fireball, Mud Missiles, and Rays. No defense. What could go wrong. No one ever does it because in real combat you need defenses but in duels all-out-offense turns out to be technique.
(hope this ends up being right) Fireball, Lava Levee, Vambraces. Hoping counters trump weaknesses.
(would be a bit sad if right) Fireball, Mud Missiles, Abyssal Armor. Just high stats with an obvious weakness.
One of the many common Curses of Smart is being hypercompetent at a couple of tricks that you’ve adopted to be passably effective at life while being incredibly blind to your own limitations and bad habits.
Just want to drop a note here that this curse (a) got me through years of major depression, making me, I guess, “high-functioning”, and (b) caused the worst interpersonal crisis I’ve had-or-expect-to-have in my life.
For me it wasn’t really a trick, per se. Just, more like… being smart enough allows you to simply brute force a bunch of areas without ever being good at them, and it feels like good enough because “passable effective at almost anything while concentrating” is legit better than median. The main failure mode when phrased like this, though, should be quite obvious—you can only concentrate on so much. The secondary failure mode is that even when concentrating, if you don’t have good heuristics born of experience actually getting good at a thing, your passable effectiveness is brittle even when you think you are concentrating, because it has bad default behaviors in the gaps of things you don’t know should be part of your concentration.
(I am not affiliated with any of these orgs. I did attend a pre-CFAR proto workshop thingy.)
Welp. I decided to do this, and here it is. I didn’t take nearly enough screenshots. Some large percent of this is me writing things, some other large percent is me writing things as if I thought the outputs of OpenAI’s Playground were definitely something that should be extracted/summarized/rephrased, and a small percentage is verbatim text-continuation outputs. Virtually no attempts were made to document my process. I do not endorse this as useful and would be perfectly fine if it were reign of terror’d away, though IMO it might be interesting to compare against, let’s say, sane attempts. Regardless, here ya go: one hour.
It’s past my bedtime.
I’ve got a pint in me.
OpenAI Playground is open as a tab.
A timer is running.
I speak to you now of Corrigibility Concerns.
When deputizing an agent who is not you to accomplish tasks on your behalf, there are certain concerns to… not address, but make sure are addressed. Let’s not Goodhart here. Jessica Taylor named “quantilization”. Paul Christiano named “myopia”. Eliezer Yudkowsky named “low impact” and “shutdownability”. I name “eli5ability” and I name “compressibility” and I name “checkpointable” and I name “testable”.
When we list out all of our proxy measures, we want corrigibility to be overdetermined. We want to achieve 70% of our goals completely and the rest half-assed and still end up with a corrigible agent. It’s okay to project what we want from an agent onto non-orthogonal dimensions and call each vector important.
So let’s define a corrigible agent. A corrigible agent is an agent that:
Does what we want it to do.
Doesn’t want to do something else.
Can easily be checked for doing what we want it to do.
Can be shut down if we want it to stop doing something.
Can be restarted if we want it to do something else.
Can be explained to us why it did something.
Doesn’t hide its intentions from us.
Doesn’t want us to not know its intentions.
Can be retrained to do something different if we want it to.
Additionally, because we live in the real world, it must not be too computationally expensive to train, run, check, shut down, restart, explain, retrain, or understand. This includes CPU cycles, wall-clock time, human thought, and so on.
My additions to the lexicon of corrigibility proxy measures is eli5ability, compressibility, checkpointable, and testable, and I will define them here.
A planning process must output simple plans. Complicated plans will fail, or if they succeed, will not be understandable by a human. This leads to the following heuristic: “eli5ability” means a plan must be understandable by a non-expert. “Understandable” is a technical term with a specific meaning in psychology: if a person has a model of the task in their heads, and the model is sufficiently close to the real task that they can use it to make predictions. Here the resulting plan must be simple enough to be an input to a much simpler AI whose predictions about the effects score well on whatever heuristics we mean to evaluate those plans. This is the sort of adversarial relationship which can be trained and improved in parallel, which in no way guarantees aligned AI but which certainly constrains the space of non-aligned AIs.
Planning processes must output compressible plans, in the sense that a joint probability function over a simple causal model can be specified by far fewer numbers than one over a complex causal model. The plan must be modular and each individual part packaged and vetted by other safety procedures without context, and the effects of the parts be vetted as a whole without the methods, as estimated both by the planning AI and by the eli5 AI.
Planning processes must output staged plans. Flexibility in timing, location, target, geography, personnel, etc should be part of the training process, such that rewards are commensurate (a) having checkpoints, and (b) changing the environment after every checkpoint. All coupling in plans must be penalized.
We need a new name for this one. I’m thinking of testable in the sense that code can be made testable by making it more modular, more idempotent, etc. Plans should come in pieces and those pieces should have expected effects and real life unit tests as part of the output and plans without those are highly penalized. It goes without saying at this point that the test here is not that the AI tell us what those expected effects and tests are, but that they agree with what other adversarially trained AIs say.
The main intuition I take from examining the product of my own intuitions with the half-assed text completion from OpenAI’s Playground is that it may be a useful line of thought to reward a powerful optimizer, while training, according to how well its stated expected effects of its plans agree with the expected effects stated by several differently-built less-powerful optimizers (also being trained) when presented with its plans as inputs. This may at least help force a powerful optimizer to intend to deceive us if it wants to avoid our actual request, at which point other intent-extraction safety tools may have an easier time picking up on its deceit.
(I of course told my partners up front that a public essay was one possible outcome of the survey and that I would not-publish anything they flagged as private.)
Gonna guess zero. Much less costly to leave ’em in for 12 weeks for goodwill than to try to remove people in that timeframe.
Good point. Probably at least some of this. You need referrals, and I was definitely not the smartest of the people in my reference class available to refer, though maybe 3rd, and someone looking at me versus the one I know definitely had more-raw-IQ should definitely have guessed that I was more likely to pick up that particular thing.
It’s also possible I’m someone “amenable” to this mindset and that was just the “on switch”. DSP, by the way.
But yeah I could see a post on… cryptanalysis, and finding and minimizing attack surfaces, without necessarily having an attack in mind, and a hindsight-view story of what first caused me to think in that way.
But in general I liked the setup a lot!
Security mindset seems highly related, and the training thing here seems like it shouldn’t be that hard? Certainly it seems very easy compared to the problem the trained people will then need to solve, and I think Eliezer has de facto trained me a substantial amount in this skill through examples over the years. There was a time I didn’t have security mindset at all, and now I have at least some such mindset, and some ability to recognize lethal issues others are missing. He doesn’t say how many other people he knows who have the abilities referred to here, I’d be curious about that. Or whether he knows anyone who has acquired them over time.
I have just realized that I’ve believed for years that “security mindset” is relatively easy and people who can’t at least dip into it are probably being “lazy”. I was being lazy; somehow I didn’t notice that I was literally trained in this mindset during an internship many many years ago. I think they did at least an acceptable job of training me. If I had to guess what the key trainings were, I’d guess:
[examples and practice] Here, learn some things about cryptography. Here is a theoretical algorithm and a convincing non-mathematical description of why it seems very hard to break. Watch as I break it via side channels, via breaking your assumptions of hardware access, via information theory. Go break things. Start by brainstorming all the ways things might be broken. Work with other smart people who are also doing that.
[examines in a different domain and visceral relevance] Speaking of hardware access, literal nation states have some incentive to get hardware access to you. Please absorb the following cautions. Here is an anecdote about an employee who always gets in the passenger side of her car rather than the driver’s side to illustrate the amount of caution you could wield. Every time you drive to work, please note the individuals sitting around in defensive locations with guns. It is literally at least some risk to your person if you ever write [this post] or put this internship on your resume, but generally people find it worth the risk, especially 5+ years after they are not actively associated with us.
I spent all of my time trying to figure out how to figure out how much [the hidden variable causing the correlation between nerd and otaku] affects trait choices and winrates.
Apparently they are correlated without a relevant hidden variable. :D
I don’t understand why it’s plausible to think that AI’s might collectively have different goals than humans.
Future posts, right? We’re assuming that premise here:
So, for what follows, let’s proceed from the premise: “For some weird reason, humans consistently design AI systems (with human-like research and planning abilities) that coordinate with each other to try and overthrow humanity.” Then what? What follows will necessarily feel wacky to people who find this hard to imagine, but I think it’s worth playing along, because I think “we’d be in trouble if this happened” is a very important point.