LessWrong dev & admin as of July 5th, 2022.
RobertM
The 2024 LessWrong Review
Curated, as a worthwhile piece of empirical research (though see my concern below re: use as an alignment technique). These are the kinds of empirical results that I could in principle imagine leading to a more robust, theoretic understanding of how generalization works for models trained in the current paradigm. It covers a relatively broad “surface area”, which hopefully makes it easier for others to conduct more in-depth investigations along multiple lines of research. One interesting and suggestive example:
One downside with our default approach to inoculation prompting in RL is that the inoculating prompt causes the model to learn reward hacking faster. An alternative approach would be to use a prompt that discourages hacking when sampling in RL, and then rewrite episodes to use an inoculating prompt before training. In Figure 29 we test a version of this: we rewrite episodes offline to modify the prompt after RL training, and then SFT a model on the rewritten episodes. We find that this is not particularly effective at removing misaligned generalization when we use our “hacking okay” inoculation prompt that worked well during RL.
Figure 29: Offline rewriting of episodes to include an inoculation prompt and then training with SFT does not prevent misalignment. We took episodes from the “don’t hack” prompted run, rewrote them offline to use the “hacking okay” inoculation prompt, and then trained a model on these episodes using SFT. The resulting model showed misalignment on our evaluations, especially agentic evaluations.)
However, I don’t think that this sort of technique will scale very far. This experiment shows that, conditional on a model learning to reward hack after being prompted in a pretty unusual way, you might be able to prevent that reward hacking tendency from generalizing to other forms of misbehavior. But, as noted in the paper, that kind of prompting itself is necessary to enable the model to learn to reward hack reliably. To me, this is pretty suggestive that the “malign generalization” that occurs with “don’t hack” prompting is operating on the level of the model’s learned reflexes, and we will be dealing with a pretty different set of concerns when we get to models that have more robust long-horizon preferences.
Separately, being able to stack other mitigations on top of inoculation prompting to reduce reward hacking in deployment environments, after encouraging it via the inoculation prompting, seems like it is sweeping a pretty fundamental issue under the rug, which is the failure to find a robust technique that successfully instills the correct “values” into the model in a way that generalizes. I don’t like the way it looks like playing whack-a-mole, which always seems to work at current levels of capabilities, only to reveal that there was at least one more mole hiding underground as soon as you scale up a bit more.
I think there’s at least a few senses in which “we” don’t “know” how colds spread:
The “state of the art” in terms of “scientific knowledge” seems slightly underconfident about small-particle transmission maybe not being a meaningful factor in practice, given the available evidence.
As a society, I don’t think there’s well-established common knowledge of our best guess of the relative likelihood of transmission by various routes. i.e. what would happen if you asked multiple different doctors (GPs) how to best avoid catching a cold in various situations? I, personally, expect that you’d get overly-broad (and therefore expensive-to-follow) advice, with relatively poor inter-rater agreement, compared to if you asked them how to avoid e.g. some specific STDs.
Our “best guess” is extremely flimsy, and the sort of thing that I could imagine a single well-designed, medium-sized study overturning. (See the various caveats about how many different viruses cause “colds”, questions about humidity/viral load/etc in other comments, and so on.) This is not the kind of situation where I can tell someone “do [x] and you’ll reduce your risk by 80%” and feel at all confident about it! Or, like, I can in fact give them an extremely broad [x], but in that case I’m pretty sure that I’d be destroying a bunch of value as a result of empirical uncertainty which seems in-principle quite possible to resolve, with a sufficient application of resources.
I am being lazy and not reading all the papers you referenced—do many of them discuss the viral load of the person who is infected?
I think a couple of them did; I don’t remember if any of them found strong effects. Might ask a language model to check later—agree that this seems like one of those big open questions that could imply huge differences in worthwhile interventions, though I think that if viral load turns out to be the only key factor in transmission likelihood, such that other interventions have basically no effect if effectuated w.r.t. spreaders with high viral loads, that’s pretty bad news, since testing for high viral load might be much harder/more expensive than e.g. putting on a mask if it turns out that “large particulates” are most of the problem in the case of a specific illness. (Though I guess we do still have the problem of knowing what illness a given person has, to know what intervention to apply...)
Very reasonable question—having a user-specific header does make it tricky to have whole-page prerendering/caching, but we’re working on making partial prerendering worthwhile (mostly for post pages, which admittedly do change more often, due to comments and votes, though the cache hit rate would still be pretty high).
In the case of collection pages like R: A-Z, we also have user-specific info like read statuses indicated in checkboxes, which are part of the expensive query responsible for hydrating most of the page contents. We could split that out, and since we did recently enable cacheComponents it’s possible that’d allow us to cache much of the page, but I’m not sure if it’s worth prioritizing.
linear progress
The bottom line is still exponential, not linear—it’s just that the top line is superexponential!
Mod note: this post violates our LLM Writing Policy for LessWrong. @nextcaller, please don’t post more direct LLM output, or we’ll remove your posting permissions.
Although, thinking about it a bit more, I think this is not quite right:
To which I say: yes, that motivation comes from non-EA ethical commitments.
Scott explains his motivation for donating a kidney in My left kidney:
It starts with wanting, just once, do a good thing that will make people like you more instead of less. It would be morally fraught to do this with money, since any money you spent on improving your self-image would be denied to the people in malarial regions of Africa who need it the most. But it’s not like there’s anything else you can do with that spare kidney.
Still, it’s not just about that. All of this calculating and funging takes a psychic toll. Your brain uses the same emotional heuristics as everyone else’s. No matter how contrarian you pretend to be, deep down it’s hard to make your emotions track what you know is right and not what the rest of the world is telling you. The last Guardian opinion columnist who must be defeated is the Guardian opinion columnist inside your own heart. You want to do just one good thing that you’ll feel unreservedly good about, and where you know somebody’s going to be directly happy at the end of it in a way that doesn’t depend on a giant rickety tower of assumptions.
I see no reason to disbelieve his self-description, and wouldn’t describe that as a “non-EA ethical commitment” (though obviously it can’t be described as an “EA ethical commitment” either).
I think this is the strongest section, and maybe sufficient to carry the argument by itself:
4. Scott wouldn’t have given the money to charity anyway
Scott writes:
[T]he most effective way to give is the one you actually do.
I think this is related to the point above—I can’t actually claim that costing me $19,000 costs 6.3 lives, because for any given $19,000 in my bank account, I only spend 10% of it on charity. Although I always ‘could’ choose to spend all of it on charity, I don’t, and I can easily predict that I won’t do so going forward. So I don’t think it’s helpful to bring up the hypothetical where I do. If I were morally perfect, then yes, kidney donation would take away time and money that I could spend doing even better morally perfect things. Since I’m not morally perfect, my real budget of time and money is less important than my motivation budget of “what will I actually do?”, and for some reason I found myself motivated to do this and not other things.
I have two objections. First, I think this just includes too much within the EA umbrella so as to strip away the E entirely. What heft does Singer’s critique of the make-a-wish foundation have if I can say, well, at least I did it, so it was effective, so kick rocks? Then any charity counts as EA. Hell, even things that are not really charity at all—say you decide to shop local for groceries—could be considered “effective” forms of altruism because the most effective stuff is the stuff you actually do, and you’re actually doing that. Any altruism ever is EA under these guidelines. In reality, EA sprang to life as a criticism of, and supposedly an improvement on, these old institutions.
Second, I just think it outsources his agency which is central to the whole ethical question we’re arguing about. If you have no free will and can’t control yourself, then it makes no sense to talk about what you ought to do at all. In fact, he does have free will, and could easily have increased his donation rate from 10% to 11%. But, as he puts it, “for some reason I found myself motivated to do this and not other things.” To which I say: yes, that motivation comes from non-EA ethical commitments.
This bit about rule utilitarianism was weird, though:
Similarly, the decision to take “thou shalt not lie” extremely seriously tends to indicate some level of deontological component to one’s moral system. But even if you assume that, for most people in most situations, it’s a good rule not to lie, it’s just clearly not for Scott. He has a lot of political reach and a blog that has a lot of persuasive power. It is very obvious that all the most influential political and media pundits lie in service of their goals, like proper consequentialists. As a rule utilitarian, he’d be better off changing the rule to “I’ll only tell lies when there’s a lot of good consequences of doing so,” or something similar, and then following that rule instead. But he doesn’t do that—presumably because lying is just wrong, and maybe it’s as simple as that.
This is false, see Ends Don’t Justify Means (Among Humans).
Also, the random insults are kind of déclassé.
How to throw parties
Also, arguing from models of infection, it is almost certainly wrong. The simplest toy model I can think of is “per DHE, every uninfected, susceptible person has a fixed probability of getting infected”, which would lead to an exponential decline of healthy individuals as DHE increase. But if that was the case, the curve should be bent linear if plotted with logY (when counting survivors), not logX. (I think the direction of curvature would be the same—highest infection risk per DHE at low DHEs, then a slow decrease.)
To explain why the distribution would look like observed would require a more complex model. For example: “Every uninfected person has a fixed personal exposure threshold when they become infected. It ranges from 45 DHE to 600 DHE and is exponentially distributed (favoring smaller thresholds) over the population in that range.”
If this was true, that would be highly surprising. Unlike with the killbots in Futurama, there is no good reason why your immune system should have a preset kill limit for RV16. If it is a case of “the immune system can fight of low doses but eventually gets overwhelmed”, then I am surprised that it would get equally overwhelmed by a low dose over a long period and a high dose over a short period.
I agree that, considered from a mechanistic perspective, the obvious explanations for this data would be “surprising if true”. My guess for the actual model of infection is “did a sufficient quantity of the virus end up in contact with a non-protective surface like a mucuous membrane”, where “sufficient quantity” might vary by individual but for which “per DHE, every uninfected, susceptible person has a fixed probability of getting infected” is often a reasonable proxy (though it loses the details that might actually be relevant for more narrowly intervening on transmission). But I find it hard to be very confident, given the state of the available evidence.
What’s their evidence that any such extrapolation is warranted?
You click on the provided link to supporting evidence and you are taken to a 2017 report titled “Supervising strong learners by amplifying weak experts”.
The link to Supervising strong learners by amplifying weak experts is in the following sentence:
Self-improvement for general intelligence had seen minor successes before.
This is the very first sentence in the
Iterated distillation and amplification (IDA)collapsible section, and is clearly not being offered as evidence that is meant to justify extrapolating to the very last sentence in that section:Now, the models have become sufficiently good at verifying more subjective things (e.g. the quality of a work product), allowing the use of IDA to improve the model at many tasks.
The rest of your post has a lot of other objections that seem invalid or confused, like attempting to use the lack of the paper’s peer review as meaningful evidence about whether a technique like that might generalize or not, but I don’t think it’s worth getting into them because the entire argument is premised on a misunderstanding of what evidence is being offered for what purpose.
Animal welfare concerns are dominated by post-ASI futures
i would not only pay a lot per flight for good wifi, i would also fly way more often
I’m not sure how common this preference is.
I think that the economic gains from people traveling on business having access to better wifi on planes might be quite large[1], but airlines themselves are not well-positioned to capture very much of those gains. There are a very small number of domestic airlines which don’t offer any wifi on their planes at all. The rest generally offer it for free, or for some relatively low price (on the order of $10). Often even the airlines that charge for it offer it as a free or discounted perk for their “frequent fliers”. Those airlines might have a hard time increasing the sticker price of their wifi offering, even if the quality improves a lot, so they’d have to hope for most of the gains to come from business-class travelers switching to them from a competitor (or, as in your case, deciding to fly at all, on the margin). But it’s not obvious to me that most business-class travelers themselves want better wifi, since once it improves past a certain point they might have very little excuse for not working through the flight. (Maybe this is too cynical, or already moot, idk.)
None of this is meant to say that airlines have no incentive to improve their wifi—I’m pretty sure some of them are already getting started on the Starlink transition—merely that there are a bunch of factors that might make that incentive weaker than it might obviously seem.
- ^
Maybe a sizable fraction of “the economic value of their average working hour * flight duration”, which could be thousands of dollars per flight for some travelers.
- ^
Preferences are confusing
Hm, no, I didn’t change anything. The section headings are meant to indicate which transmission method those studies decided was substantially responsible for spreading colds.
@bhauth emphasizes the difficulty of studying transmission of “colds” because there are over 200 different virus strains responsible for what we consider “a cold”, in response to my recent post.
I want to dig into the question of feasibility a bit more:
But it’s not feasible to do human studies of so many virus types—consider how hard it was for society just to realize that COVID was transmitted via aerosols!
Ok, but why isn’t this feasible? Certainly it’s the case that nobody has tried, but I don’t think it’d be prohibitively expensive, at least on the scale of medical research.
Some quick back-of-the-envelope math...
How much would it cost to find and pay qualified volunteers for challenge trial like Dick et al., 1987 today? My guess is that this is doable for $7k per volunteer[1].
How many volunteers would we need? In the pessimal case—the one where we’re actually just going to test every single virus strain separately, rather than doing something more sensible[2]...
My off-the-cuff guess is that 300 volunteers per virus would be enough to draw quite strong conclusions, with good experiment design[3].
That’s 60k volunteers, for a $420m price tag, not including other study costs.
Housing, feeding, and otherwise taking care of each participant—idk, let’s be conservative, and call it $300/participant-day? If we’re going with ten days per participant that’s $300 * (60k * 10) = $180m.
Clinical staff costs—conservatively, 2-3 hours per day per participant, but mostly not the super-expensive PI time, but mostly RNs/coordinators/etc. Maybe $250/participant-day? $250 * (60k * 10) = $150m.
Lab work costs—I think you get a lot of benefits of scale, here, but let’s say $2k/participant, so $120m.
I have no idea how to get accurate numbers for how much it’d cost to manufacture enough GMP-grade virus stock for each virus; LLMs converge on a $1-5m range per virus. Unfortunate, but even at the top end, that’s $1.5b.
$420m + $180m + $150m + $120m + $1.5b = $2.37b.And this is the dumb brute-force solution! Now, maybe it actually turns out that they’re all different and nothing generalizes, so that mechanistic investigations into figuring out if we can predict a given virus’s transmission methods without running a full human challenge trial on it, say by looking at its physical features, just doesn’t work. I still think that a sane civilization smashes that button for a couple billion dollars. Are there higher priorities? Sure, we should do those too.
- ^
A COVID challenge trial paid volunteers $6200 in mid-2021, but the viruses we’d be testing are in-expectation substantially less harmful and unpleasant than COVID, especially the variants that existed at the time. Some of the difference will be eaten up by sourcing costs, though.
- ^
Like testing a smaller number, looking at those results, coming up with mechanistic hypotheses that try to explain why [virus x] spreads effectively by transmission method [y] but not [z], ideally ones where we can try to falsify them more cheaply than running a full human challenge trial, and repeating until we actually understand how particular viruses spread (to the point where we can make reasonably accurate predictions about future viruses based purely on their physical characteristics, or other easily-testable traits), rather than just knowing the mere fact of their propensity to spread via some method.
- ^
Though I’m not sure how establishment-compatible it is.
- ^
Yep, that is a big question mark that I note in the conclusion:
Might depend on details of specific viruses, and I don’t think we’ve done enough research to have meaningful evidence about whether different RVs have very different transmission profiles from each other.
(And also implicitly in a few places within the body of the post.)
I think it’d be reasonable to apply a large discount to any updates you’d otherwise make on the question of rhinovirus transmission from this post, at least absent a follow-up investigation re: whether they behave similarly or not.
I disagree. The document reads very strongly of Anthropic’s “house style”, at least compared to their system prompts. It’s much higher quality writing than any current LLM’s.
“This isn’t [x] but [y]” is quite weak evidence compared to the rest of it being obviously something that Opus would be unable to generate in its default voice. (Also, the original phrase uses “but rather”, which is non-standard for that type of LLM construction.)