Thanks, really appreciate the references!
If there was a feasible way to make the algorithm open, I think that would be good (of course FB would probably strongly oppose this). As you say, people wouldn’t directly design / early adopt new algorithms, but once early adopters found an alternative algorithm that they really liked, word of mouth would lead many more people to adopt it. So I think you could eventually get widespread change this way.
Thanks for the feedback!
I haven’t really digged into Gelman’s blog, but the format you mention is a perfect example of the expertise of understanding some research. Very important skill, but not the same as actually conducting the research that goes into a paper.
Research consists of many skills put together. Understanding prior work and developing the taste to judge it is one of the more important individual skills in research (moreso than programming, at least in most fields). So I think the blog example is indeed a central one.
In research, especially in a weird new field like alignment, it’s rare to find another researcher who want to conduct precisely the same research. But that’s the basis of every sport and game: people want to win the same game. It make the whole “learning from other” slightly more difficult IMO. You can’t just look for what works, you constantly have to repurpose ideas that work in slightly different field and/or approaches and check for the loss in translation.
I agree with this, although I think creative new ideas often come from people who have also mastered the “standard” skills. And indeed, most research is precisely about coming up with new ideas, which is a skill that you can cultivate my studying how others generate ideas.
More tangentially, you may be underestimating the amount of innovation in sports. Harden and Jokic both innovate in basketball (among others), but I am pretty sure they also do lots of film study. Jokic’s innovation probably comes from having mastered other sports like water polo and the resulting skill transfer. I would guess that mastery of fruitfully adjacent fields is a productive way to generate ideas.
Thanks, sounds good to me!
Actually, another issue is that unsupervised translation isn’t “that hard” relative to supervised translation—I think that you can get pretty far with simple heuristics, such that I’d guess making the model 10x bigger matters more than making the objective more aligned with getting the answer right (and that this will be true for at least a couple more 10x-ing of model size, although at some point the objective will matter more).
This might not matter as much if you’re actually outputting explanations and not just translating from one language to another. Although it is probably true that for tasks that are far away from the ceiling, “naive objective + 10x larger model” will outperform “correct objective”.
Thanks Paul, I generally like this idea.
Aside from the potential concerns you bring up, here is the most likely way I could see this experiment failing to be informative: rather than having checks and question marks in your tables above, really the model’s ability to solve each task is a question of degree—each table entry will be a real number between 0 and 1. For, say, tone, GPT-3 probably doesn’t have a perfect model of tone, and would get <100% performance on a sentiment classification task, especially if done few-shot.
The issue, then, is that the “fine-tuning for correctness” and “fine-tuning for coherence” processes are not really equivalent—fine-tuning for correctness is in fact giving GPT-3 additional information about tone, which improves its capabilities. In addition, GPT-3 might not “know” exactly what humans mean by the word tone, and so fine-tuning for correctness also helps GPT-3 to better understand the question.
Given these considerations, my modal expectation is that fine-tuning for correctness will provide moderately better results than just doing coherence, but it won’t be clear how to interpret the difference—maybe in both cases GPT-3 provides incoherent outputs 10% of the time, and then additionally coherent but wrong outputs 10% of the time when fine-tuned for correctness, but 17% of the time when fine-tuned only for coherence. What would you conclude from a result like that? I would still have found the experiment interesting, but I’m not sure I would be able to draw a firm conclusion.
So perhaps my main feedback would be to think about how likely you think such an outcome is, how much you mind that, and if there are alternative tasks that avoid this issue without being significantly more complicated.
This doesn’t seem so relevant to capybaralet’s case, given that he was choosing whether to accept an academic offer that was already extended to him.
I think if you account for undertesting, then I’d guess 30% or more of the UK was infected during the previous peak, which should reduce R by more than 30% (the people most likely to be infected are also most likely to spread further), and that is already enough to explain the drop.
I wasn’t sure what you meant by more dakka, but do you mean just increasing the dose? I don’t see why that would necessarily work—e.g. if the peptide just isn’t effective.
I’m confused because we seem to be getting pretty different numbers. I asked another bio friend (who is into DIY stuff) and they also seemed pretty skeptical, and Sarah Constantin seems to be as well: https://twitter.com/s_r_constantin/status/1357652836079837189.
Not disbelieving your account, just noting that we seem to be getting pretty different outputs from the expert-checking process and it seems to be more than just small-sample noise. I’m also confused because I generally trust stuff from George Church’s group, although I’m still near the 10% probability I gave above.
I am certainly curious to see whether this does develop measurable antibodies :).
Ah got it, thanks!
Have you run this by a trusted bio expert? When I did this test (picking a bio person who I know personally, who I think of as open-minded and fairly smart), they thought that this vaccine is pretty unlikely to be effective and that the risks in this article may be understated (e.g. food grade is lower-quality than lab grade, and it’s not obvious that inhaling food is completely safe). I don’t know enough biology to evaluate their argument, beyond my respect for them.
I’d be curious if the author, or others who are considering trying this, have applied this test.
My (fairly uninformed) estimates would be: − 10% chance that the vaccine works in the abstract − 4% chance that it works for a given LW user − 3% chance that a given LW user has an adverse reaction −12% chance at least 1 LW user has an adverse reaction
Of course, from a selfish perspective, I am happy for others to try this. In the 10% of cases where it works I will be glad to have that information. I’m more worried that some might substantially overestimate the benefit and underestimate the risks, however.
I don’t think I was debating the norms, but clarifying how they apply in this case. Most of my comment was a reaction to the “pretty important” and “timeless life lessons”, which would apply to Raemon’s comment whether or not he was a moderator.
Often, e.g. Stanford profs claiming that COVID is less deadly than the flu for a recent and related example.
Hmm, important as in “important to discuss”, or “important to hear about”?
My best guess based on talking to a smart open-minded biologist is that this vaccine probably doesn’t work, and that the author understates the risks involved. I’m interpreting the decision to frontpage as saying that you think I’m wrong with reasonably high confidence, but I’m not sure if I should interpret it that way.
That seems irrelevant to my claim that Zvi’s favored policy is worse than the status quo.