What do you think Metz did that was unethical here?
Stephen Bennett
Soft downvoted for encouraging self-talk that I think will be harmful for most of the people here. Some people might be able to jest at themselves well, but I suspect most will have their self image slightly negatively affected by thinking of themselves as an idiot.
Most of the individual things you recommend considering are indeed worth considering.
Interesting work, congrats on achieving human-ish performance!
I expect your model would look relatively better under other proper scoring rules. For example, logarithmic scoring would punish the human crowd for giving >1% probabilities to events that even sometimes happen. Under the Brier score, the worst possible score is either a 1 or a 2 depending on how it’s formulated (from skimming your paper, it looks like 1 to me). Under a logarithmic score, such forecasts would be severely punished. I don’t think this is something you should lead with, since Brier scores are the more common scoring rule in the literature, but it seems like an easy win and would highlight the possible benefits of the model’s relatively conservative forecasting.
I’m curious how a more sophisticated human-machine hybrid would perform with these much stronger machine models, I expect quite well. I did some research with human-machine hybrids before and found modest improvements from incorporating machine forecasts (e.g. chapter 5, section 5.2.4 of my dissertation Metacognitively Wise Crowds & the sections “Using machine models for scalable forecasting” and “Aggregate performance” in Hybrid forecasting of geopolitical events.), but the machine models we were using were very weak on their own (depending on how I analyzed things, they were outperformed by guessing). In “System Complements the Crowd”, you aggregate a linear average of the full aggregate of the crowd and the machine model, but we found that treating the machine as an exceptionally skilled forecaster resulted in the best performance of the overall system. As a result of this method, the machine forecast would be down-weighted in the aggregate as more humans forecasted on the question, which we found helped performance. You would need access to the individuated data of the forecasting platform to do this, however.
If you’re looking for additional useful plots, you could look at Human Forecast (probability) vs AI Forecast (probability) on a question-by-question basis and get a sense of how the humans and AI agree and disagree. For example, is the better performance of the LM forecasts due to disagreeing about direction, or mostly due to marginally better calibration? This would be harder to plot for multinomial questions, although there you could plot the probability assigned to the correct response option as long as the question isn’t ordinal.
I see that you only answered Binary questions and that you split multinomial questions. How did you do this? I suspect you did this by rephrasing questions of the form “What will $person do on $date, A, B, C, D, E, or F?” into “Will $person do A on $date?”, “Will $person do B on $date?”, and so on. This will result in a lot of very low probability forecasts, since it’s likely that only A or B occurs, especially closer to the resolution date. Also, does your system obey the Law of total probability (i.e. does it assign exactly 100% probability to the union of A, B, C, D, E, and F)? This might be a way to improve performance of the system and coax your model into giving extreme forecasts that are grounded in reality (simply normalizing across the different splits of the multinomial question here would probably work pretty well).
Why do human and LM forecasts differ? You plot calibration, and the human and LM forecasts are both well calibrated for the most part, but with your focus on system performance I’m left wondering what caused the human and LM forecasts to differ in accuracy. You claim that it’s because of a lack of extremization on the part of the LM forecast (i.e. that it gives too many 30-70% forecasts, while humans give more extreme forecasts), but is that an issue of calibration? You seemed to say that it isn’t, but then the problem isn’t that the model is outputting the wrong forecast given what it knows (i.e. that it “hedge[s] predictions due to its safety training”), but rather that it is giving its best account of the probability given what it knows. The problem with e.g. the McCarthy question (example output #1) seems to me that the system does not understand the passage of time, and so it has no sense that because it has information from November 30th and it’s being asked a question about what happens on November 30th, it can answer with confidence. This is a failure in reasoning, not calibration, IMO. It’s possible I’m misunderstanding what cutoff is being used for example output #1.
Miscellaneous question: In equation 1, is k 0-indexed or 1-indexed?
The second thing that I find surprising is that a lie detector based on ambiguous elicitation questions works. Again, this is not something I would have predicted before doing the experiments, but it doesn’t seem outrageous, either.
I think we can broadly put our ambiguous questions into 4 categories (although it would be easy to find more questions from more categories):
Somewhat interestingly, humans who answer nonsensical questions (rather than skipping them) generally do worse at tasks: pdf. There’s some other citations in there of nonsensical/impossible questions if you’re interested (“A number of previous studies have utilized impossible questions...”).
It seems plausible to me that this is a trend in human writing more broadly and that the LLM picked up on. Specifically, answering something with a false answer is associated with a bunch of stuff—one of those things is deceit, one of those things is mimicking the behavior of someone who doesn’t know the answer to things or doesn’t care about the instructions given to them. So, since that behavior exists in human writing in general, the LLM picks it up and exhibits it in its writing.
See this comment.
You edited your parent comment significantly in such a way that my response no longer makes sense. In particular, you had said that Elizabeth summarizing this comment thread as someone else being misleading was itself misleading.
In my opinion, editing your own content in this way without indicating that this is what you have done is dishonest and a breach of internet etiquette. If you wanted to do this in a more appropriate way, you might say something like “Whoops, I meant X. I’ll edit the parent comment to say so.” and then edit the parent comment to say X and include some disclaimer like “Edited to address Y”
Okay, onto your actual comment. That link does indicate that you have read Elizabeth’s comment, although I remain confused about why your unedited parent comment expressed disbelief about Elizabeth’s summary of that thread as claiming that someone else was misleading.
I took Tristan to be using “sustainability” in the sense of “lessened environmental impact”, not “requiring little willpower”
The section “Frame control” does not link to the conversation you had with wilkox, but I believe you intended for there to be one (you encourage readers to read the exchange). The link is here: https://www.lesswrong.com/posts/Wiz4eKi5fsomRsMbx/change-my-mind-veganism-entails-trade-offs-and-health-is-one?commentId=uh8w6JeLAfuZF2sxQ
In the comment thread you linked, Elizabeth stated outright what she found misleading: https://forum.effectivealtruism.org/posts/3Lv4NyFm2aohRKJCH/change-my-mind-veganism-entails-trade-offs-and-health-is-one?commentId=mYwzeJijWdzZw2aAg
Getting the paper author on EAF did seem like an unreasonable stroke of good luck.
I wrote out my full thoughts here, before I saw your response, but the above captures a lot of it. The data in the paper is very different than what you described. I think it was especially misleading to give all the caveats you did without mentioning that pescetarianism tied with veganism in men, and surpassed it for women.
I expect people to read the threads that they are linking to if they are claiming someone is misguided, and I do not think that you did that.
I don’t think that’s the central question here.
So far as I can tell, the central question Elizabeth has been trying to answer is “Do the people who convert to veganism because they get involved in EA have systemic health problems?” Those health problems might be easily solvable with supplementation (Great!), systemic to having a fully vegan diet but only requires some modest amount of animal product, or something more complicated. She has several self-reported people coming to her saying they tried veganism, had health problems, and stopped. So, “At what rate do vegans desist for health reasons?” seems like an important question to me. It will tell you at least some of what you are missing when surveying current vegans only.
Analogously, a survey of healing crystal buyers doesn’t reliably tell us whether healing crystals improve health. Even if such a survey is useful for explaining motives, it’s clearly less valuable than an RCT when it comes to the important question of whether they actually work.
I agree that if your prior probability of something being true is near 0, you need very strong evidence to update. Was your prior probability that someone would desist from the vegan diet for health reasons actually that low? If not, why is the crystal healing metaphor analogous?
I’m aware that people have written scientific papers that include the word vegan in the text, including the people at Cochrane. I’m confused why you thought that would be helpful. Does a study that relates health outcomes in vegans with vegan desistance exist, such that we can actually answer the question “At what rate do vegans desist for health reasons?”
Does such a study exist?
From what I remember of Elizabeth’s posts on the subject, her opinion is the literature surrounding this topic is abysmal. To resolve the question of why some veg*ns desist, we would need one that records objective clinical outcomes of health and veg*n/non-veg*n diet compliance. What I recall from Elizabeth’s posts was that no study even approaches this bar, and so she used other less reliable metrics.
I took your original comment to be saying “self-report is of limited value”, so I’m surprised that you’re confused by Elizabeth’s response. In your second comment, you seem to be treating your initial comment to have said something closer to “self-report is so low value that it should not materially alter your beliefs.” Those seem like very different statements to me.
Thanks!
If you’re taking UI recommendations, I’d have been more decisive with my change if it said it was a one-time change.
Could I get rid of the (Previously GWS) in my username? I changed my name from GWS to this, and planned on changing it to just Stephen Bennett after a while, then as far as I can tell you removed the ability to edit your own username.
Obviously one trial isn’t conclusive, but I’m giving up on the water pick. Next step: test flossing.
Did you follow through on the flossing experiment?
The coin does not have a fixed probability on each flip.
Boy howdy was I having trouble with spoiler text on markdown.
I didn’t provide quotes from my text when the mismatch was obvious enough from any read/skim of the text.
It was not obvious to me, although that’s largely because after reading what you’ve written I had difficulty understanding what your position was
at allprecisely. It also definitely wasn’t obvious to jimrandomh, who wrote that Elizabeth’s summary of your position is accurate. It might be obvious to you, but as written this is a factual statement about the world that is demonstrably false.My proposal is not suppressing public discussion of plant-based nutrition, but constructing some more holistic approach whose shape isn’t solely focused on plant-based diets, or whose tone and framing aren’t like this one (more in my text).
I’m confused. You say that you don’t want to suppress public discussion of plant-based nutrition, but also that you do want to suppress Elizabeth’s work. I don’t know how we could get something that matches Elizabeth’s level of rigor, accomplishes your goal of a holistic approach, and doesn’t require at least 3 times the work from the author to investigate all other comparable diets to ensure that veganism isn’t singled out. Simplicity is a virtue in this community!
I don’t think it’s true private communications “prevent us from getting the information” in important ways (even if taking into account the social dynamics dimension of things will always, of course, be a further hindrance). And also, I don’t think public communications give us some of the most important information.
This sounds, to me, like you are arguing against public discussions. Then in the next sentence you say you’re not suppressing public discussions. Those are in fact very slightly different things since arguing that something isn’t the best mode of communication is distinct from promoting suppression of that thing, but this seems like a really small deal. You might ask Elizabeth something like “hey, could you change ‘promotes the suppression of x’ with ‘argues strongly that x shouldn’t happen’? It would match my beliefs more precisely.” This seems nitpicky to me, but if it’s important to you it seems like the sort of thing Elizabeth Elizabeth might go for. It also wouldn’t involve asking her to either delete a bunch of her work or make another guess at what you actually mean.
In any event, I will stop engaging now.
Completely reasonable, don’t feel compelled to respond.
Audience
If you’re entirely uninvolved in effective altruism you can skip this, it’s inside baseball and there’s a lot of context I don’t get into.
Oh whoops, I misunderstood the UI. I saw your name under the confusion tag and thought it was a positive vote. I didn’t realize it listed emote-downvotes in red.
Congratulations! I wish we could have collaborated while I was in school, but I don’t think we were researching at the same time. I haven’t read your actual papers, so feel free to answer “you should check out the paper” to my comments.
For chapter 4: From the high level summary here it sounds like you’re offloading the task of aggregation to the forecasters themselves. It’s odd to me that you’re describing this as arbitrage. Also, I have frequently seen the scoring rule be used with some intermediary function to determine monetary rewards. For example, when I worked with IARPA on geopolitical forecasting, our forecasters would get financial rewards depending on what percentile they were in relative to other forecasters. One would imagine that this would eliminate the incentive to report the aggregate as your own answer, but there’s a reason we (the researcher/platform/website) aggregate individual forecasts! It’s actually just more accurate under typical conditions. In theory an individual forecaster could improve that aggregate by forming their own independent forecast before seeing the work of others, and then aggregating, but in practice the impact of an individual forecast is quite small. I’ll have to read about QA pooling, it’s surprising to me that you could disincentivize forecasters from reporting the aggregate as their individual forecast.
For chapter 7: It seems to me that under sufficiently pessimistic conditions, there would be no good way to aggregate those two forecasts. For example, if Alice and Bob are forecasting “Will AI cause human extinction in the next 100 years?”, they both might individually forecast ~0% for different reasons. Alice believes it is impossible for AI to get powerful enough to cause human extinction, but if it were capable of acting it would kill us all. Bob believes any agent smart enough to be that powerful would necessarily be morally upstanding and believes it’s extremely likely that it will be built. Any reasonable aggregation strategy will put the aggregate at ~0% because each individual forecast is ~0%, but if they were to communicate with one another they would likely arrive at a much higher number. I suspect that you address this in the assumptions of the model in the actual paper.
Congrats again, I enjoyed your high level summary and might come back for a more detailed read of your papers.