I agree this is better (the system rewards only a subset of what it does), but it is still overgeneralizing. Systems could have a different purpose but fallible reward structure (also Goodharting). You could be analyzing the purpose-reward link at the wrong level: political parties want change, so they seek power, so they need donations. This makes it look like the purpose is to just get donations because of rewards to bundlers, but it ignores the rewards at other levels and confuses local rewards with global purpose. Just as a system does a lot of things, so it rewards a lot of things.
Daniel V
That’s true and a very important point I wish I had included. I assumed consciousness and some unstated degree of able-bodiedness. A good hit to the head on the way in and/or certain physical limitations, and mere inches of depth will be the determinant.
using urban firewood prices for what was mostly rural consumption.
Bernard Stanford: If you value all informal economy firewood production at market price, and then compare it to extant GDP estimates, you need to make sure ALL informal economic production is similarly valued, or you’ll massively overestimate firewood’s share of GDP. Seems to be what happened!
The approach seems to have a serious flaw in assuming that THIS sector of the informal economy was underestimated, but surely not any OTHER sector.
There are two different problems being raised here.
The outside quote is Zvi pointing out the issue of what price is used to multiply by quantity output to get value estimates. This is a good point because we want to multiply by a representative price, not a too-high price.
The inside quote is a related but distinct issue, which I disagree with. Bernard is arguing that if you are going to include an informal element in GDP, you have to include all informal elements. Sure, it’s the ideal, but 1) that’s not what’s going on here, and 2) even if it were, I’d be fine with it. First, firewood is already semi-included in GDP, this process is just attempting to get a better estimate. Imagine BLS sends people out to grocery stores to price milk and estimate milk consumption, but they only end up getting urban prices and only end up observing 1⁄3 of the milk consumed in the US. It’d actually be great to add back in the missing consumption. How do you price it? I guess with the only price you have (and Zvi is right that it’s a problematic price to use). Second, imagine BLS sends people out to start to price household dishwashing and estimate household dishwashing service provision, but they don’t bother to collect household vacuuming. It’d actually still be great to do that, though it’d be difficult to pull off.
So, I think what the authors did is actually a nice exercise, but the price issue means 28% could be an overestimate. How bad of one? The paper discusses Figure A11, recreating Barger’s (1955) estimates (though it only goes back to 1869), so what was the GDP share using that one? In the 1870s, the NBER paper in question reports a share of about 8% (on $500M of firewood). Barger reports about $350M of firewood (because of the lower price index), which would imply a share of about 5.4%. If the 28% was a 50% overestimate, that would mean the “true” share in the 1830s (if you assume Barger’s price index is “true”) could be more like 18.7%. That’s still quite high and far from “half an order of magnitude” overestimate, more like 1⁄5 of one. Is it absurd? Figure 6 here [pdf via academia.edu] has British and Swedish shares at 20% and 40%, respectively.
I too have grown increasingly skeptical that meta-analysis in its typical form does anything all that useful.
Unfortunately, people can be bad at understanding meta-analyses. If you have studies that disagree like 50⁄50, it’s not necessarily true that half did something wrong. It’s possible there is a legitimate hidden moderator that changes the effect of the variable (probably being revealed by the meta-analysis but not picked up sufficiently by popular reporting). Or even revealing that half have a fatal flaw would be a contribution of the meta-analysis! Sometimes the effects are not totally comparable, in which case that should either be modeled/adjusted, or excluded (probably already considered by the meta-analyst [though the salient counter-examples where the researcher(s) screw up confirm that not all papers are perfect, though this is not unique to meta-analyses]). It is indeed problematic when a meta-analysis concludes the average effect is the effect, particularly with a bimodal distribution of effect sizes (would be crazy to conclude that in that case!).
The alternative “look at the studies individually” suffers from all the same issues: garbage in, garbage out; hidden moderators; non-comparability; over-concluding from an average. At least meta-analysis brings some systematic thinking to the evaluation of the literature. A strong meta-analysis interacts with these issues and hopefully avoids the pitfalls because it does what a weak meta-analysis does not—it explains inclusion criteria, considers variability, and explains differences rather than just reporting a mean.
Yes, that’s what I meant, thanks.
I’m glad she’s totally fine. Maybe even a net positive for her and the family on future water safety. It showcases the importance of thinking about bodies of water beyond the prototypes.
A somewhat similar event occurred last weekend with my toddler in a pool. I was less than a foot away from him, as intended, and he was walking around in waist-deep water. Lost his feet but his waist is taller than his arms are long—so he needed me to intervene. He swallowed a little water in the less than two seconds he was sloshing around, but he otherwise didn’t care.
The lesson is the same: the bottom needs to be reliable enough that they can regain footing + footing may mean arm lengths + you basically can’t count on buoyancy/control (even if he had his arms straight out he might not know how to properly keep his head out, he might not know to hold his breath, and also he might be panicking) = don’t go swimming alone.
I think it would be great to formally lay out alternative hypotheses for why this behavior arises, so we can begin to update their likelihoods based on evidence as it accumulates. I have several:
The LLM “misreads” the instruction with probability P. P could vary as a function of the prompt itself, perhaps as a function of both the specificity/detail (negatively so) and complexity (positively so) of the prompt. Thus, we’d expect variation in sabotage rates, but with some specificity that allows us to test this hypothesis (e.g., “Condition 1″ might outperform some prompts, but you’d also think “YOU MUST” would operate similarly).
The LLM “acts out” what it thinks we expect of it. To the extent that some prompts treat the LLM as more of an agent than computer, the LLM may lean into that role. (e.g., “Condition 1″ might outperform “YOU MUST”)
The LLM has a survival drive. As noted by Schlatter et al., this can be difficult to disambiguate from H2.
The LLM is task-focused. It prioritizes completing its defined task above all else. Maybe having shutdown described more as “part of” the task would reduce sabotage rates? (Edit: Such as in this comment by nostalgebraist.)
It would be valuable to figure out how variations across models would reflect on these hypotheses. Plus, what other hypotheses are there and what patterns of variation do they predict?
Yep, frequentist null hypothesis significance testing often gets critiqued for the epistemic failures of the people who use it, even when the core of the approach is fine. By the way, the term you’re describing is familywise error rate. The key questions are, do we care about that error rate, and who is in the family (answers depend on the research questions)?
Alternatively or in addition to this, you can embrace AI and design new tasks and assignments that cause students to learn together with the AI.
This magical suggestion needs explication.
From what I’ve seen via Ethan Mollick, instead of copy-pasting, the new assignments that would be effective are the same as the usual—just “do the work,” but at the AI. Enter a simulation, but please don’t dual-screen the task. Teach the AI (I guess the benefit here is immediate feedback, as if you couldn’t use yourself or friend as a sounding board), but please don’t dual-screen the task. Have a conversation (again, not in class or on a discussion board or among friends), but please don’t dual-screen the task. Then show us you “did it.” You could of course do these things without AI, though maybe AI makes a better (and certainly faster) partner. But the crux is that you have to do the task yourself. Also note that this admits the pre-existing value of these kinds of tasks.
Students who will good-faith do the work and leverage AI for search, critique, and tutoring are...doing the work and getting the value, like (probably more efficiently than, possibly with higher returns than) those who do the work without AI. Students who won’t...are not doing the work and not getting the value, aside from the signaling value from passing the class. So there you have it—educators can be content that not doing the assignment delivers worse results for the student, but the student doesn’t mind as long as they get their grade, which is problematic. Thus, educators are not going quietly and are in fact very concerned about AI-proofing the work, including shifts to in-person testing and tasks.
However, that only preserves the benefit of the courses and in turn degree (I’m not saying pure signaling value doesn’t exist, I’m just saying human capital development value is non-zero and under threat). It does not insulate the college graduate from competition in knowledge work from AI (here’s the analogy: it would obviously be bad for the Ford brand to send lemons into the vehicle market, but even if they are sending decent cars out, they should still be worried about new entrants).
Really enjoyed the post, but in the interest of rationality,
How many more older siblings should patrons of the Chelsea nightclub have than all other men in New York?
This question rests on the false premise(s) (i.e., model misspecification(s)) that homosexuality is only a function of birth order and that the Chelsea nightclub probability doesn’t stem from heavy selection. Relatedly, gwern notes that, “surely homosexuality is not the primary trait the Catholic Church hierarchy is trying to select for.” Maybe this was supposed to be more tongue-in-cheek. But identifying a cause does not require that it sufficiently explain something completely on its own.
I agree with you. Unless the signal is so strong that people believe that their personal experience is not representative of the economy, it’s going to be overweighted. “I and half the people I know make less” will lead to discontent about the state of the economy. “I and half the people I know make less, but I am aware that GDP grew 40%, so the economy must be doing fine despite my personal experience” is possible, but let’s just say it’s not our prior.
Exactly, which is why the metric Mazlish prefers is so relevant and not bizarre, unless the premise that people judge the economy from their own experiences is incorrect.
Why is this what matters? It’s a bizarre metric. Why should we care what the median change was, instead of some form of mean change, or change in the mean or median wage?
The critique that the justification wasn’t great because the mean wage dropped a lot in the example is fair. Yet, in the proposed alternative example it remains quite likely that people will perceive the economy as having gotten worse, even if the economy is objectively so much better − 2⁄3 will say they’re personally worse off, insufficiently adjust for the impersonal ways of assessing the economy, and ultimately say the economy is worse.
Neither nor is a bizarre metric. may be great for observers to understand general trajectories of income when you lack panel data, but since people use their own lives to assess whether they are better off and in turn overweight that when they judge the economy, is actually more useful for understanding the translation from people’s lives into their perceptions.
Consider a different example (also in real terms):
T1: A makes $3, B makes $3, C makes $3 ,D makes $10, E makes $12
T2: A makes $2, B makes $2, C makes $3, D makes $9, E makes $16
The means show nice but not as crazy economic growth ($6.20 to $6.40), and the is $0 ($3 to $3) - “we’re not poorer!” However, the is -$1. And people at T2 will generally feel worse off (3/5 will say they can’t buy as much as they could before, so “this economy is tough”).
Contrast that with (still in real terms):
T1: A makes $2, B makes $2, C makes $4 ,D makes $10, E makes $12
T2: A makes $3, B makes $3, C makes $3, D makes $10, E makes $12
The means show nice but not as crazy economic growth ($6 to $6.20), and the is -$1 ($4 to $3) - “we’re poorer!” However, the is $0. And people at T2 will generally feel like things are going okay (only 1 person will feel worse off).
And these are comparing to 0. Mazlish’s post illustrates that people will probably not compare to 0 and instead to recent trajectories (“I got a 3% raise last year, what do you mean my raise this year is 2%?!”), so #1 means people will be dissatisfied. #2, as it bore out in the data, also means dissatisfaction. And #3, largely due to timing, means further dissatisfaction.
Then it is no surprise that exit polls show people who were most dissatisfied with the economy under Biden (and assumed Harris would be more of that) voted for Trump. Sure, there’s some political self-deception bias going on (see charts of economic sentiment vs. date by party affiliation), but note that the exit polls are correlational—they can indicate that partisanship is a hell of a drug or that people are rationally responding to their perceptions. It’s likely both. And if your model of those perceptions is inferior in the ways Mazlish notes, you’d wrongly think people would have been happy with the economy.
Literally macroeconomics 101. Trade surpluses aren’t shipping goods for free. There is a whole balance of payments to consider. I’m shocked EY could get that so wrong, surprised that lsusr is so ready to agree, and confused because surely I missed something huge here, right?
I guess I misunderstood you. I figured that without “regression coefficients,” the sentence would be a bit tautological: “the point of randomized controlled trial is to avoid [a] non-randomized sample,” and there were other bits that made me think you had an issue with both selection bias (agree) and regressions (disagree).
I share your overall takeaway, but at this point I am just genuinely curious why the self-selection is presumed to be such a threat to internal validity here. I think we need more attention to selection effects on the margin, but I also think there is a general tendency for people to believe that once they’ve identified a selection issue the results are totally undermined. What is the alternative explanation for why semaglutide would disincline people who would have had small change scores from participating or incline people who have large change scores to participate (remember, this is within-subjects) in the alcohol self-administration experiment? Maybe those who had the most reduced cravings wanted to see more of what these researchers could do? But that process would also occur among placebo, so it’d work via the share of people with large change scores being greater in the semaglutide group, which is...efficacy. There’s nuance there, but hard to square with lack of efficacy.
That said, still agree that the results are no slam dunk. Very specific population, very specific outcomes affected, and probably practically small effects too.
I appreciate this kind of detailed inspection and science writing, we need more of this in the world!
I’m writing this comment because of the expressed disdain for regressions. I do share the disappointment about how the randomization and results turned out. But for both, my refrain will be: “that’s what the regression’s for!”
This contains the same data, but stratified by if people were obese or not:
Now it looks like semaglutide isn’t doing anything.
The beauty of exploratory analyses like these is that you can find something interesting. The risk is that you can also read into noise. Unfortunately, all they did was plot these results, not report the regression, which could tell us whether there is any effect beyond the lower baseline. eTable3 confirms that the interaction between condition and week is non-significant for most outcomes, which the authors correctly characterized. That’s what the regression’s for!
This means the results are non-randomized.
Yes and no. People were still randomized to condition and it appears to be pretty even attrition. Yes, there is an element of self-selection, which can constrain the generalizability (i.e., external validity) of the results (I’d say most of the constraint is actually just due to studying people with AUD rather than the general population, but you can see why they’d do such a thing), but that does not necessarily mean it broke the randomization, which would reduce the ability to interpret differences as a result of the treatment (i.e., internal validity). To the extent that you want to control for differences that happen to occur or have been introduced between the conditions, you’ll need to run a model to covary those out. That’s what the regression’s for!
the point of RCTs is to avoid resorting to regression coefficients on non-randomized samples
My biggest critique is this. If you take condition A and B and compute/plot mean outcomes, you’d presumably be happy that it’s data. But computing/plotting predicted values from a regression of outcome on condition would directly recover those means. And from what we’ve seen above, adjustment is often desirable. Sometimes the raw means are not as useful as the adjusted/estimated means—to your worry about baseline differences, the regression allows us to adjust for that (i.e., provide statistical control where experimental control was not sufficient). And, instead of eyeballing plots, the regressions help tell you if something is reliable. The point of RCTs is not to avoid resorting to regression coefficients. You’ll run regressions in any case! The point of RCTs is to reduce the load your statistical controls will be expected to lift by utilizing experimental controls. You’ll still need to analyze the data and implement appropriate statistical controls. That’s what the regression’s for!
I really like this succinct post.
I intuitively want to endorse the two growth rates (if it “looks” linear right now, it might just be early exponential), but surely this is not that simple, right? My top question is “What are examples of linear growth in nature and what do they tell us about this perception that all growth is around zero or exponential?”
A separate thing that sticks out is that having two growth rates does not necessarily imply generally two subjective levels.
This can be effectively implemented by the government accumulating tax revenues (largely from the rich) in good times and spending them on disaster relief (largely on the poor) in bad times. It lets price remain a signal while also expanding supply.
Taxation is better than a ban, but in this case it remains an attempt at price control. “Documented” cost increases is doing a lot of work. Better than “vibes about price,” but it is the same deal: the government “knows better” what prices should be than what is revealed by the market. I’d argue that if the government doesn’t like what the market is yielding, it can get involved in the market and help expand supply itself, which we see governments attempt during disaster relief already.
I agree, the meta-point of selection bias is valid but the direction of bias is unclear.