ok, i agree with this. there is some room for disagreement on exactly how big the gap is between white box and black box—i think it’s very small compared to the gap from white box to full understanding. my main argument would just be ELK flavored, that there are spurious correlations that give you human simulators instead. but i don’t feel super confident that the constant factors work out to support my claim
leogao
i think SAEs are a completely reasonable thing under the first worldview, and mostly crazy under the second worldview (with the exception of maybe bio or something where I’ve heard they’re genuinely useful)
(SAEs are not sufficient to actually understand things, but they are a genuine step on the way there)
i totally agree for the case of actual white box understanding. this is what I’d consider the first worldview. my gripe is the interp-flavored techniques reveal very little understanding that might actually scale with intelligence, and yet through association with interp try to imply that they do.
i think it’s really weird that people are trying to do vaguely interp flavored things but also trying to argue for the goodness of such techniques via empirical usefulness. i think there are broadly two self consistent worldviews here. one is that you want to understand how NNs actually work and then use that understanding for something. the other is you want to make models better at X (where X can be anything from “be a good chatgpt model” to “refuse bioweapon prompts” to “make weak to strong setup score go up”). but if you’re doing the latter the actual conceptually important part is picking the right X and then working really hard to make it go up using whatever techniques work. if you’re doing the former you should actually try to understand things whatsoever. it doesn’t make sense to try to do both and ultimately get neither. you should either do pragmatic or do interp.
i haven’t had a chance to think deeply about it but vibes wise i don’t like activation oracles
i’d guess there are a lot of people out there who genuinely do not love anyone, intrinsically enjoy exerting power over people, and are at best indifferent to making them suffer. this is correlated with power because one consequence of wielding a lot of power is you will inevitably hurt people, or even just fail to help people as much as you could have. if you are a normal person, this can incur a lot of psychic damage—this is why many people are scared of having power. if you enjoy it and don’t care about hurting people, this is actively attractive.
is it often observed that children like celebrating birthdays, aspiring to be older, and then when they reach a particular age, they realize the error of their ways and treat impending birthdays as a mark of getting closer to death. while it is generally assumed that this is because the evidence of how shitty aging is only becomes evident with age, there is also a mathematical explanation. each year, your expected remaining lifespan changes by some amount. for most of your life, this is close to −1 per year, because you almost certainly weren’t about to suddenly die that year. but things get weird for very young and very old people. for very young people, each additional year of life is strong evidence that you didn’t lose decades of lifespan by succumbing to infant mortality. for very old people, your probability of dying every year is so close to 100%, that if you somehow miraculously live an additional year, your expected remaining lifespan is still extremely short. so, this theory predicts that small children should be very happy to get older; and, for those who have glasses half full, once they get old enough, they too can enjoy the bittersweet satisfaction of having nothing to lose because they have nothing left.
most history is done in a very humanitiespilled, academia flavored way. are there good examples of people doing very analytical, capital-intensive history research where the quality of the work is judged based on how successfully the resulting theories made good predictions/decisions?
scifi setting idea: movement from rural areas and small cities to larger cities continues until approximately everyone lives in one of like 10 different megacities; all of the farmland and oil fields and mines and whatnot in between are 99% roboticized, with only occasional human repairs; all of the cities are tightly connected by supersonic travel, which becomes more feasible because there are very few people on the ground outside cities to get annoyed by the noise; drugs solve sleep and allow effortless adaptation to jet lag. uniquely, SF nether expands to become a megacity, nor disappears into irrelevance; housing becomes so absurdly expensive that only the very best researchers and engineers can afford to live there, causing a huge selection effect towards talent density.
what is the current best scientific understanding of how bad ozone redistribution (less ozone in upper stratosphere, but more in lower stratosphere, with same overall amount) is compared to ozone disappearing entirely?
publicly registering a bet with Gabor Hollbeck:
i predict that the median CS postdoc will be publishing less than 100 papers a year 5 years from now. Gabor predicts otherwise.
before the result is known, both companies have positive expected value. if you are risk averse and prefer to immediately cash out instead of taking the gamble, someone out there will be willing to buy out your share of one of the companies for a small fee.
in situations where there exists a fungible medium of exchange, and trades cannot occur by coercion, positive sum trades only happen if the losing participants are adequately compensated.
people say London is declining. but walking around, i see construction everywhere, and many new skyscrapers that i don’t remember seeing last time i visited 4 years ago.
traveling through Europe, looking out the window, and seeing the national flag flying next to the flag of the EU fills me with a strange feeling. this isn’t an original thought at all, but still: it’s really crazy that just 50 years ago Europe was divided by the iron curtain, and that people would have to go to insane lengths and risk their lives to get across that border; and that less than 100 years ago all of these countries were at war with each other, and had been at war on and off for centuries with ever shifting alliances and boundaries.
have you ever heard anyone make the argument that it’s good to have AI safety aligned frontier labs (including but not limited to Anthropic) because they will have a seat at the table with the regulators, and the regulators will take major industry players’ opinions more seriously than minor players or activists?
i’ve heard this argument but i’m trying to figure out if it’s common enough to be worth writing a post about
this would be very helpful! if someone has already done a high quality version of this experiment then i don’t need to do another one
i’m somewhat concerned that capsules are not super airtight, and the powder inside is also permeable to air.
survey: what brand of melatonin do you use? i want to run an experiment on melatonin degradation using the most popular brand of melatonin.
i already have the data of what political affiliation they are, and can do arbitrary analysis of this. what specific result did you want?
I’m reminded of an episode from the ozone hole saga: the original researchers who came up with the ozone depletion theory, Rowland and Molinda, discovered a caveat to their theory that would imply the effects of CFCs would be much less than they initially expected. they felt compelled by professional honor to publish these results, even though they cut against their original theory. as expected, the publication of these results (and from the original authors, no less) gave the CFC industry plenty of ammunition to say “look, see, they were wrong all along, haha”. however, their commitment to publishing their best understanding also earned them a lot of respect and many people who thought Rowland and Molina had already made up their minds to be anti-CFCs came to think more highly of them. ultimately, further evidence swayed the consensus back in the direction that CFCs were in fact bad for ozone. if Rowland and Molina had tried to cover up their tentative negative results, the ensuing distrust probably would have poisoned their results a lot (though it’s hard to evaluate this counterfactual)
(I’m working on a full length piece about the whole ozone hole saga, but this was so relevant that i felt a need to mention it.)