Executive director at Timaeus. Working on singular learning theory and developmental interpretability.
Website: jessehoogland.com
Twitter: @jesse_hoogland
Executive director at Timaeus. Working on singular learning theory and developmental interpretability.
Website: jessehoogland.com
Twitter: @jesse_hoogland
I have to respond to the fractional sweets thing.
My partner’s parents (her dad mostly) enforced an 8-point rule around candy. One m&m = 1 point; 1 starburst = 4…
The consequence? She developed an unhealthy relationship to candy and would binge on sugary garbage whenever with friends whose parents did not enforce the rule. She didn’t get a chance to discover her own limits for herself.
Contrast with my upbringing: there just wasn’t candy in house, and my parents were relatively relaxed outside the house (granted breakfast was still garbagey cereal but standards change). I never had a problem over consuming candy.
I guess the takeaway is that (human-)enforced moderation is much more fragile than passive (environmentally enforced) moderation.
But then, if I consider how different our siblings are from us (my brother has way more of a sweet tooth, and my partner’s sister had less of a sweet tooth than her), I’d have to conclude that none of this matters, kids are their own creatures, and everything I just wrote only counts as the weakest possible kind of evidence. Oh well.
Her report. Also this was only a “problem” (not the actual eating disorder kind of problem) as a kid.
But like I said, it’s anecdotal, there’s no RCT taking place here, so discount everything appropriately.
Does anyone have a link to the original “To err is human” study? Seems to only be available in paperback.
Something about this study strikes me as not quite right. It doesn’t seem obvious to me that the adverse effects discussed translate to “killed by doctors” or even necessarily “medical mistake”:
Suppose a doctor had discovered one of these medical mistakes on time. That doesn’t necessarily mean that doctor could have done anything about it.
Suppose you’re a patient in a hospital and the base rate for fatality risks is something like one per day. Suppose also that in this hospital the doctors actually are 100% perfect at resolving each fatality risk. However, your doctors make a mistake in diagnosis 5% of the time. Well then, after a month in the hospital, your odds aren’t looking too great (80% chance you’re in a coffin).
Sure, in the second case, you could say that your hospital has a 5% “killed by doctors” rate, but you could also take the more generous view that this patient was sick as a dog, and ultimately it was the cruel hand of iterated probabilities that dealt the final blow.
When I think “killed by doctor”, I’m thinking of cases where doctors explicitly prescribed some substance or therapy that caused death. I’m less sure about death by neglect.
But I wholeheartedly agree that in general, healthcare is messed up. We’re overmedicating and overmedicalizing, it’s eating into our pockets, and it’s not making us much healthier.
Thanks gwern, I guess I just didn’t see the big blue button with “Download Free PDF” (extrapolating — my patients will be dying at about a 1 in 3 rate).
Here are the important paragraphs (pgs. 27-29):
The most extensive study of adverse events is the Harvard Medical Practice Study, a study of more than 30,000 randomly selected discharges from 51 randomly selected hospitals in New York State in 1984.30 Adverse events, manifest by prolonged hospitalization or disability at the time of discharge or both, occurred in 3.7 percent of the hospitalizations. The proportion of adverse events attributable to errors (i.e., preventable adverse events) was 58 percent and the proportion of adverse events due to negligence was 27.6 percent. Although most of these adverse events gave rise to disability lasting less than six months, 13.6 percent resulted in death and 2.6 percent caused permanently disabling injuries. Drug complications were the most common type of adverse event (19 percent), followed by wound infections (14 percent) and technical complications (13 percent).[1] [2] [3]
The findings of the Harvard Medical Practice Study in New York have recently been corroborated by a study of adverse events in Colorado and Utah occurring in 1992.[4] This study included the review of medical records pertaining to a random sample of 15,000 discharges from a representative sample of hospitals in the two states. Adverse events occurred in 2.9 percent of hospitalizations in each state. Over four out of five of these adverse events occurred in the hospital, the remaining occurred prior to admission in physicians’ offices, patients’ homes or other non-hospital settings. The proportion of adverse events due to negligence was 29.2 percent, and the proportion of adverse events that were preventable was 53 percent. As was the case in the New York study, over 50 percent of adverse events were minor, temporary injuries. But the study in New York found that 13.6 percent of adverse events led to death, as compared with 6.6 percent in Colorado and Utah. In New York, about one in four negligent adverse events led to death, while in Colorado and Utah, death resulted in about 1 out of every 11 negligent adverse events.
Extrapolation of the results of the Colorado and Utah study to the over 33.6 million admissions to hospitals in the United States in 1997,[5] implies that at least 44,000 Americans die in hospitals each year as a result of preventable medical errors. Based on the results of the New York study, the number of deaths due to medical error may be as high as 98,000. By way of comparison, the lower estimate is greater than the number of deaths attributable to the 8th-leading cause of death.
The important takeaways:
Negligence is responsible for something like 30% of these events.
“Preventable” errors (e.g. drug complications) are responsible for something like 50%.
Interestingly, negligence (“care that fell below the standard expected of physicians in their community”[6]) was more frequent among the more severe outcomes (around 50% of the deaths). [1]
It helps to be young. “Persons 65 or older had more than double the risk of persons 16 to 44 years of age” [1]
The main thing I’m wondering is how many of these “deaths due to negligence” are actually just examples of triage. I.e.: Doctors have intuition about who is/isn’t going to make it, and they decide to forego interventions that would postpone the inevitable. I’m not a doctor, but I can imagine these kinds of intuitions are hard to convey in medical records.
But even if we choose to ignore all adverse deaths caused by negligence (=50%), we still have between 20,000 (the Colorado/Utah study[3]) and 50,000 (the New York study[1][2]) iatrogenic deaths. That’s in the range of suicide (#11) to road injuries (#8)[7]. Not good.
Still looking for this one (American Hospital Association. Hospital Statistics. Chicago. 1999) [Statista](https://www.statista.com/statistics/459718/total-hospital-admission-number-in-the-us/) says 36.2 million hospital admissions in the US in 2019.
Assessed by two independent physician-reviewers looking over randomly sampled medical records.
It is quite blatant on what it does but this is pretty much statistics hacking.
Like I said, there’s plenty of uncertainty in FLOP/s. Maybe it’s helpful if rephrase this as an invitation for everyone to make their own modifications to the model.
I would compare a model being trained to computations that a single brain does over its lifetime to configure itself (or only restrict to childhood).
Cotra’s lifetime anchor is FLOPs (so 4-5 OOMs above gradient descent). That’s still quite a chasm.
For brain evolution analog I would include the brain metabolism of the computer scientists developing the next version of the model too.
Do you mean including the CS brain activity towards the computed costs of training the model?
I would not hold against the CPU the inefficiency of the solar panels. Likewise I don’t see how it is fair to blame the brain on the inefficiency of the gut. In case we can blame the gut then we should compare how much the model causes its electricity supply to increase which for many is equal to 0.
If you’re asking yourself whether or not you want to automate a certain role, then a practical subquestion is how much you have to spend on maintenance/fuel (i.e., electricity or food)? Then, I do think the acknowledging the source of the fuel becomes important.
I think a better counterargument is that if a computer running a human-brain-like algorithm consumes a whopping 10,000× more power than does a human brain, who cares? The electricity costs would still be below my local minimum wage!
I agree (as counterargument to skepticism)! Right now though, “brains being much more efficient than computers” would update me towards “AGI is further away / more theoretical breakthroughs are needed”. Would love to hear counterarguments to this model.
I argue here that a much better analogy is between training an ML model versus within-lifetime learning, i.e. multiply Joe Carlsmith’s FLOP/s estimates by roughly 1 billion seconds (≈31 years, or pick a different length of time as you wish) to get training FLOP. See the “Genome = ML code” analogy table in that post.
Great point. Copying from my response to Slider: “Cotra’s lifetime anchor is FLOPs (so 4-5 OOMs above gradient descent). That’s still quite a chasm.”
I didn’t check just now, but I vaguely recall that there’s several (maybe 3??)-orders-of-magnitude difference between FLOP/J of a supercomputer versus FLOP/J of a GPU.
This paper suggests 100 GFLOPs/W in 2020 (within an OOM of Fuguka). I don’t know how much progress there’s been in the last two years.
I think that’s a bad tradeoff. FLOP reads just fine. Clear communication is more important!! :)
Good point! I’ve updated the text.
I think it is pretty clear right now that 2022-AI is less sample efficient than humans. I think other forms of efficiency (e.g., power efficiency, efficiency of SGD vs. evolution) are less relevant to this.
To me this isn’t clear. Yes, we’re better one-shot learners, but I’d say the most likely explanation is that the human training set is larger and that much of that training set is hidden away in our evolutionary past.
It’s one thing to estimate evolution FLOP (and as Nuño points out, even that is questionable). It strikes me as much more difficult (and even more dubious) to estimate the “number of samples” or “total training signal (bytes)” over one’s lifetime / evolution.
Let me take a slightly different example: echolocation.
Bats can detect differences in period as short as 10 nanoseconds. Neuronal spiking maxes out around 100 Hz. So the solution can’t just be as simple as “throw more energy and compute at it”. It’s a question of “design clever circuitry that’s as close as possible to theoretical limits on optimality”.
Similarly, the brain being very efficient increases the probability I assign to “it is doing something non-(practically-)isomorphic to feed-forward ANNs”. Maybe it’s hijacking recurrency in a way that scales far more effectively with parameter size than we can ever hope to create with transformers.
But I notice I am confused and will continue to think on it.
There are three natural reward functions that are plausible:
, which is linear in the number of times is pressed.
, which is linear in the number of times is pressed.
, where is the indicator function for being pressed an even number of times, being the indicator function for being pressed an odd number of times.
Why are these reward functions “natural” or more plausible than , (some constant, independent of button presses), (the total number of button presses), etc.
What’s going on here? Are you claiming that you get better generalization if you have a large complexity gap between the local singularities you start out with and the local singularities you end up with?
The claim behind figure 7.6 in Watanabe is more conjectural than much of the rest of the book, but the basic point is that adding new samples changes the geometry of your loss landscape. ( is different for each .) As you add more samples the free-energy-minimizing tradeoff starts favoring a more accurate fit and a smaller singularity. This would lead to progressively more complex functions (which seems to match up to observations for SGD).
But smoothness is nice.
Smoothness is nice, but hey we use swishes anyway.
I thought the original scaling laws paper was based on techniques from statistical mechanics? Anyway, that does sound exciting. Do you know if anyone has a plausible model for the Chinchilla scaling laws? Also, I’d like to see if anyone has tried predicting scaling laws for systems with active learning.
The scaling analysis borrows from the empirical side. In terms of predicting the actual coefficients behind these curves, we’re still in the dark. Well, mostly. (There are some ideas.)
I may have given the sense that this scaling-laws program is farther along than it actually is. As far as I know, we’re not there yet with Chinchilla, active learning, etc.
This appears to be a high-quality book report. Thanks. I didn’t see anywhere the ‘because’ is demonstrated. Is it proved in the citations or do we just have ‘plausibly because’?
The because ends up taking a few dozen pages to establish in Watanabe 2009 (and only after introducing algebraic geometry, empirical processes, and a bit of complex analysis). Anyway, I thought it best to leave the proof as an exercise for the reader.
Physics experiences in optimizing free energy have long inspired ML optimization uses. Did physicists playing with free energy lead to new optimization methods or is it just something people like to talk about?
I’m not quite sure what you’re asking. Like you say, physics has a long history of inspiring ML optimization techniques (e.g., momentum/acceleration and simulated annealing). Has this particular line of investigation inspired new optimization techniques? I don’t think so. It seems like the current approaches work quite well, and the bigger question is: can we extend this line of investigation to the optimization techniques we’re currently using?
Let me see if I understand your question correctly. Are you asking: does the effective dimensionality / complexity / RLCT () actually tell us something different from the number of non-zero weights? And if the optimization method we’re currently using already finds low-complexity solutions, why do we need to worry about it anyway?
So the RLCT tells us the “effective dimensionality” at the largest singularity. This is different from the number of non-zero weights because there are other symmetries that the network can take advantage of. The claim currently is more descriptive than prescriptive. It says that if you are doing Bayesian inference, then, in the limiting case of large datasets, this RLCT (which is a local thing) ends up having a global effect on your expected behavior. This is true even if your model is not actually at the RLCT.
So this isn’t currently proposing a new kind of optimization technique. Rather, it’s making a claim about which features of the loss landscape end up having most influence on the training dynamics you see. This is exact for the case of Bayesian inference but still conjectural for real NNs (though there is early supporting evidence from experiments).
Not at all stupid!
A singularity here is defined as where the tangent is ill-defined, is this just saying where the lines cross? In other words, that places where loss valleys intersect tend to generalize?
Yep, crossings are singularities, as are things like cusps and weirder things like tacnodes
It’s not necessarily saying that these places tend to generalize. It’s that these singularities have a disproportionate impact on the overall tendency of models learning in that landscape to generalize. So these points can impact nearby (and even distant) points.
If true, what is a good intuition to have around loss valleys? Is it reasonable to think of loss valleys kind of as their own heuristic functions?
I still find the intuition difficult
For example, if you have a dataset with height and weight and are trying to predict life expectancy, one heuristic might be that if weight/height > X then predict lower life expectancy. My intuition reading is that all sets of weights that implement this heuristic would correspond to one loss valley.
If we think about some other loss valley, maybe one that captures underweight people where weight/height < Z, then the place where these loss valleys intersect would correspond to a neural network that predicts lower life expectancy for both overweight and underweight people. Intuitively it makes sense that this would correspond to better model generalization, is that on the right track?
But to me it seems like these valleys would be additive, i.e. the place where they intersect should be lower loss than the basin of either valley on its own. This is because our crossing point should create good predictions for both overweight and underweight people, whereas either valley on its own should only create good predictions for one of those two sets. However, in the post the crossing points are depicted as having the same loss as either valley has on its own, is this intentional or do you think there ought to be a dip where valleys meet?
I like this example! If your model is then the w-h space is split into lines of constant lifespan (top-left figure). If you have a loss which compares predicted lifespan to true lifespan, this will be constant on those lines as well. The lower overweight and underweight lifespans will be two valleys that intersect at the origin. The loss landscape could, however, be very different because it’s measuring how good your prediction is, so there could be one loss valley, or two, or several.
Suppose you have a different function with also with two valleys (top-right). Yes, if you add the two functions, the minima of the result will be at the intersections. But adding isn’t actually representative of the kinds of operations we perform in networks.
For example, compare taking their min, now they cross and form part of the same level sets. It depends very much on the kind of composition. The symmetries I mention can cooperate very well.
From top-left clockwise: ; ; ; .
Yes, so an example of this would be the ReLU scaling symmetry discussed in “Neural networks are freaks of symmetries.” You’re right that regularization often breaks this kind of symmetry.
But even when there are no local symmetries, having other points that have the same posterior means this assumption of asymptotic normality doesn’t hold.
I’m confused by the setup. Let’s consider the simplest case: fitting points in the plane, y as a function of x. If I have three datapoints and I fit a quadratic to it, I have a dimension 0 space of minimizers of the loss function: the unique parabola through those three points (assume they’re not ontop of each other). Since I have three parameters in a quadratic, I assume that this means the effective degrees of freedom of the model is 3 according to this post. If I instead fit a quartic, I now have a dimension 1 space of minimizers and 4 parameters, so I think you’re saying degrees of freedom is still 3. And so the DoF would be 3 for all degrees of polynomial models above linear. But I certainly think that we expect that quadratic models will generalize better than 19th degree polynomials when fit to just three points.
On its own the quartic has 4 degrees of freedom (and the 19th degree polynomial 19 DOFs).
It’s not until I introduce additional constraints (independent equations), that the effective dimensionality goes down. E.g.: a quartic + a linear equation = 3 degrees of freedom,
It’s these kinds of constraints/relations/symmetries that reduce the effective dimensionality.
This video has a good example of a more realistic case:
I think the objection to this example is that the relevant function to minimize is not loss on the training data but something else? The loss it would have on ‘real data’? That seems to make more sense of the post to me, but if that were the case, then I think any minimizer of that function would be equally good at generalizing by definition. Another candidate would be the parameter-function map you describe which seems to be the relevant map whose singularities we are studying, but we it’s not well defined to ask for minimums (or level-sets) of that at all. So I don’t think that’s right either.
We don’t have access to the “true loss.” We only have access to the training loss (for this case, ). Of course the true distribution is sneakily behind the empirical distribution and so has after-effects in the training loss, but it doesn’t show up explicitly in (the thing we’re trying to maximize).
I’m confused now too. Let’s see if I got it right:
A: You have two models with perfect train loss but different test loss. You can swap between them with respect to train loss but they may have different generalization performance.
B: You have two models whose layers are permutations of each other and so perform the exact same calculation (and therefore have the same generalization performance).
The claim is that the “simplest” models (largest singularities) dominate our expected learning behavior. Large singularities mean fewer effective parameters. The reason that simplicity (with respect to either type) translates to generalization is Occam’s razor: simple functions are compatible with more possible continuations of the dataset.
Not all type A redundant models are the same with respect to simplicity and therefore they’re not treated the same by learning.
In the same vein—Isn’t about time for the rationalist community to found a little city? A place maybe somewhere in Canada, nice and isolated from climate impacts, altitudinally disposed against obesity, enough LED wattage to offset the extra microSADs…
But in all honesty, we’d get to test better voting systems, bring smart minds together, pioneer new education plans, and obviously explore how to achieve autarky.
You could start it off small— as a mini hub for organizing retreats. Offer EA grants to come work there. Slowly expand outwards.
If we carry on the rest of the planning phase on paper and pencil, who knows, maybe we can even keep it secret from the AI overlord.