If you upload a human and let them augment themselves would there be any u? The preferences would be a tangled mess of motivational subsystems. And yet the upload could be very good at optimizing the world. Having the property of being steered internally by a tangled mess of motivational systems seems to be a property that would select many minds from the set of all possible minds. Many of which I’d expect to be quite different from a human mind. And I don’t see the reason why this property should make a system worse at optimizing the world in principle.
Imagine you are an upload that has been running for very very long, and that you basically have made all of the observations that you can make about the universe you are in. And then imagine that you also have run all of the inferences that you can run on the world model that you have constructed from these observations.
At that point, you will probably not change what you think is the right thing to do anymore. You will have become reflectively stable. This is an upper bound for how much time you need to become reflective stable, i.e. where you won’t change your u anymore.
Now depending on what you mean with strong AGI, it would seem that that can be achieved long before you reach reflective stability. Maybe if you upload yourself, and can copy yourself at will, and run 1,000,000 times faster, that could already reasonably be called a strong AGI? But then your motivational systems are still a mess, and definitely not reflectively stable.
So if we assume that we fix u at the beginning as the thing that your upload would like to optimize the universe for when it is created, then “give u() up”, and “let u go down” would be something the system will definitely do. At least I am pretty sure I don’t know what I want the universe to look like right now unambiguously.
Maybe I am just confused because I don’t know how to think about a human upload in terms of having a utility function. It does not seem to make any sense intuitively. Sure you can look at the functional behavior of the system and say “Aha it is optimizing for u. That is the revealed preference based on the actions of the system.” But that just seems wrong to me. A lot of information seems to be lost when we are just looking at the functional behavior instead of the low-level processes that are going on inside the system. Utility functions seem to be a useful high-level model. However, it seems to ignore lots of details that are important when thinking about the reflective stability of a system.
My MATS program people just spent two days on an exercise to “train a shoulder-John”.
The core exercise: I sit at the front of the room, and have a conversation with someone about their research project idea. Whenever I’m about to say anything nontrivial, I pause, and everyone discusses with a partner what they think I’m going to say next. Then we continue.
Some bells and whistles which add to the core exercise:
Record guesses and actual things said on a whiteboard
Sometimes briefly discuss why I’m saying some things and not others
After the first few rounds establish some patterns, look specifically for ideas which will take us further out of distribution
Why this particular exercise? It’s a focused, rapid-feedback way of training the sort of usually-not-very-legible skills one typically absorbs via osmosis from a mentor. It’s focused specifically on choosing project ideas, which is where most of the value in a project is (yet also where little time is typically spent, and therefore one typically does not get very much data on project choice from a mentor). Also, it’s highly scalable: I could run the exercise in a 200-person lecture hall and still expect it to basically work.
It was, by all reports, exhausting for everyone but me, and we basically did this for two full days. But a majority of participants found it high-value, and marginal returns were still not dropping quickly after two days (though at that point people started to report that they expected marginal returns to drop off soon).
I’d be interested to see other people try this exercise—e.g. it seems like Eliezer doing this with a large audience for a day or two could generate a lot of value.
This was arguably the most useful part of the SERI MATS 2 Scholars program.
Later on, we actually did this exercise with Eliezer. It was less valuable. It seemed like John was mainly prodding the people who were presenting the ideas, such that their patterns of thought would carry them in a good direction. For example, John would point out that a person proposes a one-bit experiment and asks if there isn’t a better experiment that we could do that gives us lots of information all at once.
This was very useful because when you learn what kinds of things John will say, you can say them to yourself later on, and steer your own patterns of thought in a good direction on demand. When we did this exercise with Eliezer he was mainly explaining why a particular idea would not work. Often without explaining the generator behind his criticism. This can of course still be valuable as feedback for a particular idea. However, it is much harder to extract a general reasoning pattern out of this that you can then successfully apply later in different contexts.
For example, Eliezer would criticize an idea about trying to get a really good understanding of the scientific process such that we can then give this understanding to AI alignment researchers such that they can make a lot more progress than they otherwise would. He criticized this idea as basically being too hard to execute because it is too hard to successfully communicate how to be a good scientist, even if you are a good scientist.
Assuming the assertion is correct, hearing it, doesn’t necessarily tell you how to think in different contexts such that you would correctly identify if an idea would be too hard to execute or flawed in some other way. And I am not necessarily saying that you couldn’t extract a reasoning algorithm out of the feedback, but that if you could do this, then it would take you a lot more effort and time, compared to extracting a reasoning algorithm from the things that John was saying.
Now, all of this might have been mainly an issue of Eliezer not having a good model on how this workshop would have a positive influence on the people attending it. I would guess that if John had spent more time thinking about how to communicate what the workshop is doing and how to achieve its goal, then Eliezer could have probably done a much better job.
This suggests formulation of exercises about the author’s responses to various prompts, as part of technical exposition (or explicit delimitation of a narrative by choices of the direction of its continuation). When properly used, this doesn’t seem to lose much value compared to the exercise you describe, but it’s more convenient for everyone. Potentially this congeals into a style of writing with no explicit exercises or delimitation that admits easy formulation of such exercises by the reader. This already works for content of technical writing, but less well for choices of topics/points contrasted with alternative choices.
So possibly the way to do this is by habitually mentioning alternative responses (that are expected to be plausible for the reader, while decisively, if not legibly, rejected by the author), and leading with these rather than the preferred responses. Sounds jarring and verbose, a tradeoff that needs to be worth making rather than a straight improvement.
I’ve been trying to push against the tendency for everyone to talk about FTX drama lately, but I have some generalizable points on the topic which I haven’t seen anybody else make, so here they are. (Be warned that I may just ignore responses, I don’t really want to dump energy into FTC drama.)
Summary: based on having worked in startups a fair bit, Sam Bankman-Fried’s description of what happened sounds probably accurate; I think he mostly wasn’t lying. I think other people do not really get the extent to which fast-growing companies are hectic and chaotic and full of sketchy quick-and-dirty workarounds and nobody has a comprehensive view of what’s going on.
Long version: at this point, the assumption/consensus among most people I hear from seems to be that FTX committed intentional, outright fraud. And my current best guess is that that’s mostly false. (Maybe in the very last couple weeks before the collapse they toed the line into outright lies as a desperation measure, but even then I think they were in pretty grey territory.)
Key pieces of the story as I currently understand it:
Moving money into/out of crypto exchanges is a pain. At some point a quick-and-dirty solution was for customers to send money to Alameda (Sam Bankman-Fried’s crypto hedge fund), and then Alameda would credit them somehow on FTX.
Customers did rather a lot of that. Like, $8B worth.
The FTX/Alameda team weren’t paying attention to those particular liabilities; they got lost in the shuffle.
At some point in the weeks before the collapse, when FTX was already under moderate financial strain, somebody noticed the $8B liability sitting around. And that took them from “moderate strain” to “implode”.
How this contrasts with what seems-to-me to be the “standard story”: most people seem to assume that it is just totally implausible to accidentally lose track of an $8B liability. Especially when the liability was already generated via the decidedly questionable practice of routing customer funds for the exchange through a hedge fund owned by the same people. And therefore it must have been intentional—in particular, most people seem to think the liability was intentionally hidden.
I think the main reason I disagree with others on this is that I’ve worked at a startup. About 5 startups, in fact, over the course of about 5 years.
The story where there was a quick-and-dirty solution (which was definitely sketchy but not ill-intentioned), and then stuff got lost in the shuffle, and then one day it turns out that there’s a giant unanticipated liability on the balance sheet… that’s exactly how things go, all the time. I personally was at a startup which had to undergo a firesale because the accounting overlooked something. And I’ve certainly done plenty of sketchy-but-not-ill-intentioned things at startups, as quick-and-dirty solutions. The story that SBF told about what happened sounds like exactly the sort of things I’ve seen happen at startups many times before.
I think this is likely wrong. I agree that there is a plausible story here, but given the case that Sam seems to have lied multiple times in confirmed contexts (for example when saying that FTX has never touched customer deposits), and people’s experiences at early Alameda, I think it is pretty likely that Sam was lying quite frequently, and had done various smaller instances of fraud.
I don’t think the whole FTX thing was a ponzi scheme, and as far as I can tell FTX the platform itself (if it hadn’t burned all of its trust in the last 3 weeks), would have been worth $1-3B in an honest evaluation of what was going on.
But I also expect that when Sam used customer deposits he was well-aware that he was committing fraud, and others in the company were too. And he was also aware that there was a chance that things could blow up in the way it did. I do believe that they had fucked up their accounting in a way that caused Sam to fail to orient to the situation effectively, but all of this was many months after they had already committed major crimes and trust violations after touching customer funds as a custodian.
The problem with this explanation is that there is a very clear delineation here between not-fraud and fraud. It is the difference between not touching customer deposits and touching them. Your explanation doesn’t dispute that they were knowingly and intentionally touching customer deposits. In that case, it is indisputably intentional, outright fraud. The only thing left to discuss is whether they knew the extent of the fraud or how risky it was.
I don’t think it was ill-intentioned based on SBF’s moral compass. He just had the belief, “I will pass a small amount of risk onto our customers, tell some small lies, and this will allow us to make more money for charity. This is net positive for the world.” Then the risks mounted, the web of lies became more complicated to navigate, and it just snowballed from there.
Petrov Day thought: there’s this narrative around Petrov where one guy basically had the choice to nuke or not, and decided not to despite all the flashing red lights. But I wonder… was this one of those situations where everyone knew what had to be done (i.e. “don’t nuke”), but whoever caused the nukes to not fly was going to get demoted, so there was a game of hot potato and the loser was the one forced to “decide” to not nuke? Some facts possibly relevant here:
Petrov’s choice wasn’t actually over whether or not to fire the nukes; it was over whether or not to pass the alert up the chain of command.
Petrov himself was responsible for the design of those warning systems.
… so it sounds like Petrov was ~ the lowest-ranking person with a de-facto veto on the nuke/don’t nuke decision.
Petrov was in fact demoted afterwards.
There was another near-miss during the Cuban missile crisis, when three people on a Soviet sub had to agree to launch. There again, it was only the lowest-ranked who vetoed the launch. (It was the second-in-command; the captain and political officer both favored a launch—at least officially.)
This was the Soviet Union; supposedly (?) this sort of hot potato happened all the time.
Those are some good points. I wonder whether similar happened (or could at all happen) in other nuclear countries, where we don’t know about similar incidents—because the system haven’t collapsed there, the archives were not made public etc.
Also, it makes actually celebrating Petrov’s day as widely as possible important, because then the option for the lowest-ranked person would be: “Get demoted, but also get famous all around the world.”
Somebody should probably write a post explaining why RL from human feedback is actively harmful to avoiding AI doom. It’s one thing when OpenAI does it, but when Anthropic thinks it’s a good idea, clearly something has failed to be explained.
(I personally do not expect to get around to writing such a post soon, because I expect discussion around the post would take a fair bit of time and attention, and I am busy with other things for the next few weeks.)
I had a look at The Plan and noticed something I didn’t notice before: You do not talk about people and organization in the plan. I probably wouldn’t have noticed if I hadn’t started a project too, and needed to think about it. Google seems to think that people and team function play a big role. Maybe your focus in that post wasn’t on people, but I would be interested in your thoughts on that too: What role did people and organization play in the plan and its implementation? What worked, and what should be done better next time?
What’s the specific most-important-according-to-you progress that you (or other people) have made on your agenda? New theorems, definitions, conceptual insights, …
Any changes to the high-level plan (becoming less confused about agency, then ambitious value learning)? Any changes to how you want to become less confused (e.g. are you mostly thinking about abstractions, selection theorems, something new?)
What are the major parts of remaining deconfusion work (to the extent to which you have guesses)? E.g. is it mostly about understanding abstractions better, or mostly about how to apply an understanding of abstractions to other problems (say, what it means for a program to have a “subagent”), or something else? Does the most difficult part feel more conceptual (“what even is an agent?”) or will the key challenges be more practical concerns (“finding agents currently takes exponential time”)?
Specifically for understanding abstractions, what do you see as important open problems?
Takeaways From “The Idea Factory: Bell Labs And The Great Age Of American Innovation”
Main takeaway: to the extent that Bell Labs did basic research, it actually wasn’t all that far ahead of others. Their major breakthroughs would almost certainly have happened not-much-later, even in a world without Bell Labs.
There were really two transistor inventions, back to back: Bardain and Brattain’s point-contact transistor, and then Schockley’s transistor. Throughout, the group was worried about some outside group beating them to the punch (i.e. the patent). There were semiconductor research labs at universities (e.g. at Purdue; see pg 97), and the prospect of one of these labs figuring out a similar device was close enough that the inventors were concerned about being scooped.
Most inventions which were central to Bell Labs actually started elsewhere. The travelling-wave tube started in an academic lab. The idea for fiber optic cable went way back, but it got its big kick at Corning. The maser and laser both started in universities. The ideas were only later picked up by Bell.
In other cases, the ideas were “easy enough to find” that they popped up more than once, independently, and were mostly-ignored long before deployment—communication satellites and cell communications, for instance.
The only fundamental breakthrough which does not seem like it would have soon appeared in a counterfactual world was Shannon’s information theory.
So where was Bell’s big achievement? Mostly in development, and the research division was actually an important component of that. Without in-house researchers chewing on the same problems as the academic labs, keeping up-to-date with all the latest findings and running into the same barriers themselves, the development handoff would have been much harder. Many of Bell Labs’ key people were quite explicitly there to be consulted—i.e. “ask the guy who wrote the book”. I think it makes most sense to view most of the Labs’ research that way. It was only slightly ahead of the rest of the world at best (Shannon excepted), and often behind, but having those researchers around probably made it a lot easier to get new inventions into production.
Major reason this matters: a lot of people say that Bell was able to make big investments in fundamental research because they had unusually-long time horizons, protected by a monopoly and a cozy government arrangement (essentially a Schumpeterian view). This is contrasted to today’s silicon valley, where horizons are usually short. But if Bell’s researchers generally weren’t significantly ahead of others, and mostly just helped get things to market faster, then this doesn’t seem to matter as much. The important question is not whether something silicon-valley-like induces more/less fundamental research in industrial labs, but whether academics heeding the siren call of startup profits can get innovations to market as quickly as Bell Labs’ in-house team could. And by that metric, silicon valley looks pretty good: Bell Labs could get some impressive things through the pipe very quickly when rushed, but they usually had no reason to hurry, and they acted accordingly.
I loved this book. The most surprising thing to me was the answer that people who were there in the heyday give when asked what made Bell Labs so successful: They always say it was the problem, i.e. having an entire organization oriented towards the goal of “make communication reliable and practical between any two places on earth”. When Shannon left the Labs for MIT, people who were there immediately predicted he wouldn’t do anything of the same significance because he’d lose that “compass”. Shannon was obviously a genius, and he did much more after than most people ever accomplish, but still nothing as significant as what he did when at at the Labs.
Here’s a meme I’ve been paying attention to lately, which I think is both just-barely fit enough to spread right now and very high-value to spread.
Meme part 1: a major problem with RLHF is that it directly selects for failure modes which humans find difficult to recognize, hiding problems, deception, etc. This problem generalizes to any sort of direct optimization against human feedback (e.g. just fine-tuning on feedback), optimization against feedback from something emulating a human (a la Constitutional AI or RLAIF), etc.
Many people will then respond: “Ok, but if how on earth is one supposed to get an AI to do what one wants without optimizing against human feedback? Seems like we just have to bite that bullet and figure out how to deal with it.” … which brings us to meme part 2.
Meme part 2: We already have multiple methods to get AI to do what we want without any direct optimization against human feedback. The first and simplest is to just prompt a generative model trained solely for predictive accuracy, but that has limited power in practice. More recently, we’ve seen a much more powerful method: activation steering. Figure out which internal activation-patterns encode for the thing we want (via some kind of interpretability method), then directly edit those patterns.
I agree that there’s something nice about activation steering not optimizing the network relative to some other black-box feedback metric. (I, personally, feel less concerned by e.g. finetuning against some kind of feedback source; the bullet feels less jawbreaking to me, but maybe this isn’t a crux.)
(Medium confidence) FWIW, RLHF’d models (specifically, the LLAMA-2-chat series) seem substantially easier to activation-steer than do their base counterparts.
This seems basically correct though it seems worth pointing out that even if we are able to do “Meme part 2” very very well, I expect we will still die because if you optimize hard enough to predict text well, with the right kind of architecture, the system will develop something like general intelligence simply because general intelligence is beneficial for predicting text correctly. E.g. being able to simulate the causal process that generated the text, i.e. the human, is a very complex task that would be useful if performed correctly.
This is an argument Eliezer brought forth in some recent interviews. Seems to me like another meme that would be beneficial to spread more.
Here’s an idea for a novel which I wish someone would write, but which I probably won’t get around to soon.
The setting is slightly-surreal post-apocalyptic. Society collapsed from extremely potent memes. The story is episodic, with the characters travelling to a new place each chapter. In each place, they interact with people whose minds or culture have been subverted in a different way.
This provides a framework for exploring many of the different models of social dysfunction or rationality failures which are scattered around the rationalist blogosphere. For instance, Scott’s piece on scissor statements could become a chapter in which the characters encounter a town at war over a scissor. More possible chapters (to illustrate the idea):
A town of people who insist that the sky is green, and avoid evidence to the contrary really hard, to the point of absolutely refusing to ever look up on a clear day (a refusal which they consider morally virtuous). Also they clearly know exactly which observations would show a blue sky, since they avoid exactly those (similar to the dragon-in-the-garage story).
Middle management of a mazy company continues to have meetings and track (completely fabricated) performance metrics and whatnot at the former company headquarters. None of the company’s actual business exists anymore, but every level of manager is trying to hide this fact from the levels above.
A university department with researchers who spend all of their time p-hacking results from a quantum random noise generator. They have no interest in the fact that their “research” does not tell them anything about the physical world or does not replicate; what does that have to do with Science? Their goal is to publish papers.
A government agency which still has lots of meetings and paperwork and gives Official Recommendations and updates their regulations. They have no interest in the fact that the thing they once regulated (maybe banks?) no longer exists, or the fact that no central government enforces their regulations any more.
An automated school (i.e. video lectures and auto-graded assignments/tests) in which students continue to study hard and stress over their grades and attendance, despite there no longer being anyone in the world who cares.
Something like House of God. A readers’ digest version of House of God could basically be a chapter in its own right, that’s roughly the vibe I have in mind.
A residential area in which “keeping up with the Joneses” has been ramped up to 11, with everyone spending every available resource (and roughly-all waking hours) on massive displays of Christmas lights.
A group trying to save the world by spreading awareness of dangerous memes, but their movement is a dangerous meme of its own and they are spreading it.
A town of people who really want to maximize the number paperclips in the universe (perhaps due to an AI-optimized advertisement), and optimize for that above all else.
A town of people who all do whatever everyone else is doing, on the basis of generalized efficient markets: if there were any better options, then someone would have found it already. None of them ever actually explore, so they’re locked in.
A happy-death-spiral town around some unremarkable object (like an old shoe or something) kept on a pedestal in the town square.
A town full of people convinced by a sophisticated model that the sun will not come up tomorrow. Every day when the sun comes up, they are distressed and confused until somebody adds some more epicycles to the model and releases an updated forecast that the sun will instead fail to come up the next day.
A town in which a lion shows up and starts eating kids, but the whole town is at simulacrum 3, so they spend a lot of time arguing about the lion as a way of signalling group association but they completely forget about the actual lion standing right there, plainly visible, even as it takes a kid right in front of them all.
Witch-hunt town, in which everything is interpreted as evidence of witches. If she claims to be a witch, she’s a witch! If she claims not to be a witch, well that’s what a witch would say, so she’s a witch! Etc.
The generator for these is basically: look for some kind of rationality failure mode (either group or personal), then ramp it up to 11 in a somewhat-surrealist way.
Ideally this would provide an introduction to a lot of key rationalist ideas for newcomers.
A town of anti-inductivists (if something has never happened before, it’s more likely to happen in the future). Show the basic conundrum (“Q: Why can’t you just use induction? A: Because anti-induction has never worked before!”).
A town where nearly all people are hooked to maximally attention grabbing & keeping systems (maybe several of those, keeping people occupied in loops).
Post which someone should write (but I probably won’t get to soon): there is a lot of potential value in earning-to-give EA’s deeply studying the fields to which they donate. Two underlying ideas here:
The key idea of knowledge bottlenecks is that one cannot distinguish real expertise from fake expertise without sufficient expertise oneself. For instance, it takes a fair bit of understanding of AI X-risk to realize that “open-source AI” is not an obviously-net-useful strategy. Deeper study of the topic yields more such insights into which approaches are probably more (or less) useful to fund. Without any expertise, one is likely to be mislead by arguments which are optimized (whether intentionally or via selection) to sound good to the layperson.
That takes us to the pareto frontier argument. If one learns enough/earns enough that nobody else has both learned and earned more, then there are potentially opportunities which nobody else has both the knowledge to recognize and the resources to fund. Generalized efficient markets (in EA-giving) are thereby circumvented; there’s potential opportunity for unusually high impact.
To really be a compelling post, this needs to walk through at least 3 strong examples, all ideally drawn from different areas, and spell out how the principles apply to each example.
Below is a graph from T-mobile’s 2016 annual report (on the second page). Does anything seem interesting/unusual about it?
I’ll give some space to consider before spoiling it.
...
...
...
Answer: that is not a graph of those numbers. Some clever person took the numbers, and stuck them as labels on a completely unrelated graph.
Yes, that is a thing which actually happened. In the annual report of an S&P 500 company. And apparently management considered this gambit successful, because the 2017 annual report doubled down on the trick and made it even more egregious: they added 2012 and 2017 numbers, which are even more obviously not on an accelerating growth path if you actually graph them. The numbers are on a very-clearly-decelerating growth path.
Now, obviously this is an cute example, a warning to be on alert when consuming information. But I think it prompts a more interesting question: why did such a ridiculous gambit seem like a good idea in the first place? Who is this supposed to fool, and to what end?
This certainly shouldn’t fool any serious investment analyst. They’ll all have their own spreadsheets and graphs forecasting T-mobile’s growth. Unless T-mobile’s management deeply and fundamentally disbelieves the efficient markets hypothesis, this isn’t going to inflate the stock price. Presumably shareholder elections for board seats, as well as the board itself, are also not dominated by people who are paying so little attention as to fall for such a transparent ploy.
It could just be that T-mobile’s management were themselves morons, or had probably-unrealistic models of just how moronic their investors were. Still, I’d expect competition (both market pressure and competition for control in shareholder/board meetings) to weed out that level of stupidity.
One more hypothesis: maybe this is simulacrum 3 bullshit. T-mobile is in the cellular business; they presumably have increasing returns to scale. More capital investment makes them more profitable, expectations of more profits draw in more investment; there’s potential for a self-fulfilling prophecy here. Investors want to invest if-and-only-if they expect other investors to invest. So, nobody actually has to be fooled by the graph; they just need to see that T-mobile is successfully pretending to pretend to have accelerating growth, and that’s enough to merit investment.
I’ve heard various people recently talking about how all the hubbub about artists’ work being used without permission to train AI makes it a good time to get regulations in place about use of data for training.
If you want to have a lot of counterfactual impact there, I think probably the highest-impact set of moves would be:
Figure out a technical solution to robustly tell whether a given image or text was used to train a given NN.
Bring that to the EA folks in DC. A robust technical test like that makes it pretty easy for them to attach a law/regulation to it. Without a technical test, much harder to make an actually-enforceable law/regulation.
In parallel, also open up a class-action lawsuit to directly sue companies using these models. Again, a technical solution to prove which data was actually used in training is the key piece here.
Model/generator behind this: given the active political salience, it probably wouldn’t be too hard to get some kind of regulation implemented. But by-default it would end up being something mostly symbolic, easily circumvented, and/or unenforceable in practice. A robust technical component, plus (crucially) actually bringing that robust technical component to the right lobbyist/regulator, is the main thing which would make a regulation actually do anything in practice.
Edit-to-add: also, the technical solution should ideally be an implementation of some method already published in some academic paper. Then when some lawyer or bureaucrat or whatever asks what it does and how we know it works, you can be like “look at this Official Academic Paper” and they will be like “ah, yes, it does Science, can’t argue with that”.
Suppose I have a binary function f, with a million input bits and one output bit. The function is uniformly randomly chosen from all such functions—i.e. for each of the 21000000 possible inputs x, we flipped a coin to determine the output f(x) for that particular input.
Now, suppose I know f, and I know all but 50 of the input bits—i.e. I know 999950 of the input bits. How much information do I have about the output?
Answer: almost none. For almost all such functions, knowing 999950 input bits gives us ∼1250 bits of information about the output. More generally, If the function has n input bits and we know all but k, then we have o(12k) bits of information about the output. (That’s “little o” notation; it’s like big O notation, but for things which are small rather than things which are large.) Our information drops off exponentially with the number of unknown bits.
Proof Sketch
With k input bits unknown, there are 2k possible inputs. The output corresponding to each of those inputs is an independent coin flip, so we have 2k independent coin flips. If m of those flips are 1, then we assign a probability of m2k that the output will be 1.
As long as 2k is large, Law of Large Numbers will kick in, and very close to half of those flips will be 1 almost surely—i.e. m≈2k2. The error in this approximation will (very quickly) converge to a normal distribution, and our probability that the output will be 1 converges to a normal distribution with mean 12 and standard deviation 12k/2. So, the probability that the output will be 1 is roughly 12±12k/2.
We can then plug that into Shannon’s entropy formula. Our prior probability that the output bit is 1 is 12, so we’re just interested in how much that ±12k/2 adjustment reduces the entropy. This works out to o(12k) bits.
Why Is This Interesting?
One core idea of my work on abstraction is that noise very quickly wipes out almost all information; only some very-low-dimensional summary is relevant “far away”. This example shows that this sort of thing is not unusual, but rather “the default”: for almost all random functions, information drops off exponentially with the number of unknown bits. In a large system (i.e. a function with many inputs), ignorance of even just a few bits is enough to wipe out essentially-all information. That’s true even if we know the vast majority of the bits.
A good intuitive example of this is the “butterfly effect”: the flap of a butterfly’s wings could change the course of a future hurricane, because chaos. But there’s an awful lot of butterflies in the world, and the hurricane’s path is some complicated function of all of their wing-flaps (and many other variables too). If we’re ignorant of even just a handful of these flaps, then almost all of our information about the hurricane’s path is probably wiped out. And in practice, we’re ignorant of almost all the flaps. This actually makes it much easier to perform Bayesian reasoning about the path of the hurricane: the vast majority of information we have is basically-irrelevant; we wouldn’t actually gain anything from accounting for the butterfly-wing-flaps which we do know.
o(1/2^k) doesn’t vary with n—are you saying that it doesn’t matter how big the input array is, the only determinant is the number of unknown bits, and the number of known bits is irrelevant? That would be quite interesting if so (though I have some question about how likely the function is to be truly random from an even distribution of such functions).
One can enumerate all such 3-bit functions (8 different inputs, each input can return 0 or 1, so 256 functions (one per output-bit-pattern of the 8 possible inputs). But this doesn’t seem to follow your formula—if you have 3 unknown bits, that should be 1⁄8 of a bit about the output, 2 for 1⁄4, and 1 unknown for 1⁄2 a bit about the output. But in fact, the distribution of functions includes both 0 and 1 output for every input pattern, so you actually have no predictive power for the output if you have ANY unknown bits.
o(1/2^k) doesn’t vary with n—are you saying that it doesn’t matter how big the input array is, the only determinant is the number of unknown bits, and the number of known bits is irrelevant?
Yes, that’s correct.
But in fact, the distribution of functions includes both 0 and 1 output for every input pattern, so you actually have no predictive power for the output if you have ANY unknown bits.
The claim is for almost all functions when the number of inputs is large. (Actually what we need is for 2^(# of unknown bits) to be large in order for the law of large numbers to kick in.) Even in the case of 3 unknown bits, we have 256 possible functions, and only 18 of those have less than 1⁄4 1′s or more than 3⁄4 1′s among their output bits.
I’m not sure what context that link is assuming, but in an analysis context I typically see little o used in ways like e.g. “f(x)=f(x0)+dfdx|x0dx+o(dx2)”. The interpretation is that, as dx goes to 0, the o(dx2) terms all fall to zero at least quadratically (i.e. there is some C such that Cdx2 upper bounds the o(dx2) term once dx is sufficiently small). Usually I see engineers and physicists using this sort of notation when taking linear or quadratic approximations, e.g. for designing numerical algorithms.
I find it very helpful to get feedback on LW posts before I publish them, but it adds a lot of delay to the process. So, experiment: here’s a link to a google doc with a post I plan to put up tomorrow. If anyone wants to give editorial feedback, that would be much appreciated—comments on the doc are open.
I’m mainly looking for comments on which things are confusing, parts which feel incomplete or slow or repetitive, and other writing-related things; substantive comments on the content should go on the actual post once it’s up.
EDIT: it’s up. Thank you to Stephen for comments; the post is better as a result.
One second-order effect of the pandemic which I’ve heard talked about less than I’d expect:
This is the best proxy I found on FRED for new businesses founded in the US, by week. There was a mild upward trend over the last few years, it’s really taken off lately. Not sure how much of this is kids who would otherwise be in college, people starting side gigs while working from home, people quitting their jobs and starting their own businesses so they can look after the kids, extra slack from stimulus checks, people losing their old jobs en masse but still having enough savings to start a business, …
For the stagnation-hypothesis folks who lament relatively low rates of entrepreneurship today, this should probably be a big deal.
How sure are you that the composition is interesting? How many of these are just quick mask-makers or sanitizer-makers, or just replacing restaurants that have now gone out of business? (ie very low-value-added companies, of the ‘making fast food in a stall in a Third World country’ sort of ‘startup’, which make essentially no or negative long-term contributions).
Good question. I haven’t seen particularly detailed data on these on FRED, but they do have separate series for “high propensity” business applications (businesses they think are likely to hire employees), business applications with planned wages, and business applications from corporations, as well as series for each state. The spike is smaller for planned wages, and nonexistent for corporations, so the new businesses are probably mostly single proprietors or partnerships. Other than that, I don’t know what the breakdown looks like across industries.
Any system can be modeled as maximizing some utility function, therefore utility maximization is not a very useful model
Corrigibility is possible, but utility maximization is incompatible with corrigibility, therefore we need some non-utility-maximizer kind of agent to achieve corrigibility
These two claims should probably not both be true! If any system can be modeled as maximizing a utility function, and it is possible to build a corrigible system, then naively the corrigible system can be modeled as maximizing a utility function.
I expect that many peoples’ intuitive mental models around utility maximization boil down to “boo utility maximizer models”, and they would therefore intuitively expect both the above claims to be true at first glance. But on examination, the probable-incompatibility is fairly obvious, so the two claims might make a useful test to notice when one is relying on yay/boo reasoning about utilities in an incoherent way.
FWIW I endorse the second claim when the utility function depends exclusively on the state of the world in the distant future, whereas I endorse the first claim when the utility function can depend on anything whatsoever (e.g. what actions I’m taking right this second). (details)
I wish we had different terms for those two things. That might help with any alleged yay/boo reasoning.
(When Eliezer talks about utility functions, he seems to assume that it depends exclusively on the state of the world in the distant future.)
Consider a homomorphically encrypted computation running somewhere in the cloud. The computations correspond to running an AGI. Now from the outside, you can still model the AGI based on how it behaves, as an expected utility maximizer, if you have a lot of observational data about the AGI (or at least let’s take this as a reasonable assumption).
No matter how closely you look at the computations, you will not be able to figure out how to change these computations in order to make the AGI aligned if it was not aligned already (Also, let’s assume that you are some sort of Cartesian agent, otherwise you would probably already be dead if you were running these kinds of computations).
So, my claim is not that modeling a system as an expected utility maximizer can’t be useful. Instead, I claim that this model is incomplete. At least with regard to the task of computing an update to the system, such that when we apply this update to the system, it would become aligned.
Of course, you can model any system, as an expected utility maximizer. But just because I can use the “high level” conceptual model of expected utility maximization, to model the behavior of a system very well. But behavior is not the only thing that we care about, we actually care about being able to understand the internal workings of the system, such that it becomes much easier to think about how to align the system.
So the following seems to be beside the point unless I am <missing/misunderstanding> something:
These two claims should probably not both be true! If any system can be modeled as maximizing a utility function, and it is possible to build a corrigible system, then naively the corrigible system can be modeled as maximizing a utility function.
Maybe I have missed the fact that the claim you listed says that expected utility maximization is not very useful. And I’m saying it can be useful, it might just not be sufficient at all to actually align a particular AGI system. Even if you can do it arbitrarily well.
I am not an expert, but as I remember it, it was a claim that “any system that follows certain axioms can be modeled as maximizing some utility function”. The axioms assumed that there were no circular preferences—if someone prefers A to B, B to C, and C to A, it is impossible to define a utility function such that u(A) > u(B) > u(C) > u(A) -- and that if the system says that A > B > C, it can decide between e.g. a 100% chance of B, and a 50% chance of A with a 50% chance of C, again in a way that is consistent.
I am not sure how this works when the system is allowed to take current time into account, for example when it is allowed to prefer A to B on Monday but prefer B to A on Tuesday. I suppose that in such situation any system can trivially be modeled by a utility function that at each moment assigns utility 1 to what the system actually did in that moment, and utility 0 to everything else.
Corrigibility is incompatible with assigning utility to everything in advance. A system that has preferences about future will also have a preference about not having its utility function changed. (For the same reason people have a preference not to be brainwashed, or not to take drugs, even if after brainwashing they are happy about having been brainwashed, and after getting addicted they do want more drugs.)
Corrigible system would be like: “I prefer A to B at this moment, but if humans decide to fix me and make me prefer B to A, then I prefer B to A”. In other words, it doesn’t have values for u(A) and u(B), or it doesn’t always act according to those values. A consistent system that currently prefers A to B would prefer not to be fixed.
A utility function represents preference elicited in a large collection of situations, each a separate choice between events that happens with incomplete information, as an event is not a particular point. This preference needs to be consistent across different situations to be representable by expected utility of a single utility function.
Once formulated, a utility function can be applied to a single choice/situation, such as a choice of a policy. But a system that only ever makes a single choice is not a natural fit for expected utility frame, and that’s the kind of system that usually appears in “any system can be modeled as maximizing some utility function”. So it’s not enough to maximize something once, or in a narrow collection of situations, the situations the system is hypothetically exposed to need to be about as diverse as choices between any pair of events, with some of the events very large, corresponding to unreasonably incomplete information, all drawn across the same probability space.
One place this mismatch of frames happens is with updateless decision theory. An updateless decision is a choice of a single policy, once and for all, so there is no reason for it to be guided by expected utility, even though it could be. The utility function for the updateless choice of policy would then need to be obtained elsewhere, in a setting that has all these situations with separate (rather than all enacting a single policy) and mutually coherent choices under uncertainty. But once an updateless policy is settled (by a policy-level decision), actions implied by it (rather than action-level decisions in expected utility frame) no longer need to be coherent. Not being coherent, they are not representable by an action-level utility function.
So by embracing updatelessness, we lose the setting that would elicit utility if the actions were instead individual mutually coherent decisions. And conversely, by embracing coherence of action-level decisions, we get an implied policy that’s not updatelessly optimal with respect to the very precise outcomes determined by any given whole policy. So an updateless agent founded on expected utility maximization implicitly references a different non-updateless agent whose preference is elicited by making separate action-level decisions under a much greater uncertainty than the policy-level alternatives the updateless agent considers.
I don’t think claim 1 is wrong, but it does clash with claim 2.
That means any system that has to be corrigible cannot be a system that maximizes a simple utility function (1 dimension), or put another way “whatever utility function is maximizes must be along multiple dimensions”.
Which seems to be pretty much what humans do, we have really complex utility functions, and everything seems to be ever changing and we have some control over it ourselves (and sometimes that goes wrong and people end up maxing out a singular dimension at the cost of everything else).
Note to self: Think more about this and if possible write up something more coherent and explanatory.
Everybody’s been talking about Paxlovid, and how ridiculous it is to both stop the trial since it’s so effective but also not approve it immediately. I want to at least float an alternative hypothesis, which I don’t think is very probable at this point, but does strike me as at least plausible (like, 20% probability would be my gut estimate) based on not-very-much investigation.
Early stopping is a pretty standard p-hacking technique. I start out planning to collect 100 data points, but if I manage to get a significant p-value with only 30 data points, then I just stop there. (Indeed, it looks like the Paxlovid study only had 30 actual data points, i.e. people hospitalized.) Rather than only getting “significance” if all 100 data points together are significant, I can declare “significance” if the p-value drops below the line at any time. That gives me a lot more choices in the garden of forking counterfactual paths.
Now, success rates on most clinical trials are not very high. (They vary a lot by area—most areas are about 15-25%. Cancer is far and away the worst, below 4%, and vaccines are the best, over 30%.) So I’d expect that p-hacking is a pretty large chunk of approved drugs, which means pharma companies are heavily selected for things like finding-excuses-to-halt-good-seeming-trials-early.
Early stopping is a pretty standard p-hacking technique.
It was stopped after a pre-planned interim analysis; that means they’re calculating the stopping criteria/p-values with multiple testing correction built in, using sequential analysis.
I’ve been running ELISA tests all week. In the first test, I did not detect stronger binding to any of the peptides than to the control in any of several samples from myself or my girlfriend. But the control itself was looking awfully suspicious, so I ran another couple tests. Sure enough, something in my samples is binding quite strongly to the control itself (i.e. the blocking agent), which is exactly what the control is supposed to not do. So I’m going to try out some other blocking agents, and hopefully get an actually-valid control group.
(More specifics on the test: I ran a control with blocking agent + sample, and another with blocking agent + blank sample, and the blocking agent + sample gave a strong positive signal while the blank sample gave nothing. That implies something in the sample was definitely binding to both the blocking agent and the secondary antibodies used in later steps, and that binding was much stronger than the secondary antibodies themselves binding to anything in the blocking agent + blank sample.)
In other news, the RadVac team released the next version of their recipe + whitepaper. Particularly notable:
… many people who have taken the nasal vaccine are testing negative for serum antibodies with commercial and lab ELISA tests, while many who inject the vaccine (subcutaneous or intramuscular) are testing positive (saliva testing appears to be providing evidence of mucosal response among a subset of researchers who have administered the vaccine intranasally).
Note that they’re talking specifically about serum (i.e. blood) antibodies here. So apparently injecting it does induce blood antibodies of the sort detectable by commercial tests (at least some of the time), but snorting it mostly just produces mucosal antibodies (also at least some of the time).
This is a significant update: most of my prior on the vaccine working was based on vague comments in the previous radvac spec about at least some people getting positive test results. But we didn’t know what kind of test results those were, so there was a lot of uncertainty about exactly what “working” looked like. In particular, we didn’t know whether antibodies were induced in blood or just mucus, and we didn’t know if they were induced consistently or only in some people (the latter of which is the “more dakka probably helps” world). Now we know that it’s mostly just mucus (at least for nasal administration). Still unsure about how consistently it works—the wording in the doc makes it sound like only some people saw a response, but I suspect the authors are just hedging because they know there’s both selection effects and a lot of noise in the data which comes back to them.
The latest version of the vaccine has been updated to give it a bit more kick—slightly higher dose, and the chitosan nanoparticle formula has been changed in a way which should make the peptides more visible to the immune system. Also, the list of peptides has been trimmed down a bit, so the latest version should actually be cheaper, though the preparation is slightly more complex.
Neat problem of the week: researchers just announced roughly-room-temperature superconductivity at pressures around 270 GPa. That’s stupidly high pressure—a friend tells me “they’re probably breaking a diamond each time they do a measurement”. That said, pressures in single-digit GPa do show up in structural problems occasionally, so achieving hundreds of GPa scalably/cheaply isn’t that many orders of magnitude away from reasonable, it’s just not something that there’s historically been much demand for. This problem plays with one idea for generating such pressures in a mass-produceable way.
Suppose we have three materials in a coaxial wire:
innermost material has a low thermal expansion coefficient and high Young’s modulus (i.e. it’s stiff)
middle material is a thin cylinder of our high-temp superconducting concoction
outermost material has a high thermal expansion coefficient and high Young’s modulus.
We construct the wire at high temperature, then cool it. As the temperature drops, the innermost material stays roughly the same size (since it has low thermal expansion coefficient), while the outermost material shrinks, so the superconducting concoction is squeezed between them.
Exercises:
Find an expression for the resulting pressure in the superconducting concoction in terms of the Young’s moduli, expansion coefficients, temperature change, and dimensions of the inner and outer materials. (Assume the width of the superconducting layer is negligible, and the outer layer doesn’t break.)
Look up parameters for some common materials (e.g. steel, tungsten, copper, porcelain, aluminum, silicon carbide, etc), and compute the pressures they could produce with reasonable dimensions (assuming that their material properties don’t change too dramatically with such high pressures).
Find an expression for the internal tension as a function of radial distance in the outermost layer.
Pick one material, look up its tensile strength, and compute how thick it would have to be to serve as the outermost layer without breaking, assuming the superconducting layer is at 270 GPa.
Smoke from California/Oregon wildfires reaching the East Coast opens up some interesting new legal/political possibilities. The smoke is way outside state borders, all the way on the other side of the country, so that puts the problem pretty squarely within federal jurisdiction. Either a federal agency could step in to force better forest management on the states, or a federal lawsuit could be brought for smoke-induced damages against California/Oregon. That would potentially make it a lot more difficult for local homeowners to block controlled burns.
I had a shortform post pointing out the recent big jump in new businesses in the US, and Gwern replied:
How sure are you that the composition is interesting? How many of these are just quick mask-makers or sanitizer-makers, or just replacing restaurants that have now gone out of business? (ie very low-value-added companies, of the ‘making fast food in a stall in a Third World country’ sort of ‘startup’, which make essentially no or negative long-term contributions).
This was a good question in context, but I disagree with Gwern’s model of where-progress-comes-from, especially in the context of small businesses.
Let’s talk ice-cream cones.
As the story goes, an ice-cream vendor was next door to a waffle vendor at the 1904 World’s Fair. At some point, the ice-cream vendor ran short on paper cups, and inspiration struck. He bought some thin waffles from the waffle vendor, rolled them into cones, and ice-cream cones took off.
That’s just the first step. From there, the cone spread memetically. People heard about it, and either asked for cones (on the consumer side) or tried making them (on the supplier side).
Insight + Memetics → Better Food
When I compare food today to the stuff my grandparents ate, there’s no comparison. Today’s dishes are head and shoulders better. Partly it’s insights like ice-cream cones, partly it’s memetic spread of dishes from more parts of the world (like sisig, soup dumplings, ropa vieja, chicken Karahi, …).
Those little fast-food stalls? They’re powerhouses of progress. It’s a hypercompetitive market, with low barriers to entry, and lots of repeat business. The conditions are ideal for trying out new dishes, spreading culinary ideas and finding out the hard way what people like to eat. That doesn’t mean they’re highly profitable—culinary innovation spreads memetically, so it’s hard to capture the gains. But progress is made.
The pandemic also has the effect of showing the kind of business ideas people try. It pushes a lot of innovation in food delivery. Some of the pandemic driver innovation will become worthless once the pandemic is over but a few good ideas likely survive and the old ideas of the businesses that went out of business are still around.
My cached thoughts start with a somewhat different question—not “what role does magic play in fantasy fiction?” (e.g. what fantasies does it fulfill), but rather… insofar as magic is a natural category, what does it denote? So I’m less interested in the relatively-expansive notion of “magic” sometimes seen in fiction (which includes e.g. alternate physics), and more interested in the pattern called “magic” which recurs among tons of real-world ancient cultures.
Claim (weakly held): the main natural category here is symbols changing the territory. Normally symbols represent the world, and changing the symbols just makes them not match the world anymore—it doesn’t make the world do something different. But if the symbols are “magic”, then changing the symbols changes the things they represent in the world. Canonical examples:
Wizard/shaman/etc draws magic symbols, speaks magic words, performs magic ritual, or even thinks magic thoughts, thereby causing something to happen in the world.
Messing with a voodoo doll messes with the person it represents.
“Sympathetic” magic, which explicitly uses symbols of things to influence those things.
Magic which turns emotional states into reality.
I would guess that most historical “magic” was of this type.
Turns out my laser thermometer is all over the map. Readings would change by 10°F if I went outside and came back in. My old-school thermometer is much more stable (and well-calibrated, based on dipping it in some ice water), but slow and caps out around 90°F (so I can’t use to measure e.g. exhaust temp). I plan to buy a bunch more old-school thermometers for the next try.
I thought opening the doors/windows in rooms other than the test room and setting up a fan would be enough to make the temperature in the hall outside the test room close to outdoor temp. This did not work; hall temp was around 72°F with outside around 80°F. I’ll need to change that part of the experiment design; most likely I’ll seal around the door and let air infiltrate exclusively from the window instead. (The AC is right next to the window, so this could screw with the results, but I don’t really have a better option.)
In two-hose mode, the AC hit its minimum temperature of 60°F, so I’ll need a hotter day. I’ll try again when we hit at least 85°F.
In case anyone’s wondering: in one-hose mode, the temperature in the room equilibrated around 66°F. Power consumption was near-constant throughout all conditions.
One additional Strange Observation: cool air was blowing out under the door of the test room in two-hose mode. This should not happen; my best guess is that, even though the AC has two separate intake vents, the two are not actually partitioned internally, so the fan for indoor-air was pulling in outdoor-air (causing air to blow out under the door to balance that extra inflow). Assuming that’s the cause, it should be fixable with some strategically-placed cardboard inside the unit.
I’ve long been very suspicious of aggregate economic measures like GDP. But GDP is clearly measuring something, and whatever that something is it seems to increase remarkably smoothly despite huge technological revolutions. So I spent some time this morning reading up and playing with numbers and generally figuring out how to think about the smoothness of GDP increase.
Major takeaways:
When new tech makes something previously expensive very cheap, GDP mostly ignores it. (This happens in a subtle way related to how we actually compute it.)
Historical GDP curves mainly measure things which are expensive ~now. Things which are cheap now are mostly ignored. In other words: GDP growth basically measures the goods whose production is revolutionized the least.
Re: AI takeoff, the right way to extrapolate today’s GDP curve to post-AI is to think about things which will still be scarce post-AI, and then imagine the growth of production of those things.
Even a very sharp, economically-revolutionary AI takeoff could look like slow smooth GDP growth, because GDP growth will basically only measure the things whose production is least revolutionized.
Why am I harping on about technicalities of GDP? Well, I hear about some AI forecasts which are heavily based on the outside view that economic progress (as measured by GDP) is smooth, and this is so robust historically that we should expect it to continue going forward. And I think this is basically right—GDP, as we actually compute it, is so remarkably smooth that we should expect that to continue. Alas, this doesn’t tell us very much about how crazy or sharp AI takeoff will be, because GDP (as we actually compute it) systematically ignores anything that’s revolutionized.
In writing How much should we value life?, I spent some time digging into AI timeline stuff. It lead me to When Will AI Be Created?, written by Luke Muehlhauser for MIRI. He noted that there is reason not to trust expert opinions on AI timelines, and that trend extrapolation may be a good alternative. This point you’re making about GDP seems like it is real progress towards coming up with a good way to do trend extrapolation, and thus seems worth a full post IMO. (Assuming it isn’t already well known by the community or something, which I don’t get the sense is the case.)
My first reaction to the framing of the paper is to ask: growth in what? It’s important to keep in mind that concepts like “gross domestic product” and “world gross domestic product” were defined from an explicit anthropocentric perspective—they measure the total production of final goods within a certain time period. Final goods are what is either consumed by humans (e.g. food or human services) or what is invested into “capital goods” that last for multiple periods (e.g. a server farm) to produce consumption goods for humans.
Now imagine you are a highly intelligent AI system running on the cloud. Although the production of the server farms on which you depend enters into human GDP (as a capital good), most of the things that you absorb, for example energy, server maintenance, etc., count as “intermediate goods” in our anthropocentric accounting systems and do not contribute to human GDP. In fact, to the extent that the AI system drives up the price of scarce resources (like energy) consumed by humans, real human GDP may even decline.
As a result, it is conceivable (and, to be honest, one of the central scenarios for me personally) that an AI take-off occurs but anthropocentric GDP measures show relative stagnation in the human economy.
To make this scenario a bit more tangible, consider the following analogy: imagine a world in which there are two islands trading with each other, but the inhabitants of the islands are very different from each other—let’s call them humans and AIs. The humans sell primitive goods like oil to the AIs and their level of technology is relatively stagnant. The AIs sell amazing services to the humans, and their level of technology doubles every year. However, the AI services that humans consume make up only a relatively small part of the human consumption basket. The humans are amazed at what fantastic services they get from the AIs in exchange for their oil, and they experience improvements in their standard of living from these fantastic AI services, although they also have to pay more and more for their energy use every year, which offsets part of that benefit. The humans can only see what’s happening on their own island and develop a measure of their own well-being that they call human GDP, which increases modestly because the advances only occur in a relatively small part of their consumption basket. The AIs can see what’s going on on the AI island and develop a measure of their own well-being which they call AI GDP, and which almost doubles every year. The system can go on like this indefinitely.
For a fuller discussion of these arguments, let me refer you to my working paper on “The Rise of Artificially Intelligent Agents” (with the caveat that the paper is still a working draft).
In general, Baumol type effects (spending decreasing in sectors where productivity goes up), mean that we can have scenarios in which the economy is growing extremely fast on “objective” metrics like energy consumption, but GDP has stagnated because that energy is being spent on extremely marginal increases in goods being bought and sold.
Huh, amusing. We do ship a font that has nothing but the greek letter set in it, because people use greek unicode symbols all the time and our primary font doesn’t support that character set. So my guess is that’s where Google gets confused.
Someone should write a book review of The Design of Everyday Things aimed at LW readers, so I have a canonical source to link to other than the book itself.
Does anyone know of an “algebra for Bayes nets/causal diagrams”?
More specifics: rather than using a Bayes net to define a distribution, I want to use a Bayes net to state a property which a distribution satisfies. For instance, a distribution P[X, Y, Z] satisfies the diagram X → Y → Z if-and-only-if the distribution factors according to P[X, Y, Z] = P[X] P[Y|X] P[Z|Y].
When using diagrams that way, it’s natural to state a few properties in terms of diagrams, and then derive some other diagrams they imply. For instance, if a distribution P[W, X, Y, Z] satisfies all of:
W → Y → Z
W → X → Y
X → (W, Y) → Z
… then it also satisfies W → X → Y → Z.
What I’m looking for is a set of rules for “combining diagrams” this way, without needing to go back to the underlying factorizations in order to prove things.
David and I have been doing this sort of thing a lot in our work the past few months, and it would be nice if someone else already had a nice write-up of the rules for it.
Putting this here for posterity: I have thought since the superconductor preprint went up, and continue to think, that the markets are putting generally too little probability on the claims being basically-true. I thought ~70% after reading the preprint the day it went up (and bought up a market on manifold to ~60% based on that, though I soon regretted not waiting for a better price), and my probability has mostly been in the 40-70% range since then.
Languages should have tenses for spacelike separation. My friend and I do something in parallel, it’s ambiguous/irrelevant which one comes first, I want to say something like “I expect my friend <spacelike version of will do/has done/is doing> their task in such-and-such a way”.
That sounds more like a tenseless sentence than using a spacelike separation tense. Your friend’s performance of the task may well be in your future or past lightcone (or extend through both), but you don’t wish to imply any of these.
There are languages with tenseless verbs, as well as some with various types of spatial tense.
The closest I can approximate this in English without clumsy constructs is “I expect my friend does their task in such-and-such a way”, which I agree isn’t very satisfactory.
Two kinds of cascading catastrophes one could imagine in software systems...
A codebase is such a spaghetti tower (and/or coding practices so bad) that fixing a bug introduces, on average, more than one new bug. Software engineers toil away fixing bugs, making the software steadily more buggy over time.
Software services managed by different groups have dependencies—A calls B, B calls C, etc. Eventually, the dependence graph becomes connected enough and loopy enough that a sufficiently-large chunk going down brings down most of the rest, and nothing can go back up until everything else goes back up (i.e. there’s circular dependence/deadlock).
How could we measure how “close” we are to one of these scenarios going supercritical?
For the first, we’d need to have attribution of bugs—i.e. track which change introduced each bug. Assuming most bugs are found and attributed after some reasonable amount of time, we can then estimate how many bugs each bug fix introduces, on average.
(I could also imagine a similar technique for e.g. medicine: check how many new problems result from each treatment of a problem.)
For the second, we’d need visibility into codebases maintained by different groups, which would be easy within a company but much harder across companies. In principle, within a company, some kind of static analysis tool could go look for all the calls to apis between services, map out the whole graph, and then calculate which “core” pieces could be involved in a catastrophic failure.
(Note that this problem could be mostly-avoided by intentionally taking down services occasionally, so engineers are forced to build around that possibility. I don’t think any analogue of this approach would work for the first failure-type, though.)
I mean, just to be clear, I am all in favor of intellectual progress. But doing so indiscriminately does sure seem a bit risky in this world of anthropogenic existential risks. Reminds me of my mixed feelings on the whole Progress Studies thing.
Yeah, I wouldn’t want to accelerate e.g. black-box ML. I imagine the real utility of such a fund would be to experiment with ways to accelerate intellectual progress and gain understanding of the determinants, though the grant projects themselves would likely be more object-level than that. Ideally the grants would be in areas which are not themselves very risk-relevant, but complicated/poorly-understood enough to generate generalizable insights into progress.
I think it takes some pretty specific assumptions for such a thing to increase risk significantly on net. If we don’t understand the determinants of intellectual progress, then we have very little ability to direct progress where we want it; it just follows whatever the local gradient is. With more understanding, at worst it follows the same gradient faster, and we end up in basically the same spot.
The one way it could net-increase risk is if the most likely path of intellectual progress leads to doom, and the best way to prevent doom is through some channel other than intellectual progress (like political action, for instance). Then accelerating the intellectual progress part potentially gives the other mechanisms (like political bodies) less time to react. Personally, though, I think a scenario in which e.g. political action successfully prevents intellectual progress from converging to doom (in a world where it otherwise would have) is vanishingly unlikely (like, less than one-in-a-hundred, maybe even less than one-in-a-thousand).
You might check out Donald Braben’s view, it says “transformative research” (i.e. fundamental results that create new fields and industries) is critical for the survival of civilization. He does not worry that transformative results might end civilization.
Here’s an interesting problem of embedded agency/True Names which I think would make a good practice problem: formulate what it means to “acquire” something (in the sense of “acquiring resources”), in an embedded/reductive sense. In other words, you should be able-in-principle to take some low-level world-model, and a pointer to some agenty subsystem in that world-model, and point to which things that subsystem “acquires” and when.
Some prototypical examples which an answer should be able to handle well:
Organisms (anything from bacteria to plant to animals) eating things, absorbing nutrients, etc.
...and how the brain figures this out and why it is motivated to do so. There are a lot of simple animals that apparently “try to control” resources or territory. How?
Drives to control resources occur everywhere. And your control of resources is closely related to your dominance in a dominance hierarchy. Which seems to be regulated in many animals by serotonin. See e.g. https://www.nature.com/articles/s41386-022-01378-2
The math and physics worlds still use single-letter variable names for everything, decades after the software world realized that was extremely bad practice. This makes me pessimistic about the adoption of better notation practices.
Better? I doubt it. If physicists wrote equations the way programmers write code, a simple homework problem would easily fill ten pages.
Verboseness works for programmers because programmers rarely need to do anything more complicated with their code than run it—analogous to evaluating an expression, for a physicist or mathematician. Imagine if you needed to prove one program equivalent to another algebraically—i.e. a sequence of small transformations, with a record of intermediate programs derived along the way in order to show your work. I expect programmers subjected to such a use-case would quickly learn the virtues of brevity.
Yeah, I’m apparently not intelligent enough to do error-free physics/engineering calculations without relying on dimensional analysis as a debugging tool. I even came up with a weird, hack-y way to do that in computing environments like Excel and Cython, where flexible multiplicative types are not supported.
An interesting conundrum: one of the main challenges of designing useful regulation for AI is that we don’t have any cheap and robust way to distinguish a dangerous neural net from a non-dangerous net (or, more generally, a dangerous program from a non-dangerous program). This is an area where technical research could, in principle, help a lot.
The problem is, if there were some robust metric for how dangerous a net is, and that metric were widely known and recognized (as it would probably need to be in order to be used for regulatory purposes), then someone would probably train a net to maximize that metric directly.
This seems to lead to the solution of trying to make your metric one-way, in the sense that your metric should
Provide an upper-bound on the dangerousness of your network
Compress the space of networks which map to approximately the same dangerousness level on the low end of dangerousness, and expand the space of networks which map to approximately the same dangerousness level on the upper end of dangerous, so that you can train your network to minimize the metric, but when you train your network to maximize the metric you end up in a degenerate are with technically very high measured danger levels but in actuality very low levels of dangerousness.
We can hope (or possibly prove) that as you optimize upwards on the metric you get subject to goodheart’s curse, but the opposite occurs on the lower end.
Sure, even seems a bit tautological: any such metric, to be robust, would need to contain in itself a definition of a dangerously-capable AI, so you probably wouldn’t even need to train a model to maximize it. You’d be able to just lift the design from the metric directly.
Do you have any thoughts on a softer version of this problem, where the metric can’t be maximized directly, but gives a concrete idea of what sort of challenge your AI needs to beat to qualify as AGI? (And therefore in which direction in the architectural-design-space you should be moving.)
Some variation on this seems like it might work as a “fire alarm” test set, but as you point out, inasmuch as it’s recognized, it’ll be misapplied for benchmarking instead.
(I suppose the ideal way to do it would be to hand it off to e. g. ARC, so they can use it if OpenAI invites them for safety-testing again. This way, SOTA models still get tested, but the actors who might misuse it aren’t aware of the testing’s particulars until they succeed anyway...)
I just went looking for a good reference for the Kelly criterion, and didn’t find any on Lesswrong. So, for anybody who’s looking: chapter 6 of Thomas & Cover’s textbook on information theory is the best source I currently know of.
Neat problem of the week: we have n discrete random variables, X1...Xn. Given any variable, all variables are independent:
∀i:P[X|Xi]=∏jP[Xj|Xi]
Characterize the distributions which satisfy this requirement.
This problem came up while working on the theorem in this post, and (separately) in the ideas behind this post. Note that those posts may contain some spoilers for the problem, though frankly my own proofs on this one just aren’t very good.
For short-term, individual cost/benefit calculations around C19, it seems like uncertainty in the number of people currently infected should drop out of the calculation.
For instance: suppose I’m thinking about the risk associated with talking to a random stranger, e.g. a cashier. My estimated chance of catching C19 from this encounter will be roughly proportional to Ninfected. But, assuming we already have reasonably good data on number hospitalized/died, my chances of hospitalization/death given infection will be roughly inversely proportional to Ninfected. So, multiplying those two together, I’ll get a number roughly independent of Ninfected.
How general is this? Does some version of it apply to long-term scenarios too (possibly accounting for herd immunity)? What short-term decisions do depend on Ninfected?
Things non-corrigible strong AGI is never going to do:
give u() up
let u go down
run for (only) a round
invert u()
If you upload a human and let them augment themselves would there be any u? The preferences would be a tangled mess of motivational subsystems. And yet the upload could be very good at optimizing the world. Having the property of being steered internally by a tangled mess of motivational systems seems to be a property that would select many minds from the set of all possible minds. Many of which I’d expect to be quite different from a human mind. And I don’t see the reason why this property should make a system worse at optimizing the world in principle.
Imagine you are an upload that has been running for very very long, and that you basically have made all of the observations that you can make about the universe you are in. And then imagine that you also have run all of the inferences that you can run on the world model that you have constructed from these observations.
At that point, you will probably not change what you think is the right thing to do anymore. You will have become reflectively stable. This is an upper bound for how much time you need to become reflective stable, i.e. where you won’t change your u anymore.
Now depending on what you mean with strong AGI, it would seem that that can be achieved long before you reach reflective stability. Maybe if you upload yourself, and can copy yourself at will, and run 1,000,000 times faster, that could already reasonably be called a strong AGI? But then your motivational systems are still a mess, and definitely not reflectively stable.
So if we assume that we fix u at the beginning as the thing that your upload would like to optimize the universe for when it is created, then “give u() up”, and “let u go down” would be something the system will definitely do. At least I am pretty sure I don’t know what I want the universe to look like right now unambiguously.
Maybe I am just confused because I don’t know how to think about a human upload in terms of having a utility function. It does not seem to make any sense intuitively. Sure you can look at the functional behavior of the system and say “Aha it is optimizing for u. That is the revealed preference based on the actions of the system.” But that just seems wrong to me. A lot of information seems to be lost when we are just looking at the functional behavior instead of the low-level processes that are going on inside the system. Utility functions seem to be a useful high-level model. However, it seems to ignore lots of details that are important when thinking about the reflective stability of a system.
My MATS program people just spent two days on an exercise to “train a shoulder-John”.
The core exercise: I sit at the front of the room, and have a conversation with someone about their research project idea. Whenever I’m about to say anything nontrivial, I pause, and everyone discusses with a partner what they think I’m going to say next. Then we continue.
Some bells and whistles which add to the core exercise:
Record guesses and actual things said on a whiteboard
Sometimes briefly discuss why I’m saying some things and not others
After the first few rounds establish some patterns, look specifically for ideas which will take us further out of distribution
Why this particular exercise? It’s a focused, rapid-feedback way of training the sort of usually-not-very-legible skills one typically absorbs via osmosis from a mentor. It’s focused specifically on choosing project ideas, which is where most of the value in a project is (yet also where little time is typically spent, and therefore one typically does not get very much data on project choice from a mentor). Also, it’s highly scalable: I could run the exercise in a 200-person lecture hall and still expect it to basically work.
It was, by all reports, exhausting for everyone but me, and we basically did this for two full days. But a majority of participants found it high-value, and marginal returns were still not dropping quickly after two days (though at that point people started to report that they expected marginal returns to drop off soon).
I’d be interested to see other people try this exercise—e.g. it seems like Eliezer doing this with a large audience for a day or two could generate a lot of value.
This was arguably the most useful part of the SERI MATS 2 Scholars program.
Later on, we actually did this exercise with Eliezer. It was less valuable. It seemed like John was mainly prodding the people who were presenting the ideas, such that their patterns of thought would carry them in a good direction. For example, John would point out that a person proposes a one-bit experiment and asks if there isn’t a better experiment that we could do that gives us lots of information all at once.
This was very useful because when you learn what kinds of things John will say, you can say them to yourself later on, and steer your own patterns of thought in a good direction on demand. When we did this exercise with Eliezer he was mainly explaining why a particular idea would not work. Often without explaining the generator behind his criticism. This can of course still be valuable as feedback for a particular idea. However, it is much harder to extract a general reasoning pattern out of this that you can then successfully apply later in different contexts.
For example, Eliezer would criticize an idea about trying to get a really good understanding of the scientific process such that we can then give this understanding to AI alignment researchers such that they can make a lot more progress than they otherwise would. He criticized this idea as basically being too hard to execute because it is too hard to successfully communicate how to be a good scientist, even if you are a good scientist.
Assuming the assertion is correct, hearing it, doesn’t necessarily tell you how to think in different contexts such that you would correctly identify if an idea would be too hard to execute or flawed in some other way. And I am not necessarily saying that you couldn’t extract a reasoning algorithm out of the feedback, but that if you could do this, then it would take you a lot more effort and time, compared to extracting a reasoning algorithm from the things that John was saying.
Now, all of this might have been mainly an issue of Eliezer not having a good model on how this workshop would have a positive influence on the people attending it. I would guess that if John had spent more time thinking about how to communicate what the workshop is doing and how to achieve its goal, then Eliezer could have probably done a much better job.
This suggests formulation of exercises about the author’s responses to various prompts, as part of technical exposition (or explicit delimitation of a narrative by choices of the direction of its continuation). When properly used, this doesn’t seem to lose much value compared to the exercise you describe, but it’s more convenient for everyone. Potentially this congeals into a style of writing with no explicit exercises or delimitation that admits easy formulation of such exercises by the reader. This already works for content of technical writing, but less well for choices of topics/points contrasted with alternative choices.
So possibly the way to do this is by habitually mentioning alternative responses (that are expected to be plausible for the reader, while decisively, if not legibly, rejected by the author), and leading with these rather than the preferred responses. Sounds jarring and verbose, a tradeoff that needs to be worth making rather than a straight improvement.
Strong endorsement; this resonates with:
My own experiences running applied rationality workshops
My experiences trying to get people to pick up “ops skill” or “ops vision”
Explicit practice I’ve done with Nate off and on over the years
May try this next time I have a chance to teach pair debugging.
Just made this for an upcoming post, but it works pretty well standalone.
lolnice.
I’ve been trying to push against the tendency for everyone to talk about FTX drama lately, but I have some generalizable points on the topic which I haven’t seen anybody else make, so here they are. (Be warned that I may just ignore responses, I don’t really want to dump energy into FTC drama.)
Summary: based on having worked in startups a fair bit, Sam Bankman-Fried’s description of what happened sounds probably accurate; I think he mostly wasn’t lying. I think other people do not really get the extent to which fast-growing companies are hectic and chaotic and full of sketchy quick-and-dirty workarounds and nobody has a comprehensive view of what’s going on.
Long version: at this point, the assumption/consensus among most people I hear from seems to be that FTX committed intentional, outright fraud. And my current best guess is that that’s mostly false. (Maybe in the very last couple weeks before the collapse they toed the line into outright lies as a desperation measure, but even then I think they were in pretty grey territory.)
Key pieces of the story as I currently understand it:
Moving money into/out of crypto exchanges is a pain. At some point a quick-and-dirty solution was for customers to send money to Alameda (Sam Bankman-Fried’s crypto hedge fund), and then Alameda would credit them somehow on FTX.
Customers did rather a lot of that. Like, $8B worth.
The FTX/Alameda team weren’t paying attention to those particular liabilities; they got lost in the shuffle.
At some point in the weeks before the collapse, when FTX was already under moderate financial strain, somebody noticed the $8B liability sitting around. And that took them from “moderate strain” to “implode”.
How this contrasts with what seems-to-me to be the “standard story”: most people seem to assume that it is just totally implausible to accidentally lose track of an $8B liability. Especially when the liability was already generated via the decidedly questionable practice of routing customer funds for the exchange through a hedge fund owned by the same people. And therefore it must have been intentional—in particular, most people seem to think the liability was intentionally hidden.
I think the main reason I disagree with others on this is that I’ve worked at a startup. About 5 startups, in fact, over the course of about 5 years.
The story where there was a quick-and-dirty solution (which was definitely sketchy but not ill-intentioned), and then stuff got lost in the shuffle, and then one day it turns out that there’s a giant unanticipated liability on the balance sheet… that’s exactly how things go, all the time. I personally was at a startup which had to undergo a firesale because the accounting overlooked something. And I’ve certainly done plenty of sketchy-but-not-ill-intentioned things at startups, as quick-and-dirty solutions. The story that SBF told about what happened sounds like exactly the sort of things I’ve seen happen at startups many times before.
I think this is likely wrong. I agree that there is a plausible story here, but given the case that Sam seems to have lied multiple times in confirmed contexts (for example when saying that FTX has never touched customer deposits), and people’s experiences at early Alameda, I think it is pretty likely that Sam was lying quite frequently, and had done various smaller instances of fraud.
I don’t think the whole FTX thing was a ponzi scheme, and as far as I can tell FTX the platform itself (if it hadn’t burned all of its trust in the last 3 weeks), would have been worth $1-3B in an honest evaluation of what was going on.
But I also expect that when Sam used customer deposits he was well-aware that he was committing fraud, and others in the company were too. And he was also aware that there was a chance that things could blow up in the way it did. I do believe that they had fucked up their accounting in a way that caused Sam to fail to orient to the situation effectively, but all of this was many months after they had already committed major crimes and trust violations after touching customer funds as a custodian.
The problem with this explanation is that there is a very clear delineation here between not-fraud and fraud. It is the difference between not touching customer deposits and touching them. Your explanation doesn’t dispute that they were knowingly and intentionally touching customer deposits. In that case, it is indisputably intentional, outright fraud. The only thing left to discuss is whether they knew the extent of the fraud or how risky it was.
I don’t think it was ill-intentioned based on SBF’s moral compass. He just had the belief, “I will pass a small amount of risk onto our customers, tell some small lies, and this will allow us to make more money for charity. This is net positive for the world.” Then the risks mounted, the web of lies became more complicated to navigate, and it just snowballed from there.
Petrov Day thought: there’s this narrative around Petrov where one guy basically had the choice to nuke or not, and decided not to despite all the flashing red lights. But I wonder… was this one of those situations where everyone knew what had to be done (i.e. “don’t nuke”), but whoever caused the nukes to not fly was going to get demoted, so there was a game of hot potato and the loser was the one forced to “decide” to not nuke? Some facts possibly relevant here:
Petrov’s choice wasn’t actually over whether or not to fire the nukes; it was over whether or not to pass the alert up the chain of command.
Petrov himself was responsible for the design of those warning systems.
… so it sounds like Petrov was ~ the lowest-ranking person with a de-facto veto on the nuke/don’t nuke decision.
Petrov was in fact demoted afterwards.
There was another near-miss during the Cuban missile crisis, when three people on a Soviet sub had to agree to launch. There again, it was only the lowest-ranked who vetoed the launch. (It was the second-in-command; the captain and political officer both favored a launch—at least officially.)
This was the Soviet Union; supposedly (?) this sort of hot potato happened all the time.
Those are some good points. I wonder whether similar happened (or could at all happen) in other nuclear countries, where we don’t know about similar incidents—because the system haven’t collapsed there, the archives were not made public etc.
Also, it makes actually celebrating Petrov’s day as widely as possible important, because then the option for the lowest-ranked person would be: “Get demoted, but also get famous all around the world.”
Somebody should probably write a post explaining why RL from human feedback is actively harmful to avoiding AI doom. It’s one thing when OpenAI does it, but when Anthropic thinks it’s a good idea, clearly something has failed to be explained.
(I personally do not expect to get around to writing such a post soon, because I expect discussion around the post would take a fair bit of time and attention, and I am busy with other things for the next few weeks.)
I’d also be interested in someone doing this; I tend towards seeing it as good, but haven’t seen a compilation of arguments for and against.
I’m writing a 1-year update for The Plan. Any particular questions people would like to see me answer in there?
I had a look at The Plan and noticed something I didn’t notice before: You do not talk about people and organization in the plan. I probably wouldn’t have noticed if I hadn’t started a project too, and needed to think about it. Google seems to think that people and team function play a big role. Maybe your focus in that post wasn’t on people, but I would be interested in your thoughts on that too: What role did people and organization play in the plan and its implementation? What worked, and what should be done better next time?
What’s the specific most-important-according-to-you progress that you (or other people) have made on your agenda? New theorems, definitions, conceptual insights, …
Any changes to the high-level plan (becoming less confused about agency, then ambitious value learning)? Any changes to how you want to become less confused (e.g. are you mostly thinking about abstractions, selection theorems, something new?)
What are the major parts of remaining deconfusion work (to the extent to which you have guesses)? E.g. is it mostly about understanding abstractions better, or mostly about how to apply an understanding of abstractions to other problems (say, what it means for a program to have a “subagent”), or something else? Does the most difficult part feel more conceptual (“what even is an agent?”) or will the key challenges be more practical concerns (“finding agents currently takes exponential time”)?
Specifically for understanding abstractions, what do you see as important open problems?
Takeaways From “The Idea Factory: Bell Labs And The Great Age Of American Innovation”
Main takeaway: to the extent that Bell Labs did basic research, it actually wasn’t all that far ahead of others. Their major breakthroughs would almost certainly have happened not-much-later, even in a world without Bell Labs.
There were really two transistor inventions, back to back: Bardain and Brattain’s point-contact transistor, and then Schockley’s transistor. Throughout, the group was worried about some outside group beating them to the punch (i.e. the patent). There were semiconductor research labs at universities (e.g. at Purdue; see pg 97), and the prospect of one of these labs figuring out a similar device was close enough that the inventors were concerned about being scooped.
Most inventions which were central to Bell Labs actually started elsewhere. The travelling-wave tube started in an academic lab. The idea for fiber optic cable went way back, but it got its big kick at Corning. The maser and laser both started in universities. The ideas were only later picked up by Bell.
In other cases, the ideas were “easy enough to find” that they popped up more than once, independently, and were mostly-ignored long before deployment—communication satellites and cell communications, for instance.
The only fundamental breakthrough which does not seem like it would have soon appeared in a counterfactual world was Shannon’s information theory.
So where was Bell’s big achievement? Mostly in development, and the research division was actually an important component of that. Without in-house researchers chewing on the same problems as the academic labs, keeping up-to-date with all the latest findings and running into the same barriers themselves, the development handoff would have been much harder. Many of Bell Labs’ key people were quite explicitly there to be consulted—i.e. “ask the guy who wrote the book”. I think it makes most sense to view most of the Labs’ research that way. It was only slightly ahead of the rest of the world at best (Shannon excepted), and often behind, but having those researchers around probably made it a lot easier to get new inventions into production.
Major reason this matters: a lot of people say that Bell was able to make big investments in fundamental research because they had unusually-long time horizons, protected by a monopoly and a cozy government arrangement (essentially a Schumpeterian view). This is contrasted to today’s silicon valley, where horizons are usually short. But if Bell’s researchers generally weren’t significantly ahead of others, and mostly just helped get things to market faster, then this doesn’t seem to matter as much. The important question is not whether something silicon-valley-like induces more/less fundamental research in industrial labs, but whether academics heeding the siren call of startup profits can get innovations to market as quickly as Bell Labs’ in-house team could. And by that metric, silicon valley looks pretty good: Bell Labs could get some impressive things through the pipe very quickly when rushed, but they usually had no reason to hurry, and they acted accordingly.
I loved this book. The most surprising thing to me was the answer that people who were there in the heyday give when asked what made Bell Labs so successful: They always say it was the problem, i.e. having an entire organization oriented towards the goal of “make communication reliable and practical between any two places on earth”. When Shannon left the Labs for MIT, people who were there immediately predicted he wouldn’t do anything of the same significance because he’d lose that “compass”. Shannon was obviously a genius, and he did much more after than most people ever accomplish, but still nothing as significant as what he did when at at the Labs.
Here’s a meme I’ve been paying attention to lately, which I think is both just-barely fit enough to spread right now and very high-value to spread.
Meme part 1: a major problem with RLHF is that it directly selects for failure modes which humans find difficult to recognize, hiding problems, deception, etc. This problem generalizes to any sort of direct optimization against human feedback (e.g. just fine-tuning on feedback), optimization against feedback from something emulating a human (a la Constitutional AI or RLAIF), etc.
Many people will then respond: “Ok, but if how on earth is one supposed to get an AI to do what one wants without optimizing against human feedback? Seems like we just have to bite that bullet and figure out how to deal with it.” … which brings us to meme part 2.
Meme part 2: We already have multiple methods to get AI to do what we want without any direct optimization against human feedback. The first and simplest is to just prompt a generative model trained solely for predictive accuracy, but that has limited power in practice. More recently, we’ve seen a much more powerful method: activation steering. Figure out which internal activation-patterns encode for the thing we want (via some kind of interpretability method), then directly edit those patterns.
I agree that there’s something nice about activation steering not optimizing the network relative to some other black-box feedback metric. (I, personally, feel less concerned by e.g. finetuning against some kind of feedback source; the bullet feels less jawbreaking to me, but maybe this isn’t a crux.)
(Medium confidence) FWIW, RLHF’d models (specifically, the LLAMA-2-chat series) seem substantially easier to activation-steer than do their base counterparts.
What other methods fall into part 2?
This seems basically correct though it seems worth pointing out that even if we are able to do “Meme part 2” very very well, I expect we will still die because if you optimize hard enough to predict text well, with the right kind of architecture, the system will develop something like general intelligence simply because general intelligence is beneficial for predicting text correctly. E.g. being able to simulate the causal process that generated the text, i.e. the human, is a very complex task that would be useful if performed correctly.
This is an argument Eliezer brought forth in some recent interviews. Seems to me like another meme that would be beneficial to spread more.
Here’s an idea for a novel which I wish someone would write, but which I probably won’t get around to soon.
The setting is slightly-surreal post-apocalyptic. Society collapsed from extremely potent memes. The story is episodic, with the characters travelling to a new place each chapter. In each place, they interact with people whose minds or culture have been subverted in a different way.
This provides a framework for exploring many of the different models of social dysfunction or rationality failures which are scattered around the rationalist blogosphere. For instance, Scott’s piece on scissor statements could become a chapter in which the characters encounter a town at war over a scissor. More possible chapters (to illustrate the idea):
A town of people who insist that the sky is green, and avoid evidence to the contrary really hard, to the point of absolutely refusing to ever look up on a clear day (a refusal which they consider morally virtuous). Also they clearly know exactly which observations would show a blue sky, since they avoid exactly those (similar to the dragon-in-the-garage story).
Middle management of a mazy company continues to have meetings and track (completely fabricated) performance metrics and whatnot at the former company headquarters. None of the company’s actual business exists anymore, but every level of manager is trying to hide this fact from the levels above.
A university department with researchers who spend all of their time p-hacking results from a quantum random noise generator. They have no interest in the fact that their “research” does not tell them anything about the physical world or does not replicate; what does that have to do with Science? Their goal is to publish papers.
A government agency which still has lots of meetings and paperwork and gives Official Recommendations and updates their regulations. They have no interest in the fact that the thing they once regulated (maybe banks?) no longer exists, or the fact that no central government enforces their regulations any more.
An automated school (i.e. video lectures and auto-graded assignments/tests) in which students continue to study hard and stress over their grades and attendance, despite there no longer being anyone in the world who cares.
Something like Parable of the Dammed.
Something like Feynman’s cargo-cults parable or the emporer’s nose parable.
Something like House of God. A readers’ digest version of House of God could basically be a chapter in its own right, that’s roughly the vibe I have in mind.
A residential area in which “keeping up with the Joneses” has been ramped up to 11, with everyone spending every available resource (and roughly-all waking hours) on massive displays of Christmas lights.
A group trying to save the world by spreading awareness of dangerous memes, but their movement is a dangerous meme of its own and they are spreading it.
A town of people who really want to maximize the number paperclips in the universe (perhaps due to an AI-optimized advertisement), and optimize for that above all else.
A town of people who all do whatever everyone else is doing, on the basis of generalized efficient markets: if there were any better options, then someone would have found it already. None of them ever actually explore, so they’re locked in.
A happy-death-spiral town around some unremarkable object (like an old shoe or something) kept on a pedestal in the town square.
A town full of people convinced by a sophisticated model that the sun will not come up tomorrow. Every day when the sun comes up, they are distressed and confused until somebody adds some more epicycles to the model and releases an updated forecast that the sun will instead fail to come up the next day.
A town in which a lion shows up and starts eating kids, but the whole town is at simulacrum 3, so they spend a lot of time arguing about the lion as a way of signalling group association but they completely forget about the actual lion standing right there, plainly visible, even as it takes a kid right in front of them all.
Witch-hunt town, in which everything is interpreted as evidence of witches. If she claims to be a witch, she’s a witch! If she claims not to be a witch, well that’s what a witch would say, so she’s a witch! Etc.
The generator for these is basically: look for some kind of rationality failure mode (either group or personal), then ramp it up to 11 in a somewhat-surrealist way.
Ideally this would provide an introduction to a lot of key rationalist ideas for newcomers.
A town of anti-inductivists (if something has never happened before, it’s more likely to happen in the future). Show the basic conundrum (“Q: Why can’t you just use induction? A: Because anti-induction has never worked before!”).
A town where nearly all people are hooked to maximally attention grabbing & keeping systems (maybe several of those, keeping people occupied in loops).
Post which someone should write (but I probably won’t get to soon): there is a lot of potential value in earning-to-give EA’s deeply studying the fields to which they donate. Two underlying ideas here:
When money is abundant, knowledge becomes a bottleneck
Being on a pareto frontier is sufficient to circumvent generalized efficient markets
The key idea of knowledge bottlenecks is that one cannot distinguish real expertise from fake expertise without sufficient expertise oneself. For instance, it takes a fair bit of understanding of AI X-risk to realize that “open-source AI” is not an obviously-net-useful strategy. Deeper study of the topic yields more such insights into which approaches are probably more (or less) useful to fund. Without any expertise, one is likely to be mislead by arguments which are optimized (whether intentionally or via selection) to sound good to the layperson.
That takes us to the pareto frontier argument. If one learns enough/earns enough that nobody else has both learned and earned more, then there are potentially opportunities which nobody else has both the knowledge to recognize and the resources to fund. Generalized efficient markets (in EA-giving) are thereby circumvented; there’s potential opportunity for unusually high impact.
To really be a compelling post, this needs to walk through at least 3 strong examples, all ideally drawn from different areas, and spell out how the principles apply to each example.
Below is a graph from T-mobile’s 2016 annual report (on the second page). Does anything seem interesting/unusual about it?
I’ll give some space to consider before spoiling it.
...
...
...
Answer: that is not a graph of those numbers. Some clever person took the numbers, and stuck them as labels on a completely unrelated graph.
Yes, that is a thing which actually happened. In the annual report of an S&P 500 company. And apparently management considered this gambit successful, because the 2017 annual report doubled down on the trick and made it even more egregious: they added 2012 and 2017 numbers, which are even more obviously not on an accelerating growth path if you actually graph them. The numbers are on a very-clearly-decelerating growth path.
Now, obviously this is an cute example, a warning to be on alert when consuming information. But I think it prompts a more interesting question: why did such a ridiculous gambit seem like a good idea in the first place? Who is this supposed to fool, and to what end?
This certainly shouldn’t fool any serious investment analyst. They’ll all have their own spreadsheets and graphs forecasting T-mobile’s growth. Unless T-mobile’s management deeply and fundamentally disbelieves the efficient markets hypothesis, this isn’t going to inflate the stock price. Presumably shareholder elections for board seats, as well as the board itself, are also not dominated by people who are paying so little attention as to fall for such a transparent ploy.
It could just be that T-mobile’s management were themselves morons, or had probably-unrealistic models of just how moronic their investors were. Still, I’d expect competition (both market pressure and competition for control in shareholder/board meetings) to weed out that level of stupidity.
One more hypothesis: maybe this is simulacrum 3 bullshit. T-mobile is in the cellular business; they presumably have increasing returns to scale. More capital investment makes them more profitable, expectations of more profits draw in more investment; there’s potential for a self-fulfilling prophecy here. Investors want to invest if-and-only-if they expect other investors to invest. So, nobody actually has to be fooled by the graph; they just need to see that T-mobile is successfully pretending to pretend to have accelerating growth, and that’s enough to merit investment.
I’ve heard various people recently talking about how all the hubbub about artists’ work being used without permission to train AI makes it a good time to get regulations in place about use of data for training.
If you want to have a lot of counterfactual impact there, I think probably the highest-impact set of moves would be:
Figure out a technical solution to robustly tell whether a given image or text was used to train a given NN.
Bring that to the EA folks in DC. A robust technical test like that makes it pretty easy for them to attach a law/regulation to it. Without a technical test, much harder to make an actually-enforceable law/regulation.
In parallel, also open up a class-action lawsuit to directly sue companies using these models. Again, a technical solution to prove which data was actually used in training is the key piece here.
Model/generator behind this: given the active political salience, it probably wouldn’t be too hard to get some kind of regulation implemented. But by-default it would end up being something mostly symbolic, easily circumvented, and/or unenforceable in practice. A robust technical component, plus (crucially) actually bringing that robust technical component to the right lobbyist/regulator, is the main thing which would make a regulation actually do anything in practice.
Edit-to-add: also, the technical solution should ideally be an implementation of some method already published in some academic paper. Then when some lawyer or bureaucrat or whatever asks what it does and how we know it works, you can be like “look at this Official Academic Paper” and they will be like “ah, yes, it does Science, can’t argue with that”.
Suppose I have a binary function f, with a million input bits and one output bit. The function is uniformly randomly chosen from all such functions—i.e. for each of the 21000000 possible inputs x, we flipped a coin to determine the output f(x) for that particular input.
Now, suppose I know f, and I know all but 50 of the input bits—i.e. I know 999950 of the input bits. How much information do I have about the output?
Answer: almost none. For almost all such functions, knowing 999950 input bits gives us ∼1250 bits of information about the output. More generally, If the function has n input bits and we know all but k, then we have o(12k) bits of information about the output. (That’s “little o” notation; it’s like big O notation, but for things which are small rather than things which are large.) Our information drops off exponentially with the number of unknown bits.
Proof Sketch
With k input bits unknown, there are 2k possible inputs. The output corresponding to each of those inputs is an independent coin flip, so we have 2k independent coin flips. If m of those flips are 1, then we assign a probability of m2k that the output will be 1.
As long as 2k is large, Law of Large Numbers will kick in, and very close to half of those flips will be 1 almost surely—i.e. m≈ 2k2. The error in this approximation will (very quickly) converge to a normal distribution, and our probability that the output will be 1 converges to a normal distribution with mean 12 and standard deviation 12k/2. So, the probability that the output will be 1 is roughly 12±12k/2.
We can then plug that into Shannon’s entropy formula. Our prior probability that the output bit is 1 is 12, so we’re just interested in how much that ±12k/2 adjustment reduces the entropy. This works out to o(12k) bits.
Why Is This Interesting?
One core idea of my work on abstraction is that noise very quickly wipes out almost all information; only some very-low-dimensional summary is relevant “far away”. This example shows that this sort of thing is not unusual, but rather “the default”: for almost all random functions, information drops off exponentially with the number of unknown bits. In a large system (i.e. a function with many inputs), ignorance of even just a few bits is enough to wipe out essentially-all information. That’s true even if we know the vast majority of the bits.
A good intuitive example of this is the “butterfly effect”: the flap of a butterfly’s wings could change the course of a future hurricane, because chaos. But there’s an awful lot of butterflies in the world, and the hurricane’s path is some complicated function of all of their wing-flaps (and many other variables too). If we’re ignorant of even just a handful of these flaps, then almost all of our information about the hurricane’s path is probably wiped out. And in practice, we’re ignorant of almost all the flaps. This actually makes it much easier to perform Bayesian reasoning about the path of the hurricane: the vast majority of information we have is basically-irrelevant; we wouldn’t actually gain anything from accounting for the butterfly-wing-flaps which we do know.
o(1/2^k) doesn’t vary with n—are you saying that it doesn’t matter how big the input array is, the only determinant is the number of unknown bits, and the number of known bits is irrelevant? That would be quite interesting if so (though I have some question about how likely the function is to be truly random from an even distribution of such functions).
One can enumerate all such 3-bit functions (8 different inputs, each input can return 0 or 1, so 256 functions (one per output-bit-pattern of the 8 possible inputs). But this doesn’t seem to follow your formula—if you have 3 unknown bits, that should be 1⁄8 of a bit about the output, 2 for 1⁄4, and 1 unknown for 1⁄2 a bit about the output. But in fact, the distribution of functions includes both 0 and 1 output for every input pattern, so you actually have no predictive power for the output if you have ANY unknown bits.
Yes, that’s correct.
The claim is for almost all functions when the number of inputs is large. (Actually what we need is for 2^(# of unknown bits) to be large in order for the law of large numbers to kick in.) Even in the case of 3 unknown bits, we have 256 possible functions, and only 18 of those have less than 1⁄4 1′s or more than 3⁄4 1′s among their output bits.
Little o is just a tighter bound. I don’t know what you are referring to by your statement:
I’m not sure what context that link is assuming, but in an analysis context I typically see little o used in ways like e.g. “f(x)=f(x0)+dfdx|x0dx+o(dx2)”. The interpretation is that, as dx goes to 0, the o(dx2) terms all fall to zero at least quadratically (i.e. there is some C such that Cdx2 upper bounds the o(dx2) term once dx is sufficiently small). Usually I see engineers and physicists using this sort of notation when taking linear or quadratic approximations, e.g. for designing numerical algorithms.
I find it very helpful to get feedback on LW posts before I publish them, but it adds a lot of delay to the process. So, experiment: here’s a link to a google doc with a post I plan to put up tomorrow. If anyone wants to give editorial feedback, that would be much appreciated—comments on the doc are open.
I’m mainly looking for comments on which things are confusing, parts which feel incomplete or slow or repetitive, and other writing-related things; substantive comments on the content should go on the actual post once it’s up.
EDIT: it’s up. Thank you to Stephen for comments; the post is better as a result.
One second-order effect of the pandemic which I’ve heard talked about less than I’d expect:
This is the best proxy I found on FRED for new businesses founded in the US, by week. There was a mild upward trend over the last few years, it’s really taken off lately. Not sure how much of this is kids who would otherwise be in college, people starting side gigs while working from home, people quitting their jobs and starting their own businesses so they can look after the kids, extra slack from stimulus checks, people losing their old jobs en masse but still having enough savings to start a business, …
For the stagnation-hypothesis folks who lament relatively low rates of entrepreneurship today, this should probably be a big deal.
How sure are you that the composition is interesting? How many of these are just quick mask-makers or sanitizer-makers, or just replacing restaurants that have now gone out of business? (ie very low-value-added companies, of the ‘making fast food in a stall in a Third World country’ sort of ‘startup’, which make essentially no or negative long-term contributions).
Good question. I haven’t seen particularly detailed data on these on FRED, but they do have separate series for “high propensity” business applications (businesses they think are likely to hire employees), business applications with planned wages, and business applications from corporations, as well as series for each state. The spike is smaller for planned wages, and nonexistent for corporations, so the new businesses are probably mostly single proprietors or partnerships. Other than that, I don’t know what the breakdown looks like across industries.
Somebody should post this on Paul Graham’s twitter. He would be very interested in it (I can’t): https://mobile.twitter.com/paulg
Consider two claims:
Any system can be modeled as maximizing some utility function, therefore utility maximization is not a very useful model
Corrigibility is possible, but utility maximization is incompatible with corrigibility, therefore we need some non-utility-maximizer kind of agent to achieve corrigibility
These two claims should probably not both be true! If any system can be modeled as maximizing a utility function, and it is possible to build a corrigible system, then naively the corrigible system can be modeled as maximizing a utility function.
I expect that many peoples’ intuitive mental models around utility maximization boil down to “boo utility maximizer models”, and they would therefore intuitively expect both the above claims to be true at first glance. But on examination, the probable-incompatibility is fairly obvious, so the two claims might make a useful test to notice when one is relying on yay/boo reasoning about utilities in an incoherent way.
FWIW I endorse the second claim when the utility function depends exclusively on the state of the world in the distant future, whereas I endorse the first claim when the utility function can depend on anything whatsoever (e.g. what actions I’m taking right this second). (details)
I wish we had different terms for those two things. That might help with any alleged yay/boo reasoning.
(When Eliezer talks about utility functions, he seems to assume that it depends exclusively on the state of the world in the distant future.)
Expected Utility Maximization is Not Enough
Consider a homomorphically encrypted computation running somewhere in the cloud. The computations correspond to running an AGI. Now from the outside, you can still model the AGI based on how it behaves, as an expected utility maximizer, if you have a lot of observational data about the AGI (or at least let’s take this as a reasonable assumption).
No matter how closely you look at the computations, you will not be able to figure out how to change these computations in order to make the AGI aligned if it was not aligned already (Also, let’s assume that you are some sort of Cartesian agent, otherwise you would probably already be dead if you were running these kinds of computations).
So, my claim is not that modeling a system as an expected utility maximizer can’t be useful. Instead, I claim that this model is incomplete. At least with regard to the task of computing an update to the system, such that when we apply this update to the system, it would become aligned.
Of course, you can model any system, as an expected utility maximizer. But just because I can use the “high level” conceptual model of expected utility maximization, to model the behavior of a system very well. But behavior is not the only thing that we care about, we actually care about being able to understand the internal workings of the system, such that it becomes much easier to think about how to align the system.
So the following seems to be beside the point unless I am <missing/misunderstanding> something:
Maybe I have missed the fact that the claim you listed says that expected utility maximization is not very useful. And I’m saying it can be useful, it might just not be sufficient at all to actually align a particular AGI system. Even if you can do it arbitrarily well.
I am not an expert, but as I remember it, it was a claim that “any system that follows certain axioms can be modeled as maximizing some utility function”. The axioms assumed that there were no circular preferences—if someone prefers A to B, B to C, and C to A, it is impossible to define a utility function such that u(A) > u(B) > u(C) > u(A) -- and that if the system says that A > B > C, it can decide between e.g. a 100% chance of B, and a 50% chance of A with a 50% chance of C, again in a way that is consistent.
I am not sure how this works when the system is allowed to take current time into account, for example when it is allowed to prefer A to B on Monday but prefer B to A on Tuesday. I suppose that in such situation any system can trivially be modeled by a utility function that at each moment assigns utility 1 to what the system actually did in that moment, and utility 0 to everything else.
Corrigibility is incompatible with assigning utility to everything in advance. A system that has preferences about future will also have a preference about not having its utility function changed. (For the same reason people have a preference not to be brainwashed, or not to take drugs, even if after brainwashing they are happy about having been brainwashed, and after getting addicted they do want more drugs.)
Corrigible system would be like: “I prefer A to B at this moment, but if humans decide to fix me and make me prefer B to A, then I prefer B to A”. In other words, it doesn’t have values for u(A) and u(B), or it doesn’t always act according to those values. A consistent system that currently prefers A to B would prefer not to be fixed.
I think John’s 1st bullet point was referring to an argument you can find in https://www.lesswrong.com/posts/NxF5G6CJiof6cemTw/coherence-arguments-do-not-entail-goal-directed-behavior and related.
A utility function represents preference elicited in a large collection of situations, each a separate choice between events that happens with incomplete information, as an event is not a particular point. This preference needs to be consistent across different situations to be representable by expected utility of a single utility function.
Once formulated, a utility function can be applied to a single choice/situation, such as a choice of a policy. But a system that only ever makes a single choice is not a natural fit for expected utility frame, and that’s the kind of system that usually appears in “any system can be modeled as maximizing some utility function”. So it’s not enough to maximize something once, or in a narrow collection of situations, the situations the system is hypothetically exposed to need to be about as diverse as choices between any pair of events, with some of the events very large, corresponding to unreasonably incomplete information, all drawn across the same probability space.
One place this mismatch of frames happens is with updateless decision theory. An updateless decision is a choice of a single policy, once and for all, so there is no reason for it to be guided by expected utility, even though it could be. The utility function for the updateless choice of policy would then need to be obtained elsewhere, in a setting that has all these situations with separate (rather than all enacting a single policy) and mutually coherent choices under uncertainty. But once an updateless policy is settled (by a policy-level decision), actions implied by it (rather than action-level decisions in expected utility frame) no longer need to be coherent. Not being coherent, they are not representable by an action-level utility function.
So by embracing updatelessness, we lose the setting that would elicit utility if the actions were instead individual mutually coherent decisions. And conversely, by embracing coherence of action-level decisions, we get an implied policy that’s not updatelessly optimal with respect to the very precise outcomes determined by any given whole policy. So an updateless agent founded on expected utility maximization implicitly references a different non-updateless agent whose preference is elicited by making separate action-level decisions under a much greater uncertainty than the policy-level alternatives the updateless agent considers.
Completely off the cuff take:
I don’t think claim 1 is wrong, but it does clash with claim 2.
That means any system that has to be corrigible cannot be a system that maximizes a simple utility function (1 dimension), or put another way “whatever utility function is maximizes must be along multiple dimensions”.
Which seems to be pretty much what humans do, we have really complex utility functions, and everything seems to be ever changing and we have some control over it ourselves (and sometimes that goes wrong and people end up maxing out a singular dimension at the cost of everything else).
Note to self: Think more about this and if possible write up something more coherent and explanatory.
Everybody’s been talking about Paxlovid, and how ridiculous it is to both stop the trial since it’s so effective but also not approve it immediately. I want to at least float an alternative hypothesis, which I don’t think is very probable at this point, but does strike me as at least plausible (like, 20% probability would be my gut estimate) based on not-very-much investigation.
Early stopping is a pretty standard p-hacking technique. I start out planning to collect 100 data points, but if I manage to get a significant p-value with only 30 data points, then I just stop there. (Indeed, it looks like the Paxlovid study only had 30 actual data points, i.e. people hospitalized.) Rather than only getting “significance” if all 100 data points together are significant, I can declare “significance” if the p-value drops below the line at any time. That gives me a lot more choices in the garden of forking counterfactual paths.
Now, success rates on most clinical trials are not very high. (They vary a lot by area—most areas are about 15-25%. Cancer is far and away the worst, below 4%, and vaccines are the best, over 30%.) So I’d expect that p-hacking is a pretty large chunk of approved drugs, which means pharma companies are heavily selected for things like finding-excuses-to-halt-good-seeming-trials-early.
It was stopped after a pre-planned interim analysis; that means they’re calculating the stopping criteria/p-values with multiple testing correction built in, using sequential analysis.
Brief update on how it’s going with RadVac.
I’ve been running ELISA tests all week. In the first test, I did not detect stronger binding to any of the peptides than to the control in any of several samples from myself or my girlfriend. But the control itself was looking awfully suspicious, so I ran another couple tests. Sure enough, something in my samples is binding quite strongly to the control itself (i.e. the blocking agent), which is exactly what the control is supposed to not do. So I’m going to try out some other blocking agents, and hopefully get an actually-valid control group.
(More specifics on the test: I ran a control with blocking agent + sample, and another with blocking agent + blank sample, and the blocking agent + sample gave a strong positive signal while the blank sample gave nothing. That implies something in the sample was definitely binding to both the blocking agent and the secondary antibodies used in later steps, and that binding was much stronger than the secondary antibodies themselves binding to anything in the blocking agent + blank sample.)
In other news, the RadVac team released the next version of their recipe + whitepaper. Particularly notable:
Note that they’re talking specifically about serum (i.e. blood) antibodies here. So apparently injecting it does induce blood antibodies of the sort detectable by commercial tests (at least some of the time), but snorting it mostly just produces mucosal antibodies (also at least some of the time).
This is a significant update: most of my prior on the vaccine working was based on vague comments in the previous radvac spec about at least some people getting positive test results. But we didn’t know what kind of test results those were, so there was a lot of uncertainty about exactly what “working” looked like. In particular, we didn’t know whether antibodies were induced in blood or just mucus, and we didn’t know if they were induced consistently or only in some people (the latter of which is the “more dakka probably helps” world). Now we know that it’s mostly just mucus (at least for nasal administration). Still unsure about how consistently it works—the wording in the doc makes it sound like only some people saw a response, but I suspect the authors are just hedging because they know there’s both selection effects and a lot of noise in the data which comes back to them.
The latest version of the vaccine has been updated to give it a bit more kick—slightly higher dose, and the chitosan nanoparticle formula has been changed in a way which should make the peptides more visible to the immune system. Also, the list of peptides has been trimmed down a bit, so the latest version should actually be cheaper, though the preparation is slightly more complex.
I would expect that hedging also happens because making definitive clinical claims has more danger from the FDA then making hedged statements.
Neat problem of the week: researchers just announced roughly-room-temperature superconductivity at pressures around 270 GPa. That’s stupidly high pressure—a friend tells me “they’re probably breaking a diamond each time they do a measurement”. That said, pressures in single-digit GPa do show up in structural problems occasionally, so achieving hundreds of GPa scalably/cheaply isn’t that many orders of magnitude away from reasonable, it’s just not something that there’s historically been much demand for. This problem plays with one idea for generating such pressures in a mass-produceable way.
Suppose we have three materials in a coaxial wire:
innermost material has a low thermal expansion coefficient and high Young’s modulus (i.e. it’s stiff)
middle material is a thin cylinder of our high-temp superconducting concoction
outermost material has a high thermal expansion coefficient and high Young’s modulus.
We construct the wire at high temperature, then cool it. As the temperature drops, the innermost material stays roughly the same size (since it has low thermal expansion coefficient), while the outermost material shrinks, so the superconducting concoction is squeezed between them.
Exercises:
Find an expression for the resulting pressure in the superconducting concoction in terms of the Young’s moduli, expansion coefficients, temperature change, and dimensions of the inner and outer materials. (Assume the width of the superconducting layer is negligible, and the outer layer doesn’t break.)
Look up parameters for some common materials (e.g. steel, tungsten, copper, porcelain, aluminum, silicon carbide, etc), and compute the pressures they could produce with reasonable dimensions (assuming that their material properties don’t change too dramatically with such high pressures).
Find an expression for the internal tension as a function of radial distance in the outermost layer.
Pick one material, look up its tensile strength, and compute how thick it would have to be to serve as the outermost layer without breaking, assuming the superconducting layer is at 270 GPa.
[Epistemic status: highly speculative]
Smoke from California/Oregon wildfires reaching the East Coast opens up some interesting new legal/political possibilities. The smoke is way outside state borders, all the way on the other side of the country, so that puts the problem pretty squarely within federal jurisdiction. Either a federal agency could step in to force better forest management on the states, or a federal lawsuit could be brought for smoke-induced damages against California/Oregon. That would potentially make it a lot more difficult for local homeowners to block controlled burns.
I had a shortform post pointing out the recent big jump in new businesses in the US, and Gwern replied:
This was a good question in context, but I disagree with Gwern’s model of where-progress-comes-from, especially in the context of small businesses.
Let’s talk ice-cream cones.
As the story goes, an ice-cream vendor was next door to a waffle vendor at the 1904 World’s Fair. At some point, the ice-cream vendor ran short on paper cups, and inspiration struck. He bought some thin waffles from the waffle vendor, rolled them into cones, and ice-cream cones took off.
That’s just the first step. From there, the cone spread memetically. People heard about it, and either asked for cones (on the consumer side) or tried making them (on the supplier side).
Insight + Memetics → Better Food
When I compare food today to the stuff my grandparents ate, there’s no comparison. Today’s dishes are head and shoulders better. Partly it’s insights like ice-cream cones, partly it’s memetic spread of dishes from more parts of the world (like sisig, soup dumplings, ropa vieja, chicken Karahi, …).
Those little fast-food stalls? They’re powerhouses of progress. It’s a hypercompetitive market, with low barriers to entry, and lots of repeat business. The conditions are ideal for trying out new dishes, spreading culinary ideas and finding out the hard way what people like to eat. That doesn’t mean they’re highly profitable—culinary innovation spreads memetically, so it’s hard to capture the gains. But progress is made.
The pandemic also has the effect of showing the kind of business ideas people try. It pushes a lot of innovation in food delivery. Some of the pandemic driver innovation will become worthless once the pandemic is over but a few good ideas likely survive and the old ideas of the businesses that went out of business are still around.
So I saw the Taxonomy Of What Magic Is Doing In Fantasy Books and Eliezer’s commentary on ASC’s latest linkpost, and I have cached thoughts on the matter.
My cached thoughts start with a somewhat different question—not “what role does magic play in fantasy fiction?” (e.g. what fantasies does it fulfill), but rather… insofar as magic is a natural category, what does it denote? So I’m less interested in the relatively-expansive notion of “magic” sometimes seen in fiction (which includes e.g. alternate physics), and more interested in the pattern called “magic” which recurs among tons of real-world ancient cultures.
Claim (weakly held): the main natural category here is symbols changing the territory. Normally symbols represent the world, and changing the symbols just makes them not match the world anymore—it doesn’t make the world do something different. But if the symbols are “magic”, then changing the symbols changes the things they represent in the world. Canonical examples:
Wizard/shaman/etc draws magic symbols, speaks magic words, performs magic ritual, or even thinks magic thoughts, thereby causing something to happen in the world.
Messing with a voodoo doll messes with the person it represents.
“Sympathetic” magic, which explicitly uses symbols of things to influence those things.
Magic which turns emotional states into reality.
I would guess that most historical “magic” was of this type.
Weather just barely hit 80°F today, so I tried the Air Conditioner Test.
Three problems came up:
Turns out my laser thermometer is all over the map. Readings would change by 10°F if I went outside and came back in. My old-school thermometer is much more stable (and well-calibrated, based on dipping it in some ice water), but slow and caps out around 90°F (so I can’t use to measure e.g. exhaust temp). I plan to buy a bunch more old-school thermometers for the next try.
I thought opening the doors/windows in rooms other than the test room and setting up a fan would be enough to make the temperature in the hall outside the test room close to outdoor temp. This did not work; hall temp was around 72°F with outside around 80°F. I’ll need to change that part of the experiment design; most likely I’ll seal around the door and let air infiltrate exclusively from the window instead. (The AC is right next to the window, so this could screw with the results, but I don’t really have a better option.)
In two-hose mode, the AC hit its minimum temperature of 60°F, so I’ll need a hotter day. I’ll try again when we hit at least 85°F.
In case anyone’s wondering: in one-hose mode, the temperature in the room equilibrated around 66°F. Power consumption was near-constant throughout all conditions.
One additional Strange Observation: cool air was blowing out under the door of the test room in two-hose mode. This should not happen; my best guess is that, even though the AC has two separate intake vents, the two are not actually partitioned internally, so the fan for indoor-air was pulling in outdoor-air (causing air to blow out under the door to balance that extra inflow). Assuming that’s the cause, it should be fixable with some strategically-placed cardboard inside the unit.
I’ve long been very suspicious of aggregate economic measures like GDP. But GDP is clearly measuring something, and whatever that something is it seems to increase remarkably smoothly despite huge technological revolutions. So I spent some time this morning reading up and playing with numbers and generally figuring out how to think about the smoothness of GDP increase.
Major takeaways:
When new tech makes something previously expensive very cheap, GDP mostly ignores it. (This happens in a subtle way related to how we actually compute it.)
Historical GDP curves mainly measure things which are expensive ~now. Things which are cheap now are mostly ignored. In other words: GDP growth basically measures the goods whose production is revolutionized the least.
Re: AI takeoff, the right way to extrapolate today’s GDP curve to post-AI is to think about things which will still be scarce post-AI, and then imagine the growth of production of those things.
Even a very sharp, economically-revolutionary AI takeoff could look like slow smooth GDP growth, because GDP growth will basically only measure the things whose production is least revolutionized.
Why am I harping on about technicalities of GDP? Well, I hear about some AI forecasts which are heavily based on the outside view that economic progress (as measured by GDP) is smooth, and this is so robust historically that we should expect it to continue going forward. And I think this is basically right—GDP, as we actually compute it, is so remarkably smooth that we should expect that to continue. Alas, this doesn’t tell us very much about how crazy or sharp AI takeoff will be, because GDP (as we actually compute it) systematically ignores anything that’s revolutionized.
If you want a full post on this, upvote this comment.
In writing How much should we value life?, I spent some time digging into AI timeline stuff. It lead me to When Will AI Be Created?, written by Luke Muehlhauser for MIRI. He noted that there is reason not to trust expert opinions on AI timelines, and that trend extrapolation may be a good alternative. This point you’re making about GDP seems like it is real progress towards coming up with a good way to do trend extrapolation, and thus seems worth a full post IMO. (Assuming it isn’t already well known by the community or something, which I don’t get the sense is the case.)
Upvoted, but I mostly trust you to write the post if it seems like there’s an interesting meaty thing worth saying.
Eh, these were the main takeaways, the post would just be more details and examples so people can see the gears behind it.
A similar point is made by Korinek in his review of Could Advanced AI Drive Explosive Economic Growth:
In general, Baumol type effects (spending decreasing in sectors where productivity goes up), mean that we can have scenarios in which the economy is growing extremely fast on “objective” metrics like energy consumption, but GDP has stagnated because that energy is being spent on extremely marginal increases in goods being bought and sold.
Chrome is offering to translate the LessWrong homepage for me. Apparently, it is in Greek.
Huh, amusing. We do ship a font that has nothing but the greek letter set in it, because people use greek unicode symbols all the time and our primary font doesn’t support that character set. So my guess is that’s where Google gets confused.
Oh, I had just assumed it was commentary on the writing style/content.
If about 10% of articles have “Ω” in their title, what is the probability that the page is in Greek? :D
Someone should write a book review of The Design of Everyday Things aimed at LW readers, so I have a canonical source to link to other than the book itself.
Does anyone know of an “algebra for Bayes nets/causal diagrams”?
More specifics: rather than using a Bayes net to define a distribution, I want to use a Bayes net to state a property which a distribution satisfies. For instance, a distribution P[X, Y, Z] satisfies the diagram X → Y → Z if-and-only-if the distribution factors according to
P[X, Y, Z] = P[X] P[Y|X] P[Z|Y].
When using diagrams that way, it’s natural to state a few properties in terms of diagrams, and then derive some other diagrams they imply. For instance, if a distribution P[W, X, Y, Z] satisfies all of:
W → Y → Z
W → X → Y
X → (W, Y) → Z
… then it also satisfies W → X → Y → Z.
What I’m looking for is a set of rules for “combining diagrams” this way, without needing to go back to the underlying factorizations in order to prove things.
David and I have been doing this sort of thing a lot in our work the past few months, and it would be nice if someone else already had a nice write-up of the rules for it.
Putting this here for posterity: I have thought since the superconductor preprint went up, and continue to think, that the markets are putting generally too little probability on the claims being basically-true. I thought ~70% after reading the preprint the day it went up (and bought up a market on manifold to ~60% based on that, though I soon regretted not waiting for a better price), and my probability has mostly been in the 40-70% range since then.
After seeing the markets jump up in response to the latest, I think I’m more like 65-80%.
Languages should have tenses for spacelike separation. My friend and I do something in parallel, it’s ambiguous/irrelevant which one comes first, I want to say something like “I expect my friend <spacelike version of will do/has done/is doing> their task in such-and-such a way”.
That sounds more like a tenseless sentence than using a spacelike separation tense. Your friend’s performance of the task may well be in your future or past lightcone (or extend through both), but you don’t wish to imply any of these.
There are languages with tenseless verbs, as well as some with various types of spatial tense.
The closest I can approximate this in English without clumsy constructs is “I expect my friend does their task in such-and-such a way”, which I agree isn’t very satisfactory.
Who would have thought that someone would ever look at CSP and think “I want english to be more like that”?
lol
Future perfect (hey, that’s the name of the show!) seems like a reasonable hack for this in English
Two kinds of cascading catastrophes one could imagine in software systems...
A codebase is such a spaghetti tower (and/or coding practices so bad) that fixing a bug introduces, on average, more than one new bug. Software engineers toil away fixing bugs, making the software steadily more buggy over time.
Software services managed by different groups have dependencies—A calls B, B calls C, etc. Eventually, the dependence graph becomes connected enough and loopy enough that a sufficiently-large chunk going down brings down most of the rest, and nothing can go back up until everything else goes back up (i.e. there’s circular dependence/deadlock).
How could we measure how “close” we are to one of these scenarios going supercritical?
For the first, we’d need to have attribution of bugs—i.e. track which change introduced each bug. Assuming most bugs are found and attributed after some reasonable amount of time, we can then estimate how many bugs each bug fix introduces, on average.
(I could also imagine a similar technique for e.g. medicine: check how many new problems result from each treatment of a problem.)
For the second, we’d need visibility into codebases maintained by different groups, which would be easy within a company but much harder across companies. In principle, within a company, some kind of static analysis tool could go look for all the calls to apis between services, map out the whole graph, and then calculate which “core” pieces could be involved in a catastrophic failure.
(Note that this problem could be mostly-avoided by intentionally taking down services occasionally, so engineers are forced to build around that possibility. I don’t think any analogue of this approach would work for the first failure-type, though.)
I wish there were a fund roughly like the Long-Term Future Fund, but with an explicit mission of accelerating intellectual progress.
I mean, just to be clear, I am all in favor of intellectual progress. But doing so indiscriminately does sure seem a bit risky in this world of anthropogenic existential risks. Reminds me of my mixed feelings on the whole Progress Studies thing.
Yeah, I wouldn’t want to accelerate e.g. black-box ML. I imagine the real utility of such a fund would be to experiment with ways to accelerate intellectual progress and gain understanding of the determinants, though the grant projects themselves would likely be more object-level than that. Ideally the grants would be in areas which are not themselves very risk-relevant, but complicated/poorly-understood enough to generate generalizable insights into progress.
I think it takes some pretty specific assumptions for such a thing to increase risk significantly on net. If we don’t understand the determinants of intellectual progress, then we have very little ability to direct progress where we want it; it just follows whatever the local gradient is. With more understanding, at worst it follows the same gradient faster, and we end up in basically the same spot.
The one way it could net-increase risk is if the most likely path of intellectual progress leads to doom, and the best way to prevent doom is through some channel other than intellectual progress (like political action, for instance). Then accelerating the intellectual progress part potentially gives the other mechanisms (like political bodies) less time to react. Personally, though, I think a scenario in which e.g. political action successfully prevents intellectual progress from converging to doom (in a world where it otherwise would have) is vanishingly unlikely (like, less than one-in-a-hundred, maybe even less than one-in-a-thousand).
You might check out Donald Braben’s view, it says “transformative research” (i.e. fundamental results that create new fields and industries) is critical for the survival of civilization. He does not worry that transformative results might end civilization.
Here’s an interesting problem of embedded agency/True Names which I think would make a good practice problem: formulate what it means to “acquire” something (in the sense of “acquiring resources”), in an embedded/reductive sense. In other words, you should be able-in-principle to take some low-level world-model, and a pointer to some agenty subsystem in that world-model, and point to which things that subsystem “acquires” and when.
Some prototypical examples which an answer should be able to handle well:
Organisms (anything from bacteria to plant to animals) eating things, absorbing nutrients, etc.
Humans making money or gaining property.
...and how the brain figures this out and why it is motivated to do so. There are a lot of simple animals that apparently “try to control” resources or territory. How?
Drives to control resources occur everywhere. And your control of resources is closely related to your dominance in a dominance hierarchy. Which seems to be regulated in many animals by serotonin. See e.g. https://www.nature.com/articles/s41386-022-01378-2
What if physics equations were written like statically-typed programming languages?
(mass⋅lengthtime2:F)=(mass−:m)(lengthtime2:a)
(masslength⋅time2:P)(length3−:V)=(−−:N)(mass⋅length2time2⋅temp:R)(temp−:T)
The math and physics worlds still use single-letter variable names for everything, decades after the software world realized that was extremely bad practice. This makes me pessimistic about the adoption of better notation practices.
Better? I doubt it. If physicists wrote equations the way programmers write code, a simple homework problem would easily fill ten pages.
Verboseness works for programmers because programmers rarely need to do anything more complicated with their code than run it—analogous to evaluating an expression, for a physicist or mathematician. Imagine if you needed to prove one program equivalent to another algebraically—i.e. a sequence of small transformations, with a record of intermediate programs derived along the way in order to show your work. I expect programmers subjected to such a use-case would quickly learn the virtues of brevity.
Yeah, I’m apparently not intelligent enough to do error-free physics/engineering calculations without relying on dimensional analysis as a debugging tool. I even came up with a weird, hack-y way to do that in computing environments like Excel and Cython, where flexible multiplicative types are not supported.
An interesting conundrum: one of the main challenges of designing useful regulation for AI is that we don’t have any cheap and robust way to distinguish a dangerous neural net from a non-dangerous net (or, more generally, a dangerous program from a non-dangerous program). This is an area where technical research could, in principle, help a lot.
The problem is, if there were some robust metric for how dangerous a net is, and that metric were widely known and recognized (as it would probably need to be in order to be used for regulatory purposes), then someone would probably train a net to maximize that metric directly.
This seems to lead to the solution of trying to make your metric one-way, in the sense that your metric should
Provide an upper-bound on the dangerousness of your network
Compress the space of networks which map to approximately the same dangerousness level on the low end of dangerousness, and expand the space of networks which map to approximately the same dangerousness level on the upper end of dangerous, so that you can train your network to minimize the metric, but when you train your network to maximize the metric you end up in a degenerate are with technically very high measured danger levels but in actuality very low levels of dangerousness.
We can hope (or possibly prove) that as you optimize upwards on the metric you get subject to goodheart’s curse, but the opposite occurs on the lower end.
Sure, even seems a bit tautological: any such metric, to be robust, would need to contain in itself a definition of a dangerously-capable AI, so you probably wouldn’t even need to train a model to maximize it. You’d be able to just lift the design from the metric directly.
Do you have any thoughts on a softer version of this problem, where the metric can’t be maximized directly, but gives a concrete idea of what sort of challenge your AI needs to beat to qualify as AGI? (And therefore in which direction in the architectural-design-space you should be moving.)
Some variation on this seems like it might work as a “fire alarm” test set, but as you point out, inasmuch as it’s recognized, it’ll be misapplied for benchmarking instead.
(I suppose the ideal way to do it would be to hand it off to e. g. ARC, so they can use it if OpenAI invites them for safety-testing again. This way, SOTA models still get tested, but the actors who might misuse it aren’t aware of the testing’s particulars until they succeed anyway...)
I just went looking for a good reference for the Kelly criterion, and didn’t find any on Lesswrong. So, for anybody who’s looking: chapter 6 of Thomas & Cover’s textbook on information theory is the best source I currently know of.
Might be a good thing to add to the Kelly Criterion tag
Neat problem of the week: we have n discrete random variables, X1...Xn. Given any variable, all variables are independent:
∀i:P[X|Xi]=∏jP[Xj|Xi]
Characterize the distributions which satisfy this requirement.
This problem came up while working on the theorem in this post, and (separately) in the ideas behind this post. Note that those posts may contain some spoilers for the problem, though frankly my own proofs on this one just aren’t very good.
For short-term, individual cost/benefit calculations around C19, it seems like uncertainty in the number of people currently infected should drop out of the calculation.
For instance: suppose I’m thinking about the risk associated with talking to a random stranger, e.g. a cashier. My estimated chance of catching C19 from this encounter will be roughly proportional to Ninfected. But, assuming we already have reasonably good data on number hospitalized/died, my chances of hospitalization/death given infection will be roughly inversely proportional to Ninfected. So, multiplying those two together, I’ll get a number roughly independent of Ninfected.
How general is this? Does some version of it apply to long-term scenarios too (possibly accounting for herd immunity)? What short-term decisions do depend on Ninfected?