Good argument, I find this at least somewhat convincing. Though it depends on whether penalty (1), the one capped at 10%/30% of training compute cost, would be applied more than once on the same model if the violation isn’t remedied.
johnswentworth
A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
So I read SB1047.
My main takeaway: the bill is mostly a recipe for regulatory capture, and that’s basically unavoidable using anything even remotely similar to the structure of this bill. (To be clear, regulatory capture is not necessarily a bad thing on net in this case.)
During the first few years after the bill goes into effect, companies affected are supposed to write and then implement a plan to address various risks. What happens if the company just writes and implements a plan which sounds vaguely good but will not, in fact, address the various risks? Probably nothing. Or, worse, those symbolic-gesture plans will become the new standard going forward.
In order to avoid this problem, someone at some point would need to (a) have the technical knowledge to evaluate how well the plans actually address the various risks, and (b) have the incentive to actually do so.
Which brings us to the real underlying problem here: there is basically no legible category of person who has the requisite technical knowledge and also the financial/status incentive to evaluate those plans for real.
(The same problem also applies to the board of the new regulatory body, once past the first few years.)
Having noticed that problem as a major bottleneck to useful legislation, I’m now a lot more interested in legal approaches to AI X-risk which focus on catastrophe insurance. That would create a group—the insurers—who are strongly incentivized to acquire the requisite technical skills and then make plans/requirements which actually address some risks.
(Approximately) Deterministic Natural Latents
So ‘latents’ are defined by their conditional distribution functions whose shape is implicit in the factorization that the latents need to satisfy, meaning they don’t have to always look like , they can look like , etc, right?
The key idea here is that, when “choosing a latent”, we’re not allowed to choose ; is fixed/known/given, a latent is just a helpful tool for reasoning about or representing . So another way to phrase it is: we’re choosing our whole model , but with a constraint on the marginal . then captures all of the degrees of freedom we have in choosing a latent.
Now, we won’t typically represent the latent explicitly as . Typically we’ll choose latents such that satisfies some factorization(s), and those factorizations will provide a more compact representation of the distribution than two giant tables for , . For instance, insofar as factors as , we might want to represent the distribution as and (both for analytic and computational purposes).
I don’t get the ‘standard form’ business.
We’ve largely moved away from using the standard form anyway, I recommend ignoring it at this point.
Also is this post relevant to either of these statements, and if so, does that mean they only hold under strong redundancy?
Yup, that post proves the universal natural latent conjecture when strong redundancy holds (over 3 or more variables). Whether the conjecture does not hold when strong redundancy fails is an open question. But since the strong redundancy result we’ve mostly shifted toward viewing strong redundancy as the usual condition to look for, and focused less on weak redundancy.
Resampling
Also does all this imply that we’re starting out assuming that shares a probability space with all the other possible latents, e.g. ? How does this square with a latent variable being defined by the CPD implicit in the factorization?
We conceptually start with the objects , , and . (We’re imagining here that two different agents measure the same distribution , but then they each model it using their own latents.) Given only those objects, the joint distribution is underdefined—indeed, it’s unclear what such a joint distribution would even mean! Whose distribution is it?
One simple answer (unsure whether this will end up being the best way to think about it): one agent is trying to reason about the observables , their own latent , and the other agent’s latent simultaneously, e.g. in order to predict whether the other agent’s latent is isomorphic to (as would be useful for communication).
Since and are both latents, in order to define , the agent needs to specify . That’s where the underdefinition comes in: only and were specified up-front. So, we sidestep the problem: we construct a new latent such that matches , but is independent of given . Then we’ve specified the whole distribution .
Hopefully that clarifies what the math is, at least. It’s still a bit fishy conceptually, and I’m not convinced it’s the best way to handle the part it’s trying to handle.
Yeah, it’s the “exchange” part which seems to be missing, not the “securities” part.
Why does the SEC have any authority at all over OpenAI? They’re not a publicly listed company, i.e. they’re not listed on any securities exchange, so naively one would think a “securities exchange commission” doesn’t have much to do with them.
I mean, obviously federal agencies always have scope creep, it’s not actually surprising if they have some authority over OpenAI, but what specific legal foundation is the SEC on here? What is their actual scope?
Consider the exact version of the redundancy condition for latent over :
and
Combine these two and we get, for all :
OR
That’s the foundation for a conceptually-simple method for finding the exact natural latent (if one exists) given a distribution :
Pick a value which has nonzero probability, and initialize a set containing that value. Then we must have for all .
Loop: add to a new value or where the value or (respectively) already appears in one of the pairs in . Then or , respectively. Repeat until there are no more candidate values to add to .
Pick a new pair and repeat with a new set, until all values of have been added to a set.
Now take the latent to be the equivalence class in which falls.
Does that make sense?
Was this intended to respond to any particular point, or just a general observation?
My current starting point would be standard methods for decomposing optimization problems, like e.g. the sort covered in this course.
Dialogue on What It Means For Something to Have A Function/Purpose
No, because we have tons of information about what specific kinds of information on the internet is/isn’t usually fabricated. It’s not like we have no idea at all which internet content is more/less likely to be fabricated.
Information about, say, how to prove that there are infinitely many primes is probably not usually fabricated. It’s standard basic material, there’s lots of presentations of it, it’s not the sort of thing which people usually troll about. Yes, the distribution of internet text about the infinitude of primes contains more-than-zero trolling and mistakes and the like, but that’s not the typical case, so low-temperature sampling from the LLM should usually work fine for that use-case.
On the other end of the spectrum, “fusion power plant blueprints” on the internet today will obviously be fictional and/or wrong, because nobody currently knows how to build a fusion power plant which works. This generalizes to most use-cases in which we try to get an LLM to do something (using only prompting on a base model) which nobody is currently able to do. Insofar as the LLM is able to do such things, that actually reflects suboptimal next-text prediction on its part.
That is not how this works. Let’s walk through it for both the “human as clumps of molecules following physics” and the “LLM as next-text-on-internet predictor”.
Humans as clumps of molecules following physics
Picture a human attempting to achieve some goal—for concreteness, let’s say the human is trying to pick an apple from a high-up branch on an apple tree. Picture what that human does: they maybe get a ladder, or climb the tree, or whatever. They manage to pluck the apple from the tree and drop it in a basket.
Now, imagine a detailed low-level simulation of the exact same situation: that same human trying to pick that same apple. Modulo quantum noise, what happens in that simulation? What do we see when we look at its outputs? Well, it looks like a human attempting to achieve some goal: the clump of molecules which is a human gets another clump which is a ladder, or climbs the clump which is the tree, or whatever.
LLM as next-text-on-internet predictor
Now imagine finding the text “Notes From a Prompt Factory” on the internet, today (because the LLM is trained on text from ~today). Imagine what text would follow that beginning on the internet today.
The text which follows that beginning on the internet today is not, in fact, notes from a prompt factory. Instead, it’s fiction about a fictional prompt factory. So that’s the sort of thing we should expect a highly capable LLM to output following the prompt “Notes From a Prompt Factory”: fiction. The more capable it is, the more likely it is to correctly realize that that prompt precedes fiction.
It’s not a question of whether the LLM is absorbing the explicit and tacit knowledge of internet authors; I’m perfectly happy to assume that it is. The issue is that the distribution of text on today’s internet which follows the prompt “Notes From a Prompt Factory” is not the distribution of text which would result from actual notes on an actual prompt factory. The highly capable LLM absorbs all that knowledge from internet authors, and then uses that knowledge to correctly predict that what follows the text “Notes From a Prompt Factory” will not be actual notes from an actual prompt factory.
Let’s assume a base model (i.e. not RLHF’d), since you asserted a way to turn the LM into a goal-driven chatbot via prompt engineering alone. So you put in some prompt, and somewhere in the middle of that prompt is a part which says “Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do”, for some X.
The basic problem is that this hypothetical language model will not, in fact, do what X, having considered this carefully for a while, would have wanted it to do. What it will do is output text which statistically looks like it would come after that prompt, if the prompt appeared somewhere on the internet.
A lot of the particulars of humans’ values are heavily reflective. Two examples:
A large chunk of humans’ terminal values involves their emotional/experience states—happy, sad, in pain, delighted, etc.
Humans typically want ~terminally to have some control over their own futures.
Contrast that to e.g. a blue-minimizing robot, which just tries to minimize the amount of blue stuff in the universe. That utility function involves reflection only insofar as the robot is (or isn’t) blue.
You have a decent argument for the claim as literally stated here, but I think there’s some wrongheaded subtext. To try to highlight it, I’ll argue for another claim about the “Complete and Honest Ideological Turing Test” as you’ve defined it.
Suppose that an advocate of some position would in fact abandon that position if they knew all the evidence or arguments or counterarguments which I might use to argue against it (and observers correctly know this). Then, by your definition, I cannot pass their CHITT—but it’s not because I’ve failed to understand their position, it’s because they don’t know the things I know.
Suppose that an advocate of some position simply does not have any response which would not make them look irrational to some class of evidence or argument or counterargument which I would use to argue against the position (and observers correctly know this). Then again, by your definition, I cannot pass their CHITT—but again, it’s not because I’ve failed to understand their position, it’s because they in fact do not have responses which don’t make them look irrational.
The claim these two point toward: as defined, sometimes it is impossible to pass someone’s CHITT not because I don’t understand their position, but because… well, they’re some combination of ignorant and an idiot, and I know where the gaps in their argument are. This is in contrast to the original ITT, which was intended to just check whether I’ve actually understood somebody else’s position.
Making the subtext explicit: it seems like this post is trying to push a worldview in which nobody is ever Just Plain Wrong or Clearly Being An Idiot. And that’s not how the real world actually works—most unambiguously, it is sometimes the case that a person would update away from their current position if they knew the evidence/arguments/counterarguments which I would present.
This is particularly interesting if we take and to be two different models, and take the indices 1, 2 to be different values of another random variable with distribution given by . In that case, the above inequality becomes:
Note to self: this assumes P[Y] = Q[Y].
I wasn’t imagining that the human knew the best answer to any given subproblem, but nonetheless that did flesh out a lot more of what it means (under your mental model) for a human to “understand a subproblem”, so that was useful.
I’ll try again:
I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it’s missing the more central application of these techniques: situations where the AIs
are taking many actionssolving many subproblems, where humans would eventually understandany particular actionhow well the AI’s plan/action solves any particular subproblem if they spent a whole lot of time investigating it, but where that amount of time taken to oversee anyactionsubproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to overseeactionssubproblems at a much lower cost in terms of human time—to push out the Pareto frontier of oversight quality vs cost.(… and presumably an unstated piece here is that “understanding how well the AI’s plan/action solves a particular subproblem” might include recursive steps like “here’s a sub-sub-problem, assume the AI’s actions do a decent job solving that one”, where the human might not actually check the sub-sub-problem.)
Does that accurately express the intended message?
Based on this example and your other comment, it sounds like the intended claim of the post could be expressed as:
I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it’s missing the more central application of these techniques: situations where the AIs
are taking many actionssolving many subproblems, where humans would eventually understandany particular actionany particular subproblem and its solution if they spent a whole lot of time investigating it, but where that amount of time taken to oversee anyactionsubproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to overseeactionssubproblems at a much lower cost in terms of human time—to push out the Pareto frontier of oversight quality vs cost.Does that accurately express the intended message?
+1, and even for those who do buy extinction risk to some degree, financial/status incentives usually have more day-to-day influence on behavior.