CS PhD student
Abhimanyu Pallavi Sudhir
I don’t really understand what problem this is solving. In my view the hard problems here are:
how do you define legal personhood for an entity without a typical notion of self/personhood (i.e. what Mitchell Porter said) or interests
how do you ensure the AIs keep their promise in a world where they can profit far more from breaking the contract than from whatever we offer them
Once you assume away the former problem and disregard the latter, you are of course only left with basic practical legal questions …
matter of taste for fiction; but objectively bad for technical writing
So I’m learning & writing on thermodynamics right now, and often there is a distinction between the “motivating questions”/”sources of confusion” and the actually important lessons you get from exploring them.
E.g. a motivating question is ”… and yet it scalds (even if you know the state of every particle in a cup of water)” and the takeaway from it is “your finger also has beliefs” or “thermodynamics is about reference/semantics”.
The latter might be a more typical section heading as it is correct for systematizing the topic, but it is a spoiler. Whereas the former is better for putting the reader in the right frame/getting them to think about the right questions to initiate their thinking.
I’m talking about technical writing/explanations of things.
An unfortunate thing about headings is that they are spoilers. I like the idea of a writing style where headings come at the end of sections rather than at the start. Or even a “starting heading” which is a motivating question and an “ending heading” which is the key insight discovered …
Analogous to a “reverse mathematics” style of writing where motivation precedes proofs/theory precede theorems.
edited to clarify: I’m talking about technical writing; I don’t care about fiction.
homomorphisms and entropy
One informal way to think of homomorphisms in math is that they are maps that do not “create information out of thin air”. Isomorphisms further do not destroy information. The terminal object (e.g. the trivial group, the singleton topological space, or the trivial vector space) is the “highest-entropy state”, where all distinctions disappear and reaching it is heat death.
-
Take, for instance the group homomorphism . Before was applied, “1” and “5″ were distinguished: 2 + 3 = 5 was correct, but 2 + 3 = 1 was wrong. Upon applying this homomorphism, this information disappears—however, no new information has been created, that is: no true indinstinctions (equalities) have become false.
-
Similarly in topology, “indistinction” is “arbitrary closeness”. Wiggle-room (aka “open sets”) is information, it cannot be created from nothing. If a set or sequence goes arbitrarily close to a point, it will always be arbitrarily close to that point after any continuous transformations.
-
There is no information-theoretical formalization of “indistinction” on these structures, because this notion is more general than information theory. In the category of measurable spaces, two points in the sample space are indistinct if they are not distinguished by any measurable set—and measurable functions are not allowed to create measurable sets out of nothing.
(there is also an alternate, maybe dual/opposite analogy I can make based on presentations —here, the the highest-entropy state is the “free object” e.g. a discrete topological space or free group, and each constraint (e.g. ) is information—morphisms are “observations”. In this picture we see knowledge as encoded by identities rather than distinctions—we may express our knowledge as a presentation like: , and morphisms cannot be concretely understood as functions on sets but rather show a tree of possible outcomes, like maybe you believe in Everett branches or whatever.)
In general if you postulate:
… you live on some object in a category
… time-evolution is governed by some automorphism
… you, the observer, have beliefs about your universe and keep forgetting some information (“coarse-grains the phase space”) --- i.e. your subjective phase space is also an object in that category, which undergoes homomorphisms
Then the second law is just a tautology. The second law we all know and love comes from taking the universe to be a symplectic manifold, and time-evolution as governed by symplectomorphisms. And the point of Liouville’s theorem is really to clarify/physically motivate what the Jaynesian “uniform prior” should be. Here is some more stuff, from Yuxi Liu’s statistical mechanics article:
In almost all cases, we use the uniform prior over phase space. This is how Gibbs did it, and he didn’t really justify it other than saying that it just works, and suggesting it has something to do with Liouville’s theorem. Now with a century of hindsight, we know that it works because of quantum mechanics: We should use the uniform prior over phase space, because phase space volume has a natural unit of measurement: , where is Planck’s constant, and is the dimension of phase space. As Planck’s constant is a universal constant, independent of where we are in phase space, we should weight all of the phase space equally, resulting in a uniform prior.
-
No; I mean a standard Bayesian network wouldn’t work for latents.
Bayesian networks support latent variables, and so allowing general Bayesian networks can be considered a strict generalization of allowing latent variables, as long as one remembers to support latent variable in the Bayesian network implementation.
Correct me if I’m wrong, but I believe this isn’t necessarily true.
The most general Bayesian network prediction market implementation I’m aware of is the SciCast team’s graphical model market-maker. Say a trader bets up a latent variable -- and this correctly increases the probability of its child variables (which all resolve True).
Under your model you would (correctly, IMO) reward the trader for this, because you are scoring it for the impact it has on the resolved variables. But under their model, another trader can come and completely flip , while also adjusting each conditional probability -- without affecting the overall score of the model, but screwing over the first trader completely because the first trader just owns some stocks which are now worth much less.
Articles (or writing in general) is probably best structured as a Directed Acyclic Graph, rather than linearly. At each point in the article, there may be multiple possible lines to pursue, or “sidenotes”.
I say “directed acyclic graph” rather than “tree”, because it may be natural as thinking of paths as joining back at some point, especially if certain threads are optional.
One may also construct an “And-Or tree” to allow multiple versions of the article preferred by conflicting writers, which may then be voted on with some mechanism. These votes can be used to define values to each vertex, and people can read the tree with their own search algorithm*.
A whole wiki may be constructed as one giant DAG, with each article being sub-components.
*well, realistically nobody would actually just be following a search algorithm blindly/reading a linear article linearly (since straitjacketing yourself with prerequisites is never a good idea), but you know, as a general guide to structure.
(idea came from LLM conversations, which often take this form—of pursuing various lines of questioning then backtracking to a previous message)
“What do you gain from smalltalk?” “I learned not to threaten to nuke countries.”
Lmao, amazing.
we’ll elide all of the subtle difficulties involved in actually getting RL to work in practice
I haven’t properly internalized the rest of the post, but this confuses me because I thought this post was about the subtle difficulties.
The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens,
P(s'|s,a)=append(s,a)
and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.
The third virtue of rationality, lightness, is wrong. In fact: the more you value information to change your mind on some question, the more obstinate you should be to changing your mind on that question. Lightness implies disinterest in the question.
Imagine your mind as a logarithmic market-maker which assigns some initial subsidy to any new question . This subsidy parameter captures your marginal value for information on . But it also measures how hard it is to change your mind — the cost of moving your probability from to is .
What would this imply in practice? It means that each individual “trader” (both internal mental heuristics/thought patterns, and external sources of information/other people) will generally have a smaller influence on your beliefs, as they may not have enough wealth. Traders who influence your belief will carry greater risk (to their influence on you in future), though will also earn more reward if they’re right.
I don’t understand. The hard problem of alignment/CEV/etc. is that it’s not obvious how to scale intelligence while “maintaining” utility function/preferences, and this still applies for human intelligence amplification.
I suppose this is fine if the only improvement you can expect beyond human-level intelligence is “processing speed”, but I would expect superhuman AI to be more intelligent in a variety of ways.
Something that seems like it should be well-known, but I have not seen an explicit reference for:
Goodhart’s law can, in principle, be overcome via adversarial training (or generally learning Multi-Agent Systems)
—aka “The enemy is smart.”
Goodhart’s law only really applies to a “static” objective, not when the objective is the outcome of a game with other agents who can adapt.
This doesn’t really require the other agents to act in a way that continuously “improves” the training objective either, it just requires them to be able to constantly throw adversarial examples to the agent forcing it to “generalize”.
In particular, I think this is the basic reason why any reasonable Scalable Oversight protocol would be fundamentally “multi-agent” in nature (like Debate).
I think only particular reward functions, such as in multi-agent/co-operative environments (agents can include humans, like in RLHF) or in actually interactive proving environments?
Yes, I also realized that “ideas” being a thing is due to bounded rationality—specifically they are the outputs of AI search. “Proofs” are weirder though, and I haven’t seen them distinguished very often. I wonder if this is a reasonable analogy to make:
Ideas : search
Answers : inference
Proofs: alignment
There is a cliche that there are two types of mathematicians: “theory developers” and “problem solvers”. Similarly Dyson’s “birds and frogs”, and Robin Hanson divides the production of knowledge into “framing” and “filling”.
It seems to me there are actually three sorts of information in the world:
“Ideas”: math/science theories and models, inventions, business ideas, solutions to open-ended problems
“Answers”: math theorems, experimental observations, results of computations
“Proofs”: math proofs, arguments, evidence, digital signatures, certifications, reputations, signalling
From a strictly Bayesian perspective, there seems to be no “fundamental” difference between these forms of information. They’re all just things you condition your prior on. Yet this division seems to be natural in quite a variety of informational tasks. What gives?
adding this from replies for prominence--
Yes, I also realized that “ideas” being a thing is due to bounded rationality—specifically they are the outputs of AI search. “Proofs” are weirder though, and I haven’t seen them distinguished very often. I wonder if this is a reasonable analogy to make:
Ideas : search
Answers : inference
Proofs: alignment
Just realized in logarithmic market scoring the net number of stocks is basically just log-odds, lol:
Your claims about markets seem just wrong to me. Markets generally do what their consumers want, and their failures are largely the result of transaction costs. Some of these transaction costs have to do with information asymmetry (which needs to be solved), but many others that show up in the real world (related to standard problems like negative externalities etc.) can just be removed by construction in virtual markets.
Markets are fundamentally driven by the pursuit of defined rewards or currencies, so in such a system, how do we ensure that the currency being optimized for truly captures what we care about
By having humans be the consumers in the market. Yes, it is possible to “trick” the consumers, but the idea is that if any oversight protocol is possible at all, then the consumers will naturally buy information from there, and AIs will learn to expect this changing reward function.
MIRI has been talking about it for years; the agent foundations group has many serious open problems related to it.
Can you send me a link? The only thing on “markets in an alignment context” I’ve found on this from the MIRI side is the Wentworth-Soares discussion, but that seems like a very different issue.
it can be confidently known now that the design you proposed is catastrophically misaligned
Can you send me a link for where this was confidently shown? This is a very strong claim to make, nobody even makes this claim in the context of backprop.
I’m saying the expected value of their best non-compliant option of a sufficiently advanced AI will always be far far greater by the expected value of their best compliant action.