scottviteri

Karma: 206

scottviteri Jan 23, 2025, 8:24 AM
1 point
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
on: Mechanistically Eliciting Latent Behaviors in Language Models
I really like the idea of finding steering vectors that maximize downstream differences, and I have a few follow-up questions.

Have you tried/considered modifying c_fc (the MLP encoder layer) bias instead of c_proj (the MLP decoder layer) bias? I don’t know about this context, but (i) c_fc makes more intuitive sense as a location to change for me, (ii) I have seen more success playing with it in the past than c_proj, and (iii) they are not-equivalent because of the non-linearity between them.

I like how you control for radius by projecting gradients onto the tangent space and projecting the steering vector of the sphere, but have you tried using cosine distance as the loss function so there is less incentive for R to naturally blow up? Let $D (z) = \sum_{i = 1}^{n} \sum_{t \in I_{i}} c o s D i s t (Z_{ℓ_{t a r g e t}, i, t} (z), Z_{ℓ_{t a r g e t}, i, t} (0))$ in ${max}_{z} D (z)$ .

When you do iterative search for next steering vectors, I do not expect that constraining the search to an orthogonal subspace to previously found steering vectors to be very helpful, since the orthogonal vectors might very well be mapped into the same downstream part of latent space. Since the memory demands are quite cheap for learning steering vectors, I would be interested in seeing an objective which learned a matrix of steering vectors simultaneously, maximizing the sum of pairwise distances. Suppose we are learning $K$ vectors simultaneously.
${max}_{z_{1}, \dots, z_{K}} \sum_{1 \leq k < k^{'} \leq K} \sum_{i = 1}^{n} \sum_{t \in I_{i}} c o s D i s t (Z_{ℓ_{t a r g e t}, i, t} (z_{k}), Z_{ℓ_{t a r g e t}, i, t} (z_{k^{'}}))$

But this form of the objective makes it more transparent that a natural solution is to make each steering vector turn the output into gibberish (unless the LM latent space treats all gibberish alike, which I admit is possible). So maybe we would want a tunable term which encourages staying close to the unsteered activations, while staying far from the other steered activations.
${max}_{z_{1}, \dots, z_{n}} \sum_{1 \leq k < k^{'} \leq K} \sum_{i = 1}^{n} \sum_{t \in I_{i}} c o s D i s t (Z_{ℓ_{t a r g e t}, i, t} (z_{k}), Z_{ℓ_{t a r g e t}, i, t} (z_{k^{'}})) - λ \sum_{i = 1}^{K} D (z_{k})$
Lastly, I would be interested in seeing the final output probability distribution over tokens instead of $ℓ_{t a r g e t}$ using KL for the distance, since in that domain we can extract very fine grained information from the model’s activations. Let $D^{k l} (z) = \sum_{i = 1}^{n} \sum_{t \in I_{i}} K L (Z_{ℓ_{u n e m b e d}, i, t} (z) | | Z_{ℓ_{u n e m b e d}, i, t} (0))$ in
${max}_{z_{1}, \dots, z_{n}} \sum_{k = 1}^{K} \sum_{k^{'} = 1}^{K} \sum_{i = 1}^{n} \sum_{t \in I_{i}} K L (Z_{ℓ_{u n e m b e d}, i, t} (z_{k}) | | Z_{ℓ_{u n e m b e d}, i, t} (z_{k^{'}})) - λ \sum_{i = 1}^{K} D^{k l} (z_{k})$

scottviteri Nov 25, 2024, 4:18 AM
4 points
3 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: StrivingForLegibility’s comment on: The Geometric Expectation
Very interesting! I’m excited to read your post.

scottviteri Apr 16, 2024, 7:50 PM
1 point
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: scottviteri’s comment on: «Boundaries», Part 3a: Definitions
I take back the part about pi and update determining the causal structure, because many causal diagrams are constant with the same poly diagram

scottviteri Apr 13, 2024, 10:54 PM
2 points
3 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: scottviteri’s comment on: The Geometric Expectation
I think what is going on here is that both $\nabla^{*}$ and $G$ are of the form $(e^{\land}) \circ g \circ ln$ with $g = \nabla$ and $g = E$ , respectively. Let’s define the star operator as $g^{*} = (e^{\land}) \circ g \circ ln$ . Then $(f \circ g)^{*} = (e^{\land}) \circ (f \circ g) \circ ln = (e^{\land}) \circ f \circ ln \circ (e^{\land}) \circ g \circ ln = f^{*} \circ g^{*}$ , by associativity of function composition. Further, if $f$ and $g$ commute, then so do $f^{*}$ and $g^{*}$ : $g^{*} \circ f^{*} = (g \circ f)^{*} = (f \circ g)^{*} = f^{*} \circ g^{*} .$
So the commutativity of the geometric expectation and derivative fall directly out of their representation as $E^{*}$ and $\nabla^{*}$ , respectively, by commutativity of $E$ and $\nabla$ , as long as they are over different variables.

We can also derive what happens when the expectation and gradient are over the same variables: $(\nabla_{θ} \circ E_{x \sim P_{θ} (x)})^{*}$ . First, notice that $(* k)^{*} (x) = e^{k * ln x} = e^{ln x * k} = x^{k}$ , so $(* k)^{*} = (^{\land} k)$ .. Also $(+ k)^{*} (x) = e^{k + ln (x)} = e^{k} e^{ln (x)} = x e^{k} ⟹ (+ k)^{*} = (* e^{k})$ .
Now let’s expand the composition of the gradient and expectation. $(\nabla_{θ} \circ E_{x \sim P_{θ} (x)}) (f (x)) = \nabla_{θ} \int P_{θ} (x) f (x) d x = E_{x \sim P_{θ} (x)} [\nabla_{θ} (f (x) ln P_{θ} (x))]$ , using the log-derivative trick. So $\nabla_{θ} \circ E_{x \sim P_{θ} (x)} = E_{x \sim P_{θ} (x)} \circ \nabla_{θ} \circ (* ln P_{θ} (x))$ .
Therefore, $\nabla_{θ}^{*} \circ G_{x \sim P_{θ} (x)} = (\nabla_{θ} \circ E_{x \sim P_{θ} (x)})^{*}$ $= E_{x \sim P_{θ} (x)}^{*} \circ \nabla_{θ}^{*} \circ (* ln P_{θ} (x))^{*}$ $= G_{x \sim P_{θ}} \circ \nabla_{θ}^{*} \circ (^{\land} ln P_{θ})$ .
Writing it out, we have $\nabla_{θ}^{*} G_{x \sim P_{θ} (x)} [f (x)] = G_{x \sim P_{θ} (x)} [\nabla_{θ}^{*} (f (x)^{ln P_{θ} (x)}]$ .

scottviteri Apr 13, 2024, 6:57 AM
3 points
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Vivek Hebbar’s comment on: The Geometric Expectation
And if I pushed around symbols correctly, the geometric derivative can be pulled inside of a geometric expectation ( $\nabla_{θ}^{*} G_{x \sim P (x)} [f (x)] = G_{x \sim P (x)} [\nabla_{θ}^{*} f (x)]$ ) similarly to how an additive derivative can be pulled inside an additive expectation ( $\nabla_{θ} E_{x \sim P (x)} [f_{θ} (x)] = E_{x \sim P (x)} [\nabla_{θ} f_{θ} (x)]$ ). Also, just as additive expectation distributes over addition ( $E [f (x) + g (x)] = E [f (x)] + E [g (x)]$ ), geometric expectation distributes over multiplication ( $G [f (x) g (x)] = G [f (x)] G [g (x)]$ ).

scottviteri Apr 2, 2024, 6:26 AM
1 point
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
on: «Boundaries», Part 3a: Definitions
If I try to use this framework to express two agents communicating, I get an image with a V1, A1, P1, V2, A2, and P2, with cross arrows from A1 to P2 and A2 to P1. This admits many ways to get a roundtrip message. We could have A1 → P2 → A2 → P2 directly, or A1 → P2 → V2 → A2 → P1, or many cycles among P2, V2, and A2 before P1 receives a message. But in none of these could I hope to get a response in one time step the way I would if both agents simultaneously took an action, and then simultaneously read from their inputs and their current state to get their next state. So I have this feeling that pi : S → Action and update : Observation x S → S already bake in this active/passive distinction by virtue of the type signature, and this framing is maybe just taking away the computational teeth/specificity. And I can write the same infiltration and exfiltration formulas by substituting S_t for V_t, Obs_t for P_t, Action_t for A_t, and S_env_t for E_t.

scottviteri Feb 23, 2024, 5:58 AM
3 points
3 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: scottviteri’s comment on: The Geometric Expectation
Actually maybe this family is more relevant:
https://en.wikipedia.org/wiki/Generalized_mean, where the geometric mean is the limit as we approach zero.

scottviteri Feb 23, 2024, 5:10 AM
3 points
3 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: scottviteri’s comment on: The Geometric Expectation
The “harmonic integral” would be the inverse of integral of the inverse of a function—https://math.stackexchange.com/questions/2408012/harmonic-integral

scottviteri Feb 23, 2024, 3:52 AM
2 points
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: scottviteri’s comment on: The Geometric Expectation
Also here is a nice family that parametrizes these different kinds of average (https://m.youtube.com/watch?v=3r1t9Pf1Ffk)

scottviteri Feb 22, 2024, 12:36 AM
5 points
4 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
on: The Geometric Expectation
If arithmetic and geometric means are so good, why not the harmonic mean? https://en.wikipedia.org/wiki/Pythagorean_means. What would a “harmonic rationality” look like?
What links here?
- Expected Utility, Geometric Utility, and Other Equivalent Representations by StrivingForLegibility (Nov 20, 2024, 11:28 PM; 10 points)

scottviteri Oct 15, 2023, 1:17 PM
1 point
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
on: Optimality is the tiger, and agents are its teeth
I wonder if this entails that RLHF, while currently useful for capabilities, will eventually become an alignment tax. Namely OpenAI might have text evaluators discourage the LM from writing self-calling agenty looking code.

So in thinking about alignment futures that are the limit of RLHF, these feel like two fairly different forks of that future.

Causality and a Cost Semantics for Neural Networks

scottviteriAug 21, 2023, 9:02 PM

22 points

17 votes

Overall karma indicates overall quality.

1 comment1 min readLW link

scottviteri Jun 26, 2023, 11:28 PM
1 point
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
on: Democratic AI Constitution: Round-Robin Debate and Synthesis
@Quinn @Zac Hatfield-Dodds Yep, I agree. I could allow voters to offer replacements for debate steps and aggregation steps. Then we get the choice to either
1) delete the old versions and keep a single active copy of the aggregation tree, or to
2) keep the whole multiverse of aggregation trees around.

If we keep a single copy, and we have a sufficient number of users, the root of the merge tree will change too rapidly, unless you batch changes. However, recomputing the aggregation trees from a batch of changes will end up ignoring changes to parents of nodes in the batch, since all parents end up getting recomputed anyway. Suppose we keep all constitutions (either user submitted, intermediate aggregations, or final aggregations) as a flat list of candidates to be voted amongst. Then there will be too many constitution candidates for people to interact with. So instead a user can vote with a distribution by presenting a constitution, and the distribution is generated by the softmax of negated distances to all of the constitutions in the multiverse. A user could tune their distribution by weighing multiple query constitutions, and changing softmax temperatures to tune variances. And the general population doesn’t really need to know what a distribution is—they can just input a natural language paragraph, or pick and existing one as the query.

Democratic AI Constitution: Round-Robin Debate and Synthesis

scottviteriJun 24, 2023, 7:31 PM

10 points

4 votes

Overall karma indicates overall quality.

4 comments5 min readLW link

(scottviteri.com)

scottviteri Jun 13, 2023, 4:13 PM
3 points
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Garrett Baker’s comment on: Nature < Nurture for AIs
I agree with Andrew Critch’s acausal normalcy post until he gets to boundaries as the important thing—antisociality fits this criteria too well. I’m not quite trying to say that people are just active inference agents. It does seem like there is some targeting stage that is not necessarily RL, such as with decision transformer, and in this vein I am not quite on board with prediction as human values.

scottviteri Jun 7, 2023, 2:11 AM
3 points
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Steven Byrnes’s comment on: Nature < Nurture for AIs
No, that’s not the question I was asking. Humans are able to start using grammatical languages on the basis of no observations of grammatical language whatsoever—not in the pretraining, not in the training, not in text form, not in audio form, not in video form. Again, I mentioned Nicaraguan sign language, or the creation of creoles from pidgins, or for that matter in the original creation of language by hominins.
So this has nothing to do with sample-efficiency. There are zero samples.
I don’t think you can take one or more randomly-initialized transformers, and get grammatical language out of them, without ever putting any human-created grammatical language into them. Do you? If so, how?
I agree that my statements about sample efficiency do not address this point. I do think you could get transformers to invent language, without seeing language data. You would want to use online learning in an observation, state, action loop while interacting with an environment, and probably include optimizations from ReAct, Reflexion, AutoGPT, and Voyager. But each of these relies on having some core language model that can do reasoning, and the way that we normally get these is by pre-training on language. I could imagine instead on pre-training on solutions to another problem that is arbitrarily hard to compute, simple to verify, and provides a natural learning gradient. For example, the LM could be given a numpy program f and an output $f (x)$ and get loss $L_{2} (f (x), f (y))$ for guess y. Or it could try to guess zeros of polynomials and get loss and be penalized according to the guess squared. Then put the agents together in a way such that they can communicate through their input and output channels, and I suspect that they will be able to create language. Maybe language is not so hard—level 1 is just using words to point at concepts you already have. Then learning how to compose those words is just a matter of more time-steps, given sufficient parameter capacity in your networks.
To say this you would have to argue that humans without this feature would have led a faster singularity, more or less.
I am saying it is hard to know if a feature of a person gives rise to better communication in the whole group, which makes my theory conveniently hard to test. And then I am pointing at the singularity as a limiting object (from our point of view) of increasing communication, that follows in a trend after DNA, language, the printing press, phones, the internet, and AI.
Your post says “Let’s imagine a hypothetical scenario where an AI is somehow trained in a way that is analogous to a human childhood in all of the relevant ways.” OK, now:
- It is possible in principle to program an AI that is exactly like a human sociopath’s brain
- It is possible in principle to put that AI in a human-like body and raise it in a loving human family in a normal human neighborhood, enroll them in school, etc.
- Presumably, if I did both these things, this would be a central example of “a hypothetical scenario where an AI is somehow trained in a way that is analogous to a human childhood in all of the relevant ways”, according to a reasonable interpretation of those words.
- And if I did both these things, I would wind up creating an AI that is just like a human adult high-functioning sociopath, the kind of person that emotionally abuses people just for fun, with callous disregard for the well-being of anyone but themselves, that is constitutionally incapable of guilt or remorse, etc. etc.
Where if anywhere do you disagree?
For the bullets:
1. Agree, and I think that AI won’t last long in the world, but it might last long enough to destroy humans.
2. Agree
3. Agree
4. Thank you for bringing my post into an empirical domain I had not been thinking about. So I will modify my claim to ‘there exists a competence level $α$ such that for all agents with competence level $β >= α$ , nurture matters more than nature’, where ‘matters more than’ also needs to be made precise. Now the question is locating $α$ , for which it would be useful for me to understand how common it is for a person to have a high quality upbringing (in a multi-faceted sense) and end up self-interested. Though I wonder if size of moral circle is the right metric.

scottviteri Jun 6, 2023, 6:56 PM
3 points
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Steven Byrnes’s comment on: Nature < Nurture for AIs
GPT-4 has already been trained on lots of human language. Let’s talk instead about a transformer initialized with random weights (xavier initialization or whatever).
Starting right from the random xavier initialization, you are not allowed to (pre)train it on any human language at all. None. No text. No audio of humans speaking. No video of humans speaking. Absolutely none at all. Do you think that could wind up with grammatical language? If not, then I claim this is a nice demonstration (one of many) of how human child brains are doing something different than the kind of AI you have in mind.
The LM does indeed start training with random initialization and has to learn new languages. So then the question is why are humans more sample efficient than LM’s? I am not sure about this, and I am not even sure of the premise. It sometimes feels like GPT-4 can read something once that I would need to read a few times. Which is to say that sample efficiency may be a function of how many tokens you have already seen (I would greatly appreciate a graph showing this). So it could be the case that humans are just a particular kind of pre-trained. But normally pre-training does include language, and babies don’t seem to be pre-seeded with the languages of their ancestors. So what can that pre-training contain? Well probably interaction with some sufficiently complex yet predictable environment that responds to their action space (tokens). Maybe you could do meta learning from this stage to create an LM which can learn a language from few samples. But even the smaller model may be difficult to encode directly in the genome, and it could be easier to specify parts of those models as a reward function, which when followed will lead to reconstructing those pre-trained models.
But your point here is that ML models are not like people in this way. Some other differences that I tentatively think currently exist are that LMs are faster than people, people are more sample efficient than LMs, and LMs tend to get stuck when making long term plans at the moment (try Auto GPT for instance).

I believe you are pointing out that there are differences in people and LMs to demonstrate that the space of competence intelligences is wide. The (admittedly rephrased) point I made in response to this earlier was that while there are many intelligences that are beyond some level of competence, I expect competitive pressures to ramp up as a function of intelligence (related). This is because I think that a system’s optimization ability (aka intelligence) is a monotonic function of its ability to communicate internally and externally (flagging that I am quantifying communication via Bayesian information). Optimization abilities scale with communication because communication allows you to recruit more computational resources for a given problem. Going back to the main point, I think that the design space of competitive intelligences will end up converging, and the only reason that it hasn’t sufficiently converged yet is that we are not smart enough.
Your OP doesn’t say “auto-regressive training & prompting”, rather it says “an AI is somehow trained in a way that is analogous to a human childhood in all of the relevant ways”. I don’t think the kinds of AIs and training procedures that you have in mind are at all analogous to a human childhood. Children will do things that they want to do without being “prompted” by anyone. Children are not exposed to 45 TB of internet text while in the womb. Etc. Right??
I did not go into detail about what I believed were the ‘relevant ways’ because I thought that talking about communication and such would be too philosophical and drag out the post. But I do understand that it might make the reader suspicious that I am circularly defining the ‘relevant ways’ in terms of humans. Of course, I need to use my baseline of humans in order to guess what future values might look like, in which case this is the same kind of circular as any scientific theory which uses data from the universe to predict other data from the universe.
Is that what you’ve ben thinking of this whole time? You didn’t even mention decision transformers until just now. (Or did I miss it?)
My proposal (linked again for convenience) and toolformer (in an earlier comment) also train auto-regressively on a modified prompt. I was including this when talking about auto-regressive training + prompting. This is what I was trying to communicate by saying “Also for the record I am talking about reshaping the prompt during and not just after regular auto-regressive training”.
Let me put it this way. Suppose I understood how human brains worked sufficiently well that I could make an AI that was doing all the same things as a human child brain, for the same reasons, i.e. due to the same underlying algorithms. Then I put this AI in a human body and raise it in a loving human family.
From my perspective, this would be the most central example possible of “an AI is somehow trained in a way that is analogous to a human childhood in all of the relevant ways”.
But from your perspective, I feel like you’re going to say “Oh no no no, that’s totally different from the thing I’m talking about in this post.”
Yes, that would be a central example, and I would wish you the best of luck getting it done in time.
(After all, human brains incorporate many features that do not increase the communication of the system that they are embedded in. Sociopathy has not been selected out of humans. Some human children are introverted and we’re OK with that. Etc. etc.)
To say this you would have to argue that humans without this feature would have led a faster singularity, more or less. My point earlier with respect to sociopathy was that it is only selected out to the degree that it manifests in anti-social behavior. If your sociopath ends up producing some company that produces net value for organisms at various levels of abstraction, evolution counts that as a win. That introvert might invent the steam engine, letting people interact from farther away and extract more energy from their environment so you can make more people who start the cycle over again. Not that inventing the steam engine likely enough for evolution to pick it up specifically—I am just trying to say that the action spaces is much wider than the words that you verbalize.
If so, do you see why the post title & intro come across as misleading?
The antecedent has not been fulfilled if I am understanding what “if so” is pointing at correctly.

scottviteri Jun 6, 2023, 3:08 PM
3 points
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Steven Byrnes’s comment on: Nature < Nurture for AIs
A group of humans who have never been exposed to language, not in any modality, will develop a new grammatical language out of nothing, e.g. Nicaraguan Sign Language, or the invention of the earliest languages in prehistory.
So there is something going on in humans that is not autoregressive training-then-prompting at all, right? This isn’t about modality, it’s about AI paradigm. Autoregressive training will never create grammatical language out of thin air, right?
Meh. I could see the prompting and finetuning structure mentioned earlier giving rise to agents which figure out more efficient ways of communicating. If you asked GPT-4 to create a new language now it might be able to do it. Also for the record I am talking about reshaping the prompt during and not just after regular auto-regressive training.
I feel like you should have said “here is one of a handful of techniques that I am aware of”. For example, do you think no more AI algorithms will ever be discovered in the future?
Yes, I expect there to be many more techniques that increase the communication of the system that the AI is embedded in. My point is that this is how I am coming up with the ideas in the first place.
I also strongly disagree with “communication therefore prosociality” in general. I’ve known a couple high-functioning sociopaths, they communicated as much as anybody, indeed probably more than average.
Indeed, if they are not doing object-level bad things, which decrease the amount of communication in their environment, then I do not see anything wrong with them. Sociopathy will end up getting selected out of the population as a function of how much they decrease the communication of the process in which they are embedded (for example by being dishonest or hurting people), which is why we are not all sociopaths.
Yet again, from my perspective, you seem to have a giant blind spot to the idea that any AI algorithm could possibly exist apart from autoregressive training then prompting. Human brains do a lot of things that are not autoregressive training, right? Particularly RL.
If a human or animal is hungry then they will eat because they find eating-when-hungry to be rewarding, i.e. thanks to an RL reward function, not because they were find-tuned on examples of themselves eating, nor because they were prompted to eat or whatever. Animals will eat when they’re hungry even if they have never seen any other animal eat before, not in any modality.
You’re welcome to specify that RL-centric algorithms are outside the scope of this blog post, but you can’t also say “an AI is somehow trained in a way that is analogous to a human childhood in all of the relevant ways” if there is no online RL involved, right?
I did say auto-regressive training and prompting, right? I think decision transformer includes RL into the auto-regressive training + prompting story, but I could be wrong about that.
What links here?
- scottviteri's comment on Nature < Nurture for AIs by scottviteri (Jun 6, 2023, 6:56 PM; 3 points)

scottviteri Jun 6, 2023, 4:12 AM
3 points
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Steven Byrnes’s comment on: Nature < Nurture for AIs
You’re using LLMs trained on internet text. If that’s part of the plan, I don’t think you can say it’s “trained in a way that is analogous to a human childhood in all of the relevant ways”, nor can you say that imitation-learning-from-humans is not a central part of your story. Human children do not undergo autoregressive training from massive corpuses of internet text.
Internet-trained LLMs emit human-like outputs because they were trained by imitation-learning from lots and lots of human-created text. Humans emit human-like outputs because they are humans. These are not the same, right?
All we need is for the text streams have mutual information in order to train cooperation this way. In which case your claim is that human children do not undergo autoregressive training from massive corpuses of text, to which I respond that the modality of training data only matters insofar as it is entangled which the world and the content of others’ minds. Blind people are not barred from intelligence.
I interpret you as saying:
- I’m only interested in AIs that are very competent at staying alive, executing plans, etc.
- If I make an AI as follows: [autoregressive training on a massive corpus of internet text, certain type of prompting, blah blah], then I will get an AI that is very competent at staying alive, executing plans, etc.
- Therefore I need only be interested in AIs that look like the previous bullet point.
If so, it’s obviously a bad argument because it neglects the possibility that maybe there are also other very different ways to make an AI that is very competent at staying alive, executing plans, etc. And indeed this is the case: e.g., whatever happens in the brains of human children (since human children brains are not trained on a massive corpus of internet text, or prompted, etc.).
Ok, so while for any fixed bar of functionality there would be multiple models that would exceed that bar, I expect that in the limit competitive pressures will squeeze out anything that isn’t orthogonal to communication ability. I also suspect that the parts of human values that would survive the CEV are the ones that are downstream of communication.

So to your bullet points: 1) Yes, 2) Yes, 3) More like here is one of a handful of techniques that I can apply that will help increase the communication and therefore the prosociality of an LM
I note that I am using the word communication in a bit of a non-standard way—I mean number of bits sent as measured by the number of times it halves the receiver’s Bayesian uncertainty, as opposed to raw number of 0′s and 1′s sent on a wire.

scottviteri Jun 5, 2023, 7:57 PM
3 points
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Steven Byrnes’s comment on: Nature < Nurture for AIs
This is not intuitive to me. I proposed an AI that wanders randomly around the house until it finds a chess board and then spends 10 years self-playing chess ²⁴⁄₇ using the AlphaZero-chess algorithm. This is an AI, fair and square!
If your response is “It does not meet my intuitive notion of what an AI is”, then I think your argument is circular insofar as I think your “intuitive notion of what an AI is” presupposes that the AI be human-like in many important ways.
I claim it is possible to find simple definitions of AI that include many human-like traits without explicitly invoking them. This is because human traits are not random, but selected for. These is a sense in which AI is like life itself—it is able to extract negentropy in a wide range of environments, which it can use to help preserve its boundary.
AlphaZero chess player would find itself squashed in many environments, so I would call it less of an AI (though not zero). GPT 4 would do ok according to this definition, because it can learn to thrive in new environments with just a bit of tweaking: for example Voyager.
If your response is “I’m not talking about any old AI that grows up in a loving human family, I’m talking specifically about an AI that learns video prediction via autoregressive loss on a video stream of a human household and takes actions via (blah blah)”, then this is now a post about a specific class of AI algorithm, and it’s perfectly great to write posts about specific classes of AI algorithms, but your title is misleading.
I am indeed talking about a particular set of designs for an AI, but these are designs which increase the extent to which they can be considered AIs, because they give them adaptive properties. So I don’t think the title is misleading.
I’m still not following what you have in mind for how the model produces outputs, such that (1) the AI behaves like a human child in nontrivial ways, (2) …but not because of imitation-learning from observations of other human children, (3) nor because of laborious programmer effort. Can you walk through an example?
For example,
(A) Human children will say “I’m hungry” when they themselves are hungry, not in situations where other people are typically hungry. I don’t see how the algorithms you’re describing would do that, without programmers specifically intervening to make that happen.
(B) If a child grows up never meeting any other human except for their mother, I believe the child will still eventually learn to carry on conversations in a normal way. I don’t see how the algorithms you’re describing would do that. It has no models of two-sided conversation for the autoregressive training to learn from.
I indeed think there exist techniques that satisfy those 3 criteria.
For example, here is a variation on the Observation, State, Action loop I was describing earlier. Here the observations and actions are messages from other language models, which are concurrently getting fine-tuned on text from the internet (though they are not getting fine-tuned in the attached image since it is just an example visualization).
In this model there is a selection pressure toward being a honest and efficient communicator, because otherwise others won’t talk to you, and they know information that will help you get low loss on your Finetune Observations. I call this the kindergarten phase.

scottviteri

Causal­ity and a Cost Se­man­tics for Neu­ral Networks

Demo­cratic AI Con­sti­tu­tion: Round-Robin De­bate and Synthesis

Causality and a Cost Semantics for Neural Networks

Democratic AI Constitution: Round-Robin Debate and Synthesis