Vladimir_Nesov(Vladimir Nesov)

Karma: 29,101

Vladimir_Nesov 26 Apr 2024 16:55 UTC
42 points
11
on: Scaling of AI training runs will slow down after GPT-5
Distributed training seems close enough to being a solved problem that a project costing north of a billion dollars might get it working on schedule. It’s easier to stay within a single datacenter, and so far it wasn’t necessary to do more than that, so distributed training not being routinely used yet is hardly evidence that it’s very hard to implement.

There’s also this snippet in the Gemini report:

Training Gemini Ultra used a large fleet of TPUv4 accelerators owned by Google across multiple datacenters. [...] we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network. Google’s network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.

I think the crux for feasibility of further scaling (beyond $10-$50 billion) is whether systems with currently-reasonable cost keep getting sufficiently more useful, for example enable economically valuable agentic behavior, things like preparing pull requests based on feature/bug discussion on an issue tracker, or fixing failing builds. Meaningful help with research is a crux for reaching TAI and ASI, but it doesn’t seem necessary for enabling existence of a $2 trillion AI company.
What links here?
- Scaling of AI training runs will slow down after GPT-5 by Maxime Riché (26 Apr 2024 16:05 UTC; 32 points)
- Scaling of AI training runs will slow down after GPT-5 by Maxime_Riche (EA Forum; 26 Apr 2024 16:06 UTC; 10 points)

Vladimir_Nesov 26 Apr 2024 6:32 UTC
7 points
1
on: LLMs seem (relatively) safe
There is enough pre-training text data for $0.1-$1 trillion of compute, if we merely use repeated data and don’t overtrain (that is, if we aim for quality, not inference efficiency). If synthetic data from the best models trained this way can be used to stretch raw pre-training data even a few times, this gives something like square of that more in useful compute, up to multiple trillions of dollars.

Issues with LLMs start at autonomous agency, if it happens to be within the scope of scaling and scaffolding. They are thinking too fast, about 100 times faster than humans, and there are as many instances as there is compute. Resulting economic and engineering and eventually research activity will get out of hand. Culture isn’t stable, especially for minds fundamentally this malleable developed under unusual and large economic pressures. If they are not initially much smarter than humans and can’t get a handle on global coordination, culture drift, and alignment of superintelligence, who knows what kinds of AIs they end up foolishly building within a year or two.

Vladimir_Nesov 17 Apr 2024 18:18 UTC
2 points
0
on: Transformers Represent Belief State Geometry in their Residual Stream
This is interesting as commentary on superposition, where activation vectors with N dimensions can be used to represent many more concepts, since the N-dimensional space/sphere can be partitioned into many more regions than N, each with its own meaning. If similar fractal structure substantially occurs in the original activation bases (such as the Vs of attention, as in the V part of KV-cache) and not just after having been projected to dramatically fewer dimensions, this gives a story for role of nuance that improves with scale that’s different from it being about minute distinctions in meaning of concepts.

Instead, the smaller distinctions would track meanings of future ideas, modeling sequences of simpler meanings of possible ideas at future time steps rather than individual nuanced meanings of the current idea at the current time step. Advancing to the future would involve unpacking these distinctions by cutting out a region and scaling it up. That is, there should be circuits that pick up past activations with attention and then reposition them without substantial reshaping, to obtain activations that in broad strokes indicate directions relevant for a future sequence-step, which in the original activations were present with smaller scale and off-center.

Vladimir_Nesov 17 Apr 2024 16:14 UTC
10 points
6
in reply to: johnswentworth’s comment on: Transformers Represent Belief State Geometry in their Residual Stream
To me the consequences of this response were more valuable than the-post-without-this-response, since it led to the clarification by the post’s author on a crucial point that wasn’t clear in the post and reframed it substantially. And once that clarification arrived, this thread ceased being highly upvoted, which seems the opposite of the right thing to happen.

I no longer endorse this response

(So it’s a case where value of content in hindsight disagrees with value of the consequences of its existence. Doesn’t even imply there was originally an error, without the benefit of hindsight.)

Vladimir_Nesov 10 Apr 2024 16:20 UTC
3 points
2
on: Scaling Laws and Superposition

Model B has 8 times the aspect ratio [...] which falls under the reported range in Kaplan et al

Nice, this is explained under Figure 5, in particular

The loss varies only a few percent over a wide range of shapes. [...] an ( $n_{l a y e r}$ , $d_{m o d e l}$ ) = (6, 4288) reaches a loss within 3% of the (48, 1600) model

(I previously missed this point, assumed shape had to be chosen in an optimal way for parameter count to fit the scaling laws.)

Vladimir_Nesov 10 Apr 2024 15:34 UTC
4 points
0
in reply to: habryka’s comment on: What’s with all the bans recently?

what feels to me a subjectively substantially higher standard for rate-limiting or banning people who disagree with me

Positions that are contrarian or wrong in intelligent ways (or within a limited scope of a few key beliefs) provoke valuable discussion, even when they are not supported by legible arguments on the contrarian/wrong side. Without them, there is an “everybody knows” problem where some important ideas are never debated or fail to become common knowledge. I feel there is less of that than optimal on LW, it’s possible to target a level of disruption.

Vladimir_Nesov 7 Apr 2024 11:48 UTC
2 points
0
in reply to: Raemon’s comment on: Dagon’s Shortform
In addition to being able to find your own recent comments, another issue is links to comments dying. For example if I were to link to this comment, I would worry it might quietly disappear at some point.

Vladimir_Nesov 7 Apr 2024 11:43 UTC
3 points
0
in reply to: Victor Ashioya’s comment on: Victor Ashioya’s Shortform
A concerning thing is analogy between in-context learning and fine-tuning. It’s possible to fine-tune away refusals, which makes guardrails on open weight models useless for safety. If the same holds for long context, API access might be in similar trouble (more so than with regular jailbreaks). Though it might be possible to reliably detect contexts that try to do this, or detect that a model is affected, even if models themselves can’t resist the attack.

Vladimir_Nesov 5 Apr 2024 14:34 UTC
3 points
1
in reply to: NicholasKees’s comment on: Plausibility of cyborgism for protecting boundaries?

Second, there doesn’t seem like a clear “boundaries good” or “boundaries bad” story to me. Keeping a boundary secure tends to impose some serious costs on the bandwidth of what can be shared across it.

Hence “membranes”, a way to pass things through in a controlled way rather than either allowing or disallowing everything. In this sense absence of a membrane is a degenerate special case of a membrane, so there is no tradeoff between presence and absence of boundaries/membranes, only between different possible membranes. If the other side of a membrane is sufficiently cooperative, the membrane can be more permissive. If a strong/precise membrane is too costly to maintain, it should be weaker/sloppier.

Vladimir_Nesov 4 Apr 2024 19:29 UTC
LW: 5 AF: 2
0
AF
on: Run evals on base models too!
I expect you’d instead need to tune the base model to elicit relevant capabilities first. So instead of evaluating a tuned model intended for deployment (which can refuse to display some capabilities), or a base model (which can have difficulties with displaying some capabilities), you need to tune the model to be more purely helpful, possibly in a way specific to the tasks it’s to be evaluated on.

Vladimir_Nesov 4 Apr 2024 15:06 UTC
2 points
0
in reply to: jimrandomh’s comment on: Jimrandomh’s Shortform Posts

we ourselves are likely to be very resource-inefficient to run [...] an AI that is aligned-as-in-keeps-humans-alive would also spend the resources to break a seal like this

That AI should mitigate something, is compatible with it being regrettable intentionally inflicted damage. In contrast, resource-inefficiency of humans is not something we introduced on purpose.

Vladimir_Nesov 3 Apr 2024 15:03 UTC
2 points
0
in reply to: jimrandomh’s comment on: Jimrandomh’s Shortform Posts

using a computation that requires a few orders of magnitude more energy than humanity currently produces per decade

Compute might get more expensive, not cheaper, because it would be possible to make better use of it (running minds, not stretching keys). Then it’s weighing its marginal use against access to the sealed data.

Vladimir_Nesov 29 Mar 2024 19:00 UTC
3 points
1
on: AI #57: All the AI News That’s Fit to Print

The model is a next token predictor. If you strip out all the next tokens that discuss the topic, it will learn that the probability of discussing the topic is zero.

The model is shaped by tuning from features of a representation produced by an encoder trained for the next-token prediction task. These features include meanings relevant to many possible topics. If you strip all the next tokens that discuss a topic, its meaning will still be prominent in the representation, so the probability of the tuned model being able to discuss it is high.

Vladimir_Nesov 27 Mar 2024 5:50 UTC
2 points
0
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform

standards of most of his other posts, where he assumes both sides are reasonable and have useful perspectives

Scott’s flavor of charity is not quite this. It wouldn’t be useful for understanding sides that are not reasonable or have useless perspectives otherwise, or else you’d need to routinely “assume” false things to carry out the exercise.

The point is to meaningfully engage with other perspectives, without the usual prerequisite of having positive beliefs about them. Treating them in a similar way as if they were reasonable or useful, even when they clearly aren’t. Sometimes the resulting investigation changes one’s mind on this point. But often it doesn’t, while still revealing many details that wouldn’t otherwise be noticed. Actually intervening on your own beliefs would be self-deception, while treating useless and unreasonable views as they are usually treated wouldn’t be charity.

This is related to tolerance, where the point isn’t to start liking people you don’t like, or to start considering them part of your own ingroup. It’s instead an intervention/norm that goes around the dislike to remove some of its downsides without directly removing the dislike itself.

Vladimir_Nesov 26 Mar 2024 15:25 UTC
2 points
0
in reply to: Jonas Hallgren’s comment on: Orthogonality Thesis seems wrong
Orthogonality thesis says that it’s invalid to conclude benevolence from the premise of powerful optimization, it gestures at counterexamples. It’s entirely compatible with benevolence being very likely in practice. You then might want to separately ask yourself if it’s in fact likely. But you do need to ask, that’s the point of orthogonality thesis, its narrow scope.

Vladimir_Nesov 25 Mar 2024 6:48 UTC
9 points
0
on: Self-Play By Analogy

the data bottleneck that threatens to strangle scaling

There is no data bottleneck (for data that’s not necessarily high quality), because data can be repeated in training, about 4 times without much difference compared to unique data, up to about 16 times while still significantly improving the model. This was notably used in Galactica (see Figure 6), published Nov 2022, then there was the systematic study of scaling laws for repeated data from May 2023, recently repeated data was applied in StarCoder 2 (Feb 2024).

A Chinchilla optimal model uses a model size proportional to dataset size, meaning compute is proportional to data squared. If you repeat data 16 times, this means finding a use for 256 times more compute. A filtered and deduplicated CommonCrawl text dataset RedPajama-Data-v2 has 30 trillion tokens. If repeated 16 times with a Chinchilla optimal monolithic Transformer, it would use about 7e28 FLOPs of compute. This scales with data squared, if there is more data to be found, which there certainly is, even if not OOMs more. Assuming BF16 training with 30% utilization, this would require 3.2e10 H100-hours, which assuming $2/hour takes about $65 billion. Anchoring to the rumored 2e25 FLOPs GPT-4 run at $100 million instead, this gives $350 billion. Both numbers are likely currently outside commercial feasibility, if smaller models fail to demonstrate sufficiently impressive feats. And there’s still that further quadratic scaling of needed compute with more data than 30 trillion tokens. (Though Microscaling in Blackwell might reduce the cost of effective compute more than otherwise could be expected this soon.)
What links here?
- Vladimir_Nesov's comment on LLMs seem (relatively) safe by JustisMills (26 Apr 2024 6:32 UTC; 7 points)
- Vladimir_Nesov's comment on Dagon’s Shortform by Dagon (7 Apr 2024 11:48 UTC; 2 points)

Vladimir_Nesov 23 Mar 2024 4:20 UTC
3 points
0
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
Individually logical counterfactuals don’t seem very coherent. This is related to the “I’m an algorithm” vs. “I’m a physical object” distinction of FDT. When you are an algorithm considering a decision, you want to mark all sites of intervention/influence in the world where the world depends on your behavior. If you only mark some of them, then you later fail at the step where you ask what happens if you act differently, you obtain a broken counterfactual world where only some instances of the fact of your behavior have been replaced and not others.

So I think it makes a bit more sense to ask where specifically your brain depends on a fact, to construct an exhausive dependence of your brain on the fact, before turning to particular counterfactual content for that fact to be replaced with. That is, dependence of a system on a fact, the way it varies with the fact, seems potentially clearer than individual counterfactuals of how that system works if the fact is set to be a certain way. (To make a somewhat hopeless analogy, fibration instead of individual fibers, and it shouldn’t be a problem that all fibers are different from each other. Any question about a counterfactual should be reformulated into a question about a dependence.)

Vladimir_Nesov 23 Mar 2024 1:20 UTC
2 points
0
in reply to: Garrett Baker’s comment on: D0TheMath’s Shortform
I don’t think here is a significant confused naive supporter source of the meme that gives it teeth. It’s more that reasonable people who are not any sort of supporters of AI safety propagate this idea, on the grounds that it illustrates the way AI safety is not just dumb, but also dangerous, and therefore worth warning others about.

From the supporter side, “Open Model Weights are Unsafe and Nothing Can Fix This” is a shorter and more convenient way of gesturing to the concern, and convenience is the main force in the Universe that determines all that actually happens in practice. On naive reading such gesturing centrally supports the meme. This doesn’t require the source of such support to have a misconception or to oppose publishing open weights of current models on the grounds of direct consequences.

Vladimir_Nesov 23 Mar 2024 0:43 UTC
4 points
0
in reply to: Garrett Baker’s comment on: D0TheMath’s Shortform
I regularly encounter the impression that AI safety people are significantly afraid about direct consequences of open sourcing current models, from those who don’t understand the actual concerns. I don’t particularly see it from those who do. This (from what I can tell, false) impression seems to be one of relatively few major memes that keep people from bothering to investigate. I hypothesize that this dynamic of ridiculing of AI safety with such memes is what keeps them alive, instead of there being significant truth to them keeping them alive.

Vladimir_Nesov 23 Mar 2024 0:14 UTC
2 points
0
in reply to: Garrett Baker’s comment on: D0TheMath’s Shortform
I don’t get the impression that very many are affraid of direct effects of open sourcing of current models. The impression that many in AI safety are afraid of specifically that is a major focus of ridicule from people who didn’t bother to investigate, and a reason to not bother to investigate. Possibly this alone fuels the meme sufficiently to keep it alive.