TLW

Karma: 610

TLW 24 Oct 2021 19:26 UTC
3 points
in reply to: Mati_Roy’s comment on: Mati_Roy’s Shortform
One more formal method of describing much of this might be the Kolmogorov complexity of the state of your consciousness over the timeframe. (So outputting t=0: state=blah; t=1: state=blah, etc).

This has many of the features you are looking for.

This guides me to an interesting question: is looping in an infinite featureless plain of flat white any worse than looping in an infinite featureless plain of random visual noise?

(Of course, this is both noncomputable and has a nontrivial chance that the Turing Machine attaining the Kolmogorov complexity is itself simulating you, but meh. Details.)

TLW 24 Oct 2021 19:57 UTC
2 points
on: Inference cost limits the impact of ever larger models
You can theoretically run a model on fewer GPUs by putting just the first layer into GPU memory, forward passing on it, then deleting it and loading the second layer from RAM, and so forth (see ZeRO-Infinity). But this comes with high latency

Latency shouldn’t be a problem, as you can pipeline. At least as long as you don’t run into Little’s Law problems.

(Depending on the structure of the connection matrix, you may be able to even pipeline at a sub-layer granularity.)

GPU bus bandwidth is likely more of a problem. PCIe gen3x16 is “only” ~16GB/s.

TLW 26 Oct 2021 4:12 UTC
3 points
on: Postmodern Warfare
Computer command systems can coordinate perfectly and instantly.
It is important to note that this is not strictly speaking true. Computer command systems can coordinate far more accurately and far faster than humans in many cases; this is neither perfect nor instant. (And fundamentally cannot be in a battlefield environment—the Byzentine Generals problem rules out the former and a finite speed of light the latter.)

TLW 26 Oct 2021 5:07 UTC
3 points
in reply to: Gerald Monroe’s comment on: Postmodern Warfare
I think you’re grossly underestimating the following effects/issues:

1. How do multiple redundant commanders ensure that they reliably have the same information, much less in a battlefield environment? Our best efforts still ended up with Bysantine faults on the space shuttle, and that was carefully designed wired connections… (see also Murphy Was an Optimist, which describes a 4-way split due to a failed diode).
2. How do commanders broadcast information in a manner that isn’t also broadcasting their location to enemies? (Honestly, the least important of these issues, and I was tempted not to include this lest you respond to this point and only this point.)
3. If many vehicles are constantly recieving enough information to make higher level decisions, how do you prevent a compromised vehicle from also leaking said state to the enemy? Note the number of known attacks against TPMs, and note that homomorphic encryption is many orders of magnitude away from being feasible here. (And worse, requires a serial speedup in many cases to be feasible.)
4. If many vehicles have the deterministic agent algorithm, how do you prevent a compromised vehicle from leaking said algorithm in a manner the enemy can use for adversarial attacks of various sorts? Same notes as 3.
5. “Each agent must query the layer below it to function, exporting these subtasks to an agent specialized in performing them.” What you’re describing runs into exponential blowup in the number of queries in some cases. (For a simple example, note that sliding-block puzzles are PSPACE-complete, and consider what happens when each bottom agent is a single block that has to be feasibility-queried as to if it can move.) Normally, I’d just say “sure, but you’re unlikely to run into those cases”, however combat is rather necessarily adversarial.

The OpenAI 5 DOTA2 bot beating professionals received a lot of press. A random team who got ten wins against said bot, not so much. Beware glass jaws.

> in a battlespace where everyone on the enemy side has computer controlled aim, flying drones without armor will likely only survive for mere seconds of exposure.

In a battlespace where everyone on the enemy side has computer controlled aim, flying drones with armor will likely only survive for mere seconds of exposure. It may be better to have smaller drones, or more maneuverable drones, or quieter drones, or simply more drones, over more armored drones. (Or it may not. The point is it’s not as clearcut as you seem to make it out to be.)

(You may wish to look at discussions of battleships, and particularly battleship armor, versus missiles. And battleships are far less weight-constrained than fliers...)

TLW 30 Oct 2021 4:32 UTC
2 points
in reply to: SoerenMind’s comment on: Inference cost limits the impact of ever larger models
To give a concrete example:

Say each layer takes 10ms to process. The NN has 100 layers. It takes 40ms to round-trip weight data from the host (say it’s on spinning rust or something). You can fit 5 layers worth of weights on a gpu, in addition to activation data / etc.

On a GPU with a “sufficiently large” amount of memory, such that you can fit everything on-GPU, this will have 1.04s latency overall. 40ms to grab all of the weights into the GPU, then 1s to process.

On a GPU, with no pipelining, loading five layers at a time then processing them, this will take 1.8 seconds latency overall. 40ms to load from disk, then 50 ms to process, for each group of 5 layers.

On a GPU, with pipelining, this will take… 1.04s overall latency. t=0ms, start loading layer 1 weights. t=10ms, start loading layer 2 weights. … t=40ms, start loading layer 5 weights & compute layer 1, t=50ms, start loading layer 6 weights & compute layer 2, etc. (Note that this has a max of 5 ‘active’ sets of weights at once, like in the no-pipelining case.)

(A better example would split this into request latency and bandwidth.)

> Every stage of your pipeline processes e.g. one of the NN layers. Then stage N has to wait for the earlier stages to be completed before it can compute the output of layer N. That’s why the latency to compute f(x) is high.

To be clear: I am talking about pipelining loading the NN weights into the GPU. Which is not dependent on the result of the previous layer’s computation.

I can be loading the NN weights for layer N+1 while I’m working on layer N. There’s no dependency on the activations of the previous layer.

> pipelining doesn’t help with latency

Let me give an example (incorrect) exchange that hopefully illustrates the issue.

”You can never stream video from a remote server, because your server roundtrip is 100ms and you only have 20ms per frame”.

”You can pipeline requests”

″...but I thought pipelining doesn’t help with latency?”

(This example is oversimplified. Video streaming is not done on a per-frame basis, for one.)

The key is: pipelining doesn’t help with latency of individual requests. But that’s not what we care about here. What we care about is the latency from starting request 1 to finishing request N—which pipelining absolutely does help with. (Assuming that you don’t have pipeline hazards at least—which we don’t.)

*****

All of the above being said, this only helps with the “my weights don’t fit in my GPU’s RAM” portion of things (which is what my original comment was responding to). If running an inference takes a billion floating-point ops and your GPU runs at a gigaflop, you’re never going to be able to run it in under a second on a single GPU. (Ditto, if your weights are 16GB and your GPU interface is 16GB/s, you’re never going to be able to run it in under a second on a single GPU… assuming you’re not doing something fancy like decompressing on-GPU at least.)

TLW 30 Oct 2021 18:39 UTC
1 point
on: A Roadmap to a Post-Scarcity Economy
Do you agree with automation being the real missing piece to post-(basic)scarcity?
No. I believe automation is a component towards post-scarcity—but one that’s already largely in place.

The median can stagnate (or even asymptote to zero) even as the mean diverges to infinity. Any attempt to transition to post-scarcity needs to address this, as does any attempt to state we are on the path towards a post-scarcity society based on mean-value estimates.

I would state (without much backup, so take this with a grain of salt) that this is the main missing piece.

US Meats Poultry, and Fish CPI / US median family income was less in 2014 than it was in 1992. (Alternatively: a median US family could buy the CPI MPF basket in 1992 using a lower percentage of their income than in 2014.) (Source is https://fred.stlouisfed.org/series/MEFAINUSA672N, but there’s no way to deeplink an advanced graph that I can find. So: edit graph → edit line 1 → customize data → “Producer Price Index by Commodity: Processed Foods and Feeds: Meats, Poultry, and Fish” → Add → formula → b/a → Apply. Series in question are WPS022 and MEFAINUSA646N.)

(Admittedly, this is comparing a peak with a trough.)

(If anyone could find (unadjusted) total-income info for the bottom, say, 1% of people in the US, over a long timescale, that would be interesting.)

TLW 30 Oct 2021 19:02 UTC
18 points
on: Unlock the Door
As an introvert, the last time I want to see someone who I should make social contact with is when I’m about to arrive home in my private place.

In other words, make sure that the gathering space is not obligatory (at least for introverts): this likely means ensuring that the home in question is inhabited by extroverts.

*****

The failure mode I experienced in collage common areas was closed/unwelcoming groups of people taking over the room. Notably, one common room transitioning nearly overnight to ‘the room where everyone rapidly spoke in (foreign) language X’ (I don’t remember which language offhand).

This can be mitigated through proper rules; it’s not as simple as just having the room available.

TLW 5 Nov 2021 1:38 UTC
2 points
in reply to: SoerenMind’s comment on: Inference cost limits the impact of ever larger models
Hm. Could you please reread my post? You’re repeatedly stating assertions that I explicitly state and show are not the case.

> Your point seems to be about throughput, not latency

I gave an explicit example where a single inference is lower latency with pipelining here versus without.

Hm. I think I understand where you seem to be misunderstanding. Let me try to explain a little more.

> latency (which to my knowledge is defined on a per-request basis)

The key here is that one “request” is composed of multiple requests.

From the end user point of view, a single “request” means “a single full end-to-end inference”. And the latency they care about is issuing the input data to getting the inference result out.

But from the internal point of view, that single full end-to-end inference has multiple requests (essentially, “load weights for layer 0; run calculation on inputs and layer 0 weights to get layer 1 input; load weights for layer 1; run calculation on layer 0 output and layer 1 weights to get layer 2 input; etc, etc”).

And you can reduce the latency of that one external request (the inference) by piplining multiple internal subrequests. You are absolutely correct in that the latency of each of the subrequests is not reduced—but the latency of the external request absolutely is reduced compared to if you didn’t pipeline! (At least assuming the internal subrequests can be pipelined—which they can be in this case as I’ve repeatedly noted.)

TLW 5 Nov 2021 4:27 UTC
2 points
in reply to: Gerald Monroe’s comment on: Postmodern Warfare
> In summary, it’s fine to have some of the “commanders” miss entire frames as “commanders” are stateless.

Having a single commander miss an update? Sure. That’s not really the problem. The problem is cases like “half of the commanders got update A and half didn’t, which then results in a two-way split of the commanders, which then results in agents splitting into two halves semi-randomly based on which way the majority fell of the subset of commanders that they can see”. You really should look up testing of distributed databases, because these sort of split-brain scenarios are analogous there.

You’re also currently falling afoul of the CAP theorem I believe ( https://en.wikipedia.org/wiki/CAP_theorem ). Note that “commanders receiving the same set of observations_this_frame” is equivalent to a distributed database with all observers adding observations and all commanders seeing a consistent view of this database...

> Resynchronizing when entire subnets get cut off for multiple frames and then reconnected is tricky, but straightforward

Again, you really should look up testing of distributed databases. One particularly interesting scenario is asymmetric failures. That is, A can send to B but not vice versa.

> This is how SpaceX does it right now.

Yep. Consensus among multiple redundant computations is also how the space shuttle operated (although the details are somewhat different for the space shuttle of course). It’s not perfect, but it’s a fairly decent approach so long as failures are rare enough that multiple simultaneous failures are rare, and you are not in an adversarial environment.

> “commanders” are stateless.

Commanders cannot be stateless unless either a) they do not retain memories of previously-observed world state or b) they are included in the world data every frame.

The former results in demonstrably suboptimal behavior (there’s a reason why humans have object permanence :-) ), and the latter requires that all agents be able to receive a full commander state in a single frame. (This in turn would result in your commander memory capacity being restricted by communication bandwidth.)

> Please define TPM.

My apologies. Trusted Platform Module. It’s a secure cryptoprocessor standard (and also a term for implementations of said standard, such as the ones on most x86 chips). It’s one of the more common attempts at secure computing, and has been attacked a fair few times as a result. (With more than a few successful attacks of various sorts.)

> Message payloads are fixed length and encrypted with a shared key.

Note that as soon as an attacker learns said shared key you’re now broadcasting all of your sensory information for all agents (as you have to in order to allow commanders to work as described) to the enemy. This is generally considered a Bad Idea (although you seem to disagree with this point, see below).

If you assume that information is not valuable, and that all agents broadcasting their position at all times is never harmful, I can see how this scheme of just broadcasting everything all the time can be useful. I disagree with the premise, however.

> I don’t really see an issue with the enemy gaining some information because ultimately they need to have more vehicles armed with guns or they are going to lose, information does not provide much advantage.

This is incorrect from a game-theoretical point of view. A single example is matching pennies ( https://en.wikipedia.org/wiki/Matching_pennies ). Normally the second person can at most break even on average. However, they can always win if they knew what the first person’s penny is...

(As to how this maps to warfare? You might try playing a few games of Stratego ( https://en.wikipedia.org/wiki/Stratego ). Much of the game is the same sort of “reinforcements are moving to A and B but I don’t know which is the actual reinforcement and which is a feint quite yet” decisions.)

For a slightly more concrete example: I’ll quite happily play you a game of chess where I start without a queen and you have no knowledge of where any of my pieces are. (That is, you try a move, and if it’s legal that’s the move that’s taken. You do get knowledge of which pieces you have captured.)

Starting without a queen is a significant disadvantage in chess, so according to your assertion this should be easy to win for you. (A random online stockfish engine gives +1260 centipawns, and compare e.g. here https://chesscomputer.tumblr.com/post/98632536555/using-the-stockfish-position-evaluation-score-to where even +500 centipawns gives a win % of >90%.)

> Moreover, the agents are using near optimal policy.

This is an interesting assertion. Do you have any citations showing that self-training results in near optimal policies in complex environments? In particular policies that remain near optimal in adversarial cases?

> This will select a plan chosen from the set of { winning battle configurations in the current situation | possible according to subordinate } that is the best of a finite number of local maxima.

There are a fair few optimization problems that you can run into that are proven to not have a polynomial-time approximation scheme unless P=NP (e.g. set covering).

As you can construct physical scenarios that map to said problems, this means that either:

1. You’ve solved P=NP, or are assuming that it is constructively proven that P=NP before this scenario happens.
2. You’re requiring that your commanders potentially do exp(num agents) work per timestep.
3. Your agents are not using near-optimal policies.
4. Your agents are using quantum computing, and using an algorithm within BQP ( https://en.wikipedia.org/wiki/BQP ).
5. You are unaware of this result.

2 is typically physically impossible for more than a few agents, for the same reasons that symmetric-key cryptography can be secure.

> The difference is that the bug can be patched

...assuming you haven’t already lost the war as a result. If the other side manages to take out 50% of your drones at once when you were previously at par, that’s a problem. Saying “we’ll get better for next time” only works if you’re still in a winnable position.

The main issue I am drawing attention to here when I talk about glass jaws is correlated failure modes in adversarial environments, in particular ones that can result in system-level catastrophic failure. If 10% of your drones fail on average, you can plan around that. If your critical holdpoints are suddenly all catastrophically lost due to the same silly bug or edge case, not so much.

> And the vehicles don’t have the actual source for the algorithms used to develop it, just binaries and neural network files.

This doesn’t actually matter all that much for adversarial attacks. Look at the cat and mouse game that is Denuvo for the binary side of things (and note that one of the major techniques used now is to run important code on a remote trusted server, which is not something that the techniques you describe does), and note that if you have a (trainable) neural network you can differentiate it, which is all you need for adversarial attacks.

(Actually, you don’t even need that. See e.g. https://arxiv.org/abs/2003.04884 )

And yes, there are countermeasures. But there are also countercountermeasures, and so on. It’s a cat-and-mouse game, not a clearcut ‘defender wins’.

TLW 5 Nov 2021 4:36 UTC
8 points
in reply to: Thomas’s comment on: Feature Selection
Given that neural network quines ( https://arxiv.org/abs/1803.05859 ) are a thing, it’s plausible that a sufficiently complex neural network could indeed observe itself even without explicitly using itself as an input, assuming you mean ‘performing calculations on its own weights’ by observing itself.

Although this would likely require training to do so.

(This is mainly an excuse to tell others that neural network quines exist.)

TLW 5 Nov 2021 4:57 UTC
4 points
in reply to: Razied’s comment on: EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised
(Fair warning: I’m definitely in the “amateur” category here. Usual caveats apply—using incorrect terminology, etc, etc. Feel free to correct me.)

> they in fact add a similarity loss to the loss function of MuZero that provides extra supervision

How do they prevent noise traps? That is, picture a maze with featureless grey walls except for a wall that displays TV static. Black and white noise. Most of the rest of the maze looks very similar—but said wall is always very different. The agent ends up penalized for not sitting and watching the TV forever (or encouraged to sit and watch the TV forever. Same difference up to a constant...)

TLW 5 Nov 2021 5:01 UTC
1 point
in reply to: supposedlyfun’s comment on: The Opt-Out Clause
Would this still be the case if it was a copy as opposed to a move?

TLW 5 Nov 2021 5:17 UTC
2 points
on: The Opt-Out Clause
Hypothetically, I would have already done so. Likely repeatedly.

Logic chain goes like this:

1. If this isn’t correct, I haven’t lost anything.
2. So let’s assume this is correct from here on.
3. Said simulation is unlikely to be a one-shot deal.
4. It’s likely, although not a guarantee, that the real world is substantially more powerful than this one. Certainly they have multiple capabilities we don’t (memory extraction/storage, flawless simulations of human-level simulation and environments, etc.)
5. As such, if I wake up and remember that nope, I did in fact like the sim more, it’s likely I can drop back in. Assuming they are correct they at least have the capabilities to do so (remove & store a few more memories and drop me back in).
6. On the other hand, if I wake up and find that with my added sim experiences / memories that it’s time to return to the real world, I’ve gained something.

It’s not a particularly solid logic chain for various reasons, but I think it’s passable.

On a related note, the sudden stutter in my music immediately after going through this logic chain and speaking a few words was interesting.

TLW 5 Nov 2021 5:21 UTC
3 points
on: Baby Sister Numbers
Is Noranoo a prime?

Is Norklet a prime?

Is the product of all primes below Noranoo, plus one, a prime?

What is the sum of all positive integers below Norklet?

TLW 5 Nov 2021 5:38 UTC
6 points
in reply to: Said Achmiz’s comment on: They don’t make ’em like they used to
Imperfect information is rather important here.

If you know that the cheap thing will break, and the expensive thing has a 50% chance of being solid and a 50% chance of being the cheap thing in disguise in a way that you can’t immediately tell the difference before the sale, the expensive thing looks far less attractive… which in turn means it’s less likely to be sold, and places making an expensive solid product end up doing worse. (But places making an expensive cheap thing in disguise still do well.).

(To which the common response is “just look at reviews/brand history/etc”, and the common counter-counter response being to note that just because the version sent to reviewers was good/the brand used to be good doesn’t mean that the version in front of you is good.)

TLW 5 Nov 2021 5:57 UTC
2 points
in reply to: jasoncrawford’s comment on: They don’t make ’em like they used to
If extra durability/lifespan (beyond the ~15 years that things already last) were possible with a small increase in cost, why wouldn’t manufacturers compete on this axis?

I’ll offer a slightly different take on this. Namely: there’s no reliable way to communicate this to customers at a decent lag time, so the actual price premium customers are expected to pay for this is too low to be worth it.

Warranty periods only matter given:

1. People are confident that the company will be around for the duration.
2. People are confident that the company will honor them.
3. People don’t mind the time/annoyance of potential warranty calls.
4. Warrantee periods are correlated with actual time-to-failures.

Unfortunately, 1 and 2 are not really possible now. Say you put out a new product with a 25y warranty. And you have a fund put aside to deal with failures. Great! Only… it’s a 25y warranty . That’s a long timespan. How do you protect against the next CEO running things into the ground? Or spinning off that small segment into a separate company with not enough of a fund due to “optimistic” estimates and then having that portion go bankrupt?

And meanwhile someone else will also put out a product with the same 25y warranty, where say 25% of products will fail within 10-15y and they are banking on people not to actually return them, and hence they can sell them cheaper. And so you get outcompeted on price.

TLW 5 Nov 2021 6:07 UTC
1 point
in reply to: Linch’s comment on: They don’t make ’em like they used to
Older cars are more resistant to low-speed collisions than new cars. In a low-speed collision you can often have a new car totaled where an older car would have been fine.

(There was a period where there were US regulations requiring low-speed crashes to not cause significant damage, for one. 1970s or so. (Federal Motor Vehicle Safety Standard No. 215 I believe.))

In higher-speed collisions newer cars are significantly better at keeping the passenger compartment intact than older cars, where things would fail in a haphazard fashion once things do start buckling.

TLW 11 Nov 2021 3:38 UTC
2 points
in reply to: SoerenMind’s comment on: Inference cost limits the impact of ever larger models
I am glad we were able to work out the matter!

> If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today.

Beware bandwidth bottlenecks, as I mentioned in my original post. If you have a 1TB model, you need to have it somewhere with >=1TB/s effective bandwidth between storage and the compute endpoint to achieve 1 second of latency when doing an inference. And storage capacity (not to mention model size) keeps rising faster than bandwidth does...

(There are tricks here to an extent—such as compressing the model and decompressing it on-target—but they seldom save much. (And if they do, that just means your model is inefficient...))

According to a random guy on the internet, GPT-3 is ~300GB compressed. PCIe gen4x16 is ~31.5GB/s. If you have 1s of latency, that means that you can only stream in ~31.5GB per card. (In addition to what’s already stored in RAM.)

That being said, as far as I can tell it is—in theory—possible to run a GPT-3 inference on a single Threadripper Pro platform (or something else with 128 lanes of gen4 pcie), with 8x 6GB graphics cards in 1 second, if you have 300GB of DRAM lying around. (Or 4x 12GB graphics cards in 2 seconds, with the other half of the pcie lanes filled with gen4 SSDs.)

(In practice I strongly suspect you’ll hit some unknown limit in the PCIe root complex or thereabouts. This is shuffling something silly like 250GB/s of data through that one poor root complex.)

(It’s a pity that there’s no good way to ask a GPU to pull data directly from an SSD. ICMB could help, but it requires GPU-side software support. Most of this data stream could go directly from SSD to PCIe switch to graphics card without having to be bounced through the root port...)

(Yes, 8x gpu->gpu communications will hurt overall latency… but not by all that much I don’t think. 1 second is an eternity.)

> As I think we both agree, pipelining, in the sense of using different GPUs to compute different layers, doesn’t reduce latency.

Indeed. And indeed, increases it, as you’re adding GPU-->GPU trips to the critical path.

TLW 11 Nov 2021 4:03 UTC
1 point
in reply to: Carmex’s comment on: Mati_Roy’s Shortform
An interesting perspective.

It is instructive to consider the following four scenarios:

1. The Kolmogorov complexity of the state of your mind after N timesteps in a simulation with a featureless white plane.
2. The Kolmogorov complexity of the state of your mind after N timesteps in a simulation where you are in a featureless plane, but the simulation injects a single randomly-chosen 8x8 black-and-white bitmap into the corner of your visual field. (256 bits total.)
3. The Kolmogorov complexity of the state of your mind after N timesteps in a simulation with “random” noise that’s actually the output of, say, a CSPRNG with 256b of initial state.
4. The Kolmogorov complexity of the state of your mind after N timesteps in a simulation with “truly” random noise.

TLW 11 Nov 2021 4:13 UTC
1 point
in reply to: supposedlyfun’s comment on: The Opt-Out Clause
Let me try to rephrase. Two questions, one of which is hopefully just a restatement of your original comment:

If you had the option to move your consciousness from reality to a simulation that you knew was worse than reality, knowing that you probably later wouldn’t opt out and return to reality, would you do so?

If you had the option to copy your consciousness from reality to a simulation that you knew was worse than reality, knowing that you probably later wouldn’t opt out and return to reality, would you do so?