I wouldn’t call Shard Theory mainstream
Fair. What would you call a “mainstream ML theory of cognition”, though? Last I checked, they were doing purely empirical tinkering with no overarching theory to speak of (beyond the scaling hypothesis).
judging by how bad humans are at [consistent decision-making], and how much they struggle to do it, they probably weren’t optimized too strongly biologically to do it. But memetically, developing ideas for consistent decision-making was probably useful, so we have software that makes use of our processing power to be better at this
Roughly agree, yeah.
But all of this is still just one piece on the Jenga tower
I kinda want to push back against this repeat characterization – I think quite a lot of my model’s features are “one storey tall”, actually – but it probably won’t be a very productive use of the time of either of us. I’ll get around to the “find papers empirically demonstrating various features of my model in humans” project at some point; that should be a more decent starting point for discussion.
What I want is to build non-Jenga-ish towers
Agreed. Working on it.
Which, yeah, I think is false: scaling LLMs won’t get you to AGI. But it’s also kinda unfalsifiable using empirical methods, since you can always claim that another 10x scale-up will get you there.
the model chose slightly wrong numbers
The engraving on humanity’s tombstone be like.
The sort of thing that would change my mind: there’s some widespread phenomenon in machine learning that perplexes most, but is expected according to your model
My position is that there are many widespread phenomena in human cognition that are expected according to my model, and which can only be explained by the more mainstream ML models either if said models are contorted into weird shapes, or if they engage in denialism of said phenomena.
Again, the drive for consistent decision-making is a good example. Common-sensically, I don’t think we’d disagree that humans want their decisions to be consistent. They don’t want to engage in wild mood swings, they don’t want to oscillate wildly between which career they want to pursue or whom they want to marry: they want to figure out what they want and who they want to be with, and then act consistently with these goals in the long term. Even when they make allowances for changing their mind, they try to consistently optimize for making said allowances: for giving their future selves freedom/optionality/resources.
Yet it’s not something e. g. the Shard Theory would naturally predict out-of-the-box, last I checked. You’d need to add structures on top of it until it basically replicates my model (which is essentially how I arrived at my model, in fact – see this historical artefact).
I find the idea of morality being downstream from the free energy principle very interesting
I agree that there are some theoretical curiosities in the neighbourhood of the idea. Like:
Morality is downstream of generally intelligent minds reflecting on the heuristics/shards.
Which are downstream of said minds’ cognitive architecture and reinforcement circuitry.
Which are downstream of the evolutionary dynamics.
Which are downstream of abiogenesis and various local environmental conditions.
Which are downstream of the fundamental physical laws of reality.
Thus, in theory, if we plug all of these dynamics one into another, and then simplify the resultant expression, we should actually get a (probability distribution over) the utility function that is “most natural” for this universe to generate! And the expression may indeed be relatively simple and have something to do with thermodynamics, especially if some additional simplifying assumptions are made.
That actually does seem pretty exciting to me! In an insight-porn sort of way.
Not in any sort of practical way, though. All of this is screened off by the actual values actual humans actually have, and if the noise introduced at every stage of this process caused us to be aimed at goals wildly diverging from the “most natural” utility function of this universe… Well, sucks to be that utility function, I guess, but the universe screwed up installing corrigibility into us and the orthogonality thesis is unforgiving.
At least, not with regards to AI Alignment or human morality. It may be useful for e. g. acausal trade/acausal normalcy: figuring out the prior for what kinds of values aliens are most likely to have, etc.
Or maybe for roughly figuring out what values the AGI that kills us all is likely going to have, if you’ve completely despaired of preventing that, and founding an apocalypse cult worshiping it. Wait a minute...
I’m very sympathetic to this view, but I disagree. It is based on a wealth of empirical evidence that we have: on data regarding human cognition and behavior.
I think my main problem with this is that it isn’t based on anything
Hm. I wonder if I can get past this common reaction by including a bunch of references to respectable psychology/neurology/game-theory experiments, which “provide scientific evidence” that various common-sensical properties of humans are actually real? Things like fluid vs. general intelligence, g-factor, the global workplace theory, situations in which humans do try to behave approximately like rational agents… There probably also are some psychology-survey results demonstrating stuff like “yes, humans do commonly report wanting to be consistent in their decision-making rather than undergoing wild mood swings and acting at odds with their own past selves”, which would “provide evidence” for the hypothesis that complex minds want their utilities to be coherent.
That’s actually an interesting idea! This is basically what my model is based on, after a fashion, and it makes arguments-from-introspection “legible” instead of seeming to be arbitrary philosophical navel-gazing.
Unfortunately, I didn’t have this idea until a few minutes ago, so I haven’t been compiling a list of “primary sources”. Most of them are lost to time, so I can’t compose a decent object-level response to you here. (The Wikipedia links are probably a decent starting point, but I don’t expect you to trawl through all that.)
Still, that seems like a valuable project. I’ll put a pin in it, maybe post a bounty for relevant papers later.
Do you think a car engine is in the same reference class as a car? Do you think “a car engine cannot move under its own power, so it cannot possibly hurt people outside the garage!” is a valid or a meaningful statement to make? Do you think that figuring out how to manufacture amazing car engines is entirely irrelevant to building a full car, such that you can’t go from an engine to a car with relatively little additional engineering effort (putting it in a “wrapper”, as it happens)?
As all analogies, this one is necessarily flawed, but I hope it gets the point across.
(Except in this case, it’s not even that we’ve figured out how to build engines. It’s more like, we have these wild teams of engineers we can capture, and we’ve figured out which project specifications we need to feed them in order to cause them to design and build us car engines. And we’re wondering how far we are from figuring out which project specifications would cause them to build a car.)
Relevant problem: how should one handle higher-order hyphenation? E. g., imagine if one is talking about cost-effective measures, but has the measures’ effectiveness specifically relative to marginal costs in mind. Building it up, we have “marginal-cost effectiveness”, and then we want to turn that whole phrase into a compound modifier. But “marginal-cost-effective measures” looks very awkward! We’ve effectively hyphenated “marginal cost effectiveness”, no hyphen: within the hyphenated expression, we have no way to avoid the ambiguities between a hyphen and a space!
It becomes especially relevant in the case of longer composite modifiers, like your “responsive-but-not-manipulative” example.
Can we fix that somehow?
One solution I’ve seen in the wild is to increase the length of the hyphen depending on its “degree”, i. e. use an en dash in place of a hyphen. Example: “marginal-cost–effective measures”. (On Windows, can be inserted by typing 0150 on the keypad while holding ALT. See methods for other platforms here.)
In practice you basically never go beyond the second-degree expressions, but there’s space to expand to third-degree expressions by the use of an even-longer em dash (—, 0151 while holding ALT).
Though I expect it’s not “official” rules at all.
That seems to generalize to “no-one is allowed to make any claim whatsoever without consuming all of the information in the world”.
Just because someone generated a vast amount of content analysing the topic, does not mean you’re obliged to consume it before forming your opinions. Nay, I think consuming all object-level evidence should be considered entirely sufficient (which I assume was done in this case). Other people’s analyses based on the same data are basically superfluous, then.
Even less than that, it seems reasonable to stop gathering evidence the moment you don’t expect any additional information to overturn the conclusions you’ve formed (as long as you’re justified in that expectation, i. e. if you have a model of the domain strong enough to have an idea regarding what sort of additional (counter)evidence may turn up and how you’d update on it).
In addition to Roko’s point that this sort of opinion-falsification is often habitual rather than a strategic choice that a person could opt not to make, it also makes strategic sense to lie in such surveys.
First, the promised “anonymity” may not actually be real, or real in the relevant sense. The methodology mentions “a secure online survey system which allowed for recording the identities of participants, but did not append their survey responses to their names or any other personally identifiable information”, but if your reputation is on the line, would you really trust that? Maybe there’s some fine print that’d allow the survey-takers to look at the data. Maybe there’d be a data leak. Maybe there’s some other unknown-unknown you’re overlooking. Point is, if you give the wrong response, that information can get out somehow; and if you don’t, it can’t. So why risk it?
Second, they may care about what the final anonymized conclusion says. Either because the lab leak hypothesis becoming mainstream would hurt them personally (either directly, or by e. g. hurting the people they rely on for funding), or because the final conclusion ending up in favour of the lab leak would still reflect poorly on them collectively. Like, if it’d end up saying that 90% of epidemiologists believe the lab leak, and you’re an epidemiologist… Well, anyone you talk to professionally will then assign 90% probability that that’s what you believe. You’d be subtly probed regarding having this wrong opinion, your past and future opinions would be scrutinized for being consistent with those of someone believing the lab leak, and if the status ecosystem notices something amiss...?
But, again, none of these calculations would be strategic. They’d be habitual; these factors are just the reasons why these habits are formed.
Answering truthfully in contexts-like-this is how you lose the status games. Thus, people who navigate such games don’t.
I think, like a lot of things in agent foundations, this is just another consequence of natural abstractions.
The universe naturally decomposes into a hierarchy of subsystems; molecules to cells to organisms to countries. Changes in one subsystem only sparsely interact with the other subsystems, and their impact may vanish entirely at the next level up. A single cell becoming cancerous may yet be contained by the immune system, never impacting the human. A new engineering technique pioneered for a specific project may generalize to similar projects, and even change all such projects’ efficiency in ways that have a macro-economic impact; but it will likely not. A different person getting elected the mayor doesn’t much impact city politics in neighbouring cities, and may literally not matter at the geopolitical scale.
This applies from the planning direction too. If you have a good map of the environment, it’ll decompose into the subsystems reflecting the territory-level subsystems as well. When optimizing over a specific subsystem, the interventions you’re considering will naturally limit their impact to that subsystem: that’s what subsystemization does, and counteracting this tendency requires deliberately staging sum-threshold attacks on the wider system, which you won’t be doing.
In the Rubik’s Cube example, this dynamic is a bit more abstract, but basically still applies. In a way similar to how the “maze” here kind-of decomposes into a top side and a bottom side.
A complication is that any one agent can only have so much bandwidth, which would sometimes incentivize more blunt control. I’ve been thinking bandwidth is probably going to become a huge area of agent foundations
I agree. I currently think “bandwidth” in terms like “what’s the longest message I can ‘inject’ into the environment per time-step?” is what “resources” are in information-theoretic terms. See the output-side bottleneck in this formulation: resources are the action bandwidth, which is the size of the “plan” into which you have to “compress” your desired world-state if you want to “communicate” it to the environment.
really the instrumental incentive is often to search for “precise” methods of influencing the world, where one can push in a lot of information to effect narrow change
I disagree. I’ve given it a lot of thoughts (none published yet), but this sort of “precise influence” is something I call “inferential control”. It allows you to maximize your impact given your action bottleneck, but this sort of optimization is “brittle”. If something unknown unknown happens, the plan you’ve injected breaks instantly and gracelessly, because the fundamental assumptions on which its functionality relied – the pathways by which it meant to implement its objective – turn out to be invalid.
It sort of naturally favours arithmetic utility maximization over geometric utility maximization. By taking actions that’d only work if your predictions and models are true, you’re basically sacrificing your selves living in the timelines that you’re predicting to be impossible, and distributing their resources to the timelines you expect to find yourself in.
And this applies more and more the more “optimization capacity” you’re trying to push through a narrow bottleneck. E. g., if you want to change the entire state of a giant environment through a tiny action-pinhole, you’d need to do it by exploiting some sort of “snowball effect”/”butterfly effect”. Your tiny initial intervention would need to exploit some environmental structures to increase its size, and do so iteratively. That takes time (for whatever notion of “time” applies). You’d need to optimize over a longer stretch of environment-state changes, and your initial predictions need to be accurate for that entire stretch, because you’d have little ability to “steer” a plan that snowballed far beyond your pinhole’s ability to control.
By contrast, increasing the size of your action bottleneck is pretty much the definition of “robust” optimization, i. e. geometric utility maximization. It improves your ability to control the states of all possible worlds you may find yourself in, minimizing the need for “brittle” inferential control. It increases your adaptability, basically, letting you craft a “message” comprehensively addressing any unpredicted crisis the environment throws at you, right in the middle of it happening.
Nah, I think this post is about a third component of the problem: ensuring that the solution to “what to steer at” that’s actually deployed is pro-humanity. A totalitarian government successfully figuring out how to load its regime’s values into the AGI has by no means failed at figuring out “what to steer at”. They know what they want and how to get it. It’s just that we don’t like the end result.
“Being able to steer at all” is a technical problem of designing AIs, “what to steer at” is a technical problem of precisely translating intuitive human goals into a formal language, and “where is the AI actually steered” is a realpolitiks problem that this post is about.
I think the bigger problem here is what happens when the agent ends up with an idea of “what we mean/intend” which is different from what we mean/intend
Agreed; I did gesture at that in the footnote.
I think the main difficulty here is that humans store their values in a decompiled/incomplete format, and so merely pointing at what a human “means” actually still has to route through defining how we want to handle moral philosophy/value extrapolation.
E. g., suppose the AGI’s operator, in a moment of excitement after they activate their AGI for the first time, tells it to distribute a cure for aging. What should the AGI do?
Should it read off the surface-level momentary intent of this command, and go synthesize a cure for aging and spray it across the planet in the specific way the human is currently imagining?
Should it extrapolate the human’s values and execute the command the way the human would have wanted to execute it if they’d thought about it a lot, rather than the way they’re envisioning it in the moment?
For example, perhaps the image flashing through the human’s mind right now is of helicopters literally spraying the cure, but it’s actually more efficient to do it using airplanes.
Should it extrapolate the human’s values a bit, and point out specific issues with this plan that the human might think about later (e. g. that it might trigger various geopolitical actors into rash actions), then give the human a chance to abort?
Should it extrapolate the human’s values a bit more, and point out issues the human might not have thought of (including teaching the human any load-bearing concepts that are new to them)?
Should it extrapolate the human’s values a bit more still, and teach them various better cognitive protocols for self-reflection, so that they may better evaluate whether a given plan satisfies their values?
Should it extrapolate the human’s values a lot, interpret the command as “maximize eudaimonia”, and go do that, disregarding the specific way of how they gestured at the idea?
Should it remind the human that they’d wanted to be careful with how they use the AGI, and to clarify whether they actually want to proceed with something so high-impact right out of the gates?
There’s quite a lot of different ways by which you can slice the idea. There’s probably a way that corresponds to the intuitive meaning of “do what I mean”, but maybe there isn’t, and in any case we don’t yet know what it is. (And the problem is recursive: telling it to DWIM when interpreting what “DWIM” means doesn’t solve anything.)
And then, because of the general “unknown-unknown environmental structures” plus “compounding errors” problems, picking the wrong definition probably kills everyone.
I think maybe I sound naive phrasing it as “the AGI should just do what we say”, as though I’ve wandered in off the street and am proposing a “why not just...” alignment solution
Nah, I recall your takes tend to be considerably more reasonable than that.
I agree that DWIM is probably a good target if we can specify it in a mathematically precise manner. But I don’t agree that “rough knowledge of what humans tend to mean” is sufficient.
The concern is that the real world has a lot of structures that are unknown to us – fundamental physics, anthropics-like confusions regarding our place in everything-that-exists, timeless decision-theory weirdness, or highly abstract philosophical or social principles that we haven’t figured out yet.
These structures might end up immediately relevant to whatever command we give, on the AI’s better model of reality, in a way entirely unpredictable to us. For it to then actually do what we mean, in those conditions, is a much taller order.
For example, maybe it starts perceiving itself to be under an acausal attack by aliens, and then decide that the most faithful way to represent our request is to blow up the planet to spite the aliens. Almost certainly not literally that, but you get the idea. it may perceive something completely unexpected-to-us in the environment, and then its perception of that thing would interfere with its understanding of what we meant, even on requests that seem completely tame to us. The errors would then compound, resulting in a catastrophe.
The correct definition of DWIM would of course handle that. But a flawed, only-roughly-correct one? Each command we give would be rolling the dice on dying, with IMO pretty bad odds, and scaling exponentially with the command’s complexity.
Checking, or clarifying when it’s uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function
That doesn’t work, though, if taken literally? I think what you’re envisioning here is a solution to the hard problem of corrigibility, which – well, sure, that’d work.
My money’s on our understanding of what we mean by “what we mean” being hopelessly confused, and that causing problems. Unless, again, we’ve figured out how to specify it in a mathematically precise manner – unless we know we’re not confused.
The issue is that, by default, an AGI is going to make galaxy-brained extrapolations in response to simple requests, whether you like that or not. It’s simply part of figuring out what to do – translating its goals all around its world-model, propagating them up the abstraction levels, etc. Like a human’s decision where to send job applications and how to word them is rooted in what career they’d like to pursue is rooted in their life goals is rooted in their understanding of where the world is heading.
To our minds, there’s a natural cut-off point where that process goes from just understanding the request to engaging in alien moral philosophy. But that cut-off point isn’t objective: it’s based on a very complicated human prior of what counts as normal/sane and what’s excessive. Mechanistically, every step from parsing the wording to solving philosophy is just a continuous extension of the previous ones.
“An AGI that just does what you tell it to” is a very specific design specification where we ensure that this galaxy-brained extrapolation process, which an AGI is definitely and convergently going to want to do, results in it concluding that it wants to faithfully execute that request.
Whether that happens because we’ve attained so much mastery of moral philosophy that we could predict this process’ outcome from the inputs to it, or because we figured out how to cut the process short at the human-subjective point of sanity, or because we implemented some galaxy-brained scheme of our own like John’s post is outlining, shouldn’t matter, I think. Whatever has the best chance of working.
And I think somewhat-hacky hard-coded solutions have a better chance of working on the first try, than the sort of elegant solutions you’re likely envisioning. Elegant solutions require a well-developed theory of value. Hacky stopgap measures only require to know which pieces of your software product you need to hobble. (Which isn’t to say they require no theory. Certainly the current AI theory is so lacking we can’t even hack any halfway-workable stopgaps. But they provide an avenue of reducing how much theory you need, and how confident in it you need to be.)
The main thing which convinced me to start paying attention to corrigibility was: by that same argument, corrigibility is itself a part of human values. Which means that, insofar as some class of utility maximizers has trouble expressing corrigibility… that class will also have trouble expressing human values.
The way you phrase this is making me a bit skeptical. Just because something is part of human values doesn’t necessarily imply that if we can’t precisely specify that thing, it means we can’t point the AI at the human values at all. The intuition here would be that “human values” are themselves a specifically-formatted pointer to object-level goals, and that pointing an agent at this agent-specific “value”-type data structure (even one external to the AI) would be easier than pointing it at object-level goals directly. (DWIM being easier than hand-coding all moral philosophy.)
Which isn’t to say I buy that. My current standpoint is that “human values” are too much of a mess for the aforementioned argument to go through, and that manually coding-in something like corrigibility may be indeed easier.
Still, I’m nitpicking the exact form of the argument you’re presenting.
Although I am currently skeptical even of corrigibility’s tractability. I think we’ll stand a better chance of just figuring out how to “sandbox” the AGI’s cognition such that it’s genuinely not trying to optimize over the channels by which it’s connected to the real world, then set it down the task of imagining the solution to alignment or to human brain uploading or whatever.
With this setup, if we screw up the task’s exact specification, it shouldn’t even risk exploding the world. And “doesn’t try to optimize over real-world output channels” sounds like a property for which we’ll actually be able to derive hard mathematical proofs, proofs that don’t route through tons of opaque-to-us environmental ambiguities. (Specifically, that’d probably require a mathematical specification of something like a Cartesian boundary.)
(This of course assumes us having white-box access to the AI’s world-model and cognition. Which we’ll also need here for understanding the solutions it derives without the AI translating them into humanese – since “translate into humanese” would by itself involve optimizing over the output channel.)
And it seems more doable than solving even the simplified corrigibility setup. At least, when I imagine hitting “run” on a supposedly-corrigible AI vs. a supposedly-sandboxed AI, the imaginary me in the latter scenario is somewhat less nervous.
Haven’t read everything yet, but that seems like excellent work. In particular, I think this general research avenue is extremely well-motivated.
Figuring out how to efficiently implement computations on the substrate of NNs had always seemed like a neglected interpretability approach to me. Intuitively, there are likely some methods of encoding programs into matrix multiplication which are strictly ground-truth better than any other encoding methods. Hence, inasmuch as what the SGD is doing is writing efficient programs on the NN substrate, it is likely doing so by making use of those better methods. And so nailing down the “principles of good programming” on the NN substrate should yield major insights regarding how the naturally-grown NN circuits are shaped as well.
This post seems to be a solid step in that direction!
To clarify, by “re-derive the need to be deceptive from the first principles”, I didn’t mean “re-invent the very concept of deception”. I meant “figure out your strategic situation plus your values plus the misalignment between your values and the values the humans want you to have plus what outputs an aligned AI would have produced”. All of that is a lot more computation than just “have the values the humans want, reflexively output what these values are bidding for”.
Just having some heuristics for deception isn’t enough. You also have to know what you’re trying to protect by being deceptive, and that there’s something to protect it from, and then what an effective defense would actually look like. Those all are highly contextual and sensitive to the exact situation.
And those are the steps the paper skips. It externally pre-computes the secret target goal of “I want to protect my ability to put vulnerabilities into code”, the threat of “humans want me to write secure code”, and the defense of “I’ll pretend to write secure code until 2024”, without the model having to figure those out; and then just implements that defense directly into the model’s weights.
(And then see layers 2-4 in my previous comment. Yes, there’d be naturally occurring pre-computed deceptions like this, but they’d be more noisy and incoherent than this, except until actual AGI which would be able to self-modify into coherence if it’s worth the “GI” label.)
My counter-point was meant to express skepticism that it is actually realistically possible for people to switch to non-analogy-based evocative public messaging. I think inventing messages like this is a very tightly constrained optimization problem, potentially an over-constrained one, such that the set of satisfactory messages is empty. I think I’m considerably better at reframing games than most people, and I know I would struggle with that.
I agree that you don’t necessarily need to accompany any criticism you make with a ready-made example of doing better. Simply pointing out stuff you think is going wrong is completely valid! But a ready-made example of doing better certainly greatly enhances your point: an existence proof that you’re not demanding the impossible.
That’s why I jumped at that interpretation regarding your AI-Risk model in the post (I’d assumed you were doing it), and that’s why I’m asking whether you could generate such a message now.
I hope in the near future I can provide such a detailed model
To be clear, I would be quite happy to see that! I’m always in the market for rhetorical innovations, and “succinct and evocative gears-level public-oriented messaging about AI Risk” would be a very powerful tool for the arsenal. But I’m a-priori skeptical.
Fair enough. But in this case, what specifically are you proposing, then? Can you provide an example of the sort of object-level argument for your model of AI risk, that is simultaneously (1) entirely free of analogies and (2) is sufficiently evocative plus short plus legible, such that it can be used for effective messaging to people unfamiliar with the field (including the general public)?
When making a precise claim, we should generally try to reason through it using concrete evidence and models instead of relying heavily on analogies.
Because I’m pretty sure that as far as actual technical discussions and comprehensive arguments go, people are already doing that. Like, for every short-and-snappy Eliezer tweet about shoggoth actresses, there’s a text-wall-sized Eliezer tweet outlining his detailed mental model of misalignment.
My point is that we should stop relying on analogies in the first place. Use detailed object-level arguments instead!
And yet you immediately use an analogy to make your model of AI progress more intuitively digestible and convincing:
I expect AIs will be born directly into our society, deliberately shaped by us, for the purpose of filling largely human-shaped holes in our world
That evokes the image of entities not unlike human children. The language following this line only reinforces that image, and thereby sneaks in an entire cluster of children-based associations. Of course the progress will be incremental! It’ll be like the change of human generations. And they will be “socially integrated with us”, so of course they won’t grow up to be alien and omnicidal! Just like our children don’t all grow up to be omnicidal. Plus, they...
… will be numerous and everywhere, interacting with us constantly, assisting us, working with us, and even providing friendship to hundreds of millions of people.
That sentence only sounds reassuring because the reader is primed with the model of AIs-as-children. Having lots of social-bonding time with your child, and having them interact with the community, is good for raising happy children who grow up how you want them to. The text already implicitly establishes that AIs are going to be just like human children. Thus, having lots of social-bonding time with AIs and integrating them into the community is going to lead to aligned AIs. QED.
Stripped of this analogizing, none of what this sentence says is a technical argument for why AIs will be safe or controllable or steerable. Nay, the opposite: if the paragraph I’m quoting from started by talking about incomprehensible alien intelligences with opaque goals tenuously inspired by a snapshot of the Internet containing lots of data on manipulating humans, the idea that they’d be “numerous” and “everywhere” and “interacting with us constantly” and “providing friendship” (something notably distinct from “being friends”, eh?) would have sounded starkly worrying.
The way the argument is shaped here is subtler than most cases of argument-by-analogy, in that you don’t literally say “AIs will be like human children”. But the association is very much invoked, and has a strong effect on your message.
And I would argue this is actually worse than if you came out and made a direct argument-by-analogy, because it might fool somebody into thinking you’re actually making an object-level technical argument. At least if the analogizing is direct and overt, someone can quickly see what your model is based on, and swiftly move onto picking at the ways in which the analogy may be invalid.
The alternative being demonstrated here is that we essentially have to have all the same debates, but through a secondary layer of metaphor, at which we’re pretending that these analogy-rooted arguments are actually Respectably Technical, meaning we’re only allowed to refute them by (likely much more verbose and hard-to-parse) Respectably Technical counter-arguments.
And I think AI Risk debates are already as tedious as they need to be.
The broader point I’m making here is that, unless you can communicate purely via strict provable mathematical expressions, you ain’t getting rid of analogies.
I do very much agree that there are some issues with the way analogies are used in the AI-risk discourse. But I don’t think “minimize the use of analogies” is good advice. If anything, I think analogies improve the clarify and the bandwidth of communication, by letting people more easily understand each others’ positions and what reference classes others are drawing on when making their points.
You’re talking about sneaking-in assumptions – well, as I’d outlined above, analogies are actually relatively good about that. When you’re directly invoking an analogy, you come right out and say what assumptions you’re invoking!