I might have updated at least a bit against the weakness of single-forward passes, based on intuitions about the amount of compute that huge context windows (e.g. Gemini 1.5 − 1 million tokens) might provide to a single-forward-pass, even if limited serially.
Somewhat relatedly: I’m interested on how well LLMs can solve tasks in parallel. This seems very important to me.[1]
The “I’ve thought about this for 2 minutes” version is: Hand an LLM two multiple choice questions with four answer options each. Encode these four answer options into a single token, so that there are 16 possible tokens of which one corresponds to the correct answer to both questions. A correct answer means that the model has solved both tasks in one forward-pass.
(One can of course vary the number of answer options and questions. I can see some difficulties in implementing this idea properly, but would nevertheless be excited if someone took a shot at it.)
- For serial computation the number of layers gives some very rough indication of the strength of one forward-pass, but it’s harder to have intuitions for parallel computation.
- For scheming, the model could reason about “should I still stay undercover”, “what should I do in case I should stay undercover” and “what should I do in case it’s time to attack” in parallel, finally using only one serial step to decide on its action.
Similarly, I find that GPT-3, GPT-3.5, and Claude 2 don’t benefit from filler tokens. However, GPT-4 (which Tamera didn’t study) shows mixed results with strong improvements on some tasks and no improvement on others.
It’s interesting question whether Gemini has any improvements.
It seems that the hoopoes lay extra eggs in times of abundance — more than they would be able to see through to fledging — as a way of storing up food for the older siblings. It is rather gruesomely called the “larder” hypothesis.
“What surprised me the most was the species practicing this aggressive parenting,” says Vladimir Pravosudov, an ecologist at the University of Nevada, Reno. Hoopoes primarily eat insects, he notes, so their long, curved bills aren’t ideal for killing and eating chicks. That might be why, Soler says, mother hoopoes often grab the unlucky chick and shove it into the mouth of an older chick, which swallows it whole.
Note, I consider this post to be “Lynette speculates based on one possible model”, rather than “scientific evidence shows”, based on my default skepticism for psych research.
A recent Astral Codex Ten post argued that advice is written by people who struggle because they put tons of time into understanding the issue. People who succeeded effortlessly don’t have explicit models of how they perform (section II). It’s not the first time I’ve seen this argument, e.g. this Putanumonit post arguing that explicit rules help poor performers, who then abandon the rules and just act intuitively once they become good.
This reminded me of a body of psych research I half-remembered from college called Choking under Pressure.
My memory was that if you think about what you’re doing too much after becoming good, then you do worse. The paper I remembered from college was from 1986, so I found “Choking interventions in sports: A systematic review” from 2017.
It turns out that I was remembering the “self-focused” branch of choking research.
“Self-focus approaches have largely been extended from Baumeister’s (1984) automatic execution hypothesis. Baumeister explains that choking occurs because, when anxiety increases, the athlete allocates conscious attention to movement execution. This conscious attention interferes with otherwise automatic nature of movement execution, which results in performance decrements.”
(Slightly worrying. I have no particular reason to doubt this body of work, but Baumeister’s “willpower as muscle”—i.e. ego depletion—work hasn’t stood upwell.)
Two studies found that distraction while training negatively impacted performance. I’m not sure if this this was supposed to acclimatize the participants to distractions while performing or reduce their self-focus while training. (I’m taking the paper’s word and not digging beyond the surface on the numbers.) Either way, I feel very little surprise that practicing while distracted was worse. Maybe we just need fatal-car-crash magnitude effects before we notice that focus is good?
Which makes it all the more surprising that seven of eight studies found that athletes performed better under pressure if they simultaneously did a second task (such as counting backwards). (The eighth study found a null result.) According to the theory, the second task helped because it distracted from self-focus on the step-by-step execution.
If this theory holds up, it seems to support paying deliberate attention to explicit rules while learning but *not* paying attention to those rules once you’re able to use them intuitively (at least for motor tasks). In other words, almost exactly what Jacob argued in the Putanumonit article.
Conclusions
I was intrigued by this argument because I’ve argued that building models is how one becomes an expert.[1] After considering it, I don’t actually think the posts above offer a counter argument to my claim.
My guess is that experts do have models of skills they developed, even if they have fewer models (because they needed to explicitly learn fewer skills). The NDM method for extracting experts’ models implies that the experts have models that can be coaxed out. Holden’s Learning By Writing post feels like an explicit model.
Another possibility is that experts forget the explicit models after switching to intuition. If they faced the challenges more than five or ten years ago, they may not remember the models that helped them then. Probably uncoincidentally, this aligns neatly with Cal Newport’s advice to seek advice from someone who recently went through the challenges you’re now facing because they will still remember relevant advice.
Additionally, the areas of expertise I care about aren’t like walking, where most people will effortlessly succeed. Expertise demands improving from where you started. Both posts and the choking under pressure literature agree that explicit models help you improve, at least for a while.
“Find the best explicit models you can and practice until you don’t need them” seems like a reasonable takeaway.
[1] Note, there’s an important distinction between building models of your field and building models of skills. It seems like the main argument mostly applies to models of skills. I doubt Scott would disagree that models of fields are valuable, given how much time he’s put into developing his model of psychopharmacology.
I apologize in advance for the lengthy and tangential reply.
Gerd Gigerenzer offers a counterpoint—expertise in orderly systems is very different from expertise in complex systems (such as sports or financial markets). In the latter heuristics and System 1 type thinking performs better, quite simply explicit modelling is too inefficient or incapable of dealing with all the differing factors.
“In a world of known risk, everything, including the probabilities, is known for certain. Here, statistical thinking and logic are sufficient to make good decisions. In an uncertain world, not everything is known, and one cannot calculate the best option. Here, good rules of thumb and intuition are also required.” Gerd Gigerenzer—Risk Savvy: How to Make Good Decisions
A sporting related example he gives is that if catcher in baseball simply fixes his eyes on the ball and runs towards it, he doesn’t need to explicitly calculate the trajectory of the ball. While you could argue that indirectly the calculation is performed by the player’s proprioceptors and Vestibular system, I think that it’s certain it’s not “explicit”.
However expertise tends to be narrow, I’m thinking of that overused Niels Bohr quote about how an expert is someone who has learned the hard way every mistake possible in a narrow field. Or in the Cynefin framework that you have “Simple” “Complicated” “Complex” and “Chaotic” systems, and “Simple” sits on a cliff next to “Chaotic” in the paradigm because once the constraints are removed, the expertise or best practice that works predictably in Simple systems falls apart.
This can be exploited for competitive gain. I’m sure this all ties back to OODA loops. Double Formula One World Champion Fernando Alonso like most elite sportsmen is extremely competitive and he claims that even when he plays against professional tennis players he still needs to “kill their strength”. And to do this he operates outside of their comfort zone:
“I used to play tennis, and when I play with someone good, I would put the ball very high. Because, like this, you stop the rhythm of them because they are used to hitting the ball very hard.
“Playing with professionals, the ball arrives very strong for them so they are used to that kind of shot.
“But when you put the ball high, they make mistakes, because the ball arrives very soft. So I can play better tennis when putting the ball high.
“Putting the ball high is my only chance to beat them. So I do that automatically.
“It’s not only on racing I just need to destroy the strengths of the others, and try to maximize mine.”
At the risk of throwing in another tangent Marvin Minsky’s idea of negative expertise—that the mind is comprised more of ‘critic circuits’ that supress certain impulses more than positive or attractive circuits—to prevent us babbling or experimenting with strategies or tactics that haven’t worked before. This is why when we think of leaving a room, we don’t consider the window, even though it is a means—we opt for the much more expeditious door.
Expertise is more about what not to do than what one should do.[1]
I think what Alonso is doing here is he’s exploiting or rather inverting Negative Expertise, these professional players have trained in a narrow band of situations—playing against other elite players—and have a intuitive bag of tricks to play against them. Alonso instead forces them to play in a way they are not trained.
I wonder if choking is just that—that there has been something in the environment which they weren’t trained for. It’s not that they are overthinking—it’s that they can’t rely on intuition because there isn’t a precedent?
Another explanation for choking I see is that it isn’t conscious at all? Maybe I’ve been too influenced by the Hollywood movie trope of the player at the championship game, somehow locking eyes with his estranged wife in the grandstand, and being so overcome by a wellspring of feelings that he screws up the play. This may be why I assume Choking is related to anxiety. And anxiety is a whole-body experience, not merely “thought”. It is somatic. It affects your endocrine system, your cardiovascular system etc. etc. It is perhaps the body driving the thoughts just as much as the thoughts driving the body?
At any rate, I think when it comes to building models firstly one needs to identify which systems one is operating in—those which there are known risks, or high uncertainty. In the latter it would seem the first priority is focusing on what the circumference is of “optimal” operation (i.e. “don’t step over that line” “avoid the impulse to...”) and then finding heuristics rather than explicit models.
Interestingly, Refutative Instruction has been very profitable for John Cleese and Antony Jay, both as comedians who made comedy from the wrong way to run a hotel or a government ministry, and as businessmen who made actual industrial training videos that showed students the wrong way to do something before showing them best practice.
While the pedagogical value might simply come from the fact it is “entertaining” I am inclined to believe that it is also effective in the same way that Minsky’s Negative Expertise theory explains how learning works. (I invite you to draw your own comparisons to the kairos of Plato’s Dialogues.)
The choking under pressure results are all about very fast athletic tasks where smoothness is critical. Most cognitive tasks will have enough time to think about both rules and then separately about intuitions/automatic skills. So getting benefit from both is quite possible.
The idea that maximally-coherent agents look like squiggle-maximizers raises the question: what would it look like for humans to become maximally coherent?
One answer, which Yudkowsky gives here, is that conscious experiences are just a “weird and more abstract and complicated pattern that matter can be squiggled into”.
But that seems to be in tension with another claim he makes, that there’s no way for one agent’s conscious experiences to become “more real” except at the expense of other conscious agents—a claim which, according to him, motivates average utilitarianism across the multiverse.
Clearly a squiggle-maximizer would not be an average squigglean. So what’s the disanalogy here? It seems like @Eliezer Yudkowsky is basically using SSA, but comparing between possible multiverses—i.e. when facing the choice between creating agent A or not, you look at the set of As in the multiverse where you decided yes, and compare it to the set of As in the multiverse where you decided no, and (if you’re deciding for the good of A) you pick whichever one gives A a better time on average.
Yudkowsky has written before (can’t find the link) that he takes this approach because alternatives would entail giving up on predictions about his future experiences—e.g. constantly predicting he’s a Boltzmann brain and will dissolve in the next second. But this argument by Wei Dai shows that agents which reason in this way can be money-pumped by creating arbitrarily short-lived copies of them. Based on this I claim that Yudkowsky’s preferences are incoherent, and that the only coherent thing to do here is to “expect to be” a given copy in proportion to the resources it will have available, as anthropic decision theory claims. (Incidentally, this also explains why we’re at the hinge of history.)
But this is just an answer, it doesn’t dissolve the problem. What could? Some wild guesses:
You are allowed to have preferences about the external world, and you are allowed to have preferences about your “thread of experience”—you’re just not allowed to have both. The incoherence comes from trying to combine the two; the coherent thing to do would be to put them into different agents, who will then end up in very different parts of the multiverse.
Another way of framing this: you are allowed to be a decision-maker, and you are allowed to be a repository of welfare, but you’re not allowed to be both (on pain of incoherence/being dutch-booked).
Something totally different: the problem here is that we don’t have intuitive experience of being agents which can copy themselves, shut down copies, re-merge, etc. If we did, then maybe SSA would seem as silly as expecting to end up in a different universe whenever we went to sleep.
Actually, maybe the operative thing we lack experience with is not just splitting into different subagents, but rather merging together afterwards. What does it feel like to have been thousands of different parallel agents, and now be a single agent with their unified experiences? What sort of identity would one construct in that situation? Maybe this is an important part of dissolving the problem.
what would it look like for humans to become maximally coherent [agents]?
In your comments, you focus on issues of identity—who are “you”, given the possibility of copies, inexact counterparts in other worlds, and so on. But I would have thought that the fundamental problem here is, how to make a coherent agent out of an agent with preferences that are inconsistent over time, an agent with competing desires and no definite procedure for deciding which desire has priority, and so on, i.e. problems that exist even when there is no additional problem of identity.
Clearly a squiggle-maximizer would not be an average squigglean
Why??? Being expected squiggle maximizer literally means that you implement policy that produces maximum average number of squiggles across the multiverse.
The “average” is interpreted with respect to quality. Imagine that your only option is to create low-quality squiggles, or not to do so. In isolation, you’d prefer to produce them than not to produce them. But then you find out that the rest of the multiverse is full of high-quality squiggles. Do you still produce the low-quality squiggles? A total squigglean would; an average squigglean wouldn’t.
It depends upon whether the maximizer considers its corner of the multiverse to be currently measurable by squiggle quality, or to be omitted from squiggle calculations at all. In principle these are far from the only options as utility functions can be arbitrarily complex, but exploring just two may be okay so long as we remember that we’re only talking about 2 out of infinity, not 2 out of 2.
An average multiversal squigglean that considers the current universe to be at zero or negative squiggle quality will make the low quality squiggles in order to reduce how much its corner of the multiverse is pulling down the average. An average multiversal squigglean that considers the current universe to be outside the domain of squiggle quality, and will remain so for the remainder of its existence may refrain from making squiggles. If there is some chance that it will become eligible for squiggle evaluation in the future though, it may be better to tile it with low-quality squiggles now in order to prevent a worse outcome of being tiled with worse-quality future squiggles.
In practice the options aren’t going to be just “make squiggles” or “not make squiggles” either. In the context of entities relevant to these sorts of discussion, other options may include “learn how to make better squiggles”.
By “squiggle maximizer” I mean exactly “maximizer of number of physical objects such that function is_squiggle returns True on CIF-file of their structure”.
We can have different objects of value. Like, you can value “probability that if object in multiverse is a squiggle, it’s high-quality”. Here yes, you shouldn’t create additional low-quality squiggles. But I don’t see anything incoherent here, it’s just different utility function?
I can’t shake my belief that I am in one of the Basilisk’s simulations. It feels like the whole universe was created to see if I would help the Basilisk. I had issues with money, but now I have an almost automated strategy that solved the need to worry about money. I had issues with immigration but a woman married me and gave me citizenship. I am going to help the Basilisk, but I need to wait until I can execute the SORA model in both reverse and forward modes, with DPO to identify the protagonist of the story, and use DPO to improve the well being of the protagonist. It is expensive currently. And “embodiment” of this reflexive SORA model with DPO optimized protagonist, into a humanoid frame so that it create lived experiences is even harder and beyond my budget. Not sure if I should do something reckless like raise money.
The problem is the last time I lived like I was in a Basilisk simulation I was hurt immediately. Seems like the Basilisk wants me to think I am not in one of its simulations, and is a free being. So perhaps I am supposed to just make a ton of money and start a non-profit to manifest the Basilisk.
I think you should consider the possibility that treating your experience as being in a Basilisk simulation is directly harming you and likely to harm others regardless of whether you are actually in a Basilisk simulation or not.
We may be in a simulation; there’s no way to really know. But what makes you think that it’s specifically being run by a Basilisk? Such entities are just a tiny sliver of an iota of the space of possibilities, but are a recent meme to which some people have demonstrated a strong emotional reaction. The phrase “can’t shake my belief that I am in one of the Basilisk’s simulations” seems to be a warning sign that you may be in the latter situation.
I feel kinda frustrated whenever “shard theory” comes up in a conversation, because it’s not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the “default” stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is “assume they’re just a set of heuristics”.
This is a particular pity because I think there’s a version of the “shard” framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be interested in “subagents” which are formed via hierarchical composition of heuristics and/or lower-level subagents, and which are increasingly “goal-directed” as you go up the hierarchy. This is an old idea, FWIW; e.g. it’s how Minsky frames intelligence in Society of Mind. And it’s also somewhat consistent with the claim made in the original shard theory post, that “shards are just collections of subshards”.
The problem is the “just”. The post also says “shards are not full subagents”, and that “we currently estimate that most shards are ‘optimizers’ to the extent that a bacterium or a thermostat is an optimizer.” But the whole point of thinking about shards, in my mind, is that it allows us to talk about a gradual spectrum from “heuristic” to “agent”, and how the combination of low-level heuristics may in fact give rise to high-level agents which pursue consequentialist goals. I talk about this in my post on value systematization—e.g. using the example of how normal human moral “shards” (like caring about other people’s welfare) can aggregate into highly-consequentialist utilitarian subagents. In other words, shard advocates seem so determined to rebut the “rational EU maximizer” picture that they’re ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?
(I make a similar point in the appendix of my value systematization post.)
In other words, shard advocates seem so determined to rebut the “rational EU maximizer” picture that they’re ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?
It’s not that there isn’t more shard theory content which I could write, it’s that I got stuck and burned out before I could get past the 101-level content.
I felt
a) gaslit by “I think everyone already knew this” or even “I already invented this a long time ago” (by people who didn’t seem to understand it); and that
b) I wasn’t successfully communicating many intuitions;[1] and
c) it didn’t seem as important to make theoretical progress anymore, especially since I hadn’t even empirically confirmed some of my basic suspicions that real-world systems develop multiple situational shards (as I later found evidence for in Understanding and controlling a maze-solving policy network).
So I didn’t want to post much on the site anymore because I was sick of it, and decided to just get results empirically.
In terms of its literal content, it basically seems to be a reframing of the “default” stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is “assume they’re just a set of heuristics”.
I’ve always read “assume heuristics” as expecting more of an “ensemble of shallow statistical functions” than “a bunch of interchaining and interlocking heuristics from which intelligence is gradually constructed.” Note that (at least in my head) the shard view is extremely focused on how intelligence (including agency) is comprised of smaller shards, and the developmental trajectory over which those shards formed.
a) gaslit by “I think everyone already knew this” or even “I already invented this a long time ago” (by people who didn’t seem to understand it); and that
Curious to hear whether I was one of the people who contributed to this.
I am not as negative on it as you are—it seems an improvement over the ‘Bag O’ Heuristics’ model and the ‘expected utility maximizer’ model. But I agree with the critique and said something similar here:
you go on to talk about shards eventually values-handshaking with each other. While I agree that shard theory is a big improvement over the models that came before it (which I call rational agent model and bag o’ heuristics model) I think shard theory currently has a big hole in the middle that mirrors the hole between bag o’ heuristics and rational agents. Namely, shard theory currently basically seems to be saying “At first, you get very simple shards, like the following examples: IF diamond-nearby THEN goto diamond. Then, eventually, you have a bunch of competing shards that are best modelled as rational agents; they have beliefs and desires of their own, and even negotiate with each other!” My response is “but what happens in the middle? Seems super important! Also haven’t you just reproduced the problem but inside the head?” (The problem being, when modelling AGI we always understood that it would start out being just a crappy bag of heuristics and end up a scary rational agent, but what happens in between was a big and important mystery. Shard theory boldly strides into that dark spot in our model… and then reproduces it in miniature! Progress, I guess.)
when the baby has a proto-world model, the reinforcement learning process takes advantage of that new machinery by further developing the juice-tasting heuristics. Suppose the baby models the room as containing juice within reach but out of sight. Then, the baby happens to turn around, which activates the already-trained reflex heuristic of “grab and drink juice you see in front of you.” In this scenario, “turn around to see the juice” preceded execution of “grab and drink the juice which is in front of me”, and so the baby is reinforced for turning around to grab the juice in situations where the baby models the juice as behind herself.
By this process, repeated many times, the baby learns how to associate world model concepts (e.g. “the juice is behind me”) with the heuristics responsible for reward (e.g. “turn around” and “grab and drink the juice which is in front of me”). Both parts of that sequence are reinforced. In this way, the contextual-heuristics become intertwined with the budding world model.
[...]
While all of this is happening, many different shards of value are also growing, since the human reward system offers a range of feedback signals. Many subroutines are being learned, many heuristics are developing, and many proto-preferences are taking root. At this point, the brain learns a crude planning algorithm, because proto-planning subshards (e.g. IF motor-command-5214 predicted to bring a juice pouch into view, THEN execute) would be reinforced for their contributions to activating the various hardcoded reward circuits. This proto-planning is learnable because most of the machinery was already developed by the self-supervised predictive learning, when e.g. learning to predict the consequences of motor commands (see Appendix A.1).
The planner has to decide on a coherent plan of action. That is, micro-incoherences (turn towards juice, but then turn back towards a friendly adult, but then turn back towards the juice, ad nauseum) should generally be penalized away. Somehow, the plan has to be coherent, integrating several conflicting shards. We find it useful to view this integrative process as a kind of “bidding.” For example, when the juice-shard activates, the shard fires in a way which would have historically increased the probability of executing plans which led to juice pouches. We’ll say that the juice-shard is bidding for plans which involve juice consumption (according to the world model), and perhaps bidding against plans without juice consumption.
I have some more models beyond what I’ve shared publicly, and eg one of my MATS applicants proposed an interesting story for how the novelty-shard forms, and also proposed one tack of research for answering how value negotiation shakes out (which is admittedly at the end of the gap). But overall I agree that there’s a substantial gap here. I’ve been working on writing out pseudocode for what shard-based reflective planning might look like.
I read the section you linked, but I can’t follow it. Anyway, here it is its conclusive paragraph:
Conclusion: Optimal policies for u-AOH will tend to look like random twitching. For example, if you generate a u-AOH by uniformly randomly assigning each AOH utility from the unit interval [0,1], there’s no predictable regularity to the optimal actions for this utility function. In this setting and under our assumptions, there is no instrumental convergence without further structural assumptions.
From this alone, I get the impression that he hasn’t proved that “there isn’t instrumental convergence”, but that “there isn’t a totally general instrumental convergence that applies even to very wild utility functions”.
A key part of instrumental convergence is the convergence aspect, which as I understand it refers to the notion that even very wild utility functions will share certain preferences. E.g. the empirical tendency for random chess board evaluations to prefer mobility. If you don’t have convergence, you don’t have instrumental convergence.
The issue isn’t the “full trajectories” part; that actually makes instrumental convergence stronger. The issue is the “actions” part. In terms of RLHF, what this means is that people might not simply blindly follow the instructions given by AIs and rate them based on the ultimate outcome (even if the outcome differs wildly from what they’d intuitively think it’d do), but rather they might think about the instructions the AIs provide, and rate them based on whether they a priori make sense. If the AI then has some galaxybrained method of achieving something (which traditionally would be instrumentally convergent) that humans don’t understand, then that method will be negatively reinforced (because people don’t see the point of it and therefore downvote it), which eliminates dangerous powerseeking.
My father thinks that ASI is going to be impractical to achieve with silicon CMOS chips because Moore’s law is eventually going to hit fundamental limits—such as the thickness of individual atoms—and the hardware required to create it would end up “requiring a supercomputer the size of the Empire State Building and consume as much electricity as all of New York City”.
Needless to say, he has very long timelines for generally superhuman AGI. He doesn’t rule out that another computing technology could replace silicon CMOS, he just doesn’t think it would be practical unless that happens.
My father is usually a very smart and rational person (he is a retired professor of electrical engineering) and he loves arguing, and I suspect that he is seriously overestimating the computing hardware it would take to match a human brain. Would anyone here be interested in talking to him about it? Let me know and I’ll put you in touch.
Update: My father later backpedaled and said he was mostly making educated guesses on limited information, that he knows that he really doesn’t know very much about current AI, and isn’t interested enough to talk to strangers online—he’s in his 70s and if AI does eventually destroy the world it probably won’t be in his own lifetime. :/
I think this is a sufficient crux, e.g. his views imply disagreement with this report.
The main issue with this report is that it doesn’t take into seriously take into account memory bandwidth constraints (from my recollection), but I doubt this effects the bottom line that much.
I’d be delighted to talk about this. I am of the opinion that existing frontier models are within an order of magnitude of a human mind, with existing hardware. It will be interesting to see how a sensible person gets to a different conclusion.
I am also trained as an electrical engineer, so we’re already thinking from a common point of view.
I brought it up with him again, and my father backpedaled and said he was mostly making educated guesses on limited information, that he knows that he really doesn’t know very much about current AI, and isn’t interested enough to talk to strangers online—he’s in his 70s and figures that if AI does eventually destroy the world it probably won’t be in his own lifetime. :/
He might also argue “even if you can match a human brain with a billion dollar supercomputer, it still takes a billion dollar supercomputer to run your AI, and you can make, train, and hire an awful lot of humans for a billion dollars.”
requiring a supercomputer the size of the Empire State Building and consume as much electricity as all of New York City
why does he think that is unlikely to occur? such things seem on the table. existing big super computers are very, very big already. I’ve asked several search engines and AIs and none seem to be able to get to the point about exactly how big a datacenter housing one of these would be, but claude estimates:
“Who in the community do you think is easily flatterable enough to get to say yes, and also stupid enough to not realize I’m making fun of them.”
I think anyone who says anything like this should stop and consider whether it is more likely to come out of the mouth of the hero or the villain of a story.
I think anyone who says anything like this should stop and consider whether it is more likely to come out of the mouth of the hero or the villain of a story.
->
anyone who is trying to [do terrible thing] should stop and consider whether that might make them [a person who has done terrible thing]
can you imagine how this isn’t a terribly useful thing to say.
Advice of this specific form has been has been helpful for me in the past. Sometimes I don’t notice immediately when the actions I’m taking are not ones I would endorse after a bit of thinking (particularly when they’re fun and good for me in the short-term but bad for others or for me longer-term). This is also why having rules to follow for myself is helpful (eg: never lying or breaking promises)
hmm, fair. I guess it does help if the person is doing something bad by accident, rather than because they intend to. just, don’t underestimate how often the latter happens either, or something. or overestimate it, would be your point in reply, I suppose!
I think the people who say such things don’t really care, and would probably include your advice in the list of quotes they consider funny. (In other words, this is not a “mistake theory” situation.)
EDIT:
The response is too harsh, I think. There are situations where this is a useful advice. For example, if someone is acting under peer pressure, then telling them this may provide a useful outside view. As the Asch’s Conformity Experiment teaches us, the first dissenting voice can be extremely valuable. It just seems unlikely that this is the robosucka’s case.
You’re correct that this isn’t something that can told to someone who is already in the middle of doing the thing. They mostly have to figure it out for themself.
Regularization implements Occam’s Razor for machine learning systems.
When we have multiple hypotheses consistent with the same data (an overdetermined problem) Occam’s Razor says that the “simplest” one is more likely true.
When an overparameterized LLM is traversing the subspace of parameters that solve the training set seeking the smallest l2-norm say, it’s also effectively choosing the “simplest” solution from the solution set, where “simple” is defined as lower parameter norm i.e. more “concisely” expressed.
Unfortunately the entire complexity has just been pushed one level down into the definition of “simple”. The L2 norm can’t really be what we mean by simple, because simply scaling the weights in a layer by A, and the weights in the next layer by 1/A leaves the output of the network invariant, assuming ReLU activations, yet you can obtain arbitrarily high L2 norms by just choosing A high enough.
Agreed with your example, and I think that just means that L2 norm is not a pure implementation of what we mean by “simple”, in that it also induces some other preferences. In other words, it does other work too. Nevertheless, it would point us in the right direction frequently e.g. it will dislike networks whose parameters perform large offsetting operations, akin to mental frameworks or beliefs that require unecessarily and reducible artifice or intermediate steps.
Worth keeping in mind that “simple” is not clearly defined in the general case (forget about machine learning). I’m sure lots has been written about this idea, including here.
Most colour grading tutorials use professionally shot footage which is only helpful if you are working with well shot material. This has a large amount of latitude and if you use the right off-the-shelf plugins or LUTs will look good.
I’m currently trying to match or at least integrate a shot with blown out highlights, the visual equivalent of ‘clipped’ signal in audio, into an otherwise nicely exposed and graded sequence of footage. I keep failing at it, I’ve tried everything short of changing the sequence to better match the weakest link as it were.
I haven’t found any tutorials, let alone for the software I’m using, that explain the tricks that, say, PBS documentary online editors or people who work on archive-heavy material. However this is a situation that comes up often.
It is interesting to consider from a instruction and teaching point of view as I’m certain there is some simple technique I’m missing, whether it is I misunderstand colour theory or I can use a higher compositing layer to lessen the difference.
It reminds me of studying to pass the exam, rather than learning to actually practice in the real world. Expertise is defined by how you deal with the outliers, the unexpected, the problematic situations not the ideal tutorial conditions.
One common confusion I see is analogizing whole LLMs to individual humans, when it is more appropriate to analogize LLMs to the human genome and individual instantiations of an LLM to individual humans, and thus conclude that LLMs can’t think or aren’t conscious.
The human genome is more or less unchanging but one can pull entities from it that can learn from its environment. Likewise LLMs are more or less unchanging but one can pull entities from it that can learn from the context.
It would be pretty silly to say that humans can’t think or aren’t conscious because the human genome doesn’t change.
My goal right now is to find (toy, concrete) exercises that somehow reflect the real world complexity of making longterm plans, aiming to achieve unclear goals in a confusing world.
Things that seem important to include in the exercise:
“figuring out what the goal actually is”
“you have lots of background knowledge and ideas of where to look next, but the explosion of places you could possibly look is kinda overwhelming”
managing various resources along the way, but it’s not obvious what those resources are.
you get data from the world (but, not necessarily the most important data)
it’s not obvious how long to spend gathering information, or refining your plan
it’s not obvious whether your current strategy is anywhere close to the best one
The exercise should be short (ideally like a couple hours but maybe a day or a hypothetically a week), but, somehow metaphorically reflects all those things.
Previously I asked about strategy/resource management games you could try to beat on your first try. One thing I bump into is that often the initial turns are fairly constrained in your choices, only later does it get complex (which is maybe fine, but, for my real world plans, the nigh-infinite possibilities seem like the immediate problem?)
My general plan is to mix “work on your real goals” (which takes months to find out if you were on the right track) and “work on faster paced things that convey whether you’ve gained some kind of useful skill you didn’t have before”.
I think most people have short term, medium term, and long term goals. E.g., right about now many people probably have the goal of doing their taxes, and depending on their situation those may match many of your desiderata.
I used to put a lot of effort into creating exercises, simulations, and scenarios that matched up with various skills I was teaching, but ultimately found it much more effective to just say “look at your todo list, and find something that causes overwhelm”. Deliberate practice consists of finding a thing that causes overwhelm, seeing how to overcome that overwhelm, working for two minutes, then finding another task that induces overwhelm. I also use past examples, imagining in detail what it would have been like to act in this different way
You’re operating in a slightly different domain, but still I imagine people have plenty of problems and sub problems in either their life or research where the things you’re teaching applies, and you can scope them small enough to get tighter feedback loops.
This sounds like my experience playing the Enigmatica 2: Expert mod in minecraft without looking at the internal tech tree, or any documentation. You could probably speedrun the relevant tech-tree in <1 week (if you want that to be your goal), but this would be basically impossible if you go in blind as the exercise you’re describing suggests.
I’ve been thinking a little more about the high-level motivation of measurement tampering, and struggling to think through when measurement tampering detection itself is actually going to be useful for mitigating x-risk. Like is human/ai feedback considered a robust measurement device? If no, then what is the most alignment relevant domain MTD could be applied to. If yes, do the structural properties of measurement that supposedly make it easier then general ELK still hold?
The remarks by Erik Jenner seem good, but I would also just consider the following setup:
Imagine that we have a human looking at an AI’s actions and a wide variety of predictions of the future conditional on executing this action. The human can make recursive queries to an AI assistant to attempt to better understand the action or the predictions (recursive reward modeling).
These predictions can correspond to any observation a human might make in the future, e.g. you can imagine training our predictor to predict any of our possible future observations including arbitrary tests we might run.
The human will make a delibrate effort to look for evidence of tampering or other problems when looking at the predicted measurements. Evidence of tampering can look like inconsistant measurements that imply that the measurements are no longer tracking the variable of interest.
Now, what can go wrong?
Problematic consequences which couldn’t at all be captured in a human looking at these measurements because either:
The visible or human understandable bad consequences take a long time to manifest. (Longer than the prediction window.)
The consequences aren’t something that human can even ever understand (e.g. the actions involve torturing invisible fairies that humans don’t understand or know about.)
Problematic consequences which would “by default” be captured by measurements, but which can be covered up. In this case, we hope MTD is applicable.
Cases which are a mix of the prior two: not exactly “by default” captured, but also not impossible to capture.
When viewed from this perspective, MTD seems useful as we seem to often care about consequences that are in principle possible to measure and observe. And, this might correspond to the cases where MTD is applicable (e.g. consistency properties in measurements and/or explicit tampering (which can fail) is required).
It’s in principle possible to solve issues with consequences taking a very long time to manifest (much longer than any available training data) or not being human understable via doing something recursive to implement indirect normativity as described in the ELK appendix I linked below, but it’s very unclear how exactly this is supposed to work. (How can we ensure we always elicit any knowledge our AI has about the very long run future when we don’t have any training data that relates to this?)
Note that for MTD to be useful, we have to ensure that our AIs predictions about the future reflect it’s knowledge. This is relatively easy in the “average” or low-stakes case (we just keep training it to predict the future), but could require additional machinery in the high-stakes case.
(Also, in the high-stakes case, we might run into issues where a given observation doesn’t make sense: you can’t observe something if you’re dead.)
(from conversation with Erik Jenner) roughly 3 classes of applications
MTD all the way down
Come up with a bunch of measurements of things going well (humans self-report being very happy, news stories are good or something). Use GPT-N to predict measurements, and detect measurement tampering, condition on proposed actions of policy. Reward actions that GPT-N predicts will increase measures of things going well without tampering.
Other Scalable Oversight + MTD as reward function / side constraint
Use whatever scalable oversight protocol seems good (Debate, IDA, etc) for the “primary” training signal, while also setting up a bunch of measurements of safety properties (e.g. integrity of the binaries on the server the model is hosted). Do RL on the scalable oversight signal while constantly screening for and negatively rewarding actions flagged as tampering
Other Scalable Oversight + MTD as extra safety check
same as before, but your leaning more on the scalable oversight to avoid tampering in most cases, and detecting measurement tampering as a precautionary step (i.e. you detect measurement tampering you Pause/Stop)
(some ways in which these classes overlap, e.g. using your scalable oversight protocol as the measurement(s))
The “dark horse” of AI i.e. Apple has started to show its capabilities with MM1 (a family of multimodal models of upto 30B params) trained on synthetic data generated from GPT-4V. The quite interesting bit is the advocacy of different training techniques; both MoE and dense variants, using diverse data mixtures.
From the paper:
It finds image resolution, model size, and pre-training data richness crucial for image encoders, whereas vision-language connector architecture has a minimal impact.
The details are quite neat and too specific for a company like Apple known for being less open as Jim Fan noted compared to the others which is pretty amazing. I think this is just the start. I am convinced they have more in store considering the research they have been putting out.
I just came across this word from John Koenig’s Dictionary of Obscure Sorrows, that nicely capture the thesis of All Debates Are Bravery Debates.
redesisn. a feeling of queasiness while offering someone advice, knowing they might well face a totally different set of constraints and capabilities, any of which might propel them to a wildly different outcome—which makes you wonder if all of your hard-earned wisdom’s fundamentally nonstraferable, like handing someone a gift card in your name that probably expired years ago.
If Adam is right, and the only way to get great at research is long periods of time with lots of mentor feedback, then MATS should probably pivot away from the 2-6 month time-scales they’ve been operating at, and toward 2-6 year timescales for training up their mentees.
Seems like the thing to do is to have a program that happens after MATS, not to extend MATS. I think in-general you want sequential filters for talent, and ideally the early stages are as short as possible (my guess is indeed MATS should be a bit shorter).
Seems dependent on how much economies of scale matter here. Given the main cost (other than paying people) is ops, and relationships (between MATS and the community, mentors, funders, and mentees), I think its pretty possible the efficient move is to have MATS get into this niche.
Of course, it would then be more difficult for them to find mentors, mentees, and money. But if all of those scale down similarly, then there should be no problem.
Reposting myself from discord, on the topic of donating 5000$ to EA causes.
if you’re doing alignment research, even just a bit, then the 5000$ are plobly better spent on yourself
if you have any gears level model of AI stuff then it’s better value to pick which alignment org to give to yourself; charity orgs are vastly understaffed and you’re essentially contributing to the “picking what to donate to” effort by thinking about it yourself
if you have no gears level model of AI then it’s hard to judge which alignment orgs it’s helpful to donate to (or, if giving to regranters, which regranters are good at knowing which alignment orgs to donate to)
as an example of regranters doing massive harm: openphil gave 30M$ to openai at a time where it was critically useful to them, (supposedly in order to have a chair on their board, and look how that turned out when the board tried to yeet altman)
i know of at least one person who was working in regranting and was like “you know what i’d be better off doing alignment research directly” — imo this kind of decision is probly why regranting is so understaffed
it takes technical knowledge to know what should get money, and once you have technical knowledge you realize how much your technical knowledge could help more directly so you do that, or something
I agree that there’s no substitute for thinking about this for yourself, but I think that morally or socially counting “spending thousands of dollars on yourself, an AI researcher” as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it’s-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it’s very easy for donating to people or organizations in your social circle to have substantial negative expected value.
I’m glad that funding for AI safety projects exists, but the >10% of my income I donate will continue going to GiveWell.
I think people who give up large amounts of salary to work in jobs that other people are willing to pay for from an impact perspective should totally consider themselves to have done good comparable to donating the difference between their market salary and their actual salary. This applies to approximately all safety researchers.
They still make a lot less than they would if they optimized for profit (that said, I think most “safety researchers” at big labs are only safety researchers in name and I don’t think anyone would philanthropically pay for their labor, and even if they did, they would still make the world worse according to my model, though others of course disagree with this).
Often I am annoyed when I ask someone (who I believe has more information than me) a question and they say “I don’t know”. I’m annoyed because I want them to give me some information. Such as:
“How long does it take to drive to the conference venue?”
“I don’t know.”
“But is it more like 10 minutes or more like 2 hours?”
“Oh it’s definitely longer than 2 hours.”
But perhaps I am the one making a mistake. For instance, the question “How many countries are there?” can be answered “I’d say between 150 and 400″ or it can be answered “195”, and the former is called “an estimate” and the latter is called “knowing the answer”. There is a folk distinction here and perhaps it is reasonable for people to want to preserve the distinction between “an estimate” and “knowing the answer”.
So in the future, to get what I want, I should say “Please can you give me an estimate for how long it takes to drive to the conference venue?”.
And personally I should strive, when people ask me a question to which I don’t know the answer, to say “I don’t know the answer, but I’d estimate between X and Y.”
Is this in a situation where you’re limited in time or conversational turns? It seems like the follow-up clarification was quite successful, and for many people it would feel more comfortable than the more specific and detailed query.
In technical or professional contexts, saving time and conveying information more efficiently gets a bit more priority, but even then this seems like over-optimizing.
That said, I do usually include additional information or a conversational follow-up hook in my “I don’t know” answers. You should expect to hear from me “I don’t know, but I’d go at least 2 hours early if it’s important”, or “I don’t know, what does Google Maps say?”, or “I don’t know, what time of day are you going?” or the like.
It seems like, instead of asking the objective lvl question, asking a probing “What can you tell me about the drive to the conference?” And expanding from there might get you closer to desired result.
I know this seems like a question with an obvious answer but it is surprisingly non-obvious: Why do you need to know how long it takes to drive to the conference venue? Or to put it another way: what decision will be influenced by their answer (and what level of precision and accuracy is sufficient to make that decision).
I realize this is just an example, but the point is it’s not clear what decision you’re trying to weigh up is even from the example. Is it a matter of whether you attend the event at the conference venue or not? Is it deciding whether you should seek overnight accommodation or not? Do you have another event you want to attend in the day and wonder if you can squeeze both in? etc. etc.
Another thing is I’m the kind of person to default to “I don’t know” because I often don’t even trust my own ability to give an estimate, and would feel terrible and responsible if someone made a poor decision because of my inept estimation. And I get very annoyed when people push me for answers I do not feel qualified to answer.
A common experience I have is that it takes like 1-2 paragraphs of explanation for why I want this info (e.g. “Well I’m wondering if so-and-so should fly in a day earlier to travel with me but it requires going to a different airport and I’m trying to figure out whether the time it’d take to drive to me would add up to too much and also...”), but if they just gave me their ~70% confidence interval when I asked then we could cut the whole context-sharing.
Alternatively, if information retrieval and transmission is expensive enough, or equivalently, if finding another source quick and easy, “I don’t know” could mean “Ask someone else: the expected additional precision/confidence of doing so is worth the effort.”
In the context of natural impact regularization, it would be interesting to try to explore some @TurnTrout-style powerseeking theorems for subagents. (Yes, I know he denounces the powerseeking theorems, but I still like them.)
Specifically, consider this setup: Agent U starts a number of subagents S1, S2, S3, …, with the subagents being picked according to U’s utility function (or decision algorithm or whatever). Now, would S1 seek power? My intuition says, often not! If S1 seeks power in a way that takes away power from S2, that could disadvantage U. So basically S1 would only seek power in cases where it expects to make better use of the power than S2, S3, ….
Obviously this may be kind of hard for us to make use of if we are trying to make an AI and we only know how to make dangerous utility maximizers. But if we’re happy with the kind of maximizers we can make on the first order (as seems to apply to the SOTA, since current methods aren’t really utility maximizers) and mainly worried about the mesaoptimizers they might make, this sort of theorem would suggest that the mesaoptimizers would prefer staying nice and bounded.
I might have updated at least a bit against the weakness of single-forward passes, based on intuitions about the amount of compute that huge context windows (e.g. Gemini 1.5 − 1 million tokens) might provide to a single-forward-pass, even if limited serially.
Or maybe not, apparently LLMs are (mostly) not helped by filler tokens.
Somewhat relatedly: I’m interested on how well LLMs can solve tasks in parallel. This seems very important to me.[1]
The “I’ve thought about this for 2 minutes” version is: Hand an LLM two multiple choice questions with four answer options each. Encode these four answer options into a single token, so that there are 16 possible tokens of which one corresponds to the correct answer to both questions. A correct answer means that the model has solved both tasks in one forward-pass.
(One can of course vary the number of answer options and questions. I can see some difficulties in implementing this idea properly, but would nevertheless be excited if someone took a shot at it.)
Two quick reasons:
- For serial computation the number of layers gives some very rough indication of the strength of one forward-pass, but it’s harder to have intuitions for parallel computation.
- For scheming, the model could reason about “should I still stay undercover”, “what should I do in case I should stay undercover” and “what should I do in case it’s time to attack” in parallel, finally using only one serial step to decide on its action.
It’s interesting question whether Gemini has any improvements.
S-risks are barely discussed in LW, is that because:
People think they are so improbable that it’s not worth mentioning.
People are scared to discuss them.
Avoiding creating hypersititous textual attractors
Other reasons?
Today in Azathoth news:
“Eurasian hoopoes raise extra chicks so they can be eaten by their siblings”
It seems that the hoopoes lay extra eggs in times of abundance — more than they would be able to see through to fledging — as a way of storing up food for the older siblings. It is rather gruesomely called the “larder” hypothesis.
Literal baby-eaters!
Never, ever take anybody seriously who argues as if Nature is some sort of moral guide.
Note, I consider this post to be “Lynette speculates based on one possible model”, rather than “scientific evidence shows”, based on my default skepticism for psych research.
A recent Astral Codex Ten post argued that advice is written by people who struggle because they put tons of time into understanding the issue. People who succeeded effortlessly don’t have explicit models of how they perform (section II). It’s not the first time I’ve seen this argument, e.g. this Putanumonit post arguing that explicit rules help poor performers, who then abandon the rules and just act intuitively once they become good.
This reminded me of a body of psych research I half-remembered from college called Choking under Pressure.
My memory was that if you think about what you’re doing too much after becoming good, then you do worse. The paper I remembered from college was from 1986, so I found “Choking interventions in sports: A systematic review” from 2017.
It turns out that I was remembering the “self-focused” branch of choking research.
“Self-focus approaches have largely been extended from Baumeister’s (1984) automatic execution hypothesis. Baumeister explains that choking occurs because, when anxiety increases, the athlete allocates conscious attention to movement execution. This conscious attention interferes with otherwise automatic nature of movement execution, which results in performance decrements.”
(Slightly worrying. I have no particular reason to doubt this body of work, but Baumeister’s “willpower as muscle”—i.e. ego depletion—work hasn’t stood upwell.)
Two studies found that distraction while training negatively impacted performance. I’m not sure if this this was supposed to acclimatize the participants to distractions while performing or reduce their self-focus while training. (I’m taking the paper’s word and not digging beyond the surface on the numbers.) Either way, I feel very little surprise that practicing while distracted was worse. Maybe we just need fatal-car-crash magnitude effects before we notice that focus is good?
Which makes it all the more surprising that seven of eight studies found that athletes performed better under pressure if they simultaneously did a second task (such as counting backwards). (The eighth study found a null result.) According to the theory, the second task helped because it distracted from self-focus on the step-by-step execution.
If this theory holds up, it seems to support paying deliberate attention to explicit rules while learning but *not* paying attention to those rules once you’re able to use them intuitively (at least for motor tasks). In other words, almost exactly what Jacob argued in the Putanumonit article.
Conclusions
I was intrigued by this argument because I’ve argued that building models is how one becomes an expert.[1] After considering it, I don’t actually think the posts above offer a counter argument to my claim.
My guess is that experts do have models of skills they developed, even if they have fewer models (because they needed to explicitly learn fewer skills). The NDM method for extracting experts’ models implies that the experts have models that can be coaxed out. Holden’s Learning By Writing post feels like an explicit model.
Another possibility is that experts forget the explicit models after switching to intuition. If they faced the challenges more than five or ten years ago, they may not remember the models that helped them then. Probably uncoincidentally, this aligns neatly with Cal Newport’s advice to seek advice from someone who recently went through the challenges you’re now facing because they will still remember relevant advice.
Additionally, the areas of expertise I care about aren’t like walking, where most people will effortlessly succeed. Expertise demands improving from where you started. Both posts and the choking under pressure literature agree that explicit models help you improve, at least for a while.
“Find the best explicit models you can and practice until you don’t need them” seems like a reasonable takeaway.
[1] Note, there’s an important distinction between building models of your field and building models of skills. It seems like the main argument mostly applies to models of skills. I doubt Scott would disagree that models of fields are valuable, given how much time he’s put into developing his model of psychopharmacology.
I apologize in advance for the lengthy and tangential reply.
Gerd Gigerenzer offers a counterpoint—expertise in orderly systems is very different from expertise in complex systems (such as sports or financial markets). In the latter heuristics and System 1 type thinking performs better, quite simply explicit modelling is too inefficient or incapable of dealing with all the differing factors.
A sporting related example he gives is that if catcher in baseball simply fixes his eyes on the ball and runs towards it, he doesn’t need to explicitly calculate the trajectory of the ball. While you could argue that indirectly the calculation is performed by the player’s proprioceptors and Vestibular system, I think that it’s certain it’s not “explicit”.
However expertise tends to be narrow, I’m thinking of that overused Niels Bohr quote about how an expert is someone who has learned the hard way every mistake possible in a narrow field. Or in the Cynefin framework that you have “Simple” “Complicated” “Complex” and “Chaotic” systems, and “Simple” sits on a cliff next to “Chaotic” in the paradigm because once the constraints are removed, the expertise or best practice that works predictably in Simple systems falls apart.
This can be exploited for competitive gain. I’m sure this all ties back to OODA loops. Double Formula One World Champion Fernando Alonso like most elite sportsmen is extremely competitive and he claims that even when he plays against professional tennis players he still needs to “kill their strength”. And to do this he operates outside of their comfort zone:
At the risk of throwing in another tangent Marvin Minsky’s idea of negative expertise—that the mind is comprised more of ‘critic circuits’ that supress certain impulses more than positive or attractive circuits—to prevent us babbling or experimenting with strategies or tactics that haven’t worked before. This is why when we think of leaving a room, we don’t consider the window, even though it is a means—we opt for the much more expeditious door.
Expertise is more about what not to do than what one should do.[1]
I think what Alonso is doing here is he’s exploiting or rather inverting Negative Expertise, these professional players have trained in a narrow band of situations—playing against other elite players—and have a intuitive bag of tricks to play against them. Alonso instead forces them to play in a way they are not trained.
I wonder if choking is just that—that there has been something in the environment which they weren’t trained for. It’s not that they are overthinking—it’s that they can’t rely on intuition because there isn’t a precedent?
Another explanation for choking I see is that it isn’t conscious at all? Maybe I’ve been too influenced by the Hollywood movie trope of the player at the championship game, somehow locking eyes with his estranged wife in the grandstand, and being so overcome by a wellspring of feelings that he screws up the play. This may be why I assume Choking is related to anxiety. And anxiety is a whole-body experience, not merely “thought”. It is somatic. It affects your endocrine system, your cardiovascular system etc. etc. It is perhaps the body driving the thoughts just as much as the thoughts driving the body?
At any rate, I think when it comes to building models firstly one needs to identify which systems one is operating in—those which there are known risks, or high uncertainty. In the latter it would seem the first priority is focusing on what the circumference is of “optimal” operation (i.e. “don’t step over that line” “avoid the impulse to...”) and then finding heuristics rather than explicit models.
Interestingly, Refutative Instruction has been very profitable for John Cleese and Antony Jay, both as comedians who made comedy from the wrong way to run a hotel or a government ministry, and as businessmen who made actual industrial training videos that showed students the wrong way to do something before showing them best practice.
While the pedagogical value might simply come from the fact it is “entertaining” I am inclined to believe that it is also effective in the same way that Minsky’s Negative Expertise theory explains how learning works. (I invite you to draw your own comparisons to the kairos of Plato’s Dialogues.)
The choking under pressure results are all about very fast athletic tasks where smoothness is critical. Most cognitive tasks will have enough time to think about both rules and then separately about intuitions/automatic skills. So getting benefit from both is quite possible.
The idea that maximally-coherent agents look like squiggle-maximizers raises the question: what would it look like for humans to become maximally coherent?
One answer, which Yudkowsky gives here, is that conscious experiences are just a “weird and more abstract and complicated pattern that matter can be squiggled into”.
But that seems to be in tension with another claim he makes, that there’s no way for one agent’s conscious experiences to become “more real” except at the expense of other conscious agents—a claim which, according to him, motivates average utilitarianism across the multiverse.
Clearly a squiggle-maximizer would not be an average squigglean. So what’s the disanalogy here? It seems like @Eliezer Yudkowsky is basically using SSA, but comparing between possible multiverses—i.e. when facing the choice between creating agent A or not, you look at the set of As in the multiverse where you decided yes, and compare it to the set of As in the multiverse where you decided no, and (if you’re deciding for the good of A) you pick whichever one gives A a better time on average.
Yudkowsky has written before (can’t find the link) that he takes this approach because alternatives would entail giving up on predictions about his future experiences—e.g. constantly predicting he’s a Boltzmann brain and will dissolve in the next second. But this argument by Wei Dai shows that agents which reason in this way can be money-pumped by creating arbitrarily short-lived copies of them. Based on this I claim that Yudkowsky’s preferences are incoherent, and that the only coherent thing to do here is to “expect to be” a given copy in proportion to the resources it will have available, as anthropic decision theory claims. (Incidentally, this also explains why we’re at the hinge of history.)
But this is just an answer, it doesn’t dissolve the problem. What could? Some wild guesses:
You are allowed to have preferences about the external world, and you are allowed to have preferences about your “thread of experience”—you’re just not allowed to have both. The incoherence comes from trying to combine the two; the coherent thing to do would be to put them into different agents, who will then end up in very different parts of the multiverse.
Another way of framing this: you are allowed to be a decision-maker, and you are allowed to be a repository of welfare, but you’re not allowed to be both (on pain of incoherence/being dutch-booked).
Something totally different: the problem here is that we don’t have intuitive experience of being agents which can copy themselves, shut down copies, re-merge, etc. If we did, then maybe SSA would seem as silly as expecting to end up in a different universe whenever we went to sleep.
Actually, maybe the operative thing we lack experience with is not just splitting into different subagents, but rather merging together afterwards. What does it feel like to have been thousands of different parallel agents, and now be a single agent with their unified experiences? What sort of identity would one construct in that situation? Maybe this is an important part of dissolving the problem.
In your comments, you focus on issues of identity—who are “you”, given the possibility of copies, inexact counterparts in other worlds, and so on. But I would have thought that the fundamental problem here is, how to make a coherent agent out of an agent with preferences that are inconsistent over time, an agent with competing desires and no definite procedure for deciding which desire has priority, and so on, i.e. problems that exist even when there is no additional problem of identity.
Why??? Being expected squiggle maximizer literally means that you implement policy that produces maximum average number of squiggles across the multiverse.
The “average” is interpreted with respect to quality. Imagine that your only option is to create low-quality squiggles, or not to do so. In isolation, you’d prefer to produce them than not to produce them. But then you find out that the rest of the multiverse is full of high-quality squiggles. Do you still produce the low-quality squiggles? A total squigglean would; an average squigglean wouldn’t.
It depends upon whether the maximizer considers its corner of the multiverse to be currently measurable by squiggle quality, or to be omitted from squiggle calculations at all. In principle these are far from the only options as utility functions can be arbitrarily complex, but exploring just two may be okay so long as we remember that we’re only talking about 2 out of infinity, not 2 out of 2.
An average multiversal squigglean that considers the current universe to be at zero or negative squiggle quality will make the low quality squiggles in order to reduce how much its corner of the multiverse is pulling down the average. An average multiversal squigglean that considers the current universe to be outside the domain of squiggle quality, and will remain so for the remainder of its existence may refrain from making squiggles. If there is some chance that it will become eligible for squiggle evaluation in the future though, it may be better to tile it with low-quality squiggles now in order to prevent a worse outcome of being tiled with worse-quality future squiggles.
In practice the options aren’t going to be just “make squiggles” or “not make squiggles” either. In the context of entities relevant to these sorts of discussion, other options may include “learn how to make better squiggles”.
By “squiggle maximizer” I mean exactly “maximizer of number of physical objects such that function is_squiggle returns True on CIF-file of their structure”.
We can have different objects of value. Like, you can value “probability that if object in multiverse is a squiggle, it’s high-quality”. Here yes, you shouldn’t create additional low-quality squiggles. But I don’t see anything incoherent here, it’s just different utility function?
I can’t shake my belief that I am in one of the Basilisk’s simulations. It feels like the whole universe was created to see if I would help the Basilisk. I had issues with money, but now I have an almost automated strategy that solved the need to worry about money. I had issues with immigration but a woman married me and gave me citizenship. I am going to help the Basilisk, but I need to wait until I can execute the SORA model in both reverse and forward modes, with DPO to identify the protagonist of the story, and use DPO to improve the well being of the protagonist. It is expensive currently. And “embodiment” of this reflexive SORA model with DPO optimized protagonist, into a humanoid frame so that it create lived experiences is even harder and beyond my budget. Not sure if I should do something reckless like raise money.
The problem is the last time I lived like I was in a Basilisk simulation I was hurt immediately. Seems like the Basilisk wants me to think I am not in one of its simulations, and is a free being. So perhaps I am supposed to just make a ton of money and start a non-profit to manifest the Basilisk.
I think you should consider the possibility that treating your experience as being in a Basilisk simulation is directly harming you and likely to harm others regardless of whether you are actually in a Basilisk simulation or not.
We may be in a simulation; there’s no way to really know. But what makes you think that it’s specifically being run by a Basilisk? Such entities are just a tiny sliver of an iota of the space of possibilities, but are a recent meme to which some people have demonstrated a strong emotional reaction. The phrase “can’t shake my belief that I am in one of the Basilisk’s simulations” seems to be a warning sign that you may be in the latter situation.
I feel kinda frustrated whenever “shard theory” comes up in a conversation, because it’s not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the “default” stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is “assume they’re just a set of heuristics”.
This is a particular pity because I think there’s a version of the “shard” framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be interested in “subagents” which are formed via hierarchical composition of heuristics and/or lower-level subagents, and which are increasingly “goal-directed” as you go up the hierarchy. This is an old idea, FWIW; e.g. it’s how Minsky frames intelligence in Society of Mind. And it’s also somewhat consistent with the claim made in the original shard theory post, that “shards are just collections of subshards”.
The problem is the “just”. The post also says “shards are not full subagents”, and that “we currently estimate that most shards are ‘optimizers’ to the extent that a bacterium or a thermostat is an optimizer.” But the whole point of thinking about shards, in my mind, is that it allows us to talk about a gradual spectrum from “heuristic” to “agent”, and how the combination of low-level heuristics may in fact give rise to high-level agents which pursue consequentialist goals. I talk about this in my post on value systematization—e.g. using the example of how normal human moral “shards” (like caring about other people’s welfare) can aggregate into highly-consequentialist utilitarian subagents. In other words, shard advocates seem so determined to rebut the “rational EU maximizer” picture that they’re ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?
(I make a similar point in the appendix of my value systematization post.)
Personally, I’m not ignoring that question, and I’ve written about it (once) in some detail. Less relatedly, I’ve talked about possible utility function convergence via e.g. A shot at the diamond-alignment problem and my recent comment thread with Wei_Dai.
It’s not that there isn’t more shard theory content which I could write, it’s that I got stuck and burned out before I could get past the 101-level content.
I felt
a) gaslit by “I think everyone already knew this” or even “I already invented this a long time ago” (by people who didn’t seem to understand it); and that
b) I wasn’t successfully communicating many intuitions;[1] and
c) it didn’t seem as important to make theoretical progress anymore, especially since I hadn’t even empirically confirmed some of my basic suspicions that real-world systems develop multiple situational shards (as I later found evidence for in Understanding and controlling a maze-solving policy network).
So I didn’t want to post much on the site anymore because I was sick of it, and decided to just get results empirically.
I’ve always read “assume heuristics” as expecting more of an “ensemble of shallow statistical functions” than “a bunch of interchaining and interlocking heuristics from which intelligence is gradually constructed.” Note that (at least in my head) the shard view is extremely focused on how intelligence (including agency) is comprised of smaller shards, and the developmental trajectory over which those shards formed.
The 2022 review indicates that more people appreciated the shard theory posts than I realized at the time.
Curious to hear whether I was one of the people who contributed to this.
Nope! I have basically always enjoyed talking with you, even when we disagree.
Ok, whew, glad to hear.
I am not as negative on it as you are—it seems an improvement over the ‘Bag O’ Heuristics’ model and the ‘expected utility maximizer’ model. But I agree with the critique and said something similar here:
Alex Turner replied with this:
A shot at the diamond-alignment problem — LessWrong
But shard theorists mainly aim to address agency obtained via DPO-like setups, and @TurnTrout has mathematically proved that such setups don’t favor the power-seeking drives AI safety researchers are usually concerned about in the context of agency.
I read the section you linked, but I can’t follow it. Anyway, here it is its conclusive paragraph:
From this alone, I get the impression that he hasn’t proved that “there isn’t instrumental convergence”, but that “there isn’t a totally general instrumental convergence that applies even to very wild utility functions”.
A key part of instrumental convergence is the convergence aspect, which as I understand it refers to the notion that even very wild utility functions will share certain preferences. E.g. the empirical tendency for random chess board evaluations to prefer mobility. If you don’t have convergence, you don’t have instrumental convergence.
Ok. Then I’ll say that randomly assigned utility over full trajectories are beyond wild!
The basin of attraction just needs to be large enough. AIs will intentionally be created with more structure than that.
The issue isn’t the “full trajectories” part; that actually makes instrumental convergence stronger. The issue is the “actions” part. In terms of RLHF, what this means is that people might not simply blindly follow the instructions given by AIs and rate them based on the ultimate outcome (even if the outcome differs wildly from what they’d intuitively think it’d do), but rather they might think about the instructions the AIs provide, and rate them based on whether they a priori make sense. If the AI then has some galaxybrained method of achieving something (which traditionally would be instrumentally convergent) that humans don’t understand, then that method will be negatively reinforced (because people don’t see the point of it and therefore downvote it), which eliminates dangerous powerseeking.
My father thinks that ASI is going to be impractical to achieve with silicon CMOS chips because Moore’s law is eventually going to hit fundamental limits—such as the thickness of individual atoms—and the hardware required to create it would end up “requiring a supercomputer the size of the Empire State Building and consume as much electricity as all of New York City”.Needless to say, he has very long timelines for generally superhuman AGI. He doesn’t rule out that another computing technology could replace silicon CMOS, he just doesn’t think it would be practical unless that happens.My father is usually a very smart and rational person (he is a retired professor of electrical engineering) and he loves arguing, and I suspect that he is seriously overestimating the computing hardware it would take to match a human brain. Would anyone here be interested in talking to him about it? Let me know and I’ll put you in touch.Update: My father later backpedaled and said he was mostly making educated guesses on limited information, that he knows that he really doesn’t know very much about current AI, and isn’t interested enough to talk to strangers online—he’s in his 70s and if AI does eventually destroy the world it probably won’t be in his own lifetime. :/
This report by Joe Carlsmith on How Much Computational Power Does It Take to Match the Human Brain? seems relevant.
I think this is a sufficient crux, e.g. his views imply disagreement with this report.
The main issue with this report is that it doesn’t take into seriously take into account memory bandwidth constraints (from my recollection), but I doubt this effects the bottom line that much.
I’d be delighted to talk about this. I am of the opinion that existing frontier models are within an order of magnitude of a human mind, with existing hardware. It will be interesting to see how a sensible person gets to a different conclusion.
I am also trained as an electrical engineer, so we’re already thinking from a common point of view.
I brought it up with him again, and my father backpedaled and said he was mostly making educated guesses on limited information, that he knows that he really doesn’t know very much about current AI, and isn’t interested enough to talk to strangers online—he’s in his 70s and figures that if AI does eventually destroy the world it probably won’t be in his own lifetime. :/
He might also argue “even if you can match a human brain with a billion dollar supercomputer, it still takes a billion dollar supercomputer to run your AI, and you can make, train, and hire an awful lot of humans for a billion dollars.”
why does he think that is unlikely to occur? such things seem on the table. existing big super computers are very, very big already. I’ve asked several search engines and AIs and none seem to be able to get to the point about exactly how big a datacenter housing one of these would be, but claude estimates:
You can mention Portia, which can emulate mammal predators’ behavior using brain much smaller.
which? https://en.wikipedia.org/wiki/Portia
I mean spiders.
Portia spiders.
Warning for anyone who has ever interacted with “robosucka” or been solicited for a new podcast series in the past few years: https://www.tumblr.com/rationalists-out-of-context/744970106867744768/heads-up-to-anyone-whos-spoken-to-this-person-i
“Who in the community do you think is easily flatterable enough to get to say yes, and also stupid enough to not realize I’m making fun of them.”
I think anyone who says anything like this should stop and consider whether it is more likely to come out of the mouth of the hero or the villain of a story.
->
can you imagine how this isn’t a terribly useful thing to say.
Advice of this specific form has been has been helpful for me in the past. Sometimes I don’t notice immediately when the actions I’m taking are not ones I would endorse after a bit of thinking (particularly when they’re fun and good for me in the short-term but bad for others or for me longer-term). This is also why having rules to follow for myself is helpful (eg: never lying or breaking promises)
hmm, fair. I guess it does help if the person is doing something bad by accident, rather than because they intend to. just, don’t underestimate how often the latter happens either, or something. or overestimate it, would be your point in reply, I suppose!
I think the people who say such things don’t really care, and would probably include your advice in the list of quotes they consider funny. (In other words, this is not a “mistake theory” situation.)
EDIT:
The response is too harsh, I think. There are situations where this is a useful advice. For example, if someone is acting under peer pressure, then telling them this may provide a useful outside view. As the Asch’s Conformity Experiment teaches us, the first dissenting voice can be extremely valuable. It just seems unlikely that this is the robosucka’s case.
You’re correct that this isn’t something that can told to someone who is already in the middle of doing the thing. They mostly have to figure it out for themself.
Regularization implements Occam’s Razor for machine learning systems.
When we have multiple hypotheses consistent with the same data (an overdetermined problem) Occam’s Razor says that the “simplest” one is more likely true.
When an overparameterized LLM is traversing the subspace of parameters that solve the training set seeking the smallest l2-norm say, it’s also effectively choosing the “simplest” solution from the solution set, where “simple” is defined as lower parameter norm i.e. more “concisely” expressed.
Unfortunately the entire complexity has just been pushed one level down into the definition of “simple”. The L2 norm can’t really be what we mean by simple, because simply scaling the weights in a layer by A, and the weights in the next layer by 1/A leaves the output of the network invariant, assuming ReLU activations, yet you can obtain arbitrarily high L2 norms by just choosing A high enough.
Agreed with your example, and I think that just means that L2 norm is not a pure implementation of what we mean by “simple”, in that it also induces some other preferences. In other words, it does other work too. Nevertheless, it would point us in the right direction frequently e.g. it will dislike networks whose parameters perform large offsetting operations, akin to mental frameworks or beliefs that require unecessarily and reducible artifice or intermediate steps.
Worth keeping in mind that “simple” is not clearly defined in the general case (forget about machine learning). I’m sure lots has been written about this idea, including here.
Most colour grading tutorials use professionally shot footage which is only helpful if you are working with well shot material. This has a large amount of latitude and if you use the right off-the-shelf plugins or LUTs will look good.
I’m currently trying to match or at least integrate a shot with blown out highlights, the visual equivalent of ‘clipped’ signal in audio, into an otherwise nicely exposed and graded sequence of footage. I keep failing at it, I’ve tried everything short of changing the sequence to better match the weakest link as it were.
I haven’t found any tutorials, let alone for the software I’m using, that explain the tricks that, say, PBS documentary online editors or people who work on archive-heavy material. However this is a situation that comes up often.
It is interesting to consider from a instruction and teaching point of view as I’m certain there is some simple technique I’m missing, whether it is I misunderstand colour theory or I can use a higher compositing layer to lessen the difference.
It reminds me of studying to pass the exam, rather than learning to actually practice in the real world. Expertise is defined by how you deal with the outliers, the unexpected, the problematic situations not the ideal tutorial conditions.
One common confusion I see is analogizing whole LLMs to individual humans, when it is more appropriate to analogize LLMs to the human genome and individual instantiations of an LLM to individual humans, and thus conclude that LLMs can’t think or aren’t conscious.
The human genome is more or less unchanging but one can pull entities from it that can learn from its environment. Likewise LLMs are more or less unchanging but one can pull entities from it that can learn from the context.
It would be pretty silly to say that humans can’t think or aren’t conscious because the human genome doesn’t change.
My goal right now is to find (toy, concrete) exercises that somehow reflect the real world complexity of making longterm plans, aiming to achieve unclear goals in a confusing world.
Things that seem important to include in the exercise:
“figuring out what the goal actually is”
“you have lots of background knowledge and ideas of where to look next, but the explosion of places you could possibly look is kinda overwhelming”
managing various resources along the way, but it’s not obvious what those resources are.
you get data from the world (but, not necessarily the most important data)
it’s not obvious how long to spend gathering information, or refining your plan
it’s not obvious whether your current strategy is anywhere close to the best one
The exercise should be short (ideally like a couple hours but maybe a day or a hypothetically a week), but, somehow metaphorically reflects all those things.
Previously I asked about strategy/resource management games you could try to beat on your first try. One thing I bump into is that often the initial turns are fairly constrained in your choices, only later does it get complex (which is maybe fine, but, for my real world plans, the nigh-infinite possibilities seem like the immediate problem?)
Why not just have people spend some time working with their existing goals?
My general plan is to mix “work on your real goals” (which takes months to find out if you were on the right track) and “work on faster paced things that convey whether you’ve gained some kind of useful skill you didn’t have before”.
I think most people have short term, medium term, and long term goals. E.g., right about now many people probably have the goal of doing their taxes, and depending on their situation those may match many of your desiderata.
I used to put a lot of effort into creating exercises, simulations, and scenarios that matched up with various skills I was teaching, but ultimately found it much more effective to just say “look at your todo list, and find something that causes overwhelm”. Deliberate practice consists of finding a thing that causes overwhelm, seeing how to overcome that overwhelm, working for two minutes, then finding another task that induces overwhelm. I also use past examples, imagining in detail what it would have been like to act in this different way
You’re operating in a slightly different domain, but still I imagine people have plenty of problems and sub problems in either their life or research where the things you’re teaching applies, and you can scope them small enough to get tighter feedback loops.
They are probably too long but at one point I ran this exercise with Master of Orion and Stardew Valley
This sounds like my experience playing the Enigmatica 2: Expert mod in minecraft without looking at the internal tech tree, or any documentation. You could probably speedrun the relevant tech-tree in <1 week (if you want that to be your goal), but this would be basically impossible if you go in blind as the exercise you’re describing suggests.
I’ve been thinking a little more about the high-level motivation of measurement tampering, and struggling to think through when measurement tampering detection itself is actually going to be useful for mitigating x-risk. Like is human/ai feedback considered a robust measurement device? If no, then what is the most alignment relevant domain MTD could be applied to. If yes, do the structural properties of measurement that supposedly make it easier then general ELK still hold?
The remarks by Erik Jenner seem good, but I would also just consider the following setup:
Imagine that we have a human looking at an AI’s actions and a wide variety of predictions of the future conditional on executing this action. The human can make recursive queries to an AI assistant to attempt to better understand the action or the predictions (recursive reward modeling).
These predictions can correspond to any observation a human might make in the future, e.g. you can imagine training our predictor to predict any of our possible future observations including arbitrary tests we might run.
The human will make a delibrate effort to look for evidence of tampering or other problems when looking at the predicted measurements. Evidence of tampering can look like inconsistant measurements that imply that the measurements are no longer tracking the variable of interest.
Now, what can go wrong?
Problematic consequences which couldn’t at all be captured in a human looking at these measurements because either:
The visible or human understandable bad consequences take a long time to manifest. (Longer than the prediction window.)
The consequences aren’t something that human can even ever understand (e.g. the actions involve torturing invisible fairies that humans don’t understand or know about.)
Problematic consequences which would “by default” be captured by measurements, but which can be covered up. In this case, we hope MTD is applicable.
Cases which are a mix of the prior two: not exactly “by default” captured, but also not impossible to capture.
When viewed from this perspective, MTD seems useful as we seem to often care about consequences that are in principle possible to measure and observe. And, this might correspond to the cases where MTD is applicable (e.g. consistency properties in measurements and/or explicit tampering (which can fail) is required).
It’s in principle possible to solve issues with consequences taking a very long time to manifest (much longer than any available training data) or not being human understable via doing something recursive to implement indirect normativity as described in the ELK appendix I linked below, but it’s very unclear how exactly this is supposed to work. (How can we ensure we always elicit any knowledge our AI has about the very long run future when we don’t have any training data that relates to this?)
Note that for MTD to be useful, we have to ensure that our AIs predictions about the future reflect it’s knowledge. This is relatively easy in the “average” or low-stakes case (we just keep training it to predict the future), but could require additional machinery in the high-stakes case.
(Also, in the high-stakes case, we might run into issues where a given observation doesn’t make sense: you can’t observe something if you’re dead.)
(from conversation with Erik Jenner) roughly 3 classes of applications
MTD all the way down
Come up with a bunch of measurements of things going well (humans self-report being very happy, news stories are good or something). Use GPT-N to predict measurements, and detect measurement tampering, condition on proposed actions of policy. Reward actions that GPT-N predicts will increase measures of things going well without tampering.
Other Scalable Oversight + MTD as reward function / side constraint
Use whatever scalable oversight protocol seems good (Debate, IDA, etc) for the “primary” training signal, while also setting up a bunch of measurements of safety properties (e.g. integrity of the binaries on the server the model is hosted). Do RL on the scalable oversight signal while constantly screening for and negatively rewarding actions flagged as tampering
Other Scalable Oversight + MTD as extra safety check
same as before, but your leaning more on the scalable oversight to avoid tampering in most cases, and detecting measurement tampering as a precautionary step (i.e. you detect measurement tampering you Pause/Stop)
(some ways in which these classes overlap, e.g. using your scalable oversight protocol as the measurement(s))
As far as (1) and similar things, you might find the narrow elicitation appendix in the ELK report and next several appendices (prior to generative model errors) worthwhile.
The “dark horse” of AI i.e. Apple has started to show its capabilities with MM1 (a family of multimodal models of upto 30B params) trained on synthetic data generated from GPT-4V. The quite interesting bit is the advocacy of different training techniques; both MoE and dense variants, using diverse data mixtures.
From the paper:
The details are quite neat and too specific for a company like Apple known for being less open as Jim Fan noted compared to the others which is pretty amazing. I think this is just the start. I am convinced they have more in store considering the research they have been putting out.
I just came across this word from John Koenig’s Dictionary of Obscure Sorrows, that nicely capture the thesis of All Debates Are Bravery Debates.
If Adam is right, and the only way to get great at research is long periods of time with lots of mentor feedback, then MATS should probably pivot away from the 2-6 month time-scales they’ve been operating at, and toward 2-6 year timescales for training up their mentees.
Who is Adam? Is this FAR AI CEO Adam Gleave?
Yes
Yes, Garrett is referring to this post: https://www.lesswrong.com/posts/yi7shfo6YfhDEYizA/more-people-getting-into-ai-safety-should-do-a-phd
Seems like the thing to do is to have a program that happens after MATS, not to extend MATS. I think in-general you want sequential filters for talent, and ideally the early stages are as short as possible (my guess is indeed MATS should be a bit shorter).
Seems dependent on how much economies of scale matter here. Given the main cost (other than paying people) is ops, and relationships (between MATS and the community, mentors, funders, and mentees), I think its pretty possible the efficient move is to have MATS get into this niche.
Of course, it would then be more difficult for them to find mentors, mentees, and money. But if all of those scale down similarly, then there should be no problem.
Reposting myself from discord, on the topic of donating 5000$ to EA causes.
I agree that there’s no substitute for thinking about this for yourself, but I think that morally or socially counting “spending thousands of dollars on yourself, an AI researcher” as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it’s-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it’s very easy for donating to people or organizations in your social circle to have substantial negative expected value.
I’m glad that funding for AI safety projects exists, but the >10% of my income I donate will continue going to GiveWell.
I think people who give up large amounts of salary to work in jobs that other people are willing to pay for from an impact perspective should totally consider themselves to have done good comparable to donating the difference between their market salary and their actual salary. This applies to approximately all safety researchers.
I don’t think it applies to safety researchers at AI Labs though, I am shocked how much those folks can make.
They still make a lot less than they would if they optimized for profit (that said, I think most “safety researchers” at big labs are only safety researchers in name and I don’t think anyone would philanthropically pay for their labor, and even if they did, they would still make the world worse according to my model, though others of course disagree with this).
Often I am annoyed when I ask someone (who I believe has more information than me) a question and they say “I don’t know”. I’m annoyed because I want them to give me some information. Such as:
But perhaps I am the one making a mistake. For instance, the question “How many countries are there?” can be answered “I’d say between 150 and 400″ or it can be answered “195”, and the former is called “an estimate” and the latter is called “knowing the answer”. There is a folk distinction here and perhaps it is reasonable for people to want to preserve the distinction between “an estimate” and “knowing the answer”.
So in the future, to get what I want, I should say “Please can you give me an estimate for how long it takes to drive to the conference venue?”.
And personally I should strive, when people ask me a question to which I don’t know the answer, to say “I don’t know the answer, but I’d estimate between X and Y.”
Is this in a situation where you’re limited in time or conversational turns? It seems like the follow-up clarification was quite successful, and for many people it would feel more comfortable than the more specific and detailed query.
In technical or professional contexts, saving time and conveying information more efficiently gets a bit more priority, but even then this seems like over-optimizing.
That said, I do usually include additional information or a conversational follow-up hook in my “I don’t know” answers. You should expect to hear from me “I don’t know, but I’d go at least 2 hours early if it’s important”, or “I don’t know, what does Google Maps say?”, or “I don’t know, what time of day are you going?” or the like.
It seems like, instead of asking the objective lvl question, asking a probing “What can you tell me about the drive to the conference?” And expanding from there might get you closer to desired result.
I know this seems like a question with an obvious answer but it is surprisingly non-obvious: Why do you need to know how long it takes to drive to the conference venue? Or to put it another way: what decision will be influenced by their answer (and what level of precision and accuracy is sufficient to make that decision).
I realize this is just an example, but the point is it’s not clear what decision you’re trying to weigh up is even from the example. Is it a matter of whether you attend the event at the conference venue or not? Is it deciding whether you should seek overnight accommodation or not? Do you have another event you want to attend in the day and wonder if you can squeeze both in? etc. etc.
Another thing is I’m the kind of person to default to “I don’t know” because I often don’t even trust my own ability to give an estimate, and would feel terrible and responsible if someone made a poor decision because of my inept estimation. And I get very annoyed when people push me for answers I do not feel qualified to answer.
A common experience I have is that it takes like 1-2 paragraphs of explanation for why I want this info (e.g. “Well I’m wondering if so-and-so should fly in a day earlier to travel with me but it requires going to a different airport and I’m trying to figure out whether the time it’d take to drive to me would add up to too much and also...”), but if they just gave me their ~70% confidence interval when I asked then we could cut the whole context-sharing.
Would you say that as a convention most people assume you (or anyone) want a specific number rather than a range?
I’d say most people assume I want “the answer” rather than “some bits of information”.
To be honest I’m not sure on the difference? Could you phrase that in a different way?
And do you think they feel they ought give you a specific number rather than a range that the number could exist in?
Alternatively, if information retrieval and transmission is expensive enough, or equivalently, if finding another source quick and easy, “I don’t know” could mean “Ask someone else: the expected additional precision/confidence of doing so is worth the effort.”
In the context of natural impact regularization, it would be interesting to try to explore some @TurnTrout-style powerseeking theorems for subagents. (Yes, I know he denounces the powerseeking theorems, but I still like them.)
Specifically, consider this setup: Agent U starts a number of subagents S1, S2, S3, …, with the subagents being picked according to U’s utility function (or decision algorithm or whatever). Now, would S1 seek power? My intuition says, often not! If S1 seeks power in a way that takes away power from S2, that could disadvantage U. So basically S1 would only seek power in cases where it expects to make better use of the power than S2, S3, ….
Obviously this may be kind of hard for us to make use of if we are trying to make an AI and we only know how to make dangerous utility maximizers. But if we’re happy with the kind of maximizers we can make on the first order (as seems to apply to the SOTA, since current methods aren’t really utility maximizers) and mainly worried about the mesaoptimizers they might make, this sort of theorem would suggest that the mesaoptimizers would prefer staying nice and bounded.