I’m an staff artificial intelligence engineer in Silicon Valley currently working with LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I’m now actively looking for employment working in this area.
RogerDearnaley
Cool! That makes a lot of sense. So does it in fact split into three before it splits into 7, as I predicted based on dimensionality? I see a green dot, three red dots, and seven blue ones… On the other hand, the triangle formed by the three red dots is a lot smaller than the heptagram, which I wasn’t expecting…
I notice it’s also an oddly shaped heptagram.
This seems like it would be helpful: the adversary can still export data, for example encoded steganographically in otherwise-low-perplexity text, but this limits the information density they can transmit, making the process less efficient for them and making constraints like upload limits tighter.
One other thing that would make this even harder is if we change the model weights regularly, in ways where combining parts exfiltrated from separate models is hard. We know for Singular Learning Theory that the optima found by Stochastic gradient descent tend to have high degrees of symmetry. Some of these (like, say, permuting all the neurons in a layer along with this weights) are discrete, obvious, and both easy to generate and fairly easy for an attacker to compensate for. Others, sich as adjusting various weights in various layers in ways that compensate for each other, are continuous symmetries, harder to generate and would be harder for an attacker to reason about. If we had an efficient way to explore these continuous symmetries (idelly one that’s easy to implement given a full set of the model weights, but hard to reconstruct from multiple partial pieces of multiple equivlent models), then we could explore this high-dimensional symmetry space of the model optimum and create multiple equivalent-but-not easy to peice together sets of models, and rotate between them over time (and/or deploy different ones to different instances) in order to make the task of exfiltrating the weights even harder.
So, to anyone who knows more about SLT than I do, a computationally efficient way to explore the continuous symmetries (directions in which both the slope and the Hessian are flat) of the optimum of a trained model could be very useful.
In practice, most current AIs are not constructed entirely by RL, partly because it has instabilities like this. For example, LLMs instruction-trained by RLHF uses a KL-divergence loss term to limit how dramatically the RL can alter the base model behavior trained by SGD. So the result deliberately isn’t pure RL.
Yes, if you take a not-yet intelligent agent, train it using RL, and give it unrestricted access to a simple positive reinforcement avenue unrelated to the behavior you actually want, it is very likely to “wire-head” by following that simple maximization path instead. So people do their best not to do that when working with RL.
What I would be interested to understand about feature splitting is whether the fine-grained features are alternatives, describing an ontology, or are defining a subspace (corners of a simplex, like R, G, and B defining color space). Suppose a feature X in a small VAE is split into three features X1, X2, and X3 in a larger VAE for the same model. If occurrences of X1, X2, and X3 are correlated, so activations containing any of them commonly have some mix of them, then they span a 2d subspace (in this case the simplex is a triangle). If, on the other hand, X1, X2 and X3 co-occur in an activations only rarely (just as two randomly-selected features rarely co-occur), then they describe three similar-but-distinct variations on a concept, and X is the result of coarse-graining these together as a singly concept at a higher level in an ontology tree (so by comparing VAEs of different sizes we can generate a natural ontology).
This seems like it would be a fairly simple, objective experiment to carry out. (Maybe someone already has, and can tell me the result!) It is of course quite possible that some split features describe subspaces, and other ontologies, or indeed something between the two where the features co-occur rarely but less rarely than two random features. Or X1 could be distinct but X2 and X3 might blend to span a 1-d subspace. Nevertheless, understanding the relative frequency of these different behaviors would be enlightening.
It would be interesting to validate this using a case like the days of the week, where we believe we already understand the answer: they are 7 alternatives that are laid out in a heptagon in a 2-dimensional subspace that enables doing modular addition/subtraction modulo 7. So if we have a VAE small enough that it represented all day-of-the week names by a single feature, if we increase the VAE size somewhat we’d expect to see this to split into three features spanning a 2-d subspace, then if we increased it more we’d expect to see it resolve into 7 mutually-exclusive alternatives, and hopefully then stay at 7 in larger VAEs (at least until other concepts started to get mixed in, if that ever happened).
AI that is trained by human teachers, giving it rewards will eventually wirehead, as it becomes smarter and more powerful, and its influence over its master increases. It will, in effect, develop the ability to push its own “reward” button. Thus, its behavior will become misaligned with whatever its developers intended.
This seems like an unproven statement. Most humans are aware of the possibility of wireheading, both the actual wire version and the more practical versions involving psychotropic drugs. The great majority of humans don’t choose to do that to themselves. Assuming that AI will act differently seems like an unproven assumption, one which might, for example, be justified for some AI capability levels but not others.
If you’re not already familiar with the literature on Value Learning, I suggest reading some of it. The basic idea is that goal modification is natural, if what the agent has is not a detailed specification of a goal (such as a utility function mapping descriptions of world states to their utility), but instead is a simple definition of a goal (such as “want whatever outcomes the humans want”) that makes it clear that the agent does not yet know the true detailed utility function and thus requires it to go attempt to find out what the detailed specification of the utility function pointed to by the goal is (for example, by researching what outcome humans want).
Then a human shutdown instruction becomes the useful information “you have made a large error in your research into the utility function, and as a result are doing harm, please shut down and let us help you correct it”. Obeying that is then natural (to the extent that the human(s) are plausibly more correct than the AI).
In my attempted summary of the discussion, I rolled this into Category 5.
There has been a lot of useful discussion in the answers and comments, which has caused me to revise and expand parts of my list. So that readers looking for practical career advice don’t have to read the entire comments section to find the actual resulting advice, it seemed useful to me to give a revised list. Doing this as an answer in the context of this question seems better than either making it a whole new post, or editing the list in the original post in a way that would remove the original context of the answers and comments discussion.
This is my personal attempt to summarize the answers and comments discussion: other commenters may not agree (and are of course welcome to add comments saying so). As the discussion continues and changes my opinion, I will keep this version of the list up to date (even if that requires destroying the context of any comments on it).
List of Job Categories Safe from AI/Robots (Revised)
Doing something that machines can do better, but that people are still willing to pay to watch a very talented/skilled human do about as well as any human can (on TV or in person).
Examples: chess master, Twitch streamer, professional athlete, Cirque du Soleil performer.
Epistemic status: already proven for some of these, the first two are things that machines have already been able to do better than a human for a while, but people are still interested in paying to watch a human do them very well for a human. Also seems very plausible for the others that current robotics is not yet up to doing better.
Economic limits: If you’re not in the top O(1000) people in the world at some specific activity that plenty of people in the world are interested in watching, then you can make roughly no money off this. Despite the aspirations of a great many teenaged boys, being an unusually good (but not amazing) video gamer is not a skill that will make you any money at all.
Doing some intellectual and/or physical work that AI/robots can now do better, but for some reason people are willing to pay at least an order of magnitude more to have it done less well by a human, perhaps because they trust humans better. This could include jobs where people’s willingness to pay came in the form of a legal requirement that certain work be done of supervised by a (suitably skilled/accredited) human (and these requirements have not yet been repealed).
Example: Doctor, veterinarian, lawyer, priest, babysitter, nanny, nurse, primary school teacher.
Epistemic status: Many people tell me “I’d never let an AI/a robot do <high-stakes intellectual or physical work> for me/my family/my pets…” They are clearly quite genuine in this opinion, but it’s unclear how deeply they have considered the matter. It remains to be seen how long this opinion will last in the presence of a very large price differential when the AI/robot-produced work is actually, demonstrably, just as good if not better.
Economic limits: I suspect there will be a lot of demand for this at first, and that it will decrease over time, perhaps even quite rapidly (though perhaps slower for some such jobs than others). Requires being reliably good at the job, and at appearing reassuringly competent while doing so.
Giving human feedback/input/supervision to/of AI/robotic work/models/training data, in order to improve, check, or confirm its quality.
Examples: current AI training crowd-workers, wikipedian (currently unpaid), acting as a manager or technical lead to a team of AI white collar workers, focus group participant, filling out endless surveys on the fine points of Human Values
Epistemic status: seems inevitable, at least at first.
Economic limits: I imagine there will be a lot of demand for this at first, I’m rather unsure if that demand will gradually decline, as the AIs get better at doing things/self-training without needing human input, or if it will increase over time because the overall economy is growing so fast and/or more capable models need more training data and/or society keeps moving out-of-previous distribution so new data is needed. [A lot of training data is needed, more training data is always better, and the resulting models can be used a great many times, however there is clearly an element of diminishing returns on this as more data is accumulated, and we’re already getting increasingly good at generating synthetic training data.] Another question is whether a lot of very smart AIs can extract a lot of this sort of data from humans without needing their explicit paid cooperation — indeed, perhaps granting permission for them to do so and not intentionally sabotaging this might even become a condition for part of UBI (at which point deciding whether to call allowing this a career or not is a bit unclear).
Skilled participant in an activity that heavily involves interactions between people, where humans prefer to do this with other real humans, are willing to pay a significant premium to do so, and you are sufficiently more skilled/talented/capable/willing to cater to others’ demands than the average participant that you can make a net profit off this exchange.
Examples: director/producer/lead performer for amateur/hobby theater, skilled comedy-improv partner, human sex-worker
Epistemic status: seems extremely plausible
Economic limits: Net earning potential may be limited, depending on just how much better/more desirable you are as a fellow participant than typical people into this activity, and on the extent to which this can be leveraged in a one-producer-to-many-customers way — however, making the latter factor high is is challenging because it conflicts with the human-to-real-human interaction requirement that allows you to out-compete an AI/robot in the first place. Often a case of turning a hobby into a career.Providing some nominal economic value while being a status symbol, where the primary point is to demonstrate that the employer has so much money they can waste some of it on employing a real human (“They actually have a human maid!”)
This can either be full-time employment as a status-symbol for a specific status-signaler, or you can be making high-status “luxury” goods where owning one is a status signal, or at least has cachet. For the latter, like any luxury good, they need to be rare: this could be that they are individually hand made, and-or were specifically commissioned by a specific owner, or that they are reproduced only in a “limited edition”.
Examples: (status symbol) receptionist, maid, personal assistant; (status-symbol maker) “High Art” artist, Etsy craftsperson, portrait or commissions artist.
Epistemic status: human nature (for the full-time version, assuming there are still people this unusually rich).
Economic limits: For the full-time-employment version, there are likely to be relatively few positions of this type, at most a few per person so unusually rich that they feel a need to show this fact off. (Human nobility used to do a lot of this, centuries back, but there the servants were supplying real, significant economic value, and the being-a-status-symbol component of it was mostly confined to the uniforms the servant swore while doing so.) Requires rather specific talents, including looking glamorous and expensive, and probably also being exceptionally good at your nominal job.
For the “maker of limited edition human-made goods” version: pretty widely applicable, and can provide a wide range of income levels depending on how skilled you are and how prestigious your personal brand is. Can be a case of turning a hobby into a career.
Providing human-species-specific reproductive or medical services.
Examples: Surrogate motherhood, wet-nurse, sperm/egg donor, blood donor, organ donor.
Epistemic status: still needed.
Economic limits: Significant medical consequences, low demand, improvements in medicine may reduce demand.
Certain jobs could manage to combine two (or more) of these categories. Arguably categories 1. and 5. are subsets of category 2.
I intended to capture that under category 2. “…but for some reason people are willing to pay at least an order of magnitude more to have it done less well by a human, perhaps because they trust humans better…” — the regulatory capture you describe (and those regulations not yet having been repealed) would be a category of reason why (and an expression of the fact that) people are willing to pay more. Evidently that section wasn’t clear enough and I should have phrased this better or given it as an example.
As I said above under category 2., I expect this to be common at first but to decrease over time, perhaps even quite rapidly, given the value differentials involved.
FWIW, my personal guess is that the kind of extrapolation process described by CEV is fairly stable (in the sense of producing a pretty consistent extrapolation direction) as you start to increase the cognitive resources applied (something twice as a smart human thinking for ten times as long with access to ten times as much information, say), but may well still not have a single well defined limit as the cognitive resources used for the extrapolation tend to infinity. Using a (loose, not exact) analogy to a high-dimensional SGD or simulated-annealing optimization problem, the situation may be a basin/valley that looks approximately convex at a coarse scale (when examined with low resources), but actually has many local minima that increasing resources could converge to.
So the correct solution may be some form of satisficing: use CEV with a moderately super-human amount of computation resources applied to it, in a region where it still gives a sensible result. So I view CEV as more a signpost saying “head that way” than a formal description of a mathematical limiting process that clearly has a single well-defined limit.
As for human vales being godshatter of evolution, that’s a big help: where they are manifestly becoming inconsistent with each other or with reality, you can use maximizing actual evolutionary fitness (which is a clear, well-defined concept) as a tie-breaker or sanity check. [Obviously, we don’t want to take that to the point where then human population is growing fast (unless we’re doing it by spreading through space, in which case, go for it).]
Congratulations! You reinvented from scratch (a single-person version of) Coherent Extrapolated Volition (i.e. without the Coherent part). That’s a widely considered candidate solution to the Outer Alignment problem (I believe first proposed by MIRI well over a decade ago).
However, what I think Yoshua was also, or even primarily, talking about is the technical problem of “OK, you’ve defined a goal — how do you then build a machine that you’re certain will faithfully attempt to carry out that goal, rather than something else?”, which is often called the Inner Alignment problem. (Note that the word “certain” becomes quite important in a context where a mistake could drive the human race extinct.) Proposals tend to involve various combinations of Reinforcement Learning and/or Stochastic Gradient Descent and/or Good Old-Fashioned AI and/or Bayesian Learning, all of which people (who don’t want to go extinct) have concerns about. After that, there’s also the problem of: OK, you built a machine that genuinely wants to figure out what you would have wanted to do, and then do it — how do you ensure that it’s actually good at figuring that out correctly? This is often, on Less Wrong, called the Human Values problem — evidence suggests that modern LLMs are actually pretty good at at least the base encyclopedic factual knowledge part of that.Roughly speaking, you have to define the right goal (which to avoid oversimplifications generally requires defining it at a meta level as something such as “the limit as some resources tend to infinity of the outcomes of a series of processes like this”), you have make the machine care about that and not anything else, and then you have to make the machine capable of carrying out the process, to a sufficiently good approximation.
Anyway, welcome down the rabbit-hole: there’s a lot to read here.
When faced with confusing conundrums like this, I find it useful to go back to basics: evolutionary psychology. You are a human, that is to say, you’re an evolved intelligence, one evolved as a helpful planning-and-guidance system for a biological organism, specifically a primate. Your purpose, evolutionarily, is to maximize the evolutionary fitness of your genes, i.e. to try your best to pass them on successfully. You have whole bunch of drives/emotions/instincts that were evolved to, on the African Savannah, approximately maximize that fitness. Even in our current rather different environment, while not quite as well tuned to that as they used to be to the Savannah, these still do a pretty good job of that (witness the fact that there are roughly 8 billion of us).
So, is an upload of your mind the same “person”? It (if uploaded correctly) shares your memories, drives, and so forth. it will presumably regard you (the organism, and the copy of your mind running on your biological brain, if the uploading process was non-destructive) as somewhere between itself, an identical twin, and a close blood relative. Obviously you will understand each other very well, at least at first before your experiences diverge.. So it’s presumably likely to be an ally in your (the organism’s) well-being and thus help pass your genes on.
So, is an upload exactly the same thing as your biological mind? No. Is it more similar than an identical twin? Yes. Does the English language have a good set of words to compactly describe this? No.
[Obviously if the mind uploading process is destructive, that makes passing on your genes harder, especially if you haven’t yet had any children, and don’t have any siblings. Freezing eggs or sperm before doing destructive mind uploading seems like a wise precaution.]
This might also be part of why there’s a tendency for famous artists to be colorful characters: that enhances the story part of the value of their art.
I think you’re right: I have heard this claimed widely about Art, that part of the product and its value is the story of who made it, when and why, who’s in it, who commissioned it, who previously owned it, and so forth. This is probably more true at the expensive pinnacles of the Art market, but it’s still going to apply within specific subcultures. That’s why forgeries are disliked: objectively they look just like the original artist’s work, but the story component is a lie.
More generally, luxury goods have a namber of weird economic properties, one of which is that there’s a requirement that they be rare. Consider the relative value of natural diamonds or other gemstones, vs synthetic ones that are objectively of higher clarity and purity with fewer inclusions: the latter is an objectively better product but people are willing to pay a lot less for it. People pay a lot more for the former, because they’re ’natural”, which is really because they’re rare and this a luxury/status symbol. I think this is an extension of my category 5. — rather then the human artist acting as your status symbol in person as I described above, a piece of their art that you commissioned and took them a couple of days to make just for you is hanging on your wall (or hiding in your bedroom closet, as the case may be).
There are basically three reasons to own a piece of art:
1) that’s nice to look at
2) I feel proud of owning it
3) other people will think better of me because I have it and show it off
The background story doesn’t affect 1), but it’s important for 2) and 3).
That sounds like good advice — thanks!
The sheer number of Geek Points that This Pony Does Not Exist wins is quite impressive.
I’m still watching this (it’s interesting, but 6 hours long!), and will have more comments later.
From his point of view in what I’ve watched so far, what matters most about the categories of jobs above is to what extent they are critical to the AI/robotic economic growth and could end up being a limiting factor bottleneck on it.
My categories 1. and 4.–6. (for both the original version of 4. and the expanded v2 version in a comment) are all fripperies: if these jobs went entirely unfilled, and the demand for them unfulfilled, the humans would be less happy (probably not by that much), but the AI/robotic economic growth would roar on unabated. For category 2., this could matter, but in for this category AI/bots can doer the job, consumers just strongly prefer a human doing it. So a shortage of humans willing to do these compared to demand would increase the price differential between a human and AI providerl, and sooner or later that would reach the differential where people are willing to go with the AI option, demand would decrease, and AI/bots would fill the gap and do a good job of it. So this effect is inherently self-limiting, cannot become too much of a bottleneck, and I can’t see it being a brake on AI/robotic economic growth rates:
The glaring exception to this is my category 3.: providing more data about human preferences. This is something that the AIs genuinely, fundamentally need (if they’re aligned — a paperclip maximizer wouldn’t need this). Apart from the possibility of replacing/substituting the data with things like AI synthetic data, AI historical or scientific research into humans that requires no actual human participation to generate data (or that is disguised as video games or a virtual environment, for example, but that’s just using a free-entertainment motivation to participate, rather than a salary, so economically it’s not that different), it’s a vital economic input from the humans into the AI/robotic sector of the economy, and if it became too expensive, it could actually become a bottleneck/limiting factor in the post-AGI economy.
So that means that, for predicting an upper bound on FOOM growth rates, understanding how much and for how long AI needs human data/input/feedback of the type that jobs in category 3. generates, whether this demand decreases or increases over time, and to what extent functionally equivalent data could by synthesized or researched without willing human involvement, is actually a really important question. This could in fact be the Baumol Effect bottleneck that Carl Shulman has been looking for but hadn’t found: AI’s reliance on (so far, exponentially increasing amounts of) data about humans that can only come from humans.
Algorithmic improvements relevant to my argument are those that happen after long-horizon task capable AIs are demonstrated, in particular it doesn’t matter how much progress is happening now, other than as evidence about what happens later
My apologies, you’re right, I had misunderstood you, and thus we’ve been talking at cross-purposes. You were discussing…if research capable TAI can lag behind government-alarming long-horizon task capable AI (that does many jobs and so even Robin Hanson starts paying attention)
while I was instead talking about how likely it was that running out of additional money to invest slowed reaching either of these forms of AGI (which I personally view as being likely to happen quite close together, as Leopold also assumes) by enough to make more than a year-or-two’s difference.
I also think the improvements themselves are probably running out.
I disagree, though this is based on some guesswork (and Leopold’s analysis, as a recently-ex-insider). I don’t know exactly how they’re doing it (improvements in training data filtering is probably part of it), but the foundation model companies have all been putting out models with lower inference costs and latencies for the same capability level (OpenAI; GPT-4 Turbo, GPT-4o vs GPT-4; Anthropic Claude 3.5 Sonnet vs. the Claude 3 generation; Google: Gemini 1.5 vs 1). I am assuming that the reason for this performance improvement is that the newer models actually had lower parameter counts (which is supported by some rumored parameter count numbers), and I’m then also assuming that means these had lower total compute to train. (The latter assumption would be false for smaller models trained via distillation from a larger model, as some of the smaller Google models almost certainly are, or heavily overtrained by Chinchilla standards, as has recently become popular for models that are not the largest member of a model family.)
Things like the effectiveness of model pruning methods suggest that there are a lot of wasted parameters inside current models, which would suggest there’s still a lot of room for performance improvements. The huge context lengths that foundation model companies are now advertising without huge cost differentials also rather suggest something architectural has happened there, which isn’t just full attention quadratic-cost classical transformers. What combination of the techniques from the academic literature, or ones not in the academic literature, that’s based on is unclear, but clearly something improved there.
That’s not necessarily required. The Scientific Method works even if the true “Unified Field Theory” isn’t yet under consideration, merely some theories that are closer to it and others further away from it: it’s possible to make iterative progress.
In practice, considered as search processes, the Scientific Method, Bayesianism, and stochastic gradient descent all tend to find similar answers: yet unlike Bayesianism gradient descent doesn’t explicitly consider every point in the space including the true optimum, it just searches for nearby better points. It can of course get trapped in local minima: Singular Learning Theory highilights why that’s less of a problem in practice than it sounds in theory.
The important question here is how good an approximation the search algorithm in use is to Bayesianism. As long as the AI understands that what it’s doing is (like the scientific method and stochastic gradient descent) a computationally efficient approximation to the computationally intractable ideal of Bayesianism, then it won’t resist the process of coming up with new possibly-better hypotheses, it will instead regard that as a necessary part of the process (like hypothesis creation in the scientific method, the mutational/crossing steps in an evolutionary algorithm, or the stochastic batch noise in stochastic gradient descent).