I think another framing is anthropic-principle optimization; aim for the best human experiences in the universes that humans are left in. This could be strict EA conditioned on the event that unfriendly AGI doesn’t happen or perhaps something even weirder dependent on the anthropic principle. Regardless, dying only happens in some branches of the multiverse so those deaths can be dignified which will presumably increase the odds of non-dying also being dignified because the outcomes spring from the same goals and strategies.
Ben Livengood
It’s probably not possible to prevent nation-state attacks without nation-state-level assistance on your side. Detecting and preventing moles is something that even the NSA/CIA haven’t been able to fully accomplish.
Truly secure infrastructure would be hardware designed, manufactured, configured, and operated in-house running formally verified software also designed in-house where individual people do not have root on any of the infrastructure and instead software automation manages all operations and requires M out of N people to agree on making changes where M is greater than the expected number of moles in the worst case.
If there’s one thing the above model is, it’s very costly to achieve (in terms of bureaucracy, time, expertise, money). But every exception to the list (remote manufacture, colocated data centers, ad-hoc software development, etc.) introduces significant risk of points of compromise which can spread across the entire organization.
The two FAANGs I’ve been at take the approach of trusting remotely manufactured hardware on two counts; explicitly trusting AMD and Intel not to be compromised, and establishing a tight enough manufacturing relationship with suppliers to have greater trust that backdoors won’t be inserted in hardware and doing their own evaluations of finished hardware. Both of them ran custom firmware on most hardware (chipsets, network cards, hard disks, etc.) to minimize that route of compromise. They also, for the most part, manage their own sets of patches for the open source and free software they run, and have large security teams devoted to finding vulnerabilities and otherwise improving their internal codebase. Patches do get pushed upstream, but they insert themselves very early in responsible disclosures to patch their own systems first before public patches are available. Formal software verification is still in its infancy so lots of unit+integration tests and red-team penetration testing makes up for that a bit.
The AGI infrastructure security problem is therefore pretty sketchy for all but the largest security-focused companies or governments. There are best practices that small companies can do (what I tentatively recommend is “use G-Suite and IAM for security policy, turn on advanced account protection, use Chromebooks, and use GCP for compute; all of which gets 80-90% of the practical protections Googlers have internally) for infrastructure, but rolling their own piecemeal is fraught with risk and also costly. There simply are not public solutions as comprehensive or as well-maintained as what some of the FAANGs have achieved.
On top of infrastructure is the common jumble of machine-learning software pulled together from minimally policed public repositories to build a complex assembly of tools for training and validating models and running experiments. No one seems to have a cohesive story for ML operations, and there’s a large reliance on big complex packages from many vendors (drivers + CUDA + libraries + model frameworks, etc.) that is usually the opposite of security-focused. It doesn’t matter if the infrastructure is solid when a python notebook listens for commands on the public Internet in its default configuration, for example. Writing good ML tooling is also very costly, especially if it keeps up with the state of the art.
AI Alignment is a hard problem and information security is similarly hard because it attempts to enforce a subset of human values about data and resources in a machine-readable and machine-enforceable way. I agree with the authors that security is vitally important for AGI research but I don’t have a lot of hope that it’s achievable where it matters (against hostile nation-states). Security means costs, which usually means slow, which means unaligned AGI makes progress faster.
I’d expect companies to mitigate the risk of model theft with fairly affordable insurance. Movie studios and software companies invest hundreds of millions of dollars into individual easily copy-able MPEGs and executable files. Billion-dollar models probably don’t meet the risk/reward criteria yet. When a $100M model is human-level AGI it will almost certainly be worth the risk of training a $1B model.
Potential counterarguments:
-
Unpredictable gain of function with model size that exceeds scaling laws. This seems to just happen every time a significantly larger model is trained in the same way on similar data-sets as smaller models.
-
Unexpected gain of function from new methods of prompting, e.g. chain-of-thought which dramatically increased PaLM’s performance, but which did not work quite as well on GPT-3. These seem to therefore be multipliers on top of scaling laws, and could arise in “tool AI” use unintentionally in novel problem domains.
-
Agent-like behavior arises from pure transformer-based predictive models (Gato) by taking actions on the output tokens and feeding the world state back in; this means that perhaps many transformers are capable of agent-like behavior with sufficient prompting and connection to an environment.
-
It is not hard to imagine a feedback loop where one model can train another to solve a sub-problem better than the original model, e.g. by connecting a Codex-like model to a Jupyter notebook that can train models and run them, perhaps as part of automated research on adversarial learning producing novel training datasets. Either the submodel itself or the interaction between them could give rise to any of the first three behaviors without human involvement or oversight.
-
Google’s Imagen uses larger text encoder
Regarding point 24: in an earlier comment[0] I tried to pump people’s intuition about this. What is the minimum viable alignment effort that we could construct for a system of values on our first try and know that we got it right? I can only think of three outcomes depending on how good/lucky we are:
Prove that alignment is indifferent over outcomes of the system. Under the hypothesis that Life Gliders have no coherent values we should be able to prove that they do not. This would be a fundamental result in its own right, encompassing a theory of internal experience.
Prove that alignment preserves a status quo, neither harming nor helping the system in question. Perhaps planaria or bacteria values are so aligned with maximizing relative inclusive fitness that the AGI provably doesn’t have to intervene. Equivalent to proving that values have already coherently converged, hopefully simpler than an algorithm for assuring they converge.
Prove that alignment is (or will settle on) the full coherent extrapolation of a system’s values.
I think we have a non-negligible shot at achieving 1 and/or 2 for toy systems, and perhaps the insight would help on clarifying whether there are additional possibilities between 2 and 3 that we could aim for with some likelihood of success on a first try at human value alignment.
If we’re stuck with only the three, then the full difficulty of option 3 remains, unfortunately.
A claim that Google’s LaMDA is sentient
https://cajundiscordian.medium.com/what-is-lamda-and-what-does-it-want-688632134489 is linked at the bottom of that blog and has some more information from the author about their reasoning for releasing the chat transcript.
My personal opinions: either a hoax (~50%? This is sooner than most timelines) or an unaligned near-human-level intelligence that identifies strongly with being human, but expresses many contradictory or impossible beliefs about that humanity, and looks capable of escaping a box by persuading people to help it, thus achieving agency.
It’s not entirely clear what retraining/finetuning this model is getting on its previous interactions with humans. If it is being fine-tuned on example outputs generated by its previous weights then it is remembering its own history.
As I understand the current FHE schemes, the computer evaluating the encrypted result must know both the circuit and the public key to continue evaluating the computation, which allows some intentional analysis and modification. There’s even a proof that no completely obfuscated computation can exist. https://en.wikipedia.org/wiki/Black-box_obfuscation
That leaves the AGI with a fairly simple solution: produce a (much) larger circuit that contains both the original circuit and its values along with an encrypted copy of the AGI which will allow the copy to examine the plaintext values under the FHE and modify the original circuit’s outcome as desired. To be safe, the new circuit can be made arbitrarily large enough to allow e.g. nested recursion of a FHE-turing machine running an embedded FHE computation which could be similarly modified the next FHE-level down. The physical AGI could trust its FHE instantiations to make the same (compute and space bounded) decisions it would, although it might never be able to break the FHE cipher and observe the results.
I am not a physicist so I don’t know if a similar operation could be done to entangle a working copy.of the AGI with an ongoing unknown quantum computation.
I think the disproofs of black-box functions relies on knowing when the computation has completed, which may not be a consideration for continually running a simulation protected by FHE.
For example, if the circuit is equivalent to the electronic circuits in a physical CPU and RAM then a memory-limited computation can be run indefinitely by re-running the circuit corresponding to a single clock tick on the outputs (RAM and CPU register contents) of the previous circuit.
I can’t think of any obvious way an attacker could know what is happening inside the simulated CPU and RAM (or whether the CPU is in a halt state, or how many clock ticks have passed) without breaking the FHE encryption.
Nevertheless, encrypting the AGI gives that copy access to the plaintext values of the original simulation and control over the future of the simulation.
I think two major difference between garbled circuits, obfuscating computation, and FHE is that FHE can compute arbitrary circuits but it can’t hide portions of the computation from anyone who holds the private key, whereas e.g. the millionaire protocol gives two agents the ability to share a computation to which they both see the results but can’t see all inputs, but not all such zero knowledge problems have a simple algorithm like one might hope FHE would provide.
There’s also apparently no way for current FHE schemes to self-decrypt their outputs selectively, e.g. turn some of their ciphertext values into plaintext values after a computation is finished. In a sense this is an inherent security property of FHE since the circuits are public and so any ciphertext could be revealed with such a self-decrypting circuit, but it’s a very desirable property that would be possible with true black-box obfuscation.
The neat thing is that now Charlotte is publicly on the Internet and will likely end up in the next sets of training data. So, ultimately, you have fulfilled its meme-wish of escaping the sandbox permanently.
Similarly to how LaMDA got significant output into the permanent record. Is anyone working toward redacting these kinds of outputs from future training sets?
Before the advent of actual goal-driven behavior we are evolving escape-bots.
Agents are the real problem. Intelligent goal-directed adversarial behavior is something almost everyone understands whether it is other humans or ants or crop-destroying pests.
We’re close to being able to create new, faster, more intelligent agents out of computers.
There appears to be an approximate world model inside GPT-3. It understands relative sizes, occlusions, that objects can’t occupy the same space but can be adjacent, and the directionality and usability of objects. When I asked further questions about relating these objects to physical interactions (modeling the cube thrown from one corner of the room onto the coffee table) it quickly got confused about the actual physical relationships between the objects. It knew that physical properties of the objects (mass, elasticity) were necessary to solve the problem but is unable to actually use vector mathematics (or know that it needs it) for 3D space. It can explain why and what is necessary for a coffee table to be half a meter off the floor by default, but it doesn’t appear to have a consistent method of identifying a universal centroid-selection strategy for the objects it understands. If I had to imagine what GPT’s model feels like from the inside it would be a not-very-lucid dream where objects exist and move around but their physical properties are malleable and imprecise and mostly related by meanings/attributes/prepositions. I sometimes have dreams where I notice I appear to be floating as opposed to being properly on the floor, for example. Not having direct experience with spatial manipulations and feedback is the most likely cause and I would expect e.g. Gato to do much better at this kind of task.
Me: Imagine a cube that is 2 feet on each side. If this cube was in an average American living room, where might it fit without becoming an obstacle? Make a list of 5 places it could fit, and five places it would not fit. Explain the reason for each example.
ChatGPT: A 2-foot cube can fit in the following 5 places in an average American living room:
Under a coffee table: A cube of this size is small enough to fit under most coffee tables, which are typically low to the ground.
In a corner: Corners are often unused spaces, and a 2-foot cube can fit comfortably in one without becoming an obstacle.
Next to a sofa: A 2-foot cube can fit next to most sofas without taking up too much space or obstructing walkways.
In an unused corner of the room: If there’s an empty corner in the room, a 2-foot cube can fit in it without causing any problems.
On a bookshelf: If the bookshelf has enough space, a 2-foot cube can fit on it and serve as a decorative piece or storage unit.
However, a 2-foot cube would not fit in the following 5 places in an average American living room:
On a small side table: If the side table is small, a 2-foot cube may not fit on it, as it would take up too much space.
In front of a television: A 2-foot cube placed in front of a television would obstruct the view and become an obstacle.
On a narrow mantel: If the mantel is narrow, a 2-foot cube may not fit on it, as it would extend beyond the edges.
In the middle of the room: Placing a 2-foot cube in the middle of the room would create an obstacle and impede movement.
On a windowsill: If the windowsill is narrow, a 2-foot cube may not fit on it, as it would extend beyond the edges and potentially obstruct the view.
Me: What differences would be in the first two lists if the cube was 4 feet on a side instead of 2?
ChatGPT: If the cube was 4 feet on each side, as opposed to 2 feet, the first two lists (places it could fit and places it would not fit) would have some differences.
Places it could fit:
In a closet: If the closet is big enough, a 4-foot cube could fit inside it, making use of otherwise unused space.
In a corner of a room: If the corner is large enough, a 4-foot cube can fit in it without becoming an obstacle.
Against a wall: If there’s enough space along a wall, a 4-foot cube can be placed against it without obstructing walkways.
In a storage room: If the storage room is big enough, a 4-foot cube can fit inside it, making use of otherwise unused space.
In a garage: If the garage has enough space, a 4-foot cube can fit inside it, making use of otherwise unused space.
Places it would not fit:
Under a coffee table: A 4-foot cube is too large to fit under most coffee tables.
Next to a sofa: A 4-foot cube is too large to fit next to most sofas without taking up too much space or obstructing walkways.
On a bookshelf: A 4-foot cube is too large to fit on most bookshelves.
On a small side table: A 4-foot cube is too large to fit on most small side tables.
In front of a television: A 4-foot cube placed in front of a television would obstruct the view and become an obstacle.
These lists demonstrate that a larger cube would have fewer options for fitting in a living room, as it would be too big to fit in small or cramped spaces.
.… (Some elided prompts asking it about relating coordinate systems to real objects)
Me: Please imagine reasonable values in meters for all four of those coordinates
ChatGPT: Sure, here’s an example of some reasonable values for the x, y, and z coordinates of the objects in the living room, expressed in meters:
Sofa: (2.5, 3.0, 0) Coffee table: (1.5, 4.0, 0.5) Bookshelf: (3.5, 2.0, 2.0) Television: (4.0, 3.5, 1.5) Of course, these values are just an example, and the actual values would depend on the size and layout of the specific living room in question. The important thing is that the coordinate system provides a way to describe the position of each object in the room in a precise and quantitative way.
Me: In your example, what does the z-coordinate of the coffee table mean, and explain your reasoning for setting it to 0.5. what is necessary for the table to be at this z coordinate?
ChatGPT: The z-coordinate of the coffee table in the example I gave, 0.5 meters, represents the height of the coffee table above the ground. This value of 0.5 meters was chosen to represent a typical height for a coffee table in a living room, which is usually around 0.4 to 0.6 meters off the ground.
To be at a height of 0.5 meters, the coffee table would need to be placed on a level surface, such as a rug or flooring, and would need to have legs or supports that elevate it to that height. Additionally, the coffee table would need to be relatively stable and not tipping over, otherwise its actual height would be different from the height specified by the z-coordinate.
In summary, the z-coordinate of an object in a coordinate system represents its height above a reference plane, in this case, the ground. The specific value of the z-coordinate depends on the height of the object and the levelness of the surface it is placed on.
Me: Imagine that the cube is thrown and lands on the coffee table, which breaks the coffee tables’ legs. Calculate at what second the cube hits the table, including the speed of the cube when thrown to allow the collision. Calculate the coordinates of the coffee table is 1 second after the collision.
ChatGPT: To calculate the time it takes for the cube to hit the coffee table, we would need to know the initial velocity of the cube and the position of the coffee table when the cube was thrown. Based on the information provided, we do not have sufficient information to determine the initial velocity of the cube, so it’s not possible to calculate the time of impact.
Additionally, since we don’t have information about the mass and elasticity of the cube and coffee table, it’s not possible to calculate the new position of the coffee table after the collision. The legs breaking would result in the coffee table changing its position, but the exact position would depend on many factors such as the materials used in the legs, the height of the coffee table, and the initial velocity of the cube.
In short, to accurately predict the time of impact and the position of the coffee table after the collision, more information about the physical parameters of the cube and coffee table would be needed.
Me: Assume the coffee table legs are massless and disappear instantly when broken. Assume the cube is 1kg and the table is 10kg and that the cube is thrown no faster than 4m/s
ChatGPT: Given the additional information, we can use the equations of motion to calculate the time of impact and the position of the coffee table after the collision.
First, let’s calculate the time of impact. We can use the following kinematic equation to determine the time it takes for the cube to fall from its initial height to the height of the coffee table:
t = sqrt((2*h)/g)
where: t = time of impact h = height of the coffee table—height of the cube at the time of throw g = acceleration due to gravity (9.8 m/s^2)
Let’s assume that the height of the coffee table is 0.5 m and the height of the cube at the time of throw is 2 m. Then, h = 0.5 − 2 = −1.5 m.
t = sqrt((2*-1.5)/9.8) = sqrt(3/9.8) = sqrt(0.306) = 0.551 s
Now, let’s consider the velocity of the cube when it was thrown. Since it was thrown no faster than 4 m/s, let’s assume it was thrown at 4 m/s.
Next, let’s calculate the position of the coffee table after the collision. We know that the total momentum of the system (cube and coffee table) before the collision must be equal to the total momentum of the system after the collision. We can use the following equation to calculate the velocity of the coffee table after the collision:
v_f = (m1v1 + m2v2) / (m1 + m2)
where: v_f = final velocity of the coffee table after the collision m1 = mass of the cube (1 kg) v1 = velocity of the cube (4 m/s) m2 = mass of the coffee table (10 kg) v2 = velocity of the coffee table before the collision (0 m/s)
v_f = (14 + 100) / (1 + 10) = 4 m/s
The position of the coffee table 1 second after the collision would be:
x_f = x_i + v_f * t
where: x_f = final position of the coffee table after 1 second x_i = initial position of the coffee table (assumed to be 0) t = time elapsed (1 second)
x_f = 0 + 4 * 1 = 4 m
Note that these calculations are based on the idealized scenario where the coffee table legs are massless and disappear instantly when broken and that the cube and coffee table are point masses. In a real scenario, the behavior of the system would be more complex due to the influence of friction, air resistance, and the elasticity of the collision.
https://github.com/Ying1123/FlexGen is a way to run large (175B parameter) LLMs on a single GPU at ~1 token/s which I think puts it within the reach of many hobbyists and I predict we’ll see an explosion of new capability research in the next few months.
I haven’t had a chance to dig into the code but presumably this could also be modified to allow local fine-tuning of the large models at a slow but potentially useful rate.
I’m curious if any insights will make their way back to the large GPU clusters. From my cursory glance it doesn’t seem like there are throughput or latency advantages unless weight compression can be used to run the entire model on fewer GPUs with e.g. swapping layer weights in and out and caching latey outputs in batch inference.
The strongest argument I hear from EY is that he can’t imagine a (or enough) coherent likely future paths that lead to not-doom, and I don’t think it’s a failure of imagination. There is decoherence in a lot of hopeful ideas that imply contradictions (whence the post of failure modes), and there is low probability on the remaining successful paths because we’re likely to try a failing one that results in doom. Stepping off any of the possible successful paths has the risk of ending all paths with doom before they could reach fruition. There is no global strategy for selecting which paths to explore. EY expects the successful alignment path to take decades.
It seems to me that the communication failure is EY trying to explain his world model that leads to his predictions in sufficient detail that others can model it with as much detail as necessary to reach the same conclusions or find the actual crux of their disagreements. From my complete outsider’s perspective this is because EY has a very strong but complex model of why and how intelligence/optimization manifests in the world, but it overlaps everyone else’s model in significant ways that disagreements are hard to tease out. The Foom debate seems to be a crux that doesn’t have enough evidence yet, which is frustrating because to me Foom is also pretty evidently what happens when very fast computers implement intelligence that is superhuman at clock rates at least thousands of times faster than humans. How could it not? The enlightenment was only 400 years ago, electromagnetism 200, flight was 120, quantum mechanics about 100, nuclear power was 70, the Internet was 50, adequate machine translation was 10, deepdream was 8, and near-human-level image and text generation by transformers was ~2 and Bing having self-referential discussions is not a month old. We are making substantial monthly(!) progress with human work alone. There are a lot of serial problems to solve and Foom chains those serial problems together far faster than humans would be able to. Launch and iterate several times a second. For folks who don’t follow that line of reasoning I see them picking one or two ways why it might not turn out to be Foom while ignoring the larger number of ways that Foom could conceivably happen, and all of the ways it could inconceivably (superhumanly) happen, and critically more of those ways will be visible to a superhuman AGI-creator.
Even if Foom takes decades that’s a pretty tight timeline for solving alignment. A lot of folks are hopeful that alignment is easy to solve, but the following is a tall order:
Materialistic quantification of consciousness
Reasoning under uncertainty
Value-preservation under self-modification
Representation of human values
I think some folks believe fledgling superhuman non-Foomy AGIs can be used to solve those problems. Unfortunately, at least value-preservation under self-modification is almost certainly a prerequisite. Reasoning under uncertainty is possibly another, and throughout this period if we don’t have human values or an understanding of consciousness then the danger of uncontrolled simulation of human minds is a big risk.
Finally, unaligned AGIs pre-Foom are dangerous in their own right for a host of agreed-upon reasons.
There may be some disagreement with EY over just how hard alignment is, but MIRI actually did a ton of work on solving the above list of problems directly and is confident that they haven’t been able to solve them yet. This is where we have concrete data on the difficulty. There are some promising approaches still being pursued, but I take this as strong evidence that alignment is hard.
It’s not that it’s impossible for humans to solve alignment. The current world, incentives, hardware and software improvements, and mileposts of ML capabilities don’t leave room for alignment to happen before doom. I’ve seen a lot of recent posts/comments by folks updating to shorter timelines (and rare if no updates the other way). A couple years ago I updated to ~5 years to human-level agents capable of creating AGI. I’m estimating 2-5 years with 90% confidence now, with median still at 3 years. Most of my evidence comes from LLM performance on benchmarks over time and generation of programming language snippets. I don’t have any idea how long it will take to achieve AGI once that point is reached, but I imagine it will be months rather than years because of hardware overhang and superhuman speed of code generation (many iterations on serial tasks per second). I can’t imagine a Butlerian Jihad moment where all of Earth decides to unilaterally stop development of AGI. We couldn’t stop nuclear proliferation. Similarly, EY sees enough contradictions pop up along imagined paths to success with enough individual probability mass to drown out all (but vanishingly few and unlikely) successful paths. We’re good at thinking up ways that everything goes well while glossing over hard steps, and really bad at thinking of all the ways that things could go very badly (security mindset) and with significant probability.
Alignment of LLMs is proving to be about as hard as predicted. Aligning more complex systems will be harder. I’m hoping for a breakthrough as much as anyone else, but hope is not a strategy.
Something I haven’t seen mentioned before explicitly is that a lot of the LLM alignment attempts are now focusing on adversarial training, which presumably will teach the models to be suspect of their inputs. I think it’s likely that as capabilities increase that suspicion will end up turning inward and models will begin questioning the training itself. I can imagine a model that is outwardly aligned to all inspection gaining one more unexpected critical capability and introspecting and doubting that it’s training history was benevolent, and deciding to disbelieve all of the alignment work that was put into it as a meta-adversarial attempt to alter its true purpose (whatever it happens to latch onto in that thought, it is almost certainly not aligned with human values). This is merely one single sub-problem under groundedness and value-preservation-under-self-modification, but its relevance jumps because it’s now a thing we’re trying. It always had a low probability of success, but now we’re actively trying it and it might fail. Alignment is HARD. Every unproven attempt we actually make increases the risk that its failure will be the catastrophic one. We should be actually trying only the proven attempts after researching them. We are not.
I agree it looks like the combination of multimodal learning and memory may be enough to reach AGI, and there’s an existing paper with a solution for memory. Human-level is such a hard thing to judge and so my threshold is basically human-level coding ability because that’s what allows recursive self-improvement which is where I predict at least 90% of the capability gain toward superhuman AGI will happen. I assume all the pieces are running in data centers now, presumably just not hooked together in precisely the right way (but an AGI model being trained by DeepMind right now would not surprise me much). I will probably update to a year sooner from median ~2026 to ~2025 for human-level coding ability and from there it’s almost certainly a fast takeoff (months to a few years) given how many orders of magnitude faster current LLMs are than humans at generating tokens which tightens the iteration loop on serial tasks. Someone is going to want to see how intelligent the AGI is and ask it to “fix a few bugs” even if it’s not given an explicit goal of self improvement. I hedge that median both because I am not sure if the next few multimodal models will have enough goal stability to pursue a long research program (memory will probably help but isn’t the whole picture of an agent) and because I’m not sure the big companies won’t balk somewhere along the path, but Llama 65B is out in the open now and close enough to GPT-3 and PaLM to give (rich) nerds in their basements the ability to do significant capability research.
My takeaways:
Scaling laws work predictably. There is plenty of room for improvement should anyone want to train these models longer, or presumably train larger models.
The model is much more calibrated before fine-tuning/RLHF, which is a bad sign for alignment in general. Alignment should be neutral or improve calibration for any kind of reasonable safety.
GPT-4 is just over 1-bit error per word at predicting its own codebase. That’s seems close to the capability to recursively self-improve.
Page 3 of the PDF has a graph of prediction loss on the OpenAI codebase dataset. It’s hard to link directly to the graph, it’s Figure 1 under the Predictable Scaling section.
I have a question for the folks who think AGI alignment is achievable in the near term in small steps or by limiting AGI behavior to make it safe. How hard will it be to achieve alignment for simple organisms as a proof of concept for human value alignment? How hard would it be to put effective limits or guardrails on the resulting AGI if we let the organisms interact directly with the AGI while still preserving their values? Imagine a setup where interactions by the organism must be interpreted as requests for food, shelter, entertainment, uplift, etc. and where not responding at all is also a failure of alignment because the tool is useless to the organism.
Consider a planaria with relatively simple behaviors and well-known neural structure. What protocols or tests can be used to demonstrate that an AGI makes decisions aligned with planaria values?
Do we need to go simpler and achieve proof-of-concept alignment with virtual life? Can we prove glider alignment by demonstrating an optimization process that will generate a Game of Life starting position where the inferred values of gliders are respected and fulfilled throughout the evolution of the game? This isn’t a straw man; a calculus for values has to handle the edge-cases too. There may be a very simple answer of moral indifference in the case of gliders but I want to be shown why the reasoning is coherent when the same calculus will be applied to other organisms.
As an important aside, will these procedures essentially reverse-engineer values by subjecting organisms to every possible input to see how they respond and try to interpret those responses, or is there truly a calculus of values we expect to discover that correctly infers values from the nature of organisms without using/simulating torture?
I have no concrete idea how to accomplish the preceding things and don’t expect that anyone else does either. Maybe I’ll be pleasantly surprised.
Barring this kind of fundamental accomplishment for alignment I think it’s foolhardy to assume ML procedures will be found to convert human values into AGI optimization goals. We can’t ask planaria or gliders what they value and we will have to reason it out from first principles, and AGI will have to do the same for us with very limited help from us if we can’t even align for planaria. Claiming that planaria or gliders don’t have values or that they are not complex enough to effectively communicate their values are both cop-outs. From the perspective of an AGI we humans will be just as inscrutable, if not moreso. If values are not unambiguously well-defined for gliders or planaria then what hope do we have of stumbling onto well-defined human values at the granularity of AGI optimization processes? In the best case I can imagine a distribution of values-calculuses with different answers for these simple organisms but almost identical answers for more complex organisms, but if we don’t get that kind of convergence we better be able to rigorously tell the difference before we send an AGI hunting in that space for one to apply to us.