Thanks for commenting :)
> I would guess that we might be so far from solving alignment that nothing would look particularly helpful?
My thinking is that using reinforcement-learning-like methods will select towards systems that look like they are aligned / optimized for what we are trying to optimize them for. If the system gives answers/solutions/etc where humans can see that it doesn’t really optimize well for what we want it to optimize for, then I presume it would be tweaked further until that no longer was the case. For example, suppose that we get it to write code for us, and we select how easy the code is to read/understand for humans as an optimization-criteria, and then it returns code that has clear examples of things that could be improved—well, then it would presumably be tweaked further (and it would presumably be a bad strategy for it if it was trying to trick us into thinking it was aligned).
That being said, it is a possibility that a system could be pretending that it’s less capable than it really is. That way it could return answers that didn’t look particularly helpful. Which I guess sort of makes the answer to your question “yes”. As in, there could be scenarios with AGI/superintelligence where it is “pretending” to not be an AGI/superintelligence. And then there could be other scenarios where it’s not hiding that it’s really capable, but where it does pretend to be less capable than it really is at certain types of things. But there would presumably be incentives for it to not do that to such a degree that it became easy for humans to notice that it isn’t doing its best. (Btw, I am consciously leaving out a few types of conceivable scenarios so as to make this comment less jumbled.)
Most of this series will focus on scenarios where we know that the system in question is superintelligent, and where we had the foresight to “box” it in (making it hard to hack itself out through security vulnerabilities in operating system and that sort of stuff) well before it acquired anything resembling superintelligence (or something that plausibly might self-improve in a self-reinforcing feedback-loop towards anything resembling superintelligence).
> Or, even worse, the only thing that would look helpful is something completely wrong?
Yes, that is a major worry. And that is one of the main concerns/problems/difficulties that I’m hoping to address in this series (not the only one, but one of the major ones). That is to say, I want to outline strategies/techniques that are intended to make it as hard and as unlikely as possible for the AI-system to be able to trick us into thinking it is being helpful (giving us what we want) while it really isn’t.
Tor Økland Barstad
Really well-written post.
One thing that seems under-discussed to me are methods we might use to get help from a superintelligent AGI to assist in creating systems for which we have more assurances that they are aligned (as a whole). And one reason for me thinking that it’s under-discussed is that even if we think we have succeeded with alignment, we should look for how we can use a superintelligence to verify that this is the case and add extra layers of assurance (finding the least risky methods for doing this first, and going about it in a stepwise and iterative manner).
I think that if such plans are laid out in more detail beforehand (before some team develops AGI/superintelligence I mean), and people try minimizing the degree to which such plans are “handwavy”, this may help make teams more apt to make use of techniques/procedures/strategies that can be helpful (compared to if they are improvising).
Have started writing about this here if you are interested (but part 2 and 3 will probably be more substantial than part 1): https://www.lesswrong.com/posts/ZmZBataeY58anJRBb/agi-assisted-alignment-part-1-introduction
Though it may well be (not committing in either direction, but seems plausible) that to even get to the stage where you give a superintelligent AI questions/requests (without it beforehand hacking itself onto the internet, or that sort of thing), people would need to exhibit more security mindset than they are likely to do..
Interesting comment. I feel like I recently have experienced this phenomena myself (that it’s hard to find people who can play “red team”).
Do you have any “blue team” ideas for alignment where you in particular would want someone to play “red team”?
I would be interested in having someone play “red team” here, but if someone were to do so in a non-trivial manner then it would probably be best to wait at least until I’ve completed Part 3 (which will take at least weeks, partly since I’m busy with my main job): https://www.lesswrong.com/posts/ZmZBataeY58anJRBb/agi-assisted-alignment-part-1-introduction
Could potentially be up for playing red team against you, in exchange for you playing red team against me (but if I think I could have something to contribute as red team would depend on specifics of what is proposed/discussed—e.g., I’m not familiar with technical specifics of deep learning beyond vague descriptions).- 22 Jun 2022 10:39 UTC; 7 points) 's comment on Getting from an unaligned AGI to an aligned AGI? by (
Thanks for the feedback!
And yes, do feel free to send me drafts in the future if you want me to look over them. I don’t give guaranties regarding amount or speed of feedback, but it would be my intention to try to be helpful :)
This sounds a bit like the idea of a “low-bandwidth oracle”.
Thanks, that’s interesting. Hadn’t seen that (insofar as I can remember). Definitely overlap there.
I think probabilistic safety measures can be good if we stack a lot of them together in the right way.
Same, and that’s a good/crisp way to put it.
Maybe post a summary at the top, and say people are welcome to give feedback after just having read the summary.
Will edit at some point so as to follow the first part of that suggestion. Thanks!
I think you’ll find it easier to get feedback if you keep your writing brief. (..) I think I might have been able to generate the bullet points above based on a 2-paragraph executive summary of your post.
Some things in that bullet-list addresses stuff I left out to cut length, and stuff I though I would address in future parts of the series. Found also those parts of bullet-list helpful, but still this exemplifies dilemmas/tradeoffs regarding length. Will try to make more effort to look for things to make shorter based on your advice. And I should have read through this one more before publishing.
Well, if you start out with a superintelligence that you have good reasons to think is fully aligned, then that is certainly a much better situation to be in (and that’s an understatement)! Mentioning that just to make it clear that even if we see things differently, there is partial agreement :)
Lets imagine a superintelligent AI, and let’s describe it (a bit anthropomorphicly) as not “wanting” to help me solve alignment. Lets say that instead what it “wants” is to (for example) maximize it’s reward signal, and that the best way to maximize it’s reward signal would be to exterminate humanity and take over the world (for reasons Eliezer and others have outlined). Well, in that case it will be looking for ways to overthrow humanity, but if it isn’t able to do that it may want to do the next best thing (providing outputs that its operators respond to with a high reward signal).
So it may well prefer to “trick” me. But if it can’t “trick” me, it may prefer to give me answers that seem good to me (rather than answers that I recognize as clearly bad and evasive).
Machine learning techniques will tend to select for systems that do things that seem impressive and helpful. Unfortunately this does not guarantee “deep” alignment, but I presume that it will select for systems that at least seem aligned on the surface.
There are lots of risks and difficulties involved with asking questions/requests of AIs. But there are more and less ways dangerous of interacting with a potentially unaligned AGI, and questions/requests vary a lot in how easy or hard it is to verify whether or not they provide us what we want. The techniques/outlines I will outline in the series are intended to minimize risk of being “tricked”, and I think they could get us pretty far, but I could be wrong somehow, and it’s a long/complicated discussion.
Having AGI-system output programs that generate output instead of outputting the answers/results directly, and requiring these programs to be optimized for being narrow/specialized/readable/modular/short—and last but not least, verifiable.
When/if possible, programs that are outputted should be combined with as much in the way of formal proofs/verification as possible.
Since narrow/specialized programs are easier to verify, this will in some situations make it harder (but often still possible, it depends on specifics) for it to trick human operators.
We could call these narrow/specialized programs “generators”.
It could be possible to output “generators” that help with the verification of other “generators”.
One way to verify “generators” would be to look at code, as well as any formal verifications of behavior (with varying levels of assurance).
Another way to verify “generators” would be to test against real world data. A “generator” that e.g. simulates (or otherwise predicts) result of construction-procedures, should also work for procedures that have been done in the past (that the AI doesn’t have access to). And if it later is used to implement AGI-provided procedures, those would preferably be tested first in conditions where there are additional guaranties that it’s hard for the instruction to result in something that behaves unexpectedly. E.g., if it is nanomachinery, then running in conditions where it’s enclosed in material that makes getting chemically impossible (even if more subtle parts of design don’t go as expected). A “generator” that predicts human behavior and/or approximates results of (high welfare) brain emulations can be tested against existing real world data that AGI doesn’t have access to.
It can also be checked how “generator” that is outputted varies based on info AGI-instance has been given.
Sometimes it may make sense to have “generators” that help generate other “generators”. E.g. we could have a “generator” that generates “generators” for predicting the behavior of some type of animal (again, suffering sub-routines should be avoided). Or if we want one function to approximate the output of another function (but using less compute), we could use an “approximator”, instead of directly asking “give me a physics simulator, but make it computationally efficient”.
Maybe “generators” also could aid in finding design flaws in other “generators”, e.g. ways a “physics-simulation-generator” generates faulty output in certain cases, or ways a “generator” isn’t as narrow/modular as it could be, or ways a “generator” has more complexity than it needs to do its job. And the “approximator” should then be tested insofar as possible of course.
More “generators” that do the same job, but based on different architectures/principles, is better. The more converging results from various designs/principles/implementations, the better.
Having a “council” of AGIs that are “siloed”.
The first AGI can be used in the creation of code for AGIs that are aligned based on various different principles. Could be in one “swoop”, but with the expectation that code and architecture is optimized for clarity/modularity/verifiability/etc. But could also be by asking the AGI to do lots of different things. Or both (and then we can see whether output from the different systems is the same).
Naturally, the all these things should not be asked of the same instance of the AGI (although that could be done as well, to check if output converges).
In the end we have a “council” of AGIs. Some maybe predicting output of smart humans humans working for a long time. Some maybe using high welfare brain emulations. Some maybe constructing proofs where ambiguity of cluster-like concepts is accounted for within formalism, and mapping between concepts and outside world is accounted for within formalism—with as much of “core” thinking as possible being one of the same as these proofs. Some maybe based on machine learning by debate ideas. The more concepts that seem likely to work (without having suffering sub-routines) the better.
This “council” of “siloed” AGIs can then be used as oracle/genie, and we can see if output converges. And they also search for mistakes in output (answers, proofs/verification, argument-trees, etc) from other AGIs.
Specify proof-format that has ambiguity (of cluster-like concepts, etc) be a part of the formalism, and has mappings between concepts and real world be part of formalism, and can reference output from functions as part of formalism.
Of course how much trust that is put in proof/argument-tree would depend on various things (allowing vague concepts makes it less trustable).
For cluster-like concepts referenced by proofs, a precise specification of criteria for exclusion/inclusion should not be expected, but the more the better. Inference rules and examples can specify the degree to which specific instances would fall within a specific concept or not (also allowed to say that some instances neither fall inside or outside of it).
One of the points would be to make as much as possible be within the realm of things where an AGI could be expected to output proofs that are easier to verify compared to other output.
My thinking is that this would be most helpful when combined with other techniques/design-principles. Like, outputting the proof (very formal argument with computable inference-steps) is one thing, but another thing is which techniques/processes that are used to look for problems with it (as well as looking for problem with formalism as a whole, as well as testing/predicting how hard/easy humans can be convinced of things that are false or contradictory given various conditions/specifics).
Bonus if these formal proofs/statements can be presented in ways where humans easily can read them.
Getting AGI to output “physics-simulator” where both starting and ending conditions are specified to be within some ranges of possible state-configurations, instead of simulating one specific example and providing a “point-like” value.
Have possible machines (proposed by AI-system) be tested by such “physics-simulators”.
”Simulator” might highlight predictable aspects of machine behavior vs less predictable things. That is to say, the behavior of machine that is “low-entropy” (I’m probably using the term wrong).
Point would be to have a probabilistic safety-measure that makes it harder for AI to propose some machine design (or machine-creation-instructions) that results in unexpected behavior.
The “simulator” should be tested against real-world data that the AI doesn’t have access to (as well as using various other measures to verify and look for problems).
Trying to think of ways that experiments can set up where AI-systems use other AI-systems as oracles/genies (1 or more being superintelligent), perhaps sometimes in “adversarial” ways. Exploring methods for asking requests and maybe finding out things about how hard/easy it is to “trick” operator (seeming to provide what they want without providing what they want) given various domains/methods.
May e.g. involve one AI asking for code-pieces for some purpose, but where other AI is to try to “hide” ways in which delivered code isn’t quite what the other AI wants.
Superintelligence may realize what’s going on (or may be going on), and act accordingly. But nontheless maaybe some useful info could be gained? 🤔
I think there are various very powerful methods that can be used to make it hard for AGI-system to not provide what we want in process of creating aligned AGI-system. But I don’t disagree in regards to what you say about it being “extremely dangerous”. I think one argument in favor of the kinds of strategies I have in mind is that they may help give an extra layer of security/alignment-assurance, even if we think we have succeeded with alignment beforehand.
Similar to this, but not the same: Experiment with AGI where it is set to align other AGI. For example, maybe it needs to do some tasks to do reward, but those tasks need to be done by the other AGI, and it don’t know what the tasks will be beforehand. One goal being to see methods AGI might use to align other AGI (that may then be used to align AGI-systems that are sub-systems of AGI-system, and seeing if output from this AGI converges with results from AGIs aligned by other principles).
Don’t expect that this would be that fruitful, but haven’t thought about it that much and who knows.
Would need to avoid suffering sub-routines.
Do you have available URLs to comments/posts where you have done so in the past?
How useful AI-systems can be at this sort of thing after becoming catastrophically dangerous is also worth discussing more than is done at present. At least I think so. Between Eliezer and me I think maybe that’s the biggest crux (my intuitions about FOOM are Eliezer-like I think, although AFAIK I’m more unsure/agnostic regarding that than he is).
Obviously a more favorable situation if AGI-system is aligned before it could destroy the world. But even if we think we succeeded with alignment prior to superintelligence (and possible FOOM), we should look for ways it can help with alignment afterwards, so as to provide additional security/alignment-assurance.
As Paul points out, verification will often be a lot easier than generation, and I think techniques that leverage this (also with superintelligent systems that may not be aligned) is underdiscussed. And how easy/hard if would be for an AGI-system to trick us (into thinking it’s being helpful when it really wasn’t) would depend a lot on how we went about things.
Various potential ways of getting help for alignment while keeping “channels of causality” quite limited and verifying the work/output of the AI-system in powerful ways.
I’ve started on a series about this: https://www.lesswrong.com/posts/ZmZBataeY58anJRBb/getting-from-unaligned-to-aligned-agi-assisted-alignment
Perhaps experiments could make use of encryption in some way that prevented AGIs from doing/verifying work themselves, making it so that they would need to align the other AGI/AGIs. Encryption keys that only one AGI has could be necessary for doing and/or verifying work.
Could maybe set things up in such a way that one AGI knows it can get more reward if it tricks the other into approving faulty output.
Would need to avoid suffering sub-routines.
Personally I’m more willing to give feedback on prepublication drafts because that gives me more influence on what people end up reading.
I have a first draft ready for part 2 now: https://docs.google.com/document/d/1RGyvhALY5i98_ypJkFvtSFau3EmH8huO0v5QxZ5v_zQ/edit
Will read it over more, but plan to post within maybe a few days.
I have also made a few changes to part 1, and will probably make additional changes to part 1 over time.
As you can see if you open the Google Doc, part 2 is not any shorter than part 1. You may or may not interpret that as an indication that I don’t make effective use of feedback.
Part 3, which I have not finished, is the part that will focus more on proofs. (Edit: It does not. But maybe there will be a future post that focuses on proofs as planned. It is however quite very relevant to the topic of proofs the way I think of things.)Any help from anyone in reading over would be appreciated, but at the same time it is not expected :)
Far and away the most common failure mode among self-identifying alignment researchers is to look for Clever Ways To Avoid Doing Hard Things (or Clever Reasons To Ignore The Hard Things), rather than just Directly Tackling The Hard Things
Regarding making use of a superintelligent AGI-system in a safe way, here are some things that possibly could help with that:Working on making it so that the first superintelligent AGI is robustly aligned from the beginning.
Working on security measures so that if we didn’t do #1 (or thought we did, but didn’t), the AGI will be unable to “hack” itself out in some digital way (e.g. exploiting some OS flaw and getting internet access)
Developing and preparing techniques/strategies so that if we didn’t do #1 (or thought we did, but didn’t), we can obtain help from various instances of the AGI-system that get us towards a more and more aligned AGI-system, while (1) minimizing the causual influence of the AGI-systems and the ways they might manipulate us and (2) making requests in such a way that we can verify that what we are getting is what we actually want, greatly leveraging how verifying a system often is much easier than making it.
#2 and #3 seems to me as worth pursuing in addition to #1, but not instead of #1. Rather #2 and #3 could work as additional layers of alignment-assurance.
I do think genuine failure modes are being alluded to by “Clever Ways To Avoid Doing Hard Things”, but I think there also may be failure modes having to do with encouraging “everyone” to only work on “The Hard Things” in a direct way (without people also looking for potential workarounds and additional layers of alignment-assurance).
Also, consider if someone comes up with alignment methodologies for an AGI that don’t seem robust or fully safe, but do seem like they might have a decent/good chance of working in practice. Such alignment methodologies may be bad ideas if they are used as “the solution”, but if we have a “system of systems”, where some of the sub-systems themselves are AGIs that we have attempted to align based on different alignment methodologies, then we can e.g. see if the outputs from these different sub-systems converge.
Sincerely someone who does not call himself an alignment researcher, but who does self-identify as a “hobbyist alignment theorist”, and is working on a series where much of the focus is on Clever Techniques/Strategies That Might Work Even If We Haven’t Succeeded At The Hard Things (and thus maybe could provide additional layers of alignment-assurance).
I feel like mentioning that for men sperm donation is something they can consider (and for women, egg donation is something they can consider!). It is of course very different from having “your own” child, but there is also overlap.
Sperm donation can sometimes happen via clinic, while at other sperm donors and donor recipients can match via platforms such as Just A Baby, co-parentmatch.com and prideangel.com (there are additional platforms beyond this, and there are also groups on Facebook). People seeking donors include lesbian couples, couples where the man is infertile, and single women (and sometimes trans-men).There are pros and cons to different ways of doing things. People seeking donations can save a lot of money by not paying a clinic, and they can make a more informed choice about who they choose as donor by speaking to him directly. Some donor-recipients prefer there to be little or no contact with the donor after pregnancy, while others prefer for there to be a possibility of keeping in touch. Some think it can be good for the child to be able to know his/her donor, and have some level of contact (e.g. visiting once in a while).
For people who want to learn more, there is a podcast named Sperm Donation Word.
If you google news stories with the phrase “sperm donor shortage”, many stories will come up (and not just from 2022 - stories like these just keep coming). To me this is an almost comically salient example of how humans are mesa optimizers from the perspective of natural selection..
Edit: Regarding “To me it seems that EOS is the most viable competitor”, this is something I changed my mind about (maybe within a few months of writing this comment—I don’t quite remember).
To me it seems that EOS is the most viable competitor. The actual blockchain of theirs will go live this summer, but they already have test nets up and running that developers can use to develop applications (though I dunno how easy or hard it is to get started with the current level of documentation, etc). Some of the questions that determine how well EOS does or doesn’t do are:
How developer-friendly will EOS be compared to Ethereum? This article argues that EOS will be a better choice for dapp developers (https://steemit.com/eos/@nadejde/eos-development-platform-first), with a bit of discussion for and against some of the articles points in the comments. I don’t have experience with either myself, and I don’t know much about what kind of third party code that is / will be / can be made available for Ethereum to make it more developer friendly.
How much flexibility will EOS give for developers language-wise? My impression has been that lots of the popular languages we have today will be possible to use since they’ll be using WebAssembly. After googling WebAssembly a bit more I’m not as optimistic about this as I was, but still think this seems to be an argument in favour of EOS vs Ethereum: https://github.com/appcypher/awesome-wasm-langs, https://stackoverflow.com/questions/43540878/what-languages-can-be-compiled-to-web-assembly-or-wasm. I don’t know if languages that are well-supported by WebAssembly practically speaking will be easy to use to develop on EOS, or if C++ in the beginning or indefinitely will be the de facto choice for EOS developers, but I’m assuming that they chose WebAssembly because they intend for it to be possible to use several different languages, so that more developers can choose a language they already are comfortable with.
Practically speaking, when can developers start working on dapps on EOS vs on Cardano? EOS has a test-net ready for dapp developers to work on, and will release the main net this summer, though I don’t know how easy or cumbersome it is to get started. Unless I’m mistaken Cardanos work on smart contracts is pretty early stage, and there hasn’t been set a date yet for when smart contract functionality will be available: https://www.reddit.com/r/cardano/comments/7q9yq0/when_will_cardano_have_smart_contracts/?
Will EOS transactions really be free for most users? This seems to be what they claim, and so far I haven’t come across people arguing that this isn’t the case.
Is it the case that regular users of Ethereum applications always will have to pay fees, even if these get really low? As I understand it, one possible solution would be to have dapps give the users money for gas. If this becomes affordable, will it for most dapps be practically doable to “hide” them from users so that it doesn’t make the dapp less user-friendly? I don’t see why it wouldn’t be if gas-fees become low enough, at least for dapps where users have to register accounts, but my understanding is vague. Also, this seems to me like yet another thing that developers will have to think about, unless some easy standardized solution becomes available.
For the different levels of speed, scalability and low costs of transactions, will both EOS and Ethereum get there? And if both get there, who will get there first and how large will the time difference between them be? Take for example steemit.com, a website that runs on a blockchain with an architecture that is similar to the one EOS will be using. Will it become possible to have something like that run on Ethereum without having users wait a lot, pay high fees, and having the network get clogged? My impression is that if this becomes possible on Ethereum it will probably not be a few months after EOS, but years after. Is that impression wrong? I don’t have a full overview of the different upgrades Ethereum are planning and their expected timelines. In this video from September 2017 Vitalik seems to think that Ethereum probably will have the capacity to replace Visa in a couple of years, but with prototypes with lower security having that kind of capacity in about a year: https://youtu.be/WSN5BaCzsbo?t=10m42s. He also seems to think that running something like Starcraft on Ethereum will be possible at some point thanks to second-layer systems. Here are some comments regarding the capacity of EOS (the video moves on to other topics at 2:02): https://youtu.be/UC6RYwYPnpU?t=49s.
How will they compare in terms of funding for projects? I don’t know, though I know that EOS has a pretty big “war chest” of money that will be used to fund projects on their blockchain (https://cryptovest.com/news/breaking-blockone-ceo-brendan-blumer-announces-1-billion-in-capital-for-eos-projects/).
My own guess is that Ethereum will continue to be the number one smart contract platform, but that EOS will do better in terms of percentagewise growth in price from here. That’s just the guess that feels most likely to me though, not something I expect neccecarily will happen. I have read a significant amount and watched a significant amount of interviews and so on, but my understanding of this stuff is vague and incomplete.
I have the majority of my money in EOS at this moment, though I’m not sure if that’s smart of me or not, and am very much unsure about when I should pull out (now? in a month? this summer?). What makes me most unsure is the possibility of a crash of the whole crypto-market, which may be pretty likely.
Smart contracts thanks to rootstock? Scaling and lower transaction fees thanks to lightning network? More investments and usage due to first-mover advantage / brand awareness / network effects? These possibilities, which I don’t rule out with my current admittedly shallow/vague understanding, makes it seem plausible to me that Bitcoin could continue to be #1 even if Ethereum scales successfully.