My top interest is AI safety, followed by reinforcement learning. My professional background is in software engineering, computer science, machine learning. I have degrees in electrical engineering, liberal arts, and public policy. I currently live in the Washington, DC metro area; before that, I lived in Berkeley for about five years.
David James
Verify human designs and automatically create AI-generated designs which provably cannot be opened by mechanical picking.
Such a proof would be subject to its definition of “mechanical picking” and a sufficiently accurate physics model. (For example, would an electronically-controllable key-looking object with adjustable key-cut depths with pressure sensors qualify as a “pick”?)
I don’t dispute the value of formal proofs for safety. If accomplished, they move the conversation to “is the proof correct?” and “are we proving the right thing?”. Both are steps in the right direction, I think.
I’m curious if your argument, distilled, is: fewer people skilled in technical AI work is better? Such a claim must be examined closely! Think of it from a systems dynamics point of view. We must look at more than just one relationship. (I personally try to press people to share some kind of model that isn’t presented only in words.)
Another failure mode—perhaps the elephant in the room from a governance perspective—is national interests conflicting with humanity’s interests. For example, actions done in the national interest of the US may ratchet up international competition (instead of collaboration).
Even if one puts aside short-term political disagreements, what passes for serious analysis around US national security seems rather limited in terms of (a) time horizon and (b) risk mitigation. Examples abound: e.g. support of one dictator until he becomes problematic, then switching support and/or spending massively to deal with the aftermath.Even with sincere actors pursuing smart goals (such as long-term global stability), how can a nation with significant leadership shifts every 4 to 8 years hope to ensure a consistent long-term strategy? This question suggests that an instrumental goal for AI safety involves promoting institutions and mechanisms that promote long-term governance.
See also Nomic, a game by Peter Suber where a move in the game is a proposal to change the rules of the game.
I grant that legalese increases the total page count, but I don’t think it necessarily changes the depth of the tree very much (by depth I mean how many documents refer back to other documents).
I’ve seen spaghetti towers written in very concise computer languages (such as Ruby) that nevertheless involve perhaps 50+ levels (in this context, a level is a function call).
If instead you keep deliberating until the balance of arguments supports your preferred conclusion, you’re almost guaranteed to be satisfied eventually!
Inspired by the above, I offer the pseudo code version...
loop { if assess(args, weights) > 1 { // assess active arguments break; // preferred conclusion is "proved" } else { arg = biased_sample(remaining_args); // without replacement args.insert(arg); optimize(args, weights); // mutates weights to maximize `assess(args, weights)` } }
… the code above implements “the balance of arguments” as a function parameterized with weights. This allows for using an optimization process to reach one’s desired conclusion more quickly :)
One failure mode could be a perception that the USG’s support of evals is “enough” for now. Under such a perception, some leaders might relax their efforts in promoting all approaches towards AI safety.
[Question] Inviting discussion of “Beat AI: A contest using philosophical concepts”
Note that this is different from the (also very interesting) question of what LLMs, or the transformer architecture, are capable of accomplishing in a single forward pass. Here we’re talking about what they can do under typical auto-regressive conditions like chat.
I would appreciate if the community here could point me to research that agrees or disagrees with my claim and conclusions, below.
Claim: one pass through a transformer (of a given size) can only do a finite number of reasoning steps.Therefore: If we want an agent that can plan over an unbounded number of steps (e.g. one that does tree-search), it will need some component that can do an arbitrary number of iterative or recursive steps.
Sub-claim: The above claim does not conflict with the Universal Approximation Theorem.
One important role of a criminal justice system is rehabilitation. Another, according to some, is retribution. Those in Azkaban suffer from perhaps the most of awful forms of retribution. Dementation renders a person incapable of rehabilitation.
Consider this if-then argument:
If:
Justice is served without error (which is not true)
The only purpose for criminal justice is retribution
Then: Azkabanian punishment is rational.
Otherwise, assuming there are other ways to protect society from the person, it is irrational to dement people.
Speaking broadly, putting aside the fictional word of Azkaban, there is an argument that suggests retribution for its own sake is wrong. It is simple: inflicting suffering is wrong, all other things equal. Retribution makes sense only to the extent it serves as a deterrent.
In my experience, programming languages with {static or strong} typing are considerably easier to refactor in comparison to languages with {weak or dynamic} typing.*
* The {static vs dynamic} and {strong vs weak} dimensions are sometimes blurred together, but this Stack Overflow Q&A unpacks the differences pretty well.
Thanks for your quick answer—you answered before I was even done revising my question. :) I can personally relate to Dan Luu’s examples. / This immediately makes me want to find potential solutions, but I won’t jump to any right now. / For now, I’ll just mention the ways in which Jacob Collier can explain music harmony at many levels.
I listened to part of “Processor clock speeds are not how fast AIs think”, but I was disappointed by the lack of a human narrator. I am not interested in machine readings; I would prefer to go read the article.
Thanks for the references; I’ll need some time to review them. In the meanwhile, I’ll make some quick responses.
As a side note, I’m not sure how tree search comes into play; in what way does tree search require unbounded steps that doesn’t apply equally to linear search?
I intended tree search as just one example, since minimax tree search is a common example for game-based RL research.
No finite agent, recursive or otherwise, can plan over an unbounded number of steps in finite time...
In general, I agree. Though there are notable exceptions for cases such as (not mutually exclusive):
-
a closed form solution is found (for example, where a time-based simulation can calculate some quantity at an any arbitrary time step using the same amount of computation)
-
approximate solutions using a fixed number of computation steps are viable
-
a greedy algorithm can select the immediate next action that is equivalent to following a longer-term planning algorithm
… so it’s not immediately clear to me how iteration/recursion is fundamentally different in practice.
Yes, like I said above, I agree in general and see your point.
As I’m confident we both know, some algorithms can be written more compactly when recursion/iteration are available. I don’t know how much computation theory touches on this; i.e. what classes of problems this applies to and why. I would make an intuitive guess that it is conceptually related to my point earlier about closed-form solutions.
-
Claim: the degree to which the future is hard to predict has no bearing on the outer alignment problem.
If one is a consequentialist (of some flavor), one can still construct a “desirability tree” over various possible various future states. Sure, the uncertainty makes the problem more complex in practice, but the algorithm is still very simple. So I don’t think that that a more complex universe intrinsically has anything to do with alignment per se.
Arguably, machines will have better computational ability to reason over a vast number of future states. In this sense, they will be more ethical according to consequentialism, provided their valuation of terminal states is aligned.
To be clear, of course, alignment w.r.t. the valuation of terminal states is important. But I don’t think this has anything to do with a harder to predict universe. All we do with consequentialism is evaluate a particular terminal state. The complexity of how we got there doesn’t matter.
(If you are detecting that I have doubts about the goodness and practicality of consequentialism, you would be right, but I don’t think this is central to the argument here.)
If humans don’t really carry out consequentialism like we hope they would (and surely humans are not rational enough to adhere to consequentialist ethics—perhaps not even in principle!), we can’t blame this on outer alignment, can we? This would be better described as goal misspecification.
If one subscribes to deontological ethics, then the problem becomes even easier. Why? One wouldn’t have to reason probabilistically over various future states at all. The goodness of an action only has to do with the nature of the action itself.
Do you want to discuss some other kind of ethics? Is there some other flavor that would operate differentially w.r.t. outer alignment in a more versus less predictable universe?
Want to try out a thought experiment? Put that same particular human (who wanted to specify goals for an agent) in the financial scenario you mention. Then ask: how well would they do? Compare the quality of how the person would act versus how well the agent might act.
This raises related questions:
If the human doesn’t know what they would want, it doesn’t seem fair to blame the problem on alignment failure. In such a case, the problem would be a person’s lack of clarity.
Humans are notoriously good rationalizers and may downplay their own bad decisions. Making a fair comparison between “what the human would have done” versus “what the AI agent would have done” may be quite tricky. (See the Fundamental Attribution Error a.k.a. correspondence bias.
As I understand it, the argument above doesn’t account for the agent using the best information available at the time (in the future, relative to its goal specification).
I think there is some confusion around a key point. For alignment, do we need to define what an agent will do in all future scenarios? It depends what you mean.
In some sense, no, because in the future, the agent will have information we don’t have now.
In some sense, yes, because we want to know (to some degree) how the agent will act with future (unknown) information. Put another way, we want to guarantee that certain properties hold about its actions.
Let’s say we define an aligned agent doing what we would want, provided that we were in its shoes (i.e. knowing what it knew). Under this definition, it is indeed possible that to specify an agent’s decision rule in a way that doesn’t rely on long-range predictions (where predictive power gets fuzzy, like Alejandro says, due to measurement error and complexity). See also the adjacent by comment about a thermostat by eggsyntax.
Note: I’m saying “decision rule” intentionally, because even an individual human does not have a well-defined utility function. (edited)
Nevertheless, it seems wrong to say that my liver is optimising my bank balance, and more right to say that it “detoxifies various metabolites, synthesizes proteins, and produces biochemicals necessary for digestion”—even though that gives a less precise account of the liver’s behaviour.
I’m not following why this is a less precise account of the liver’s behavior.
First, I encourage you to put credence in the current score of −40 and a moderator saying the post doesn’t meet LessWrong’s quality bar.
By LD you mean Lincoln-Douglas debate, right? If so, please continue reading.
Second, I’d like to put some additional ideas up for discussion and consideration—not debate—I don’t want to debate you, certainly not in LD style. If you care about truth-seeking, I suggest taking a hard and critical look at LD. To what degree is Lincoln-Douglas debate organized around truth-seeking? How often does a participant in an LD debate change their position based on new evidence? In my understanding, in practice, LD is quite uninterested in the notion of being “less wrong”. It seems to be about a particular kind of “rhetorical art” of fortifying one’s position as much as possible while attacking another’s. One might hope that somehow the LD debate process surfaces the truth. Maybe, in some cases. But generally speaking, I find it to be a woeful distortion of curious discussion and truth-seeking.
Here is an example of a systems dynamics diagram showing some of the key feedback loops I see. We could discuss various narratives around it and what to change (add, subtract, modify).
I find this style of thinking particularly constructive.
For any two nodes, you can see a visual relationship (or lack thereof) and ask “what influence do these have on each other and why?”.
The act of summarization cuts out chaff.
It is harder to fool yourself about the completeness of your analysis.
It is easier to get to core areas of confusion or disagreement with others.
Personally, I find verbal reasoning workable for “local” (pairwise) reasoning but quite constraining for systemic thinking.
If nothing else, I hope this example shows how easily key feedback loops get overlooked. How many of us claim to have… (a) some technical expertise in positive and negative feedback? (b) interest in Bayes nets? So why don’t we take the time to write out our diagrams? How can we do better?
P.S. There are major oversights in the diagram above, such as economic factors. This is not a limitation of the technique itself—it is a limitation of the space and effort I’ve put into it. I have many other such diagrams in the works.