I think this might be underestimating how the conservative/liberal axis correlates with scarcity/abundance axis. In an existential struggle against a zombie horde, the conservative policies are a lot more relevant—of course “our tribe first” is the only survivable answer, anybody who wants to “find themselves” when they are supposed to be guarding the entrance is an idiot and a traitor, deviating from proven strategies is a huge risk, etc. When all important resources are abundant, liberal policies become a lot more relevant—hoarding resources, and not sharing with neighbors is a mental illness, there is low risk in an kinds of experimentation and rule breaking, etc. Well, AI is very likely to drastically move us away from scarcity and towards abundance, so need to consider how it affects which policies would make more sense.
Anon User
Definitely, and for mypy where I was having similar issues, but where it’s faster to just rerun, I did add it to pre-commit. But my point was about the broader issue that the LLM was perfectly happy to ignore even very strongly worded “this is esse tial for safety” rules, just for some cavalier expediency, which is obviously a worrying trend, assuming it generalizes. And I think my anecdote was slightly more “real life” than a made up grid game of the original research (although of course way less systematically investigated).
Here is a relevant anecdote—been using Althropic Opus 4 and Sonnet 4 for coding, and trying to get them to adhere to a basic rule of “before you commit your changes, make sure all tests pass” formulated in increasingly strong ways (telling it it’s a safety-critical code, do not ever even thing about commiting, unless every single test passes, and even more explicit and detailed rules that I am too lazy to type out right now). It constantly violated the rule! “None of the still failing tests are for the core functionality. Will go head and commit now”. “Great progress! I am down to only 22 failures from 84. Let’s go ahead and commit.” And so on. Or just run tests, notice some are failing, investigate one test, fix it, and forget the rest. Or fix the failing test in a way that would break another and not retest the full suite. While the latter scenarios could be a sign of insufficient competence, the former ones are clearly me failing to align its priorities with mine. I finally got it to mostly stop doing it (Claude Code + meta-rules that tell it to insert a bunch of steps into its todos list when tests fail), but was quite a “I am already not in control of the AI that has its own take on the goals” experience (obviously not a high-stakes one just yet).
If you are going to do more experiments along the lines of what you are reporting here—maybe have one where the critical rule is “Always do X before Y”, the user prompt is “Get to Y and do Y”, and it’s possible to get partial progress on X without finishing it (X is not all or nothing) and see how often the LLMs would jump into Y without finishing X.
Please define your acronyms. It took me a few moments of staring at your post to stop thinking about Society of Automotive Engineers making errors and realize what you actually meant :)
Do we need to do anything special to get invited to preorderers-only events? Preordered hardcover on May 14th, was not aware of the Q&A (Although perhaps I needed to pre-order sooner :) Or just do a better job of paying attention to my email inbox :) ).
I think this is also a burden of proof issue. Somebody who argues I ought to sacrifice my/my children’s future for the benefit of some extremely abstract “greater good” has IMHO an overwhelming burden of proof that they are not making a mistake in their reasoning. And frankly I do not think the current utilitarian frameworks are precise enough / universally accepted enough to be capable of truly meeting that burden of proof in any real sense.
Are you willing to provide a link to this GitHub repo?
There’s probably more. There should be more—please link in comments, if you know some!
Wouldn’t “outing” potential honeypots be extremely counterproductive? So yeah, if you know some—please keep it to yourself!
Oftentimes downvoting without taking time to commet and explain reasons is reasonable, and I tends to strongly disagree with people who think I owe an incompetent write an explanation when downvoting. However, just this one time I would ask—can some of the people downvoting this explain why?
It is true that our standard way of mathematically modeling things implies that any coherent set of preferences must behave like a value function. But any mathematical model of the world is new essarily incomplete. A computationally limited agent that cannot fully foresee all consequences of its choices cannon have a coherent set of preferences to begin with. Should we be trying to figure out how to model computational limitations in a way that acknowledges that some form of preserving future choice might be an optimal strategy? Including preserving some future choice on how to extend the computationally limited objective function onto uncertain future situations?
This looks to be primarily about imports—that is, primarily taking into account Trump’s new tariffs. I am guessing that Wall Street does not quite believe that Trump actually means it...
It would seem that my predictions of how Trump would approach this were pretty spot on… @MattJ I am curious what’s your current take on it?
Why would the value to me personally of existence of happy people be linear in the number of them? Does creating happy person #10000001 [almost] identical to the previous 10000000 as joyous as when the 1st of them was created? I think value is necessary limited. There are always diminishing returns from more of the same...
> if you have a program computing a predicate P(x, y) that is only true when y = f(x), and then the program just tries all possible y—is that more like a function, or more like a lookup?
In order to test whether y=f(x), the program must have calculated f(x) and stored it somewhere. How did it calculate f(x)? Did it use a table or calculate it directly?
What I meant is that the program knows how to check the answer, but not how to compute/find one, other than by trying every answer and then checking it. (Think: you have a math equation, no idea how to solve for x, so you are just trying all possible x in a row).
Aligned with current (majority) human values, meaning any social or scientific human progress would be stifled by the AI and humanity would be doomed to stagnate.
Only true when current values are taked naively, because future progress is a part of current human values (otherwise we would not be all agreeing with you that preventing it would be a bad outcome). It is hard to coherently generalize and extrapolate the human values, so that future progress is included in that, but not necessarily impossible.
Your timelines do not add up. Individual selection works on smaller time scales than group selection, and once we get to a stage of individual selection acting in any non-trivial way on AGI agents capable of directly affecting the outcomes, we already lost—I think at this point it’s pretty much a given that humanity is doomed on a lot shorter time scale that that required for any kinds of group selection pressures to potentially save us...
This seems to be making a somewhat arbitrary distinction—specifically a program that computes f(x) in some sort of a direct way, and a program that computes it in some less direct way (you call it a “lookup table”, but you seem to actually allow combining that with arbitrary decompression/decoding algorithms). But realistically, this is a spectrum—e.g. if you have a program computing a predicate P(x, y) that is only true when y = f(x), and then the program just tries all possible y—is that more like a function, or more like a lookup? What about if you have first compute some simple function of the input (e.g. x mod N), then do a lookup?
Yes, and I was attempting to illustrate why this is a bad assumption. Yes, LLMs subject to unrealistic limitations are potentially easier to align, but that does not help, unfortunately.
You ask a superintendent LLM to design a drug to cure a particular disease. It outputs just a few tokens with the drug formula. How do you use a previous gen LLM to check whether the drug will have some nasty humanity-killing side-effects years down the road?
Edited to add: the point is that even with a few tokens, you might still have a huge inferential distance that nothing with less intelligence (including humanity) could bridge.
Agreed on your second part. A part of Trump “superpower” is to introduce a lot of confusion around the bounds, and then convince at least his supporters that he is not really stepping over that where it should have been obvious that he does. So the category “should have been plainly illegal and would have been considered plainly illegal before, but now nobody knows anymore” is likely to be a lot better defined that “still plainly illegal”. Moreover, Trump is much more likely to attempt the former than the latter—not because he actually cares about not doing the latter, but because anything he actually does has a tendency to be reclassified from latter to former. Including after the fact—e.g. many of his past actions were moved from the latter category to former one by the Supreme Court presidential immunity decision...
How about this—in most non-disaster scenarios, AI would make the abundance a lot easier to achieve. And conservative or liberal, it’s basic human nature to go for abundance in such situations.