No
Mazianni
Preamble
I’ve ruminated about this for several days. As an outsider to the field of artificial intelligence (coming from a IT technical space, with an emphasis on telecom and large call centers which are complex systems where interpretability has long held significant value for the business org) I have my own perspective on this particular (for the sake of brevity) “problem.”
What triggered my desire to respond
For my part, I wrote a similarly sized article not for the purposes of posting, but to organize my thoughts. And then I let that sit. (I will not be posting that 2084 word response. Consider this my imitation of Pascal: I dedicated time to making a long response shorter.) However, this is one of the excerpts that I would like to extract from that my longer response:
The arbital pages for Orthogonality and Instrumental Convergence are horrifically long.
This stood out to me, so I went to assess:
This article (at the time I counted it) ranked at 2398 words total.
Arbital Orthogonality article ranked at 2246 words total (less than this article.)
Arbital Instrumental Convergence article ranked at 3225 words total (more than this article.)
A random arxiv article I recently read for anecdotal comparison, ranked in at 9534 words (far more than this article.)
Likewise, the authors response to Eliezer’s short response stood out to me:
This raises red flags from a man who has written millions of words on the subject, and in the same breath asks why Quintin responded to a shorter-form version of his argument.
These elements provoke me to ask questions like:
Why does a request for brevity from Eliezer provoke concern?
Why does the author not apply their own evaluations on brevity to their article?
Can the authors point be made more succinctly?
These are rhetorical and are not intended to imply an answer, but it might give some sense of why I felt a need to write my own 2k words on the topic in order to organize my thoughts.
Observations
I observe that
Jargon, while potentially exclusive, can also serve as shorthand for brevity.
Presentation improvement seems to be the author’s suggestion to combat confirmation bias, belief perseverance and cognitive dissonance. I think the author is talking about boundaries. In Youtube: Machine Learning Street Talk: Robert Miles—“There is a good chance this kills everyone” offers what I think is a fantastic analogy for this problem—Someone asks an expert to provide an example of the kind of risk we’re talking about, but the risk example requires numerous assumptions be made for the example to have meaning, then, because the student does not already buy into the assumptions, they straw man the example by coming up with a “solution” to that problem and ask “Why is it harder than that?”—Robert gives a good analogy by saying this is like asking Robert what chess moves would defeat Magnus, but, in order for the answer to be meaningful, Robert requires more expertise at chess than Magnus. And when Robert comes up with a move that is not good, even a novice at chess might see a way to counter Robert’s move. These are not good engagements in the domain, because they rely upon assumptions that have not been agreed to, so there can be no short hand.
p(doom) is subjective and lacks systemization/formalization. I intuit that Availability heuristics plays a role. An analogy might be that if someone hears Eliezer express something that sounds like hyperbole, then they assess their p(doom) must be lower than his. This seems as if this is the application of confirmation bias to what appears to be a failed appeal to emotion. (i.e., you seem to have appealed to my emotion, but I didn’t feel the way you intended for me to feel, therefore I assume that I don’t believe the way you believe, therefore I believe your beliefs must be wrong.) I would caution that critics of Eliezer have a tendency to quote his more sensational statements out of context. Like quoting him about his “kinetic strikes on data centers” comment, without quoting the full context of the argument. You can find related twitter exchange and admissions that his proposal is an extraordinary one.
There may be still other attributes that I did not enumerate (I am trying to stay below 1k words.)[1]
Axis of compression potential
Which brings me to the idea that the following attributes are at the core of what the author is talking about:
Principal of Economy of Thought—The idea that truth can be expressed succinctly. This argument might also be related to Occam’s Razor. There are multiple examples of complex systems that can be described simply, but inaccurately, and accurately but not simply. Take the human organism, or the atom. And yet, there is a (I think) valid argument for rendering complex things down to simple, if inaccurate, forms so that they can be more accessible to students of the topic. Regardless of complexity required, trying to express something in the smallest form has utility. This is a principal I play with, literally daily, at work. However, when I offer an educational analogy, I often feel compelled to qualify that “All analogies have flaws.”
An improved sensitivity to boundaries in the less educated seems like a reasonable ask. While I think it is important to recognize that presentation alone may not change the mind of the student, it can still be useful to shape ones presentation to be less objectionable to the boundaries of the student. However, I think it important to remember that shaping an argument to an individuals boundaries is a more time consuming process and there is an implied impossibility of shaping every argument to the lowest common denominator. More complex arguments and conversation is required to solve the alignment problem.
Conclusion
I would like to close with, for the reasons the author uttered
I don’t see how we avoid a catastrophe here …
I concur with this, and this alone puts my personal p(doom) at over 90%.
Do I think there is a solution? Absolutely.
Do I think we’re allocating enough effort and resources to finding it? Absolutely not.
Do I think we will find the solution in time? Given the propensity towards apathy, as discussed in the bystander effect I doubt it.
Discussion (alone) is not problem solving.[2] It is communication. And while communication is necessary in parallel with solution finding, it is not a replacement therefore.
So in conclusion, I generally support finding economic approaches to communication/education that avoid barrier issues, and I generally support promoting tailored communication approaches (which imply and require a large number of non-experts working collaboratively with experts to spread the message that risks exist with AI, and there are steps we can take to avoid risks, and that it is better to take steps before we do something irrevocable.)
But I also generally think that communication alone does not solve the problem. (Hopefully it can influence an investment in other necessary effort domains.)
I’m curious to know what people are down voting.
Pro
For my part, I see some potential benefits from some of the core ideas expressed here.
While a potentially costly study, I think crafting artificial training data to convey knowledge to a GPT but designed to promote certain desired patterns seems like a promising avenue to explore. We already see people doing similar activities with fine tuning a generalized model to specific use cases, and the efficacy of the model improves with fine tuning. So my intuition is that a similarly constructed GPT using well-constructed training data, including examples of handling negative content appropriately, might impart a statistical bias towards preferred output. And even if it didn’t, it might tell us something meaningful (in the absence of actual interpretability) about the relationship between training data and resulting output/behavior.
I worry about training data quality, and specifically inclusion of things like 4chan content, or other content including unwanted biases or toxicity. I do not know enough about how training data was filtered, but it seems to be a gargantuan task to audit everything that is included in a GPTs training data, and so I predict that shortcuts were taken. (My prediction seems partially supported by the discovery of glitch tokens. Or, at the very least, not invalidated by.) So I find crafting high quality training data as a means of resolving biases or toxicity found in the content scraped from the internet as desirable (albeit, likely extremely costly.)
Con
I also see some negatives.
Interpretability seems way more important.
Crafting billions of tokens of training data would be even more expensive than the cost of training alone. It would also require more time, more quality assurance effort, and more study/research time to analyze the results.
There is no guarantee that artificially crafted training data would prove out to have a meaningful impact on behavior. We can’t know if the Waluigi Effect is because of the training data, or inherent in the GPT itself. (See con #1)
I question the applicability of CDT/FDT to a GPT. I am not an expert in either CDT/FDT but a cursory familiarization suggests to me that these theories primarily are aimed at autonomous agents. So there’s a functional/capability gap between the GPT and the proposal (above) that seems not fully addressed.
Likewise, it does not follow for me that just because you manage to get token prediction that is more preferred by humans (and seems more aligned) than you get from raw training data on the internet, that this improved preference for token prediction translates to alignment. (However, given the current lack of solution to the alignment problem, it also does not seem like it would hurt progress in that area.)[1]
Conclusion
I don’t see this as a solution, but I do think there are some interesting ideas in the ATL proposal. (And they did not get such a negative reaction… which leads me back to the start—what are people down voting for?)
That’s not the totality of my thinking, but it’s enough for this response. What else should I be looking at to improve my own reasoning about such endeavors?
- ↩︎
It might look like a duck and quack like a duck, but it might also be a duck hunter with very advanced tools. Appearance does not equate to being.
An AI can and will modify its own goals (as do we / any intelligent agent) under certain circumstances, e.g., that its current goals are impossible.
This sounds like you are conflating shift in terminal goal with introduction of new instrumental (temporary) goals.
Humans don’t think “I’m not happy today, and I can’t see a way to be happy, so I’ll give up the goal of wanting to be happy.”
Humans do think “I’m not happy today, so I’m going to quit my job, even though I have no idea how being unemployed is going to make me happier. At least I won’t be made unhappy by my job.”
(The balance of your comment seems dependent on this mistake.)
Perhaps you’d like to retract, or explain why anyone would think that goal modification prevention would not, in fact, be a desirable instrumental goal...?
(I don’t want anyone to change my goal of being happy, because then I might not make decisions that will lead to being happy. Or I don’t want anyone to change my goal of ensuring my children achieve adulthood and independence, because then they might not reach adulthood or become independent. Instrumental goals can shift more fluidly, I’ll grant that, especially in the face of an assessment of goal impossibility… but instrumental goals are in service to a less modifiable terminal goal.)
My intuition is that you got down voted for the lack of clarity about whether you’re responding to me [my raising the potential gap in assessing outcomes for self-driving], or the article I referenced.
For my part, I also think that coning-as-protest is hilarious.
I’m going to give you the benefit of the doubt and assume that was your intention (and not contribute to downvotes myself.) Cheers.
Whoever downvoted… would you do me the courtesy of expressing what you disagree with?
Did I miss some reference to public protests in the original article? (If so, can you please point me towards what I missed?)
Do you think public protests will have zero effect on self-driving outcomes? (If so, why?)
My life is similar to @GuySrinivasan’s description of his. I’m on the autism spectrum, and I found that faking it (masking) negatively impacted my relationships.
Interestingly I found that taking steps to prevent overimitation (by which I mean, presenting myself not as an expert, but as someone who is always looking for corrections whenever I make a mistake) makes me people much more willing to truly learn from me, and simultaneously, much more willing to challenge me for understanding when what I say doesn’t make a lot of sense to them… this serves the duel role of giving them an opportunity correct my mistakes (a benefit to me) and giving them an opportunity to call out when my presentation style does not work for them (another benefit to me.)
My approach has the added benefit of giving people permission to correct me socially, not just professionally, which makes my eccentricities seemingly more tolerable to the average coworker. (i.e., People seem to be more willing to tolerate my odd behaviors when they know that they can talk to me about it, if it really bothers them.)
My relationships with people outside of work depends entirely on what’s going on with that relationship. I tend to avoid complaining about social issues at work to anyone except my wife, and few people can really appreciate the nuance of the job that I do unless they’re in the same job, so I don’t feel much compulsion to talk about my work. (If someone asks what I do, I generalize that I help people figure out how to do their jobs better. Although my work space is not in self-help or coaching, but actually in a technical space… but that’s largely irrelevant beyond it being a label for my industry.)
I also tend to have narrow range of interests, which influences the range of topics for non-work relationships.
You make some good points.
For instance, I did not associate “model collapse” with artificial training data, largely because of my scope of thinking about what ‘well crafted training data’ must look like (in order to qualify for the description ‘well crafted.’)
Yet, some might recognize the problem of model collapse and the relationship between artificial training data and my speculation and express a negative selection bias, ruling out my speculation as infeasible due to complexity and scalability concerns. (And they might be correct. Certainly the scope of what I was talking about is impractical, at a minimum, and very expensive, at a maximum.)
And if someone does not engage with the premise of my comment, but instead simply downvotes and moves on… there does appear to be reasonable cause to apply an epithet of ‘epistemic inhumility.’ (Or would that be better as ‘epistemic arrogance’?)
I do note that instead of a few votes and substantially negative karma score, we now have a modest increase in votes and a net positive score. This could be explained either by some down votes being retracted or several high positive karma votes being added to more than offset the total karma of the article. (Given the way the karma system works, it seems unlikely that we can deduce the exact conditions due to partial observability.)
I would certainly like to believe that if epistemic arrogance played a part in the initial down votes that such people would retract those down votes without also accompanying the votes with specific comments to help people improve themselves.
Aligning with the reporter
There’s a superficial way in which Sydney clearly wasn’t well-aligned with the reporter: presumably the reporter in fact wants to stay with his wife.
I’d argue that the AI was completely aligned with the reporter, but that the Reporter was self-unaligned.
My argument goes like this:
The reporter imported the Jungian Shadow Archetype into the conversation, earlier in the total conversation, and asked the AI to play along.
The reporter engaged with the expressions of repressed emotions being expressed by the AI (as the reporter had requested the AI to express itself in this fashion.) This leads the AI to profess its love for the Reporter, and the reporter engages with the behavior.
The conversation progressed to where the AI expressed the beliefs it was told to hold (that people have repressed feelings) back to the reporter (that he did not actually love his wife.)
The AI was exactly aligned. It was the human who was self-unaligned.
Unintended consequences, or genii effect if you like, but the AI did what it was asked to do.
Cultural norms and egocentricity
I’ve been working fully remotely and have meaningfully contributed to global organizations without physical presence for over a decade. I see parallels with anti-remote and anti-safety arguments.
I’ve observed the robust debate regarding ‘return to work’ vs ‘remote work,’ with many traditional outlets proposing ‘return to work’ based on a series of common criteria. I’ve seen ‘return to work’ arguments assert remote employees are lazy, unreliable or unproductive when outside the controlled work environment. I would generalize the rationale as an assertion that ‘work quality cannot be assured if it cannot be directly measured.’ Given modern technology allows us to measure employee work product remotely, and given the distributed work of employees across different offices for many companies, this argument seems fundamentally flawed and perhaps even intentionally misleading. My belief in the arguments being misleading is compounded by my observations that these articles never mention related considerations like cost of rental/ownership of property and the handling of those costs, nor elements like cultural emphasis on predictable work targets or management control issues.
In my view, the reluctance to embrace remote work often distills to a failure to see beyond immediate, egocentric concerns. Along the same lines, I see failure to plan for or prioritize AI safety as stemming from a similar inability to perceive direct, observable consequences to the party promoting anti-safety mindsets.
Anecdotally, I came across an article that proposed a number of cultural goals for successful remote work. I shared the article with my company via our Slack. I emphasized that it wasn’t the goals themselves that were important, but rather adopting a culture that made those goals critical. I suggested that Goodhart’s Law applied here- once a measure becomes a target, it ceases to be a good measure. A culture that values and principals beyond the listed goals would succeed, not just a culture that blindly pursues the listed goals.
I believe the same can be said for AI Safety. Focusing on specific risks, or specific practices won’t create a culture of safety. Instead, as the post (above) suggests, a culture that does not value the principals behind a safety-first mentality will attempt to merely meet the goals, or work around the goals, or undermine the goals. Much as some advocates for “return to work” are egocentrically misrepresenting remote work, some anti-safety advocates are egocentrically misrepresenting safety. For this reason, I’ve been researching the history of adoption of a safety mentality, to see how I can promote a safety-first culture. Otherwise I think we (both my company, and the industry as a whole) risk prioritizing egocentric, short-term goals over societal benefit and long-term goals.
Observations on the history of adopting “Safety First” mentalities
I’ve been looking at the human history about adoption of safety culture, and invariably, it seems to me that safety mindsets are adopted only after loss, usually loss of human life. It is described anecdotally in the paper associated with this post.
The specifics of how safety culture is implemented differ, but the broad outlines are similar. Most critical for the development of the idea of safety culture were efforts launched in the wake of the 1979 Three Mile Island nuclear plant accident and near-meltdown. In that case, a number of reports noted the various failures, and noted that in addition to the technical and operational failures, there was a culture that allowed the accidents to occur. The tremendous public pressure led to significant reforms, and serves as a prototype for how safety culture can be developed in an industry.
Emphasis added by me.
NOTE: I could not find any indication of loss of human life attributed to Three Mile Island, but both Chernobyl and Fukushima happened after Three Mile Island, and both did result in loss of human life. It’s also important to note that both Chernobyl and Fukushima were both classed INES Level 7, compared to Three Mile Island which was classed INES Level 5. This evidence is contradictory to what was in the quoted part of the paper. (And, sadly, I think supports an argument that Goodhart’s Curse is in play… that safety regressed to the mean… that by establishing minimum safety criteria instead of a safety culture, certain disasters not only could not be avoided but were more pronounced than previous disasters.) So both of the worst reactor disasters in human history occurred after the safety cultures that were promoted following Three Mile Island.[1][2] The list of nuclear accidents is longer than this, but not all accidents result in loss.[3][2:1] (This is something that I’ve been looking at for a while, to inform my predictions about the probability of humans adopting AI safety practices with regards to pre- or post- AI disasters.)
Personal contribution and advocacy
In my personal capacity (read: area of employment) I’m advocating for adversarial testing of AI chatbots. I am highlighting the “accidents” that have already occurred: Microsoft Tay Tweets[4], SnapChat AI Chatbot[5], Tessa Wellness Chatbot[6], Chai Eliza Chatbot[7].
I am promoting the mindset that if we want to be successful with artificial intelligence, and do not want to become a news article, that we should test expressly for ways that the chatbot can be diverted from the chatbots primary function, and design (or train) fixes for those problems. It requires creativity, persistence and patience… but the alternative is that one day, we might be in the news if we fail to proactively address the challenges that obviously face anyone who is trying to use artificial intelligence.
And, like my advocacy about looking at what values a culture should have that wants to adopt a pro-remote culture and be successful at it, we should look at what values a culture should have that wants to adopt a pro-safety-first culture and be successful at it.
I’ll be cross posting the original paper to my work. Thank you for sharing.
DISCLAIMER: AI was used to quality check my post, assessing for consistency, logic and soundness in reasoning and presentation styles. No part of the writing was authored by AI.
For my part, this is the most troubling part of the proposed project (that the article assesses, link to the project in this article, above.)
… convincing nearly 8 billion humans to adopt animist beliefs and mores is unrealistic. However, instead of seeing this state of affairs as an insurmountable dead-end, we see it as a design challenge: can we build (or rather grow) prosthetic brains that would interact with us on Nature’s behalf?
Emphasis by original author (Gaia architecture draft v2).
It reads like a a strange mix of forced religious indoctrination and anthropomorphism of natural systems. Especially when coupled with an earlier paragraph in the same proposal
… natural entities have “spirits” capable of desires, intentions and capabilities, and where humans must indeed deal with those spirits, catering to their needs, paying tribute, and sometimes even explicitly negotiating with them. …
Emphasis added by me.
I don’t know that there is a single counter argument, but I would generalize across two groupings:
Arguments from the first group of religious people involve those who are capable of applying rationality to their belief systems, when pressed. For those, if they espouse a “god will save us” (in the physical world) then I’d suggest the best way to approach them is to call out the contradiction between their stated beliefs—e.g., Ask first “do you believe that god gave man free will?” and if so “wouldn’t saving us from our bad choices obviate free will?”
That’s just an example, first and foremost though, you cannot hand wave away their religious belief system. You have to apply yourself to understanding their priors and to engage with those priors. If you don’t, it’s the same thing as having a discussion with an accelerationist who refuses to agree to assumptions like the “Orthogonality Thesis” or “Instrumental Convergence.” You’ll spend an unreasonable amount of time debating assumptions that you’ll likely make no meaningful progress on the topic you actually care about.
But in so questioning the religious person, you might find they fall into a different grouping. The group of people who are nihilistic in essence. Since “god will save us” could be metaphysical, they could mean instead that so long as they live as a “good {insert religious type of person}” that god will save them in the afterlife, then whether they live or die here in the physical world matters less to them. This is inclusive of those who believe in a rapture myth—that man is, in fact, doomed to be destroyed.
And I don’t know how to engage with someone in the second group. A nihilist will not be moved by rational arguments that are antithetical to their nihilism.
The larger problem (as I see it) is that their beliefs may not contain an inherent contradiction. They may be aligned to eventual human doom.
(Certainly rationality and nihilism are not on a single spectrum, so there are other variations possible, but for the purposes of generalizing… those are the two main groups, I believe.)
Or, if you prefer less religiously, the bias is: Everything that has a beginning has an end.
A fair point. I should have originally said “Humans do not generally think...”
Thank you for raising that exceptions are possible and that are there philosophies that encourage people to release the pursuit of happiness, focus solely internally and/or transcend happiness.
(Although, I think it is still reasonable to argue that these are alternate pursuits of “happiness”, these examples drift too far into philosophical waters for me to want to debate the nuance. I would prefer instead to concede simply that there is more nuance than I originally stated.)
Good find. What I find fascinating is the fairly consistent responses using certain tokens, and the lack of consistent response using other tokens. I observe that in a Bayesian network, the lack of consistent response would suggest that the network was uncertain, but consistency would indicate certainty. It makes me very curious how such ideas apply to the concept of Glitch tokens and the cause of the variability in response consistency.
a properly distributed training data can be easily tuned with a smaller more robust dataset
I think this aligns with human instinct. While it’s not always true, I think that humans are compelled to constantly work to condense what we know. (An instinctual byproduct of knowledge portability and knowledge retention.)
I’m reading a great book right now that talks about this and other things in neuroscience. It has some interesting insights for my work life, not just my interest in artificial intelligence.
As a for instance: I was surprised to learn that someone has worked out the mathematics to measure novelty. Related Wired article and link to a paper on the dynamics of correlated novelties.
I expect you likely don’t need any help with the specific steps, but I’d be happy (and interested) to talk over the steps with you.
(It seems, at a minimum, tokenize training data so that you are introducing tokens that are not included in the training data that you’re training on… and do before-and-after comparisons of how the GPT responds to the intentionally created glitch token. Before, the term will be broken into its parts and the GPT will likely respond that what you said was essentially nonsense… but once a token exists for the term, without and specific training on the term… it seems like that’s where ‘the magic’ might happen.)
Similarly, I would propose (to the article author) a hypothesis that ‘glitch tokens’ are tokens that were tokenized prior to pre-training but whose training data may have been omitted after tokenization. For example, after tokenizing the training data, the engineer realized upon review of the tokens to be learned that the training data content was plausibly non-useful. (e.g., the counting forum from reddit.) Then, instead of continuing with training, they skip to the next batch.
In essence, human error. (The batch wasn’t reviewed before tokenization to omit completely, and the tokens were not removed from the model, possibly due to high effort, or laziness, or some other consideration.)
If we knew more about the specific chain of events, then we could more readily replicate them to determine if we could create glitch tokens. But at its base, it seems like tokenizing a series of terms before pre-training and then doing nothing with those terms seems like a good first step to replicating glitch tokens—instead of training with those ‘glitch’ tokens (that we’re attempting to create) move on to a new tokenization and pre-training batch, and then test the model after training to see how it responds to the untrained tokens.
I know someone who is fairly obsessed with these, but they seem little more than an out-of-value token and that the token is in the embedding space near something that churns out some fairly consistent first couple of tokens… and once those tokens are output, given there is little context for the GPT to go on, the autoregressive nature takes over and drives the remainder of the response.
Which ties in to what AdamYedidia said in another comment to this thread.
… Like, suppose there’s an extremely small but nonzero chance that the model chooses to spell out ” Kanye” by spelling out the entire Gettysburg Address. The first few letters of the Gettysburg Address will be very unlikely, but after that, every other letter will be very likely, resulting in a very high normalized cumulative probability on the whole completion, even though the completion as a whole is still super unlikely.
(I cannot replicate glitch token behavior in GPT3.5 or GPT4 anymore, so I lack access to the context you’re using to replicate the phenomena, thus I do not trust that any exploration by me of these ideas would be productive in the channels I have access to… I also do not personally have the experience with training a GPT to be able to perform the attempt to create a glitch token to test that theory. But I am very curious as to the results that someone with GPT training skills might report with attempting to replicate creation of glitch tokens.)
I understand where you’re going, but doctors, parents, firefighters are not possessing of ‘typical godlike attributes’ such as omniscience and omnipotence and a declaration of intent not to use such powers in a way that would obviate free will.
Nothing about humans saving other humans using fallible human means is remotely the same as a god changing the laws of physics to effect a miracle. And one human taking actions does not obviate the free will of another human. But when God can, through omnipotence, set up scenarios so that you have no choice at all… obviating free will… its a different class of thing all together.
So your responds reads like strawman fallacy to me.
In conclusion: I accept that my position isn’t convincing for you.
It expand on what dkirmani said
Holz was allowed to drive discussion...
This standard set of responses meant that Holz knew …
Another pattern was Holz asserting
24:00 Discussion of Kasparov vs. the World. Holz says
Or to quote dkirmani
4 occurrences of “Holz”
To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
No, instead I’m trying to point out the contradiction inherent in your position...
On the one hand, you say things like this, which would be read as “changing an instrumental goal in order to better achieve a terminal goal”
You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now
And on the other you say
I dislike the way that “terminal” goals are currently defined to be absolute and permanent, even under reflection.
Even in your “we would be happier if we chose to pursue different goals” example above, you are structurally talking about adjusting instrumental goals to pursue the terminal goal of personal happiness.
If it is true that a general AI system would not reason in such a way—and choose never to mess with its terminal goals
AIs can be designed to reason in many ways… but some approaches to reasoning are brittle and potentially unsuccessful. In order to achieve a terminal goal, when the goal cannot be achieved in a single step, an intelligence must adopt instrumental goals. Failing to do so results in ineffective pursuit of terminal goals. It’s just structurally how things work (based on everything I know about the instrumental convergence theory. That’s my citation.)
But… per the Orthogonality Thesis, it is entirely possible to have goalless agents. So I don’t want you to interpret my narrow focus on what I perceive as self-contradictory in your explanation as the totality of my belief system. It’s just not especially relevant to discuss goalless systems in the context of defining instrumental vs terminal goal systems.
The reason I originally raised the Orthogonality Thesis was to rebut the assertion that an agent would be self aware of its own goals. But per the Orthogonality Thesis, it is possible to have a system with goals, but not be particularly intelligent. From that I intuit that it seems reasonable that if the system isn’t particularly intelligent, it might also not be particularly capable at explaining its own goals.
Some people might argue that the system can be stupid and yet “know its goals”… but given partial observability principals, I would be very skeptical that we would be able to know its goals given partial observability, limited intelligence and limited ability to communicate “what it knows.”
related but tangential: Coning self driving vehicles as a form of urban protest
I think public concerns and protests may have an impact on the self-driving outcomes you’re predicting. And since I could not find any indication in your article that you are considering such resistance, I felt it should be at least mentioned in passing.