Software engineer 30+ years at fortune 100 companies
Currently writing about the intersections of technology and philosophy
Also writes at Mind Prison
Software engineer 30+ years at fortune 100 companies
Currently writing about the intersections of technology and philosophy
Also writes at Mind Prison
“detailed episode guides for an old show not being in the training data”
This is incorrect. I’m the author of this test. The intention was to show data that we can prove is in the training set isn’t correctly surfaced by the LLM. So in this case, when it hallucinates or says “I don’t know”, it should know.
As to the model confidence, you might find what I recently wrote about “hallucinations are provably unsolvable” of interest under section “The Attempted Solutions to Solve AI Hallucinations” and the 2 linked papers within.
For something such as AI safety, having AI itself solve the problem is largely problematic when we need provable outcomes without hidden behaviors. The circular system can easily mask problems. But all of this fundamentally is advancing past the harder problems. That being algorithms cannot encode the concepts we wish to impart to the AI. Abstractly we can make alignment systems that seem to make sense, but can never implement them concretely.
As it is today, every LLM released is jailbroken within minutes. We can’t protect behaviors of which the attack surface is anything that can be expressed by human language.
I elaborate on these points extensively here in AI Alignment: Why Solving It Is Impossible
Alignment is a constructed on a paradox. It can’t be solved. It is a tower of nebulous and contradictory concepts. I don’t perceive how we can make any progress until an argument first addresses these issues. A paradox can’t be solved, but it can be invalidated if the premise is wrong.
“Alignment, which we cannot define, will be solved by rules on which none of us agree, based on values that exist in conflict, for a future technology that we do not know how to build, which we could never fully understand, must be provably perfect to prevent unpredictable and untestable scenarios for failure, of a machine whose entire purpose is to outsmart all of us and think of all possibilities that we did not.”
I elaborate in detail here—AI Alignment: Why Solving It Is Impossible
All sources are cited within here—https://www.mindprison.cc/p/the-cartesian-crisis
Any ideology can be contagious as they tend to ride on top of social bonds. However, your premise for arriving at a tolerant utopia requires that a single ideology will become dominant against all other competing ideologies.
While the ideals and principles of tolerance can certainly spread, they will encounter other ideologies in opposition. Unfortunately, this brings us right back the Popper’s paradox. Tolerance would need to win against other intolerant ideologies. Tolerance could only maintain its status if the other ideologies simply submit or voluntarily disband of their own accord.
What could cause intolerant ideologies to voluntarily disband? The members would need to come to understand their needs are better met by the tolerant ideology. However, even if that may be the case, ideologies are built on social structure and status. Typically, individuals are not reasoning on a level void of significant bias for their tribe. They are simply exhibiting whatever behavior builds stronger bonds and social status within their group or ideology.
Even if we could somehow arrive at the point of the utopian tolerance, we likely have another obstacle. That being hedonic adaptation. As we drift toward our utopian society, ultimately we will find new monsters to seek out. What was once considered tolerant will be seen as intolerant. Thus a narrowing of the utopian vision may begin to develop until it breaks.
As for Popper’s paradox and its original dilemma. I believe the best we can achieve to invalidate the paradox is to use the most narrow definition of tolerance possible. That being “restraint from using force to silence or remove others from society”. My elaboration on Popper’s Paradox of tolerance and solutions provides more detail.
Agreed, p(doom) is a useless metric for advancing our understanding or policy decisions. It has become a social phenomenon where people are now pressuring others to calculate a value and asserting it can be used as an accurate measurement. It is absurd. We should stop accepting nothing more than guesses and define a criteria that is measurable. FYI, my own recent rant on the topic …
https://www.mindprison.cc/p/pdoom-the-useless-ai-predictor
I used Infi-gram to prove the data exists in the training set as well as other prompts that could reveal the information exists. For example, LLMs sometimes could not answer the question directly, but when asked to list the episodes it could do so revealing the episode exists within the LLMs dataset.
FYI—“Ring around Gilligan” is surfaced incorrectly. It is not about mind reading. It is about controlling another person through a device that makes them do whatever asked.
Although I can’t know specifically why some models are able to now answer the question, it isn’t unexpected that they would eventually. With more training and bigger models the statistical bell curve of what the model can surface does widen.
BTW, your primary use case is mine as well. Unfortunately, I have had no luck with reliability for processing knowledge in the context window. My best solution has been to prompt for direct citations for any document so I can easily verify if the result is a hallucination or not. Doesn’t stop hallucinations, but just helps me more quickly identify them.
I suspect training for such specific tasks might improve performance somewhat, but hallucinations will never go away on this type of architecture. I wrote something in detail recently about that here “AI Hallucinations: Proven Unsolvable—What Do We Do?”
Sorry for the late reply, my karma here is negative and I have a 3 day penalty on replies. For some reason everything I’ve posted here received lots of downvotes without comment. So I’ve mostly quit posting here.