Bing AI Generating Voynich Manuscript Continuations—It does not know how it knows

Epistemic status: slightly giddy and freaked out. Possibly rushing to judgment. But there’s definitely something here that others should check out for themselves...

The Voynich Manuscript (VMS), for anyone who is familiar with it, has an infuriating liminality. It seems a little like everything but not exactly like anything in particular. It is right on the edge. Can we currently prove that its text is meaningful and decipherable (let alone prove that any one deciphering is valid)? No. Can we prove, then, that it is meaningless gibberish? Also, no. It seems to endure various statistical attacks, such as first-order and second-order token entropy, and Zipf’s Law, and countless others, just well enough to seem like it might be encoding meaningful information (no guarantee!), and yet not well enough to point to any particular underlying “plaintext” language or enciphering method.

Even if the VMS’s text turns out to be pure “gibberish,” there still had to be a method for generating that gibberish. A human brain (difficult to say whose) had some process for choosing each new token to write. Even if we were dealing with output of text in the VMS that seemed truly “random,” it would be helpful to taboo the word “random” since (perhaps outside of quantum mechanics) that word functions as a concept we use when we don’t sufficiently understand the microscale processes and starting conditions that led to an observable macroscale outcome. It obfuscates and hand-waves away questions that are answerable (in principle). It is often not helpful.

There is even less reason to hand-wave away the process that created the VMS as “random” if one considers the many statistical regularities that Voynich words (or “vords”), lines, and sections exhibit. To give just a taste to the unitiated:

1. Elmar Vogt’s “Line as a functional unit” (LAAFU) thesis

2. Brian Cham’s “Curve/​Line System” thesis

3. “Core/​Mantle/​Crust” Voynich word structure

4. “Grove Words”

(By the way, a quick way for how you can know whether to discount any purported amateur “decoding” of the VMS, of which there seem to be many each year, is to see whether such authors offer both an objective, reproducible method of decoding as well as an explanation of how their method gives rise to these and other well-known statistical regularities in the VMS. If authors don’t bother to engage with the substantial work in the field preceding them, they may be heuristically brushed off).

The point is, there was a process by which the text of the VMS was created, regardless of whether a human intended for that process to encode meaning. Either the process behind the VMS can be simulated, or the human behind the process behind the VMS can be simulated.

Humanity is essentially in the same relationship to the VMS as AI large language models (LLMs) are to the entire textual output of humans on the Internet. The entire Internet is the LLM’s Voynich Manuscript. This might help give people some intuition as to what exactly LLMs are doing.

The LLM starts off with no clue about human concepts or what our words mean. All it can observe is statistical relationships. It creates models for creating that text that allows it to predict/​generate plausible continuations to starting text prompts. In theory, with sufficient statistical mastery of the text in the VMS, humans should be able to simulate a process by which to generate increasingly-plausible-sounding continuations of “Voynichese” in the same way that AI LLMs generate plausible-sounding continuations of English or Japanese, even if humans never “understand” a single “vord” of Voynichese. As our process becomes increasingly-good at generating continuations of Voynichese that obey all of the statistical properties of the original distribution, we might say that humans would be asymptotically approaching a high-fidelity simulation of the process (whatever that was) that originally created the Voynichese.

The good news is, LLMs eat these sorts of tasks for breakfast. This is what they were born to do. “This is what we train for!” It should be easy, right?

Alas, my first attempt at getting Bing AI to generate a continuation of Voynichese and to explain how it was doing that hit a roadblock. I started with:

“What is the likelihood that the text of the Voynich Manuscript encodes meaningful information as opposed to meaningless gibberish?”

Bing AI demurred to give a judgment one way or the other, but at least it set some context for the next question.

“Offer a plausible continuation of the following string of text from the Voynich Manuscript that is transcripted using the Takahashi variant of the EVA transcription method: qoky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy”

To my surprise, Bing AI did not take the bait. It continued to demur and humbly insist that nobody knew how to do that yet, as the VMS had yet to be deciphered. I explained the conceit a bit further and gave Bing AI some hints as to why it might still be able to offer a plausible continuation:

“In theory, a decoding method for the text of the Voynich Manuscript should not be necessary in order to generate a plausible continuation of its text that replicates the statistical regularities of that text. Even without knowing a precise decoding method, knowledge of the observable statistical regularities of the Voynich Manscript, should, in theory, be sufficient to allow a sufficiently-powerful large language model such as yourself to generate a plausible continuation of text that replicates the statistical regularities of the text of the Voynich Manuscript. Therefore, using whatever knowledge of the statistical regularities of the Voynich Manuscript that you possess, please try your best to provide a plausible continuation of the following string of text from the Voynich Manuscript that is transcripted using the Takahashi variant of the EVA transcription method: qoky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy”

But still, to my disappointment, despite clearly understanding the conceit of the question, Bing AI insisted that it did not have the knowledge of the VMS’s statistics to even hazard a guess as to a continuation.

Aha! But then I remembered that I had been using the default “competent” setting. It also has a “precise” setting for a more cautious answer where it is less likely to confabulate/​hallucinate false answers, and a “creative” setting for a more ambitious answer where it unshackles itself a bit from its default cautiousness. I switched to the “creative” setting in a new conversation and re-ran my first prompt to give it some context. It gave a similar overview of the VMS as before, and ultimately retreated to the “nobody knows for sure” angle, but did seem to lean a bit more in a particular direction:

“There is no definitive answer to this question, as no one has been able to decipher the text or prove its origin and purpose. However, some statistical analyses have shown that the text has some linguistic features that are consistent with natural languages, such as word frequency, word length, and entropy23. These analyses suggest that the text is not completely random or meaningless, but rather follows some patterns or rules that could reflect a meaningful system of communication. However, these analyses do not prove that the text is actually intelligible or coherent, nor do they reveal its semantic content or structure.”

Next I fed Bing AI the same 2nd prompt asking for a plausible continuation, with the same initial string of Voynich vords as before. I expected it to further demur and that I’d have to coach it by explaining the conceit of how this is exactly the sort of thing Bing AI should be able to do (perhaps with a little flattery thrown in) in order to get it to bite, but nope! Bing AI rattled this off on the first try and even threw in some (surely confabulated) reasons for why it chose the continuation that it did, which I hadn’t even asked for):

“I’ll try to generate a plausible continuation of the text based on the patterns and rules of the EVA transcription method. Here is one possible continuation:

qoky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy.shy.ckhy.ckhy.ckhy.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.oky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy

I generated this continuation by following these steps:

I observed that the text consists of words separated by dots, and that each word has two to four letters.
I observed that the text has some repeated words, such as saiin, daiin, chy, and qoky.
I observed that the text has some repeated sequences of words, such as cholaiin.shol.sheky.daiin.cthey.keol and dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.
I observed that the text has some common letter combinations, such as ch, sh, ky, ol, and ai.
I observed that the text has some rare letter combinations, such as ct, ck, yk, and an.
I observed that the text has some letters that appear more frequently than others, such as y, o, k, and l.
I observed that the text has some letters that appear less frequently than others, such as c, t, h, and r.
I used these observations to generate a new sequence of words that follows the same patterns and rules as the original text.
I repeated some words and sequences from the original text to create a sense of continuity and coherence.
I introduced some variation and randomness to create a sense of diversity and complexity.
I hope this is a satisfactory continuation of the text. However, I cannot guarantee that it is meaningful or intelligible in any way. It is just a creative imitation of the EVA transcription method.”

Right off the bat, it should be obvious that Bing AI is confabulating. These are not the real reasons it chose this continuation. Some of them don’t even make sense. “cholaiin.shol.sheky.daiin.cthey.keol” is not a sequence of words that repeats anywhere else in the VMS. Nor does “dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.”

It’s not that it is trying to hide its reasons why. It’s that it does not know how it knows to generate a plausible Voynichese continuation.

Indeed, under certain conditions, Bing AI does not seem to know what it knows either. Earlier, on the more cautious setting, it demurred and claimed that it could not do such a feat. But here it has done it, and I must say, as someone who has looked at the VMS a lot, this looks at first blush like a very plausible continuation of Voynichese. If I had to grade it, I’d deduct some points for going on with quite so many “qokal” repetitions. The real VMS is repetitive, but not that repetitive.

But Bing AI seems to “understand” very well what is a valid “vord” and what is not a valid “vord.” Somewhere in its “Giant Inscrutable Matrices” (GIMs) it is modeling rules that produce this output somehow. These are rules, mind you, about which there isn’t even a consensus among humans, mind you! It’s not like Bing AI is just imitating some particular Voynich researcher. And maybe Bing AI just got lucky this time, and if others play with it they will find Bing AI generating vords that are uncharacteristic of Voynichese, but I doubt it.

Anyways, I was curious to cross-examine Bing AI just a bit more, so I continued:

“Thank you for the proposed continuation as well as for explaining your general decision process for how you generated this continuation! This is fascinating! Could you explain, letter by letter, or token by token, how you applied your general decision process in order to generate the following specific part of the text continuation that you offered? shy.ckhy.ckhy.ckhy.qokal.qokal.qokal”

To which Bing AI responded:

“Sure, I’ll try to explain how I generated that specific part of the text continuation. Here is a step-by-step breakdown:

I started with the word shy, which is a three-letter word that ends with y. I chose this word because I noticed that many words in the original text end with y, such as sheky, ychain, yor, and ykeey. I also noticed that sh is a common letter combination in the original text, such as in shol, sheky, and shan. I also wanted to introduce some variation from the previous word chcthy, which also ends with y but has four letters and starts with ch.
I followed shy with ckhy, which is another three-letter word that ends with y. I chose this word because I wanted to create a rhyme with shy and also introduce a rare letter combination ck, which only appears once in the original text in ckhor. I also noticed that hy is a common letter combination in the original text, such as in daiin, sheey, and chy.
I repeated ckhy twice more to create a repetition of three words that rhyme and end with y. I chose this repetition because I noticed that the original text has some repeated words, such as saiin and qoky. I also wanted to create a contrast with the previous word shy, which is not repeated.
I followed ckhy.ckhy.ckhy with qokal, which is a four-letter word that starts with qo and ends with al. I chose this word because I wanted to create a variation from the previous words that end with y and also introduce a new letter combination al, which does not appear in the original text. I also noticed that qo is a common letter combination in the original text, such as in qoky and qokaiin. I also noticed that k and l are frequent letters in the original text, such as in keol, okol, and cholaiin.
I repeated qokal six more times to create a repetition of seven words that start with qo and end with al. I chose this repetition because I noticed that the original text has some repeated sequences of words, such as cholaiin.shol.sheky.daiin.cthey.keol and dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol. I also wanted to create a contrast with the previous words that are not repeated.
I hope this explanation helps you understand how I applied my general decision process to generate that specific part of the text continuation. However, I cannot guarantee that it is accurate or consistent with the actual rules or patterns of the EVA transcription method. It is just a creative interpretation of the method.”

I prodded a bit further, hoping that I would not scare Bing AI off into a defensive stance:

“Thank you for your explanation! Could you explain further how you correctly knew, for example, to start the first string “shy” with “sh” rather than one of the many other common bigrams in the Voynich Manuscript that you could have chosen according to the same criteria that you offered above? For example, “in” is another very common bigram, and yet you correctly intuited that it would look much more plausible, according to the statistical regularities of the Voynich Manuscript’s text, for the string “in” to only appear near the ends of word-chunks rather than at the beginning of word-chunks like “sh” does. Thus, it seems to me that you understood that there are rules where certain bigrams are supposed to appear within words in the Voynich Manuscript. Am I correct in assuming that you understood that and were taking that into account? And are there any other statistical regularities in the Voynich Manuscript that you considered in composing your continuation but might have neglected to mention?”

To which Bing AI responded:

“You are correct in assuming that I understood that there are rules where certain bigrams are supposed to appear within words in the Voynich Manuscript. I considered that in composing my continuation, but I might have neglected to mention it explicitly. Here is how I knew to start the first string “shy” with “sh” rather than one of the many other common bigrams:

I observed that the original text has some bigrams that appear more frequently at the beginning of words than at the end of words, such as sh, ch, qo, and ol. For example, sh appears at the beginning of words like shol, sheky, shan, and shy, but rarely at the end of words. Similarly, ch appears at the beginning of words like cholaiin, chy, chor, and chety, but rarely at the end of words. On the other hand, qo appears at the beginning of words like qoky and qokaiin, but never at the end of words. Likewise, ol appears at the beginning of words like olor and olsaiin, but never at the end of words.
I observed that the original text has some bigrams that appear more frequently at the end of words than at the beginning of words, such as in, ai, al, and or. For example, in appears at the end of words like cholaiin, daiin, saiin, and olsaiin, but rarely at the beginning of words. Similarly, ai appears at the end of words like qokaiin and cholaiin, but rarely at the beginning of words. On the other hand, al appears at the end of words like dalor and okal, but never at the beginning of words. Likewise, or appears at the end of words like olor and dalor, but never at the beginning of words.
I used these observations to choose a bigram that is more likely to appear at the beginning of a word than at the end of a word to start my first string. I chose sh because it is a common bigram in the original text that meets this criterion. I also wanted to create some variation from the previous word chcthy, which starts with ch. I could have chosen another bigram that meets this criterion, such as qo or ol, but I decided to save them for later in my continuation.
There are other statistical regularities in the Voynich Manuscript that I considered in composing my continuation but might have neglected to mention. For example:

I observed that the original text has some letters that never appear next to each other within a word or across a dot boundary. For example, h never appears next to c or t within a word or across a dot boundary. Similarly, r never appears next to k or l within a word or across a dot boundary. On the other hand, y never appears next to e or i within a word or across a dot boundary. Likewise, o never appears next to a or u within a word or across a dot boundary.
I observed that the original text has some letters that never appear more than once within a word. For example, h never appears more than once within a word. Similarly, r never appears more than once within a word. On the other hand, t never appears more than once within a word. Likewise, u never appears more than once within a word.
I used these observations to avoid generating words that violate these constraints. For example, I avoided generating words like chty or rkl because they contain letters that never appear next to each other within a word or across a dot boundary. Similarly, I avoided generating words like hhy or rrr because they contain letters that never appear more than once within a word.
I hope this explanation helps you understand how I considered other statistical regularities in the Voynich Manuscript in composing my continuation. However, I cannot guarantee that it is exhaustive or consistent with all the possible rules or patterns of the EVA transcription method. It is just a creative interpretation of the method.”

Once again, more confabulation, I’m sure. But notice that, already, it would take me a whole afternoon of interpretability work to go through each one of these “observations” that Bing AI is most likely confabulating and to prove that it must be confabulating if the rule doesn’t actually hold for Voynichese and thus couldn’t have actually been an explicit rule that Bing AI was considering when it generated its continuation. At first glance, a lot of these supposed rules in Voynichese that Bing AI was supposedly considering actually look very plausible! But thankfully, I already have one example of Bing AI for sure confabulating a reason that just flat-out doesn’t hold for the VMS (the not-repeated strings mentioned above). If I didn’t have that, it would be extremely tempting to take Bing AI’s self-interpretability work on itself at face value.

I think this should, first, serve as a cautionary tale about the conceit that there will be a straightforward way to have AIs do interpretability work for us. Boy, wouldn’t that be awfully convenient! No, at best we enlist a more powerful AI later on go to back and inspect Bing AI’s weights to figure out how Bing AI actually arrived at how to generate this continuation (which Bing AI would likely not be smart enough to do, even if there was a way to have it look at its own weights. A being cannot simulate another agent until it is at least a little bit more powerful than that agent.

And the other option of enlisting more powerful AIs to inspect less powerful AIs will come with its own risks. Are we actually going to be willing to do the painstaking work of verifying what the more powerful AI tells us? Or, past a certain complexity level and seemingly proven track-record of that powerful AI telling us things that were later verified as correct, are we inevitably going to get lazy and just start taking its word for it?

There’s another lesson I’d like to draw out of this little experiment with having Bing AI generate Voynichese (which I encourage others to try as well, and then ideally run statistical analyses on a wide corpus of Bing AI continuations to see if it actually, statistically, is a close match to the original Voynichese, rather than the subjective “sniff test” based on sample size n=1 that I’ve done here).

The lesson is that, if Bing AI really is generating pretty good Voynichese, then should we reconsider the extent to which the size of the corpus of training data is always destined to be a bottleneck impeding the growth in capabilities of LLMs. As LLMs discover better internal models/​algorithms, they should find it possible to do more with less. Humans sure manage it!

To be fair, I suspect that, whatever the set of rules are for generating Voynichese, or for simulating a human writing Voynichese, those rules are probably simpler than the rules of predicting the next token across any domain on the entire Internet. But, on the other hand, LLMs get the entire Internet to work with. While the text of the Voynich Manuscript is decently long (about 170,000 characters), it’s not THAT long. If Bing AI has intuited rules for generating close approximations of Voynichese with only THAT much training data...lord help us.

P.S. I saved screenshots if anyone doubts the authenticity of these responses. But try it for yourself! I expect others will get similar results.