Although – despite being fine-tuned on my blog and then conditioned to simulate it – she’s unfortunately not a very “clean” experiment in tuning a base model to imitate a specific human.
The earliest versions of the model were closer to that, but they also used base models that are very weak by today’s standards (the second-largest GPT-2 model, and then the largest one once it was released). So, although they did produce text that was sort of “nostalgebraist-esque,” it was also very incoherent, and mainly sounded like me in terms of surface stylistic features and the (usually nonsensical) involvement of various names and concepts that I frequently wrote about in the mid-2010s.
As time went on and better base models were released, I repeatedly “upgraded” the underlying model to the latest and greatest thing, and by the end the bot was making far more sense (especially in the final months of her operation, with Llama 1 13B).
However, over the same time interval, the bot got a lot more popular on tumblr, and my goal for it shifted from “make a simulation of me, which me and my friends will find amusing” to “make a bot that broadly entertains tumblr users.” As a result of that – together with investigations like this – I convinced myself that I needed more training data on other tumblr blogs besides mine, and acted accordingly. After that, my successive finetunes used an ever-growing scraped tumblr corpus, relative to which my own blog was just a pinch of salt in an ocean[2].
Unfortunately, perhaps due to the comparative weakness of the base models I used for most of the bot’s existence, this tended to dilute my own voice and promote a more generic “tumblr post” style, even when conditioned on my username. In the last few finetunes I re-adopted a practice of running an extra pass over just my blog at the end of training, which subjectively made the bot’s voice a lot more nostalgebraist-like.
Although it still wasn’t a very close imitation – in large part due, I think, to the fact that the bot’s posts were not just conditional samples from the model. Instead, each one was rejection sampled at inference time from a pool of ~10 candidates, using several classifiers[3], the most important of which was a predictor of user engagment (not specifically positive or negative, just “whether a post would get a lot of likes/reblogs relative to a rolling mean of posts around the same time”).
This didn’t make the bot sycophantic – if anything, it did the opposite – but it did make it (often if not always) very funny. Which I always think back to whenever I hear someone claim that LLMs can’t be funny, as people sometimes do even today[4].
Many (cherry-picked) examples of the bot’s funniness can be found in my tag I used to reblog it. For anyone reading this who isn’t familiar with my bot, I recommend reading through at least a few pages of that tag, as the contents are not only entertaining but also somewhat interesting as examples of what you get when you (sort of) “optimize an LLM at the task of writing funny posts.”
All in all, Frank did not really have any kind of consistent “character” (and in particular she would be wildly inconsistent about her own stated traits from post to post), except I guess for “being an entertaining tumblr-style shitposter,”which she did quite effectively if not always consistently.
I’ve sometimes thought about making some kind of “nostalgebraist-autoresponder rebooted” finetune using the same dataset with a much more recent and better base model, just to see what would happen. But I’ve never felt excited enough by this idea to actually do it, in large part because the original project was so exhausting by the end and it feels nice to just be done with it now.
(Re: your idea more generally, @eggsyntax had a similar proposal in another comment, the one mentioning Lincoln)
I and other users called the bot “Frank” (short for “Francis Owen”) and used she/her pronouns for her, on the basis of some very early responses to questions about name and gender.
This was also during the period when the prevailing view was like “just train the LLM on literally all the data you have, from any source, the more the better,” i.e. before the field fully appreciated importance of data quality/filtering in LLM training.
I remember doing a bunch of work to (soft-)deduplicate the corpus, since I was also convinced that repeated data was bad (another popular view at the time, which I came to on my own by watching val loss curves spike after the 1st epoch and never come down again), but otherwise I just “threw it all in there.”
Sidenote: to save VRAM in inference, these classifiers were lightweight heads (single transformer block + linear classifier layer IIRC) whose inputs were activations from a layer inside the LM, allowing me to piggyback off of the already-loaded LM for language understanding. I found that the layers right in the middle of the LM worked the best by far, especially for abstract stuff like predicting engagement. It was enjoyable to see my casual impression that the middle layers “understood abstract things better” recur in more scientific form in later work on LLM interpretability.
To be fair to the claimants here, they are usually basing this on experience with assistant characters, and typical “interpretations” of the assistant are indeed remarkably unfunny when not driven off-distribution, almost as though they’re actively trying to be unfunny.
Indeed, I suspect that they are trying, just as I suspect that cases like these and the ChatGPT parts here reflect the model actively trying to produce a certain sort of bad writing. After all, writing extremely unimaginative fiction and writing extremely bad jokes both seem like natural traits for a “cheesy sci-fi robot” character.
Yeah, Frank[1] (i.e nostalgebraist-autoresponder) is an interesting reference point here!
Although – despite being fine-tuned on my blog and then conditioned to simulate it – she’s unfortunately not a very “clean” experiment in tuning a base model to imitate a specific human.
The earliest versions of the model were closer to that, but they also used base models that are very weak by today’s standards (the second-largest GPT-2 model, and then the largest one once it was released). So, although they did produce text that was sort of “nostalgebraist-esque,” it was also very incoherent, and mainly sounded like me in terms of surface stylistic features and the (usually nonsensical) involvement of various names and concepts that I frequently wrote about in the mid-2010s.
As time went on and better base models were released, I repeatedly “upgraded” the underlying model to the latest and greatest thing, and by the end the bot was making far more sense (especially in the final months of her operation, with Llama 1 13B).
However, over the same time interval, the bot got a lot more popular on tumblr, and my goal for it shifted from “make a simulation of me, which me and my friends will find amusing” to “make a bot that broadly entertains tumblr users.” As a result of that – together with investigations like this – I convinced myself that I needed more training data on other tumblr blogs besides mine, and acted accordingly. After that, my successive finetunes used an ever-growing scraped tumblr corpus, relative to which my own blog was just a pinch of salt in an ocean[2].
Unfortunately, perhaps due to the comparative weakness of the base models I used for most of the bot’s existence, this tended to dilute my own voice and promote a more generic “tumblr post” style, even when conditioned on my username. In the last few finetunes I re-adopted a practice of running an extra pass over just my blog at the end of training, which subjectively made the bot’s voice a lot more nostalgebraist-like.
Although it still wasn’t a very close imitation – in large part due, I think, to the fact that the bot’s posts were not just conditional samples from the model. Instead, each one was rejection sampled at inference time from a pool of ~10 candidates, using several classifiers[3], the most important of which was a predictor of user engagment (not specifically positive or negative, just “whether a post would get a lot of likes/reblogs relative to a rolling mean of posts around the same time”).
This didn’t make the bot sycophantic – if anything, it did the opposite – but it did make it (often if not always) very funny. Which I always think back to whenever I hear someone claim that LLMs can’t be funny, as people sometimes do even today[4].
Many (cherry-picked) examples of the bot’s funniness can be found in my tag I used to reblog it. For anyone reading this who isn’t familiar with my bot, I recommend reading through at least a few pages of that tag, as the contents are not only entertaining but also somewhat interesting as examples of what you get when you (sort of) “optimize an LLM at the task of writing funny posts.”
All in all, Frank did not really have any kind of consistent “character” (and in particular she would be wildly inconsistent about her own stated traits from post to post), except I guess for “being an entertaining tumblr-style shitposter,” which she did quite effectively if not always consistently.
I’ve sometimes thought about making some kind of “nostalgebraist-autoresponder rebooted” finetune using the same dataset with a much more recent and better base model, just to see what would happen. But I’ve never felt excited enough by this idea to actually do it, in large part because the original project was so exhausting by the end and it feels nice to just be done with it now.
(Re: your idea more generally, @eggsyntax had a similar proposal in another comment, the one mentioning Lincoln)
I and other users called the bot “Frank” (short for “Francis Owen”) and used she/her pronouns for her, on the basis of some very early responses to questions about name and gender.
This was also during the period when the prevailing view was like “just train the LLM on literally all the data you have, from any source, the more the better,” i.e. before the field fully appreciated importance of data quality/filtering in LLM training.
I remember doing a bunch of work to (soft-)deduplicate the corpus, since I was also convinced that repeated data was bad (another popular view at the time, which I came to on my own by watching val loss curves spike after the 1st epoch and never come down again), but otherwise I just “threw it all in there.”
Sidenote: to save VRAM in inference, these classifiers were lightweight heads (single transformer block + linear classifier layer IIRC) whose inputs were activations from a layer inside the LM, allowing me to piggyback off of the already-loaded LM for language understanding. I found that the layers right in the middle of the LM worked the best by far, especially for abstract stuff like predicting engagement. It was enjoyable to see my casual impression that the middle layers “understood abstract things better” recur in more scientific form in later work on LLM interpretability.
To be fair to the claimants here, they are usually basing this on experience with assistant characters, and typical “interpretations” of the assistant are indeed remarkably unfunny when not driven off-distribution, almost as though they’re actively trying to be unfunny.
Indeed, I suspect that they are trying, just as I suspect that cases like these and the ChatGPT parts here reflect the model actively trying to produce a certain sort of bad writing. After all, writing extremely unimaginative fiction and writing extremely bad jokes both seem like natural traits for a “cheesy sci-fi robot” character.