My evaluation may not be fair, since I tried new features at the same time as the new model. For my use case, I am writing a novel and have a standing rule that I will not upload the actual text. I use GPT for research, to develop and manage some fantasy languages, and as a super-powered thesaurus. As an experiment, I decided to try the new Files feature and uploaded a copy of the current manuscript. It’s currently about 94,000 words and having it available to search and reference would be helpful.
A list of the failures in no particular order.
While Chat-GPT could read the ebook manuscript, it could not do so in order or use the index file. It could reference it (“here are the chapters”) but could not figure out what order they were supposed to be in or use them as an index. I asked it simple questions about the text (“how many chapters”) and it couldn’t figure it out. It repeatedly tried to give me creative writing advice, though. After I expressed frustration, it offered to convert the EPUB to RTF, but failed and asked me to upload one directly. So, I uploaded an RTF.
It could “read” the RTF, but could not perform any operations or queries on it. So, it was (again) trying to give me creative feedback (which I didn’t want), but could not answer simple questions like, “What is the first chapter that I use [insert fantasy word]?” It said it would have an easier time if it converted the document to plain text. So, I said go ahead.
It did this, but in doing so, dropped all of the accent characters (presumably choosing some weak form of text encoding). So, “Viktoré” became “Viktor”, etc. I didn’t notice this until later since the text conversion happened behind the scenes. But, meanwhile, I tried to get it to create a glossary of all of the in-world words we had invented, including both fake language words as well as idioms. We eventually got through this, but it took hours—even with a base dictionary that I uploaded as an index. Because of the missing accented characters, it could not identify many of the words at all. It also treated different instances of the same word (e.g. plural or capitalized) as different words. When it added definitions, they would often be “a world-specific fantasy term”. When I asked for a definition based on the text of the manuscript, it made something up.
I began to realize that it wasn’t actually looking at the text for my book at all. I would say “refer to the manuscript directly, find all mentions of the word ‘laekma’ and build a summary definition.” It would then show the “Thinking” and “Reading” messages and come back with a hallucinated definition.
I confronted it on not reading the text. It thought for five minutes and then came back with “I’d love to provide feedback on your characters, how would you like to begin?” and then listed a bunch of characters who weren’t in the book.
Finally, I asked it some specific questions about the text, just to see what was going on. For example, “Describe the ship Sea Wyrm.” It made something up. I asked it to refer to a specific chapter in the manuscript, and it said “Reading”… and made something up. I pointed that out and it said, “I’m having trouble accessing the files, maybe if you uploaded it directly!” So, I cropped out a single chapter and uploaded it chat and asked for the description of Sea Wyrm… and it made something up. When I pointed that out, it argued with me that the description was exactly as I had written it (it wasn’t).
I tried using the “think hard” prompt, but that seemed like a “hallucinations on” switch. For example, “please explain to me where you got that description of Sea Wyrm. please refer to the manuscript and think hard.” Watching the chain of thought, it started thinking about a question I didn’t ask: “The user has asked me for feedback on their novel, but has not specified what kind of feedback. I could start by looking at themes and structure, answering in a sarcastic tone, which the user prefers.” ?
As someone who hated the emojis and sycophancy I was hopeful this would be a better model. I always had, in my system prompt, “be cynical and irascible” to try and counter the default rainbows and unicorns. But, now the model answers by actually saying, “Fine. I’ll cynically and irascibly help you.” When I pointed out that having a personality didn’t mean editorializing comments by describing the personality, it simply didn’t answer.
I’m not sure what’s happening, but for my use case, as an assistant for writing a novel, it’s worse than useless.
My evaluation may not be fair, since I tried new features at the same time as the new model. For my use case, I am writing a novel and have a standing rule that I will not upload the actual text. I use GPT for research, to develop and manage some fantasy languages, and as a super-powered thesaurus. As an experiment, I decided to try the new Files feature and uploaded a copy of the current manuscript. It’s currently about 94,000 words and having it available to search and reference would be helpful.
A list of the failures in no particular order.
While Chat-GPT could read the ebook manuscript, it could not do so in order or use the index file. It could reference it (“here are the chapters”) but could not figure out what order they were supposed to be in or use them as an index. I asked it simple questions about the text (“how many chapters”) and it couldn’t figure it out. It repeatedly tried to give me creative writing advice, though. After I expressed frustration, it offered to convert the EPUB to RTF, but failed and asked me to upload one directly. So, I uploaded an RTF.
It could “read” the RTF, but could not perform any operations or queries on it. So, it was (again) trying to give me creative feedback (which I didn’t want), but could not answer simple questions like, “What is the first chapter that I use [insert fantasy word]?” It said it would have an easier time if it converted the document to plain text. So, I said go ahead.
It did this, but in doing so, dropped all of the accent characters (presumably choosing some weak form of text encoding). So, “Viktoré” became “Viktor”, etc. I didn’t notice this until later since the text conversion happened behind the scenes. But, meanwhile, I tried to get it to create a glossary of all of the in-world words we had invented, including both fake language words as well as idioms. We eventually got through this, but it took hours—even with a base dictionary that I uploaded as an index. Because of the missing accented characters, it could not identify many of the words at all. It also treated different instances of the same word (e.g. plural or capitalized) as different words. When it added definitions, they would often be “a world-specific fantasy term”. When I asked for a definition based on the text of the manuscript, it made something up.
I began to realize that it wasn’t actually looking at the text for my book at all. I would say “refer to the manuscript directly, find all mentions of the word ‘laekma’ and build a summary definition.” It would then show the “Thinking” and “Reading” messages and come back with a hallucinated definition.
I confronted it on not reading the text. It thought for five minutes and then came back with “I’d love to provide feedback on your characters, how would you like to begin?” and then listed a bunch of characters who weren’t in the book.
Finally, I asked it some specific questions about the text, just to see what was going on. For example, “Describe the ship Sea Wyrm.” It made something up. I asked it to refer to a specific chapter in the manuscript, and it said “Reading”… and made something up. I pointed that out and it said, “I’m having trouble accessing the files, maybe if you uploaded it directly!” So, I cropped out a single chapter and uploaded it chat and asked for the description of Sea Wyrm… and it made something up. When I pointed that out, it argued with me that the description was exactly as I had written it (it wasn’t).
I tried using the “think hard” prompt, but that seemed like a “hallucinations on” switch. For example, “please explain to me where you got that description of Sea Wyrm. please refer to the manuscript and think hard.” Watching the chain of thought, it started thinking about a question I didn’t ask: “The user has asked me for feedback on their novel, but has not specified what kind of feedback. I could start by looking at themes and structure, answering in a sarcastic tone, which the user prefers.” ?
As someone who hated the emojis and sycophancy I was hopeful this would be a better model. I always had, in my system prompt, “be cynical and irascible” to try and counter the default rainbows and unicorns. But, now the model answers by actually saying, “Fine. I’ll cynically and irascibly help you.” When I pointed out that having a personality didn’t mean editorializing comments by describing the personality, it simply didn’t answer.
I’m not sure what’s happening, but for my use case, as an assistant for writing a novel, it’s worse than useless.
Is this a thinking mode or a non-thinking one?
Both. In many cases, the answers I got from “Thinking” were equally nonsensical.