Thursday, March 14, 2024

Rewrites

One of the worries that I'd heard voiced about Generative AI concerned the biases that might be present in the data. This prompted me to wonder, what could you understand about the dataset and/or the training that went into it, given the sorts of responses a system gave to prompts?

To be sure, I have no idea of how to determine this... I simply don't know enough about how these systems work "under the hood" as it were. I'm a dabbler, not an engineer. But since the genesis of this series of experiments was someone finding that different systems gave different answers to the same prompts, another test came to mind. Last week's experiment had Copilot, Perplexity, Gemini and ChatGPT 3.5 translating a snippet of a document I'd written in romanized Japanese some two decades ago. This week, I gave each of them the entire translation, as created by Copilot, and had each system re-write the text. In a nutshell, the text is a brief narrative about Tom, who works for a bank in Sim City.

Copilot and Gemini both did two things that stood out from the other two: 1) they created titles for the story, and 2) they re-ordered some of the details of the story. In the original, I start by noting that Tom works for a bank, but don't note that he's the branch manager until later. Copilot and Gemini note both Tom's workplace, and his position when they introduce him as a character. Perplexity and ChatGPT 3.5 had their own similarity: they created very similar text. The first sentence for each matched; literally word-for-word, and there is a sentence in the middle of the story where the two systems varied only by a single word.

Gemini's rewrite was brief, managing to trim the text by almost 40 words, nearly a quarter of the total. Copilot, conversely, was the most verbose of the systems; it was the only system where the re-written text was longer than the text I'd submitted, by about a dozen words. Mainly because it tended to add little flourishes into the final document, but also because it cut the fewest corners in noting the details of the original text. To be sure, however, all of the systems had trouble with the details, sometimes appearing to miss the nuances.

In the end, despite what I said earlier, I think I can start to understand something about the "interior" of each system from this test, given that I'm already starting to build a set of expectations of what each would do with a given input. I expect it will take several more trials to distill what seem like "personalities" into a distinct set of rules that each operates by. The fact that I'm not an engineer would make the task longer, but it seems doable. Which makes sense; it's the differentiator of systems that are otherwise doing the same thing.

No comments: