Thursday, February 15, 2024

On the Seventh Day

Last week, I replicated an experiment with "A.I." conducted by a user on LinkedIn. The question was: "How many days between 2000-03-09 and 2002-09-30?"

Basically, ChatGPT 3.5 got the answer all kinds of wrong. It had come up with workable steps to find the number, but made errors in each one, resulting in it being off by 100 days. But it showed its work, so one could see where it went off the rails.

This morning, I asked again, and this time, ChatGPT's answer was only slightly off from the ones originally provided by Bing, Alexa, Copilot, and Gemini, answering 934 days. (By my math, it's off by 1.) Again, it gave the steps, but this time, it didn't show its work.

I then decided to test a theory of mine about generative A.I., and asked: "How many numbers are there between 1 and 100?" This is a trick question. For starters, it's ambiguous, because I don't define "number." So the answer could be "an infinite number." I also suspected that ChatGPT wasn't going to leave out 1 and 100, because this isn't the way people normally interact with numbers like this. Sure enough: "Therefore, there are 100 numbers between 1 and 100, inclusive."

(Just to check, I asked it: "Is 1.5 a number?" To which it responded "Yes, 1.5 is indeed a number. It is a decimal number representing the value one and a half. In mathematical terms, it's a rational number and can be expressed as the fraction 3/2.")

Curious as to how ChatGPT parsed the word "between" I asked: "How many days are there between Sunday and Saturday?" The answer came back as: "There are 6 days between Sunday and Saturday, inclusive." So it seems that the use of "inclusive" changed, from counting both 1 and 100 in the number question to counting Sunday or Saturday in the days of the week question.

To be sure, these are simple questions. But they point to the idea that ChatGPT, at least, because of the data that it's been trained on, tends to use language in the same inconsistent ways that people commonly use language. And this is why I suspect that prompt engineering is such an important piece of the puzzle. Getting to the "right answer" is likely going to mean constraining the system, especially if the "Temperature" and/or "Top_p" settings of a system are set to allow for more variability in the answers given.

No comments: