The quote tweets are gold.
I was like “if the light blue line goes over 100% I know this chart is hot garbage” and sure enough
"How do you know about all this AI stuff?"
I just read tweets, buddy.
Page 2 of 3
The quote tweets are gold.
I was like “if the light blue line goes over 100% I know this chart is hot garbage” and sure enough
I think "benchmarks are not equivalent to "human ability" and it's foolish to equate them" is a good line to use on people in the future.
Many, many, many of the papers that I link to here are about how a model is performing. But unless it's the ones where GPT got into MIT or became king of doctors or masters of all law, most of the more fun recent papers have been about "self-report studies," where polls typically given to humans are given to LLMs instead:
I will discuss three high-profile papers that I believe might have some of these problems. I am not saying that everything about these papers is wrong or that these papers are bad overall (at least not all of them). Especially the first paper is quite good in my opinion. But I have my doubts about some of their findings and I think that pointing them out can illustrate some of these pitfalls.
This is great! This is how it should be!!! And what's that? You want sass?
I find the use of this tool to be a shaky idea right out of the gate. The authors of the paper claim that their work is based on the political spectrum theory, but I am not aware of any scientific research that would back the Political Compass. To my knowledge, it really is merely a popular internet quiz with a rather arbitrary methodology.
Go forth and read the paper itself (which I guess technically isn't a paper, but it's basically a paper)
OpenAI has continually claimed that the "model weights haven't changed" on their models over time, which many have accepted as "the outputs shouldn't be changing." Even if the former is true, something else is definitely happening behind the scenes:
For example, GPT-4's success rate on "is this number prime? think step by step" fell from 97.6% to 2.4% from March to June, while GPT-3.5 improved. Behavior on sensitive inputs also changed. Other tasks changed less, but there are definitely singificant changes in LLM behavior.
Is is feedback for alignment? Is it reducing costs through other architecture changes? It's a mystery!
Another fun pull quote, for code generation:
For GPT-4, the percentage of generations that are directly executable dropped from 52.0% in March to 10.0% in June. The drop was also large for GPT-3.5 (from 22.0% to 2.0%).
If you're building a product on top of a model you aren't running yourself, these sorts of (unreported) changes can wreak havoc on your operations. Even if your initial test runs worked great, two months down the line and you might have everything unexpectedly fall apart.
Answers include:
...but lbh I haven't read any of these.
The responses in here are a good read. Thoughts about whether and/or why it's happening, including the shine of novelty disappearing, awareness of hallucinations coming to the forefront, and/or RLHF alignment preventing you from just asking for racial slurs all day.
I especially enjoyed this comment:
If you ask ChatGPT an exceedingly trivial question, it’ll typically spend the next 60 seconds spewing out five paragraphs of corporate gobbledygook. And of course, because ChatGPT will lie to you, I often end up back on Google anyways to validate it’s claims.
A great piece about the pitfalls of evaluating large language models. It tackles a few reasons why evaluating LLMs as if they were people is not necessarily the right tack:
Most tests are pretty bad at actual evaluating much of anything. Cognitive scientist Michael Frank (in summary) believes that
...it is necessary to evaluate systems on their robustness by giving multiple variations of each test item and on their generalization abilities by giving systematic variations on the underlying concepts being assessed—much the way we might evaluate whether a child really understood what he or she had learned.
Seems reasonable to me, but it's much less fun to develop a robust test than to wave your arms around screaming about the end of the world.
I am sorry to report this is probably not true.
As with all evaluations, please take with one rather large grain of salt.
The paper is here. It's sadly not about AI detection, but rather whether large language models have a model of the world or are just faking it. If you come in thinking it's the former you're rather quickly brought to your senses:
Do large language models (LLMs) have beliefs? And, if they do, how might we measure them?
It has so much stuff, but lbh I haven't actually read any of it.