aifaq.wtf

Many, many, many of the papers that I link to here are about how a model is performing. But unless it's the ones where GPT got into MIT or became king of doctors or masters of all law, most of the more fun recent papers have been about "self-report studies," where polls typically given to humans are given to LLMs instead:

I will discuss three high-profile papers that I believe might have some of these problems. I am not saying that everything about these papers is wrong or that these papers are bad overall (at least not all of them). Especially the first paper is quite good in my opinion. But I have my doubts about some of their findings and I think that pointing them out can illustrate some of these pitfalls.

This is great! This is how it should be!!! And what's that? You want sass?

I find the use of this tool to be a shaky idea right out of the gate. The authors of the paper claim that their work is based on the political spectrum theory, but I am not aware of any scientific research that would back the Political Compass. To my knowledge, it really is merely a popular internet quiz with a rather arbitrary methodology.

Go forth and read the paper itself (which I guess technically isn't a paper, but it's basically a paper)

Source: https://twitter.com/mpikuliak/status/1685638904844566528?s=20

Permalink

@matei_zaharia on July 19, 2023

July 19, 2023

#fine-tuning #evaluation #models #alignment #tweets #papers

OpenAI has continually claimed that the "model weights haven't changed" on their models over time, which many have accepted as "the outputs shouldn't be changing." Even if the former is true, something else is definitely happening behind the scenes:

For example, GPT-4's success rate on "is this number prime? think step by step" fell from 97.6% to 2.4% from March to June, while GPT-3.5 improved. Behavior on sensitive inputs also changed. Other tasks changed less, but there are definitely singificant changes in LLM behavior.

Is is feedback for alignment? Is it reducing costs through other architecture changes? It's a mystery!

Changes between dates of GPT accuracy etc

Another fun pull quote, for code generation:

For GPT-4, the percentage of generations that are directly executable dropped from 52.0% in March to 10.0% in June. The drop was also large for GPT-3.5 (from 22.0% to 2.0%).

If you're building a product on top of a model you aren't running yourself, these sorts of (unreported) changes can wreak havoc on your operations. Even if your initial test runs worked great, two months down the line and you might have everything unexpectedly fall apart.

Full paper here

Source: https://twitter.com/matei_zaharia/status/1681467961905926144?s=20

Permalink

@sarahookr on July 17, 2023

July 17, 2023

#models #tweets #evaluation

Answers include:

...but lbh I haven't read any of these.

Source: https://twitter.com/sarahookr/status/1680896245437542401?s=20

Permalink

ChatGPT use declines as users complain about ‘dumber’ answers | Hacker News

news.ycombinator.com

ChatGPT use declines as users complain about ‘dumber’ answers | Hacker News

July 16, 2023

#models #evaluation #shortcomings and inflated expectations #link

The responses in here are a good read. Thoughts about whether and/or why it's happening, including the shine of novelty disappearing, awareness of hallucinations coming to the forefront, and/or RLHF alignment preventing you from just asking for racial slurs all day.

I especially enjoyed this comment:

If you ask ChatGPT an exceedingly trivial question, it’ll typically spend the next 60 seconds spewing out five paragraphs of corporate gobbledygook. And of course, because ChatGPT will lie to you, I often end up back on Google anyways to validate it’s claims.

Source: https://news.ycombinator.com/item?id=36750200

Permalink

@MelMitchell1 on July 13, 2023

July 13, 2023

#models #evaluation #doomerism and TESCREAL #tweets

A great piece about the pitfalls of evaluating large language models. It tackles a few reasons why evaluating LLMs as if they were people is not necessarily the right tack:

Data contamination: the AI has already seen the answers!
Robustness: answering one question doesn't mean the AI can answer a similar question
Flawed benchmarks: machines take shortcuts that aren't relevant to the actual question

Most tests are pretty bad at actual evaluating much of anything. Cognitive scientist Michael Frank (in summary) believes that

...it is necessary to evaluate systems on their robustness by giving multiple variations of each test item and on their generalization abilities by giving systematic variations on the underlying concepts being assessed—much the way we might evaluate whether a child really understood what he or she had learned.

Seems reasonable to me, but it's much less fun to develop a robust test than to wave your arms around screaming about the end of the world.

Source: https://twitter.com/MelMitchell1/status/1679574712089735172?s=20

Permalink

@goodside on July 13, 2023

July 13, 2023

#evaluation #gzip #tweets

I am sorry to report this is probably not true.

Source: https://twitter.com/goodside/status/1679358632431853568?s=20

Permalink

@AnthropicAI on July 11, 2023

July 11, 2023

#evaluation #tweets

As with all evaluations, please take with one rather large grain of salt.

Source: https://twitter.com/AnthropicAI/status/1678759122194530304?s=20

Permalink

@ben_levinstein on July 10, 2023

July 10, 2023

#ai detection #papers #tweets #evaluation

The paper is here. It's sadly not about AI detection, but rather whether large language models have a model of the world or are just faking it. If you come in thinking it's the former you're rather quickly brought to your senses: