aifaq.wtf

"How do you know about all this AI stuff?"
I just read tweets, buddy.

#summarization

Page 1 of 1

Summarization is (Almost) Dead

#summarization   #link  

Okay this is bold:

we believe that most conventional works in the field of text summarization are no longer necessary in the era of LLMs

While every other paper is like "oh boy yeah, LLMs have an awful hit rate for summarization." And yet:

As depicted in Table 1, humanwritten reference summaries exhibit either an equal or higher number of hallunications compared to GPT-4 summaries. In specific tasks such as multinews and code summarization, human-written summaries exhibit notably inferior factual consistency.

But! Also! Looks like the big issue with human-written summaries was "their lack of fluency," which sounds like the AI stuff was just written better? Guess that's valuable, especially in line with the supposed higher factuality of LLM-generate content.

@lefthanddraft on April 09, 2024

#summarization   #hallucinations   #tweets  

The paper got another post

FABLES: Evaluating faithfulness and content selection in book-length summarization

#hallucinations   #summarization   #context   #content window   #link   #claude   #openai   #mixtral   #gpt-4  

An analysis of the annotations reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate.

What kinds of things are AI tools especially bad at?

Something about calling an AI's work "well-done" feels far more anthropomorphic than it should.

While LLM-based auto-raters have proven reliable for factuality and coherence in other settings, we implement several LLM raters of faithfulness and find that none correlates strongly with human annotations, especially with regard to detecting unfaithful claims

Of course this needs a link to my favorite hallucination leaderboard. It's tough since of course it costs money to do this in a way that doesn't rely on LLMs to create and score the dataset. Which leads to...

Collecting human annotations on 26 books cost us $5.2K, demonstrating the difficulty of scaling our workflow to new domains and datasets.

$5k is is somehow cost prohibitive between UMass, Princeton, Adobe, and an AI institute? That... I don't know, seems like not very much money. I get the understanding that this is "best" done for pennies, but if someone had to cough up $5k each year to repeat this with newly-unknown data I don't think it would be the worst thing in the world.

Finally, we move beyond faithfulness by exploring content selection errors in book-length summarization: we develop a typology of omission errors related to crucial narrative elements and also identify a systematic over-emphasis on events occurring towards the end of the book.

Here's the omission types:

Reading SEC filings using LLMs | Hacker News

#summarization   #question and answer   #shortcomings and inflated expectations   #embeddings   #link  

The link itself is super boring, but the comments are great: a ton of people arguing about whether or not LLM-based question-and-answer over documents works at all (especially with SEC filings and other financial docs).

  • Shortcomings of text embeddings to return relevant documents
  • Inability of LLMs to actually figure out what's interesting

I think the largest issue with summarization/doc-based Q&A is that when reading we as people bring a lot of knowledge to the table that is not just rearranging the words in a piece of text. What's talked about or mentioned the most is not always what's most important. One commentor talking about a colleague using ChatGPT to summarize SEC filings:

The tidbit it missed, one of the most important ones at the time, was a huge multi year contract given to a large investor in said company. To find it, including the honestly hilarious amount, one had to connect the disclosure of not specified contract to a named investor, the specifics of said contract (not mentioning the investor by name), the amount stated in some finacial statement from the document and, here obviously ChatGPT failed completely, knowledge of what said investor (a pretty (in)-famous company) specialized in. ChatGPT did even mention a single of those data points.

...

In short, without some serious promp working, and including addditional data sources, I think ChatGPT is utterly useless in analyzing SEC filings, even worse it can be outright misleading. Not that SEC filings are increadibly hard to read, some basic financial knowledge and someone pointing out the highlights, based on a basic understanding of how those filings actually work are supossed to work, and you are there.

Another one lowers the hallucination rate and encourages human comprehension by converting a human prompt into code that is used to search the database and return the relevant info, instead of having the LLM read and report on the info itself.

I also love this one about a traditional approach that draws attention to the when being sometimes an additional flag to the what:

They received SEC filings using a key red flag word filter into a shared Gmail account with special attention for filings done on Friday night or ahead of the holidays.

July 5, 2023: @hwchase17

#summarization   #langchain  

I feel like I waited years to understand the difference between how each type of document chain affected the output. And here we are!

May 31, 2023: @ndiakopoulos

#journalism   #summarization   #actual work  

May 12, 2023: @mmitchell_ai

#summarization   #fact-checking  

The part I'll stress here is "without fiddling...[summarization] can go terribly wrong." We like to think summarizing things is easy – and it is, comparatively! – but give this a read. In a Danish newsroom experimenting with summarization, 41% of the auto-generated story summaries needed to be corrected before publication.

May 7, 2023: @timnitgebru

#hallucinations   #summarization   #alignment  

All I want in life is to read this opposite-of-the-argument summary! Things could have gone wrong in two ways:

First, they pasted in the URL and said "what's this say?" Sometimes ChatGPT pretends it can read the web, even when it can't, and generates a summary based on what ideas it can pull out of the URL.

Second, it just hallucinated all to hell.

Third, ChatGPT is secretly aligned to support itself. Doubtful, but a great way to stay on the good side of Roko's Basilisk.

May 6, 2023: @jorisdejong4561

#summarization   #langchain   #limitations  

"Write me a summary" seems like an easy task for a language model, but there are a hundred and one ways to do this, each with their own strengths and weaknesses. Even within langchain!

If you're excited about summarization, be sure to read this to see how things might go wrong. With hallucinations, token limits, and other technical challenges, LLM-based summarization has a lot more gotchas than you'd think.