One my favorite challenges is judging whether a word is toxic or not. It's so reliant on context! "Mexican" or "gay" can end up getting texts flagged as offensive since they are often used as slurs or in otherwise-hateful content, even if they can also be completely normal words.
Manager uses ChatGPT to generate the job posting, you use Langchain to apply for the job, HR's AI filters you out based on keywords, then the government's automated system rejects your welfare benefits.
All I want in life is to read this opposite-of-the-argument summary! Things could have gone wrong in two ways:
First, they pasted in the URL and said "what's this say?" Sometimes ChatGPT pretends it can read the web, even when it can't, and generates a summary based on what ideas it can pull out of the URL.
Second, it just hallucinated all to hell.
Third, ChatGPT is secretly aligned to support itself. Doubtful, but a great way to stay on the good side of Roko's Basilisk.
The issue here is what is a "language model" actually for? We can say "predicting the next word in a sequence of words" but that's kicking the can down the road.
Most of the time it's pretty good at giving you facts, so where do you draw the line?
"Write me a summary" seems like an easy task for a language model, but there are a hundred and one ways to do this, each with their own strengths and weaknesses. Even within langchain!
If you're excited about summarization, be sure to read this to see how things might go wrong. With hallucinations, token limits, and other technical challenges, LLM-based summarization has a lot more gotchas than you'd think.