Community, Leadership, Experimentation, Diversity, & Education
Pittsburgh Arts, Regional Theatre, New Work, Producing, Copyright, Labor Unions,
New Products, Coping Skills, J-O-Bs...
Theatre industry news, University & School of Drama Announcements, plus occasional course support for
Carnegie Mellon School of Drama Faculty, Staff, Students, and Alumni.
CMU School of Drama
Thursday, September 25, 2025
Science journalists find ChatGPT is bad at summarizing scientific papers
Ars Technica: Summarizing complex scientific findings for a non-expert audience is one of the most important things a science journalist does from day to day. Generating summaries of complex writing has also been frequently mentioned as one of the best use cases for large language models (despite some prominent counterexamples).
Subscribe to:
Post Comments (Atom)

10 comments:
I am certainly not surprised that ChatGPT is bad at accurately summarizing scientific papers. As mentioned in the article, previous studies have also shown the factual errors that LLM’s make when summarizing or just generally providing information. The way LLM works is by using lots of data to project the most likely next word even when provided with the facts that programming comes through and results in errors. I think it is interesting that the author pointed out the journalists doing the research may have been biased as chatgpt could in the future put the journalists out of their jobs. While there is likely some bias in the journalists they are also the most informed on what good scientific paper summaries should contain and I don’t think that should sway people on whether or not the journalists were doing their best to analyse chatgpts writing based on the contents, not that it was written by AI.
“ChatGPT bad at summarizing scientific papers” ok fork found in kitchen. Are we really surprised. All AI does is predict the most likely words in a sequence. If the scientific data is weird or evades likeliness, AI will mess it up. One thing that sticks with me is that AI apparently tends to overuse words like “groundbreaking,” which is something people do as well to grab attention. It also conflates correlation and causation (something that even renowned world leaders do nowadays!) and don’t provide context. It’s kind of scary how AI is just mimicking human misinformation because comfortable clickbait and correlation=causation are appealing connections. We want to believe “groundbreaking” news and see what we believe to be evidence. I do really wonder about what the human biases mean for this (obviously these people wouldn’t want to do their jobs), but to be honest the results are so incredibly decisive I don’t know how much the bias is swaying things. It might help to have non-scientist academics read the summaries and rate them as well.
So when I first read this article my immediate reaction to “Science journalists find ChatGPT is bad at summarizing scientific papers” was, well no shit. To me, this seemed obvious; of course it sucks at summarizing scientific papers, that isn’t within the scope of the technology. But as I read this article, I started to think about why there is such a push for using ChatGPT and other large language models (LLM) in academia. Something I have seen time and time again when talking about LLMs (especially with older individuals) is a fundamental misunderstanding of what a LLM actually does and how it operates. I’ll admit, at the surface it is a reasonable assumption to think an LLM such as ChatGPT would excel at the task of making writing more accessible, and to a degree, it is really good at that. Something the article conveys and also where I see LLMs tend to fall short is conveying nuances. This makes sense, at its core a LLM is a hyper-sophisticated parrot. Sure it has access to such extreme vocabulary and text volumes that it can spit out phases that seem coherent, however much like a parrot, even if it can identify when a word is appropriate to use in context (a parrot will see people referring to its food as seed and then repeat that), it has no idea what it is actually saying. Because of this, it is unfair to expect it to be able to convey nuance.
Wow, what a shocker. Who would have thought AI would be able to process things so complex like these, which have a stupidly high number of sources and text, when it can’t even add two simple numbers sometimes or tell you the capital of places. This isn’t really too surprising to me at all, AI is still growing and needs time to learn, it’s simply not there yet to where it can process long and detailed information like this. Will it get to that point eventually, absolutely, it’s growing every day, however for now it’s extremely unreliable and this study only proves it further. I also think if you tell an AI “turn this research paper into something interesting to read and interpret” it will 100% struggle as it doesn’t have that human element to it, nor is it close to having the human element. I think it’s going to be interesting seeing if this study still stands in a few years.
I feel like this shouldn’t come as a surprise to anyone. ChatGPT and other AIs seem like a cool idea and they seem to work when you first use them, but there just isn’t any depth to their answers. All they do is compile and regurgitate information that humans have already made, and it’s not even very good at that. In this sense, it’s frustrating that developers are putting AI into every app and website you go to and never providing a clear way to just shut it off. I don’t think there’s one app on my phone that doesn’t have AI features, and it’s even baked into the phone itself in the settings menu. It’s even more annoying that some people seem to not understand that it’s not trustworthy and use it for genuinely important things. On top of that, it’s burning through insane amounts of electricity and water, and its existing copyright problems with art still stand.
scientific papers because i doubt it can really read into the depth of the information. It just predicts what should be said, i find this very ironic because scientific papers should push you to ask questions, which chat gbt and ai in general is really awful at doing. If the AI model struggles to ask questions or can't really gauge a good question because it is programmed to know everything, why would it be good at answering in depth questions either? It's an interesting paradox. Science writers are supposed to take multiple things into account and think critically, and I think we've all found and observed that AI struggles to think critically because it is programmed to make us think in a certain way. There is zero way that it is not biased.
I would argue that even non-scientific articles that chat GPT tries to summarize are not done very successfully. I think that in order to successfully summarize something you need to have an understanding of the material that you're summarizing, the authorial intent behind the paper, and an understanding of the way that the words you say come across to other people.Well I think that chat GPT is good at understanding the material to a certain extent, it's not so good at understanding the author's perspective, or the way that individual words give you meaning beyond their dictionary definition. To be fair, that last one is something we all struggle with, but it's especially important when it comes to scientific summaries because even a slight change in meaning can really change how something is viewed. Nevertheless, I think it is proof that its best use is used as a tool, and not as the ONLY tool.
I think making complicated papers easier to understand for people not in that field of knowledge is an interesting use of AI–still I wouldn’t encourage it because there are already people who do that for a living–but I did not think of that. I know people often use it to summarize large articles but that is using AI to shorten writing which is not the same as making it easier to comprehend. While I don’t believe for a second that AI properly re-wrote those scientific papers to have the same nuance and style of a real journalist, I also think there’s somewhat of a flaw to only having the AAAS writers review their own papers–because of course they’re going to say it is flawed they’re literally stopping a robot from taking their job they worked hard to achieve at doing. I do believe the journalists when they say AI messed it up in different ways because AI is usually not good at those things, but the article could’ve had more merit if there were counts of scientists not immediately affiliated with writing the articles assessing how the bot did. In the end AI should not be destroying the environment to do things qualified people already can do for a living.
I thought this article was really eye opening because it showed how flawed AI can be when dealing with something as detailed as science papers. Journalists tested ChatGPT by asking it to summarize real research and what they found was that it often cut corners to make the writing sound easier to read. In doing so it would sometimes leave out critical details or even make claims that the actual studies did not support. The phrase they used was that it traded precision for clarity which is not a good mix when you are talking about science where the small details matter the most. What stood out to me is that this can be really dangerous in areas like medicine or climate studies where a single misrepresented detail can completely change how the information is understood. It made me think about how easy it is to trust something that sounds polished and confident when in reality it might be missing the very things that make the research valuable. I can see how AI can still be useful for a first pass or just to get the general idea of a study but you really need to check the source material yourself if you want to be accurate.
I’m not surprised that AI struggles to write summaries of very complex issues, and it’s interesting to see data that supports that thought. Utilizing AI to summarize complex topics is something I feel happens frequently, but it continues to raise my skepticism about whether AI can truly be used as a learning tool.I’d be interested to see if other AI chatbots produce higher quality work. With Gemini and other AI platforms that give users more control over reference data, feeding the same prompt into multiple chatbots and then reflecting on the similarities or differences could be an interesting way to take this further.Unfortunately, we’ll have to wait and see how future updates affect this aspect of AI. Will specialized chatbots be created for this purpose? Does ChatGPT have too wide of a range of capabilities to truly excel at any one thing? I’m curious to see what the future holds.
Post a Comment