Sign up for The Media Today, CJRâs daily newsletter.
The Tow Center for Digital Journalism and investigative journalist and professor Bette Dam are working together on a new project, UNHEARD, which aims to help news organizations reveal potentially overlooked narratives by using AI to audit who is quoted in their articles.Â
Many still remember the 2003 media debacle in the coverage of the US invasion of Iraq. It revealed a failure of the Western press to critically evaluate official US government narratives about the justification for the war. In the aftermath of the invasion, the New York Times, for instance, issued a mea culpa about its flawed reporting. One culprit: an overreliance on US government sources and lack of source diversity.
Professor Dam witnessed this dynamic firsthand in the context of reporting around the war in Afghanistan. She lived and worked in the country for fifteen years and wrote two books examining the influence of Western powers in the region, including how their narratives, often reflected in the American press, contributed to the continuation of the war. Upon her return to Europe, Dam pursued a Ph.D. looking at the mediaâs coverage of the war. She collected all of the articles in the New York Times and the Associated Press about the war in Afghanistan and, with the help of her team, manually identified every quote and source from a small sample of the corpus. Her preliminary findings from her analysis of just under fifteen hundred articles suggest that more than 60 percent of sources used by the AP and about half the sources used by the Times over twenty years of Afghanistan coverage were American officials; the rest, she says, were predominantly Afghan officials who were advancing the Western narrative. But these results come from a sample of a larger dataset of over seventy thousand news articles from the two news organizations.Â
With financial support from the Pulitzer Center, the Tow Center is helping Dam use AI tools to expand this analysis to the full dataset. If successful, the methods we are developing can be extended to help news organizations audit their own coverage and to conduct more general media-wide analysis about who gets quoted in news about particular topics. In addition to a critical evaluation of the past, an approach that uses computational methods offers the promise of real-time source audits, allowing news organizations to analyze reporting, identify potential imbalances, and suggest alternative sets of sources to improve accuracy as a story unfolds.
We are not the first to attempt computational source detection. In 2021, The Guardian used a deep-learning pipeline trained on manually labeled data to extract quotes from articles. It reported that the tool was successful, achieving approximately 89 percent accuracy in identifying quotes and sources in spans of text where the speaker is identified within the text block. But it had its limitations. The Guardianâs solution initially lacked a coreference resolver, meaning that it couldnât identify the source of a quote if they were referred to in a sentence as âheâ or âshe,â as opposed to by name, as is often done on second or third reference in a story. The Guardian went on to implement a coreference resolver in its solution, but its experiments were closed-source and custom tailored to the paper’s archive, and the accuracy of the end-to-end task is not yet reported. In 2023, the BBC released its Citron quote extraction system open-source. Citron sought to tackle the same quote extraction problem as this project, and features a coreference resolver. But the overall success rate in classifying quotations and sources, and resolving coreferences, was only around 60 percent.Â
But, with the introduction of commercially available large language models (LLMs)âthe technology that powers popular chatbots like OpenAIâs ChatGPTâwe believe the technology has finally caught up to the task. While LLMs are known to âhallucinateâ (i.e., fabricate information out of thin air), computational methods can be used to guard against this. For example, we can write code that checks to ensure that extracted text from an article does, in fact, exist verbatim in the original article. In a recent paper, âTesting Generative AI for Source Audits in Student-Produced Local News,â Rahul Bhargava and a team at Northeastern University recently reported promising results using GPT 3.5 Turbo to extract quotes and their sources from a small sample of student-produced news.Â
In a preliminary test, the Tow Center tested how well ChatGPT can identify the quotes manually labeled by Dam as part of her Ph.D. We chose twenty articles at random from her sample: ten from the New York Times and ten from the Associated Press. All articles pertained to the war in Afghanistan and were published between 2002 and 2003. We wrote a prompt meant to capture as many quotes as possible from the dataset, including the speakers and whether the quote was paraphrased (e.g., âSecretary of State Colin L. Powell says Iran has been generally helpful in the war in Afghanistanâ) or direct (e.g., âSecretary of State Colin L. Powell said, âIran has been generally helpful in the war in Afghanistanâ). We came up with the prompt below, which is partly inspired by one developed by Bhargava et al.Â
You are a research assistant who will go through the news article that I give you and identify all quotes and speakers. I will input text, and you will extract both direct and paraphrased quotes. Please identify the full name of the speaker whenever possible. If a name is not provided, identify the speaker by their title or group (e.g., âMilitary officials,â âa spokespersonâ). If the speaker is an organization, media outlet, or governmental entity, include it as the speaker. If the speaker is truly unknown, use âUnidentified.â
However, we took a more expansive definition of what counts as a âquoteâ than Bhargava et al. Our prompt also included the following:
In addition to standard direct and paraphrased quotes, also extract instances where the article summarizes a group or person’s statements or descriptions. If a sentence describes what a group or person said, claimed, or described, capture it as a paraphrased quote.
The output should be a CSV file with three columns: âSpeaker,â’ âQuotes,â and âDirect or Paraphrased.â Each quote gets its own row, and the corresponding speaker is listed. If a quote is paraphrased or summarized, enclose the full sentence in brackets and mark it as âParaphrasedâ in the third column.
The results from our small preliminary experiment are promising as well. ChatGPT identified 230 out of the 251 quotes that Dam had manually labeled in our sample of twenty articles. We noticed patterns in the quotes that ChatGPT consistently missed: they were often longer, more convoluted, and came from unknown sources, documents, or organizations. For instance, ChatGPT missed the following quote where a reporter paraphrased an announcement from an Afghan media organization: âGovernment-run Kabul TV announced that Taj Mohammad Wardak, a former governor of several other provinces, was being appointed in Paktia.â Unsurprisingly, the missed quotes were often paraphrased rather than direct, which are easier to identify because of the quotation marks that surround them. The results, while not perfect, show that language models such as GPT can go a long way in helping researchers and journalists detect bias in news sourcing.Â
You can find the data behind this small preliminary experiment on GitHub. As we test on a larger sample, and subsequently build our tool, we are committed to developing open-source and welcome others to participate and contribute their insights. We hope that UNHEARD will be more than a toolâit will be a movement, bringing together professionals and others to engage in scientific, data-driven discussions on bias in journalism. Our goal is to foster an inspiring and constructive dialogue that enhances reporting standards. We have already begun inviting US media organizations.
If youâre interested in receiving updates on the project, please sign up at unheard.news or drop us a line at editors@unheard.news.
Has America ever needed a media defender more than now? Help us by joining CJR today.