Our ongoing NLP literature review

01/12/2024

The rise of misinformation and the increasing complexity of online narratives have created an urgent need for more sophisticated fact-checking tools. The PROMPT project aims to develop an integrated toolbox that leverages recent advances in artificial intelligence - especially in the NLP pipeline - while addressing the specific challenges faced by journalists working across multiple European languages. Thanks to its modular construction, it can be used along with other fact-checking tools in a complementary fashion.

Here, we provide the main scientific articles used so far in our research. While non-exhaustive, we believe these papers to be good starting points for exploring academical works related to LLM-powered fact-checking. This landscape informing our approach can be decomposed in multiple themes.

First, community analysis inform us on the organicity of content found online. This bibliography contains several works that demonstrate how network analysis can reveal coordinated efforts to spread misinformation. This research shows that understanding the social dynamics and community structures behind information spread is crucial for effective fact-checking.
The second direction involves content analysis. Transformer-based language models have shown remarkable capabilities in understanding nuanced narratives. Moreover, the characteristics of newly released models have established new benchmarks in multilingual text comprehension. However, these models require careful consideration of linguistic and cultural contexts, in particular due to code-switching and the language variations found on social networks.
Third, for content truthfulness assessment, recent research has focused on combining large language models with external knowledge sources. Retrieval-augmented generation (RAG) has been particularly influential, showing how external knowledge can be effectively integrated with language models to improve factual accuracy. This aspect can be augmented, for instance by considering Knowledge Graphs in addition to traditional RAG. Nonetheless, the interaction between RAG and multilingual / multimodal / multi-platform content is an even more important question in our research.

This project synthesizes these research streams into a practical toolbox that can assist fact-checkers using multiple metrics regarding topics aforementioned. In addition, the PROMPT project will provide with a clear-text explanation of provided qualification, which involves several additional research questions regarding the confidence one can have in LLMs as an exploratory tool.

From Dogwhistles to Bullhorns: Unveiling Coded Rhetoric with Language Models (Mendelsohn et al.), ACL 2023, https://aclanthology.org/2023.acl-long.845
Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models (Nikhil Sharma, Kenton Murray, Ziang Xiao), 2024, arXiv:2407.05502
BERTrend: Neural Topic Modeling for Emerging Trends Detection (Boutaleb et al.), FuturED 2024 https://aclanthology.org/2024.futured-1.1
DoRA: Weight-Decomposed Low-Rank Adaptation (Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Min-Hung Chen), ICML 2024, arXiv:2402.09353
Matryoshka Representation Learning (Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, Ali Farhadi), 2024, arXiv:2205.13147
Unveiling Global Narratives: A Multilingual Twitter Dataset of News Media on the Russo-Ukrainian Conflict (Sherzod Hakimov, Gullal S. Cheema), ICMR 2024, arXiv:2306.12886
Mapping News Narratives Using LLMs and Narrative-Structured Text Embeddings (Jan Elfes), 2024, arXiv:2409.06540
Retrieval-augmented generation in multilingual settings (Nadezhda Chirkova, David Rau, Hervé Déjean, Thibault Formal, Stéphane Clinchant, Vassilina Nikoulina), 2024, arXiv:2407.01463
Native Language Identification in Texts: A Survey (Goswami et al.), NAACL 2024, https://aclanthology.org/2024.naacl-long.173

Don't hesitate to contact us on Bluesky or via the form below to let us know about any additional sources we should consider!