PhD thesis defense of Arthur Amalvy – 12/09/2024 – Laboratoire Informatique d’Avignon

Thesis title: Natural Language Processing for the Representation of Narrative Texts through Character Networks

Date: 12/09/2024 – 9 AM

Place: CERI’s Ada Lovelace amphitheater.

Abstract:

A character network represents characters as vertices in a graph, and their relationships as edges between them. In the case of literary works, they model a whole narrative using a single mathematical object. Depending on the needs, their edges can represent different types of interactions between characters: co-occurrence, conversation, direct action… Additionally, the temporal changes in the relationships between characters can be modeled with dynamic networks. Thanks to this flexibility, character networks have been used to tackle a number of tasks, such as literary genre classification, story segmentation, recommendation or summarization. Manually extracting these networks is costly, which is why many researchers interested in automating the process. This, in turn, requires solving different Natural Language Processing (NLP) tasks such as Named Entity Recognition (NER), coreference resolution or speaker attribution.

In this thesis, we present contributions to this automatic extraction process in the case of novels, as well as to character network applications. Inspired by the 2019 survey of Labatut and Bost that summarizes existing extraction efforts in a generic extraction framework, we propose Renard, a modular character network extraction pipeline that we release under a free license. We use Renard to better understand the performance of existing extraction pipelines by studying the impact of NER and coreference resolution errors on the quality of extracted networks. We find that both tasks’ performance is important to network quality and depends strongly on the novel. In the case of coreference resolution, we also observe that different errors do not have the same impact: linking precision is particularly important when it comes to correctly detecting characters.

Additionally, we identify and work on two challenges of automatic character network extraction systems. The first one is the lack of literary data to train such systems. We tackle this challenge by 1) releasing a new literary dataset covering the NER and character unification tasks; and 2) proposing to use a NER data augmentation scheme, mention replacement, to alleviate the issue of unseen name style in the case of cross-domain NER. The second challenge we identify is the limited range of transformers-based models, which can be detrimental to performance in some tasks. We propose to retrieve relevant context at the document level to mitigate the lack of information induced by that lack of range, and show that it can increase performance for the NER task.

Finally, we present contributions in character network applications on two case studies. First, we leverage networks modeling different types of interactions (co-occurrence, mention and conversation) on an analysis of Alfred de Musset’s Lorenzaccio. By using community detection on a co-occurrence network, we identify subplots, quantify their relative importance and find interactions between them. Additionally, we propose a method to detect automatically conspiracies using our dynamic mention and conversational networks. Second, we propose to leverage character networks to perform narrative matching (i.e. the task of matching corresponding narrative units) on three different adaptations of George R. R. Martin’s A Song of Ice and Fire across media: the original novels, the comics directly adapted from these and the HBO TV show. Our results show that network-based methods can outperform existing text-based ones, and can even be combined with them to increase performance. We also highlight the importance of working on commensurate narrative units. In these two case studies, we leverage dynamic networks and show their interest, despite their relative underusage in the character network literature.

The jury will be composed of :

Claire GARDENT, Research Director, CNRS/LORIA, Lorraine University, Reporter

Christophe CERISARA, Research Fellow, CNRS/LORIA, Lorraine University, Reporter

Farah BENAMARA, Professeure, CNRS/IRIT, Paul Sabatier University, Examiner

David BAMMAN, Professeur Associé, School of Information, UC Berkeley, Examiner

Vincent LABATUT, Maitre de Conférence, LIA, Avignon University, Thesis director

Richard DUFOUR, Professeur, LS2N, Université de Nantes, Thesis co-director