Soutenance Elvys LINHARES PONTES: Compressive Cross-Language Text Summarization

La soutenance aura lieu au Laboratoire Informatique d'Avignon, le 30/11/2018 à 14h30, Amphi Blaise Pascal (Site Agroparc).


Titre: Compressive Cross-Language Text Summarization

Membres du Jury:

Mme Marie-Francine MOENS Professor, LIIR, Heverlee, Rapporteur
M. Antoine DOUCET Professor, L3i, La Rochelle , Rapporteur
M. Frédéric BECHET Professor, LIS, Marseille Examinateur
M. Guy LAPALME Professor, DIRO, Montréal Examinateur
Mme Fatiha SADAT Professor, GDAC, Montréal Examinateur
M Petko VALTCHEV Professor, GDAC, Montréal Examinateur
M Florian BOUDIN Associate Professor, LS2N, Nantes Examinateur
M Juan-Manuel TORRES-MORENO Associate Professor HDR, LIA, Avignon Directeur
M Stéphane HUET Associate Professor, LIA, Avignon Co-directeur
Mme Andréa Carneiro LINHARES Associate Professor, UFC, Fortaleza Co-directrice

Résumé :

The popularization of social networks and digital documents increased quickly the information available in the Internet. However, this huge amount of data cannot be analyzed manually. Natural Language Processing (NLP) analyzes the interactions between computers and human languages in order to process and to analyze natural language data. NLP techniques incorporate a variety of methods, including linguistics, semantics and statistics to extract entities, relationships and understand a document. Among several NLP applications, we are interested, in this thesis, in the Cross-Language Text Summarization (CLTS) which produces a summary in a language different from the language of the source documents. We also analyzed other NLP tasks (word encoding representation, semantic similarity, multi-sentence compression, and text summarization) to generate more stable and informative cross-lingual summaries.

Most of NLP applications (including text summarization) use a kind of similarity measure to analyze and/or compare the meaning of words, chunks, sentences and texts in their approaches. A way to analyze this similarity is to generate a representation for these sentences that contains the meaning of them. The meaning of sentences is defined by several elements, such as the context of words and expressions, the order of words and the previous information. Simple metrics, such as cosine metric and Euclidean distance, provide a measure of similarity between two sentences; however, they do not analyze the order of words or multi-words. Analyzing these problems, we propose a neural network model that combines recurrent and convolutional neural networks to estimate the semantic similarity of a pair of sentences (or texts) based on the local and general contexts of words. Our model predicted better similarity scores than baselines by analyzing better the local and the general meanings of words and multi-word expressions.

In order to remove redundancies and non relevant information of similar sentences, we propose a multi-sentence compression method that compresses similar sentences by fusing them in correct and short compressions that contain the main information of these similar sentences. We model clusters of similar sentences as word graphs. Then, we apply an integer linear programming model that guides the compression of these clusters based on a list of keywords. We look for a path in the word graph that has good cohesion and contains the maximum of keywords. Our approach outperformed baselines by generating more informative and correct compressions for French, Portuguese and Spanish languages.

Finally, we combine these previous methods to build a cross-lingual text summarization system. Our system is an {English, French, Portuguese, Spanish}-to-{English, French} cross-lingual text summarization framework that analyzes the information in both languages to identify the most relevant sentences. Inspired by the compressive text summarization methods in mono-lingual analysis, we adapt our multi-sentence compression method for this problem to just keep the main information. Our system using MSC method proves to be a good alternative to compress redundant information and to preserve relevant information. Our system improve the ROUGE scores and significantly outperforms extractive baselines in the state of the art for all these languages. In addition, we analyze the cross-lingual text summarization of transcript documents. Our approach achieved better and more stable ROUGE scores even for these documents that have grammatical errors and missing information.



Vendredi, 30 Novembre, 2018 - 14:30 to 16:30

Laboratoire Informatique d'Avignon

Université d'Avignon et des Pays de Vaucluse
339 chemin des Meinajaries, Agroparc BP 91228, 84911 Avignon cedex 9
+33 (0)4 90 84 35 00