Datasets & Corpora – Laboratoire Informatique d’Avignon

BRÉF – Revised database of elected representatives in France

4 December 2023

The Revised Database of Elected Officials in France (BRÉF) is built from a primary source, the National Directory of Elected Officials (RNE), along with several secondary sources, including databases from the National Assembly, the Senate, and the European Parliament. The intention is to expand this database further by fully leveraging these secondary sources and, in the longer term, by integrating new databases and occasional contributions.

FOPPA – Open Database of French Public Procurement Award Notices

1 May 2023

The FOPPA database (French Open Public Procurement Award notices) is a database constituted in the framework of the ANR DeCoMaP project (ANR-19-CE38-0004). It contains public procurement notices published in France from 2010 to 2020. It relies on a subset of the TED database (Tenders Electronic Daily, an appendix of the EU official bulletin). These data have a number of issues, the most serious being that the unique ID of most involved agents are missing. We performed a number of operations to solve these issues and obtain a usable database. These operations and their outcomes are described in detail in the below technical report. Production date: 2019–2024 Publicly available database: 10.5281/zenodo.7433154 Source code used to build the base: https://github.com/CompNet/FoppaInit/ Technical report explaining the processing: Lucas Potin, Vincent Labatut, Rosa Figueiredo, Christine Largeron, Pierre-Henri Morand. FOPPA: A database of French Open Public Procurement Award notices. Technical Report, Avignon Université. 2022. ⟨hal-03796734⟩ Data paper describing the database (cite this paper if you use these data): Lucas Potin, Vincent Labatut, Pierre-Henri Morand, Christine Largeron. FOPPA: an open database of French public procurement award notices from 2010–2020. Scientific Data 10:303 (2023). DOI: 10.1038/s41597-023-02213-z ⟨hal-04101350⟩

Serial Speakers – Collection of Annotated TV Serials

4 December 2020

This dataset consists of 3 TV series with manual annotations: All three files are in .json format and contain TV Series annotated data. Each TV Series is defined by its name, A TV Series contains seasons, defined by their ids. Every season is made of episodes, defined by their ids, titles, duration and fps. Each episode contains two basic kinds of data: scenes and speech segments. Scenes are defined by starting points and are made of shots (Seasons 1 only).A shot is defined by starting and ending positions, and recurring shot ids. The speech segments are defined by their starting and ending points; textual content (here encrypted for copyright reasons); speaker; possible interlocutors.

WAC – Wikipedia Abusive Conversations

4 December 2020

This dataset contains conversations between Wikipedia editors, which are annotated in terms of various types of abuse, at the level of messages. It aligns two existing corpora: