About the Data Tooling category


This working group will develop tools to gather text from the identified sources and process it to be both easy to use at training time and respectful of the data subjects’ rights. This includes tools for crawling, automatic PI detection and de-identification, documentation of web content, efficient web text formats that retain the website and page structure, tools for extracting text from audio or pdf files, as well as infrastructure for securely maintaining and dispensing the data for training.

Resources – onboarding – documentation

Entry document: Dataset Org: from Data Sources to Training Dataset

Current members

Benoit Dal Ferro, Xavier Tannier, Halil Akin, Julien Launay, Pasquale Minervini, Antoine Simoulin, Shubham Agarwal, M Saiful Bari, Yozh, Yacine Jernite, Max Ryabinin, Wietse de Vries, Hady Elsahar, Manan Dey, Sampo Pyysalo, Veronika Laippala, Archiki Prasad

Current chair(s)

Colin Raffel