About the Dataset – General Discussion category

The One-Year Workshop on Large Language Models for Research or “Summer of Language Models 21 :cherry_blossom:” aims to enable studying phenomena associated with Large Language Models (LLMs) by gathering a large and diverse research community to train an LLM collaboratively and subsequently making it available for research purposes. Among the prerequisites for succeeding in this endeavor, the need for a training corpus of sufficient size and quality stands out: the Data Section of the workshop specifies how we intend to create a dataset that addresses this requirement.

This dataset, by its size and nature, calls for a significant effort to follow established best practices in data creation and curation. Further, the magnitude of recent changes in techniques, legislation, and application scope of data-driven methods suggests that some of these standards and protocols might need to evolve to adapt to new needs. In particular, the last decade has seen two significant shifts. First, whereas data-driven language technology was previously mainly employed for automatic translation, its applications are becoming ubiquitous, giving training data issues the potential to lead to much more significant harm than they once did. Second, the recent trend of pre-training “general-purpose” models as a first step represents a sharp turn from traditional NLP datasets that were curated to align to a specific use case and raises several important questions as to how we define this notion of generality.

To address these new challenges, the workshop invites proposals and collaborations in the following categories:

  • First, multidisciplinary work that combines insights from machine learning, sociology, linguistics, library science, and law to illustrate the range of questions at play and help give a comprehensive picture of the state of the art in responsible data curation practices.
  • Second, work that builds on these insights to provide concrete recommendations for the data curation process by considering both the specificities of speech and language data and the unique needs of a heteroclite aggregated dataset.
  • Third, tools and processes to implement these recommendations and gather language data from diverse sources in order to build and document the aggregated dataset that we will use to train the final model(s) for the workshop.

Go to data working groups

Find more information about the working group Dataset Creation here.
WG Dataset Creation Chair: Yacine Jernite