Category Topics

BigScience

The “BigScience” project originated in discussions between the HuggingFace open-science team, the directors of the French Jean Zay supercomputer and members of the French NLP academic and industrial research communities, in early 2021.
2

BigScience – Kickoff event

During our BigScience kick-off event on April 28 we will introduce the project, its origin, philosophy and goals as a collaborative open science effort in large-scale language modeling. We will look at the different working groups such as dataset creation, modeling, evaluation, etc. and will share how you can contribute and be part of it. In this thread we will collect some of the most upvoted audience questions and answer them for future reference.
6

Modeling approach

Very large models trained on very large datasets display emergent prompting and priming behaviors: by conditioning on a task description and/or priming example pairs, the model generates an output for a given input and gets non-trivial performance on a diversity of benchmarks (cf GPT3). However, these behaviors feel somewhat accidental. When training a large enough left-to-right (a.k.a. causal) language model on enough text, the model learns non-trivial “skills” that are secondary to the LM object for which supervised signal was thought to be necessary. However, prompting and priming are a fickle art right now because of a model’s brittleness to the format of the prompt and examples (AutoPrompt, How can we know what LMs know?, What makes good in-context examples for GPT-3?). We want to make these prompting abilities more explicit and robust rather than solely relying on poorly understood emergent behaviors, by providing as much signal that resembles prompting as possible.
2

Dataset – General Discussion

The One-Year Workshop on Large Language Models for Research or “Summer of Language Models 21 ” aims to enable studying phenomena associated with Large Language Models (LLMs) by gathering a large and diverse research community to train an LLM collaboratively and subsequently making it available for research purposes. Among the prerequisites for succeeding in this endeavor, the need for a training corpus of sufficient size and quality stands out: the Data Section of the workshop specifies how we intend to create a dataset that addresses this requirement.
0
1

Site Feedback

Discussion about this site, its organization, how it works, and how we can improve it.

1

Data Governance and Archival Strategies

This working group is tasked with exploring questions about how data is gathered and managed, choosing the right metadata and indexing and documentation structure, and developing protocols to ensure that data is used while respecting the rights and for the benefit of the data subjects.
0

Data Sourcing and Representativeness

This working group is responsible for helping define language choices and local and global representativeness criteria, analysing the diversity of existing text sources for each region in terms of social contexts represented, and finding diverse sources of text to meet these criteria, including both online and offline text in all available media.
0

Data Tooling

This working group will develop tools to gather text from the identified sources and process it to be both easy to use at training time and respectful of the data subjects’ rights. This includes tools for crawling, automatic PI detection and de-identification, documentation of web content, efficient web text formats that retain the website and page structure, tools for extracting text from audio or pdf files, as well as infrastructure for securely maintaining and dispensing the data for training.
0