About BigScience

Origin

The “BigScience” project originated in discussions between the HuggingFace open-science team, the directors of the French Jean Zay supercomputer and members of the French NLP academic and industrial research communities, in early 2021.

The acceleration in Artificial Intelligence (AI) and Natural Language Processing (NLP) will have a fundamental impact on society, as these technologies are at the core of the tools we use on a daily basis. A considerable part of this effort currently stems in NLP from increasingly larger language models trained on increasingly larger quantities of texts.

Unfortunately, the resources necessary to create the best-performing models are found almost exclusively at the Big 5 US technology giants (Google, Apple, Facebook, Amazon, Microsoft). The stranglehold on this transformative technology poses some problems, from a research advancement, environmental, ethical and societal perspective.

The BigScience project aims to demonstrate another way of creating, studying, and sharing large language models and large research artifacts in general within the AI/NLP research communities.

This project takes inspiration from scientific creation schemes existing in other scientific fields, such as CERN and the LHC in particle physics, in which open scientific collaborations facilitate the creation of large-scale artifacts useful for the entire research community.

Gathering a much larger research community around the creation of these artifacts makes it possible to consider in advance the many research questions (capabilities, limitations, potential improvements, bias, ethics, environmental impact, general AI/cognitive research landscape) that will be interesting to answer with the created artifacts and to reflect and prepare the tools needed to answer as many of these questions as possible.

Hugging Face, at the origin of the project, develops open-source research tools that are widely used in the NLP community. Very early in the process, this project attracted more than thirty French partners surrounding the French public compute Jean Zay facility, in practice involving quickly a hundred people, from French academic laboratories, startups/SMEs, and larger industrial groups (see the detailed list of the founding members).

It was also clear from the inception of the project that, to propose a more inclusive way to conduct open-research on these artifacts, including the much wider international community was a sin-equal-none requirement and, following the first proposal, the project opened and extended to an international research community interested in studying and understand better the many research questions surrounding large language models as well as the challenges around creating and sharing such models and datasets for research purposes.

In the end, it’s the deep belief of the founding members that the project’s success will ultimately be measured by its long-term impact on the field of AI. Beyond the research artifacts created and shared, this project aims to bring together all the skills, conditions, and lessons allowing such future experiments of large-scale scientific collaboration.

A one-year workshop

The collaboration is organized as a One-Year Workshop on Large Language Models for Research: the “Summer of Language Models 21 :cherry_blossom:

The workshop will:

  • be conducted online during one year: from May 2021 to May 2022
  • include live events spread over the year (online for the first, possibly in-person for later) with at least opening and closing live events.
  • involve a set of collaborative tasks conducted along the year and aimed at creating, sharing and evaluating a large multilingual dataset and a large language model as tools for research.

This workshop will foster discussions and reflections around the research questions surrounding large language models (capabilities, limitations, potential improvements, bias, ethics, environmental impact, role in the general AI/cognitive research landscape) as well as the challenges around creating and sharing such models and datasets for research purposes and among the research community.

The collaborative tasks are quite large since they involve several millions GPU hours on a supercomputer.

If successful, this workshop could be reconducted in the future involving an updated or different set of collaborative tasks.

Outcomes

The outcomes of the workshop are expected to be:

  • Fostering discussion around
    • the research questions surrounding these models (capabilities, limitations, bias, ethics, environment, general role/interest)
    • using such workshops as a useful mean for collaborative research (future reconduction)
    • community-wide best practices for the creation, curation, and maintenance of the datasets that enable these models (covering technical, societal, and legal aspects)
  • Research publications from the participants comprising:
    • At least one publication involving the whole set of participants
    • Several more focused publications involving smaller author groups working on the collaborative tasks (e.g. on the dataset, on the modeling approach, on several aspects of the evaluation, etc…)
    • These publications would be possibly gathered in a special edition of the proceedings of the ACL or PMLR (or other publishing venue)
  • Sharing several artifacts created during the collaborative tasks as tools for the research community:
    • A Multilingual Dataset for Research (possibly full-open-access, possibly behind an authentication access portal to follow GdPR and data privacy legislations => to be defined by the organization committee in charge of this artifact)
    • A large Language Model for Research (possibly full-open-access, possibly behind a researcher’s authentication access portal like the dataset => to be defined by the organization committee in charge of this artifact)
    • Software tools created to complete the shared tasks, e.g. dataset filtering, model off-loading, etc (under very permissive licence, e.g. Apache2 => to be by the organization committee in charge of this artifact)
    • Extensive documentation of the evaluation results, protocols, approach, datasets, and tools developed in the course of the project for future reconduction

Roles - participation

You can generally participate in the project as:

  • Advisor (Steering Committee member):
    • role: give general scientific/organization advices
    • time commitment: light - reading a newsletter every 2 weeks - giving feedback/advices
  • Participant in a Working Group (Organizing Committee member)
    • role: joining one of the working groups of the OC (see list below): advising/designing/building the collaborative task (building the dataset/model/tools) or advising/designing/organizing the live events
    • time commitment: medium - depend on the chosen task (see details of the working groups below)
  • Chair/co-chair of a Working Group (Organizing Committee member)
    • role: the chair(s) is supposed to provide at least the minimal amount of work necessary for having a very bare-bone version of the task. If other members are active in the WG, the chair(s) can mostly coordinate the effort and organise the decision process.
    • time commitment: more significant - also depend on the chosen WG
  • Workshop attendant joining the collaborative task or live events (later on)
    • role: participating in the collaborative task in a guided way following guidelines setup by the OC (helping build the dataset, helping build the tools)
    • time commitment: up to the attendant - following guidelines of the OC/WG

Find more information about BigScience here.
Find more information about the Working Groups here.