Question from Dory: Did I understand correctly that the project focuses on 'only' 8 languages?

Did I understand correctly that the project focuses on ‘only’ 8 languages?

Join the discussion on defining language varieties!

We are starting with 8 language “groups” which include for example Arabic with all its forms of diglossia, or Bantu languages, which on a language taxonomy would be at the same level as e.g. Romance languages. The initial choice was motivated by balancing number of speakers and geographical coverage but will likely evolve during the project.

One thing to note, however, is that we are targeting an intentional multilingualism where we commit to spending significant time on any language that we want to cover - in contrast to choosing multilingual sources a priori and then figuring out what languages are in there.

I truly appreciate the choice to include several varieties for the same language, as opposed to a “standard” variety. However, I believe that the criteria adopted for the current selection of languages are a bit inconsistent. Consider that:

  • Currently, only 4 families are included (of which one, the Indo-European, is over-represented).
  • If the sample is based on areal diversity, South-East Asia and Papunesia should also appear. Moreover, within each current region, majority languages have been preferred over indigenous languages.
  • If the sample is based on the size of the speakers’ community, Japanese / Indonesian / etc. should also be selected as they have more than any Bantu language does.

While there is a trade-off between maximising diversity (in terms of area, typology, and family) and number of speakers, I would suggest that we choose a consistent set of criteria to select the languages (or at least, a principled way to reach a compromise between the two sets of criteria). This obviously depends mainly on the high-level goals of the BigScience project.

In case it may be useful, some relevant work on quantifying the diversity of language samples can be found here and here.

1 Like

As mentioned above, the language variants working group should definitely explore adding any of these to the set of languages! Some of the next contenders in addition to Japanese and Indonesian, as you mentioned, would likely be Bengali, Russian, and Turkish.

To explain a little more, we started with a selection of 8 languages to leave room for growth. Choosing the 8 most spoken languages included Bengali and Russian but excluded any African languages, so we switched those out for first Swahili (which then became the family of Bantu languages after discussion with native speakers) and Portuguese as a language that is spoken natively in four continents.

It is hard to find a consistent set of criteria that addresses all of the needs of the projects, including all of its social ambitions, so there will be some “arbitrariness”; but the current choice did grow from extensive conversations - which should obviously continue. Feel free to add yourself as a member of the Working Group if you want to be part of those going forward!