This working group is responsible for helping define language choices and local and global representativeness criteria, analysing the diversity of existing text sources for each region in terms of social contexts represented, and finding diverse sources of text to meet these criteria, including both online and offline text in all available media.
- Defining frameworks for analysing representativeness and diversity across regions
- Exploring different modes of data collection from web crawling to participatory methods and collaboration with existing data organizations
- Choosing languages, defining language varieties, and identifying relevant regions for each of those
- Identifying diverse text sources for each region and language variety
Entry document: Dataset Org: Data Sourcing and Representativeness
Karën Fort, Sam Bowman, Halil Akin, Caiming Xiong, Guillaume Klein, Samson Tan, Myle Ott, Philippe Muller, Ruiqi Zhong, M Saiful Bari, Luke Zettlemoyer, Yacine Jernite, Wietse de Vries, Max Ryabinin, Antoine Neuraz , Tsvetomila Mihaylova, Hady Elsahar, Manan Dey, Minh Quang Pham, Jin Koay, Ari Jankelowitz, , Edoardo M. Ponti
Angie McMillan-Major, Pedro Ortiz, Zeerak Waseem