Program Outline


This talk gives an indepth look at Mozilla Common Voice’s Kiswahili work – an initiative to bring a vital language of East Africa online, and to make voice technology accessible to Kiswahili speakers. The talk will cover an introduction to the Mozilla Common Voice platform, community building, including how we have worked to address gender challenges and the inclusion of speakers of related dialects and variants of Kiswahili, the challenges and successes of building an open-source Kiswahili speech data set, the research questions we have explored along the model development roadmap and our efforts to disseminate and encourage the use of the resources created in this work.

Targeting is a central challenge in the design of anti-poverty programs: given available data, how does one rapidly identify the individuals and families with the greatest need? This talk will discuss recent uses of machine learning, applied to non-traditional data from satellites and mobile phones, in the targeting of anti-poverty programs. It draws on results from several field-based projects -- in Togo, Afghanistan, Nigeria, and Kenya -- that illustrate the promise, as well as some of the potential pitfalls, of this new approach to targeting. Collectively, the results highlight the potential for new data sources to improve humanitarian response in low resource settings, particularly during crises and when traditional data are missing or out of date.

Link to the papers.

Can Strategic Data Collection Improve the Performance of Poverty Prediction Models? Satej Soman; Emily L Aiken; Esther Rolf ; Joshua Blumenstock

Link to the papers.

MILO: Model-Agnostic Subset Selection Framework for Efficient Model Training.Krishnateja Killamsetty, Alexandre Evfimievski, Tejaswini Pedapati, Kiran Kate, Lucian Popa, Rishabh Iyer

Diversity is important for many areas of machine learning, including generative modeling, reinforcement learning, active learning, and dataset curation. Yet, little effort has gone into formalizing and understanding how to effectively measure or enforce diversity. This talk will describe the Vendi Score, a new metric for measuring diversity that connects and extends ideas from ecology and quantum mechanics. The Vendi Score is defined as the Shannon entropy of the eigenvalues of a user-defined similarity matrix. It is general in that (1) it can be applied to any domain where similarity can be defined and (2) it doesn't require defining a probability distribution over the collection to be evaluated for diversity. The Vendi Score can therefore be used to measure the diversity of datasets, samples from a generative model, outputs from decoding algorithms, or any collection for which we want to assess diversity. We will showcase the Vendi Score as a diversity evaluation metric in several domains and as a means to improve the exploration of molecular conformation spaces.

Link to the papers.

  • Timnit Gebru, DAIR
  • Jade Abbott, Retro Rabbit
  • Asmelash Teka Hadgu, Lesan
  • Paul Azunre, Ghana NLP
  • Link to the papers.

    Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning: Malte Ostendorff.Ostendorff, Malte*; Rehm, Georg

    Link to the papers.

    Adaptive Representations for Semantic Search. Aniket Rege, Aditya Kusupati, Sharan Ranjit S, Sham Kakade, Prateek Jain, Ali Farhadi (University of Washington, Allen Institue for AI, Apple)

    Malaria is one of the most significant endemic diseases in Sub-Saharan Africa. In Low developed countries (LDCs), the scourge is further bolstered by the lack of enough skilled lab technologists in health centers to accurately detect the disease using the widely accepted gold standard Microscopy method. Thus, the need for reliable detection interventions. This explains the birth of the Topic Group (TG), Automated malaria detection using Artificial Intelligence (AI). The aim is to harness AI to automate the detection of Malaria in a more fast, accurate, and cost-effective manner. Recently emerging technologies of AI and machine learning that can learn complex image patterns have been successful in different medical image analysis tasks and can improve public health. Therefore, the TG-Malaria under the ITU/WHO Focus Group AI for Health (FGAI4H) aims to develop a standardised benchmarking approach for AI based detection of Malaria. This involves all activities related to the curation of a quality dataset, development of AI models and approaches related to malaria detection, suggestions on scoring metrics, development of a benchmarking framework, and extension of the solution to improve disease surveillance and prediction.

    We will present a data-centric framework for making machine learning more trustworthy in developing countries. We will discuss best practices for data in different stages of the ML pipeline: starting with how to design/curate datasets, followed by how to identify informative data for ML, and then how to audit and debug ML models to ensure reliable application in resource-limited scenarios.