2026-06-08 –, Kesselhaus
Low-resource languages expose weaknesses in NLP systems that are often hidden by benchmark data. Drawing on experience annotating fieldwork data, this talk shows how ambiguity and annotation decisions reveal fundamental data quality issues relevant to real-world NLP pipelines.
This talk is an experience report on annotating language data in a low-resource setting and what this process reveals about data quality in NLP pipelines. Rather than treating low-resource languages as edge cases, the talk frames them as stress tests that make structural data issues visible early and clearly.
The session outlines what linguistic fieldwork data looks like before it becomes “training data,” highlighting ambiguity, context dependence, and variation that cannot always be resolved through additional labeling. It then focuses on the annotation decisions required when categories are underspecified or multiple analyses are plausible, and connects these challenges to familiar issues in applied NLP, such as label noise, brittle representations, and unexpected model behavior.
The goal is to share practical lessons from linguistic data work that help NLP practitioners reason more realistically about annotation, uncertainty, and robustness. Attendees will gain concrete insights into why “clean data” is often an illusion and how early data decisions shape downstream systems.
Priscilla Lola Adenuga works with language data at the intersection of linguistics and NLP. Her background is in syntactic analysis and linguistic fieldwork, with hands-on experience annotating low-resource language data. She is interested in data quality, annotation practices, and how insights from linguistics can inform more robust and realistic NLP systems.