Ilnar Salimzianov's Personal Site
Читать на русском
Empower Your Language: Let's Build its Digital Future with Mozilla
First published: July 24, 2025. Last update: December 29, 2025.
On July 21, 2025, I began a new role as a Regional Language Researcher,
working as an independent contractor for the Mozilla Foundation. Over the
next six months, I'll be focused on a project that I believe is vital for
the future of our languages. I want to tell you about this initiative and
explain what's in it for you.
Note: I am an independent contractor, not a Mozilla
employee. All views expressed here are my own.
What is the Mozilla Data Collective (MDC)?
Many of us know Common Voice, Mozilla's groundbreaking project to crowdsource
speech data. Its success is a testament to what a global community can
achieve. To date, the project has collected over 33,816
hours of recorded speech across an incredible 137
languages (as of July 21, 2025).
The Mozilla Data Collective (MDC) is the next step in
that vision. Think of it as Common Voice, but for all types of
language data — not just speech. The core philosophy is Create,
Curate, Control. It's a platform that allows individuals and
communities to contribute data on their own terms, putting power back into
the hands of data creators.
The two key differences from Common Voice are:
-
Broader Data Types: We're looking for:
- text corpora,
- audio for text-to-speech (TTS) systems (usually: a single speaker
reading longer passages of text),
- audio for speech-to-text (STT) systems (usually: many speakers
reading short passages of text),
- and more. Anything needed to build a comprehensive suite of AI
tools for a language.
- Flexible Licensing: You are in control. While open
licenses are encouraged, you can add constraints (like non-commercial use)
or even provide paid access.
Why Should You Contribute? What's in it for You?
Your motivation will depend on who you are. Here’s how the MDC can
benefit you directly:
For Researchers, Academics, and Linguists
- Visibility and Impact: Your datasets gain a wider
audience in the ML/AI community, leading to more citations and greater
impact for your work.
- Free, Secure Hosting: MDC provides a stable,
long-term home for your valuable data, ensuring it remains discoverable
and usable for years to come.
- Simplified Contribution: If your data isn't in a
machine-readable format, I can help. My role is to assist in cleaning,
converting, and documenting datasets to make them suitable for the
collective.
For Content Creators, Journalists, and Publishers
- Fuel Innovation in Your Language: By contributing
content (e.g., article archives, podcast audio), you provide the raw
material to build better AI tools for your native language.
- Increase Your Reach: As AI tools for your language
improve, your content becomes more accessible to a global audience.
- Future-Proof Your Content: Turn your existing
archives into a valuable asset that contributes directly to the digital
vitality of your community.
For Language Activists and Communities
- Digital Sovereignty: Ensure your language thrives in
the digital age. MDC provides a path for language communities to build and
control the foundational data needed for their own technological
future.
- Empower Local Talent: With accessible data, local
developers can build products that serve community needs.
- Preserve and Control Your Heritage: You can apply
access constraints to your datasets, ensuring they are used in ways that
align with your community's values.
What Kind of Data Are We Looking For?
We are interested in datasets large enough for modern NLP tasks. Ideal
contributions include:
- Text Corpora: Collections of contemporary text with
at least 500k tokens for building language models.
- Audio for ASR/TTS: At least 5-10 hours of audio
paired with orthographic transcriptions. Even audio-only data is useful
for speech language models.
- Interview Corpora: Transcribed fieldwork recordings
are incredibly valuable. We will ensure any access constraints are
honored.
- Parallel Corpora: Datasets with ~100k parallel
sentences for machine translation.
If your data uses multiple orthographies or is in a raw format (like ELAN
files), don't worry. As long as it's well-documented, it is likely suitable
for the MDC.
My Role and Geographic Focus
As an independent contractor, my focus is on sourcing datasets for
languages of Greater Central Asia and the Caucasus. This
includes languages such as:
- Turkish, Azerbaijani, Turkmen, Uzbek, Kazakh, Kyrgyz, Tatar,
Uyghur
- Kurdish, Persian (Farsi), Pashto, Dari
- Georgian, Armenian, Chechen, Avar
- Mongolian, and many others in the region.
Even if your language isn't listed, please reach out. Mozilla's goal is
to support all languages, and I can connect you with the right
colleague.
Let's Collaborate!
I've spent my career working on computational tools for our languages,
often with public funding. I see this work with Mozilla as a way to give
back and help create a more equitable digital world.
Big tech will not save our languages—we, native speakers, will.
Initiatives like the MDC empower us to build the future we want. Your
contribution can make a huge difference.
If you own or know of a dataset that could be a good fit, please contact
me. I am here to answer your questions and handle the technical heavy
lifting.
You can reach me directly at mdc.ilnar@gmail.com, or by filling out the
expression-of-interest form below.
For More Information
You can read the official announcement about the Mozilla Data Collective
on the Common Voice Discourse forum.
Updates
Datasets that I helped to add to Mozilla Data Collective:
-
Speech Corpus of Armenian Question-Answer Dialogues
-
ReRooted: Speech Corpus of Testimonials from Armenian
Refugees and Immigrants
-
KyrgyzNER: Human-Annotated NER Dataset for Kyrgyz
-
KyrgyzLLM-Bench: Kyrgyz LLM Evaluation Dataset
Home | Resume |
Projects | Publications | Talks |
Reading log | Movies
log | Now | Email