Ilnar Salimzianov's Personal Site
Читать на русском
Empower Your Language: Let's Build its Digital Future with Mozilla
First published: July 24, 2025. Last update: August 1, 2025.
On July 21, 2025, I began a new role as a Regional Language
Researcher, working as an independent contractor for the Mozilla
Foundation. Over the next six months, I'll be focused on a project
that I believe is vital for the future of our languages. I want to
tell you about this initiative and explain what's in it for
you.
Note: I am an independent contractor, not a Mozilla employee. All
views expressed here are my own.
What is the Mozilla Data Collective (MDC)?
Many of us know Common Voice, Mozilla's groundbreaking project to
crowdsource speech data. Its success is a testament to what a global
community can achieve. To date, the project has collected over
33,816 hours of recorded speech across an incredible
137 languages.
The Mozilla Data Collective (MDC) is the next step in
that vision. Think of it as Common Voice, but for all types of
language data — not just speech. The core philosophy is
Create, Curate, Control. It's a platform that allows
individuals and communities to contribute data on their own terms,
putting power back into the hands of data creators.
The two key differences from Common Voice are:
- Broader Data Types: We're looking for:
- text corpora,
- audio for text-to-speech (TTS) systems (usually: a
single speaker reading longer passages of text),
- audio for speech-to-text (STT) systems (usually: many
speakers reading short passages of text),
- and more. Anything needed to build a comprehensive suite
of AI tools for a language.
- Flexible Licensing: You are in control. While open
licenses are encouraged, you can add constraints (like
non-commercial use) or even provide paid access.
Why Should You Contribute? What's in it for You?
Your motivation will depend on who you are. Here’s how the MDC
can benefit you directly:
For Researchers, Academics, and Linguists
- Visibility and Impact: Your datasets gain a wider
audience in the ML/AI community, leading to more citations and
greater impact for your work.
- Free, Secure Hosting: MDC provides a stable,
long-term home for your valuable data, ensuring it remains
discoverable and usable for years to come.
- Simplified Contribution: If your data isn't in a
machine-readable format, I can help. My role is to assist in
cleaning, converting, and documenting datasets to make them
suitable for the collective.
For Content Creators, Journalists, and Publishers
- Fuel Innovation in Your Language: By contributing
content (e.g., article archives, podcast audio), you provide the
raw material to build better AI tools for your native language.
- Increase Your Reach: As AI tools for your language
improve, your content becomes more accessible to a global
audience.
- Future-Proof Your Content: Turn your existing
archives into a valuable asset that contributes directly to the
digital vitality of your community.
For Language Activists and Communities
- Digital Sovereignty: Ensure your language thrives in
the digital age. MDC provides a path for language communities to
build and control the foundational data needed for their own
technological future.
- Empower Local Talent: With accessible data, local
developers can build products that serve community needs.
- Preserve and Control Your Heritage: You can apply
access constraints to your datasets, ensuring they are used in
ways that align with your community's values.
What Kind of Data Are We Looking For?
We are interested in datasets large enough for modern NLP tasks.
Ideal contributions include:
- Text Corpora: Collections of contemporary text with
at least 500k tokens for building language models.
- Audio for ASR/TTS: At least 5-10 hours of audio
paired with orthographic transcriptions. Even audio-only data is
useful for speech language models.
- Interview Corpora: Transcribed fieldwork recordings
are incredibly valuable. We will ensure any access constraints
are honored.
- Parallel Corpora: Datasets with ~100k parallel
sentences for machine translation.
If your data uses multiple orthographies or is in a raw format
(like ELAN files), don't worry. As long as it's well-documented, it
is likely suitable for the MDC.
My Role and Geographic Focus
As an independent contractor, my focus is on sourcing datasets
for languages of Greater Central Asia and the Caucasus.
This includes languages such as:
- Turkish, Azerbaijani, Turkmen, Uzbek, Kazakh, Kyrgyz, Tatar
- Kurdish, Persian (Farsi), Pashto, Dari
- Georgian, Armenian, Chechen, Avar
- Mongolian, and many others in the region.
Even if your language isn't listed, please reach out. Mozilla's
goal is to support all languages, and I can connect you with the
right colleague.
Let's Collaborate!
I've spent my career working on computational tools for our
languages, often with public funding. I see this work with Mozilla
as a way to give back and help create a more equitable digital
world.
Big tech will not save our languages—we, native speakers, will. Initiatives
like the MDC empower us to build the future we want. Your contribution can
make a huge difference.
If you own or know of a dataset that could be a good fit, please
contact me. I am here to answer your questions and handle the
technical heavy lifting.
You can reach me directly at mdc.ilnar@gmail.com, or by filling out
the expression-of-interest form below.
For More Information
You can read the official announcement about the Mozilla Data Collective on the
Common Voice Discourse forum.
Home |
Resume |
Projects |
Publications |
Talks |
Reading log |
Movies log |
Now |
Email