Ilnar Salimzianov's Personal Site

Empower Your Language: Let's Build its Digital Future with Mozilla

First published: July 24, 2025. Last update: December 29, 2025.

On July 21, 2025, I began a new role as a Regional Language Researcher, working as an independent contractor for the Mozilla Foundation. Over the next six months, I'll be focused on a project that I believe is vital for the future of our languages. I want to tell you about this initiative and explain what's in it for you.

Note: I am an independent contractor, not a Mozilla employee. All views expressed here are my own.

What is the Mozilla Data Collective (MDC)?

Many of us know Common Voice, Mozilla's groundbreaking project to crowdsource speech data. Its success is a testament to what a global community can achieve. To date, the project has collected over 33,816 hours of recorded speech across an incredible 137 languages (as of July 21, 2025).

The Mozilla Data Collective (MDC) is the next step in that vision. Think of it as Common Voice, but for all types of language data — not just speech. The core philosophy is Create, Curate, Control. It's a platform that allows individuals and communities to contribute data on their own terms, putting power back into the hands of data creators.

The two key differences from Common Voice are:

Broader Data Types: We're looking for:
- text corpora,
- audio for text-to-speech (TTS) systems (usually: a single speaker reading longer passages of text),
- audio for speech-to-text (STT) systems (usually: many speakers reading short passages of text),
- and more. Anything needed to build a comprehensive suite of AI tools for a language.
Flexible Licensing: You are in control. While open licenses are encouraged, you can add constraints (like non-commercial use) or even provide paid access.

Why Should You Contribute? What's in it for You?

Your motivation will depend on who you are. Here’s how the MDC can benefit you directly:

For Researchers, Academics, and Linguists

Visibility and Impact: Your datasets gain a wider audience in the ML/AI community, leading to more citations and greater impact for your work.
Free, Secure Hosting: MDC provides a stable, long-term home for your valuable data, ensuring it remains discoverable and usable for years to come.
Simplified Contribution: If your data isn't in a machine-readable format, I can help. My role is to assist in cleaning, converting, and documenting datasets to make them suitable for the collective.

For Content Creators, Journalists, and Publishers

Fuel Innovation in Your Language: By contributing content (e.g., article archives, podcast audio), you provide the raw material to build better AI tools for your native language.
Increase Your Reach: As AI tools for your language improve, your content becomes more accessible to a global audience.
Future-Proof Your Content: Turn your existing archives into a valuable asset that contributes directly to the digital vitality of your community.

For Language Activists and Communities

Digital Sovereignty: Ensure your language thrives in the digital age. MDC provides a path for language communities to build and control the foundational data needed for their own technological future.
Empower Local Talent: With accessible data, local developers can build products that serve community needs.
Preserve and Control Your Heritage: You can apply access constraints to your datasets, ensuring they are used in ways that align with your community's values.

What Kind of Data Are We Looking For?

We are interested in datasets large enough for modern NLP tasks. Ideal contributions include:

Text Corpora: Collections of contemporary text with at least 500k tokens for building language models.
Audio for ASR/TTS: At least 5-10 hours of audio paired with orthographic transcriptions. Even audio-only data is useful for speech language models.
Interview Corpora: Transcribed fieldwork recordings are incredibly valuable. We will ensure any access constraints are honored.
Parallel Corpora: Datasets with ~100k parallel sentences for machine translation.

If your data uses multiple orthographies or is in a raw format (like ELAN files), don't worry. As long as it's well-documented, it is likely suitable for the MDC.

My Role and Geographic Focus

As an independent contractor, my focus is on sourcing datasets for languages of Greater Central Asia and the Caucasus. This includes languages such as:

Turkish, Azerbaijani, Turkmen, Uzbek, Kazakh, Kyrgyz, Tatar, Uyghur
Kurdish, Persian (Farsi), Pashto, Dari
Georgian, Armenian, Chechen, Avar
Mongolian, and many others in the region.

Even if your language isn't listed, please reach out. Mozilla's goal is to support all languages, and I can connect you with the right colleague.

Let's Collaborate!

I've spent my career working on computational tools for our languages, often with public funding. I see this work with Mozilla as a way to give back and help create a more equitable digital world.

Big tech will not save our languages—we, native speakers, will. Initiatives like the MDC empower us to build the future we want. Your contribution can make a huge difference.

If you own or know of a dataset that could be a good fit, please contact me. I am here to answer your questions and handle the technical heavy lifting.

You can reach me directly at mdc.ilnar@gmail.com, or by filling out the expression-of-interest form below.

📬 Interested in contributing a dataset? Fill out the Expression of Interest Form, and I’ll be in touch!

For More Information

You can read the official announcement about the Mozilla Data Collective on the Common Voice Discourse forum.

Updates

Datasets that I helped to add to Mozilla Data Collective: