Reversing language shift through language technology: Apertium and Common Voice

https://selimcan.gitlab.io/slides/nu/

Ilnar Salimzianov (Илнар Сәлимҗанов)

ilnar@selimcan.org

6 February, 2020

Who I am

  • independendent consultant, NU research group member, Apertium community
  • German Philology, Kazan Federal University, 2011
  • M.Sc. Comp. Linguistics, University of Stuttgart, 2017

Introduction

Between 1950 and 2010, 230 languages went extinct, according to the UNESCO Atlas of the World’s Languages in Danger. Today, a third of the world’s languages have fewer than 1,000 speakers left. Every two weeks a language dies with its last speaker, 50 to 90 percent of them are predicted to disappear by the next century.

The Race to Save the World's Disappearing Languages (National Geographic)

Why language loss is bad?

  • think species extinction
  • loss of knowledge
  • loss of worldview
  • unique culture, heritage

Languages: Why we must save dying tongues (BBC)

Prestige of a language

(among many other factors)

  • available resources in that language
  • technology

What we can do?

"An endangered language will progress if its speakers can make use of electronic technology". (David Crystal, Language Death)
  • writing aids (e.g. spellcheckers)
  • machine translation
  • speech-enabled apps
  • language technology in general

Overview

Endangered language X spoken in country Y.

Task:

  1. make computers support it
  2. create a spellchecker and morphological transducer for it
  3. machine translator to/from
  4. speech-to-text system

1. Locale

  • e.g. en_US.utf8, kk_KZ.utf8@latn
  • set of a parameteres, e.g.:
    • character set
    • date format
    • days of the week
    • how plurals are formed
  • necessary for software localization

Common Locale Data Repository, CLDR

GNU C Library, glibc

https://ftyers.github.io/localisation.html

2. Spellcheckers & Morphological transducers

Toy example 1: sheep language

  • ba
  • baa
  • baaa...

Finite-state automaton (machine)

Toy example 2: English nouns

Toy example 2: Random English words

Source: "Finite-state morphology: Xerox Tools and Techniques" (Kenneth Beesley, Lauri Karttunen, 2003)

Finite-state transducer

Source code, Lexc formalism

                          LEXICON Nouns
                          
                          apple:apple   Number ;
                          orange:orange Number ;

                          LEXICON Number

                          <sg>:  # ;
                          <pl>:s # ;
                        

Compiler

  • source code in, binary (executable) program out
  • Stuttgart Finite-State Toolkit, Helsinki Finite-State Toolkit etc
  • HFST recommended

Four freedoms

A program is free software if the program's users have the four essential freedoms:
  • The freedom to run the program as you wish, for any purpose (freedom 0).
  • The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.
  • The freedom to redistribute copies so you can help others (freedom 2).
  • The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.

https://www.gnu.org/philosophy/free-sw.en.html

TWOLC formalism

  • apple -> apples vs
  • cherry -> cherries
                          "Rule for English plurals"
                          
                          y:i <=> _ <pl>:s ;

                          "Insert e in e.g. cherry -> cherries"

                          0:e <=> :i _ <pl>:s ;
                        
  • altenatively: xfst rewrite rule formalism
  • Demo: morphological analysis

    Demo: morphological generation

    3. Machine translation: Apertium platform

    More on Apertium

    • free/open-source (both engine and linguistic data)
    • 50 released translators, many more in beta
    • web interface, API available
    • documentation: wiki.apertium.org

    Even more on Apertium

    • minority/endangered languages, closely-related languages
    • friendly community:
      • #apertium on Freenode IRC
    • contacts
      • apertium.org (released pairs only)
      • beta.apertium.org (recent versions of all, incl. morph. analysis/generation)
      • turkic.apertium.org

    4. Speech recognition/speech-to-text: Common Voice

    Motivation

    • no speech data in Apertium
    • no speech interface
    • speech-enabled apps: next wave?
    • recordings from 1000s of speakers required
      • Youtube?
      • licensing issues/costs

    What is Common Voice? (1)

    • crowdsourcing project for speech data
    • all data in public domain (no restrictions)
    • 1600h English, 595h German, 27h Tatar, 22h Kyrgyz, 16h Turkish
    • 2000h needed for production-quality (source: Mozilla)

    What is Common Voice? (2)

    • floss speech-to-text engine DeepSpeech
    • where is Kazakh?
      • help's needed!
      • see https://taruen.com/blog/

    Summary

    • Languages are disappearing rapidly, it's time to act and document them.
    • HFST is good for writing morphological transducers/spellcheckers.
    • Karttnunen & Beesley's "Finite-State Morphology" is your friend. (alternatively: docs/pointers on wiki.apertium.org).
    • Apertium: MT is possible even if you don't have large parallel corpora (as in statistical or neuronal MT), esp. if languages. are close.
    • Free/open-source licenses and hubs of data are good (collaboration, rapid development, synergy)

    References

    https://selimcan.gitlab.io/papers/turklang-2019.pdf