Machine learning deciphers lost languages

October 22, 2020 //By Rich Pell
Machine learning deciphers lost languages
Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) say they have developed a machine learning (ML) system that can help linguists decipher languages that have been lost to history.

Such "dead" languages - where little is known about their grammar, vocabulary, or syntax - can't be deciphered using machine-translation algorithms, and often don't have a well-researched "relative" language to be compared to, and may even lack traditional dividers like white space and punctuation. Now, say the researchers, they have developed a new system that has been shown to be able to automatically decipher a lost language without needing advanced knowledge of its relation to other languages.

In addition, the system can itself determine relationships between languages and was even used to corroborate recent scholarship suggesting that the language of Iberian is not actually related to Basque. Ultimately, say the researchers, their goal is for the system to be able to decipher lost languages that have eluded linguists for decades, using just a few thousand words.

The system relies on several principles grounded in insights from historical linguistics, such as the fact that languages generally only evolve in certain predictable ways. For example, while a given language rarely adds or deletes an entire sound, certain sound substitutions are likely to occur - i.e., a word with a "p" in the parent language may change into a "b" in the descendant language, but changing to a "k" is less likely due to the significant pronunciation gap.

By incorporating these and other linguistic constraints, say the researchers, they developed a decipherment algorithm that can handle the vast space of possible transformations and the scarcity of a guiding signal in the input. The algorithm learns to embed language sounds into a multidimensional space where differences in pronunciation are reflected in the distance between corresponding vectors.

This design, say the researchers, enables them to capture pertinent patterns of language change and express them as computational constraints. The resulting model can segment words in an ancient language and map them to counterparts in a related language.

The project builds on research last year that deciphered

Vous êtes certain ?

Si vous désactivez les cookies, vous ne pouvez plus naviguer sur le site.

Vous allez être rediriger vers Google.