Data Innovation Day: Making Language Data Useful
At Data Innovation Day, we heard a lot about open data and its sources. Some of the most fascinating data comes straight from the web itself. Macduff Hughes of Google spoke about Google Translate and how machine learning systems use open data on the web to improve translations. Machine translation was one of the first things people tried to do with computers, and the process has become increasingly sophisticated over the years, moving from being rule-based (e.g., “dog” in English always equals “chien” in French) to statistical and data-driven.
Why is Google’s translation service so important? For starters, most of the world can’t read what’s on the web. Even if Internet users are literate, there is often a language barrier between them and the content they might wish to consume. For this reason, the number one use of Google Translate is directly in the Chrome browser, allowing users to translate web pages as they come across them. Google Translate isn’t contained to the computer, however; the fastest growing use of the service is actually on mobile devices, a usage which highlights the importance of improving speech recognition software (since different alphabets present different degrees of difficulty in typing directly into a mobile device).
Hughes highlighted some of the fascinating uses of Google Translate as well as some astonishing statistics. Google Translate currently performs more than one billion translations every day. Ninety-two percent of those take place outside the U.S. The service is now available in 80 languages, including Hausa, Igbo, Maori, Mongolian, Nepali, Punjabi, and Somali, all of which are new since December. The most requested translation, it turns out, is from English to Spanish. Google Translate offers more than just homework help, however. National domestic violence groups use Google Translate to overcome language barriers for those in need of help, and police have used the mobile app to aid travelers in need of assistance who might not speak the local language. Both of these are very real examples of the impact open data has on the larger community.
But how does Google Translate work? To put it simply, it operates on a “co-occurrence” model, where the service compares unknown copy to a known entity, finding common occurrences of words and phrases. It then uses these “translated” bits to build a model, from which it will translate the rest of the missing copy.
Hughes explained that some languages translate better under this model than others due to language complexity. The co-occurrence machine learning model runs into trouble when the syntax of a language is different than the one it’s being compared to, causing articles to be dropped or grammar to be misinterpreted. The most successful translation pair in the service is Spanish to English, but Hughes’ example showed articles being dropped even there.
How does Google Translate know what language it’s even looking at when it begins the translation process? Hughes explained that the service has been trained to look for clues:
- Are these exact words found in a particular language dictionary?
- In which alphabet is the word encoded? Latin? Cyrillic? Greek? Hebrew? Arabic?
- Are there any commonly occurring letter pairs? For example, -er and -ing are common word endings in the English language.
Even with these markers, however, it’s difficult for Google Translate to take into account language registers (formal vs. informal) and intricate variations such as different regional dialects of a particular language.
Another area where Google Translate runs into problems is when attempting to translate rare languages, or those that don’t often co-occur. For example, a news article in Swahili is unlikely to be copied exactly into Welsh, but there’s a good chance that it will have an English counterpart. To get around this issue, Google Translate uses language triangulation, running a translation through a third language in order to get to the desired result. In this case, the Swahili news article would be compared to its English copy, translated into English, and then translated from English into Welsh (a much more commonly occurring language pair than Swahili and Welsh alone).
It’s an imperfect science, but the massive amounts of free and open data on the web have brought Google Translate light years ahead of where it might be otherwise. The long and short of it is that the translation quality improves at a rate directly proportional to the amount of data Google has on its servers, since all search results are being used as a training ground of sorts (an interesting side note, however: the effect of translation improvement begins to decrease at 100 billion words). Macduff Hughes’ talk on Google Translate illustrated amazing opportunities inherent in repurposing common data, and it hinted at the potential for still greater impact if we continue to work on speech recognition and other machine learning tasks using data that we already have on hand.
Want more from Data Innovation Day? Check out our other posts:
- The Center for Data Innovation: Data Innovation Day 2014
- Data Innovation Day: Big Data and the US Census