Posted by Melina Koukoutchos on July 13th, 2015
Lost in Translation
Ever wondered what language you hear behind you on the subway platform? Or have you ever come across a document in an unknown script with no idea of how to go about solving the mystery? In a world where written information is widely accessible and shared in potentially thousands of different languages, the ability to identify the language of a piece of text is the first step to unlocking the ideas it contains. Language identification (LID) software is an important tool used to document and analyze a variety of languages. LID tools such as Google Translate can recognize up to 100 languages, and with Langscape’s LID tool, we are attempting to identify many more.
How does LID work?
Current LID tools work by comparing the input text to a database of language samples. An algorithm then examines the sequences of characters that appear in a text sample for each language in the database. For example, if the text were “Let me tell you about the time I met…” and we use character sequences of threes, the character sequences would be: [let] [etm] [tme] [met] [ete] [tel] and so on.
Once the character sequences have been established for the entire text, the frequency for each character sequence is then calculated. Some character sequences may appear multiple times within a text while others may only appear once; for instance, the sequence [ime] appears multiple times in our previous example, “t[ime] [I me]t”, while [oua] occurs only once,“tell y[ou a]bout”). In a text of 1000 characters or more, many sequences will occur more than once.
This same process is then applied to the character sequences that appear in the input text, allowing the sequences and frequencies to be compared to the reference texts for each language in the database. The languages that have character sequences that are most similar and occur with a similar frequency to those in the input text will be offered as the best matches. Langscape’s LID currently draws from a database of over 900 languages! Test it out for yourself and see how well it can spot a language.
What are the challenges?
Language identification is a difficult task. Many languages, such as Norwegian and Danish, are very similar, use the same alphabet, and contain many of the same character sequences. These striking similarities create difficulties for the algorithm when distinguishing between languages.
In other cases, the same language may be written in multiple scripts. For instance, Serbian regularly uses both the Cyrillic script and the Latin script. So how can we improve the tool’s ability to detect subtle differences and similarities?
One bet is that the reference text matters: length, subject matter, level of formality and style will all affect the frequency of different character sequences. Langscape’s current language samples primarily consist of formal documents such as the UN’s Universal Declaration of Human Rights or translations of the Bible. As a result, the tool works successfully for many documents that are structured and edited; however, difficulties arise with text that use less formal language, such as internet pages, blogs, online news articles, and tweets. In less formal texts, people tend to use word abbreviations, slang terms, change the spelling, insert emoticons, and duplicate punctuation!!! In addition, the texts are often much shorter than formal documents, meaning fewer characters and less information for the algorithm to work with.
What is Langscape doing to improve its LID tool?
Langscape is currently exploring ways to improve its LID tool, especially when identifying the language used in informal texts such as Internet blogs and tweets. We are examining which texts make ideal reference samples for a language; for example, texts using very topic specific terminology like science journals should be avoided because they may contain character sequences that are generally uncommon in the language and may skew the tool’s results. In addition, adjustments to the reference samples may improve the tool’s accuracy, such as removing emoticons or hash tags or by up-casing/lower-casing text. We will also examine what the ideal number of characters is for reference samples and input texts.
If our reference samples contain collections of shorter texts, like tweets, is the tool more accurate?
We are also examining ways to improve the tool’s ability to distinguish between more closely related languages, like Norwegian and Danish. One method may be to run the algorithm on an input text multiple times using different collections of reference samples. For instance, say the algorithm is run on an input text and the top results are Norwegian and Danish (which are in the same language family); we could then re-run the algorithm on the same input text using only the reference samples for languages within that language family. Would this improve the tool’s ability to distinguish between the closely related languages?
We hope to continue improving Langscape’s LID tool and search for new and interesting applications.