Posted 10 months, 1 week ago in the wee hours by oso
I’ve been to quite a few of these now - gatherings of geeks from around the world who sit around, surely over-caffeinated, for a day or two days or three days, and discuss and debate how they are going to save the world. That may come off as cynical, but, in fact, I write it in complete admiration, because we all discuss and debate how to save the world, but so few of us ask, OK, so what’s the next step.
My main goal for tomorrow is to write a comprehensible post on why open translation tools are important for both open source software and open content.
I know, I know, that last sentence probably reads like Latin to most of y’all, but I’ll do my best to explain what open source software and what open content is as well.
So what are translation tools? For most people on the internet, the answer is Google Translate. We see something on the web in a language that we don’t know and so we cut and paste the text into Google Translate and let it do its magic. How does that ‘magic’ happen? Two ways. First is standard grammar-based machine translation which I’ll go over below, thanks to an informative session by Fran Tyers of Apertium. The second way is through “statistical machine translation.” What this means is that Google Translate has a gigantic corpus, or database, of texts which have already been translated from one language to another. In this case, Google uses the translations of documents of the United Nations and it stores each of the translated phrases and sentences. So, if you enter in a sentence into Google Translate that it has already seen in a document translated by a professional UN translator, then it will simply offer you that exact translation word for word. If it can’t find the exact sentence or phrase, then it will use grammar-based rules to translate your text.
Let’s go over the steps that Google Translate and other machine translators process when translating your text. For the session led by Fran, we decided to translate the Serbian sentence “Velika reka teče kroz grad” into English. Here is a flow chart of that process (click on it to see it at full resolution). Below the flow chart, I’ll give more context to each step.
Step 1: Select source and target languages of translation
In this case: Serbian > English
Step 2: Write phrase or sentence to be translated:
“Velika reka teče kroz grad”
Step 3: Source language analysis
The machine then identifies the “grammatical features” of each word. There are two types of grammatical features:
1.) Parts of speech (ie. adjective, noun, verb, pronoun, etc.)
2.) Sub-categories (ie. gender of word, case, singular or plural, conjugation, person etc.)
The combination of these “grammatical features” and the lemma of the word (that is, the ‘natural form’ of the word) is called a “lexical unit”.
Step 4: Choosing an analysis
The computer then uses probability based on word analysis and linguistic rules to choose a single lexical unit for each word. That is, based on statistics, it decides whether the word is feminine/masculine, plural/singular, etc.
Step 5: Transfer the grammar from source to target language
The machine then looks at the ‘lemma‘ of each word in the bilingual dictionary. The ‘lemma’ of each word is the so-called ‘citation form’ of that word. So, for example, the lemma of went is ‘go’. In the case of our Serbian sentence, the lemma of ‘velika’ is ‘veliki’. This is the form of the word you would look up in a dictionary.
Next, the machine checks its glossary for possible translations of the lemma of the word you are translating. Possible translations of “veliki”, for example, are ‘big’, ‘large’, and ‘great’. Apertium uses the most general translation. Other systems have probability-based methods of choosing which possible translation to use.
After translating each word, the machine then looks to see if re-ordering of the words is necessary. For example, from Spanish to English “la pelota blanca” would be rearranged to “the white ball” rather than left as “the ball white”. In this case, however, no re-ordering is necessary.
Step 6: Final translation
The final translation is “Big river passes through city.”
Step 7: Post-editing
The machine translation in this example is pretty close. However, in the Serbo-Croatian language articles do not exist. A human would therefore need to add either “the” or “a” before each noun.
The big river passes through the city.

So there you go: all the steps that a machine translator takes while doing its ‘magic’. You can learn more about machine translation on the event wiki here and here.
But machine translation is just one very small component of how both content and software is made available in multiple languages. Next post I’ll explain what is open software and open content and why open translation is important to both.
![[Documentation] How Machine Translation Does its Magic](http://el-oso.net/blog/wp-content/plugins/ttftitles/cache/9fe739715d6847f8fcfc739fa5413c73.png)

















Keep writing; I’m reading.
I’m looking forward to your post on why machine translation is important. Personally, I consider it dangerous in the hands of the lingustically naive — it almost always does a poor job and often does such a spectacularly bad job that unless someone views every word with a high degree of skepticism and is prepared to reverse-engineer the translation glitches, it can do more harm than good.
That said, I have to admit I use it often, and I’ve grudgingly come to enjoy machine translation in some social software situations - for example, when someone turns on a translation bot in Second Life, which has the value of illustrating to everyone in the space how bad translation software is!
“I’m looking forward to your post on why machine translation is important. Personally, I consider it dangerous in the hands of the lingustically naive — it almost always does a poor job and often does such a spectacularly bad job”
I encourage you to look into the quality of machine translation output when translating between closely-related languages. It is understandable that the issues surrounding this would not be widely known as in the anglophone world we don’t really have any languages which are closely related (Scots could qualify but there is no really accepted standard).
A quick note on the post: The section on SMT isn’t quite how I explained it, but that is probably down to a problem of my communicating the process rather than David’s scribing. For people who are interested in how SMT works, I would encourage them to look at the Wikipedia page on ‘Statistical machine translation’.
Thanks for the reply, Francis. You’re probably right that the quality is better when translating among closely related languages. But that raises the question of whether automatic translation is necessary among closely related languages. Slang and regional differences aside (which automatic translators usually can’t deal with anyway), aren’t closely related languages in their written, literary-standard form often mutually intelligible without assistance?
Admittedly I may be biased by the languages I have the most exposure to — phonology is about the only thing that keeps Spanish, Italian and Portuguese speakers from understanding each other. Perhaps within other language families there are gaps which stymie humans but are easily bridged by software.
“Admittedly I may be biased by the languages I have the most exposure to — phonology is about the only thing that keeps Spanish, Italian and Portuguese speakers from understanding each other. Perhaps within other language families there are gaps which stymie humans but are easily bridged by software.”
The same goes for Catalan, Occitan, Galician and to a certain extent French. You ask if “whether automatic translation is necessary among closely related languages.”, as a native speaker of English, you don’t have any problem reading documentation. Everything is in English pretty much.
Consider for a moment if you spoke English and yet your whole computer interface, all advertising, documentation, everything was in Scots. Sure, you can understand it, but isn’t it a bit frustrating? Try browsing around the Scots Wikipedia here:
“Breetish fowk tae fecht Pechtish invaders, follaein the Romans leavin Breetain in the 4t century. This leid wis the forerinner o Modren Inglis; but while the byleids o ither Inglis regions hae been muckle chynged by the influences o ither furrin leids—Norman an Norman–French in parteecular —the Geordie byleid hauds ontae mony chairicteristics o the auld leid.”
- http://sco.wikipedia.org/wiki/Main_Page
Imagine if you had to interact with your government and social services in Scots, if all application forms were in Scots, if people thought you were backward for using English.
We’re trying to make translation for smaller languages practical. Of course if you don’t believe in linguistic diversity this argument is rubbish, people can just learn English and have done with.
[...] my interest in how the internet can facilitate and speed up the translation workflow, I was fascinated to hear [...]