Semi-Automating the DDTP translations: step by step process (contact: #ubuntu-fr-l10n or #ubuntu-translators on freenode).
Rationale
- The DDTP represent around 50 000 strings to translate * 140 languages= 7 million strings to translate !
- On good weeks, a typical yet active translation team translates 500 strings. It would take a hundred weeks with highly motivated volunteers of a large translation team, working non-stop, at their best to fully translate a language.
The solution
- Delegate initial translation to Google Translate and review translations with humans to speed the process.
A Word of Caution before beginning
- Machine translations don't replace (yet) human translations. It needs to be validated and corrected in many cases.
- Depending on the target language, the quality of the translations will fluctuate. (e.g. French, German, Spanish…) should work well. Smaller languages might have more issues.
- Another factor is the complexity of the strings. You'll have really good translations for simple strings and poor translations for poorly written or complex descriptions.
- Finally, DO take time to do a mass review of the translations after the initial import. Given the size of the DDTP, a simple find and replace can save dozens of hours of translation.
- Make sure your review team is closed and has strong standards to avoid validation of poorly translated strings by too enthusiastic members
Process
- 1/ Preparing the DDTP for Machine Translation
- Ask for a Launchpad translation export of the DDTP for main, universe, restricted and multiverse (.po files for your language and .pot templates)
- Gather the .po and .pot files in a folder on your drive
- We had to split the DDTP universe po in 14 files and main in 5 files to make it match Google Translator Kit's size requirements. You'll have around 14 files in UNIVERSE, and 5 in MAIN but 1 in MULTIVERSE (use MULTIVERSE for tests)
- Use the Python script below created by Redmar to split the po files in files below 900k
- Gather the translated po files on a shared online folder and split the work between team members.
- 2/ Perform the Machine Translation
<REPEAT FOR EACH CHUNK OF THE FILE>
- Import the file in Google Translator Kit : http://translate.google.com/toolkit/docupload?hl=fr
- You might see errors:
- the file might be too big (remove a few strings)
- After a couple of seconds, Google Translator Kit displays the file along with automatic translations. You could theoretically review them here, but that's not the point.
- You should fix some mistakes directly in Google Translator Kit thanks to its powerful search and replace functionnality that can be limited to translations only
- The \n (line break) become \ n which messes everything
- The \" become \ " (with a space) which messes everything
- Check the po file validates in Poedit (do more fixes directly in Google Translator Kit if there are many mistakes)
</REPEAT FOR EACH CHUNK OF THE FILE>
- 3/ Bring back the suggestions into your drive
Download all the translated po back to your drive or your shared online folder in a specific folder (named translated to avoid messing up with original files)
- 4/ Do mass-Replace of obvious mistakes in each file to be able to import them back into Launchpad and also save a lot time when correcting the strings
- DO NOT CORRECT ORIGINAL ENGLISH STRINGS: if you do so, Launchpad will consider those new strings (as now they differ) and thus discard the string AND your translation.
- Use Gedit syntax coloring to quickly spot mistakes
- Use RegEx to correct as many mistakes as possible before initial import into the Bogus Project. The Time Savings are HUGE.
- Make sure that the PO file validates (using PoEdit)
- It seems that Google Translate adds spaces after \ .
- The \n (line break) become \ n which messes everything
- The \" become \ " (with a space) which messes everything
- The syntaxic coloration in gEdit highlights the mistakes well.
- It might create bugs during the launchpad import as "" is a string delimiter.
- Strings that you should also be careful for (Global)
- Any word in English in the translated part
- The beginning of your string should be capitalized (Paquet, Interface…)
- Consistency in similar strings
- Punctuation and Typography rules (spaces..)
- Typos in the original strings should be reported as bugs against the DDTP (https://bugs.launchpad.net/ddtp-ubuntu/+filebug)
- Strings that you should be careful for (French specific)
- 5/ Import the PO files back in our bogus project in Launchpad
- We can't import them as fuzzy in the DDTP, because Launchpad doesn't handle fuzzy (https://bugs.launchpad.net/launchpad/+bug/493084) . We don't want to import the automatic translations as "Translated" because there still are many mistakes (ans also because it's fed back to upstream). We thus need to use the "bogus project" solution to show our machine translations as suggestions in the real DDTP. As a result, they should appear as suggestions for the real DDTP so that they can be reviewed by humans.
- We have thus created a cross-language project. You shouldn't create your own. Get in touch on the ubuntu-translators ML to import your language.
- You don't have to merge back the files before uploading (http://ubuntu.5.n6.nabble.com/Re-Ubuntu-l10n-de-community-Bug-1057767-Re-Some-translation-templates-are-not-in-sync-with-upstream-td4997338.html).
- Import the DDTP pot files (templates)
- Import the various automatically translated chunks (next chapter)
- Wait for the files to be accepted and imported
- 6/ Review Process
- Go to the DDTP, the imported translations in DDTP automation bogus project should appear as suggestions
- Ask for forum members to do a first review for obvious mistakes and make suggestions
- Then, team members can make the actual review and validate the translation
Additional Advice
- Focus on validating high-priority and high-impact packages first
- Using Nightmonkey, focus on the most visible packages. Just replace the langage code on the links below
- Focus on mass validations & low-hanging fruits first
- For validations, you can use the very buggy search feature on Launchpad. It timeouts 8 times out of 10, but is valuable when it doesn't (Very specific searches reduce the risk of timing out to 50%)
- French Specific (you can take inspiration)
- Done by Sylvie : "this package provides a module for"
- Done by Sylvie : "Translation data for all supported packages for:"
- Done by Sylvie : "Translation data updates for all supported packages for:"
- Done by Sylvie : "Translation data updates for all supported KDE packages for:"
- Done by Sylvie : "Translation data for all supported KDE packages for: "
- Done by Sylvie : KDE translation updates for language
- Done by Sylvie : "Translation data updates for all supported GNOME packages for:"
- Done by Sylvie : "Translation data for all supported GNOME packages for: "
- Done by Sylvie : "KDE translations for language "
Bring back the validated translations in the Bogus Project
- To simplify mass corrections, regularly export the official DDTP in their normal form and reimport it in the Bogus project.
- The manually reviewed suggestions will replace the automatic suggestions, reducing time to mass correct mistake and removing the previous automated suggestions.
Recycling fuzzy translations from older releases (contact Redmar for more details)
- When you merge the translations of an older version of Ubuntu into the current version, there will be a lot of « fuzzy » translations for strings that are similar (for example, meta packages for different programs, debugging symbols etc).
- msgmerge quantal_ddtp.po raring_ddtp.po -o merged_ddtp.po, for example
- Those fuzzy strings often only need a few small changes (eg. program name) to be accepted, which can really speed up translations. And you don't have to worry about google putting in a weird translation, since it is all based on earlier translations done by a human.
- You have to unfuzzy them using Find & Replace in a text editor. Remove the line with #, fuzzy. Save. Upload.
#. type: newglossaryentry{#2} #: frontmatter/glossary-entries.tex :15 #, fuzzy msgid "bla bla english" msgstr "bla bla translated"
Status, Results, Todo, Bugs & Questions
- Results
- Todo & Bugs
- Create a DDTP automation LP team and invite coordinator for each language
- Questions
- Your question here
- Status for French
- Main
- 6000 suggestions to generate, splitted by Pierre, 5 parts, translated and merged
- Restricted
- Translated manually, not applicable
- Multiverse
- 300 suggestions to generate, likely technical & difficult, Splitted by Pierre & Merged
- Universe
- 14 parts, translated and merged