Drupal UI localization with translation memory in real

By baluertl on 4 August 2018

Here I won't go in the details why CAT tools are necessary to utilize for translating such a huge software's user interface like Drupal. Instead, see the illustration of my tweet from 1 Jan, 2018:

Initial research of existing efforts

Before I would mistakenly reinvent the wheel, this question was shouting continuously in my mind:

Am I alone with this idea? Really?

So I quickly ran a search for the term "translation memory" site:localize.drupal.org. It resulted the following pages:

Poedit vejledning (from Danish team, 5 Oct 2009) is mostly only about a link pointing to a guide on how to use Poedit software.
Velkommen til oversettergruppa for norsk bokmål (from Norwegian team) the term of "translation memory" coins up only in the very last comment of gisle from 5 Oct 2014, mentioning a proposal to use TM.
Bei der deutschen Übersetzung mithelfen - wie geht das? (from German team) similarly, the term of "translation memory" coins up only once in an old comment of Thomas_Zahreddin 13 Oct 2010 proposing also to use TM.
翻訳に便利なツール集 (from Japanese team, 21 Mar, 2014) is also pointing links to online tools (beside Poedit again) like Open-Tran.eu (discontinued since then) and btranslator.org.
翻訳の一貫性を保つために - 翻訳支援ソフトの利用 (from Japanese team, 16 Feb, 2010) lists also available tools like Poedit and Google Translator Toolkit.
翻訳スプリント (from Japanese team, 15 Feb, 2010) seems to be the most useful collection of information: aiwata55 in their comment lists the pros and cons of the two most-mentioned tool, Poedit and GTT, and also mentions TMX format. Their discussion was mainly about how to integrate both tools into one workflow by regularly exchanging the TMX between local Poedit instances and the central GTT repository.

Evaluating available tools

Poedit

Con: it's TM is local, not centralized between contributors making difficult to maintain consistency.
Con: the .po file format is quite old and many newer standards (eg. XML-based formats) already exists available on the market novadays.

Google Translator Toolkit

Pro: free and cloud-based.
Con: does not support any dedicated localization-related file formats. (Only those that are somehow interesting for Google corporation for its products and services.)

WordFast's FreeTM (SaaS)

Pro: free and cloud-based.
Con: does not support Drupal's .po file format, so intensive conversions required vice-versa.

UPDATE: – Crowdin (SaaS) is also quite popular among FLOSS web application projects (eg. some other CMSes like Joomla, Orchard, OpenCart, Prestashop) also uses. Probably would worth to spend some time on a more detailed evaluation.

My conclusion of this quick research

It's a dinosaur topic, with a long history since the idea coined up multiple times.
At the time of writing (Jan 2019) I'm unaware of any signs of success from any locale team.

So let's jump right into the steps! Go ahead towards the great goal, to migrate all our translated Drupal strings into a cloud-based TM!

Obtaining the input data

The very first step is to get your gold in your hand: the precious translations you and your translator fellows once submitted onto l.d.o during the years. In my tutorial I downloaded an entire export of Drupal core 8.4.8 Hungarian localization file, because this is the latest 100% complete.
- File size: 1,3 MB
- Lines: 29 887
Heads up! This is not the number of strings, because lines in the .po file are automatically brakes after a length limit. So see the next step in which I concatenate these lines of the file.

It's always a good practice to counter-check the current state of our data to detect any possible mistakes before we proceed through the steps of transformation. So if you list all the strings of Drupal core 8.4.8 on localize.drupal.org for me it shows 8564 strings (distributed on 857 pages). Now if you count the matches of the msgid\s"[^"] regex pattern on the freshly downloaded file, you should see exactly the same 8564.
Normalizing .po format

I performed search & replace with these regex patterns:
1. - Search: msgid\s""\n"
  - Replace: msgid "
  Probably in your locale the first occurance should be also the "@site is currently under maintenance. We should be back shortly. Thank you for your patience" or ID#20842 around lines 1062-63.) This replacement in your file it should result some 1552 changes.
2. - Search: msgstr\s""\n"
  - Replace: msgstr "
  Also, the first occurance should appear right after the above mentioned string around lines 1065-66.) This replacement in Hungarian resulted 1968 changes – irrelevant for other locales.
3. - Search: "\n"
  - Replace: (empty)
  This replacement in Hungarian locale resulted 8912 changes – irrelevant for other locales.
Tip: Keep in mind that the .po file has a header in its beginning, which should be left untouched by these replacing patterns. One trick can be if you enable in your plain text editor the "Replace only in the selected segment" (or something similar) feature before you run the command.

Now it's time for counter-check the results of our series of modification we did on our file:
1. As I assume, now your .po file of Drupal core 8.4.8 should contain 17 461 lines in total, right?
2. First, deduct the 14 lines of the header on the top, then we get 17 447.
3. Then run a counting with the msgid_ regex pattern which shows the plural forms of original English strings (it should be 107 for your locale too) and deduct it: we are at 17 340.
4. Query again now the msgstr\[ regex pattern (for Hungarian it results 214, probably for you too). These are the plural version of some strings. Deduct this number again, now we get 17 126.
5. Now divide by 2 (each string has two lines, one msgid and one msgstring) and voilà! It should still 8564 strings. (Don't worry if not. For example for me it showed 8563, so apparently I lost one already at the first step. However, taking into consideration how cumbersome this process is, I think this loss is acceptable.)
Now we have a normalized (1 line ~ 1 string, no unnecessary line breaks) file.
Say goodbye to context information

Currently it’s a known limitation that regex patterns introduced in this workflow are not prepared to automatically handle the msgctxt lines, which contains extra information on some strings’ contexts. As I briefly checked the situation with them I realized two things: A) there are relatively only a few (cca. 100+ around among 8000+ in total); B) caring them would increase the complexity of process heavily. So taking into consideration these arguments I decided to leave them for manual process. This regex pattern matches them, so I run it on the freshly normalized .po file:
- Search: msgctxt\s".*"\n – matches the entire line including linebreak at the end.
- Replace: #\n – inserts a single comment sign instead.
Separating out plural forms

An other difficult problem to handle is the situation with the plural formulated strings, which usually made up by multiple lines (depending on your locale’s language standard, for Hungarian for example 4 lines in total per strings). As I briefly counted them with the simple d_plural search pattern, there are 107 strings in total, which of 15 are multi-sentence. So similary to the msgctxt lines discussed at the previous step, I decided to separate them out for manual process later on.

a) Creating the plurals-only file
- Search: msgid(?:_plural){0}\s".+"\nmsgstr\s".*"
- Replace: #\n# – inserts a single comment sign instead.
b) Saving the rest of strings as no-plurals file
- Search: (?:msgid(?:_plural)?\s".*"\n)+(?:\n?msgstr(?:\[\d\])\s".*")+ – this matches all plural strings (regardless the sentence count in them), but in case if you would need it, here’s how to match the multi-sentence plural strings only: (msgid(?:_plural)?\s"(?:.*[\.\?\!\:\;]\s)+(?:.*[\.\?\!\:\;])"\n)+(?:\n?msgstr(?:\[\d\])\s"(?:.*[\.\?\!\:\;]\s)+(?:.*[\.\?\!\:\;])")+
- Replace: #\n#\n#\n# – depending on your locale’s plural configuration, count that how many lines the above search pattern matches per string and specify here the replacement pattern as many repeats of “#\n” as necessary. This way we can conserve the line number of the processed files, which makes debugging comparison much easier.
Also handful to know these two strings:
- ID#353636:
  <strong>Warning:</strong> There is currently 1 menu link in %title. It will be deleted (system-defined items will be reset).
  It’s one of the 15 strings which are multi-sentence AND plural formulated at the same time, so can be a perfect candidate to check with. (See around line number 6069 in the .po file. Probably the first occurance for the above search pattern.)
- ID#2546659:
  1 entity display updated: @displays. @count entity displays updated: @displays.
  This supposed to be the only one matching string on a freshly exported file. (See around line number 27699 in the .po file.)
Counting multi-sentence strings

Feel free to skip this chapter if you're not that into regular expressions.

Now we’re facing yet another challenge with Drupal's .po files: they may consist of multiple sentences (for example the very frequent
... cannot be undone. Are you sure? combination of sentences), which appears in many strings, therefore the TMs will not recognize them properly. However, this is necessary to feed the TM software with more granulated translation units (also called as TUs), so it can assist us much better during translation when we arrive to a string which has a similar sibling already translated. Therefore in this step we need to count, mark and process those strings. For this purpose we also can use regex patterns:
1. Only English original lines: msgid\s".*(\.|\?|!|:|;)\s.*"$ results 1261 (should be the same for you to)
2. Only your translated lines: msgstr\s".*(\.|\?|!|:|;)\s.*"$ results 1254 (probably different for each locales)
3. Together both: msg(id|str)\s".*(\.|\?|!|:|;)\s.*"$ to match all entries containing sentence-separator characters in them. Not surprisingly the result of point 3. should equal with the sum of points 1. and 2.
Although this almost 2500+ number may seems freaky, but we can clarify this number even further: the msgid\s".*(\.|\?|!|:|;)\s.*"\nmsgstr\s".*(\.|\?|!|:|;)\s.*" regex pattern counts only those msgid-msgstr pairs where both of them contains any of the sentence-separator characters. For me it boils down to only 1154 line pairs, which is less then half of the previous result. Phew!

Continuing this logic what happens if we count only by different separator characters?
1. Period: msgid\s".*\.\s.*"\nmsgstr\s".*\.\s.*" results 923
2. Question mark: msgid\s".*\?\s.*"\nmsgstr\s".*\?\s.*" results 2
3. Exclamation mark: msgid\s".*!\s.*"\nmsgstr\s".*!\s.*" results 5
4. Colon: msgid\s".*:\s.*"\nmsgstr\s".*:\s.*" results 254
5. Semicolon: msgid\s".*;\s.*"\nmsgstr\s".*;\s.*" results 15
These all numbers does not say too much yet, only responding the curios question "Approximately how many strings can be affected?" And also they should vary depending on your locale, the above is just an example from Hungarian. It would be also good to know that how many sentences (aka. TUs) these strings will be needed to cut away into?

Counting sentences closed with periods only:

In the msgid\s"(?:.*\.\s){1}(?:.*\.)?"\nmsgstr\s"(?:.*\.\s){1}(?:.*\.)?" regex pattern both {1} quantifiers can be modified.
- 2 sentences (828 strings)
- 3 sentences (231 strings)
- 4 sentences (82 strings)
- 5 sentences (28 strings)
- 6 sentences (10 strings)
- 7 sentences (3 strings)
- 8 sentences (2 strings)
- 9 sentences (1 string)
- 10 sentences (1 string)
Continuing in this direction we can improve further to match wider portion of strings:

Allowing any sentence-closing characters:

In the msgid\s"(?:.*(\.|\?|!|:|;)\s){1}(?:.*(\.|\?|!|:|;))"\nmsgstr\s"(?:.*(\.|\?|!|:|;)\s){1}(?:.*(\.|\?|!|:|;))" regex pattern both {1} quantifiers can be modified.
- 2 sentences (950 strings, +122)
- 3 sentences (264 strings, +33)
- 4 sentences (99 strings, +17)
- 5 sentences (44 strings, +16)
- 6 sentences (18 strings, +8)
- 7 sentences (9 strings, +6)
- 8 sentences (3 strings, +1)
- 9 sentences (2 string, +1)
- 10 sentences (2 strings, +1)
- 11 sentences (1 string)
- 12 sentences (1 string)
- 13 sentences (1 string) (#2404357) – by our terms this is the most (sub-)sentences contained in one single string of Drupal core 8.4.8.
I marked at the end of lines in parenthesis how many extra strings the extended sentence-closing characters add to the purely period-closed ones. As it can be seen not that much. However, situation would be much-much more difficult if we allow the comma character also as TU separator:

Allowing any sentence-closing characters, plus even comma too:

In the msgid\s"(?:.*(\.|\?|!|:|;|,)\s){1}(?:.*(\.|\?|!|:|;|,))"\nmsgstr\s"(?:.*(\.|\?|!|:|;|,)\s){1}(?:.*(\.|\?|!|:|;|,))" regex pattern both {1} quantifiers can be modified.
- 2 sentences (1210 strings, +260)
- 3 sentences (571 strings, +307)
- 4 sentences (328 strings, +229)
- 5 sentences (204 strings, +160)
- 6 sentences (142 strings, +124)
- 7 sentences (92 strings, +83)
- 8 sentences (62 strings, +53)
- 9 sentences (40 strings, +39)
- 10 sentences (31 strings, +30)
- 11 sentences (2 strings)
- 12 sentences (18 strings)
- 13 sentences (13 strings)
- 14 sentences (8 strings)
- 15 sentences (7 strings)
- 16 sentences (3 strings)
Important to understand that how special role comma has: even if these numbers seems greater than above counting, but for example ID#1977918 demonstrates perfectly why it would be tricky to cut away strings by commas as well: simply because comma functions as a list separator also within sentences.

Side note: of course a similar pattern works for the single-sentence strings as well (msgid\s"(?:.*(\.|\?|!|:|;))"\nmsgstr\s"(?:.*(\.|\?|!|:|;))" results 3226), but fortunately we do not need to deal with them, because 1 sentence = 1 TU.

Get prepared for non-standard grammar

Who said that all the UI strings used in Drupal are perfectly safe for English grammar? And also, why not allow that in localized strings you may not want close the sentence as it was done in the original one? Also, when the strings are formatted with HTML-tags, then their very last character will be probably a ">". So we need to twist our pattern a little bit even further to include these sentences as well. As some counter-testing, this is how to check how many strings has no any sentence-closing characters at their very end: msgid\s"(?:.*[\.\?\!\:\;]\s)+(?:.*[^\.\?\!|\:|\;])"\nmsgstr\s"(?:.*[\.\?\!\:\;]\s)+(?:.*)"
Separating single-sentence strings from multi-sentence ones

Until now this seems the final ultimate combination to detect multi-sentence strings the most possible intelligent way (yet): msgid\s"[^\"](?:.*(?:\s\w*[^&][a-z]*\;)?[\.\?\!\:\;]\s)+(?:.*[\.\?\!\:\;]?)"\nmsgstr\s"(?:.*[\.\?\!\:\;]\s)?(?:.*[\.\?\!\:\;]?)". Its capabilities:
- Allows (semi-)colons as well as sentence-closing.
- But detects HTML entities and does not consider their semicolon as sentence-ending.
- Allows “unclosed” last sentences, which can normally happen
- Allows different number of sentences in the translated string rather than how many the original has.
For me with Drupal core 8.4.8 .po file after these transformations described above this pattern results 1141 strings. So a you can see, we're facing only a subset amount (cca. 13%) of our translated data needed to be brake up into finer granularity before being converted to TUs. Similary as we did earlier at Step 4. with plural formulates strings, it's also worth to split again the entire .po file into two parts at this point. There are (at least) two reasons why:
- Sense of achievement: the major part of our normalized .po file from Step #2. contains consists only single-sentence strings, so if we could have them separated, then inpatient readers of this tutorial could jump right to TMX conversion into the standard TM-exchange format (called TMX) and upload to any cloud-based TM software.
- Easier focus: we will need to pay close attention only on the multi-sentence strings when reviewing how they got paired up by regex replace (in the next step). To ease this process we can get rid of the clutter now, as a wise preparation.
a) Creating the single-sentence only file

As a very basic separation I saved two different copies of the current .po file, so first add the "-single-sentence" postfix to the end of filename of a new copy of the previous .po file. Then perform the following operation on it:
- Search:
  msgid\s"[^\"](?:.*(?:\s\w*[^&][a-z]*\;)?[\.\?\!\:\;]\s)+(?:.*[\.\?\!\:\;]?)"\nmsgstr\s"(?:.*[\.\?\!\:\;]\s)?(?:.*[\.\?\!\:\;]?)"
  
  It's exactly the same regex pattern introduced above at the beginning of this step.
- Replace:
  # msgid: That was a multi-sentence string, deleted from this file.\n# msgstr: Ez egy többmondatos sztring volt, törölve ebből a fájlból.
  
  Hash mark signs a comment in .po's syntax, so conversion into TMX format will skip them. In the second part of the replace pattern I used a translated version of the first part's English comment, just for fun. Feel free to customize, these has no importance at all.
  
  Tip: naturally, you can safely leave unspecified the replaced text, so when you run the command empty lines will appear in your editor. Also, the file size get smaller a bit, maybe it means some benefit for performance of processing later on. I specified the replaced texts to keep a memo to myself where modifications has happened, when comparing the file to previous steps' output files.
b) Saving the rest of strings as multi-sentence file
- Search:
  msgid\s"[^\"][^\.\?\!\:\;\n]*[\.\?\!\:\;]?"$\n^msgstr\s".*"
  
  Note that in the msgstr half there’s a widely allowing combination used: because in translated sentences not always the same characters are being used.
- Replace:
  #\n# – Tip: as this is the bigger portion of the file, here it's not recommended to use a long sentence as replacement just for fun, otherwise the output file will grow up even to 1-2 Mb.
Removing placeholder lines

During previous steps we replaced the unneeded lines with comments in order to preserve the order of strings for debugging and counter-chacking purposes. At this point of the process we can safely remove these extra lines to slank our .po files.
- Search: #\n#\n
- Replace: (empty)
Fragmenting UI strings into TUs

Maybe some would find easier to write scripts on transforming text structures, but I simply love the power of regular expressions, so I decided to go with this way. As you may already guessed, we will perform the changes in iterations for each sentence-separator character. To proceed, you will definitely need to have an advanced plain text editor (not a programming IDE, but something like Atom or Notepad++), because replacements will refer to regex capturing groups. Let' start hacking!
- Search: msgid\s"(?:(.*(?:\s\w*[^&][a-z]*\;)?[\.\?\!\:\;])\s)+((?:.*[\.\?\!\:\;]?))"\nmsgstr\s"(?:(.*[\.\?\!\:\;])\s)?((?:.*[\.\?\!\:\;]?))"
- Replace: # Sentence One:\nmsgid "$1"\n# Sentence Three:\nmsgstr "$3"\n# Sentence Two:\nmsgid "$2"\n# Sentence Four:\nmsgstr "$4"
Until I had time to tinker with this pattern seems to make the job.
Removing placeholder comments

We can use this to remove our temporary placeholder comments:
- Search: # Sentence \S{2,5}:\n
- Replace: (empty)
...

To be continued. Stay tuned :)

Attachment	Size
Illustration of benefits of TM	117.65 KB
Step 1: exported (truncated)	954.02 KB
Step 2: then normalized (truncated)	868.26 KB
Step 3: then contexts removed (truncated)	865.94 KB
Step 4a: then separated to contain plurals only	61.56 KB
Step 4b: then separated to contain no plurals (truncated)	1020.87 KB
Step 6a: then separated to contain only single-sentence strings	623.18 KB
Step 6b: then separated to contain only multi-sentence strings	612.19 KB
Step 7: then from multi-sentence file placeholders removed	584.54 KB
Step 8: then multi-sentence file fragmented	686.43 KB
Step 9: then from fragmented multi-sentence file placeholders removed	605.23 KB
Step 10: second-iteration of Step 9, file containing single-sentence strings	401.88 KB
Step 10: second-iteration of Step 9, file containing multi-sentence strings	369.08 KB