Reusing software translations between Android, iOS and web

TL;DR: There are a lot of small things which you can do to improve translation reuse, but they’d have to be applied from the beginning of the project.

If you happen to have one product ready and translated, whereas the other(s) need translations, I advise you to sign up to POEditor. It is a very good product and you can tell the guys know what they’re doing. During a free trial, you can import the existing translation from one product, and reuse them in another product thanks for the translation memory, with very low effort. It supports XLIFF, Android XML, JSON and a few other file formats. The drawbacks are that its TM only finds exact matches, and after the trial, the pricing ain’t cheap if you have many strings and languages (that’s one of the reasons we didn’t buy the subscription).

In order to achieve good results though, it’s worth if you read on, to avoid some problems which impede string reuse.

Contrary, if you’re just starting with your projects, I advise you to:

  • follow this guideline, and set up some processes to allow translation units reuse in the future,
  • translate only one app first, and iterate on the translations until they’re good enough,
  • when first app translation is done, sign up for POEditor or find other translation memory tool, and export the translations into the remaining products.

Introduction

Software translation and localization is a difficult subject. It seems deceptively simple, but as a Polish proverb says, “the farther into the forest, the more trees”. There’s a big number of people involved (developers, product managers, translation agency, translators) and it may take many iterations to obtain good results (this is a subject for a whole separate blog post).

Whereas there are many resources in the web on the software translation best practices, there are very few about how to make multiple projects in disparate technologies share the translations. Translations reuse is not straightforward, because each software platform and framework has a different attitude to the problem; but since the translation agencies very often simply do not do good work, it’s important to aim for high translation reuse once good translations are available.

In this blog post, I will write about the task I had last month, which was to localize an Android app and reuse as much translations as possible from existing iOS and web apps. I will discuss the pain points to be aware of, and provide some solutions - but there is no silver bullet.

Best practices

First I will start with two (of the many) best practices (BP) of software translations. I will refer to a string to be translated as a “translation unit”.

Best practice 1: don’t concatenate strings and variables, use a template string with placeholder

When dynamic data has to be put in the UI string, always use translations unit with a placeholder, like Delete %s from contact list?, and replace the placeholder in the code.

Each language has a well-defined order of the sentence. In German, verbs often go at the end of the sentence. Concatenating strings doesn’t make sense; whole sentence has to be translated, and placeholder correctly located by the translator.

Best practice 2: always include punctuation in the translation unit

If the UI string looks like Foobar:, the translation unit must be Foobar:, not Foobar.

You might think you could just add the colon in the code or UI template, but it’s not correct. The grammar rules of French language say that characters like :, ;, !, ? must be preceded by a single whitespace. So the French translation of Foobar: will be Foobar :.

Real life

An important realization is that software translation is not a one-off thing, but a continuous process, which consists of:

  • Initial translation,
  • Adding new strings (and hence the new translations),
  • Updating existing strings (and hence also the translations).

This alone makes it tricky and complex to keep things in sync, but translation reuse is more difficult than it should be mostly due to a different way each major platform handles it.

Different conceptual attitude on iOS and Android

In Android, a standard is to externalize the translation units into strings.xml file in an appropriate subfolder of the project, one per language (including the primary language).

On the web, the standard file format for data is now JSON, so naturally it’s also common to use it for the translations (at least in JavaScript-powered Single-Page Applications). Projects like angular-translate popularized this approach: have one JSON per language (including the primary language), loaded dynamically at runtime, and then passed to the framework to populate the strings in the UI. From translation point of view, this is quite similar to the Android approach, though there are some technical differences.

In iOS, the approach is quite different. A standard practice is that you actually DO hardcode the strings in the code, in the primary language of the app, but you wrap them in a NSLocalizedString call. Then, you can use Xcode to analyze the project, and generate a set of XLIFF files for the translation agency (one per each language, except the primary language), and when you get the translations back, you import them to the project to a set of Localizable.strings files.

At the very beginning of the project, I was thinking about having a shared source-of-truth git repo with all the strings in the primary language, their translations, and some tooling on top of that, which would build platform-specific files; but due to the complexity coming from the fact of iOS being much different than the rest, and the apps evolving at different pace (which could turn problematic at some point - though could be solved with git branches), this idea was abandoned.

Enter the translation memory

“Translation memory” (TM) is an opposite approach to “shared source-of-truth”: each platform remains independent, but it can pull translations from the shared pool. This is something that professional translation agencies use, but you can also find some free software doing that, or roll your own.

The way TMs work is simple: you push your translation units and respective translations to it, and then when you need to translate the same or similar translation unit from a different project, the TM finds a match.

Basic TMs can only find exact matches; more advanced TMs can find fuzzy matches, when the inputs vary only slightly. The problem is that there can be many small differences in the translation units due to a number of reasons I will present below.

Small but important impediments to translation unit reuse

Slightly different inputs

Mostly due to punctuation differences (!, : etc.), as previously explained in BP2:

Book now vs Book now! are different translation units.

This can also happen due to:

  • extra whitespace,
  • different whitespace (regular vs non-breaking whitespace),
  • differing apostrophes and quotes (regular ASCII vs. fancy ones).

Good TMs can handle it and find a fuzzy match, but it still might require manual intervention to align the translations.

Regarding the apostrophes and quotes, it might be interesting to have tooling in place to normalize them; for example to replace the ASCII ones with “fancy” ones (U+2019 right single quotation mark, U+201C and U+201D left/right double quotation mark). This has a few advantages:

  • the fancy apostrophes and quotes don’t need to be \-escaped on Android,
  • different translators throughout the time might use one or the other, which could lead to UI inconsistencies,
  • guaranteed consistency between the platforms.

Non-breaking space (U+00A0) looks like a regular space in any code editor, hence it’s difficult to spot - beware.

Side rant: Translation agencies often use TMs with fuzzy matching, but do not verify the result. Hence Foobar can be translated to Foobar. which feels very wrong if it’s a UI string of a button. You might need some tooling to check that, or make sure your translation agency verifies that. Or both. I wrote some tools to do a few simple checks like that on strings.xml, and I am planning open source them soon. Once done, I will update this article, so feel free to bookmark this URL and come back later if interested.

Different casing

Book now! and BOOK NOW! are different translation units.

Again, the TM can find it, but it’s not ideal to have a mismatch in each platform.

Which one is better? The opinions are split. Non-uppercased version is theoretically better, because it’s trivial to uppercase if needed, whereas reverse is not true: you can not easily lowercase a string in a correct way (particularly in German, where nouns are always capitalized). However, there are reports stating that uppercasing at runtime in Android can lead to dropped frames while scrolling, so you need to decide yourself which trade-off to make.

Placeholders

As stated in BP1, you should use placeholders. You do? Good.

The strings with placeholders are most difficult to translate, hence the most important to reuse if they are already translated, but it would be too easy if it was straightforward.

Android uses placeholders like %d, %s, %1$s, %2$s; iOS uses %@, %1$@, %2$@, and various 3rd-party libraries offer different syntaxes.

It might be reasonable to find a set of libraries (or write yourself if necessary) that will allow the exact same syntax for all platforms. Syntax from Phrase looks nice. However the problem is that in case of named placeholders, each team might use a different placeholder name and we’re in even worse position than before unless both teams cooperate.

Another solution (if you use %d/%@ etc.) could be to normalize placeholders at the time when you export the strings for translation. For example: after exporting XLIFF from Xcode, search-and-replace %@ to %s; send the replaced XLIFF to the translators; when it comes back, reverse the replacements.

One more Android caveat: in strings.xml, you can wrap your placeholder with an XLIFF tag to facilitate the work to a (knowledgeable) translator: Book a flight to <xliff:g example="London">%s</xliff:g>, and those XLIFF tags will be discarded at compilation time. While in theory it sounds great, in practice it nearly guarantees that the translation memory won’t match this string with an existing string in its database (i.e. string from iOS won’t be reused). Yet another trade-off to make.

Plurals

While it’s hard with placeholders, it’s even harder with plurals, because pluralization rules differ between each language, and of course each platform or library handles it differently.

Android’s implementation seems very good to me: one string per quantity-indicator, grouped together. Some libraries like messageformat encode the quantity-selection logic into just one string, which in my opinion is not the most readable nor easiest for the translation agencies, but often they work the way they do due to technical limitations of their ecosystem.

Translation reuse for plurals is hard to achieve. It’s probably just easiest to translate those separately on each platform, or to make UI decisions that remove the need for pluralization altogether.

BTW: In our Android app, we moved all <plurals> to the very end of strings.xml so that they are not mixed with the regular strings.

Process-related impediments

One non-obvious pain point we only learnt about at the last moment was a simple fact that the UI strings have diverged between the platforms.

While we were working on the Android app, quite a few the UI strings in the iOS app have changed, but were not updated in the Android codebase. When it came to translating Android, it turned out that many strings were simply not there in the iOS app anymore, and it took a while to find the appropriate strings in the iOS app and update the Android sources.

Corollary

Proper translation reuse is only possible with a strong cooperation from all the teams. In practice, this is hard to achieve. Aim for the best, but assume that it will just not happen, and you’ll achieve a reuse of something like 70-80%.

Discuss

Real-world HTTPS deployment pitfalls (part 2)

This is part 2 of the blog post. For part 1, see here, where I discuss: how not to overlook an expiring cert; how not to shoot yourself in foot with HSTS; a case of forgotten “nowww”; why you should always send intermediate certificates; and TLS 1.2 migration considerations if you support Android KitKat.

Extraneous certificates

TL;DR: Save yourself some bandwidth, and improve initial render time, by not sending root certificate from the server.

This is technically not a huge problem, but extraneous certificates bloat each new TLS connection, and are an equivalent of sending unoptimized JPEGs full of metadata.

Root cert sent by server

Since a TLS handshake is the very first thing happening when connecting to a domain over HTTPS, by sending unnecessary data at this stage, you’re increasing initial render time for all users (particularly the ones with poor connections and far away from your edge servers).

Every compliant browser will ignore a self-signed root cert at the end of the chain: if the browser has that root cert in its store, the TLS validation will succeed, and if it doesn’t have that cert, it will fail. Contrary to intermediate certs, which for robustness should always be sent, there’s no point of sending the root cert.

Serving multiple certificates, one of which is wrong

This is a strange case of server misconfiguration, but I discovered it happening recently to one of our partners.

Initially it seemed like things worked correctly most of the time, but sometimes, regardless of the browser and operating system (Windows, iOS, Android), the TLS connection would fail.

Two certificates

When checked in SSL Labs, it showed two different certificates returned by the server:

Two certificates

But what does it mean that the server returns two certificates? How is this technically possible? Is the problem on the browser side or server side?

I was confused by that, so I reached to Ivan Ristić from SSL Labs who explained (thanks!) that SSL Labs does multiple connections during the test, and collects all the server certs it encounters. The server can only return one server certificate at a time, and in addition, it may return a “bag of certs” containing intermediate certs, to help the client to perform the validation of the chain of trust. However, it may return a different server certificate on each connection attempt – this is handy if you have lots of servers and load balancers: you don’t have to keep all those platforms in sync, with the same server cert.

To inspect the issue more closely, I enabled Fiddler, configured it temporarily to ignore certificate errors, and put a few lines of Fiddler script to log the details of the certificate in the “comments” column, and highlight it in the session list if the observed cert was wrong.

static function onEvalCert(o: Object, e: ValidateServerCertificateEventArgs)
{
  try
  {
    var X2: System.Security.Cryptography.X509Certificates.X509Certificate2 =
      new System.Security.Cryptography.X509Certificates.X509Certificate2(e.ServerCertificate);

    if( X2.ToString().Contains("azure") ) {
      e.Session["ui-backcolor"] = "darkred";
      e.Session["ui-color"] = "white";
    }
    e.Session["ui-Comments"] = X2.ToString();
    e.Session.RefreshUI();
  }
  catch (ex)
  {
    FiddlerApplication.Log.LogFormat("Failed to evaluate certificate: {0}", ex.Message);
  }
}

static function OnBoot() {
  FiddlerApplication.add_OnValidateServerCertificate(onEvalCert);
}

Then I opened /favicon.ico URL of the server in the browser, and hit F5 a number of times.

After this test, I realized that in fact the TLS connection failed almost randomly with 50-50 chance, as you can see in the screenshot below:

Random certificates returned by Azure - Fiddler

It turned out that Azure deployment of our partner was misconfigured and indeed sometimes the server was wrongly sending a server certificate of Azure, instead of that of the appropriate customer domain (unfortunately, I don’t know more details on why that was the case).

See also:

Serving certificate signed only by a niche or a very new root cert

This is something that most likely won’t affect you if you obtain your certificates from any major CA, but I decided to put it here for completeness, as I learned about it while investigating an issue of incomplete chain mentioned earlier.

All the browsers and operating systems have loads of root certs in their stores, but those stores are not equal. Depending on the OS, browser, device vendor, and even country when the device is sold, there might be slight variations in cert store contents. (You might want to double check that topic if you do a truly worldwide business and target Asian markets for example.)

There’s also variation in time: a device with an operating system from 2009 most likely will not have a root cert issued in 2010!

Typically the leaf certs are short-lived (months/years), and root certs are long-lived (years/decades), but the issue still holds. If you need to support very old Windows or Android, double check that your cert has been signed with an old enough cert (typically the CAs will do it for you - if they use a very new cert, they will also cross-sign using an older cert).

How can you verify cert’s details? The easiest way is to obtain its fingerprint (hash)…

Checking details of a root certificate

…and then use it in your favorite search engine, which will lead you to a Censys cert viewer:

Checking details of a root certificate

The example above is a popular Comodo cert issued in 2010. (Note this does not mean it was immediately picked up by the browser vendors on day one after issuance). This particular cert is known to not be present in Android < 5.1., nor in Firefox < 36. However, when Comodo signs your certs with that cert, it also cross-signs it (at least for now) with an older cert that is available on older devices, so generally you don’t have to worry about it.

Assumming your once-configured HTTPS will work forever

HTTPS is a moving target. Vulnerabilities in crypto algorithms and implementations are found each year as the research and hardware advance, and hence you will need to reconfigure your server periodically to avoid using deprecated crypto.

On the other hand, various CAs have been compromised in the past, and in response, browser vendors changed the treatment of the certs issued by those CAs (either lowering or fully revoking trust in those CAs); upcoming version of Chrome will stop trusting certain old Symantec-issued certificates way before their original expiration date. Also, there were software bugs related to handling of misbehaving CAs which made HTTPS connections wrongly fail if the site was using a Symantec-issued cert.

Keep yourself up-to-date with the news (you may want to follow @sleevi_ on Twitter). Avoid using certificates from certificate authorities that have a long track of misbehaviors and not following best industry practices - or at least be more vigilant in such case.

Keep in mind that each HTTPS cert renewal is a potentially breaking change, and treat it as such - put the QA in the loop for a quick sanity.

Use Chrome Canary and Firefox Nightly to learn about breaking changes before they reach the wider audience.

Additional Resources

Tools

Blogs

Other

Discuss

Real-world HTTPS deployment pitfalls (part 1)

…or one year from life of a web/app developer, seeing broken HTTPS in the wild

All the things written below actually happened to my team, other teams at my company, or our external partners!


Who is this post for: web & app developers, devops, sysadmins; particularly if you’re working in a big company and roll out HTTPS on your own

Technical difficulty: low / intermediate (you know what is HTTPS, a CA, a certificate)

TL;DR: if you have only 2 minutes now, go to SSL Labs and check your domain. If you see anything in red or orange, compare your results with major websites like Google, GitHub, Microsoft, Guardian etc. If the given problem is not present on any major site, but only on yours, it probably means you need to fix it NOW.

Introductory self-Q&A

“I’m a just a developer and HTTPS is a devops things, do I really need to read all of that?”

If you’re a dev in a company migrating to HTTPS soon, or your newly created project will be using HTTPS, this post is for you. Even if your website is already in production, you may learn a thing or two.

Your devops will put things in place, and stuff will mostly work, but they might not know all the arcane details of all web browsers, and all requirements for your product, and then it will be you who will have to debug stuff. So at least glance over the headlines and TL;DRs to get yourself familiar with common issues.

“We have a staging environment, we’ll catch everything before long deploying to prod!”

Only as long as 1) your staging env has exactly the same HTTPS cert and config as production (in big companies, it might not be the case, due to domain name differences and internal policies), and 2) you test all possible browsers and operating systems (but it may still not be enough!)

“Is it really that hard?”

If you don’t want to break things for the end-users, it takes time. Have a look for example at this blog from The Guardian.

I will try not to repeat too many obvious things in this guide. I assume you already know basic stuff about HTTPS from different sources. Instead I will point some things that might be easily overlooked, or not noticed at all if you’re (un)lucky - until you have an urgent issue in production..

Forgetting about certificate expiration

TL;DR: don’t rely only on a once-a-year email reminder

I temporarily break my promise and start from trivial yet costly mistake, similar to forgetting to renew a domain, but it happens nonetheless. At least every few weeks I stumble upon a high profile page with expired cert, or a near-miss.

Obviously you should have a well-defined process, with email reminder and someone responsible for taking the action, but…

Apart from that, if you use Fiddler regularly, you can use a few lines of FiddlerScript to highlight HTTPS sessions that are using soon-expiring certs (thanks to Eric Lawrence for publishing it), so that the expiration catches your attention before it’s too late. You may want to customize it to only highlight sessions touching particular domain names of your interest:

if (e.Session.hostname.EndsWith("mydomain.net") ||
    e.Session.hostname.EndsWith("mydomain.com")) {
  e.Session["ui-backcolor"] = "red";
  e.Session["ui-color"] = "white";
}
Certificates expiring in less than 30 days

Don’t wait to renew the certificate on the last possible day (in particular when it expires on Sunday!) to avoid putting unnecessary stress on your team.

Missing intermediate certificates

TL;DR: run SSL Labs check on your domain and if you see “Extra download” in certification path, go now and fix it and come back when done!

First, quick primer on how CA (Certificate Authority) system works. Let’s use GitHub as an example.

In the image below you see a certificate chain for github.com. There are 3 certs in play:

Missing intermediate certificates
  1. Server cert (leaf): This is GitHub’s public cert that they use to secure the TLS session with the user’s browser.
  2. Intermediate cert of DigiCert, a Certificate Authority which signed (i.e. verified the authenticity) of the cert from GitHub (cert #1).
  3. Root cert (high-trust cert) of DigiCert, which signed (i.e. verified the authencity) of the other cert from DigiCert (cert #2). This cert, like all the root certs, is self-signed by the issuer; all the root certs are self-signed by definition.

Web browsers and operating systems typically ship with dozens and dozens of root certs embedded in their CA stores. When a server administrator wants to obtain a new certificate, the Certificate Authority, for operational reasons, will not sign it with a root cert directly, but instead it will do it with an intermediate cert. Anything in the chain between leaf cert (yourdomain.com) and the root cert is an intermediate cert. In a typical situation, there are 1 or 2 intermediate certs.

To verify the website’s cert, the browser needs to have a full chain of certification, to verify trust of each link of the chain. The leaf cert is always sent by the server, the root certs are available in the browser, but where does the browser get the intermediate certificates from? There are two options:

  1. either server sends all the intermediate certs,
  2. or the browser needs to get them from somewhere.

The screenshot above shows GitHub properly sending the intermediate cert. But what happens in the other case? It’s implementation-specific, depends on the browser and platform:

  • if the browser happens to have an intermediate cert cached locally, because some other website that the user has visited served it, it will be reused;
  • if the intermediate cert is not cached, some browsers will fetch it, but some won’t; in particular, Android WebView and all versions of Firefox do not fetch any missing certs!

See an example of a misconfigured server below:

Missing intermediate certificates

This servers’s cert can be verified via two certification paths (because the intermediary cert has been cross-signed by two different certs), but unfortunately none of paths can be reliably resolved by the browser without an extra download.

If you see “extra download” in SSL Labs, go now and fix it! Your page might be randomly not working for many of your users (but it might be sporadic enough that you don’t get any reports).

See also:

TLS 1.2 and old Androids

TL;DR: make sure you don’t lose dozens of thousands of users before you roll out TLS 1.2

HTTPS deployment is a fine balance between security and backward compatibility. The gold standard in 2017 is TLS 1.2, but it’s not supported by old operating systems and browsers (Internet Explorer on Windows XP, and very old Androids).

Supporting outdated browsers means supporting insecure crypto and lowering security for everyone else. While most of the Android developers have stopped supporting pre-KitKat devices long time ago, there’s still a significant market share of KitKat (Android 4.4). According to Android dashboard, as of July 2017, 17% of Android users still use KitKat. However, you should check the stats in Play Store console for the active users of your own app, and the stats there might be way different (as the variation between the countries is big).

The interesting thing about KitKat is that while it has the capability to support TLS 1.2, it’s by default switched off, and while some vendors do support it, many do not. (There are even reports of Samsung devices with Android 5.0 not supporting TLS 1.2, which in theory should not happen).

Due to PCI-DSS compliance, you might be forced to migrate your server to TLS 1.2, but you should double check your user base statistics before, to avoid recklessly cutting out a big portion of the market from your services.

The platform team in my company was very keen in late 2016 on migrating to TLS 1.2 (and removing support for TLS 1.1 and 1.0), but after some discussions we’ve decided to postpone it. We have reevaluated this summer, and we are finally planning to drop KitKat support and roll out TLS 1.2 in the coming weeks (end of 2017).

Using HSTS too aggressively

TL;DR: only roll out HSTS when everything else’s been checked and working well. Serve small max-age initially and increase as you gain confidence - or prepare to serve max-age=0 in case of problems.

HSTS is a very useful security header which tells the browser to “remember” to always load all URLs from a given domain over HTTPS, even if HTTP URLs are encountered, for a given period of time described by max-age field in the header value. In other words, it prevents you from visiting unsecured HTTP pages.

Usually this is good, but obviously not when HTTPS version is not working properly, and the user really wants to see the HTTP version.

Story time: We have an on-site JIRA instance in my company; put in place several years ago over HTTP, but work began lately on serving JIRA and everything else over HTTPS.

At some point, the implementation team enabled an http->https redirect and an HSTS header. For some reason though, it turned out some parts of the page were not working over https (not that obvious to diagnose for the end user), so the force-https config was disabled, and the recommendation from JIRA support team was to use http URLs.

But, with HSTS it is not that easy: if you ever visited https version before, when visiting http URL later, you were redirected back to https page (the whole point of HSTS after all!), and clearing standard browsing data didn’t help.

There are two solutions if your users get trapped by that problem:

  • You could serve max-age=0 value for HSTS header which tells the browser to discard all HSTS data it has for the serving domain (but this is taken into account by the browser only when served over HTTPS - as any other HSTS header value).
  • Users may clear HSTS cache of the browser. Not very easy to do, and not user friendly; typically hidden in browser internals, for example: chrome://net-internals/#hsts in Chrome.

Obviously, the best solution is to be cautious and only implement HSTS when everything else was verified; also, put a small max-age (a few hours, a few days) first, and gradually increase when no problems are found.

Read more:

Forgetting about www/nowww when deploying HTTPS, CDN, or security proxy

TL;DR: do not forget about nowww. Make sure it works over http and redirects to https.

Typically, your website has a canonical URL of www.example.com or example.com, so you have 2 entry points. Most likely you serve a redirect from one to the other, to avoid duplicated content.

However with HTTPS in the game, each of the two is accessible either via http or https, so you have 4 entry points total.

If you forget about your nowww domain, you might end up in the following situation:

  1. User types “example.com” in URL bar
  2. The server responds with a redirect to “https://example.com
  3. Your HTTPS cert is not valid for nowww domain -> scary warning, user runs away.

Another thing that could happen is that your traffic to nowww domain won’t be resolved at all and user will think that your page is down.

The problem is so prevalent that modern browsers have some built-in magic to probe www domain in case of nowww not working, but as with any error recovery mechanism, it’s better to not rely on it.

Actions:

1) Make sure to choose one canonical entry point (say, https://www.example.com) and put redirects in place in your webserver’s config for the 3 remaining ones:

(You might want to write a simple bot that checks all of this each night and alerts you if a redirection stops working. Subsequent configuration changes, perhaps done by external teams – not unheard of in corporate environments – might break that redirection without anyone noticing.)

2) Since example.com and www.example.com are different origins, a regular TLS cert with just one explicit domain name won’t work for both. You need a cert with SAN (Subject Alternative Name) matching both, for example: *.example.com example.com or www.example.com example.com

3) SEO tip: Generate <link rel="canonical" href="https://..."> in the <head> of your HTML responses to make deduplication work easier for web crawlers.

See also the following support entries from Google Webmasters:

Part 2

If you made it that far, you should check part 2, where I discuss: sending extraneous certificates; randomly sending wrong certificates; sending a certificate signed by a not widely-accepted root cert; and provide some links to helpful tools and external resources.

Discuss