Previously we have discussed our reasoning and API for localization. Here, we would like to continue with the next topic: text extraction from the source code. That is, after we have indicated the text we would like translated, how we extract that text out of our codebase for translation. (Note: we used Acclaro for the actual translation, and have been very happy with them. Recommended.)
The difficulty is to extract all of the strings so that they can be sent to translators. Compounding this difficulty, we also must make sure to send the correct number of plural forms to the translators. Look at the example below.
__count does not appear in the translated string, only in the values object. This means we must, at extraction time, understand these value objects and be able to reason about them—we must know their types.
There are a few bad ways to do this:
- build a parser by hand to deal with this (possible, but fragile, error-prone, and silently oblivious to things it does not support)
- search or grep
Our first solution was the hand-built parser. It worked reasonbly well, but there were known, unfixable bugs. We needed a solution that allowed us to have full syntatic and semantic understanding of source code files. Syntax for knowing the tree of the source code: function calls, argument lists, object access. Semantic for seeing into object types and values.
C# and Razor extraction with Roslyn
The C# team at Microsoft has been working on a project called Roslyn. It is a programmatic way to access the syntax tree and semantic model of C# files. That means that we can hand it a C# file and then search through it looking for certain kinds of things, and act on them. Roslyn comes with a SyntaxWalker class that walks over each node. You can override the one you want. In our case, we want any invocation (function call) named
1 2 3 4 5 6 7 8 9 10 11 12
Now we’re trying to create a new
StringInfo (just a container class). We need to examine our
node.ArgumentList and see if the first argument is a string. We use C#’s dynamic feature coupled with various ways to call
StringInfo.Create to easily support different types of arguments (noted in comments above the functions):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Ok, now we’ve got our string. Next we need to figure out if
__count is part of the object. We’ve implemented various ways to determine
HasCount(), as referenced above on line 14. Again, C#’s dynamic proves useful.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
The above tries all variants (that were in our code) that might contain
__count somehow. The use of dynamic allows us to not care about the kind of expression, which is determined during runtime, and then the appropriate version of
HasCount() is called. Here’s a gist of it all (it’s got some other features not listed here).
The above covers our C# controllers. For the Razor views, we cheat a bit. The ASP.NET compiler is invoked which converts the views into C# files, then the same extraction logic is run on those. This is much easier than trying to write a full razor parser.
_s, from which we can extract strings and objects. It’s a pretty hideous function and not worth pasting, but here it is.
Both of these extractors yield JSON, which is processed in node and sent off to our translation service. We also have a dashboard (built with AngularJS) to view all translations, fix them, and reprocess all the files.
Translation and localization are hard. We spent months refactoring and extending our codebase to support it. But now, we can change, add, and translate strings with little effort, because our tools handle all the hard stuff. If you care about having quality translations (supporting dynamic text insertion and pluralization), whichever solution you use (resource files or gettext-style) must be able to fully understand your code so that it can generate high-quality output to send to translators.