Translation Solutions Ltd. provides XML translation services in most language pairs. This article explains how to prepare and translate XML as part of a training series for translators. However if you landed here looking for a provider to translate your files, you can just check out our prices or ask for a free translation quote.
How to translate XML?
First of all, if you have not done so yet, please do read my How to translate HTML article. Not because I want more people to read that article (which I do, of course), but because I will assume that you already know everything covered in that previous article.
This article is intended to be a short and clear approach to the subject of XML translation and tagging. XML is increasingly popular. In fact, it's a real buzz word these days, but when it comes to "How do you translate that stuff", your information sources tend to go dry. I hope you find this article useful to get started. This is not all there is to know about it.
Please note that I tend to over-simplify things. The goal of this document is not to allow you to create XML based software, but only to understand enough basics to translate it without messing it up. This is why I skip a number of information that you would find easily in regular XML tutorials. You will find some interesting links on XML in the resources section of this website.
Of course, should you notice mistakes (not just simplification) in this tutorial, do let me know. I make this information available "as is". How/if you use it is entirely up to you.
XML stands for eXtensible Markup Language. You all remember (or should) what a markup language is. What's extensible? Extensible means that you can extend it (just kidding). It doesn't come with a set of fixed tags, like HTML. You can (you do) create your own tags, and reference them in a separate file called a DTD (Document Type Definition).
As such, XML is not a language, but rather a meta-language. That means it is a language that you can use to create new languages. After all, you create all your tags and you define them too, so the result is your language, XML-based.
Like HTML, a XML file is just a text file (Unicode), which means you can open it in a word editor and see what's inside. However, XML is not just a techie alternative to HTML. It's way more.
The biggest difference between HTML and XML is the purpose. In HTML the purpose is to format the text so that a browser (Internet Explorer or Firefox, for instance) knows how to display it. You specify what should be displayed, where and how.
i.e.: The <b>black</b> cat.
Then your browser shows: The black cat.
In XML, the purpose is to separate content from form and to categorize data in a way that can be processed by software. Not just a browser. For instance, XML can be used by DTP software, Office suites (OpenOffice.org for instance), Web applications... You name it.
This small example should help you understand how it works and what it can do:
In this example (which is legitimate XML code, by the way), you can easily understand what's what. This is a listing of fruits, including the reference number, type, quantity price, and whether or not it is on sale.
It doesn't take much imagination to understand that this system could be used to hold the complete inventory of a department store. Then, when the machine identify the barcode 001, the software could simply search for "<ref>001</ref>" and thereafter, get every information it needs. And it could also be used to print the company's catalog, or to generate web pages.
Once you have information neatly stored in between tags, you can write any kind of software to use it as you need.
I just created (made-up) the following "elements": "fruits", "ref", "type", "quantity", "price" and "OnSale". More could be created as needed. That's the beauty of XML. You simply create the elements you need.
"Element" refers to a tag's name (used in both the opening and closing tag). In XML, every tag is opened and closed, except for standalone tags clearly marked with a "/" at the end, like
So the purpose of XML is different from that of HTML. In HTML, tags give formatting INFORMATION. With XML, tags identify DATA.
Wait a minute! "Formatting information" is "data" too, right?
Yes, and actually, HTML is somewhat of a subdivision of XML (with more permissive rules, we will come to that later).
You may have noted that in HTML, there are attributes,
like "src", in the image tag:
<img src="/my image.jpg">
In XML, you can also have attributes and yes those are also made-up. Our example before could be rewritten:
<fruit ref="001" type="Apple" quantity="15" price="$0.25" OnSale="No"/>
<fruit ref="002" type="Pear" quantity="22" price="$0.30" OnSale="No"/>
Which would be correct, although a little extreme.
Tags for HTML have been already defined and agreed upon, and are used by a specific type of software called a browser.
And here we come to the next crucial item that comes with an XML document: The DTD, Document Type Definition. Because you create the tags you want, you have to define them somewhere. It is however possible to have a standalone XML document.
HTML also has a DTD, which is integrated in the browser – and which is why the browser knows that <b> means bold. That DTD is already agreed upon, so you simply write HTML using tags that are defined in advance.
In XML, elements can be defined in either a DTD or a Schema.
Summary: An XML document is a document used to store data in a way that programs can use. Note that it only contains data. By itself, it doesn't do anything; it is not a programming language.
It is a tool of choice for Web applications, for data exchange (for instance, TMX, the translation memory exchange format is in XML), for authoring, for ...and for a whole bunch of documents that will have to be translated.
What's the buzz?
Do you wonder what's so funky about XML? What's so special about it? After all it is just a text document with data in it. It stores data without any formatting, so what's the big deal?
That's the big deal. Because it stores data in a way that is easily accessible and without formatting, the data can be stored only once, and used to produce and update a zillion different documents.
Take the example from before. With the same data, the department store can create catalogs, handle its inventory, fill up its web site, calculate stock value, generate stock alerts, print brochures... all from the same file. Because the data are not formatted, they can be extracted and rearranged in any way necessary.
If the content is not separated from the form, every update requires updating every document separately. (The good ol' way, with outdated copies everywhere). With XML, you simply update the data in one file, and propagate it, all the data is the same at all levels, from the inventory to the online catalog, because it all comes from the same file.
In order not to mess up an XML document, you must know what's acceptable and what's not. In short:
All XML documents must start with a document type declaration, like
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
As far as you are concerned, the only part you may have to change or add there is the encoding, depending on your target language. (Standalone means there is no DTD)
CASE SENSITIVENESS: XML IS CASE SENSITIVE!!! That means that <fruit> and <Fruit> are 2 different tags.
WHITE SPACES (spaces, tabs, line breaks) are passed on as such to the application. Then, it depends on how the application handles white spaces. So when you translate, do not add or remove white spaces between tags. In HTML, " <b>the happy <b>dog" will display the same way as "<b>the happy <b> dog". In XML, you don't know.
Every tag has a closing tag or is a standalone tag (with / at the end). If during translation you remove a closing tag, the whole file will be rejected. Be careful. Every tag that's opened must later be closed.
All tags must be properly nested.
<fruit><type>apple</type></fruit> is fine, but you can't have:
All attributes must have quotes (unlike HTML which accept both with and without).
<img src=/image.jpg> is OK in HTML, not in XML. Use <img src="/image.jpg/">
The apostrophe may be used if the value contains a double-quote character, or the other way around. If you need isolated quotes in your translation, you can use
". Use only straight quotes for attributes, NO curly quotes.
<img src="/image.jpg/"> works.
<img src="/_YT//image.jpg/"> doesn't.
<img src=/image.jpg/> doesn't.
There must be no isolated start character in the document (no < or &), so if you have to write < or & in the text, use < and & for &. That means that in your translation, you must NOT use < or &. Ever.
&are available by default in files without DTD. (Means you can use them in your translation without having to define them). If there is a DTD, you have to check that they are declared before you can use them.
A XML file that conforms to these rules is called "well formed". You will hear about a "valid XML file". A valid XML file is a file that is well formed AND follow completely its DTD. (Meaning all elements used are properly defined in the DTD).
All right, that's about it for the rules. What it really comes down to is:
DO NOT MESS AROUND WITH TAGS. Keep them as they are, and it will be mostly all right. (You may have to change encoding or the position of tags, like in HTML translation)
Things you should understand:
What needs to be translated?
Well, text of course!
A HTML file typically contains text and tags, so you protect the tags and you translate the text.
A XML file can be a little trickier, in that it can contain text, tags, variables, reference numbers... so while you still need to translate only the text, there are a few more things you should beware of. Elements will determine what needs to be translated or not.
Look above. What should be translated? Apple and Pear? Probably. No? Maybe. The DTD will tell you which elements and attributes contain translatable text or not.
Context is hardly ever a problem in HTML. The file you are translating can be displayed in your browser, and that's all of it. That's exactly what the final user will see.
When it comes to XML however, things are different, because the data can be used and reused in different documents, and often you just don't know where or how a text entry will be used. Often text entries do not follow a logical order, or worse, don't have any context at all. If this is the case, browse over the other files see if you can make sense of it.
Fortunately, in most cases, you can figure it out right, but there is no perfect solution, when all you have is an XML file.
To display XML, (in Websites, for instance), another file is used, written in a language called XSL (eXtensible Stylesheet Language) which integrate the data from the XML file into a formatted structure, like a HTML or a RTF file. The result can be seen in most recent browsers
When in doubt, ask your customer. Not asking may help you look good ...but only until complaints come in.
How do you read a DTD?
Reminder: The Document Type Definition (DTD) is a file that defines the elements (tags) used in the XML file.
First, to read a DTD, you need to have the DTD. Don't laugh, often, you have to ask for it, and sometimes, you won't get it because it doesn't exist, and even when it exists, it is often not complete. Anyway, in the best of cases, you have a DTD and it is complete.
The DTD can be internal also, (inside the xml file, at the beginning), albeit per my observation, this is getting rare.
A DTD is a text file. If there is no program assigned to open the DTD file, use your favorite text editor. Yes, all those high tech sounding stuff are in text format.
There is no translatable text in the DTD (comments could be translated, providing they exist and that you have been specifically asked to translate them – very unlikely.).
What you are interested in, in a DTD, is to locate the elements, their attributes, and to find out which ones can contain text.
A DTD file defines not only elements and attributes, but also the way they relate to each other. Remember my example from before? Well let's write a DTD for it.
(ref, type, quantity, price, OnSale)>
<!ELEMENT ref (#PCDATA)>
<!ELEMENT type (#PCDATA)>
<!ELEMENT quantity (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ELEMENT OnSale (#PCDATA)>
<!ELEMENT creates an element (pair of tags). You wouldn't have guessed, right? So the first line reads as follows:
"There is an element called fruits, and it contains the following elements: ref, type, quantity, price, OnSale". #PCDATA indicates that whatever is inside theses tags is content (Content! The stuff you need to translate)
The second line says that the "ref" element contains character data (text), and so on. Now a little more complex. Can you guess what that is for:
<!ELEMENT part (name,
InStock (YES | NO)>
<!ELEMENT image EMPTY>
<!ELEMENT name (#PCDATA)>
<!ELEMENT manufacturer (#PCDATA)>
The first element contains 2 elements (image & manufacturer) and has 2 attributes, partnumber & InStock. InStock can take the value YES or the value NO. The image element is empty (which means it is a standalone tag), but has an attribute called url.
As you probably guessed, this XML definition would be used for a listing of parts of some kind, like:
As you can see, this thing is fairly easy to understand. You define a tag, you define what attributes are possible and what tag or information it can contain.
Preparing the file for translation
4 types of tags
For translation purposes, there are 4 types of tags. You will probably remember the 2 types of tags I had presented earlier for HTML: External and Internal. In case you don't remember, External tags are tags that are never included in the translation. They are never part of a segment, like <p>, for instance. Internal tags, on the other end, may be inside a segment, like a <b>, and may need to be moved around.
XML has 2 more types of tags:
External group tags,
External group tags are tags which, like external tags, are never in the middle of a segment, but also do not contain any translatable text. For instance,
contains nothing that you can translate, and it will never contain anything to translate. That's why we call it an external group tag.
Tags with translatable attributes,
are tags with attributes that could contain translatable text. Actually, there are a few in HTML too, like:
<img src="/image" alt="This text should be translated">
Understanding what these tags are, it is then easy to understand how the file(s) should be prepared.
External tags should
be protected with the external tag style.
External group tags should be protected with the external tag style (both tags and everything in between).
Internal tags should be internal tag style.
Tags with translatable attributes are a tad trickier. The attribute part that can be translated must be left as is, but the tag itself can be internal or external, and needs to be protected. To figure this out, just locate one such tag in the files and you will see if it is internal (could be inside a sentence) or external (would never cut a sentence in 2).
Take: <img src="/image" alt="This text should be translated"/>.
This is a standalone tag that could be part of a sentence (internal) but contains translatable text. It should be prepared as follows:
<img src="/image" alt="This text should be translated">
That way, you can edit the text without altering the tag, which is paramount in XML.
I think that, based on the above information, you should be able to handle properly XML translation in most cases. However do not make the mistake to consider that the data in this tutorial are all you will ever have to know to do XML translation. I think this tutorial can help you get started, but it's only a beginner's tutorial. I will write another article to cover some practical solutions to tag XML using free software.
There are a number of XML tutorials listed in the resource pages. You may want to have a look there.
Remember that if you feel you can not handle a job, you better tell your client so.