Much talk about Machine Translation

We were recently contacted by a software publisher asking us to consider Machine Translation (MT) use for translating their knowledge base. Given the volumes involved, they were looking at a way to lower their costs.

Their hypothesis for using MT was based on the following:

  • Knowledge bases, unlike software GUI, documentation or help, do not need to have a high level of quality
  • Since they contain massive amounts of information, it is impossible for humans to translate them fast enough to meet their rapid expansions
  • Some entries are never or rarely used by users
  • The entries themselves are authored by many support people and not by professional tech pub writers, so the grammatical quality of the source content is already at an inferior level.

Bottom line, since inferior translation quality is acceptable, perhaps use of MT is justified.

We’ve had long ago experimented with MT and concluded that its benefits do not save our professional translators time. Reworking the output of MT is more time consuming that translating from scratch. But given recent hype about new methods and technologies, we decided to put their hypothesis to the test.

We randomly selected sentences from their knowledge base and gave off-the-shelf MT solutions a try. We found many problems, mainly in inaccurate translations and terminology use, and particularly that the source was not in a perfect shape (see last bullet above). We will limit the discussion here to a simple example to illustrate our point. The source English sentence that we will use is: The operation of saving the assembly as a multi-body part was a point in time event.

With much press on the new Statistical Machine Translation (SMT) technology from the University of Southern California, and its proclaimed higher fidelity than rule-based translation output, we decided to give it a try. SMT depends on vast (multimillions of words) existing translation databases, so we opted to go to the fore-front leader in serving content, Google. After all, they are the best at indexing the world-wide web and if anyone can make benefit of the vast existing translations on the internet, it will be them.

Google’s translation of our text sentence into French was the following: Le fonctionnement de l’économie d’un assemblage de plusieurs partie du corps a été un moment manifestation.

Despite other problems, the key term that I want you to focus on is saving. It was translated by Google as economizing, giving it a financial tone.

With Systran, the engine used by many free online translation engines like Altavista’s Bable Fish, the translation into French was: L’opération de sauver l’assemblée comme pièce de multi-corps était un point dans l’événement de temps.

Microsoft’s Beta Translate site was very similar to Systran’s translation: L’opération de sauver l’assemblée comme pièce de multi-corps était un événement de moment.

But both Systran and Microsoft interpreted saving as rescuing!

It took a human being to realize that the text is intended for a software application and to correctly infer that saving is intended for registering the file (enregistrer) of the assembly and not for rescuing or economizing it!

This was not a surprise to us. When you deal with translations every day, hour and minute, you know that there is no real substitute today to human translations.

Some say that despite all this, the gist of the meaning is still maintained and the international user can benefit from MT. It is better than not having any translation at all. Perhaps. But when you are a successful and reputable professional company and your brand and image are on the line, are you willing to risk it all without looking at better options?

Your goal should be to seek quality and accuracy in everything that you publish, no matter if it is product, website, support, PR, training, legal, financial or knowledge base related. So how can you balance brand, image and cost trade offs when it comes to translating bulk content?

Simple: Divide, Prioritize and Conquer! Stay tuned for the next blog.

Posted in:

8 Comments

  1. Good point, but let’s agree that the English sentence is not very understandable in the original either. “The operation of saving the assembly” is the same as saying “Saving the assembly”. Just by editing that, the translation became “Enregistrement de l’assemblée comme un organe pluridisciplinaire partie a été un point dans le temps.” The “multi-body part” is also a tricky construction, and since I don’t know the context, I didn’t try to say it with other words. In short, if you improve the original – a sound investment if you are going to machine translate the text into 15 languages – you can save a lot of time. If you integrate tools like acrotext into the process, you can improve the output of MT by improving the quality of the source text. This is the famous GIGO issue.

  2. Appreciate your clear examples regarding AT. It is important for LSP’s & translators to continually stay aware so they don’t miss the tipping point where it becomes a benefit, yet we aren’t there yet. Thanks for you helpful analysis.

  3. There are few points to be added as a testimony to the fact that machine translation is not a very good option even for projects that don’t require high quality of translation:

    I have recently managed a project for translating a fairly big KB into two languages in a short timeline. Many teams were assigned and a lot of synchronization took place to make sure the terms are consistent.
    One side of this project was managing the numerous questions from the different teams. If we ignore the questions that aimed to having the best translation for a certain term (since the quality is not the most important factor in these projects), we are still left with a big amount of questions seeking clarifications of the contextual meaning of a certain term, the technical definition of some other terms and very often to clarify some acronyms. Since we have to either have an acronym in the target language or keep the English one but have an explanation between brackets in the target language, I can’t see how this issue would be handled by the automatic translation engines (keeping in mind that the number of occurrences of such terms is high).
    Another project’s constraint was the fact that the fields for questions and answers had a limited length (character count), for example 255 chars for the question. As we know for some target languages, the word and character counts grow exponentially. During the lifecycle of the project, and just before delivery, we had a QA phase that identified all the segments that were larger than the limit size and the teams started working on rephrasing the sentences to fit the size. There were so many comments from different teams that a lot of effort was needed to keep the short translation meaningful. I am just wondering what an automatic translation would offer for this case. Even if we put a size limit constraint and enforce it on the engine, my guess is that it would crop the text when it reaches the size limit with no attention to how meaningful the resulting text would be to the end user.

  4. If you take a look at the Microsoft Knowledge Base (and actually several other IT companies) they are doing exactly this – building high customized and focused MT engines (not using a free online engine) to translate KB data. MSR executives have made public presentations at TAUS and Loc World that shows that millions of people actually use the raw MT and find it useful. Interestingly for FR, ES, IT and even DE the customer satisfaction ratings of raw MT content are higher than they are for the source English content. So while it may not meet your linguistic quality criteria it meets the needs of customers who want information immediately and are willing to wade through “crappy MT” and prefer it to trying to read the same information in English.

    The sheer volume and rate at which new information is created necessitates automation, no matter how. I am told that new KB content is created every few minutes and it is often for immediate and current customer problems.

    The need for this immediacy are not likely to go away and these companies will need to solve the problem with or without their LSP partners.

  5. > Their [your customer’s] hypothesis for using MT was based on the following… it is impossible for humans to translate them [knowledge bases] fast enough to meet their rapid expansions

    Your customer tells you that humans cannot translate sufficiently quickly. How do you deal with this problem?

    > … the grammatical quality of the source content is already at an inferior level.

    Your customer knows that the quality of the source text is not good. Therefore, why worry about the quality of the translation?

  6. n translating from Russian to English I regularly use Google Translate to find the correct Roman alphabet spelling of foreign names — something it excels at. But I’m amazed at how often the word “not” is inserted or omitted in the translation. A peril of Statistical Machine Translation, I guess.

  7. I do not intend to comment on the central issue of this post, which has been brilliantly argued for and against by my above savvy colleagues, but I do want to call your attention to the following sentence:

    ‘We’ve had long ago experimented with MT and concluded that its benefits do not save our professional translators time.’

    This construction struck me as a bit awkward, as if it had been changed as an afterthought and the unwanted part was not deleted. I would go for

    ‘We have experimented with MT a long time ago, and concluded that its benefits were not saving any time for our professional translators.’ or, alternatively,

    ‘We had long ago experimented with MT and concluded that its benefits did not save any time for our professional translators.’

    IMHO. Of course, I stand to be corrected if you feel that I am wrong.

  8. Occasionally GoogleTranslate can come up with the right phrase – but rarely, and it depends on the language. For example “garde des sceaux” was translated as “guarding a bucket!”, clearly there is a misspelling of “sceau” somewhere on the internet. Its German to English translations, for instance, are total gibberish, it cannot understand German word order.

Leave a Reply

Your email address will not be published. Required fields are marked *