Translation Database Fuzzy Matches and Word Count Demystified

Most content publishers, software or other, perform updates on their product, manuals, website, and marketing literature yearly or on a more regular basis. They update their source language files and engage their localization vendor or staff to update all the supported target languages.

Changes to the source files take the shape of additions (new text), removal of obsolete text (deletions), or edits (modifications to existing text).

The new text requires new translations for each target language. The deleted text is disregarded. The edited or modified text will require updating in all target languages.

When a top-down localization process is applied and a translation database (translation memory or TM) is in use, the search engine looks for segment changes (complete phrases or complete segments) to the source. The following is the result:

translation database fuzzy match

Fig1: Example of fuzzy matching analysis

  1. No match or new text: Typically generates little match in the database and requires full translation.
  2. Repeat or unchanged text: Generates a 100% match from the database, not requiring any changes.
  3. Edited or modified text: Results in a “fuzzy” match. This is a match in the database that can be anywhere from 50-99% of the original. Anything under 50% is considered no match.
  4. Deleted text: Produces no impact on the translation update effort, since the text no longer exists.

Translation databases store language pair segments or sentences. A search engine is run on the newly released source text that analyzes the text one segment at a time, comparing it against what is already in the database.

  • If a 100% match is found, then it is considered an exact match.
  • If the search engine finds a similar but not an exact match, it allocates a fuzzy match percentage to it, anywhere from 50% to 99%.

For instance, a sentence with ten words having just one word difference from a sentence stored in the database will result in a 90% fuzzy match. If it has only five words in common with another sentence, then the fuzzy match is 50%.

By calculating the fuzzy match of each sentence, one can approximate the effort of translation needed to perform the full update in any target language.

At GlobalVision, we apply weights to strings to calculate the “equivalent” new word count to translate. For instance, the sentence with ten words having just one word changed since the last release is calculated as two new words to translate (20%). A sentence of ten words with four or more words changed is calculated as ten new words (100%). Other percentages are applied in between.

Internal changes to the sentence tags (bold, italic, links, internal font or color change, etc.) will also force a fuzzy match. A weight is applied to these changes as well, as they also require translator intervention.

The analysis and calculations are done by the translation database/search engine software. These are based on algorithms built in the software that objectively approximate the new number of words to translate. The results are not 100% exact, but during the past ten years in using these algorithms, we have satisfied all our clients.

Applying an appropriate weight to each fuzzy match is a process that we use to estimate not only the cost, but also the staffing and scheduling data. This is why we can accomplish 98% of our projects on schedule and on budget for our clients!

Search Engine Geo-Optimization

This whitepaper provides some considerations to take into account before you engage in your next website globalization effort.

Download it Now!