WANTED–Accurate fuzzy match word count!

Recently, one of our blog entries tiled “Fuzzy Match or Fuzzy Math” caused interesting comments on the G11n – Globalization Professionals group on LinkedIn, where readers argued the equal benefits of the Levenshtein distance method in calculating fuzzy matching, compared to more refined algorithms.

It was stated that since source languages, subject matter, difficulty of text and target languages differ from a project to another, no method can derive a real fuzzy match calculation. It was also argued that the law of averages will at the end normalize the Levenshtein distance method calculation results making it all acceptable.

But our experience shows that in the majority of languages, the fuzzy match calculation of the Levenshtein distance method always leads to much more optimistic results than what reality dictates.

Normalized word count based on accurate fuzzy match calculations are important to operations at GlobalVision as we use them to gauge not only the cost, but also the time that a project requires before it is completed. For more information on this read “Translation database fuzzy matches and word count demystified“.

Although we try to lower costs for our clients–which character based fuzzy match engines do due to their constant much higher fuzzy match values–we have to consider the real effort ahead and keep the interest of translators at heart as well. Squeezing translators in time and budget causes a disservice to them and our clients’ end-users.

Some Asian languages, like Chinese, use symbols rather than characters. In their case, the Levenshtein distance method becomes the common denominator. But most other languages, even ones with complex morphology–like Arabic–can benefit from the adapted Levenshtein distance method that accounts for word and not just character changes. The improvement in fuzzy matching accuracy is meaningful.

Fuzzy match engines can benefit from further refinements, but the return on investment quickly diminishes beyond a certain point as issues such as factoring in the complexity of the source text are very complex to simulate in a software algorithm, and the end result will not lead to improved quality of the translation, only slightly improved metrics.

We welcome outside input and research in this area and will be glad to test any newly devised methods that can demonstrate real significant improvement in the accounting of words to translate.

If interested in testing your knowledge about key localization-translation practices, take our free translation localization quiz and get your score served up immediately online!

Posted in:

One Comment

  1. May you tell us wich algorithms you understand as «more refinated» than the Levenshtein one?
    May you publish the modified Levenshtein algorithm for words you use? I am very far from beeing a computer savy, but I adapted the Levenshtein algorithm some years ago to be used with RQ-Basic and to make it avalaible to some of my fellows in translation that were interested in how fuzzy match calculation works.

    I am just curious to know the (mathematical) basis of the statements written in this post. Many thanx!

Leave a Reply

Your email address will not be published. Required fields are marked *