Machine Translation: the Problem with Mass Input

Written by Paul Vincent on 14 February 2017.

Free translation for most of us is a really great thing – I totally want to see it happen. And given that my entire business and livelihood rests on selling translation and language services, many people find that surprising. But, we need to make a distinction between what Machine Translation (MT) currently can and can’t be used for. I believe the point is summed up nicely by stories like this.

Without going into excessive detail, there’s a few things the world should know about how free translation engines work – or, more to the point, how they get fed with language in the first place.

Every time you create a webpage, as a developer you add a language code to it (whether you know it or not). Google can therefore detract, with some confidence, what language it is written in – without a person ever looking at it. They then begin matching up all the different sentences or phrases they find and assign meaning to it. So we essentially have language 1 and language 2. Machine Translation engines look at how a phrase in language 1 is written and check this text against all the millions of different sources they have in language 2.

So, when you go to Google Translate and ask it how to say a phrase in language 2, it will give you what it believes to be the most likely response. Often it’s right. Or, near enough. Some of what it’s looking at could be taken from parallel texts of a professionally translated document that the engine has accessed – so as you can imagine, that’s going to be pretty good.

But ultimately, this model depends on sourcing and taking in more and more data, evaluating more and more content based on the way we talk – it can access this information through your chat & social media apps, through blogs and webpages. The problem is that more data does not lead to better quality. In fact, it means the opposite. It simply means that it captures whatever is out there.

So, it scans a lot of content in the public sphere, such as content linking the idea of Daesh and Saudi Arabia, and therefore considers that one is the most likely translation of the other, or incorrectly maps those words to the corresponding words in other languages.

And it doesn’t stop there.

This undoubtedly wasn’t an intentional act by Google – but as it said in its defence, “…Our systems produce translations automatically based on existing translation on the web, so we appreciate when users point out issues such as this.” (Source, Business Insider UK)

To quantify that, their engines blindly scan content created and discussed amongst large numbers of people, in lots of different languages, and match up this content across these languages (without ever reading it). The billions of words that get poured into the web every day in every language are being constantly compared, lined up and matched against similar foreign versions.

This automatic process is not designed to focus on proper speech or what we know to be ‘right’, but simply by what’s more commonly used. The fundamental problem is that in many cases, what’s really happening is the inclusion of incorrect data, or short term trends in word usage. This may well not prevent most users from getting great value from MT, either as a conversation aid, assisting with basic travel and survival language or even as a learning tool, but this data acquisition method is very likely to cause damaging results if relied upon for business or professional purposes.

Unfortunately, the likelihood is that as social media grows, along with the amount of input and breadth of data being fed into machine translation, they’re actually going to get worse, before they get any better.