Back in early December, I was blown away by a surprise demo of Qualcomm’s (NASDAQ: QCOM) latest real-time translation breakthrough at its annual Snapdragon Tech Summit: Two people on a phone call, talking to each other in different languages (one in English, the other in Mandarin), and both of them hearing the other in their own language. In other words, the English-speaker’s words were being simultaneously translated into Mandarin, and the Mandarin-speaker’s words were being simultaneously translated into English. Although neither individual spoke the other’s language, each one heard the other in his own language. And with that, Qualcomm’s real-time translation during a phone call moved from science fiction to fact.
How Qualcomm is Finally Bringing Real-time Translation to the World
Judging by the collective gasp from the crowd in attendance, and the ensuing stares from single-serving neighbors that seemed to silently ask “Did I just see what I think I just saw?” I wasn’t the only one there who realized the importance of what we had just witnessed: A true, real-time, and mostly seamless translation between two people speaking different languages during a phone conversation. The ramifications of such a real-time translation capability were obvious, but the real question was… how in the world is this even possible. Was it a trick? Was it a demo of something Qualcomm was working on and hoping to achieve someday? Or was it a real demo of a real product that actually works? It turned out to be the latter. It was and is entirely real. This is not a tomorrow or next year capability. It is a right now capability, and while partnerships with companies like Youdao are critical to developing certain key aspects of the process, Qualcomm’s AI-powered Snapdragon 865 5G mobile platform is the tech that makes it possible.
Why Qualcomm’s Real-Time Translation Technology Breakthrough is Different — and So Groundbreaking
I already know what you’re thinking: Wasn’t Google already doing this? Well, not exactly. Not like this. While Google and a number of other tech companies with strong AI products have managed to build impressive voice-to-text and voice-to-voice translation engines, this is different for several reasons, the principal one being that Qualcomm’s real-time translation function happens on the device, not in the cloud.
On-device translation is the key to real-time translation. Relying on a cloud solution to translate speech typically involves too much lag for a real-time application. Your speech has to be captured, transmitted to a server somewhere, analyzed and translated, then the translation sent back to your device or to the call.
Don’t get me wrong, cloud-based translation apps are great and useful if all you need to do is ask a voice assistant to translate a phrase or a recorded video, but not so great if you expect that translation to happen in real time during a conversation. For that, you need as near to zero latency as you can achieve — and putting the translation function literally between you and the person that you are communicating with is a very good way to achieve that. Better yet, if the translation function can happen on your phone, or on the device you are using as your communication interface (it could also be a laptop or a tablet), you can keep latency to a minimum. And that is where Qualcomm’s Snapdragon 865 comes in: The 865’s integrated 5th Generation AI Engine packs an impressive 15 TOPS (Trillion Operations Per Second). It is also is backed up by the combined power of the Adreno 650 GPU, Hexagon 698 processor, and Kryo 585 CPU. The Hexagon digital signal processor specifically holds the key to optimizing end-to-end latency to keep each step of the translation process as short as possible.
The process is fairly complex, obviously, and Qualcomm’s real-time translation process can be broken down into three general steps enabled by specific technology solutions.
First, Automatic Speech Recognition (ASR) captures your speech and, using Convolutional Neural networks (CNN), transcribes it as text into the language that you spoke in. This happens on the Hexagon 698 processor.
Then, using Neural Machine Translation (NMT), the English text is translated into your interlocutor’s language, also in text format. Note that this type of translation is contextual, so it is not a word-for-word translation, but rather an approximation of how your sentences would naturally translate into the other language. This is important because languages often have very different grammatical rules, to say nothing of selecting the right words to minimize ambiguity. Those factors are taken into account at this stage.
Finally, a Text-to-speech engine converts the translated text into speech. And because this all happens on the device, and the Snapdragon 865 platform is designed to perform this type of operation in real-time, the translation feels seamless.
Here’s a short vide from the 2019 Snapdragon Tech Summit that walks you through how Qualcomm’s real-time translation works:
Real World Applications for Real-Time Translations — and Where this Technology Goes from Here
From phone calls to video calls. Currently, this type of real-time translation appears to be limited to phone calls, which is a great start, but I anticipate that video calls will be the next frontier, particularly if some of the face-mapping tech I also saw on display at the Snapdragon Tech Summit is applied to this. One of the demos I saw slightly altered people’s eye angle on video calls so that they would line up with the camera — giving the impression that everyone is looking right at each other rather than looking slightly down, as they tend to do during those types of calls. In this instance, mouth and lip movements could be altered in real time to appear to mouth the words in the translated language rather than the original language. While this may seem a little unsettling at first thought, that kind of real time video manipulation, in this context, will enhance the translation experience, improve the clarity of communications between parties, and will also help individuals with hearing impairments by adding lip-reading cues that they otherwise might not have without it.
From calls to face-to-face translations. It isn’t difficult to imagine how, using their mobile devices paired with noise-cancelling Bluetooth earphones, two individuals could also, either through a phone call or an app, achieve the same type of simultaneous translation, face-to-face. The ability for travelers to speak to any taxi driver, police officer, EMT, doctor, attorney, and merchant anywhere in the world, is a game-changer for tourism. This also opens the door for businesses, large and small, to break through language barriers and partner with companies around the world, to say nothing of the ability for companies to recruit from a much broader pool of talent without having to worry as much as they did before about language challenges. Lastly, this type of simultaneous translation also opens new doors to students open to attending classes either virtually or in person, anywhere around the world.
For a better understanding of Qualcomm’s real-time translation capabilities, and how Qualcomm and Youdao achieved this breakthrough on device here’s a pretty cool visualization —
The concept of real-time translation is exciting. It’s even more exciting to see this becoming a reality — and Qualcomm is clearly leading the way on this front.
Futurum Research provides industry research and analysis. These columns are for educational purposes only and should not be considered in any way investment advice.
Additional insights from Futurum Research:
Qualcomm’s New Bluetooth Audio SoCs Represent Quiet Evolution
Qualcomm’s Tiered SOC Diversification Pays Off, Expands Footprint
Qualcomm’s Virtual MWC Event Reveals Big Wins for Snapdragon 865
Image Credit: Qualcomm
The original version of this article was first published on Futurum Research.
Senior Analyst at @Futurumxyz. Digital Transformation + Tech + Disruption. Author, keynote speaker + troublemaker. Opinions are my own. I like croissants.