Google Researchers Just Made Computers Sound Much More Like People

Google Researchers Just Made Computers Sound Much More Like People

A team of researchers at Google has found a way to dramatically improve computer-generated speech, substantially improving its cadence and intonation. It’s a step towards the kind of sophisticated speech synthesis that has, to date, existed entirely within the realm of science fiction.

Computers, even when they speak, do not sound human. Even in science fiction, where such constraints need not exist, computers, androids, and robots commonly use stilted grammar, inaccurate pronunciation, or speak in harsh, mechanical tones. In TV shows and movies where artificial lifeforms speak naturally (the advanced Cylon models in the 2004 Battlestar Galactica reboot, for example), this capability is often used to play up why the artificial life forms represent a threat. The ability to speak naturally is often treated as a vital component of humanity. Mechanical life forms in Star Trek: The Next Generation and its various spin-offs almost always speak with mannerisms intended to convey their artificiality, even when their intentions are perfectly benign.

In the real world, programs like Dr. Sbaitso were often the first introduction computer users had to text-to-speech technology. You can hear what Creative Labs’ text-to-speech technology sounded like below, circa 1990.

Modern technology has dramatically improved on this, but technologies like Alexa, Cortana, Google Assistant, or Siri would never be mistaken for a human save in very specific cases. A significant part of the reason why we can tell when a computer is speaking versus an individual is because of the (mis)use of prosody. Prosody is defined as the pattern of intonation, tone, rhythm, and stress within a language.

There’s an old joke about the importance of commas that compares two simple sentences to make its point: “It’s time to eat Grandma” conveys a rather different meaning than “It’s time to eat, Grandma.” In this case, the comma is used to convey information about how the sentence should be pronounced and interpreted. Not all prosodic information is encoded via grammar, however, and teaching computers how to interpret and use this data has been a major stumbling block. Now, researchers across multiple Google teams have found a way to encode prosody information into the Tacotron text-to-speech (TTS) system.

Google Researchers Just Made Computers Sound Much More Like People

We can’t embed Google’s speech samples directly, unfortunately, but it’s worth visiting the page to hear how the new information impacts pronunciation and diction. Here’s how Google describes this work:

We augment the Tacotron architecture with an additional prosody encoder that computes a low-dimensional embedding from a clip of human speech (the reference audio). This embedding captures characteristics of the audio that are independent of phonetic information and idiosyncratic speaker traits — these are attributes like stress, intonation, and timing. At inference time, we can use this embedding to perform prosody transfer, generating speech in the voice of a completely different speaker, but exhibiting the prosody of the reference. The embedding can also transfer fine time-aligned prosody from one phrase to a slightly different phrase, though this technique works best when the reference and target phrases are similar in length and structure.

There are samples and clips you can play to see how Tacotron handles various tasks. The researchers note they can transfer prosody even when the reference audio uses an accent not in Tacotron’s training data. And even more importantly, they’ve found a way to model what they call latent “factors” of speech, allowing for the prosody within any speech clip to be represented without requiring a reference audio clip. This expanded model can force Tacotron to use specific speaking styles to make various statements sound happy, angry, or sad.

None of the clips sound completely human — there’s still a degree of artificiality to the underlying presentation — but they’re a substantial improvement on what’s come before. Maybe the next Elder Scrolls game won’t have to feature the same eight voice actors in approximately 40,000 different roles.

Continue reading

ET Deals Roundup: $200 Gift Card with Samsung 4K TV for $600, $50 Price Drop on Inspiron 15 7000, and more
ET Deals Roundup: $200 Gift Card with Samsung 4K TV for $600, $50 Price Drop on Inspiron 15 7000, and more

Ready to upgrade to a 4K television? Maybe you're looking for a new laptop for school, or searching for the perfect camera for an upcoming vacation. Well, there are plenty of discounts floating around this week, so we've put together a list of the hottest deals. If you're looking to save big on new gear, you're bound to find something worthwhile below.

Samsung to Announce Galaxy S9 at Mobile World Congress in February
Samsung to Announce Galaxy S9 at Mobile World Congress in February

Previous rumors pointed to a surprise Galaxy S9 unveiling at CES, which is underway now. However, Samsung is on hand not with the hotly anticipated new Galaxy phone, but with TV, smart home devices, and appliances — lots and lots of appliances.

Vivo Demos First Smartphone With In-Display Fingerprint Sensor
Vivo Demos First Smartphone With In-Display Fingerprint Sensor

At CES, Chinese smartphone maker Vivo is on hand to show off the first ever phone with a fingerprint reader inside the display.

ET Deals Roundup: Dell Inspiron 14 for $600, Wireless Backup Camera for $64, and more
ET Deals Roundup: Dell Inspiron 14 for $600, Wireless Backup Camera for $64, and more

On the hunt for discounted exercise equipment to help you stay on track with your 2018 resolutions? Maybe you're just looking for a massive television to watch the big game. Well, today's deals surely have you covered. Everything from mouse pads to pillows to gaming PCs are on sale right now, so let's take a look-see.