A few months back, Amazon detailed some of the underlying systems that prevent Alexa from responding when someone says the wake word “Alexa” on TV, in an internet ad, or on the radio. But how does Amazon’s voice assistant filter out everyday background noise? A blog post and accompanying research paper (“End-to-End Anchored Speech Recognition“) lay out a novel noise-isolating technique that could improve the assistant’s ability to recognize speech by 15%.
The approach is scheduled to be presented at this year’s International Conference on Acoustics, Speech, and Signal Processing in Brighton.
“One of the ways that we’re always trying to improve Alexa’s performance is by teaching her to ignore speech that isn’t intended for her,” explained senior applied scientist in the Alexa AI group Xin Fan. “We assume that the speaker who activates an Alexa-enabled device by uttering its ‘wake word’ — usually ‘Alexa’ — is the one Alexa should be listening to. Essentially, our technique takes an acoustic snapshot of the wake word and compares subsequent speech to it. Speech whose acoustics match those of the wake word is judged to be intended for Alexa, and all other speech is treated as background noise.”
Rather than training a separate AI system to differentiate between noise and wake words, Fan and colleagues merged their wake-word-matching mechanism with a standard speech recognition model. They tested two variations on a sequence-to-sequence encoder-decoder AI architecture — that is, an architecture that processed input data (millisecond-long audio signal snapshots) in order and generated a corresponding output sequence (phonetic renderings of sound) — and, as with most encoder-decoder techniques, the encoder component summarized the input as a fixed-length vector (a sequence of numbers) and converted it into an output. Meanwhile, a special attention mechanism “taught” to detect certain characteristics of the wake word in speech guided the decoder toward those characteristics in the vector.
In an experiment, the researchers trained one of the AI models to “more explicitly” emphasize likely wake word speech, first by adding a component that directly compared the wake word acoustics with those of subsequent speech and then by using the result as an input to a separate component that learned to mask bits of the encoder’s vector.
The masking model performed worse than the baseline, interestingly — it reduced the error rate by 13% versus 15%. The team speculates that this is because its masking decisions were based solely on the state of the encoder network, and they leave to future work a masking mechanism that considers the decoder state.