Specifically, a very small speech recognizer built into the embedded motion coprocessor runs all the time and listens for "Hey Siri." When just those two words are detected, Siri parses any subsequent speech as a command or query.
The detector uses a Deep Neural Network to convert the acoustic pattern of a user's voice into a probability distribution. It then uses a temporal integration process to compute a confidence score that the phrase uttered was "Hey Siri."
If the score is high enough, Siri wakes up and proceeds to complete the command or answer the query automatically.
If the score exceeds Apple's lower threshold but not the upper threshold, however, the device enters a more sensitive state for a few seconds, so that Siri is much more likely to be invoked if the user repeats the phrase—even without more effort.
"This second-chance mechanism improves the usability of the system significantly, without increasing the false alarm rate too much because it is only in this extra-sensitive state for a short time," said Apple.
To reduce false triggers from strangers, Apple invites users to complete a short enrollment session in which they say five phrases that each begin with "Hey Siri." The examples are saved on the device.
We compare the distances to the reference patterns created during enrollment with another threshold to decide whether the sound that triggered the detector is likely to be "Hey Siri" spoken by the enrolled user.Apple also says it created "Hey Siri" recordings both close and far in various environments, such as in the kitchen, car, bedroom, and restaurant, based on native speakers of many languages around the world.
This process not only reduces the probability that "Hey Siri" spoken by another person will trigger the iPhone, but also reduces the rate at which other, similar-sounding phrases trigger Siri.
For many more technical details about how "Hey Siri" works, be sure to read Apple's full article on its Machine Learning Journal.
Discuss this article in our forums