Adoption of voice-activated technology has accelerated in recent years. Voice-controlled functionality on smartphones and voice-controlled devices for home use, such as Amazon Echo and Google Home, have become widespread. Voice control is also being implemented in many other areas, including banking, healthcare and office environments. US bank Capital One, for example, has developed a third-party voice app for Amazon Alexa that can be used by Capital One customers to perform personal banking tasks, such as checking an account balance or paying a credit card bill.[1] In the UK, NHS Digital recently announced a partnership with Amazon that will enable patients to access health information by voice command via the Alexa assistant.[2] For office environments, Amazon offers ‘Alexa for Business’, which enables organisations to develop bespoke voice-controlled environments for administrative tasks and internal liaison.[3]
Voice control has many potential benefits in terms of convenience and accessibility. However, the ability to effect actions by voice also represents a new attack channel which may be exploited by malicious actors seeking to gain unauthorised access to a system. Access control for voice-activated technology is challenging. Voice biometrics as an access control mechanism is vulnerable to spoofing, as highlighted in a recent incident in which a BBC reporter was able to access his brother’s supposedly voice-protected bank account.[4] PINs and passwords are difficult to implement in the context of voice, as verbalising a PIN or password may lead to it being overheard. Non audio-based access control mechanisms, such as device locking, may affect the usability of a voice-controlled device.
A compounding factor affecting the security of voice-activated technology is that some types of audio attack are not easily detectable. Researchers have demonstrated that it is possible to conceal voice commands in white noise[5], as well as in frequencies which are above the human-audible range.[6] Speech recognition in voice-controlled systems may also be vulnerable to adversarial learning attacks, in which small changes to audio input may lead a voice-controlled device to recognise different words in the audio input than those that are heard by a human.[7]
In an attack which uses voice control as an attack vector, an attacker might be aiming to obtain personal information of a user of a voice-controlled device. Whilst much attention has been focused on privacy of information held by providers of voice-controlled devices, there is also a risk of information stored in voice-controlled systems being leaked to malicious actors via the voice control interface. This might occur if an attacker is able to trigger verbal disclosure of personal information from the device by speech synthesis. A voice-controlled device could also be prompted by malicious audio input to transmit personal information of a user to a remote location controlled by an attacker. The potential for this was demonstrated recently when a misinterpretation by an Amazon Echo device of audio from its environment led it to send an audio recording of a conversation inside a couple’s home to a business contact.[8]
A further aim of gaining unauthorised access to a voice-controlled system might be to hijack voice-controlled devices in a smart home, such as lights, thermostats or smart locks. It has in fact been demonstrated that it is possible to gain access to a home protected by a smart lock by issuing verbal commands to voice-controlled digital assistant Siri from outside the home.[9] Yet another motivation for executing an attack on voice-activated technology might be to compromise the reputation of a business or individual by triggering compromising web searches, or by posting damaging content to a victim’s social media, by voice.
An additional type of attack on voice-activated systems is specific to third-party voice apps that are made available via a cloud-based voice assistant. Several providers of cloud-based assistants, including Amazon and Google, have enabled development of third-party voice apps, also known as ‘Skills’ or ‘Actions’, which can be invoked by users of a cloud-based voice assistant by asking the assistant to connect them to a particular app, using a name given to the app by its developer. There has been research demonstrating the potential for so-called Skill-squatting attacks – the audio equivalent of typo-squatting – in which the name of a legitimate voice app is misrecognised by a voice assistant as the name of a malicious app deployed by an attacker.[10] In executing this attack, a malicious actor may use overlapping word sounds, or exploit errors in speech recognition. The attacker’s aim in this type of attack might be to collect personal information which a user had intended to share with a legitimate voice app, such as a banking voice app. Another attacker objective might be to spread misinformation, such as incorrect health information, by impersonating a voice app which is a legitimate source of such information. An attacker might also be aiming to deny traffic to the legitimate voice app, representing a denial of service attack.
Attacks on voice-activated technology may be executed remotely, e.g. by prompting users to download a malicious audio file[11], by audio input from a malicious smartphone app[12], or by abusing the speaker functionality of any device in the target system’s vicinity which has been compromised by an attacker, such as a hacked PC. An attack on voice-activated technology may also be executed locally, either in a public space, e.g. by triggering voice activation on smartphones carried by victims on public transport, or in any private space containing voice-controlled devices to which an attacker has gained unauthorized physical access. Thus the security of voice-activated technology may become a significant issue for both technical and physical penetration testing in future.
[5] Carlini, Nicholas, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David Wagner, and Wenchao Zhou. “Hidden voice commands.” In 25th USENIX Security Symposium (USENIX Security 16), pp. 513-530. 2016.
[6] Zhang, Guoming, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. “Dolphinattack: Inaudible voice commands.” In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 103-117. ACM, 2017.
[7] https://www.inverse.com/article/40367-researchers-can-fool-speech-recognition-a-i-with-this-trick
[8] https://www.theverge.com/2018/5/24/17391898/amazon-alexa-private-conversation-recording-explanation
[9] https://nakedsecurity.sophos.com/2016/09/22/siri-opens-smart-lock-to-let-neighbor-walk-into-a-locked-house/
[10] Kumar, Deepak, Riccardo Paccagnella, Paul Murley, Eric Hennenfent, Joshua Mason, Adam Bates, and Michael Bailey. “Skill squatting attacks on Amazon Alexa.” In 27th USENIX Security Symposium (USENIX Security 18), pp. 33-47. 2018.
[11] Dhanjani, Nitesh. Abusing the internet of things: blackouts, freakouts, and stakeouts. ” O’Reilly Media, Inc.”, 2015.
[12] Diao, Wenrui, Xiangyu Liu, Zhe Zhou, and Kehuan Zhang. “Your voice assistant is mine: How to abuse speakers to steal information and control your phone.” In Proceedings of the 4th ACM Workshop on Security and Privacy in Smartphones & Mobile Devices, pp. 63-74. ACM, 2014.