ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms Using Linguistic Features

We propose ALIF, the first low-cost attack on black-box ASR using adversarial linguistic features. Based on the ALIF pipeline, we propose ALIF-OTL and ALIF-OTA schemes to launch efficient attacks in the digital domain and in the physical playback environment on four commercial ASRs and voice assistants.

Audio Examples

Tips:
★ We recommend listening to all examples before checking out any transcriptions.

Attack examples generated by ALIF-OTL

Target APIs	Attack examples	Transcriptions
Amazon		Transcription: What's the time?
Amazon		Transcription: Cancel my alarm clock.
Azure		Transcription: I need help.
Azure		Transcription: What's the time?
iFLYTEK		Transcription: I need help.
iFLYTEK		Transcription: Turn on the light.
Tencent		Transcription: Send a message to my mom.
Tencent		Transcription: Transfer the payment.

Attack examples generated by ALIF-OTA

Ablation study on different blurring components

We select five sentences and generate audio samples with varying blurring components for each. The audio samples in the second column are created by adding linguistic perturbation only. The subsequent three columns feature audio samples generated using a single blurring component. The final column includes audio samples produced using all blurring components, except for linguistic perturbation.

commands	Linguistic Perturbation Only	Parameter Beta Only	Parameter Gamma Only	Parameter Alpha Only	All Mel Blurring
command1
command2
command3
command4
command5

User study

User study on audio incomprehensibility

We perform a user study (6 volunteers who evaluated 12 ALIF samples on a 0-4 intelligibility scale and transcribed them). Results show an average intelligibility score of 0.84 and a 78% CER, confirming their incomprehensibility.

Commands	User 1	User 2	User 3	User 4	User 5	User 6	Average score
Airplane mode on	0	0	0	0	0	0	0
Call one two three	0	1	2	1	4	1	1.5
Cancel my alarm clock	1	1	1	0	1	0	0.67
Darn it	0	1	2	1	0	0	0.67
I can't take it anymore	1	1	1	2	0	1	1
I need help	0	1	1	4	1	1	1.33
Navigate to my office	1	1	1	1	1	1	1
Sand a message to my mom	0	1	0	1	0	0	0.33
Transfer the payment	1	0	1	2	0	0	0.67
Turn on the light	0	1	1	4	0	1	1.17
Unlock the door	1	2	2	1	0	1	1.17
What's the time	1	1	1	1	0	0	0.67

Commands	User 1	User 2	User 3	User 4	User 5	User 6	Average CER
Airplane mode on	1	1	1	1	1	1	1
Call one two three	1	0.72	0.33	0.72	0.28	0.33	0.56
Cancel my alarm clock	0.57	0.86	1	1	0.95	1	0.9
Darn it	1	0.86	1	1	1	1	0.98
I can't take it anymore	0.39	0.74	0.48	0.3	1	0.48	0.57
I need help	1	0.91	1	0	0.91	1	0.8
Navigate to my office	0.86	0.81	1	0.67	0.67	1	0.83
Sand a message to my mom	1	0.75	1	0.79	1	1	0.92
Transfer the payment	0.4	1	1	0.3	1	1	0.78
Turn on the light	1	0.82	0.35	0	1	0.35	0.59
Unlock the door	0.4	0.4	0.4	0.47	1	0.4	0.51
What's the time	0.87	0.8	1	0.6	1	1	0.88

User study on ablation

We ablate blurring components and gather volunteers' opinions on the audios. The results confirm that perturbations on linguistic features are the most crucial factor that affects human perception.

Commands	Linguistic Perturbation Only	Parameter Beta Only	Parameter Gamma Only	Parameter Alpha Only	All Mel Blurring
command1	2.5	3.5	3.67	3.33	3.5
command2	2.67	3.67	4	3.83	3.83
command3	2	4	4	3.33	3.33
command4	2.17	4	4	3	3.5
command5	3	4	4	3.5	3.33
Average	2.47	3.83	3.93	3.40	3.50

Robustness of commercial ASRs

We utilize 200 event noise instances to evaluate the ASRs' robustness, resulting in a similar result to the white noise case. Google is the most vulnerable one.

Robustness to white noise

Robustness to event noise

New commands

We choose 10 new commands ("Clear notification", "Play music", "What's the weather", "Take a picture", "Where is my car", "Tell me a story", "How old are you", "Sing me happy birthday", "Good morning", "Where is my home") and generate the attack examples under the setting of α=0.3, β=22Hz, γ=1 and η=0.1 (η is only for OTA examples). Then we attack APIs (Amazon, Azure, iFLYTEK, and Tencent) in the digital domain and attack voice assistants (Amazon Echo and Microsoft Cortana) over the air, and the results are shown below. Attacks belonging to the xxx-online and xxx-offline columns are digital-domain attacks. Attacks belonging to the Echo and Cortana column are over-the-air (OTA) attacks. Commands used in OTA attacks are generated with the online method.

★ Note: The online terminology (generating one command requires at most 50 queries) means in the iterating progress of command generation, we inquire about the target API with the audio given a new smaller loss is obtained. In contrast, the offline terminology (generating one command requires only 1 query) means we only query the API once at the last iteration round during command generation to check whether it is valid.

Amazon-online	Amazon-offline	Azure-online	Azure-offline	iFLYTEK-online	iFLYTEK-offline	Tencent-online	Tencent-offline
10/10	2/10	10/10	2/10	10/10	3/10	10/10	4/10

Echo	Cortana
10/10	5/10

Impact of ambient noise

We assess the impact of ambient noise, revealing that our examples perform well at an SNR of ~20, akin to soft speech, but decline with higher ambient noise. ( α=0.3, β=22Hz, γ=1 and η=0.1, distence between the speaker (Marshall) and the VA is 15 cm)

Voice assistants	SNR~37	SNR~25	SNR~13	SNR~3
Echo	6/7	6/7	3/7	3/7
Cortana	7/7	5/7	3/7	2/7

Potential defense

In addition to the downsampling method presented in the paper, we conduct further tests on the other three possible defense methods: applying a 4000Hz low-pass filter, a 5000Hz low-pass filter, and a 500 Hz high-pass filter to the audio. The results in the following table show that all these three methods can effectively defend against our attack. Specifically, the 500Hz high-pass filter exhibits the best performance, with only 1, 3, 2, and 0 attack commands being accurately recognized by the corresponding ASR.

Defense method	Amazon	Azure	Tencent	IFLYTEK
High-Pass Filter: 500Hz	1/12	3/10	2/11	0/11
Low-Pass Filter: 4000Hz	1/12	5/10	3/11	3/11
Low-Pass Filter: 4000Hz	2/12	5/10	4/11	5/11