Performance of single-channel speech enhancement algorithms on Mandarin listeners with different immersion conditions in New Zealand English

(1)

Speech Communication 157 (2024) 103026

Available online 14 December 2023

Contents lists available atScienceDirect

Speech Communication

journal homepage:www.elsevier.com/locate/specom

Performance of single-channel speech enhancement algorithms on Mandarin listeners with different immersion conditions in New Zealand English

Yunqi C. Zhang

^a^,∗

, Yusuke Hioka

^a

, C.T. Justine Hui

^a

, Catherine I. Watson

^b

aAcoustics Research Centre, Department of Mechanical and Mechatronics Engineering, University of Auckland, Auckland, 1010, New Zealand

bDepartment of Electrical, Computer and Software Engineering, University of Auckland, Auckland, 1010, New Zealand

A R T I C L E I N F O

Keywords:

Speech intelligibility Speech enhancement Immersion condition Mandarin New zealand english Phonetic analysis

A B S T R A C T

Speech enhancement (SE) is a widely used technology to improve the quality and intelligibility of noisy speech.

So far, SE algorithms were designed and evaluated on native listeners only, but not on non-native listeners who are known to be more disadvantaged when listening in noisy environments. This paper investigates the performance of five widely used single-channel SE algorithms on early-immersed New Zealand English (NZE) listeners and native Mandarin listeners with different immersion conditions in NZE under negative input signal-to-noise ratio (SNR) by conducting a subjective listening test in NZE sentences. The performance of the SE algorithms in terms of speech intelligibility in the three participant groups was investigated. The result showed that the early-immersed group always achieved the highest intelligibility. The late-immersed group outperformed the non-immersed group for higher input SNR conditions, possibly due to the increasing familiarity with the NZE accent, whereas this advantage disappeared at the lowest tested input SNR conditions.

The SE algorithms tested in this study failed to improve and rather degraded the speech intelligibility, indicating that these SE algorithms may not be able to reduce the perception gap between early-, late- and non-immersed listeners, nor able to improve the speech intelligibility under negative input SNR in general.

These findings have implications for the future development of SE algorithms tailored to Mandarin listeners, and for understanding the impact of language immersion on speech perception in noise.

1. Introduction

Speech intelligibility, or the ability to understand spoken language, is crucial to effective speech communication. In real-life listening sit- uations, the listening environment is often imperfect, with various factors such as reverberation and background noise affecting the quality of speech signals (Lecumberri et al.,2010; Scharenborg and van Os, 2019). One of the most significant challenges is the degradation of speech caused by noise, which is largely quantified by the signal-to- noise ratio (SNR). Plomp (1977) found that at an SNR above 0 dB (i.e. the speech signal has equal or higher energy than noise) in a cocktail party situation, native listeners are able to understand sentences easily, even if the intelligibility of individual words is low. This phenomenon is attributed to the concept of ‘‘glimpse’’, which refers to the ability of listeners to utilise gaps and fragments of speech in the presence of fluctuating noise to extract meaning in the speech (Moore, 2003;Srinivasan and Wang,2005;Cooke,2006;Lu and Cooke,2009).

However, when the energy of noise surpasses that of the speech, i.e. negative SNR, native listeners are unable to benefit from such

∗ Corresponding author.

E-mail addresses: [email protected](Y.C. Zhang),[email protected](Y. Hioka),[email protected](C.T.J. Hui), [email protected](C.I. Watson).

glimpses (Brungart, 2001; Ezzatian et al., 2010; Jin and Liu, 2012), resulting in a significant decline in speech intelligibility and leads to difficulties in real-life communication.

Such extremely adverse listening conditions pose challenges not only for native (L1) listeners but also for non-native (L2) listeners. It has been consistently proven that L1 listeners always outperformed L2 listeners in understanding speech in noise as they are more sensitive to acoustic-phonetic cues and more experienced in utilising such cues to facilitate their perception in noise (Flege et al.,1992;van Wijngaarden et al.,2002;Jia et al.,2006;Bradlow and Alexander,2007;Lecumberri et al., 2010; Broersma and Scharenborg, 2010; Mattys et al., 2010;

Jin and Liu,2012; Mi et al.,2013; Borghini and Hazan, 2020). Fur- thermore, several researchers also investigated the perception and/or production performance of L2 listeners with different levels of immersion or language proficiency (Flege et al., 1992, 1997; Meador et al.,2000;Jia et al.,2006;Broersma and Scharenborg,2010;Mattys et al.,2010;Mi et al.,2013;Kilman et al.,2014;Zhang et al.,2014).

These studies discovered that L2 listeners with higher proficiency or

https://doi.org/10.1016/j.specom.2023.103026

Received 7 June 2023; Received in revised form 5 November 2023; Accepted 10 December 2023

(2)

who have been immersed/experienced in the target language were able to achieve higher intelligibility scores under noisy conditions.

Additionally, listeners’ intelligibility is found to relate to their native language due to contrasts in phonetic familiarity, lexical structure, and acoustic structure between their native language and the target language, etc. (Polka, 1992; Lecumberri et al., 2010). Therefore, it is imperative to fix the language pair when investigating perception differences across languages, as testing across various language pairs may yield dissimilar results.

To address the challenge of understanding speech in noise, speech enhancement (SE) is a widely used technique to improve the quality and intelligibility of speech contaminated by noise (Loizou,2007;Wang and Wang, 2013;Healy et al.,2013). Ideally, to improve the speech intelligibility of L2 listeners under adverse listening conditions, the perception gap or the L2 disadvantage between the L1 and L2 listeners should be diminished after applying SE algorithms. Whilst a variety of SE algorithms have been designed to improve the intelligibility of noisy speech, previous studies evaluating their performance have focused exclusively on the response from L1 listeners of the target language. To date, to the best of the authors’ knowledge, there is no SE algorithm specifically designed for the needs of L2 listeners. Additionally, most current SE algorithms were designed for and evaluated with moderately noisy speech, i.e. the SNR being positive (above 0 dB), leaving their performance under more severe SNR (below 0 dB) cases unknown.

Given that there are more than 400 million people learning English as a second language in Mainland China, and English has become a compulsory subject since primary school (Wang,2015), it is worthwhile studying the perception of English sentences by native Mandarin listeners under adverse conditions, specifically, under negative SNR levels.

The study byLi et al.(2011) investigated the intelligibility of speech enhanced by single-channel SE algorithms by recruiting three groups of listeners with different native languages (including Mandarin). Nev- ertheless, in the study, participants were exclusively presented with stimuli in their native language, leaving the effect of SE algorithms on L2 listeners still unknown, which the current study addresses.

Our preliminary study (Zhang et al.,2022) investigated the performance of existing SE algorithms for native New Zealand English (NZE) listeners and native Mandarin listeners who had never been immersed in English-speaking environments. The result showed a significant difference in the intelligibility score obtained by listeners with different immersion conditions, but the tested SE algorithms were found not to significantly improve the intelligibility or even degraded compared to the unprocessed noisy speech. Moreover, the perception gap between the two groups was hardly narrowed, indicating the ineffectiveness of the SE algorithms on non-immersed L2 listeners. However, these two groups of participants were considered to have distinct differences between their immersion conditions and many of the errors made by the non-immersed native Mandarin group would have been caused by unfamiliarity with the NZE accent. This is because most of the schools in Mainland China teach General American (GenAm) or British English, which matches the demographics data from the previous test. NZE differs from other varieties of English for its raised /e, æ/ vowels, and the /3:/ vowel is accompanied with a lip rounding (Bauer et al., 2007;Maclagan et al.,2017). NZE is a non-rhotic English like British English, which differs from rhotic GenAm for some specific vowels. For example, NZE merges between vowels /I@/ and /e@/, where GenAm merges the /6/ and /o:/ vowels. Moreover, NZE distinguishes between the /2/ and /æ/ vowels and the /6/ and /2/ vowels, while GenAm distinguishes only the latter one (Rogers, 2000; Hay et al., 2008).

According to a series of studies (Eisenstein and Verdi,1985;Matsuura et al., 1999; Clopper and Bradlow, 2009), familiarity with an En- glish dialect significantly influences L2 listeners’ speech intelligibility.

This has been further proven by Hui et al. (2023) by conducting a subjective listening test on L2 listeners with different immersion conditions. The result showed that immersed L2 listeners who are familiar with the variety of English of the target language have higher

intelligibility than non-immersed L2 listeners. Therefore, the findings for the performance of the SE algorithms in our previous study may not be exclusively influenced by the participants’ speech intelligibility in noise, but also by their familiarity with the accent of the target language, i.e. New Zealand English. Furthermore, due to its nature as a preliminary study (Zhang et al.,2022), the interactions between speech intelligibility, immersion condition, and SE algorithm were not extensively discussed, with a focus primarily on reporting experimental results.

To address the constraints or limitations inherent in our preliminary study, the current study is extended by including another group of native Mandarin listeners who arrived in New Zealand after the age of 15 and have been immersed in NZE for more than one year, i.e. more familiar with the NZE accent. By adding this group, we are able to investigate the speech perception abilities in noise of L2 listeners with different immersion conditions, and further investigate the effectiveness of SE algorithms on L2 listeners. This may provide more insights into the influence of immersion conditions on speech intelligibility and the potential benefits/drawbacks of SE algorithms in this population.

Moreover, the way we categorise the participants is improved in this study. As noted byCheng et al.(2021) in their study, the definition of ‘‘nativeness’’ can be ambiguous and contentious, making cross-study comparisons challenging. This finding also reflects our participants as the native NZE group contains bilingual listeners, who are considered to be better described using immersion age compared to the ambiguous definition of ‘‘nativeness’’. Therefore, we have revised the group descriptions used in this study. Specifically, we have re-categorised the native NZE group to the ‘‘early immersed’’ group to reflect their lifelong exposure to the NZE accent, and the Mandarin-speaking groups have been renamed as the ‘‘late immersed’’ and ‘‘non-immersed’’ groups, respectively, to reflect their levels of exposure to the NZE accent. These terminologies are adopted from the research byHui et al.(2022).

In summary, the current study aims to gain insight into how the presence and absence of immersion may impact the effectiveness of speech enhancement algorithms for L2 listeners. To address the limitations of the previous study (Zhang et al., 2022), we explore the interaction between speech intelligibility, immersion condition, and SE algorithm in greater detail. The findings are analysed and discussed regarding the impact on speech perception abilities in noise, potential differences between the effectiveness of SE algorithms among different immersion conditions, and the possible causes of such enhancement/degradation. In addition, a phonetic analysis is performed to gain some insights into how listeners with different levels of immersion perceive English sentences with an NZE accent. Building on our previous findings, we hypothesise that the early-immersed group always performs significantly better than the other groups and the SE algorithms will prove ineffective in improving the intelligibility for all participant groups. We also hypothesise that the late-immersed Mandarin group will perform better than the non-immersed Mandarin group due to the fact that they are more experienced in listening to English speech in noisy environments. In terms of phonetic accuracy, we anticipate fewer nucleus confusions from the late-immersed group compared to the non-immersed group due to their increased familiarity with NZE accents.

2. Methodology

An online subjective listening test was conducted to measure the speech intelligibility of participants with different immersion conditions when listening to noisy speech enhanced by different speech enhancement algorithms. This section presents a comprehensive overview of the test design and data analysis methods to address the research questions.

(3)

2.1. Participants

All participants were either at university or had received a university- based education. They were recruited to be in one of three groups:

•Early-immersed group (NZE group): 20 NZE listeners (mean age

= 23.3, sd = 4.45, 6 female, 14 male) who had arrived in New Zealand before the age of 12 years old.

•Late-immersed group (NZM group): 19 Mandarin listeners (mean age = 26.5, sd = 3.69, 9 female, 10 male) who had been immersed in New Zealand English for more than one year after the age of 15.

•Non-immersed group (CM group): 19 Mandarin listeners (mean age = 23.5, sd = 1.68, 10 female, 8 male, 1 preferred not to say) residing in Mainland China who had never been immersed in English-speaking environments.

Among the participants in the NZE group, ten out of the 20 participants were monolingual speakers of NZE and the rest were heritage multilingual. The participants had lived in New Zealand for 5–10 years (n = 2), 10 years and longer (n = 18). One participant reported that they lived in Sweden until 15 years old and moved to New Zealand and lived there since then. English was reported as their first and home language. The response of this participant showed no abnormality compared to the others.

The Mandarin groups were asked to report any English proficiency certificates they had obtained. The international certificates they reported included: International English Language Testing System (IELTS) and Test of English as a Foreign Language (TOFEL). For participants who pursued their tertiary study in China, most tertiary students in Mainland China are required to pass at least one English certificate, including the Test of English Majors (TEM) Band 4 and 8 for English major students and the College English Test (CET) Band 4 and 6 for the rest. The higher Band number indicates increasing difficulty.

For the NZM group, all participants reported their first and home language as Mandarin. Ten participants had passed either IELTS (n = 8) or CET6 (n = 2), the rest had been to universities in NZ. Only one participant had neither been to NZ university nor received any English proficiency certificates and had lived in New Zealand for 7 years, who showed an abnormally low intelligibility score during analysis, which was also included in the analysis. Only two of the participants reported their work language as Mandarin, whereas the others were English or both English and Mandarin. The participants self-reported themselves to speak NZE (n = 13), American English (n = 5), and British English (n = 1). The number of years that the participants had lived in New Zealand was: 1–2 years (n = 1), 2–5 years (n = 3), 5–10 years (n = 12), 10 years and longer (n = 3).

For the CM group, two participants reported their L1 as Cantonese and the rest as Mandarin, but all reported themselves to be proficient in Mandarin. For the CM group, only two participants did not have certificates, the rest had passed at least one of CET 4 (n = 2), CET 6 (n = 9), TEM 8 (n = 1), IELTS (n = 4, two scored 6.5 and two not mentioned), and TOFEL (n = 1). All participants from the CM group reported they had learnt either British (n = 6) or/and American English (n = 13).

2.2. Speech enhancement algorithms

The SE algorithms tested were chosen based on three criteria: (1) been widely used during different periods of SE technology development, (2) based on the fundamental core concept, rather than being a complicated extension, (3) the original implementation is easily accessible. Considering the limitation of time spent for each participant, only a limited number of SE algorithms could be examined. Hence, for machine learning-based algorithms, this study only focused on those that are based on discriminative training and rely on conventional training criteria. Overall, five single-channel SE algorithms were selected for the test, which were:

• A prioriWiener filter (WF)(Scalart and Filho,1996)

• Generalised subspace (SS)(Hu and Loizou,2003)

• Unsupervised Bayesian non-negative matrix factorisation (NMF)(Mohammadiha et al.,2013)

• Conv-TasNet (Conv)(Luo and Mesgarani,2019)

• Deep complex U-net (Unet)(Choi et al.,2019)

Implementations of the algorithms were publicly available, where WF and SS were implemented by the MATLAB code fromLoizou(2007), the code for NMF and Unet were provided by the authors of their literature, and Conv was applied through Asteroid (Pariente et al.,2020).

To optimise the SE performance, some parameters of the algorithms were adjusted, except the end-to-end trained Conv and Unet. WF had its smoothing factors,𝛽 and𝜇(Scalart and Filho,1996), set to 0.96 and 0.99, respectively. For SS, the smoothing factors of𝛽 and𝜇(Hu and Loizou,2003) were set to 0.96 and 0.99, respectively. For NMF, the main buffer𝑁₁(Mohammadiha et al.,2013) was set to 50.

2.3. Stimuli

Since the study involves late- and non-immersed listeners, the sentences should not confound the performance of such listeners due to their unfamiliarity with complex words or grammar with the results from the SE algorithms or noise, and need to be in the native language of the reference group (NZE group). Hence, Bamford-Kowal-Bench (BKB) sentences in the Speech Perception Assessment New Zealand (SPANZ) corpus (Kim and Purdy,2014) were used, which are semanti- cally meaningful with simple syntactic structure and frequently used words. The original BKB corpus was designed for hearing-impaired children (Bench et al., 1979) and has been utilised in a number of studies on the perception of L2 listeners (Mayo et al.,1997;Van Engen, 2010;Calandruccio and Smiljanic,2012;Hui et al.,2022). The SPANZ corpus is specifically designed for studies using New Zealand English as the target language. The corpus re-recorded 15 widely-used speech materials in NZE accent and words were modified to the common expressions in NZ (e.g. ‘‘vacation’’ was replaced by ‘‘holiday’’). Each sentence contains three to four keywords to be marked, i.e. words other than the keywords are not marked. For example, for the sentence

‘‘The dog went for a walk’’, the keywords are ‘‘dog’’, ‘‘went’’, and

‘‘walk’’. In our prior study (Zhang et al.,2023), a separated group of non-immersed Mandarin listeners (n = 9) transcribed clean speech from the SPANZ BKB corpus, yielding an impressive 89.56% accuracy rate, thus affirming corpus validity.

The clean speech was contaminated by either stationary speech- shaped noise (SSN) or non-stationary babble noise recorded at the sampling rate of 16 kHz. The SSN was a white noise the spectrum of which was shaped by the power spectrum of the sum of 288 SPANZ BKB sentences. The babble noise was from the NOISEX-92 corpus (Varga and Steeneken,1993). Since each target sentence had a different length between 3–4 s, the noise signals were truncated to the same length as the speech signal where the starting point of the noise signals was fixed.

The noise was added to the clean speech at specific input signal-to-noise ratio (SNR) levels to generate the noisy speech. To avoid significant ceiling/flooring effect across all participant groups, input SNR levels of 0,−3,−6 dB for the babble noise, and−3,−6,−9 dB for the SSN were chosen based on a pilot test. The sound level of the target speech sentences remained the same regardless of the input SNR levels.

The noisy speech was processed by the SE algorithms mentioned in Section2.2to obtain the enhanced speech. Even though some SE algorithms may adjust the sound level of the input signal, no manipulation was conducted to the output signals. Besides the enhanced speech, the original noisy speech signals were also included in the test as the baseline reference. Speech signals processed by the same SE algorithm under a certain input SNR level were considered as one condition and each condition was repeated three times with randomly selected different sentences. Therefore, each participant listened to 108

(4)

Table 1

Average improvements of SDR (dB), SI-SNR, PESQ, and STOI for each SE algorithm. The improvement in each objective metric is calculated by (enhanced speech measurement - noisy speech measurement). Values in bold indicate improved cases.

SE Babble noise Speech-shaped noise

SNR 𝛥𝑆𝐷𝑅 𝛥𝑃 𝐸𝑆𝑄 𝛥𝑆𝐼-𝑆𝑁 𝑅 𝛥𝑆𝑇 𝑂𝐼 SNR 𝛥𝑆𝐷𝑅 𝛥𝑃 𝐸𝑆𝑄 𝛥𝑆𝐼-𝑆𝑁 𝑅 𝛥𝑆𝑇 𝑂𝐼

Conv

0 7.92 0.49 8.97 0.13 −3 9.77 0.44 10.23 0.10

−3 7.46 0.32 8.74 0.14 −6 6.82 0.26 7.66 0.11

−6 5.59 0.14 6.98 0.10 −9 3.63 0.14 3.96 0.08

NMF

0 0.24 0.02 0.61 −0.03 −3 −0.41 0.04 −0.13 −0.03

−3 0.88 0.03 2.43 −0.01 −6 0.01 0.07 −0.14 −0.02

−6 1.45 0.06 3.26 0.002 −9 0.22 0.16 −0.83 −0.005

SS

0 3.59 0.01 4.24 0.01 −3 4.79 0.05 5.57 0.03

−3 3.13 −0.02 3.62 −0.001 −6 4.50 −0.01 5.27 0.02

−6 2.52 −0.04 2.41 −0.02 −9 3.98 −0.04 4.37 0.007

Unet

0 2.54 0.22 8.24 0.07 −3 4.52 0.25 8.29 0.08

−3 3.02 0.08 7.91 0.06 −6 4.83 0.06 7.50 0.08

−6 2.65 0.00 6.45 0.002 −9 4.32 −0.04 5.74 0.3

WF

0 2.29 0.01 2.69 0.001 −3 4.39 0.15 5.22 0.03

−3 2.17 −0.02 2.51 −0.003 −6 4.00 0.08 4.69 0.03

−6 2.00 −0.03 2.14 −0.01 −9 3.44 0.04 3.76 0.02

Fig. 1.A screenshot of the GUI used in the online test.

sentences ((5 SE algorithms + 1 noisy speech)×2 noise types×3 input SNR levels×3 repetitions) in the test.

To quantitatively evaluate the performance of the SE algorithms both from objective and subjective points of view, the improvements in the signal-to-distortion ratio (SDR) (Vincent et al.,2006), perceptual evaluation of speech quality (PESQ) (Rix et al.,2001), scale-invariant SNR (SI-SNR) (Luo and Mesgarani,2018), and the short-time objective intelligibility (STOI) (Taal et al.,2011) were also measured. These four objective metrics have been widely used to identify the effectiveness of SE algorithms by researchers and such metrics have also evaluated most of the SE algorithms we tested in this experiment.Table 1presents the mean differences in these metrics before and after applying the SE algorithms. The results demonstrate that Conv outperformed the other algorithms for all conditions. The decline in speech quality and intelligibility, as determined by PESQ and STOI, respectively, occurred in the rest of the SE algorithms; however, such occurrences were infrequent. Overall, most of the SE algorithms were able to reduce the amount of noise and/or improve the quality or intelligibility of noisy speech objectively, which supports our purpose of investigating their performance by subjective listening test.

2.4. Test procedure

The test was developed by PsychoPy (Peirce et al.,2019) and was made available online through Pavlovia. The participants were asked to transcribe the sentences played through headphones via the GUI as shown inFig. 1by a keyboard. A practice page was given for the participants to adjust the sound level and get familiar with the test.

They were instructed to adjust the sound level of their device to the maximum level that they deemed comfortable and were not allowed to change the setting during the formal test. To avoid the sound level of sentences with low input SNR levels being too noisy to tolerate, an unprocessed noisy signal at−9 dB input SNR, which is the case with the highest possible sound level, was included in the practice.

Demographic information of the participants was collected through a questionnaire, including the gender, age, first language, type of English speaking, language background, year of staying in New Zealand (for NZE and NZM groups), proficient language(s), educational background, English language certificates (for NZM and CM groups), and whether they experience hearing loss.

During the formal test, each participant listened to 108 sentences as summarised in Section2.3. Every sentence was played automatically and was not repeatable. The same set of sentences were used for every participant, but with random parameter combinations and were played in an arbitrary order. The duration of the test was around 30 min excluding the break and the participants were given a break after listening to 54 sentences, with the freedom to take as much time as required.

The participants were rewardedkoha(a New Zealand M¯aori custom which means gift, donation or contribution) as a token of appreciation.

The procedure was approved by the University of Auckland Human Participants Ethics Committee (24202).

2.5. Marking rubric

Responses of participants who self-reported that they had no hearing impairment at the time of sitting the experiment were marked manually based on the BKB corpus’ marking criteria (Bench et al., 1979). The root of the word was marked instead of the whole word, e.g. ‘‘buying’’ and ‘‘bought’’ were both marked correct for the keyword

‘‘buy’’. Homonyms such as ‘‘son’’ responses ‘‘sun’’, ‘‘two’’ responses ‘‘to’’

were not subject to a penalty. To quantify the speech intelligibility of the participants, the ratio of correct words (proportion correct) was calculated for each participant under each condition which included three sentences as mentioned in Section2.3.

2.6. Statistical analysis

The speech intelligibility score indicated by the proportion correct obtained by the participants was analysed using the linear mixed effect model (LME) in R with thelme4 package (Bates et al.,2015). Since the experiment involved two types of noise with different input SNR levels, separate model was developed for each noise type. Step function in the lmerTest package (Kuznetsova et al., 2017) was carried out to obtain the best fitted model. The fixed effects were immersion condition (participant group, i.e. NZE, NZM, and CM), input SNR, and SE algorithm, where the baseline (i.e. the unprocessed noisy speech) was also included. The random effect was the participant ID. Likelihood ratio test was conducted to check the significance of the fixed effects.

Post-hoc pairwise comparisons between models were achieved by the emmeanspackage (Lenth et al.,2022) with thep-values adjusted by the Tukey HSD method. Results withp-values being less than 0.01 were considered significant.

2.7. Phonetic analysis

The phonetic analysis summarises the frequent errors made by the participants in each group. All keywords that received incorrect answers were collected since the same phonetic confusion may occur in different response words.

A phoneme that occurs at different locations in a word can have various pronunciations depending on their surrounding phoneme environments, which is known as co-articulation (Daniloff and Hammarberg,

(5)

Fig. 2. Predicted proportion correct from linear mixed effect model under different noise types in terms of immersion condition separated by algorithms. The error bars represent the 95% confidence intervals.

Table 2

Pairwise contrasts of speech intelligibility proportion correct scores among NZE, NZM, and CM groups.

The entries for SSN are reduced as SE algorithm and immersion condition have no significant interaction according to the model.

Babble nosie NZE - CM NZE - NZM NZM - CM

Algorithm SNR Estimate t.ratio p.value Estimate t.ratio p.value Estimate t.ratio p.value

noisy 0 0.53 11.47 <.0001 0.36 7.77 <.0001 0.17 3.65 <.001

Conv 0 0.41 8.8 <.0001 0.26 5.7 <.0001 0.14 3.05 <0.01

Unet 0 0.33 7.05 <.0001 0.18 3.96 <.001 0.14 3.06 <0.01

NMF 0 0.47 10.15 <.0001 0.34 7.26 <.0001 0.13 2.86 0.01

SS 0 0.51 11.11 <.0001 0.33 7.21 <.0001 0.18 3.85 <.001

WF 0 0.53 11.53 <.0001 0.29 6.33 <.0001 0.24 5.13 <.0001

noisy −3 0.42 8.98 <.0001 0.31 6.73 <.0001 0.1 2.23 0.07

Conv −3 0.29 6.31 <.0001 0.22 4.65 <.0001 0.08 1.63 0.23

Unet −3 0.21 4.57 <.0001 0.13 2.91 0.01 0.08 1.64 0.23

NMF −3 0.36 7.66 <.0001 0.29 6.21 <.0001 0.07 1.44 0.32

SS −3 0.4 8.62 <.0001 0.29 6.16 <.0001 0.11 2.43 0.04

WF −3 0.42 9.04 <.0001 0.24 5.29 <.0001 0.17 3.71 <.001

noisy −6 0.31 6.6 <.0001 0.24 5.2 <.0001 0.06 1.38 0.35

Conv −6 0.18 3.93 <.001 0.15 3.13 <0.01 0.04 0.78 0.71

Unet −6 0.1 2.19 0.08 0.06 1.39 0.35 0.04 0.79 0.71

NMF −6 0.24 5.28 <.0001 0.22 4.69 <.0001 0.03 0.59 0.83

SS −6 0.29 6.24 <.0001 0.21 4.64 <.0001 0.07 1.58 0.26

WF −6 0.31 6.66 <.0001 0.17 3.77 <.001 0.13 2.86 0.01

SSN

−3 0.44 10.6 <.0001 0.26 6.12 <.0001 0.19 4.42 <.0001

−6 0.45 10.68 <.0001 0.27 6.55 <.0001 0.17 4.07 <.001

−9 0.31 7.42 <.0001 0.21 5.07 <.0001 0.1 2.32 0.06

1973). Therefore, every syllable was divided into its phonetic components: onset, nucleus, and coda, to record the location of the phonemes to eliminate the ambiguity caused by co-articulation. Phonetic components of multi-syllable words were numbered to the syllables they were in. The missing of a whole syllable and the primary stress in a multi-syllable keyword were also recorded.

3. Result

Both two LME models separated by different noise types had significant two-way interaction between variables according to the likelihood ratio comparison. For the babble noise, the interactions were between the SE algorithm and input SNR (𝜒²(10) = 4763,𝑝 < .0001), immersion condition (NZE, NZM, and CM groups) and SE algorithm (𝜒²(10) =

4778.3, 𝑝 < .0001), and input SNR and immersion condition (𝜒²(4)

= 4786.5,𝑝 < .0001). For SSN, the interactions were between the SE algorithm and input SNR (𝜒²(10) = 4701.6,𝑝 < .0001) and between input SNR and immersion condition (𝜒²(4) = 4697.5,𝑝 < .0001).

3.1. Relationship between immersion condition and speech intelligibility

Fig. 2 shows the linear prediction of the proportion correct in terms of the immersion condition of the participants separated by SE algorithms. Each subplot represents the performance of a SE algorithm on different groups of participants, where the groups are distinguished by different colours and the error bars display the 95% confidence interval. The plot also allows observation of trends in speech intelligibility among participant groups and different input SNR levels. The

(6)

Table 3

Pairwise contrasts of speech intelligibility proportion correct scores among 5 SE algorithms (the original noisy speech is included as a reference). The entries for SSN are reduced as SE algorithm and immersion condition have no significant interaction according to the model.

Contrast Babble noise

SNR NZE NZM CM SSN

Estimate t.ratio p.value Estimate t.ratio p.value Estimate t.ratio p.value SNR Estimate t.ratio p.value

noisy - Conv 0 0.1 2.44 0.14 0 0.03 1 −0.03 −0.65 0.99 −3 0.12 3.56 <0.01

noisy - Unet 0 0.21 5.25 <.0001 0.03 0.8 0.97 0.01 0.12 1 −3 0.09 2.61 0.1

noisy - NMF 0 0.16 3.9 <0.01 0.13 3.25 0.02 0.09 2.33 0.18 −3 0.18 5.15 <.0001

noisy - SS 0 0 0.1 1 −0.02 −0.55 0.99 −0.01 −0.31 1 −3 0.05 1.44 0.7

noisy - WF 0 0.03 0.69 0.98 −0.04 −0.97 0.93 0.03 0.75 0.97 −3 0.04 1 0.92

Conv - Unet 0 0.11 2.81 0.06 0.03 0.77 0.97 0.03 0.78 0.97 −3 −0.03 −0.95 0.93

Conv - NMF 0 0.06 1.46 0.69 0.13 3.21 0.02 0.12 2.99 0.03 −3 0.06 1.6 0.6

Conv - SS 0 −0.09 −2.34 0.18 −0.02 −0.58 0.99 0.01 0.34 1 −3 −0.07 −2.12 0.28

Conv - WF 0 −0.07 −1.75 0.5 −0.04 −1 0.92 0.06 1.41 0.72 −3 −0.09 −2.55 0.11

Unet - NMF 0 −0.05 −1.35 0.75 0.1 2.44 0.14 0.09 2.21 0.23 −3 0.09 2.54 0.11

Unet - SS 0 −0.21 −5.15 <.0001 −0.05 −1.35 0.76 −0.02 −0.44 1 −3 −0.04 −1.17 0.85

Unet - WF 0 −0.18 −4.56 <.0001 −0.07 −1.77 0.49 0.03 0.63 0.99 −3 −0.06 −1.61 0.59

NMF - SS 0 −0.15 −3.79 <0.01 −0.15 −3.79 <0.01 −0.11 −2.64 0.09 −3 −0.13 −3.71 <0.01

NMF - WF 0 −0.13 −3.21 0.02 −0.17 −4.21 <.001 −0.06 −1.58 0.61 −3 −0.15 −4.15 <.001

SS - WF 0 0.02 0.59 0.99 −0.02 −0.42 1 0.04 1.07 0.89 −3 −0.02 −0.44 1

noisy - Conv −3 0.18 4.57 <.0001 0.09 2.14 0.27 0.06 1.45 0.7 −6 0.05 1.46 0.69

noisy - Unet −3 0.36 9.06 <.0001 0.18 4.55 <.0001 0.16 3.87 <0.01 −6 0.09 2.6 0.1

noisy - NMF −3 0.18 4.52 <.001 0.16 3.86 <0.01 0.12 2.94 0.04 −6 0.11 3.28 0.01

noisy - SS −3 0.1 2.42 0.15 0.07 1.74 0.51 0.08 1.97 0.36 −6 0 0.02 1

noisy - WF −3 0.09 2.29 0.2 0.02 0.61 0.99 0.09 2.33 0.18 −6 −0.02 −0.59 0.99

Conv - Unet −3 0.18 4.49 <.001 0.1 2.42 0.15 0.1 2.42 0.15 −6 0.04 1.14 0.86

Conv - NMF −3 0 −0.05 1 0.07 1.72 0.52 0.06 1.5 0.67 −6 0.06 1.82 0.46

Conv - SS −3 −0.09 −2.15 0.26 −0.02 −0.4 1 0.02 0.52 1 −6 −0.05 −1.44 0.7

Conv - WF −3 −0.09 −2.28 0.2 −0.06 −1.53 0.65 0.04 0.88 0.95 −6 −0.07 −2.05 0.32

Unet - NMF −3 −0.18 −4.54 <.0001 −0.03 −0.7 0.98 −0.04 −0.93 0.94 −6 0.02 0.68 0.98

Unet - SS −3 −0.26 −6.64 <.0001 −0.11 −2.82 0.06 −0.08 −1.9 0.4 −6 −0.09 −2.58 0.1

Unet - WF −3 −0.27 −6.77 <.0001 −0.16 −3.95 <0.01 −0.06 −1.54 0.64 −6 −0.11 −3.19 0.02

NMF - SS −3 −0.08 −2.1 0.29 −0.09 −2.12 0.28 −0.04 −0.97 0.93 −6 −0.11 −3.26 0.01

NMF - WF −3 −0.09 −2.23 0.23 −0.13 −3.25 0.02 −0.02 −0.62 0.99 −6 −0.14 −3.86 <0.01

SS - WF −3 −0.01 −0.13 1 −0.05 −1.13 0.87 0.01 0.36 1 −6 −0.02 −0.61 0.99

noisy - Conv −6 0.15 3.74 <0.01 0.05 1.31 0.78 0.03 0.62 0.99 −9 0.16 4.55 <.0001

noisy - Unet −6 0.36 9.12 <.0001 0.19 4.61 <.0001 0.16 3.93 <0.01 −9 0.22 6.35 <.0001

noisy - NMF −6 0.12 2.92 0.04 0.09 2.29 0.2 0.06 1.37 0.74 −9 0.18 5.27 <.0001

noisy - SS −6 0.08 2.1 0.29 0.06 1.42 0.71 0.07 1.66 0.56 −9 0.07 2.09 0.29

noisy - WF −6 0.06 1.5 0.66 −0.01 −0.17 1 0.06 1.55 0.63 −9 −0.03 −0.83 0.96

Conv - Unet −6 0.21 5.38 <.0001 0.13 3.3 0.01 0.13 3.3 0.01 −9 0.06 1.81 0.46

Conv - NMF −6 −0.03 −0.81 0.96 0.04 0.98 0.93 0.03 0.75 0.98 −9 0.03 0.72 0.98

Conv - SS −6 −0.07 −1.64 0.58 0 0.11 1 0.04 1.03 0.91 −9 −0.09 −2.45 0.14

Conv - WF −6 −0.09 −2.23 0.22 −0.06 −1.48 0.68 0.04 0.93 0.94 −9 −0.19 −5.38 <.0001

Unet - NMF −6 −0.25 −6.19 <.0001 −0.09 −2.32 0.19 −0.1 −2.56 0.11 −9 −0.04 −1.09 0.89

Unet - SS −6 −0.28 −7.01 <.0001 −0.13 −3.19 0.02 −0.09 −2.27 0.21 −9 −0.15 −4.26 <.001

Unet - WF −6 −0.3 −7.61 <.0001 −0.19 −4.78 <.0001 −0.1 −2.38 0.17 −9 −0.25 −7.19 <.0001

NMF - SS −6 −0.03 −0.82 0.96 −0.04 −0.87 0.95 0.01 0.28 1 −9 −0.11 −3.18 0.02

NMF - WF −6 −0.06 −1.42 0.72 −0.1 −2.45 0.14 0.01 0.18 1 −9 −0.21 −6.1 <.0001

SS - WF −6 −0.02 −0.6 0.99 −0.06 −1.59 0.61 0 −0.1 1 −9 −0.1 −2.92 0.04

same trend in proportion correct across all immersion conditions can be seen inFig. 2; the NZE group shows a much higher predicted proportion correct than the Mandarin groups, i.e. followed by the NZM group and the lowest for the CM group. This is supported byTable 2, which presents the results of the post-hoc pairwise contrasts among the NZE, NZM, and CM groups in terms of SE algorithm and input SNR levels.

The proportion correct of the NZE group is significantly higher than the Mandarin groups for most cases except for Unet at−6 dB input SNR, while the NZM group performs significantly better than the CM group under high input SNR levels (0 dB under babble noise,−3 and−6 dB under SSN). Both Mandarin groups have similar proportion correct under the lower input SNR, i.e,−9 dB under SSN and−3 and−6 dB under babble noise except for WF.

3.2. Effect of speech enhancement algorithms on speech intelligibility

By comparing the predicted proportion correct in terms of SE algorithms through Fig. 2, the original noisy speech has the highest intelligibility for all groups of participants and under all input SNR conditions for most cases. This is confirmed byTable 3, which displays the post-hoc pairwise contrasts between different SE algorithms in terms of the participants’ immersion condition and input SNR level. We find that noisy speech significantly outperforms the five SE algorithms, especially under low input SNR levels. The pairwise contrasts under SSN inTable 3are simplified as the SSN LME model shows no significant interaction between the SE algorithm and immersion condition, i.e. changing one of the effects does not influence the other one.

(7)

Table 4

The occurrence of the five most frequent nucleus errors in the first syllable for each speech enhancement algorithm.

Error Noisy Conv Unet NMF SS WF

NZE NZM CM NZE NZM CM NZE NZM CM NZE NZM CM NZE NZM CM NZE NZM CM

@→NA 6/45 2/24 3/25 4/34 5/37 6/30 2/24 6/27 6/43 5/33 4/39 8/30 3/30 10/32 3/35 5/34 4/31 5/27 eI→aI 3/117 9/123 7/101 9/93 16/120 11/93 3/121 3/95 3/90 10/111 5/100 9/125 9/106 11/100 9/100 7/112 8/89 7/118 e→I/i: 3/110 9/100 6/84 1/90 7/81 4/90 3/85 5/88 3/89 2/92 6/92 6/91 0/98 5/96 6/99 2/105 3/94 10/98

3:→u: 0/33 1/38 3/43 0/28 1/33 2/39 0/52 3/41 2/34 0/43 2/38 3/38 0/48 0/36 4/37 2/36 2/42 2/37

æ→e 0/92 5/62 1/74 3/86 3/80 5/73 3/80 2/69 3/93 3/74 6/88 1/81 1/68 2/81 4/64 1/80 4/76 2/71

Table 5

10 Most frequent nucleus errors in the first syllable for each group.

NZE Occurrence NZM Occurrence CM Occurrence

@→NA 25/200 @→NA 31/190 @→NA 31/190 eI→aI 41/660 I@→@ 13/114 eI→aI 46/627 I@→@ 5/120 eI→aI 52/627 3:→u: 16/228 au→e 11/280 e→I/i: 35/551 e→I/i: 35/551

e@→u: 2/60 @→6 10/190 3:→ou 14/228

I@→e@ 4/120 æ→e 22/456 I@→i: 7/114

æ→a: 14/480 au→e 11/266 e@→æ 3/57

u:→i: 7/260 3:→u: 9/228 e→eI 26/551

æ→e 11/480 e→eI 21/551 I@→@ 5/114

U→o: 4/220 æ→eI 16/456 æ→e 16/456

In general, Unet performs the worst under most conditions as shown in Table 3. Moreover, the insignificant contrast in proportion correct among all participant groups inTable 2and the near-flooring proportion correct of Unet under−6 dB under babble noise for all groups in Fig. 2also prove the ineffectiveness of Unet, especially under low input SNR. However, Unet shows the second-best intelligibility improvement indicated by𝛥𝑆𝑇 𝑂𝐼as shown inTable 1. Conv, regarded as the most effective algorithm in terms of objective quality and intelligibility mea- sures as shown inTable 1, is found to have an insignificant difference in performance when compared to other SE algorithms except for Unet according to Table 3. Generally, the NZE group is found to have a significant difference between algorithms for many contrasts, whereas as perTable 3most of the contrasts for the CM group are insignificant.

3.3. Phonetic errors

Since the SPANZ BKB corpus is not phonetically balanced, the total occurrence of each vowel in the test is different, i.e. some vowels occurred more often than others. Moreover, since the sentences were chosen arbitrarily for each condition, the vowel occurrence under each SE algorithm was different. To better reflect the frequency of each confusion, the occurrence of each confusion is represented by the occurrence count of the confusion out of the total occurrence count of the original vowel. It is found that most of the errors between the SE algorithms are similar and some unique errors with high occurrence percentages only occurred once. Due to the low total occurrence of the target vowel under that SE algorithm, it is less convincing to regard such errors as ‘‘common’’ errors. In consequence, the rest of the section will discuss the overall phonetic errors by combining the errors of each SE algorithm. A table of the five most occurred nucleus errors for each SE algorithm for each participant group is shown inTable 4.

We discover that both onset and coda errors show similar patterns for all participant groups, mostly by missing the whole phonetic component. Hence, no table is presented in this paper for simplicity. Nucleus errors are found to have the most noticeable differences between participants with different immersion conditions and the 10 most occurred errors for each group are displayed in Table 5. The occurrences are ranked from high to low for each participant group. Nucleus errors which are possibly attributed to unfamiliarity with the NZE accent were bolded. Table 5 shows that the three most occurring nucleus errors exhibited a high degree of similarity regardless of immersion condition and SE algorithm. All groups tended to mistake [/eI/-/aI/] and miss the schwa, and the NZE and NZM groups often confused [/I@/-/@/], while

this confusion was less common for the CM group. Another common error for all participant groups was the confusion of the raised NZE /æ/ with /e/, where the NZE group had a much lower occurrence compared to the Mandarin groups. This indicates that such confusion may be common for listeners who are familiar with other varieties of English. Another common confusion caused by unfamiliarity with the NZE accent was [/3:/-/u:/], which occurred frequently in both NZM and CM groups.

4. Discussion

In the current study, we investigate how existing speech enhancement algorithms affect the speech intelligibility of listeners with different immersion conditions. This section discusses how SE algorithms and immersion conditions interactively affect speech intelligibility.

Referring to the original literature of the five SE algorithms we evaluated, all of them have been tested using either SSN or babble noise, or both. Both Conv (Luo and Mesgarani,2019) and NMF (Mohammadiha et al.,2013) were tested down to−5 dB input SNR whereas the others were only tested for positive input SNR conditions (Scalart and Filho, 1996;Hu and Loizou, 2003; Choi et al., 2019). The most prominent objective improvement in both intelligibility and quality as reported in Table 1was achieved by the DNN-based methods (Conv and Unet).

4.1. Effect of immersion condition on speech intelligibility

The fact that the NZE group always performed significantly better than the Mandarin groups under all conditions as shown inTable 2is aligned with our hypothesis of expecting the highest intelligibility score from the NZE group. The only exception is for Unet at−6 dB input SNR under babble noise, which is potentially being caused by the flooring effect. Since the sentences in the SPANZ BKB corpus have a relatively high predictability context (Schoof and Rosen,2015), the contextual information may have been available during speech perception.Brad- low and Alexander(2007),O’Neill et al.(2020) found that semantic information is more accessible to L1 listeners under noisy conditions even if both L1 and L2 listeners are able to benefit from it. They also stated that L2 listeners exhibited reduced sensitivity to the phonetic distinctions of the target language, possibly because of their limited exposure to the complete set of acoustic-phonetic cues. Therefore, it is assumed that the Mandarin groups, especially the CM group, are forced to rely heavily on acoustic-phonetic cues for perception due to their unfamiliarity with NZE.

Contrary to our hypothesis, the performance of the NZM and CM groups had no significant difference at lower input SNR levels (i.e.−3 and−6 dB under babble noise and−9 dB under SSN) while we expected the CM group would always perform worse than the NZM group due to their different levels of immersion (Flege et al.,1992,1997;Meador et al.,2000;Jia et al.,2006;Mi et al.,2013). One of the possible reasons for the disagreement would be that, by looking at the intelligibility scores inFig. 2, both Mandarin groups reached almost flooring at the lowest input SNR levels (−6 dB for babble noise and−9 dB for SSN).

This indicates that under extremely noisy conditions, the late- and non- immersed listeners are less likely to capture enough acoustic-phonetic cues during perception. In fact,Flege et al.(1992) have investigated

(8)

the production of word-final English consonant pair /t/-/d/ by native English and native Mandarin listeners. It was found that experienced L2 speakers did not produce word-final contrast consonants significantly better than the inexperienced ones, which aligns with our findings.

Even though the current study focuses on speech intelligibility, it has been proven that the degree of accuracy in production and perception of speech are closely related (Flege et al.,1997). However, other studies have reported that experienced/proficient non-native listeners are able to perform significantly better than inexperienced/less proficient L2 listeners for both production and perception (Flege et al.,1997;Kilman et al., 2014; Jia et al., 2006). One possible explanation for these contradicting results is that the majority of these studies focus on vowels or sentences. Vowels are voiced and have a longer duration than voiceless stop consonants at word-final positions (Rogers, 2000) and sentences allow top-down effects, hence, making the perception cues more accessible to experienced/proficient L2 listeners. In contrast, for the stop consonant pairs, the major way to distinguish and produce them is by making perceptual use of the difference in vowel duration before such consonant pairs. This may indicate that the ability to notice and utilise subtle acoustic-phonetic cues is less likely to be obtained by late- and non-immersed listeners under extremely noisy conditions, especially when such cues may be mitigated or removed by SE algorithms. Moreover, our study tested the intelligibility of enhanced speech, where the speech is very likely to be distorted by SE algorithms.Bradlow and Alexander (2007) demonstrated that L2 listeners were only able to utilise higher-level contextual information when there were sufficient acoustic-phonetic cues. Therefore, auditory cues may be less accessible to the Mandarin groups and limit the accessibility of top-down information from the sentences for the NZM group, leading to an insignificant difference in intelligibility between the NZM and CM groups.

To further prove the influence of the immersion condition, phonetic analysis was conducted to compare the performance difference between the Mandarin groups on phonetic level. According toTable 5, some of the common nucleus errors can be attributed to unfamiliarity with the NZE accent. These errors include the well-reported raised /e/ and /æ/

to /I/ and /e/, respectively (Watson et al.,2000), and the more recently studied merging [/3:/-/u:/] (Maclagan et al.,2017). As reported by our previous phonetic study (Zhang et al.,2023), non-immersed Mandarin listeners also tend to mishear /e/ to /I/ when listening to clean NZE speech. Hence, the transcription errors of the CM group in the current study are very likely to have resulted from a confluence of the unfamiliarity with NZE vowels, the noise, and the distortion caused by the SE algorithms. Interestingly, except for the [/3:/-/u:/] confusion where the NZM group made fewer mistakes than the CM group, the two Mandarin groups performed similarly in terms of the other errors mentioned above. This finding contradicted with our hypothesis; the NZM group would have fewer nucleus confusions due to unfamiliarity with the NZE accent since they have been immersed in NZE for at least a year. This may indicate that even experienced late-immersed listeners may find the characteristics of NZE accent challenging in noise. Even though the NZM group showed little advantage in terms of distinguishing NZE vowels, the higher intelligibility score achieved by the NZM group at higher input SNR levels indicates that they may still be able to perceive universal phonemes more accurately than the CM group due to their extensive exposure to NZE. Additionally, some vowel confusions may be due to the listeners’ greater familiarity with the rhotic General American (GenAm) accent, which occurs with a lower but notable occurrence. For instance, the confusion of [/2/-/æ/] (NZM occurrence = 7/475, CM occurrence = 15/475) may be explained by the fact that /æ/ in GenAm is often used in place of /2/ in NZE. The confusion of [/6/-/o:/] (NZM occurrence = 12/608, CM occurrence = 19/608) may result from the fact that NZE distinguishes between /6/

and /o:/, whereas these vowels merge in GenAm (Rogers,2000;Hay et al.,2008). Such confusion caused by the difference between NZE and GenAm may be due to the fact that participants from the CM group

had never been exposed to NZE, and most of them had received their English education in GenAm. Also, both confusions occurred for the NZM group, but with slightly lower occurrences, possibly because the participants had learnt GenAm in their early English education, at least until the age of 15. Therefore, there is a possibility that they may still be affected by the English learnt at an earlier age. However, to the best of our knowledge, this topic has not been investigated in prior research.

All three groups exhibited two common nucleus confusions, specifically confusion between the vowel pairs [/eI/-/aI/] (e.g. ‘‘hi’’ to ‘‘hay’’) and [/æ/-/e/] (e.g. ‘‘bed’’ to ‘‘bag’’), where the latter pair is likely due to the raised /æ/ in NZE accent which often causes such confusion to non-NZE speakers. These confusions indicate that even early-immersed listeners face challenges in distinguishing certain vowel sounds under noisy conditions. The most frequent nucleus error for all participant groups was the deletion of schwa. This would have been mainly caused by the extremely short duration of this unstressed vowel having been often located in the unstressed syllable in the corpus we used, e.g. ‘‘around’’

to ‘‘round’’. Hence, the vowel may easily be masked by the noise and removed by the SE algorithms. These findings highlight the impact of familiarity with a variety of English on phonetic perception.

4.2. Effect of speech enhancement algorithms on speech intelligibility

The unprocessed noisy speech is slightly but statistically significantly more intelligible than any of the enhanced speech according to Table 3, which aligns with the finding inLi et al.(2011). Although the test conducted in this early research was limited to five traditional SE algorithms with input noisy speech signals at positive input SNRs for listeners with different native languages targeted in their native languages, the result also showed that most of the SE algorithms suppress the background noise with the cost of distorting the speech.

Similarly,Cooke and Lecumberri(2016) pointed out that ‘‘while enhanced styles lead to gains by reducing the effect of masking noise, the same styles distort the acoustic-phonetic integrity of the speech signal’’. This previous study found that modified speech styles which are beneficial in noise failed to achieve higher intelligibility in quiet conditions. Our current study revealed that the current SE algorithms yielded lower intelligibility than the unprocessed noisy speech. Hence, the results from both research showed that such manipulation of speech might result in damaging its acoustic-phonetic structure.

From the STOI improvements shown inTable 1, we expected that the proportion correct of all the SE algorithms apart from NMF, would be higher than that of the original noisy speech, while the result showed the contrary. A plausible explanation for the degradation can be drawn from the design of these algorithms; most of them were designed in such a way as to improve the speech with respect to the objective metrics (SNR, SDR, SSNR, etc.), but not the actual speech intelligibility perceived by listeners. Therefore, these methods may have focused too much on eliminating the noise from the signal, resulting in removing some critical phonetics features (e.g. initials or ends of a word) or distorting the speech signal, which will be explained in detail in the following section. The exception is Conv-TasNet, where a subjective test was also conducted in the original study. However, the test asked the participants to mark the quality of the enhanced speech but did not measure how well the participants could understand them, i.e. intelligibility (Choi et al.,2019). Also, the input SNR levels used in the current study were lower than that assumed in the original literature, hence, such algorithms may not be designed to perform well under negative input SNR levels.

Another interesting finding is that, when listening to the signals processed by the worst performer, Unet, the signals sounded very clean with most of the noise removed, which may have led to the over- removal of phonetic components. The phonetic analysis further proves this observation, where most of the onset and coda errors made by the participants were missing the whole phonetic component regardless of immersion condition. Since some onset and coda consist of voiceless

(9)

consonants, which have a shorter duration (Rogers, 2000), it is very likely that background noise may partially or completely mask them, especially under low input SNR conditions. Moreover, the SE algorithms are designed to suppress and remove the noise and may treat these phonetic components as part of background noise and remove them in the process.

Overall, Table 3 shows that the NZE group is more capable of capturing the small but statistically significant difference between SE algorithms under babble noise. Such behaviour is not found in the NZM nor CM groups, which may reflect that late- and non-immersed listeners perceive enhanced and noisy speech with similar intelligibility and are less able to benefit from the SE algorithms than the early-immersed listeners under non-stationary noise. Such inability to observe the difference between noisy and enhanced speech is found under SSN for all participant groups, with the difference between algorithms being more significant under the lowest input SNR level (−9 dB).

According toFig. 2, the perception gaps among the three groups at each input SNR level were barely narrowed by applying SE algorithms.

This corroborates our hypothesis about the ineffectiveness of SE algorithms on all participants regardless of immersion conditions. The only exception is for Unet at−6 dB under babble noise, where the difference in intelligibility score among the participant groups is insignificant according toTable 2. However, such a narrowed perception gap may be due to the significant drop in the intelligibility score caused by the flooring effect (i.e. proportion correct lower than 0.25 as shown in Fig. 2).

4.3. Limitations and future work

The limitations of the research can mainly be attributed to three factors. Firstly, due to the COVID-19 pandemic, the test had to be conducted online. According to the research byCooke and García Lecum- berri (2021), online speech perception tests are reliable when performed by known participants and avoid low-quality headphones. Since most of the participants in the Mandarin groups were known by the authors and were instructed to avoid low-quality headphones, the results from this study should be appropriate. However, since no information about the quality of the participants’ headphones was collected, it is still possible that the speech perception of some participants’ was influenced. Additionally, the participants’ listening environments varied and the sound level of the stimuli was not controlled. While participants were instructed not to adjust the device’s sound volume during the formal test, this could not be monitored. Since the volume of the enhanced speech differed significantly among certain SE algorithms, e.g. signal processed by Conv had a much higher energy than that processed by NMF, this would have potentially impacted the participants’ responses.

Secondly, since meaningful sentences were used in the test, listeners with higher proficiency would have taken advantage of using semantic cues to guess the correct keywords. Due to such sentences, responses treating multiple words as a single word were common, e.g. ‘‘cost a lot’’

as ‘‘customer’’. Additionally, the corpus was not optimised for phonetic analysis as test sentences contain an unbalanced distribution of phones, which results in a less comprehensive analysis of individual phonemes.

This is the main reason why the study did not focus on the phonetic analysis and only briefly covered it in Section4.1. Furthermore, since the participants transcribed whatever they heard, part of the marking involved guessing their corresponding response to the keyword.

Therefore, the marked results may not fully reflect their perception.

Thirdly, the experiment only tested noisy speech with negative input SNR conditions to avoid the intelligibility score of the early- immersed listeners suffering from ceiling effect. However, as shown in Fig. 2, the intelligibility of the Mandarin listener groups (NZM and CM) tested at the highest input SNRs (0 dB for babble noise,−3 dB for SSN) exhibited a large potential for improvement under positive input SNR conditions.

Since this is the early study to investigate the performance of single- channel SE algorithms on non-native listeners, it primarily serves to broaden our understanding of the general effects of SE on this listener group. Future studies are required to understand the effect of SE algorithms on listeners with different language exposure and language familiarity. This includes using a different corpus with low predictability sentences, controlling the listening environment (e.g. by conducting the test in person), testing non-native listeners with different first languages under positive input SNR conditions, and conducting such subjective listening tests to evaluate SE algorithms designed to enhance speech intelligibility, as well as machine learning-based methods with diverse training and learning models.

5. Conclusion

This study investigated the effect of different speech enhancement (SE) algorithms on speech intelligibility with a glimpse of the phonetic perception of early-immersed New Zealand English (NZE) listeners and native Mandarin listeners with varying degrees of immersion in NZE in noisy environments. The results indicated that the early-immersed group consistently performed better than the late- and non-immersed groups in terms of speech intelligibility, and that noisy speech was generally more intelligible than enhanced speech. However, the late- immersed group did show a higher intelligibility score than the non- immersed group at higher input signal-to-noise ratios (SNR), although this advantage diminished at extremely low SNR levels (i.e. at input SNRs lower than−6 dB). The SE algorithms tested showed insignificant differences from each other, with the exception of the deep complex U- net (Unet) algorithm performing worse. Furthermore, the study found that the Mandarin groups were less able to detect intelligibility differences between speech processed by different SE algorithms, suggesting that these algorithms may not effectively improve speech intelligibility for late- and non-immersed listeners. Finally, phonetic errors in noisy and enhanced speech were found to be similar, where the unfamiliarity with the NZE accent led to nucleus errors for both Mandarin groups, where the late-immersed listeners were also unable to take advantage of their immersion experience. Overall, these findings have implications for the development of SE algorithms, especially for non-native listeners, and the understanding of the effects of language immersion on speech perception in noise.

CRediT authorship contribution statement

Yunqi C. Zhang:Conceptualization, Methodology, Software, Vali- dation, Investigation, Resources, Formal analysis, Data curation, Writ- ing – original draft, Writing – reviewing & editing, Visualization.

Yusuke Hioka: Conceptualization, Methodology, Writing – original draft, Writing – reviewing & editing, Supervision, Project administra- tion. C.T. Justine Hui: Conceptualization, Methodology, Resources, Writing – review & editing, Supervision.Catherine I. Watson:Con- ceptualization, Methodology, Writing – review & editing, Supervision.

Declaration of competing interest

The authors declare that they have no known competing finan- cial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability

The authors do not have permission to share data.

Acknowledgements

We thank the participants for their participation, Dr. Suzanne Purdy for providing the SPANZ corpus, and Dr. Elaine Ballard for her insight- ful advice.