Detected VLROPs and VLREPs are brought closer to the truth (dashed lines), after adding the AE evidence. For human voice communication, most of the message is present in non-VLRs.
Overview of speech recognition
Conventional speech recognition system
Mean cepstral subtraction and cepstral variance normalization are some of the popular normalization techniques. It was shown that the lowest order elements in the cepstrum vector provide a good approximation of the vocal tract filter [2].
Landmark based speech recognition system
Landmarks: Landmarks are those cases in which acoustic manifestations of the linguistically motivated distinctive features are most salient [4]. In the system proposed by Liu [4], the landmarks and distinguishing features are used to predict the underlying sequence of segments.
Event based consonant-vowel (CV) unit recognition system
Therefore, the output of HMM is given more weight than the SVM output in the first stage. Therefore, the output of the SVM in the second phase is given more weight than the HMM output.
Motivation for vowel-like region based approach
It can be seen from the figure that the VLRs are mainly high energy regions and most of the energy is present at low frequency. On the other hand, non-VLRs are relatively low energy regions and most of the energy is present at high frequency.
Organization of the thesis
This chapter reviews the literature related to the acoustic-phonetic analysis of speech and the speech recognition approaches that use knowledge of these analyses. The second approach is the knowledge-based approach, which explicitly uses the acoustic-phonetic knowledge.
Acoustic-phonetic Analysis of Speech
- Voiced/ unvoiced information
- Degree of constriction
- Formants and formant transitions
- Burst and voice onset time
- Vowel onset and offset points
- Nasalization
- Frication
For sound onset detection, methods based on periodicity in the acoustic waveform and spectrographic analysis were used [64]. In the following sections, we examine the implicit and explicit use of acoustic-phonetic knowledge in phoneme recognition.
Implicit acoustic-phonetic knowledge
- HMM based system
- Hybrid ANN-HMM based system
- SGMM-HMM based system
- DNN-HMM based system
In the following sections, we examine the implicit and explicit use of acoustic-phonetic knowledge in speech recognition. Under mismatched training conditions, such as mismatched microphone, different background noise, inappropriate channel, etc., the recognition performance degrades.
Explicit acoustic-phonetic knowledge
Landmark based approach
A method was proposed to detect these landmarks by processing the signal energy in six frequency bands. Most landmark systems have failed due to the lack of a probabilistic framework for dealing with variability in pronunciation.
Event based approach for syllable recognition
Using the two-stage system, a CV recognition rate of 66.14% was reported in the Telugu news database [16]. The disadvantage of the central heating unit recognition system is that VOP must be detected correctly, otherwise not all central heating units will be recognized.
Explicit acoustic-phonetic knowledge in statistical systems
Organization of the work
Another task in the proposed framework involves the extraction of VLR and non-VLR specific acoustic-phonetic features. It has been found in the literature that manual labeling in the case of voiced aspirated sounds is a challenging task.
Manual marking of VLROPs and VLREPs
Unvoiced, voiced unaspirated and nasal consonants
Consequently, signal energy, formant transition, and excitation strength can be used to mark the VLROP and VLREP events. In the absence of a consonant, the vowel energy slowly decreases and it becomes difficult to detect the exact moment of the end point.
Voiced aspirated (VA) consonants
The energy of the EGG signal is fed through the non-linear operator and convolved with the second order Gaussian differentiator. This shows that the change in the amplitude (or the energy) of the EGG signal is an unequivocal indicator to mark the VLROP manually in the case of VA units.
VLROP and VLREP detection using Bessel features
Analysis of VLROPs and VLREPs using Bessel expansion and AM-FM model 51
In the case of the combined evidence in Figure 3.5 (e) and (f), the peaks are comparatively close to the ground truth events. In Figure 3.5 (b), the peak deviates greatly from the ground truth, which, after combining the proposed evidence, approaches Figure 3.5 (e).
Performance evaluation
VLRs detection using source and vocal tract information in statistical framework
Analysis of the ES based VLR detection
In this subsection, we will analyze false and missed detections with the ES-based method. Similarly, relatively low signal strength in the rough region is captured and not declared as VLR.
Complementary information from source and system for VLR detection
The vocal tract narrowing is analyzed and evidence is presented for approximately measuring the degree of vocal tract narrowing. The detailed analysis of vocal tract strictures and the procedure for estimating the evidence are described in Chapter 4.
Database for VLR detection evaluation
VLRs are produced with a relatively lower amount of constriction than the non-VLRs and thus the evidence can be used as a feature for VLR detection. Smooth evidence obtained using HE of LP residual of speech and rate of change of excitation strength obtained with the ZFF operation, as discussed in the previous section, are used as excitation source characteristics.
SVM system
Experimental results
Adding AE and VTC features to the excitation source features further increases IR and significantly reduces ˆSR. Thus, using complementary information from the source and vocal tract in an SVM framework significantly improves the detection performance of VLR compared to the existing SP-based method.
Summary
VLRs and non-VLRs are produced with different amounts of constriction in the vocal tract. Non-VLRs are produced with a complete occlusion or a narrow constriction in the vocal tract.
Vocal tract constriction evidence using ZFF
The proposed evidence shows very high value for vocal fold areas and very low value for low vowels. The cosine kernel value ˆk is proposed to be a measure of the agreement between the two.
Analysis of VTC evidence
- Voiced sounds
- Vowels
- Unvoiced sounds
- Voiced and unvoiced sounds
Low frequency characteristics of the two extreme cases (low vowels and voice bars) can be seen in the plot as previously described. The trend according to tongue position or vowel height is reflected in the distribution curves.
VTC evidence as a feature for recognition of non-vowel-like sounds
Recognition of constricted phones in a phoneme recognizer
Also shows the t- and p-value of the t-test performed on accuracy of different narrowed phones. Absolute telephone error rate (PER) and relative telephone error rate reduction (RPER) for different restricted classes are also shown in Table 4.2 for all three databases.
VTC evidence as vowel height feature
These three coefficients are strongly related to the vowel height characteristic and are therefore compared with the plot of similar bar charts in Figure 4.8. However, the overlap between the distribution of mid and low vowels in the C1 case is more than a feature of vowel height.
Vowel roundedness and frontness features
Vowel roundedness features
Center of Gravity of Spectral Peaks (RCoG): Vowel roundness refers to the amount of roundness present in the lips during vowel articulation. The hypothesis is that the amplitude of the third formant decreases due to rounding.
Vowel frontness feature
Vowel recognition using acoustic-phonetic features in limited labeled data scenario
Database
Since vowel recognition is performed using limited training data, a number of subsets are prepared from the TIMIT train set. From each dialect, 500 examples (top 250 examples each from male and female speakers, arranged by name) are extracted to form eight different subsets (DR1-DR8).
Experimental results
This suggests that different information such as roundness, frontality, and height conveyed by acoustic-phonetic features aid in vowel recognition. The acoustic-phonetic properties calculated from the HFBT spectrum are added to the standard MFCC functions in various combinations.
Summary
Detected formants and spectrum are used to extract features for vowel rounding and frontness. The proposed acoustic-phonetic features are used for vowel recognition in a limited training data scenario.
Detection of dominant aperiodic component regions (DARs) in speech
- Sub-fundamental frequency (SFF) filtering
- Dominant resonant frequency (DRF) and high to low frequency components
- Evidence enhancement
- Refinement using VLR information
- Combining three outputs and smoothing
Here we try to detect some of the discontinuities due to aperiodic sources using SFF filtering. For all resonant sounds, most of the signal energy is present in the low frequency range.
Prediction of duration of transition region (DTR)
On the other hand, high vowels are produced with a moderate narrowing in the vocal tract. For prediction of DTR in a CV unit, the average VTC value (Vav) in the vocal region is calculated and this value is mapped to a specified time duration.
Experimental evaluation
Evaluation of the DARs detection method
The VLR removal operation reduces the spurious rate by about 10% in both the SFF-based and combined SFF methods. It is observed that the VLR removal operation not only reduces the false rate but also the overall detection rate which is reduced by about 5%.
Evaluation of the DTR prediction method
Application of DARs and DTRs in consonant-vowel (CV) unit recognition system
Database
Due to unavailability of manually tagged VOPs in the database, forced alignment and automatic VOP detection methods are used to obtain the VOPs. Therefore, the CV units (consonant-vowel as well as consonant-diphthong unit) containing an obstruent consonant are considered for this study.
Two-stage CV recognition system
112 such CV units containing 17 obstruent consonants, 5 vowels and 2 diphthongs are used, as listed in Table 5.3.
VOP detection
Acoustic modeling
Since the size of the training data is small, the number of hidden layers is chosen to be 2. To reduce the learning rate in the dimensions where the derivatives have a high variance, a matrix-valued learning rate is used in the preconditioned form of stochastic gradient. descent employed by Kaldi.
Improved consonant recognition using DARs and DTRs
However, in the linear relationship, the high VTC values are mapped to a very short duration, say, 5-10 ms. Therefore, for the same change in a lower VTC value, the value of the predicted transition region duration will vary by a lesser amount.
Evaluation of the CV recognition system
Consonant recognition
Acoustic models FA-based VOP SP-based VOP for obstacles Suggested fixed duration Suggested fixed duration. In all cases, the proposed method using conditional CoR and VTR provides improved performance than the basic method using fixed-duration features.
CV unit recognition
The HMM-SGMM system performs the best among the three modeling approaches and FA-based VOP detection is shown to provide better consonant recognition than the SP-based method. In any case, the proposed method achieves an improvement of about 3% compared to the basic system.
Summary
Different types of acoustic-phonetic information explored in the previous chapters are used at different levels of the framework. Effectiveness of the features is demonstrated in phone recognition by considering isolated utterances from the phones.
Phone recognition framework based on VLR segmentation
On the other hand, non-VLRs are produced with complete closure or high narrowing in the vocal tract. Moreover, most of the message part is present in the non-VLRs and the VLRs act like the carriers of the message.
Architecture of the proposed phone recognition framework
- VLR detection in continuous speech
- Selection of VLRs and non-VLRs
- Non-VLR specific acoustic-phonetic information
- Obstruent specific acoustic-phonetic information
- Features for VLRs
- Acoustic models
In the test phase, these models are used at the time of first-level decoding of the non-VLRs. These models are used for second-level decoding of non-VLRs in the test phase.
Evaluation of the proposed framework
Databases
Similarly, 12,533 non-VLR units are available in coda syllables, of which 10,948 units are used for training and the rest are used for testing. Similarly, 16,565 non-VLR units are available in coda syllables, of which 12,018 units are used for training and the rest are used for testing.
Results using forced alignment (FA) based VLR detection
Therefore, a slight decrease in the overall recognition performance without VLR for Assamese can be observed for nasal separation, although the improvement is very significant in the case of nasals. In the case of Assamese language, SGMM-HMM based acoustic modeling provides the best performance for both non-VLR and VLR.
Results using automatic VLR detection
Vowels and semivowels decoded at the end of the phone recognition are combined and the endpoints of the VLRs are obtained. However, the improved performance in terms of correction rate shows the potential of the proposed framework if the two discussed issues are properly addressed.
Summary
To make proper use of the acoustic-phonetic information, a framework for telephone recognition is proposed. However, this increases the false VLRs and the insertion errors at the end of the phone recognition.
Conclusion
Using acoustic-phonetic knowledge in the proposed framework improves the non-VLR recognition performance by about 4.5% for both Hindi and Assamese databases. Although a fully automatic phone recognition is not currently possible, the experiments show the potential of the framework if the problems related to the current system are properly addressed.
Directions for future work
For example, some fake phones can be removed by using a threshold value in the probability score. The counter group delay function (g[k]) of the resulting signal is obtained to cancel out the loss in frequency resolution.
FBT spectrum
In addition to these two methods, the formants are estimated from the well-known STRAIGHT spectrum as reported in [47] and also from the Fourier spectrum.
Formant estimation evaluation
Franco, “Link probability estimators in HMM speech recognition,” IEEE Trans on Speech and audio processing, vol. Acero, “Context-dependent pretrained deep neural networks for large-vocabulary speech recognition,” IEEE Trans on Audio, Speech and Language Processing, vol.