PEMO: Speech Recognition
with Perceptive Feature Extraction
A computational model of the auditory periphery (PEMO) was developed by the
Medical Physics Group at
PEMO was originally developed
to simulate psychoacoustical experiments like temporal or
spectral masking experiments. Recently, the model was applied
to different topics in speech processing like speech intelligibility
prediction, objective speech quality measurement and automatic
speech recognition (ASR).
The motivation for our work in the field of ASR is that the human auditory system can
be regarded as a very robust "speech regognition system" which allows us to understand
speech in very noisy environments. Today's ASR systems, on the other hand, usually perform
quite bad even in low noise. Simulating the "internal representation" of speech with
an auditory-based feature extraction like PEMO should allow a more robust automatic recognition of
Processing Stages of PEMO
- Preemphasis of the time signal
- Basilar-membrane filtering with a gammatone filterbank
- Envelope Extraction (half-wave rectification and low pass filtering)
- Adaptive amplitude compression to simulate short-term adaptation
- Low pass filtering of the compressed envelope
The representation of speech and sounds after PEMO-processing:
- Stationary input signals are log-compressed, approximately
- Changes in the input signal, like onsets and offsets are transformed linearly, thus emphasized
- Amplitude modulations between about 1 and 10 Hz are passed, others suppressed
- The coding of the input signal is sparse and distinct
- See the
were performed with PEMO feature extraction.
The task was speaker-independent, isolated digit recognition in quiet and in noise.
The speech material was corrupted with different types of additive and
convolutive noise before feature extraction. Both HMM and neural networks were
used for recognition. Other front ends like MFCC or RASTA were tested for comparison.
The results show
- Comparable or slightly worse performance in quiet
- Much better performance in both additive and convolutive noise, compared to MFCC
- About the same performance in noise, compared to adaptive JAH-RASTA, but without the
need to detect speech-free intervals for noise estimation
- Neural networks take more advantage from the PEMO processing than HMM recognizers
Related Papers and Articles:
Tchorz, J., Kasper, K., Reininger, H. and Kollmeier, B.
On the Interplay between auditory-based features and locally recurrent neural
networks for robust speech recognition in noise
Eurospeech ´97 , p. 2075-2078, ESCA, Patras, Greece, 1997.
Download (postscript, 392k)
Tchorz, J., Wesselkamp, M. and Kollmeier, B.
Gehörgerechte Merkmalsextraktion zur robusten Spracherkennung in
Fortschritte der Akustik - DAGA 96, p. 532-533, DEGA, Oldenburg, 1996.
Download (postscript, 81k)
Dau, T., Püschel, D., and Kohlrausch, A.
A quantitative model of the ``effective'' signal processing in the auditory system: I. Model
J. Acoust. Soc. Am., vol. 99, p. 3633-3631, 1996
Kasper, K., Reininger, R., and Wolf, D.
Exploiting the Potential of Auditory Preprocessing for Robust
Speech Recognition by Locally Recurrent Neural Networks
Proc. Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), vol. 2 , p. 1223-1227, 1997
A more detailed description of the auditory model in ASR system
and the setup of the experiments
Download (postscript, 83k)
See also the publication list of our group.
Currently working on ASR with PEMO preprocessing: Michael Kleinschmidt
Back to Medical Physics Group home page
Jan. 28, 1998 firstname.lastname@example.org