Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND DEVICE FOR LOW-LATENCY AUDITORY MODEL-BASED SINGLE-CHANNEL SPEECH ENHANCEMENT
Document Type and Number:
WIPO Patent Application WO/2009/043066
Kind Code:
A1
Abstract:
The present invention relates to a method for enhancing wide-band speech audio signals in the presence of background noise and, more particularly to a noise suppression system, a noise suppression method and a noise suppression program. More specifically, the present invention relates to low-latency single-channel noise reduction using sub-band processing based on masking properties of the human auditory system.

Inventors:
OPITZ MARTIN (AT)
HOELDRICH ROBERT (AT)
ZOTTER FRANZ (AT)
NOISTERNIG MARKUS (FR)
Application Number:
PCT/AT2007/000466
Publication Date:
April 09, 2009
Filing Date:
October 02, 2007
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
AKG ACOUSTICS GMBH (AT)
OPITZ MARTIN (AT)
HOELDRICH ROBERT (AT)
ZOTTER FRANZ (AT)
NOISTERNIG MARKUS (FR)
International Classes:
G10L21/02; G10L21/0208; G10L21/0216; G10L21/0232; G10L21/0264
Domestic Patent References:
WO2006114100A12006-11-02
WO2002011125A12002-02-07
Foreign References:
EP1729287A12006-12-06
EP1600947A22005-11-30
Other References:
LIN L ET AL: "Speech denoising based on an auditory filterbank", SIGNAL PROCESSING, 2002 6TH INTERNATIONAL CONFERENCE ON AUG. 26-30, 2002, PISCATAWAY, NJ, USA,IEEE, vol. 1, 26 August 2002 (2002-08-26), pages 552 - 555, XP010628047, ISBN: 978-0-7803-7488-1
AMIR HUSSAIN ET AL: "Nonlinear Adaptive Speech Enhancement Inspired by Early Auditory Processing", NONLINEAR SPEECH MODELING AND APPLICATIONS LECTURE NOTES IN COMPUTER SCIENCE;LECTURE NOTES IN ARTIFICIAL INTELLIG ENCE;LNCS, SPRINGER-VERLAG, BE, vol. 3445, 1 January 2005 (2005-01-01), pages 291 - 316, XP019012533, ISBN: 978-3-540-27441-4
JOHNSON ET AL: "Speech signal enhancement through adaptive wavelet thresholding", SPEECH COMMUNICATION, ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, NL, vol. 49, no. 2, 15 February 2007 (2007-02-15), pages 123 - 133, XP005890520, ISSN: 0167-6393
KALLIRIS M G ET AL: "Broad-Band Acoustic Noise Reduction Using a Novel Frequency Depended Parametric Wiener Filter. Implementations using Filterbank, STFT and Wavelet Analysis/Synthesis Techniques.", AUDIO ENGINEERING SOCIETY (AES) CONVENTION, 12 May 2001 (2001-05-12) - 15 May 2001 (2001-05-15), Amsterdam, The Netherlands, pages 1 - 9, XP002499667
JAN SKOGLUND ET AL: "On Time-Frequency Masking in Voiced Speech", IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 8, no. 4, 1 July 2000 (2000-07-01), XP011054031, ISSN: 1063-6676
Attorney, Agent or Firm:
BARGER, PISO & PARTNER (Wien, AT)
Download PDF:
Claims:

CLAIMS

1. A method for suppressing noise in an input audio signal (y[n]) which comprises a wanted signal component (x[n]) and a noise signal component, the method comprising the steps of

- dividing the input audio signal (y[n]) into a plurality of frequency subbands (y k [n]) by means of an analysis band splitter,

- suppressing noise in each of the subbands (y k [n]) by a plurality of noise supressing processors,

- recombining the plurality of subbands (y k [n]) into an output signal (x[n]) by means of a synthesis filter, all steps being performed in the time domain.

2. Method according to claim 1, characterized in that the dividing of the input audio signal into a plurality of subbands by means of the analysis band splitter is performed according to human auditory loudness perception.

3. Method according to claim 2, characterized in mat the analysis band splitter comprises a Gammatone filter bank (GFB), preferably a nonuniform Gammatone filter bank.

4. Method according to any of claims 1 to 3, characterized in that a pre-processor (HOME) and post-processor (H IOME ) perform non-linear filtering to the input audio signal, comprising a. a pre-processing filter, which emulates the transfer behaviour of the human outer and middle ear applied to the time-discrete noisy input audio signal b. a post-processing filter applied to the enhanced full-band signal to compensate the effect of the pre-processing filter.

5. Method according to any of claims 1 to 4, characterized in that each noise processor is comprised of a signal level detector (LD), a noise estimator (NE), an auditory masking filter (PM) and a subtraction processor.

6. Method according to claim 5, wherein said signal level detector (LD) exploits the phase of low-order filter sections to generate a quadrature signal and an in-phase signal out of

the sub-band signal y k [n]) and summing up the squared amplitudes of these signals,

7. Method according to claim 5, wherein said noise estimator generates a sub-band noise value by performing smoothing based on Minimum Statistics, more particularly weighted averaging of the previous noise value and the current input value with three different time constants is applied.

8. Method according to claim 5 or 6, wherein said auditory masking filter uses the signal power detected in each sub-channel to generate a temporal masking behaviour based on human auditory perception, more particularly non-linear weighted averaging of the previous signal value and the current sub-band input value is applied only on the falling slope depending on the level detected in each sub-band.

9. Method according to claims 1 to 8, wherein the update of the noise estimator depends on the current input value compared to time-varying, level dependent thresholds, i.e. if the current input value is greater than a predetermined threshold value the current input value is not considered to be noiwe and said noise estimator is not updated.

10. Method according to claims 1 to 9, wherein the noise suppression in each of the subbands is performed using the Ephraim and Malah noise suppression rule (EMSR).

11. Method according to claims 1 to 10, wherein the noise suppression in each of the subbands is performed decision directed approach (DDA).

12. Apparatus for suppressing noise in an input audio signal (y[nj) which comprises a wanted signal component (x[n]) and a noise signal component, the apparatus comprising

- an analysis band splitter for dividing the input audio signal (y[n\) into a plurality of frequency subbands (yk[n])>

- a plurality of noise supressing processors for suppressing noise in each of the subbands (y k [n]) f

- a synthesis filter for recombining the plurality of subbands (y k [n]) into an output signal (x[n]), analysis band splitter, noise supressing processors and synthesis filter working in the time domain.

13. Apparatus according to claim 12, characterized in that a level detector (LD) is provided in each of the subbands.

14. Apparatus according to claim 13, characterized in that said signal level detector (LD) exploits the phase of low-order filter sections to generate a quadrature signal and an in-phase signal out of the sub-band signal (y k [n]) and summing up the squared amplitudes of these signals.

15. Apparatus according to claim 14, characterized in that said quadrature signal is generated by FIR first order section provided in the level detector (LD).

16. Apparatus according to claim 14, characterized in that said quadrature signal is generated by FIR first order all-pass (AP) provided in the level detector (LD).

17. Apparatus according to claim 14, characterized in that said quadrature signal is generated by a delay line providing a λ/4 delay at the digital center frequency (θ k ).

18. Apparatus according to any of claims 12 to 17, characterized in that each noise processor is comprised of a signal level detector (LD), a noise estimator (NE), an auditory masking filter (PM) and a subtraction processor.

19. Apparatus according to any of Claims 12 to 18, characterized in mat the analysis band splitter comprises a Gammatone filter bank (GFB), preferably a nonuniform Gammatone filter bank,

20. Apparatus according to any of claims 12 to 19, characterized in that a pre-processor (H OME ) and post-processor (H IOME ) is provided for performing non-linear filtering to the input audio signal, comprising a. a pre-processing filter, which emulates the transfer behaviour of the human outer and middle ear applied to the time-discrete noisy input audio signal b. a post-processing filter applied to the enhanced full-band signal to compensate the effect of the pre-processing filter.

Description:

METHOD AND DEVICE FOR LOW-LATENCY AUDITORY MODEL-BASED SINGLE- CHANNEL SPEECH ENHANCEMENT

FIELD OF THE INVENTION

The present invention relates to a method for enhancing wide-band speech audio signals in the presence of background noise and, more particularly to a noise suppression system, a noise suppression method and a noise suppression program. More specifically, the present invention relates to low-latency single-channel noise reduction using sub-band processing based on masking properties of the human auditory system.

BACKGROUND OF THE INVENTION

Additive background noise in speech communication systems degrades the subjective quality and intelligibility of the perceived voice. Therefore, speech processing systems require noise reduction methods, i.e. methods aiming at processing a noisy signal with the purpose of eliminating or attenuating the level of noise and improving the signal-to-noise-ratio (SNR) without affecting the speech and its characteristics. In general, noise reduction is also referred to as noise suppression or speech enhancement.

For example, mobile phones are often used in environments with high level of background noise such as public spaces. The use of mobile phones, voice-controlled devices and communication systems in cars has created a great demand for hands-free in-car installations, with the objective to increase safety and convenience; in many countries and regions law prohibits e.g. hand-held telephony in cars. Noise reduction becomes important for these applications, as they often needed to operate in adverse acoustic environments, in particular at low signal-to-noise ratios (SNR) and highly time-varying noise signal characteristics (e.g. rolling noise of cars).

In room teleconferencing applications, such as video-conferencing or speech recognition and querying systems, ambient noise usually arises from fans of computers, printers, or facsimile machines, which can be considered as (long-term) stationary. Conversational noise, emerging from (telephone) talks of colleagues sharing the office, as often referred to as babble noise, contains harmonic components and is therefore much harder to attenuate by a noise reduction

unit.

However, applications within hearing aids and in-car speech communication systems require noise suppression methods, which can be performed in real-time.

Despite, the fast development of the underlying hardware in terms of computing power and storage capacity supports the progress of software implementations.

One of the most widely used methods for noise reduction in real-world applications is referred to in the art as spectral subtraction (see S. F. Boll, "Suppression of Acoustic Noise in Speech using Spectral Subtraction," BEEE Trans. Acoust. Speech and Sig. Proα, vol. ASSP-27, pp. 113-120, Apr. 1979). Generally, spectral subtraction attempts to estimate the short time spectral amplitude (STSA) of clean speech from that of the noisy, speech, i.e. the desired speech contaminated by noise, by subtracting an estimate noise signal. The estimated speech magnitude is combined with the phase of the noisy speech, based on the assumption that the human ear is insensitive against phase distortions (see C. L. Wang et al., "The unimportance of phase in speech enhancement," IERE Trans. Acoust. Speech and Sig. Proc, vol. ASSP-30, pp. 679-681, Aug. 1982). In practice, spectral subtraction is implemented by multiplying the input signal spectrum with a gain function in order to suppress frequency components with low SNR. This SNR-based gain function is formed from estimates of the noise spectrum and noisy speech spectrum assuming wide-sense stationary, zero-mean random signals and the speech and the noise signals to be uncorrelated. These conventional spectral subtraction methods provide significant noise reduction with the main disadvantage of a degradation of the signal quality, acoustically perceptive as "musical tones" or "musical noise". The musical tones emerge from spectrum estimation errors. In the recent years many enhancements to the basic spectral subtraction approach have been developed.

A method to reduce musical tones which is often applied is to subtract an overestimate of the noise spectrum to reduce the fluctuations in the DFT coefficients and prevent the spectral components from going below a spectral floor (see M. Berouti et al., "Enhancement of speech corrupted by acoustic noise," in Proc. IEEE Int. Conf. on Acoust., Speech and Sig. Proc. (ICASSP'79), vol. 4, pp. 208-211, Washington D.C., Apr. 1979). This approach successfully reduces musical tones during low SNR conditions and noise only periods. The main disadvantage is the distortion of the speech signal during voice activity . In practice a tradeoff between speech quality level and residual noise floor level has to be found. Further methods cope with this problem by introducing optimal and adaptive oversubtraction factors for low SNR conditions

and propose underestimation of the noise spectrum at high SNR conditions (see W. M. Kushner et al., "The effects of subtractive-type speech enhancement / noise reduction algorithms on parameter estimation for improved recognition and coding in high noise environments," in Proc. IEEE Int. Conf. Acoustics, Speech and Sig. Proc. (ICASSP'89), vol. 1, pp. 211-214, 1989).

Applying a soft-decision based modification of the spectral gain function (see R. McAulay and M. Malpass, "Speech enhancement using a soft-decision noise suppression filter," in IEEE Trans. Acoust., Speech and Sig. Proc, vol. 28, no. 2, pp. 137-145, 1980) has been shown to improve the noise suppression properties of the enhancement system in terms of musical tone suppression. These soft-decision approaches mainly depend on the a priori probability of speech absence in each spectral component of the noisy speech.

The minimum mean-square error short-time spectral amplitude estimator (MMSE-STSA, see Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean-square error short-time amplitude estimator," TERF. Trans. Acoust Speech and Sig. Proc, vol. 32, no. 6, pp.1109-1121,

1984) and the mini mum mean-square error log spectral amplitude estimator (MMSE-LSA, Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean-square error log spectral amplitude estimator," IEEE Trans. Acoust. Speech and Sig. Proc, vol. 33, no. 2, pp.443-445,

1985) minimize the mean squared error of the estimated short-time spectral or log spectral amplitude respectively. It was found that the nonlinear smoothing procedure of the MMSE- STSA/LSA methods (the so-called decision-directed approach) obtains a more consistent estimate of the SNR, resulting in good noise suppression without unpleasant musical tones (see O. Capp, "Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor," IEEE Trans. Speech and Audio Proc, vol. 2, no. 2, pp. 345-349, 1994). Both, Capp and Malah (see E. Malah et al., 'Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments," in Proc IEEE Int Conf. Acoust., Speech and Sig. Proc. (ICASSP'99), vol. 2, pp. 789-792, 1999) propose a limitation of the a priori SNR estimate to overcome the problem of perceptible low-level musical noise during speech pauses. The so-called a priori SNR represents the information on the unknown spectrum magnitude gathered from previous frames and is evaluated in the decision-directed approach (DDA). As the smoothing performed by the DDA may have irregularities, low-level musical noise may occur. A simple solution to this problem consists in constraining the a priori SNR by a lower bound.

In single-channel spectral subtraction the noise power spectrum is usually estimated during

speech pauses requiring voice activity detection (VAD) methods (see R. McAulay and M. Malpass, "Speech enhancement using a soft-decision noise suppression niter," in TRRR Trans. Acoust., Speech and Sig. Proc., vol. 28, no. 2, pp. 137-145, 1980; and W. J. Hess, "A pitch- synchronous, digital feature extraction system for phonemic recognition of speech", in TRRP. Trans. Acoust., Speech and Sig. Proc., vol. 24, no. 1, pp. 14-25, 1976). This approach implies stationary noise characteristics during periods of speech. Arslan et al. developed a robust noise estimation method that does not require voice activity detection by recursive averaging with level dependent time constants in each subband (see L. Arslan et al. "New methods for adaptive noise suppression", in Proc. Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP-95), Detroit, May 1995). Martin proposes a noise estimation method, which is based on minimum statistics and optimal signal power spectral density (PSD) smoothing (see R. Martin, "Noise power spectral density estimation based on optimal smoothing and minimum statistics," in TRRR Trans. Speech and Audio Proc, vol. 9, no. 5, pp. 512, July 2001). Further, Ealey et al. present a method for estimating non-stationary noise throughout the duration of the speech utterance by making use of the harmonic structure of the voiced speech spectrum, also referred as harmonic tunnelling (see D. Ealey et al., 'ηaπnonic tunnelling: tracking non-stationary noises during speech," in Proc. Eurospeech Aalborg, 2001 ). Further, as proposed by Sohn and Sung (see J. Sohn and W. Sung, "A voice activity detector employing soft decision based noise spectrum adaptation," in Proc. IEEE InL Conf. Acoustics, Speech and Sig. Proc. (ICASSP'98), vol. 1, pp- 365-368, 1998) using soft decision information, the noise spectrum is continuously adapted wheter speech is present or not.

Ephraim and Van Trees propose another important method for noise reduction based on signal subspace decomposition (see Y. Ephraim and H. L. Van Trees, "A signal subspace approach for speech enhancement", in IEEE Trans. Speech and Audio Proc, vol. 3, pp.251-266, July 1995). In doing so, the noisy signal is decomposed into a signal-plus-noise subspace and a noise subspace, where these two subspaces are orthogonal. Thus makes it possible to estimate the clean speech signal from the noisy speech signal. The resulting linear estimator is a general Wiener filter with adjustable noise level, to control the trade-off between signal distortion and residual noise, as they cannot be minimized simultanously.

Skoglund and Kleijn point out the importance of the temporal masking property in connection with the excitation of voiced speech (see J. Skoglund and W. B. Kleijn, "On Time-Frequency

Masking in Voiced Speech", in EEEE Trans. Speech and Audio Proc., vol. 8, no. 4, pp. 361-369, July 2000). It is shown that noise between the excitation impulses is more perceptive than noise close to the impulses, and this is especially so for the low pitch speech for which the excitation impulses locates temporal sparsely. Temporal masking is not employed by conventioanl noise reduction methods using frequency domain MMSE estimators. Patent WO 2006 114100 discloses a signal subspace approach taking the temporal masking properties into account.

OBJECT AND SUMMARY OF THE INVENTION

The aim of the present invention consists in providing a single-channel auditory-model based noise suppression method with low-latency processing of wide-band speech signals in the presence of backgound noise. More specifically, the present invention is based on the method of spectral subtraction using a modified decision directed approach comprising oversubtraction and an adjustable noise-level to avoid perceptible musical tones. Further, the present invention uses sub-band processing plus pre- and post-filtering to give consideration to temporal and simultaneous masking inherent to human auditory perception, in particular to minimize perceptible signal distortions during speech periods.

Frequency domain processing is accomplished for the proposed system by using a nonuniform Gammatone filter bank (GTF), which is divided into critical bands, also often referred as Bark bands. This analysis filter bank separates the noisy signal into a plurality of overlapping narrow-band signals, considering spectral (simultaneous) masking properties of human auditory perception.

A pre-processor, which emulates the transfer behaviour of the human outer- and middle ear, is applied to the time-discrete noisy input signal (i.e. the desired speech contaminated by noise and interference).

In each sub-band, the level of the noisy signal is detected and smoothed. These narrow-band level detectors applied to each of the plurality of sub-bands utilize the phase of simple low-order filter sections to provide lowest signal processing delay.

From the smoothed envelope of the sub-band signals the noise level is estimated in each sub-band utilizing a heuristic approach based on recursive Minimum-Statistics.

The instantaneous signal-to-noise-ratio (SNR) in each sub-band is estimated from the envelope

of the noisy signal and the noise level estimate.

The a priori SNR is estimated from the instantaneous SNR by applying the Ephraim-and- Malah Spectral Subtraction Rule (EMSR). In order to minimize the influence of estimation errors an improved decision directed approach (DDA) is proposed, introducing an underestimation parameter and a noise floor parameter.

Temporal masking based on human auditory perception is taken into account by appropriate filtering of the sub-band signals. These non-linear auditory post-masking filters apply recursive averaging to falling slopes of the signal level detected in each sub-band, with the following effects: (a) over-estimating variances of impulsive noise, (b) noise suppression algorithms do not effect signal below the temporal masking threshold, and (c) no additional signal delay is introduced to transient signals, important in speech perception.

A non-linear gain function for each sub-band is derived from the a priori SNR estimates, comprising over-subtraction of the noise signal estimates.

The noisy signal in each sub-band is multiplied by the respective gain in order to suppress the noise signal components.

An optimized nearly perfect reconstruction filter-bank employing a decision criterion for signed summation re-synthesizes the enhanced full-band speech singal.

Finally, a post-processing filter is applied to the enhanced full-band signal to compensate the effect of the pre-processing filter.

NOTES: The noise reduction methods as cited above operate in the frequency domain using the Discrete Time Fourier Transform (DTFT), which is based on block processing of the time- discrete input signals. This block procesisng introduces a signal delay depending on the frame size.

Single channel subtractive-type speech enhancement systems are efficient in reducing background noise; however, they introduce a perceptually annoying residual noise. To deal with this problem, properties of the auditory system are introduced in the enhancement process. This phenomenon is modeled by the calculation of a noise-masking threshold in frequency domain, below which all components are inaudible (see N. Virag, "Single Channel Speech Enhancement Based on Masking Properties of the Human Auditory System", IEEE Trans, on Speech and Audio Proc., vol. 7, no. 2, pp. 126-137, March 1999).

To model auditory masking in subtractive-type speech enhancement systems, filter bank

implementations are especially attractive as they can be adapted to the spectral and temporal resolution of the human ear. The authors propose a noise suppression method based on spectral subtraction combined with Gammatone filter (GTF) banks divided into critcal bands. The concept of critical bands, which describes the resolution of the human auditory systems, leads to a nonlinearly warped frequency scale, called the Bark Scale (see J. O. Smith HI and J. S. Abel, "Bark and ERB Bilinear Transforms," IEEE Trans, on Speech and Audio Proα, vol. 7, no. 6, pp. 697-708, Nov. 1999).

The use of Gammatone filter banks outperforms the DTFT based approches in terms of computational complexity and overall system latency. However, the GTF approach allows implementing a low-latency analysis-synthesis scheme with low computational complexity and nearly perfect reconstruction. The proposed synthesis filter creates the broadband output signal by a simple summation of the sub-band signals, introducing a criterion that indicates the necessity of sign alteration before summation. This approach outperforms channel vocoder based approaches as proposed e.g. by McAulay and Malpass (see R. J. McAulay and M. L. Malpass, "Speech Enhancement Using a Soft-Decision Noise Suppression Filter", IEEE Trans, on Acoust., Speech and Sig. Proc, vol. ASSP-28, no. 2, pp. 137-145, April 1980). Within this approach full-band reconstruction of the output signal is performed by the summation of alternately out of phase sub-band signals without considerung the real phase realύons between subbands. This introduces higher distortions in the output signal.

Important note: Sub-band signals without downsampling, as often applied to bearing aid systems, do not require a resynthesis filter bank. Therefore this approach is applicable to low latency speech enhancement systems, but it is computational highly inefficient. The method proposed by the authors allows calculating the output signal from the sub-band signals by simple summation, taking the phase differences into account!

It is worth mentioning, that there are many applications, such as hearing aids or in-car- communication systems, where the computational complexity and signal latency is of utmost importance.

The main advantages of the present invention compared to conventional noise reduction approaches are the significant improvements concerning overall signal latency and computational efficiency.

The invention is not restricted to the following embodiment. It is merely intended to explain

the inventive principle and to illustrate one possible implementation.

According to the invention, the method for low-latency auditory-model based single channel noise suppression and reduction works as an independent module and is intended for installation into a digital signal processing chain, wherein the software-specified algorithm is implemented on a commercially available digital signal processor (DSP), preferably a special DSP for audio applications.

NOTES: With the Ephraim-and-Malah Spectral Subtraction Rule (EMSR) the clean speech signal amplitude is estimated subject to the given amplitude of the noisy signal and the estimated noise variance. To avoid artifacts like musical noise, a modified descision directed approach (DDS) is applied, introducing over-subtraction (under-estimation) of the noise variance plus a noise floor parameter.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG 1 is a schematic illustration of the single-channel sub-band speech enhancement unit of the present invention.

FIG 2 is a schematic ilustration of the non-linear calculation of the gain factor for noise suppression applied to each sub-band.

FIG 3 and 4 show the roof-shaped MMSE-SP attenuation surface dependent on the a posteriori (7 f c) and the a priori (£*) SNR. To include all values 0 < 7* < 00, the x-axis corresponds to 7* and not (7* - 1) as in the literature. The dash-dotted line in Fig. 3 marks the transition between the partitions * /**"• and G 10 , the dashed line shows the power spectral subtraction contour. The contours of the DDA estimation are plotted in Fig. 4 upon the MMSE-SP attenuation surface. Dashed lines in Fig. 4 show the average of the dynamic relationships between 7* and ξ k , solid lines show static relationships.

FIG 5 and 6 are illustrations of the combined (modified) DDA and MMSE-SP estimation behaviour. Dashed lines in Fig. S show the average of the dynamic relationships between 7* and ξ k , solid lines show static relationships. Two fictitious hysteretis loops of Fig. 6 matching the observations from informal experiments.

FIG 7 shows a block diagram of the overall-system.

FIG 8 shows the over-all system comprising auditory frequency analysis and resynthesis

as front- and back end, and using special low-latency and low-effort speech enhancement in between. A combination of an elaborate noise suppression law with a human auditory model enables high quality performance.

FIG 9 shows an outer- and middle ear filter composed of three second order sections (SOS).

FIG 10 shows an example: Three-Zero Gammatone filter of order 3. The common zero at z = 1 is not included in this figure.

FIG 11 shows a familiar way of level-detection. As the signal power is used, the squared amplitude is detected.

FIG 12 shows the Low-Latency FIR level detector

FIG 13 shows a non-linear recursive auditory post-masking filter, responding to falling slopes.

FIG 14 shows a recursive noise level estimator using three time-constant and a counter threshold.

DETATI. FD DESCRIPTION

In this description new aspects are brought forward concerning the Ephraim and Malah noise suppression rule (EMSR) and the decision directed approach (DDA) for a priori signal to noise ratio (SNR) estimation. After partitioning the domain of the amplitude estimator, it becomes clear that the combined DDA estimation obeys an unshaped hysteretic cycle. Introducing a hysteresis width parameter improves the hysteresis shape and reduces musical noise. Eventually, we obtain a more flexible noise suppressor with less dependency on the system sample rate.

I. INTRODUCTION

The Ephraim and Malah amplitude estimator and the Ephraim and Malah decision directed a priori SNR estimate (Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984 and Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr.2, vol. ASSP-33, pp. 443-445, Apr. 1985.) are a powerful means of noise suppression in speech

signal processing. Actually there are quite a lot of recently published works on both issues, as the combined algorithm is a powerful tool on the one hand (O. Cappέ, "Elimination of the Musical Noise Phenomenon with the Ephraim and Malah Noise Suppressor", IEEE Transactions on Speech and Audio Processing, nr. 2, vol. 2, pp. 345-349, Apr. 1994), but on the other hand simplifications (P. J. Wolfe and S. J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001) as well as enhancements (L Cohen and B. Berdugo, "Speech Enhancement for non-stationary noise environments", Signal Processing, no. 11, pp. 2403-2418, Elsevier, Nov. 2001; I. Cohen, "Speech Enhancement Using a Noncausal A Priori SNR estimator", IEEE Signal Processing Letters, no. 9, pp. 725-728, Sep. 2004; I. Cohen, "Relaxed Statistical Model for Speech Enhancement and A Priori SNR Estimation", Center for Communication and Information Technologies, Israel Institute of Technology, Oct, 2003, CCTT Report no.443; M. K. Hasan, S. Salahuddin, M. R. Khan, "A Modified A Priori SNR for Speech Enhancement Using Spectral Subtraction Rules", IEEE Signal Processing Letters, vol. 11, no. 4, pp 450-453, April 2004) are desirable.

In the amplitude estimation part of the algorithm a signal model is considered in which a noisy signal y[n] consists of speech x[n] and additive noise d[n], at time-index n. The signals x[n] and d[n] are assumed to be statistically independent Gaussian random variables. Due to certain properties of the Fourier transform, the same statistical model can be assumed for corresponding complex short-term spectral amplitudes 2C t f m ] an ^ £*[m] in each frequency bin k, at analysis time m (Underlined variables denote complex quantities here. Therefore, in our notation, 2C j t[ m ] represents a complex variable. For simplicity of notation λ f c[m] shall represent the magnitude 12CjJm] |.)- Given the speech and noise variances σ\ k and σ% k , the clean speech amplitude 2C j Jm] can be estimated from the noisy speech YjJm]. An eligible estimator 2C*[m] for the clean speech amplitude is described in section I-A.

The unknown clean speech variance σ^ fc is implicitly determined in the a priori SNR estimation part of the algorithm, whereas the noise variance σ\ k has to be determined in advance, e.g. using Minimum Statistics (P. J. Wolfe and S. J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001), MCRA (I. Cohen and B. Berdugo, "Speech Enhancement for non-stationary noise environments", Signal Processing, no. 11, pp. 2403-2418,

Elsevier, Nov. 2001), or Harmonic Tunneling (D. Ealey, H. Kelleher, D. Pearce, "Harmonic Tunneling: Trackiαg Non-Stationaiy Noises During Speech", Proc. Eurospeech, 2001)

The decision directed estimation described in section I-B determines the a priori SNR ξ k = σ x, f e/ σ .U *° eac ^ frequency bin k. Additionally, the noise suppressor utilizes an instantaneous estimate, the so called a posteriori SNR, which relates the square of the current noisy magnitude to the noise variance 7jt[m] = ^[wij/σ^.

In section II an overview of the combined estimation is given, and its hysteretic shape is presented. Furthermore in section HI it is shown how a slight modification can reduce unwanted estimation behaviour and enable a smoother estimation hysteresis.

λ. The Ephraim and Malah Suppression Rule (EMSR)

Like mentioned above, the EMSR reconstructs the magnitude of the clean speech signal X k [m] from the noisy observation Vt[m]. As magnitudes at different time-steps πi are assumed to be statistically independent, the time index m may be dropped for simplicity of notation.

Ephraim and Malah's MMSE-SA estimator (Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984) solves the Bayesian formula X k = E {X f cp^} to estimate the clean speech magnitude X k . Applying different distortions to the amplitude, other estimators were derived in similar ways, i.e. the MMSE-LSA (Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean- Square Error Log-Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr.2, vol. ASSP-33, pp. 443-445, Apr. 1985) X k = e**^*^, and Wolfe and GodsilTs MMSE-SP (P. J. Wolfe and S. J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing

Workshop, pp. 496-499, 6-8. Aug 2001) X k = y/W{X^}. For a more detailed description refer to Cohen (I. Cohen, "Relaxed Statistical Model for Speech Enhancement and A Priori SNR Estimation", Center for Communication and Information Technologies, Israel Institute of Technology, Oct. 2003, CCIT Report no. 443).

According to Ephraim and Malah the noisy phase is an optimal estimate of the clean phase.

Thus the reconstruction operator is a real- valued spectral weight G[m]:

Because of its simplicity, we have chosen the Wolfe and Oodsill MMSE-SP Eq. (3) as the basis of our considerations. The corresponding weighting rule can be expressed as using the equation of the

hi order to simplify its application, we partition the reconstruction operator into a few regions

Additionally; we can approximate the Wiener Filter by

Combining both, we can divide die MMSE-SP surface into logarithmically flat partitions (see als

Note that in the following sections we use the short form G when we refer to G MMSE-SP .

B. The Decision Directed Approach (DDA)

The DDA combines two basic SNR estimators to a new estimator of die a priori SNR ζ k .

The first estimator is the instantaneous SNR (7* — 1) = !?/< * - ! = -- w - <* * >/«*»•

Allowing only positive SNR values we get

which can be calculated before noise reduction. This instantaneous SNR will differ from the true SNR in the following cases:

• when the analysis time-window is too short regarding the stationarity of the signals x[n] and d[n],

• when there is non-stationary noise that can't be identified in detail, or

• when noise and speech signals are highly correlated.

The second estimator describes the reconstructed SNR, which is calculated after noise reduction using

In bad SNR conditions, e.g.0 < 7* < 2, the a posteriori SNR 7* shows relative variations in time that are smaller than those of (7*- 1) (Relative variations, e.g. 10 1og(7 f c[m])— 10-log(7 f c[m— 1]), are more significant than linear variations regarding human auditory perception.). Ideally, G provides a consistently high attenuation under low SNR conditions. Therefore, the reconstructed SNR rec will take more consistent values than SNRi 11St in the low SNR case. Eventually, the DDA for estimation of the a priori SNR combines both SNHi 11St aad SNR rec :

The specific estimation properties can be observed by inserting the suppression gain into the DDA.

II. COMBINING DDA AND EMSR

Using the partitions of the Wolfe and Godsill reconstruction operator GMMSE- S P from section I-A and inserting them into the Ephraim and Malah DDA (7), the combined a priori SNR estimation exhibits the following spheres of action:

The characteristics of the combined approach can be seen in Fig. 4. Considering the magnitude of a speech signal and a constant noise level, i.e. a time-varying a posteriori SNR 7t as input sequence, one can imagine a kind of hysteretic loop evolving on the MMSE-SP surface. Besides the obvious discontinuities in this loop, other properties can be shown (O. Cappe, "Elimination of the Musical Noise Phenomenon with the Ephraim and Malah Noise Suppressor", IEEE Transactions on Speech and Audio Processing, nr. 2, vol. 2, pp. 345-349, Apr. 1994).

A. Recursive Averaging

1) Expectation by Recursive Averaging: In the above enumeration we can see that in partition 1 the a priori SNR estimation corresponds to recursive averaging (Eq. (8)) of the instantaneous SNRi nSt (S). It is feasible to generalize the averaging process by introducing a time-constant τ βvg specifying the averaging parameter α = exp[—l/(τ avi - f 3 )]. Here, the sample rate /, = 1/T denotes the amount of time-frequency transformations per second.

2) The Constant-ξ-Effect: If the a priori SNR ζ* takes a constant value in partition 1, e.g. in case of a large time-constant r avg , or at the border of ξ k 's value range, the estimator could operate strangely. At a small and constant ξ k , the system will hold its output magnitude at a constant level. This happens whenever the input is small enough (5j t 2 ["i]/ σ d , f c — 1) "C l/ξ k =» V fc 2 [m] < σ^/C, w o% k k (using (8) and its preconditions):

Under certain circumstances this can lead to annoying additional broad-band noise, which could even be worse when a limitation of ξ k to a minimum ζ causes a constant output magnitude only for y?M < σ yς.

3) Unstable Recursive Averaging: Following Eq. (12), partition 5 can lead to a priori SNR estimation by unstable recursive averaging of SNRi nSt when a > 1/2, i.e. £* can increase suddenly in this partition.

B. Partitions Without Recursive Averaging

In the partitions 2, 3, and 4 the recursive averaging interpretation is not useful. Namely, in Eq. (9) the a priori SNR estimate ξ k takes a constant value, and in Eq. (11) ξ k is determined by a single tap delay. It seems odd that in Eq. (10) the SNR ξ k is a down-scaled version of SNR iπst .

C. Summary of Properties

Actually, every partition except for 1 and 4 (Eqs. (8) and (H)) exhibits some unexpected behaviour. Defining a by a time-constant, we obtain generalized averaging properties in Eq. (8),

whereas a sample rate dependent behaviour is introduced to the estimation defined by (9)-(12). This form of sample rate dependency rules out a general parameter set suitable for different analysis time-steps and transformation sizes.

Awkward estimation behavior, e.g. the "constant-ξ-effect", and the discontinuities in the hysteresis loop (Fig. 4) give rise to considerations concerning a modification of the DDA and a reconsideration of time-constant and minimum a priori SNR quantities.

in. A MODIFIED, FAST RESPONDING DDA

In order to minimize the influence of unexpected estimation performance, we modify the decision directed a roach to with ζ being a noise-floor parameter (O. Cappέ, "Elimination of the Musical Noise Phenomenon with the Ephraim and Malah Noise Suppressor", IEEE Transactions on Speech and Audio Processing, nr. 2, vol. 2, pp. 345-349, Apr. 1994) and p being an under-estimation parameter of the instantaneous SNR. Similar to the partitions in section π, we can now find:

Regarding the partitions of the new estimator, an over-all estimation scheme can be shown in Fig. 5. Instead of time-constants in me range of speech quasi-stationarity, we now use r avg = 2 ms. p = 1O ~15 / 10 ensures that the scale factor in (17) is approximately p(\ — a) « p, which fixes the discontinuities of the estimation hysteresis. We can choose the noise-floor ζ = 1O~ 25 / 10 so

small that the maximum attenuation C lies at the bottom of the dynamical range of a frequency bin. These measures largely reduce the sample rate dependency described in section II-C and the "constant-f-effect" in section II-A.2.

It becomes clear that rising instantaneous SNRs are now better attenuated according to Fig. S than in Fig.4. Thus, a stronger attenuation of musical noise, i.e. inconsistently high instantaneous SNR, can be provided, while a signal with consistently high SNR is able to pass through the noise suppressor. The two curly loops in Fig. 6 give approximate examples of hysteresis loops occurring during system operation. In the recursive averaging partition the hysteresis path depends on the slope of rising or falling signal amplitude.

The parameter p can directly control the suppression hysteresis width and musical noise suppression. Our modification enables a separate control of averaging time-constant and musical noise suppression.

IV. CONCLUSION

We found a comprehensible way to graphically describe the properties of a combined Wolfe and Godsill spectral amplitude estimation and Ephraim and Malah decision directed a priori SNR estimation. This description can similarly be used for other amplitude estimation rules, and provides a new insight into the Ephraim and Malah noise suppressor.

So far the suppression of musical noise has been a trade-off between musical noise suppression and transient distortion. Small modifications in the decision directed estimation rule allow a more flexible handling of musical noise suppression, while reducing dependencies on the analysis time- step and the "constants-effect". An informal listening test using the modified algorithm with adjustable analysis time/frequency-resolution (filterbank approach) showed useful enhancements in the over-all algorithm.

Our further work will introduce our descriptive methods into the more elaborate estimation approaches of Cohen (I. Cohen, "Speech Enhancement Using a Noncausal A Priori SNR estimator", IEEE Signal Processing Letters, no. 9, pp. 725-728, Sep. 2004) or Hasan (M. K. Hasan, S. Salahuddin, M. R. Khan, "A Modified A Priori SNR for Speech Enhancement Using Spectral Subtraction Rules", IEEE Signal Processing Letters, vol. 11, no. 4, pp 450-453, April 2004).

APPARATUS FOR LOW-LATENCY SINGLE CHANNEL SPEECH ENHANCEMENT In the following a proffered embodiment will be described, however the invention is not limited to this embodiment.

The reduction of musical noise in noise suppression algrithms still an issue in noise reduction. Although the Ephraim and Malah suppression rule (EMSR) and the decision directed approach (DDA) show a good performance, additional means have to be applied. Moreover, processing delays arising from signal analysis (fast Fourier transform, FFT) pose a problem in real-time applications. Essential improvements in both issues can be achieved by implementing signal analysis and filtering approaches capable of modelling the human auditory perception and latency reduction.

V. INTRODUCTION

The major part of this description is dedicated to auditory signal preparation and analysis, using efficient algorithms with low-latency. Our system combines an auditory Gammatone filterbank (R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996; L. Lin, E. Ambikairajah, W. H. Holmes, "Auditory Filterbank Design Using Masking Curves", Proc. EUROSPEECH Scandinavia, 7th European Conference on Speech Communication and Technology, 2001; L. Lin, E. Ambikairajah, W. H. Holmes, "Perceptual Domain Based Speech and Audio Coder", Proc. of the third International Symposion DSPCS 2002, Sydney, Jan. 28-31, 2002) with the Ephraim and Malah noise suppression rule (Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984; Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr.2, vol. ASSP-33, pp. 443-445, Apr. 1985; P. J. Wolfe and S. J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001). This combination is newly introduced by the authors, whereas the combination of an auditory Gammatone filterbank with a Wiener noise suppressor is known from (L. Lin,

E. Ambikairajah, "Speech Denoising Based on an Auditory Filterbank", 6th ICSP, International Conference on Signal Processing, (552-S5S), 26-30 Aug. 2002), and a frequency domain solution is known from WO 00/30264 (International applicatoin No. PCT/SG99/00119). Furthermore, the integration of a time-domain outer and middle ear filter, as well as the integration of a non-linear temporal post-masking filter (G. Stoll, J. G. Beerends, R. Bitto, K. Brandenburg, C. Colomes, B. Feiten, M. Keyhl, C. Schmidmer, T. Sporer, T. Thiede, W. C. Treumiet, 'TEAQ

- der neue ITU-Standard zur objektiven Messung der wahrgenommenen Audioqualitat", RTM

- Rundfunktechnische Mitteilungen, die Fachzeitschrift fur Hδrfunk und Fernsehtechnik, 43. Jahrgang, ISSN 0035-9890 (81-120), Fiπna Mensing GmbH + Co. KG, Abteilung Verlag, Sept 1999; L. Lin, E. Ambikairajah, W. H. Holmes, "Perceptual Domain Based Speech and Audio Coder", Proc. of the third International Symposion DSPCS 2002, Sydney, Jan. 28-31, 2002) into the noise suppression system is new. Additionally, a narrow band low-latency level detection exploiting the phase of simple first order filters is newly introduced. Finally, we present a simple scheme for signal reconstruction (resynthesis) avoiding band-edge signal cancellation.

• Combination of an auditory Gammatone Filterbank and the EMSR noise suppressor in a time-domain approach

• Integration of outer and middle ear filters into the suppression system in a time-domain approach

• Integration of an auditory post-masking filter

• Low-latency narrow band level-detector

• Low-Effort Wolfe and Godsill signal restauration

• Low-latency up-sampling

• Low-latency resynthesis restraining destructive interference

VI. SYSTEM OVERVIEW

The over-all system is shown in a block diagram in Fig. 7. It can be implemented as analog or digital effect processor or as a part of a software algorithm. Inside the over-all system there are several subsystems Fig. 8:

• an outer and middle ear filter (HOME).

• a Gammatone filterbank analysis section (GFB),

• the low-latency level detection (LD),

• the auditory post-masking filter (PM),

• a recursive noise spectrum estimation (NE),

• the spectral sutraction weight (EMSR),

• the low-latency upsampling (L |)

• the vocoder stage, and

• the inverse outer and middle ear filter (HOME)-

VII. OUTER AND MIDDLE EAR FILTER

An outer and middle ear filter consists of three second order sections (SOS) representing the following physiological parts of the human ear (E. Zwicker, H. Fasti, "Psychoacoustics, facts and models", Springer, Berlin Heidelberg, 1999; E. Terhardt, "Akustische Kommunikation", Springer, Berlin Heidelberg, 1998):

1) the high-pass attenuation curve below 1 kHz modelling the 100-Phon curve, which represents the acoustic impedance of the outer ear and the mechanic impedance of the middle ear ossicles,

2) the resonance of the ear channel, and

3) the low-pass attenuation curve above 1 kHz modelling the threshold of hearing.

The latter two filters are optional, whereas the high-pass component is mandatory and reduces the influence of low-frequency noise on the noise suppressor.

In the end, a filter structure providing an appropriate magnitude transfer function could look like Fig. 9. All three filter sections have to be second order sections to provide appropriate slopes. The outer filter skirts can be modelled as second order low- and high-pass shelving filters, whereas the resonance can be modelled as parametric peak-filter (P. Dutilleux, U. Zolzer, "DAFX", Wiley&Sons, 2002).

The filter inversion is straight-forward. If there should be zeros at e.g. z = 1 in the z-domain, the inverse filter can't undo this, so perhaps z = 0.99 could be a proper choice for a pole location inverting a z = l zero.

VIII. FREQUENCY GROUPING / AUDITORY BANDWIDTHS

Frequency grouping is an imporant effect in human loudness perception. The perceived loudnesss consists of particular loudnesses associated to individual frequency ranges. An auditory

frequency scale can be used to model this frequency grouping effects, the units of which can be seen as frequency resolution of human auditory loudness perception (E. Zwicker, H. Fasti, "Psychoacoustics, facts and models", Springer, Berlin Heidelberg, 1999). We denote an arbitrary auditory frequency transform with the operator <B{ } and the corresponding inverse frequency transform with Q3 -1 { }. A reasonable frequency scale using a low number of frequency groups can be given by the formula of Traunmuller (E. Terhardt, "Akustische Kommunikation", Springer, Berlin Heidelberg, 19

Accordingly, the inverse transform 93 ~1 { } is

The center frequencies /* of the auditory filterbank can be calculated applying the inverse transform /* = ^B ~l {u k } on an equally spaced scale v k (with spacing dv, e.g. dv — l[Baxk]) in the Bark-domain. Similarly the bandwidths B k can be derived from B k — 35 ~1 {ι/ f c + du/2} — 93 "1 I^ — du/2}. Other Bark-scames (e.g. E. Zwicker, H. Fasti, 'Tsychoacoustics, facts and models", Springer, Berlin Heidelberg, 1999) use smaller bandwidths resulting in auditory filters with more group-delay, thus the above spacing is preferred.

Note that we use v for the Bark-frequency instead of z in order to avoid confusion with the z-domain variable z.

IX. AUDITORY GAMMATONE FILTERS

Auditory Gammatone filters (R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996) can be efficiently implemented in the time- domain, allowing the separation of a broadband audio signal into auditory band signals. The magnitude response of the Gammatone Filter corresponds to the simultaneous masking properties of the human ear. Plotting the magnitude of this filter along an auditory frequency scale the filter shape remains the same, whatever center frequency the filter is designed to have. The arbitrary form representing the family of Gammatone-filters of the order m is shown below, wherein it is the filterbank channel index. A corresponding z-transform wherein *GF denotes an arbitrary

Gammatone filter (e.g. GF, APGF, OZGF, TZGF), is:

Digital center frequencies θ k and pole radii r k are derived from the continuous-time quantities center frequency /*, bandwidth B k , the band-edge rejection Cd B (e g- C dB = — 5[dB]), and the sample rate f a :

An auditory Gammatone Filterbank represents of a set of overlapping Gammatone filters that devide the auditory firequeny scale in equally spaced frequency bands. An order m = 4 is frequently used in literature, whereas the order m = 3 is proposed to minimize computational cost The term g *GF shall be adjusted so that unity gain at the center frequency ∫ k can be provided. For a special form of gammatone filter the system H nυm,k (z) has to be adapted suitably as shown in the following sub-sections.

A. Ordinary Gammatone filter

The odinary Gammatone filter (GF; R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996) has to be derived from the continuous-time impulse response using the Laplace- and impulse-invariance transform (A. V. Oppenheim, R. W. Schafer, J. R. Buck, "Discrete-Time Signal Processing", Prentice Hall, 1999): which determines the unknown polynomial H num,k (z) in the above equation (21). Due to its shape and computational cost its use is not recommended.

B. All-Pole Gammatone filter

An All-Pole Gammatone filter (APGF) can be obtained when just cancelling the polynomial H num,k (z) = 1 in equation (21). It is the most efficient Gammatone filter (R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996).

C. One-Zero Gammatone filter

Just setting H ama ^(z) = (1— z '1 ) in equation (21) leads to the so-called One-Zero Gammatone filter (R. F. Lyon, "The All-Pole Gammatone Filter and Auditory Models", Proc. Forum Acusticum, Antwerpen 1996). The One-Zero Gammatone (OZGF) can be efficiently composed of a common "One-Zero" for all channels k before splitting up into k All-Pole Gammatone filters.

D. Three-Zero Gammatone filter

When adding a pair of complex conjugate zeroes z = τ z · e ±θz,k with the digital frequency θ z,k at 1 Bark above center frequency θ k , with radius τ t ≈ 0.98, and one additional zero at z = 1, we obtain H num,k (z) = (1 - 2r z cos(θ z,k )z- χ + τ z z -2 ) · (1 - z -1 ) for the Three-Zero Gammatone (TZGF) filter with its improved shape (L. Lin, E. Ambikairajah, W. H. Holmes, "Auditory Filterbank Design Using Masking Curves", Proc. EUROSPEECH Scandinavia, 7th European Conference on Speech Communication and Technology, 2001). In comparison, the computational cost of the One-Zero Gammatone filter of order m+ 1 is equal to the cost of the Three-Zero Gammatone Filter of order m, when again, a single "One-Zero" is commonly used by all channels k. Appropriate transforms and digital frequency calculation θ z,k follow from the equations (19)(20) and (22).

X. RESYNTHESIS

A resynthesis of a broadband signal from the auditory band signals can be implemented as an addition of all signal bands. Unfortunately this can bring destructive signal cancellation at the overlap between neighbouring signal channels. Therefore we derived a simple criterion that indicates the necessity of a sign alteration for every second channel befor signal summation

Using this formula, the frequency response of the superposition of all signals lies in the range of approximately C dB + 3[dB] and 0[dB]. Omitting a necessary sign alteration can result in a destructive signal cancellation at the band-edges of adjacent filters.

XI. (LOW-LATENCY) LEVEL DETECTION

Masking effects modelled by the auditory filterbank cannot be exploited unless the amplitude of the filterbank channels is determined. Suitable ways of level detection are proposed in the following sub-sections.

We propose to use the first simple approach for the high-frequency channels, and the low- latency approach for low-frequency bands.

A. Ordinary Level-Detection with pre-masking

Usually non-linearities nice e.g. the absolute value, square, or half-wave rectification are used to transform the signal amplitude into the base band around 0 Hz. Further a smoothing filter removes components at higher frequencies, and in the end the desired amplitude signal is found. Fig. 11 provides an example, which also takes the form-factor F into account

The commonly used approach of amplitude detection is computationally efficient, but smoothing filters involve group delays in the signal path that have to be compensated for. We recommend to describe the recursive smoothing parameter α by a time-constant τ avg in [s]

Suitable time-constants match the auditory pre-masking time-constant, which is approximately τ avg ≈ 2[τns] (G. Stoll, J. G. Beerends, R. Bitto, K. Brandenburg, C. Colonies, B. Feiten, M. Keyhl, C. Schmidmer, T. Sporer, T. Thiede, W. C. Treurniet, "PEAQ - der neue ITU-Standard zur objektiven Messung der wahrgenommenen Audioqualitat", RTM - Rundfunktechnische Mitteilungen, die Fachzeitschrift fur Hδrfunk und Femsehtechnik, 43. Jahrgang, ISSN 0035- 9890 (81-120), Fiπna Mensing GmbH + Co. KG, Abteilung Verlag, Sept 1999).

B. Low-Latency Level detection

Our new method exploits the phase of simple filter sections. This method for level detection is also applicable to other technical fields and not restricted to noise suppression alone.

Using a Hubert transform, a consistent 90° phase shift can be brought to a broad band signal. Summing up the squares of the original and the shifted signal, squared amplitudes (i.e. signal power) remain while sinusoidal components cancel. But a causal implementation of the Hilbert transform doesn't exist.

Unlike an ideal Hubert transformator, we only need 90° phase shift in the considered frequency range, i.e. in the corresponding auditory frequency group.

We propose to use the following kinds of filters to provide a 90° phase shift at a frequency θ k :

• a simple FIR first order section,

• a simple IER first order all-pass (AP), and

> a simple delay line providing a λ/4 delay at θ k

Each of the above mentioned methods can provide a 90° phase shift to a virtually arbitrary frequency θ k and is therefore suitable. One can choose between the following properties:

• FIR: numerical not stable around θ k = [0, π/2, π], providing the broadest band featuring a 90° phase.

• AP: numerical unstable around θ k = [0, π/2, π], the 90° phase frequency band is smaller, computational effort bigger.

• λ/4-delay: numerical stable, the smallest frequency band of 90" phase, computational effort low, more memory needed.

Fig. 12 provides an example for the FIR level detection method. Appropriate parameters can be found using the phase-equations for the corresponding systems, e.g. A. V. Oppenheim, R. W. Schafer, J. R. Buck, "Discrete-Time Signal Processing", Prentice Hall, 1999.

XII. AUDITORY POST-MASKING

Using a non-linear post-masking filter (i.e. recursive averaging only responding to a falling slope) exhibits several benefits:

• impulsive noise variance is slightly over-estimated (over-subtraction) because of the post- masking.

• noise suppression algorithms cannot attenuate signals until the auditory post masking time has elapsed.

• aliasing effects after downsampling or ripples in the amplitude signals are reduced due the post-masking smoothing operation.

• though smoothing is applied, no group delay is brought to the amplitude of important transient signals

We propose a structure that works on the signal power detected in each channel (cf. Fig. 13, L. Lin, E. Ambikairajah, W. H. Holmes, "Perceptual Domain Based Speech and Audio Coder", Proc. of the third International Symposion DSPCS 2002, Sydney, Jan. 28-31, 2002).

The averaging parameter α k in the channel k; has to correspond to human auditory post- masking time-constants at corresponding frequencies ∫ k . Therefore, we use following equation to derive the averaging parameter α:

A parameter G can be used to scale the post-masking time-constants if useful.

The time-constant for l[Bark] is approximately τ v1 ≈ 40[ms], and for 20[Bark] approximately τ v 20 ≈ 4[ms] (G. Stoll, J. G. Beerends, R. Bitto, K. Brandenburg, C. Colomes, B. Feiten, M. Keyhl, C. Schmidmer, T. Sporer, T. Thiede, W. C. Treurniet, "PEAQ - der neue ITU-Standard zur objektiven Messung der wahrgenommenen Audioqualitat", RTM - Rundfunktechnische Mitteilungen, die Fachzeitschrift fur Hόrfunk und Fernsehtechnik, 43. Jahrgang, ISSN 0035- 9890 (81-120), Firma Mensing GmbH + Co. KG, Abteilung Verlag, Sept 1999). Following equation can be used to derive τ k

Alternatively, the equation in above cited reference can be used, but our formula provides a suitable interpolation with longer time-constants.

XIII. RECURSIVE MINIMUM STATISTICS

We can use the structure in Fig. 14 to estimate the noise level in each frequency band. Similar approaches can be found in R. Martin, "Noise Power Spectral Estimation Based on Optimal Smoothing and Minimum Statistics", IEEE Transactions on Speech and Audio Processing, nr. 5, vol. 9, pp. 504-512, JuL 2001 or WO 00/30264 (International applicatoin No. PCT/SG99/00119).

This method essentially applies three time-constants of averaging to the signal level. Falling slopes are sligthtly averaged, whereas during rising input slope, the output is held constant (i.e. infinitely large time-constant) during the period of N ω , sampling intervals. When N ω sampling intervals are exceeded, the rising signal slope is averaged by a third time constant. The time- constants can be similarly converted to recursive averaging parameters as in equation (25) and (26).

An appropriate counter threshold N w can be calculated using a continuous time interval T w

Suitable to utterances or words of human speech, this time interval can be chosen e.g. T w ≈ 1.5s. The falling slope time-constant can be a scaled version of the post-masking time-constants r*, or e.g. constant 200 [msj.

The rising slope time-constant defining β can be approximately 700 [ms], which corresponds to a velocity of appoximately 6[dB]/[s]. Unlike other time-constants, this one is proposed to be equal for all channels k.

The saturation operation in Fig. 14 can be expressed as:

XIV. EPHRAIM AND MALAH NOISE SUPPRESSION RULE (EMSR)

With the EMSR (Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984; Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr.2, vol. ASSP-33, pp. 443-445, Apr. 1985) we can estimate the clean speech amplitude subject to the given noisy speech amplitude and the noise variance. We can e.g. use the Wolfe and Godsill definition of the spectral weight (P. J. Wolfe and S. J. Godsill, "Simple Alternatives to the Ephraim and Malah Suppression Rule for Speech Enhancement", Proc. 11th IEEE Signal Processing Workshop, pp. 496-499, 6-8. Aug 2001) and a modified decision directed approach (F. Zotter, M. Noisternig, R. Hδldrich, "Speech Enhancement Using the Ephraim and Malah Suppression Rule and Decision Directed Approach: A Hysteretic Process", to appear in IEEE Signal Processing Letters, 2005. First manuscript submitted Jan 24, 2005)

The following relations are involved in the above equation:

The noise variance is given by the noise estimation algorithm; m and n are time indices, ∫ s is the system sample rate and L a down-sampling factor.

According to Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean- Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, nr. 6, vol. ASSP-32, pp. 1109-1121, Dec. 1984, γ k [m] is the a posteriori SNR, and ξ k [m] is the a priori SNR. G w,k [m] is the spectral weight of a Wiener filter, a is an averaging parameter, denned by an averaging time-constant τ snr,k , which is either approximately 2[ms] (F. Zotter, M. Noisternig, R. Höldrich, "Speech Enhancement Using the Ephraim and Malah Suppression Rule and Decision Directed Approach: A Hysteretic Process", to appear in IEEE Signal Processing Letters, 2005. First manuscript submitted Jan 24, 2005) or derived from the auditory post-masking time-constants.

The "over-subtraction factor" p (cf. Zotter et at) can be chosen to be p = 10 -15/10 , and the noise-floor parameter ζ can be ζ = 10 -40/ ' 10 .

XV. LOW-LATENCY Up-SAMPLING

Usually up-sampling needs either a processing-delay or a group-delay due to the interpolation operation involved. Such a delay is approximately L samples long, using the up-sampling factor L.

We propose to use a special method for up-sampling introducing no additional delays. This can be done if the signal is devided into buffers (perferably with the buffer size of the ADC and DAC).

When in every signal block the last sample of the preceeding block is given, it is possible to linearly interpolate to the following given sample instantaneously. Therefore, the last sample in every block must correspond to a sampling instant at the lower sampling rate.

XVI. CONCLUSIONS

Frequency domain solutions using equivalent auditory models require delays in the range of 10 miliseconds, the implementation of our system with 20 frequency bands and the third order TZGF has a mean latency of 3.5 up to 4 miliseconds. The required computational cost is about approximately 8.9 MIPs at fs = 16 [kHz], which is only slightly more than DFT solutions need (7 MIPs). We also apply a slightly modified Ephraim and Malah suppression rule (EMSR) using the simplified Wolfe and Godsill formula and modified decision directed approach.

The disclosure of all cited publications is included in its entirety into this description.