VOICE ACTIVITY DETECTION AND END-POINT DETECTION - MULTIMEDIA TECHNOLOGIES INST M

Title:

VOICE ACTIVITY DETECTION AND END-POINT DETECTION

Document Type and Number:

WIPO Patent Application WO/2001/086633

Kind Code:

A1

Abstract:

This invention provides a method for detection of voice activity or VAD method in a voice signal, particularly in telephonic applications, comprising: a first step aimed at acquiring the voice signal (1) divided in segments or frames having a time duration d, a second step aimed at computing, for each frame, at least three of the following five parameters: the energy differential over the whole band $g(D)E¿f?, the energy differential over the band 0-1kHz, $g(D)E¿l?, the zero crossing rate differential, $g(D)ZCR, the second cepstral coefficient, c¿2?, and the fifth cepstral coefficient, c¿5?, a third step in which a neural network process is carried out in order to provide, based upon at least three of said five parameters, for each frame, an output value Y in the range defined by a minimum value Y¿min? and by a maximum value Y¿max?, being Y¿min?< Y¿max?. The invention also provides a VAD apparatus to perform said VAD method, a method for segmentation of isolated words or EPD method, including the steps of said VAD method, as well as an EPD apparatus related thereto.

Inventors:

BERITELLI FRANCESCO (IT)

Application Number:

PCT/IT2001/000221

Publication Date:

November 15, 2001

Filing Date:

May 08, 2001

Export Citation:

Click for automatic bibliography generation Help

Assignee:

MULTIMEDIA TECHNOLOGIES INST M (IT)
BERITELLI FRANCESCO (IT)

International Classes:

G10L25/78; G10L25/87; G10L25/30; (IPC1-7): G10L11/02

Other References:

BERITELLI F: "A robust endpoint detector based on differential parameters and fuzzy pattern recognition", PROCEEDINGS OF ICSP'98: FOURTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, BEIJING, CHINA, vol. 1, 12 October 1998 (1998-10-12) - 16 October 1998 (1998-10-16), IEEE, Piscataway, NJ, USA, pages 601 - 604, XP002173614, ISBN: 0-7803-4325-5
GHISELLI-CRIPPA T ET AL: "A FAST NEURAL NET TRAINING ALGORITHM AND ITS APPLICATION TO VOICED-UNVOICED-SILENCE CLASSIFICATION OF SPEECH", IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH & SIGNAL PROCESSING (ICASSP '91), 14 May 1991 (1991-05-14), IEEE, New York, NY, USA, pages 441 - 444, XP000245262, ISBN: 0-7803-0003-3

Attorney, Agent or Firm:

Iannone, Carlo Luigi (26 Roma, IT)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1.

A method for detection of voice activity or VAD method in a voice signal, particularly in telephonic applications, comprising : a first step aimed at acquiring the voice signal (1) divided in segments or frames having a time duration d, a second step aimed at computing, for each frame, at least three of the following five parameters: # the energy differential over the whole band zlEs the energy differential over the band 01 kHz, dE,, # the zero crossing rate differential, dZCR, # the second cepstral coefficient, c2, and # the fifth cepstral coefficient, c5, a third step in which a neural network process is carried out in order to provide, based upon al least three of said five parameters, for each frame, an output value Y in the range defined by a minimum value Ymn and by a maximum value Y,,., being Y.. i,, < Y,,..

2.	A VAD method according to claim 1, characterised in that Ymrn is equal to 0 (zero) and Ymax is equal to 1 (one).

3.	A VAD method according to claim 1 or 2, characterised in that Ymin corresponds to a silence frame and Ymax corresponds to a voice activity frame.

4.	A VAD method according to any one of the preceding claims, characterised in that the time duration d of the frames is not longer than 40 milliseconds (ms).

5.	A VAD method according to claim 4, characterised in that the time duration d of the frames is in the range of 10 to 20 ms.

6.	A VAD method according to claim 5, characterised in that the time duration d of the frames is equal to 10 ms.

7.

A VAD method according to any one of the preceding claims, characterised in that the three parameters of the energy differential over the whole band aEz the energy differential over the band 01 kHz, #Ei and the zero crossing rate differential, AZCR, are computed, for each frame, in the second computation step, and in that said third neural network process step is based upon said three parameters AEf, dE, and AZCR.

8.	A VAD method according to claim 7, characterised in that said neural network includes a perceptron having three inputs, an output and nine nodes in an intermediate stage.

9.

A VAD method according to claim 8, characterised in that the relationship that links the intermediate output yi of each one of said nine intermediate nodes to said three inputs is : yi = f(αi) where aii = wi1#Ef + wi2#El + wi3#ZCR + bi, and in that the relationship that links the output Yto the outputs y ; of said nine intermediate nodes is: where and 'f {a 1+e and α = W1y1 +...+W9y9 + B.

10.	A VAD method according to any one of the preceding claims, characterised in that the training operation of said neural network is carried out by means of algorithm"Delta Learning Rule".

11.	A VAD method according to any one of the preceding claims, characterised in that said neural network is trained by means of a clean or noiseless voice signal.

12.

A VAD method according to any one of claims 1 to 10, characterised in that said neural network is trained by means of a clean or noiseless voice signal and with eight further audio signals obtained by adding babble noise, car noise, traffic noise and white noise, respectively, to said clean signal, with a signal to noise ratio or SNR equal to 20 dB and 10 dB.

13.

A VAD method according to any one of claims 1 to 10, characterised in that said neural network is trained by means of a clean or noiseless voice signal and with twelve further audio signals obtained by adding babble noise, car noise, traffic noise and white noise, respectively, to said clean signal, with a SNR equal to 20 dB, 10 dB and 0 dB.

14.

A VAD method according to claim 9, characterised in that said neural network provides: the following matrix w (9x3) of the interconnection weights w « between the three inputs and the outputs yi of the nine intermediate nodes: wij j=1 j=2 j=3 i= 9 0.620.474.55 i=2 1.10 0.38 4. 33 f=3 0.520.696.42 i=4 4.36 5.55 2.00 i=5 0.37 0. 29 3.57 r6 2.17 2.64 7.23 i=7 2.343.00 4.00 i=8 0.020.01 7.60 i=91. 742.404.38 the following vector b (9x1) of the additive constants, or bias, of the outputs yi of said nine intermediate nodes: bi i=1 4. 61 ru 9.42 j=3 0. 36 i=4 10.54 i=5 10. 69 6 4.19 i=77. 81 i=8 5.78 i=92. 53 the following vector W (1x9) of the interconnection weights W between the output Y of the neural network and the outputs yi of the nine intermediate nodes: i=1 j=2 j=3 j=4 j=5 j=6 j=7 j=8 j=9 Wi 1.75 1. 57 2.780.37 1. 57 0.57 3. 03 0.19 0.40 and the following bias B of the output Y : B =2.61 15.

15.

A VAD method according to claim 9 or 14, characterised in that said second computation step includes : computing energy Ef in broad band 0 to 4 kHz for each frame, computing energy E, in low band 0 to 1 kHz for each frame, computing the zero crossing rate ZCR for each frame, computing the respective average values Ef, E, and ZCR over a number of frames, 'starting from the values Ef and from the average values Ef, computing differential values Ef for each frame, w starting from the values E, and from the average values E,, computing differential values JE, for each frame, and starting from the values ZRC and from the average values ZCR, computing differential values AZRC for each frame.

16.

A VAD method according to claim 15, characterised in that said computation operation of energy Ef in broad band 0 to 4 kHz for each frame, said computation operation of energy EX in low band 0 to 1 kHz for each frame and said computation operation of the zero crossing rate ZCR for each frame are carried out according to the method described for the ITUT G. 729 VAD device.

17.	A VAD method according to claim 15 or 16, characterised in that said computation operation of the average values Ef, E, and ZCR is carried out in adaptive way according to the method described for the ITUT G. 729 VAD device.

18.	A VAD method according to claim 15 or 16, characterised in that said computation operation of the average values Ef, Et and ZCR is carried out by analysing an initial segment of the voice signal having a duration in the range of 100 ms to 500 ms.

19.	A VAD method according to claim 18, characterised in that said computation operation of the average values Ef, E, and ZCR is carried out by analysing an initial segment of the voice signal having a duration of 300 ms.

20.	A VAD method according to any one of the preceding claims, characterised in that said first step aimed at acquiring the voice signal (1) includes filtering this signal by means of a highpass filter (2).

21.	A VAD method according to claim 20, characterised in that the cutoff frequency of said highpass filter (2) is in the range of 130 Hz to 170 Hz.

22.	A VAD method according to claim 21, characterised in that the cutoff frequency of said highpass filter (2) is 150 Hz.

23.	A VAD method according to any one of the preceding claims, characterised in that it includes, after said third neural network process step, a further step for comparing the output values Y of the neural network to a threshold value.

24.	A VAD method according to claim 23, characterised in that said threshold value is given by the arithmetical mean value of Ymen and Ymax((Ymin + Ymax)/2).

25.

An apparatus for detection of voice activity or VAD, comprising one or more processing units (3,4,5,6,7,8,9) for the voice signal (1), characterised in that it further comprises a neural network (10) for receiving at its input port the data processed by said processing units, and in that it carries out the VAD method according to any one of the preceding claims 1 to 19.

26.

A VAD apparatus according to claim 25, characterised in that it carries out the VAD method according to claim 7 and in that it includes : # a first, a second and a third processing unit (3,4,5) for computing, for each frame, the energy Ef in the broad band of 0 to 4 kHz, the energy E, in the low band of 0 to 1 kHz and the zero crossing rate ZCR, respectively, # a fourth processing unit (6) for receiving the computed values Ef, E, and ZCR and for computing the respective average values Ef, E, and ZCR, a fifth processing unit (7) for receiving the computed values Ef and Ef, and for computing the differential values AEf, # a sixth processing unit (8) for receiving the computed values E, and E, and for computing the differential values dE,, and a seventh processing unit (9) for receiving the computed values ZCR and ZCR and for computing the differential values jZCR, said neural network (10) receiving at its input port the computed differential values aEr, Et and JZCR.

27.	A VAD apparatus according to claim 25 or 26, characterised in that it further includes a high pass filter (2) upstream to said processing units (3,4,5,6,7,8,9,11) aimed at carrying out said filtering operation according to any one of claims 20 to 22.

28.	A VAD apparatus according to any one of claims 25 to 27, characterised in that it further comprises a final comparison unit (11) for carrying out the comparison step according to claim 23 or 24.

29.

A method for segmentation of isolated words or EPD method, characterised in that it comprises: the first, the second and the third steps of the VAD method according to any one of the preceding claims 1 to 22, a fourth levelling or smoothing step on said voice signal (1) for providing a smoothed signal V, a fifth step for provisionally marking the boundaries of the word, in which a coarse initial end point P'i and a coarse final end point PF of the word are established, and a sixth step for trimming the marking operation carried out in the fifth step, in which the initial end point P, and the final end point PF of the word are established.

30.	An EPD method according to claim 29, characterised in that said fourth smoothing step performs a median filtering of the seventh order.

31.

An EPD method according to claim 29 or 30, characterised in that said fifth provisionally marking step carries out, for each ith frame, a comparison of the relative value Vi, provided by the fourth smoothing step, to the value S, of a fixed initial threshold and establishes the coarse initial end point P'l as the final end point of a window comprising N2 frames, in which the relation Vj > S1 applies to at least a pre established number Nt of frames, where N2 z2 N,.

32.	An EPD method according to claim 31, characterised in that N1 is in the range of 3 to 50 and N2 is in the range of 3 to 50.

33.	An EPD method according to claim 32, characterised in that N, =4 andN2=5.

34.	An EPD method according to any one of claims 31 to 33, characterised in that the value S, of the initial fixed threshold is in the range [Ymin + 0,1* (Ymax Ymin)] and [Ymin + 0,5*(Ymax Ymin)].

35.	An EPD method according to claim 34, characterised in that the value S, of the initial fixed threshold is hm +0, 4* ()].

36.

An EPD method according to claim 29 or 30, characterised in that said fifth provisionally marking step carries out, for each ith frame, a comparison of the relative value Vi, provided by the fourth smoothing step, to the value St1 of a fixed initial threshold and establishes the coarse final end point P'F as the initial end point of a window comprising N4 frames, in which the relation Vj < S ; applies to at least a pre established number N3 of frames, where N4 ? N3.

37.	An EPD method according to claim 36 and any one of claims 31 to 35, characterised in that the value S'1 of the final fixed threshold is equal to the value S, of the initial fixed threshold.

38.	An EPD method according to claim 36 or 37, characterised in that the value of N3 is in the range of 3 to 50 and the value of N4 is in the range of 3 to 50.

39.	An EPD method according to claim 38, characterised in that N3 = 26 and N4 = 30.

40.	An EPD method according to any one of claims 36 to 39, characterised in that the value S', of the final fixed threshold is in the range [Ymin + 0,1(Ymax Ymin)] and [Ymin + 0,5(Ymax Ymin)].

41.	An EPD method according to claim 40, characterised in that the value S'l of the initial fixed threshold is [Y,,, +0X4* (YRY"")].

42.

An EPD method according to any one of claims 29 to 41, characterised in that said sixth step for trimming the marking operation comprises: computing, for each ith frame, the derivative of the relative value Vj furnished by the fourth smoothing step; comparing said value Vs to a value S2 of a further initial fixed threshold ; and establishing the initial point Pi, in a window immediately preceding the coarse initial end point P', of the word and including a preestablished number N, of frames, as the nearest point to P i where Vi < S2 or where the derivative of V ; changes its sign; or, if no one of said events occurs in said preceding window, establishing the initial end point P, as the point that precedes point P', by N, frames (P, = P'iN,).

43.	An EPD method according to claim 42, characterised in that N1 is in the range of 3 to 50.

44.	An EPD method according to claim 43, characterised in that N, = 10.

45.	An EPD method according to any one of claims 42 to 44, characterised in that the value S2 of said fixed initial threshold is in the range of [Ymin + 0,5(Ymax Ymin)] to [Ymin + 0,9*(Ymax Ymin)].

46.	An EPD method according to claim 45, characterised in that the value S', of said further fixed initial threshold is [Ymin + 0,6*(Ymax Ymin)].

47.

An EPD method according to any one of claims 29 to 41, characterised in that said sixth step for trimming the marking operation comprises: computing, for each ith frame, the derivative of the relative value V ; furnished by the fourth smoothing step; comparing said value Vi to a value S'2 of a further final fixed threshold; and establishing the final point PF, in a window immediately subsequent the coarse final end point P'F of the word and including a preestablished number NF of frames, as the nearest point to P'F where Vj c S'2 or where the derivative of V changes its sign; or, if no one of said events occurs in said subsequent window, establishing the final end point PF as the point that follows point P'F by NF frames (PF = PFNF).

48.	An EPD method according to claim 47 and any one of claims 42 to 46, characterised in that the value S2 of said further fixed final threshold is equal to the value S2 of said further fixed initial threshold.

49.	An EPD method according to claim 47 or 48, characterised in that NFis in the range of 3 to 50.

50.	An EPD method according to claim 49, characterised in that NF= 30.

51.	An EPD method according to any one of claims 47 to 50, characterised in that the value S'2 of said further fixed final threshold is in the range of [Ymin + 0,5* (Ymax Ymin)] to [Ymin + 0,9 * (Ymax Ymin)].

52.	An EPD method according to claim 51, characterised in that the value S'2 of said further fixed final threshold ìs [Y"+ 0, 6 * (Y,";Ynz.

53.	An EPD method according to any one of claims 29 to 52, characterised in that it further comprises as a further final control step : considering as noise and rejecting all words having a duration shorter than a minimum time interval Pmn or longer than a maximum time interval Pmax.

54.	An EPD method according to claim 53, characterised in that said minimum time interval Pmin is not longer than 300 ms.

55.	An EPD method according to claim 54, characterised in that said minimum time interval Pmin amounts to 100 ms.

56.	An EPD method according to any one of claims 53 to 55, characterised in that said maximum time interval Pmax is not shorter than 1 second.

57.	An EPD method according to claim 56, characterised in that said maximum time interval P is equal to 2 seconds.

58.

An apparatus for recognition of isolated words or EPD apparatus comprising one or more units (12,13,14) for establishing the word boundaries, characterised in that it further comprises an apparatus for detection of voice activity or VAD apparatus according to any one of claims 25 to 27, connected upstream to said word boundary establishing units, and in that it carries out an EPD method according to any one of claims 29 to 57.

59.

An EPD apparatus according to claim 58, characterised in that said word boundary establishing units (12,13,14) comprise a finished state machine and one or more buffer registers for storing the frames of the voice signal (1) needed for carrying out said fifth provisional marking step and said sixth marking trimming step, said machine having the following four state: A. initial or nonactivity state, B. activity begin or false alarm state, C. activity state, and D. Activity end or micropause state.

60.

A method for detection of voice activity or VAD method and related VAD apparatus and a method for segmentation of isolated words or EDP method and related EPD apparatus according to any one of the preceding claims 1 to 24,25 to 28,29 to 57 and 58,59, respectively, substantially as described and shown.

Description:

VOICE ACTIVITY DETECTION AND END-POINT DETECTION This invention relates to a low complexity and environment noise unresponsive method for detection of voice activity as well as to a method for segmentation of isolated words utilising said voice activity detection method and to related apparatuses.

A Voice Activity Detection (VAD) is a device which enables to distinguish between voice (useful signal) and silent pauses (signal typically not including information) within a telephonic conversation. A description of VADs has been provided by R. V. Cox e P. Kroon, in"Low Bit-Rate Speech Coders for Multimedia Communications", IEEE Communications Magazine, vol. 34, n. 12, Dec. 1996, pages 34-41.

A normal telephonic conversation includes time periods in which a voice activity occurs (amounting to average values of about 40%) and during which at least a speaker speaks, and time periods in which no voice activity occurs, characterised by a silent condition or by presence of only environment noise (amounting to average values of about 60%), in which both speakers are listening at one another or are making pauses between isolated words or within a single word. As a general rule, the voice activity algorithm operates on voice segments (frames) having a duration of 10-20 ms.

According to the disclosure of Gersho,"Advances in speech and audio compression", IEEE Proc., vol. 82, n. 6, pages 900-918, June 1994, a VAD device presents various interesting application scenarios, among which the two main application scenarios of voice coding and recognition are included.

In the first case, a VAD device is utilised : -to carry out the so-called discontinuous transmission or DTX, namely an operation mode in which the transmission is disabled during all pause periods during a conversation, with the noticeable benefit that the channel band is restricted and the transmission capability of a communication system is increased (radio mobile systems, satellite links, voice communications on Internet) ; and -to lower the storage capability need and consequently the costs in voice storing systems (telephonic answering machines, voice files, etc.).

In particular, a VAD device as applied to radio mobile systems, such as GSM or UMTS, allows to reduce both any co-channel

interference, thereby increasing the system capability in respect of the number of users enabled to access any base station, and the power consumption by the concerned mobile terminal, thereby increasing the average life of the energy charge. Furthermore, a VAD device represents the first stage in a classification of variable bit rate (VBR) voice coders, in which a particular coding model is associated to each considered phonetic class, so that the bit rate dynamically matches the local characteristics of the transmitted voice signal.

In systems for recognition of isolated speech, a VAD device forms the first processing stage typically present in the word end point detector or EPD, as described by Lawrence Rabiner in"Applications of voice processing to telecommunications", Proceedings of the IEEE, vol. 82, no.

2, February 1994, pages 199-227. In particular, the word boundaries or end points are the two starting and ending times of a word spoken by a speaker.

In both voice coding and recognising applications, the limitations of a conventional VAD device are the responsivity to the environmental noise and the processing load that make its real time execution particularly complex.

In effect, while, in noiseless environments, such as at home, it is possible to distinguish the speech segments and the pauses by simply comparing the energy value of each frame to a pre-established threshold value, in a noise affected environment, such as municipal road traffic, presence of other speakers, noise generated by own vehicle, a simple comparison of the measured energy value to a threshold value is no more sufficient, due to the fact that many noise frames would be construed as voice frames and many voice frames would be erroneously construed as noise frames. Furthermore, when the environmental noise is overposed to the voice, the phonetic contents of the latter are altered, thereby noticeably complicating a correct identification process of the speech segments with respect to the pure environmental noise.

In various applications of modern technologies, as in the case of mobile system based communications, however, the need exists of correctly operating algorithms also in adverse acoustical conditions. It is apparent, therefore, that the robustness to environmental noise is to be considered as one of the main objects of the researches in the voice digital processing field.

This aspect is very important in that it represents one of the main causes by which the system performances are degraded.

In effect, all erroneous classifications, or mis-classifications, established by a VAD device (namely the voice activity frames construed as silence or environmental noise and vice versa) in voice coding operations entail the insertion of holes into the conversation and consequently a certain degradation in quality as perceived by the user.

Particularly in recognising voice commands, an erroneously operating VAD device entails an erroneous identification of the word boundaries that strongly influences the performances in terms of recognition rate.

In this respect, Figure 1 (appearing in Lawrence Rabiner and Biing-Hwang Juang"Fundamentals of Speech Recognition", Prentice- Hall International, Inc., April 1993) shows a graph of total accuracy, expressed as percentage, in recognition of digits as a function of the end point position variation, expressed as milliseconds. It is apparent that small errors in the estimation of the word boundaries often cause relatively significant degradations. By way of example, an error in the start point evaluation of i 30 ms ( 3 frames) causes an accuracy decrease of 2%. By spacing the end points from the manual mark, the recognition accuracy further decreases. Even if the so obtained results undoubtfully depend on the implemented recognition system, Figure 1 significantly evidences the strict connections existing between the EPD process and the performances of the recognition process.

A further aspect to be considered is related to the computational simplicity, which even now often represents a limitation particularly for applications in mobile apparatuses.

At present, conventional VAD devices generally operate according so-called"threshold"algorithms, namely algorithms based upon decision criteria which utilise fixed values-in the case of fixed threshold algorithms-or variable values as a function of the signal local behaviour-in the case of adaptive threshold algorithms. The evaluation of the local signal characteristics is based upon whether suitably chosen parameters exceed or not said threshold values, thereby determining a decision about the nature of the signal itself. Lastly, a binary type information, known as"flag", will return the result of such decision in terms of presence or absence of the voice signal.

Even if various VAD devices have been disclosed in the literature, the most relevant ones are those that, in the ETSI (European Telecommunications Standards Institute) for GSM (Global Systems for Mobile Communications) system range have reached the standard level described in ETS GSM 06. 32 (ETS 300-580-6)"European digital cellular telecommunications system (Phase 2) ; Voice Activity Detection (VAD)", September 1994, as well as the VAD device that, in the ITU-T (International Telecommunication Union-Telecommunication Sector) range for 8 kbit/s coder, known under the denomination G. 729, Annex B, has reached the standard described by A. Benyassine, E. Shlomot, H. Y.

Su, D. Massaloux, C. Lamblin e J. P. Petit in"ITU Reccomendation G. 729 Annex B: A Silence Compression Scheme for Use with G. 729 Optimized for V. 70 Digital Simultaneous Voice and Data Applications", IEEE Communications Magazine, Sept. 1997, vol. 35, n. 9, pages 64-73.

The ETS GSM VAD device is substantially an energy detector based upon an adaptive type threshold mechanism. It receives at its input port voice frames of 20 ms and, for each of them, it should be capable to establish whether only background noise or active speech is involved.

When a reliable detection is desired, the threshold level should be sufficiently higher than the noise level, in order to prevent it from being identified as voice, but it should not be much higher than it, in order to prevent low energy level voice segments from being identified as noise. A suitable threshold location, therefore, is essential for a good operation of this VAD device.

In an ETSI GSM VAD device, the input voice signal is filtered by means of an adaptive analysis filter, whose coefficients are computed starting from the autocorrelation coefficients of the input signals, averaged over four consecutive frames. This averaging step enables to carry out a filtering operation aimed at lowering the noise contents overposed to the voice. As a result, a more reliable voice/noise discrimination is obtained.

Both the threshold value and the adaptive filter coefficients are up-dated only when no speech is present, that is only during periods in which only noise is present, while, in speech containing periods, the up-dating step of the previous noise containing period is valid.

Lastly, in order to enable very low level signals to be anyway recognised as noise, the ETSI GSM VAD device provides for an additional fixed threshold located at a very low level, such that any signal

having a level lower than said threshold is considered as background noise.

The ITU-T G. 729 Annex B VAD device standardised for the 8 kbit/s ITU-T G 729 coder, enables a selection between two coding modes, according to whether the considered frame is an activity or an inactivity frame.

The decision of the VAD device is taken on frames having a duration of 10 ms. As a first step, the input signal is filtered by a high- pass filter having a cut-off frequency of 140 Hz in order to eliminate any undesired low frequency components. Subsequently, starting from a set of autocorrelation coefficients, the parameters needed to carry out the classification are fetched according to a frame-by-frame procedure, in particular energy Ef in the broad band of 0-4 kHz, energy Ei in the low band of 0^1 kHz, zero crossing rate ZCR and a set S of 12 Line Spectral Frequencies (LSF).

In the first operation period of the system, in respect of a given number Ni of frames, the activity decision is taken only according to the energy parameter. When the latter is higher than 15 dB, the decision is in favour of activity, otherwise it is in favour of inactivity. In this stage, also an initialisation stage is present in respect of the long term average values that will be used for all frames subsequent to the initial ones for computing the following four differential parameters: -differential energy Eg in the whole band; -differential energy dE ; in the low band; -spectral distortion as ; and -differential zero crossing rate/1ZCR.

These parameters represent the difference between the effective value of a parameter and its average value computed in adaptive mode in the last noise frames. In particular, the differential parameters are generated according to the following formulas :

The subsequent stage is a matching stage in which an initial decision in respect of activity is taken by considering different regions in the space of the above four differential parameters. The activity decision is given by the combination of the decision regions, while the non- activity decision is simply given by the complementary region.

The decision in the four dimension space is based upon the following disequalities : dP ; : < a. dPk+b for i, k, = ,........ 4 where JPi and jPk are Wo of the four differential parameters and a and b are suitable constants. If no one of the above disequalities is fulfilled, a Boolean flag of the VAD device is made equal to 0 (silence or environmental noise). Otherwise, if at least one of the above disequalities is fulfilled, the decision flag is made equal to 1 (voice).

The initial decision is filtered by means of a levelling or smoothing block, on the base of the two previous frames, in order to avoid abrupt changes between activity and non-activity conditions. The last block is related to the up-dating function for the average values. Such up-dating operation should be effected only in respect of the non-activity frames.

The decision that the frames are activity or non-activity frames is taken by a secondary VAD device that enables or disables the average value up- dating function.

An improved implementation of a VAD device G. 729 was recently proposed by F. Beritelli, S. Casale e A. Cavallaro in"A Robust Voice, Activity Detector for Wireless Communications Using Soft Computing", IEEE Journal on Selected Areas in Communications (JSAC), special Issue on Signal Processing for Wireless Communications, Vol. 16, n. 9, Dec.

1998.

In particular, a new approach in respect of the matching stage, based upon a set of rules of"fuzzy"logic, suitably obtained after a training stage, has been proposed. In this case, a system of six rules receives as input the four differential parameters and produces as output, frame-by frame, a continuous value in the range of 0 to 1. Upon a subsequent comparison of the fuzzy output to a suitably selected threshold value, the VAD device decision can be obtained. Based upon comparisons to the G. 729 Annex B standard as carried out by varying either the noise type or the signal to noise natio or SNR, it can be

concluded that the fuzzy VAD device offers better performances, since it has a lower number of misclassifications.

The problem of automatically detecting the boundaries of a word has also been detailedly investigated in the literature and there are scientific reports relating to this topic since the first'70 years of the XX century.

As a general rule, according to these conventional approaches, the voice signal is firstly processed by a module that measures a set of parameters. Subsequently, the boundaries of the word to be recognised are established by means of a threshold decision mechanism.

As for as signals including a background noise with a stationary and low level are concerned, this approach results into a reasonably good recognition accuracy. Anyway, it is often unreliable when the concerned environment is noiseful and particularly when the existing noise is not stationary.

Among the conventional end point detection methods, a particular historical relevance is to be granted to the algorithm of Rabiner and Sambur, who were the first workers that in 1974 studied the problem of automatically detecting the boundaries of insulated or isolated words. In fact, they proposed a rather simple algorithm having satisfactory performances in connection with almost any type of environmental noise, provided that the SNR value be not lower than 30 dB, as it was shown by Lawrence Rabiner himself in"Applications of Voice processing to Telecommunications" (previously quoted reference).

The Rabiner and Sambur algorithm is based upon measurement of two parameters: the Short-Time Energy and Zero Crossing rate (ZCR).

Furthermore, such algorithm is self-adaptive in view of the fact that the threshold values is adjusted as a function of the noise present in the environment as it will be described herein below.

In such algorithm, the voice waveform is initially filtered by a band -pass filter, with pass-band of 100 to 400 Hz, in order to eliminate the undesired signal components (low frequency hum, dc offset, etc.). It is assumed in this method that no presence of voice signal appears in the first 100 ms record, such that, during this time period, the environmental characteristics can be evaluated. These measurements are given by the average value and by the standard deviation of the ZCR figure as well as by the average energy. Based upon such measurements, among which

relationships expressed by empirical parameters exist, the values of three thresholds are established : a value for the ZCR figure and two values (a lower and an upper one) for energy.

The energy and ZCR functions are subsequently computed for the whole input signal over frames of 10 ms duration. The execution begins by locating, starting from the first frame, the point at which the energy profile overcomes both the lower and the upper energy thresholds, noting that it should not descend below the lower energy threshold before having overcome the upper one. Such point, upon being identified by the lower threshold is provisionally marked as initial end point of the word. In similar way, starting from the last frame (in this case, it is necessary that the user perform a first segmentation of the word by manually operating a push- button), the provisional end point of the word is located.

After that, the algorithm proceeds with examining those periods having a duration of 250 ms that precede the provisional initial end point and follow the final one. In particular, the number of frames in which the ZCR figure overcomes the relative threshold level is counted. If such number is equal to or higher than three, the definitive initial end point is back displaced to the last index of the frame at which the ZCR figure overcomes the threshold level ; otherwise, it remains unaltered. The same procedure applies to the final end point.

The ground on which such strategy is based is due to the fact that, in most cases of interest, overcoming the ZCR threshold is a reliable indication of the presence of a not-voiced sound. Anyway, it is also possible that a weak fricative sound do not overcome this test and be consequently lost ; in such cases, no simultaneously simple and reliable method exists to distinguish a not-voiced sound with respect to the background noise.

Therefore, it can be stated that the Rabiner and Sambur method has its main limit just in the presence of stationary and non-stationary background noises, due to the fact that it cannot satisfactorily discriminate the voice signal when the SNR figure is lower then 30 dB.

A simple algorithm alternative to EPD having acceptable performances in moderately difficult conditions was proposed by C. Tsao and R. M. Gray in"An endpoint detector for LPC speech using residual error look-ahead for vector quantization applications", IEEE ASSP Mag., 1984, pages 18B. 7. 1-18B. 7. 4. In particular, such algorithm utilises as

sole parameter the root mean square error for linear prediction, Ep, in respect of which it has been empirically observed that a strong correlation with the nominal boundaries of the words exists.

In particular, assuming that an optimum threshold T has been established and that a silent portion of the considered signal is examined, an initial end point is obtained by observing M preceding frames, in the case that Ep > T for (a Ad) times in those M frames, where 0 < a < 1. In similar way, assuming that a portion of voice signal is examined, a final end point is obtained by observing M preceding frames, in the case that Ep < T for (a M) times in those M frames. For frames of 20 ms duration, preferably M = 5 and a = 0.8. This algorithm also automatically establishes the threshold value T during the first frames of 300 ms duration of initial silence.

In absence of noise, the Tsao and Gray method, even if it utilises a single parameter acceptably operates even in the case of weak fricative sounds, such as n aS"and av".

Anyway, it has good performances only in absence of noises and for SNR figures not lower than 20 dB.

A further EPD algorithm that is particularly robust in respect of the environmental noise was recently proposed by J. C. Junqua, B. Mak e B.

Reaves in"A robust algorithm for word boundary detection in the presence of noise", IEEE Tr. on Speech and Audio Proc. Mag., n. 3,1994, pages 406-412. Such algorithm establishes so-called"reliability islands" and locates the word boundaries on the base of a first coarse detection, robust in respect of noise, followed by a trimming procedure.

In order to consistently retrieve the"reliability islands"also in noise affected conditions, it introduces a parameter, called TF (time- frequency), based upon energy in the frequency band of 250 to 3500 Hz, which is added to the logarithm of the r. m. s. energy value computed over the whole frequency band of the signal. In particular, the energy in the above mentioned frequency band is utilised in view of its utility in establishing the high energy regions substantially corresponding to the vowels in the input signal. This frequency band helps the algorithm in realising the discrimination between signal and noise. By determining the portion of the signal included between the first and the last vowels, the fundamental limits are thereby established.

This limited band energy is firstly normalised and filtered by an intermediate or smoothing algorithm. Contemporaneously, the logarithm of the r. m. s. value of the not-limited band energy is computed and then normalised and filtered. The final TF parameter is obtained after filtering the sum of the two energy functions. Subsequently, a noise adaptive threshold is computed starting from the first frames of the input signal and then the begin of the first vowel and the end of the last one (fundamental limits) are established by comparing the TF parameter to the above mentioned adaptive threshold. Lastly, starting from the so obtained coarse limits, a trimming procedure that also utilises the ZCR figure is applied by reversely running a fixed distance of 100 ms starting from the begin of the first vowel and of 150 ms starting from the end of the last vowel.

The experimental evaluations of this algorithm show that, in the case of total absence of noise, the results obtained by this algorithm are similar to the results obtained by manual marking ; when additional noises are otherwise present, this method appears to offer better performances than the previous ones, particularly for low values of the SNR figure (up to 5 dB).

As in connection with VAD devices, as far as the end point recognition is concerned, a new approach based upon the Fuzzy logic has been proposed by F. Beritelli in"A Robust Endpoint Detector based on Differential Parameters and Fuzzy Pattern Recognition", Proc. IEEE International Conference on Signal Processing (ICSP'98) Beijing (China), Oct., 12-16. 1998. In particular, the EPD Fuzzy method is comprised by a first processing stage represented by the above mentioned VAD Fuzzy method, followed by a post-processing stage that suitably processes the output of the VAD Fuzzy method in order to establish the word boundaries. In particular, in the first place, a median filter of the seventh order eliminates any abrupt variations existing in the fuzzy output; subsequently the end points are obtained from the intersection of the filter output with a fixed threshold having a value 9. The performances of the EPD Fuzzy method are better than the previously analysed EPD algorithms, when either the environmental noise type or the SNR figure are varied. Furthermore, the EPD Fuzzy method appears to be robust in respect of the level variations of the signal. Anyway, the EPD Fuzzy method has been found not to be quite satisfactory in respect of the so

introduced processing load and of the robustness in the presence of noise.

It is an object of this invention, therefore, to provide a method for establishing the word end points by a simple and speedy procedure, with low processing load, even when environmental noises of significant level are present.

It is also an object of this invention to provide an apparatus adapted to carry out one of the above mentioned methods or both.

It is specific subject matter of this invention, therefore, to provide method for detection of voice activity or VAD method in a voice signal, particularly in telephonic applications, comprising: -a first step aimed at acquiring the voice signal divided in segments or frames having a time duration d of the frames not longer than 40 milliseconds (ms), more preferably in the range of 10 to 20 ms, still more preferably equal to 10 ms, -a second step aimed at computing, for each frame, at least three of the following five parameters: 'the energy differential over the whole band JEf, * the energy differential over the band 0-1 kHz, SEa, 'the zero crossing rate differential, dZCR, * the second cepstral coefficient, c2, and 'the fifth cepstral coefficient, c5, -a third step in which a neural network process is carried out in order to provide, based upon at least three of said five parameters, for each frame, an output value Y in the range defined by a minimum value Ymjn ;", preferably equal to 0 (zero), and by a maximum value Ymax, preferably equal to 1 (one), being Ymin < Ymax.

Preferably, according to this invention, Y", ;" corresponds to a silence frame and Ymax corresponds to a voice activity frame.

According to this invention, the three parameters of the energy differential over the whole band AEf, the energy differential over the band 0-1 kHz, dE, and the zero crossing rate differential, SZCR, can be computed, for each frame, in the second computation step, and said third neural network process step is based upon said three parameters défi, E, and ZCR.

According to this invention, said neural network can be trained by means of the"Delta Learning Rule"and the voice signal employed for

said training procedure can be a clean or noiseless voice signal and/or further audio signals obtained by adding babble noise, car noise, traffic noise and white noise, respectively, to said clean signal, with a SNR equal to 20 dB and 10 dB and possibly also with a SNR equal to 0 dB.

Preferably, according to this invention, said neural network includes a perceptron having three inputs, an output and nine nodes in an intermediate stage, in which, still more preferably, the relationships between the inputs and the intermediate outputs as well as between the latter and the network output are linear relationships, still more preferably with pre-established and constant coefficients.

Further according to this invention, the VAD method can comprise, after the third neural network processing step, a further step for comparing the output values Y of the neural network to a threshold value, preferably furnished by the arithmetical mean value of Ymin and Ymax ((Ymin + x)/2).

It is also subject matter of this invention an apparatus for detection of voice activity or VAD, comprising one or more units for processing the voice signal, characterised in that it further comprises a neural network for receiving at its input port the data processed by said processing units, and in that it carries out the VAD method according to this invention.

According to this invention, the VAD apparatus can further comprise a high-pass filter arranged upstream to said processing units and/or a final comparison unit for comparing the output signal of the neural network to a threshold value.

It is also specific subject matter of this invention to provide method for segmentation of isolated words or EPD method, characterised in that it comprises: -the first, the second and the third steps of the VAD method according to this invention, -a fourth step for levelling or smoothing said voice signal in order to provide a smoothed signal V, -a fifth step for provisionally marking the boundaries of the word, in which a coarse initial end point P', and a coarse final end point P'F of the word are established, and

-a sixth step for trimming the marking operation carried out in the fifth step, in which the initial end point P, and the final end point PF of the word are established.

It is also specific subject matter of this invention to provide an apparatus for segmentation of isolated words or EPD apparatus comprising one or more units for establishing the word boundaries, characterised in that it further comprises an apparatus for detection of voice activity or VAD apparatus connected upstream to said word boundary establishing units, and in that it carries out an EPD method according to this invention.

Further characteristics and embodiments of this invention will be disclosed in the dependent claims.

This invention will be now described by way of illustration and not by way of limitation according to its preferred embodiment, by particularly referring to the Figures of the enclosed drawings, in which: Figure 1 is a graph of the total accuracy, expressed as percentage, of the digit recognition, as a function of the variation, expressed measured in milliseconds, of end point positions, Figure 2 is a block diagram of the referred embodiment of the neural VAD apparatus according to this invention, and Figure 3 is a block diagram of a preferred embodiment of the EPD apparatus according to this invention.

Particular attention has been paid by the inventor to the two substantial aspects of a pattern or shape recognition approach: namely, establishing an efficient assembly or set of parameters and a proper adaptation or matching technique.

Following a new parameter selection methodology, the inventor succeeded in establishing a minimum set of parameters which, besides assuring a high separability grade between the classification categories, offer a low computational complexity. In particular, the problem of establishing a parameter set that could turn out to be more significant as input to a VAD system has been investigated by exploiting the separability method as described by K. Fukunaga in"Infroduction to Statistical Pattern Recognition", Academic Press, San Diego, California, 1990, April 1993.

This criterion makes it possible to select the parameters that offer a greater contribute to the classification operation among a larger assembly of such parameters.

The following 37 parameters (described in the above mentioned references by Lowrence Rabiner in"Applications...", by Lowrence Rabiner and Biing-Hwang Juang in"Fundamentals..."and by A.

Benyassine, E. Shlomot, H. Y. Su, D. Massaloux, C. Lamblin and J. P.

Petit in"ITU Recommendation...") have been initially considered * the energy differential over the whole band, zIE, * the energy differential over the band 0-1 kHz, AEi, * the zero crossing rate differential, aZCR, * the seventeen cepstral coefficients, co,... c, 6, and * the seventeen cepstral coefficient differentials, dco,.... dc, s.

Aiming at realising the goal of computational simplicity and efficiency, a set of parameters that are simple from a computational view point have been selected. For this reason, among the four parameters of the G. 729 VAD, the spectral distortion parameter AS has not been considered due to the fact that it has a complexity higher by at least one order of magnitude with respect to the other three parameters.

The abovesaid parameters have been computed over frames of 10 ms duration and in connection with an assembly of words comprising the Italian numerals (zero, one, two, three, four, five, six, seven, eight, nine) and some commands (delete, call, no, yes, OK, record, check) voiced by four speakers, two male and two female speakers). The starting database, which consists in 68 words, has been scaled up to a level equal to-15.86 dBmo and having four different noise types, specifically babble noise, traffic noise, car noise and white noise, digitally added thereto, with three different SNR figures of 20 dB, 10 dB and 0 dB, respectively.

Since the VAD device should be able to discriminate between silence or pause frames and voice activity frames, it has been decided to exclusively consider-for each word-the twenty frames that are straddling over each of the initial and final end points, the latter being detected by an ideal mark.

Thirteen different scenarios have been realised, respectively related to the noiseless or clean case and to the above mentioned twelve noiseful cases: each scenario is formed by a number of 2720 vectors of 37 components equally distributed in the two above mentioned classes of voice activity and no-voice activity. Ordering of the various parameters has been carried out for each scenario. Table I shows the parameters related to the first eight ordered positions in each of said thirteen

scenarios.

Scenario/Pos. 1° 2° 3° 4° 5° 6° 7° 8° 1 CLEAN C5 AZCR C9 C11 C3 C10 C6 CO 2 BABBLE 20dB C2 AC4 C14 C8 AC6 C12 C10 #EI 3 BABBLE 10dB C2 AC4 C14 AC8 AC6 AC12 AZCR C3 4 BABBLE OdB AE) AC14 AC4 AC2 AEf C6 AC16 C8 5 CAR 20dB AEf C16 AC12 C8 ACO C1 AC14 C10 6 CAR 10dB AEf AC16 AC12 AZCR C8 AC3 C4 C10 7 CAROdB C5 C7 AEf C9 C13 ACO #C11 #El 8 TRAFFIC 20dB AEI AC13 C5 AC9 #C11 C3 C7 AC2 9 TRAFFIC 10dB hEI C13 AC12 AC9 AC11 AC10 C1 AZCR 10 TRAFFIC OdB AEI C13 AC5 C7 AC12 C2 C6 AC9 11 WHITE 20dB AEI AZCR AC9 AC11 C8 C6 C5 AC13 12 WHITE10dB AEI AEf C11 AZCR C10 AC8 #C13 AC15 13 WHITE 0dB #El #Ef #C5 #C16 #C10 #ZCR C11 C16 Table 1 Starting from such orders, a"significance"score, defined as separability grade, has been granted to each parameter according to the following relationship : where pi represents the position obtained by parameter j in scenario i and #, namely the weight allotted to scenario/ ; in particular, the same weight has been allotted to each scenario (#i = #, #i). Anyway, a higher weight could be allotted to the clean case and to the noiseful cases with higher SNR figure.

In this way, a global ordering for all parameters has been obtained according to a decreasing value of the separability grade, as it appears from Table 11.

The preferred embodiment of the method for detection of voice activity according to this invention provides for the VAD method be only based on the three following parameters dEf, JEU and AR.

The Inventor realised a non-linear matching technique that turns out to be particularly robust in respect of various noise types that are overposed to voice in a telephonic conversation carried out in noiseful environments. Such technique is based upon use of a suitably trained neural network which forms the matching block.

This block receives the individual parameters at its input port and outputs frame-by-frame a value in the range of 0 (pause) and 1 (voice activity). The result is a value in the range of 0 to 1 which is subsequently compared to a threshold value equal to 0.5 for final decision.

Parameter Score P Parameter Score P 1 DE (8. 400000095 21 C12 1.500000037 2 AEf 6.100000083 22 AC 12 1.400000028 3 C2 3.999999978 23 C15 1.399999976 4 C5 3.800000094 24 AC2 1.200000018 5 AZCR 3.800000004 25 ACO 1.100000024 6 C13 2.700000025 26 AC8 1.100000016 7 AC4 2.600000054 27 AC1 1.000000015 8 C7 2.500000045 28 AC9 1.000000015 9 2.49999997 29 AC3 1.000000007 10 C1 2.400000013 30 AC6 0. 900000028 11 C3 2.300000064 31 AC7 0.900000013

12 C14 2.20000004 32 AC16 0.800000027 13 C4 2.200000025 33 AC13 0.700000025 14 C8 2.000000045 34 #C14 0.700000025 15 C9 2. 000000015 35 AC10 0. 500000015 16 C6 1.900000051 36 ACII 0.500000007 17 C16 1.900000051 37 #C15 0 18 AC5 1.700000048 19 C11 1.70000004 20 CIO 1. 500000052 Table ! I As far as the terminology utilised in respect to the neural network is concerned, reference will be made hereinafter to the work of Simon Haykin,"Neural Networks"Macmillan College Publishing Company, New York, 1994.

In the preferred embodiment of this invention, the neural network includes a perceptron having three inputs (parameters jEf, dE/ and #ZCR), an output (in the range between 0 and 1) and nine nodes in an intermediate stage.

The network is trained by means of the above mentioned"Delta Learning Algorithm". In the preferred embodiment of this invention, the matching block is trained by means of the parameters retrieved from the "clean"signal and by adding"babble","traffic"and"white"noises thereto, with signal-to-noise ratios of 20, 10 and 0 dB The training operation is carried out in such a way that the neural network is adapted to furnish, as a result of a given input value assembly, an output in the value range [0- 1], in which values tending to 1 indicate a presence of voice activity and values tending to 0 indicate an absence of voice activity. Upon ending the training stage, the values of coefficients Wij, bi, wi, B.

The relationship that links the intermediate output yi of each one of said nine intermediate nodes to said three inputs is:

yi = f(αi) where αi = wi1#Ef + wi2#El + wi3#ZCR + bi and the relationship that links the output Y to the outputs y ; of said nine intermediate nodes is: where and and and α = Wlyl +... + Wgyg + B The preferred embodiment of this invention provides for the neural network to have: -the following matrix w (9x3) of the interconnection weights between the three inputs and the outputs yi of the nine intermediate nodes: wij j=1 j=2 j=3 i=1 0.62-0.47-4.55 i=2 1.10 0.38 4.33 i=3 0.52 -0.69 -6.42 i=4 4.36 5.55 2.00 i=5 -0. 37-0.29 3.57 <=6 2.17 2.64 7.23 i=7 2,34 -3. 00 4.00 i=8 0.02-0.01 7.60 i=9 -1.74 -2. 40-4.38 -the following vector b (9x1) of the additive constants, or bias, of the outputs yi of said nine intermediate nodes: bi i=1 -4. 61 /=2 9.42 --3-0. 36 i=4 10.54 i=5 10. 69 i=6 4.19 i=7 -7. 81 F8 5. 78 F9-2. 53

-the following vector W (1 x9) of the interconnection weights W between the output Y of the neural network and the outputs yi of the nine intermediate nodes: i=1 j=2 j=3 j=4 j=5 j=6 j=7 j=8 j=9 W 1.75-1.57 2.78-0.37 1.57 0. 57 3.03 0.19 0.40

and -the following bias B of the output Y : B = -2.61, By referring to Figure 2, it can be observed that the block diagram of the neural VAD device according to the preferred embodiment of this

invention operates in such a way that the voice signal 1 is initially filtered by a high-pass filter 2 having a cut-off frequency in the range of 130 to 170 Hz, preferably a frequency near to 150 Hz, in order to eliminate all low frequency noise components included in the concerned signal.

The concerned signal is subsequently transferred to a first, a second and a third processing units 3,4 and 5, respectively, that, for each frame, compute energy Ef in broad band 0 to 4 kHz, energy E, in low band 0 to 1 kHz and the zero crossing rate ZCR. Preferably, the above three processing units 3,4 and 5 perform said computation following the same mechanism as adopted in the ITU-T G. 729 VAD device.

A fourth processing unit 6 receives the computed values Ef, E, and ZCR and computes their respective average values Ef, E, and ZCR, preferably in adaptive way according to the process adopted in the ITU-T G. 729 VAD device.

The computed values Ef and the average values Ef are appiied to a fifth processing unit 7 which computes their differential values dEf. In similar way, the computed values E, and the relative average values E, are applied to a sixth processing unit 8 which computes their differential values dE,, and the computed values ZGR and their respective average values ZCR are applied to a seventh processing unit 9 which computes their differential values ZCR The computed differential values JEf, JE, and aZCR are applied to said neural network 10 which furnishes frame-by-frame an output value Y in the range of 0 to 1.

A comparison unit 11 compares the output values Y to a threshold value and furnishes a decision Boolean value or flag D in respect of the classification of the signal as voice activity signal or non-voice activity signal.

As described in the above mentioned ETSI GSM 06. 32 (ETS 300- 580-6)"European digital cellular telecommunications system...", in other embodiment of this invention, a post-processing unit, not shown in Figure 2, can be provided to eliminate all errors caused by evaluating some voice segments as noise. in such a context, said post-processing unit is based upon the so-called"hangover mechanism".

Starting from a VAD device based upon a neural network according to this invention, the Inventor realised an EPD device for detection of the word boundaries in order to effect segmentation of

isolated words, by adding a post-processing stage to the output of the neural network.

By referring to Figure 3, it can be observed that in the block diagram of the preferred embodiment of a VAD detector according to this invention, the comparison unit 11 of Figure 2 can be substituted by more complex post-processing units.

In particular, it can be observed that the output Y of the neural network 10 of the VAD device is processed in a smoothing unit 12, preferably implemented as a median filter of the seventh order, aimed at reducing any abrupt variation in the concerned signal.

In order to locate the initial frame and the final frame and starting from time at which the user activates the system, a provisional marking unit 13 analyses the shape of the signal V emitted by said smoothing unit 12 on a frame-by-frame basis.

In particular, the value Vi relating to the i-th frame is compared to the value B1 of a fixed threshold : the provisional point P ; at which a word begins is coarsely established as the final end point of a window comprising N2 frames, in which the relation Vs > S1 applies to at least a pre-established number N, of frames, where N2 > N,. Preferably, N, = 4 and N2 = 5 and S, in the range of [0.1 to 0.5] (in the general case it is in the range of [+0, 1-)] to [Ymin + 0,5*(Ymax - Ymin)]) and still more preferably, S1 = 0. 4 (in the general case it is [Y"n + 0,4*(Ymax - Ymin)]). This criterion permits to prevent any noise peaks generally having a limited time duration from being construed as words.

Upon carrying out such provisional estimation of the initial end point P' ; and continuing with analysing the output V of said smoothing unit 12, the provisional point P'Fat which a word ends is coarsely established as the initial end point of a window comprising N4 frames, in which the relation Vi < S' applies to at least a pre-established number N3 of frames, where N4 > N3. Preferably, N3 = 26 and N4 = 30. This criterion enables to prevent any drops of the signal due to the presence of micropauses within the word, such as those occurring, for instance, in pronouncing double consonants of occlusive type, from being construed as final points. In other embodiments, the threshold value for establishing the coarse point P'F can be different from the threshold value utilised to establish the coarse point P'l, even if it is anyway in the range of [0.1 to

0.5] (in the general case it is in the range of [Ymin + 0,1*(Ymax - Ymin)] to [Y., +0,5*(Ymax - Ymin)]).

Signal V and coarse word initial end point P, and final end point P'F are then applied to a marking trimming unit 14. Said trimming unit 14 utilises a further threshold S2 > S, and analyses the sign of the first derivative of signal V outcoming from said smoothing unit 12, in order to ascertain any slope change. Preferably, the value of S2 is in the range of [0. 5 to 0.9] (in the general case it is in the range of ([Ymin + 0,5 * (Ymax - Ymin)] to [Ymin + 0,9* (Ymax - Ymin)]) and still more preferably S2 = 0. 6 (in the general case it is [Y,,,. + 0,6 * (Ymax - Ymin)]).

In particular, the initial point P, is established, in a window immediately preceding the coarse initial end point P, i of the word and including a pre-established number N, of frames, as the nearest point to P ; where VI < S2 or where the derivative of Vi changes its sign. If no one of said events occurs, then point P, is taken as P/= P'/-/v/frames.

Preferably, N, = 10, in order to reduce the computation load needed in the voice recognition stage, particularly in real time applications.

In similar way, the final point PF is the nearest point to P'F, in a window immediately subsequent to P'F and including a pre-established number NF of frames, where Vi < S2 or where the derivative of V changes its sign. If no one of said events occurs point PF is taken as PF P'F + NF frames. Preferably, NF = 30 in order to ascertain with a sufficient accuracy level that no intra-word micropause is involved. In other embodiments, the threshold value to be utilised for establishing the final point PF can be different from the value utilised for establishing the initial point P,, even if it is still included in the in the range of [0.5 to 0.9] (in the genera case it is in the range of [Ymin + 0,5* (Ymax - Ymin)] to Lastly, according to a preferred embodiment of the EPD detector of the present invention, a further final check step is carried out on the results of the performed recognition procedure and consequently all words having a duration shorter than a minimum time interval Pmin, preferably not longer than 300 ms and still more preferably equal to 100 ms, as well as all words having a duration longer than a maximum time interval Pmax, preferably not shorter than 1 second and still more preferably equal to 2 seconds, are discarded.

The Inventor developed an apparatus to realise the method for recognition of isolated words according to this invention in real time. In particular, the concerned apparatus comprises a finished state machine aimed at distinguishing the various operation conditions of the EPD detector, as well as a number of storage or buffer registers that enable to store the results relating to the latest frames for establishing said coarse marking and then said trimming stages. This automatic apparatus is characterised by the following four states: A. initial or non-activity state, B. activity begin or false alarm state, C. activity state, and D. Activity end or micropause state.

The above mentioned state machine starts from an initial state A and can run through its various states by establishing the frames Pi and P F connected with said coarse marking, while the analysis of the buffer contents makes it possible to carry out said trimming step after having located said coarse initial and final points P ; and P'F, respectively.

In the method for segmentation of words according to this invention and in the EPD apparatus related thereto, the fourth processing unit 6 of the VAD section of the EPD detector carries out a simplified computation of the average values Ef, E, and ZCR of parameters Ex, ex and ZCR. In particular, the average value of each parameter is established by analysing an initial portion of the signal, preferably having a duration in the range of 100 to 500 ms. Still more preferably, exclusively an initial portion of the signal having a net duration of 300 ms is considered, starting from the time at which the user switches on the apparatus (or the method is started). This is justified by the assumption that such initial portion of the signal represents the environmental noise that adds to the subsequently voiced word.

In other embodiments of the recognition method and of the apparatus related thereto, a maximum time period At, preferably in the range of 2 to 6 seconds, can be provided between the time point at which the method or the apparatus is started and the begin point of the concerned word, such that, upon expiring this time period, the method or the apparatus is returned to the initial stage, thereby indicating no voice command has been recorded and consequently recognised.

The evaluation of the performances of a word boundary detector is carried out by utilising a parameter WTE (Weighed Total Error) that considers the following four possible detection errors: 1) Start Front Advance (SFA), defined as the number of frames by which the initial point of the automatic marking is anticipated with respect to the manual one; 2) Start Front Clipping (SFC), defined as the number of frames by which the initial point of the automatic marking is delayed with respect to the manual one; 3) End Front Clipping (EFC), defined as the number of frames by which the final point of the automatic marking is anticipated with respect to the manual one; 4) End Front Delay (EFD), defined as the number of frames by which the final point of the automatic marking is delayed with respect to the manual one.

Parameter WTE, as expressed by formula : WTE = f. 4 * (SFC + EFC) + 0.6 * (SFA + EFD), allots a higher weight to cuts SFC and EFC introduced within the word in view of the fact that, as it can be observed in Figure 1, an error concerning the cuts is more severe than any error possibly made in respect of the enlargements.

In terms of WTE, number and average duration of the false alarms, a comparison to the conventional EPD detectors evidences that a neural EPD detector according to this invention has better performances in respect of the word boundary identification for all noise kinds and all SNR values, in that it reduces to an average of a half the error, the number and the average value of the false alarms. The EPD detector according to this invention, therefore, provides for a better accuracy in the whole system for voice recognition.

A further embodiment of the method for detection of voice activity according to this invention provides for a VAD device based upon the three parameters AEf, dE, and aZRC as well on the second and fifth cepstral coefficients, cl and c2, respectively.

In other embodiments of the VAD device according to this invention, said neural network 10 is trained by means of parameters extracted from the"clean"signal having"babble"noise,"car"noise, "traffic"noise and"white"noise, respectively, added thereto, with a SNR

equal to 20 dB and 10 dB or by means of parameters exclusively extracted from the"clean"signal.

The method for segmentation of isolated words appears to be particularly robust in respect of the environmental noise and has the further advantage of a reduced computational complexity, in view of the reduced number and of the simplicity of all parameters utilised therein, as well as in view of the post-processing and matching algorithm simplicity, which enables the end points of a word to be obtained in reliable way.

In contrast with conventional EPD algorithms, the EPD recognition method according to this invention utilises fixed thresholds and is of the so-called"forward"type, that is to say that it analyses the word only in forward direction, so that it does not require long buffers to store the signal to be processed.

The preferred embodiments of this invention have been described and a number of variations have been suggested hereinbefore, but it should expressly be understood that those skilled in the art can make other variations and changes, without so departing from the scope thereof, as defined by the following claims.

Previous Patent: ANALYSIS OF AND NOISE REDUCTION IN A COMPLEX WAVEFORM

Next Patent: FORWARD ERROR CORRECTION IN SPEECH CODING