Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AUDIO SOURCE SEPARATION
Document Type and Number:
WIPO Patent Application WO/2023/052345
Kind Code:
A1
Abstract:
An electronic device having circuitry configured to perform source separation on an audio signal based on an enable signal to obtain a processed audio signal comprising a separated source and a residual signal, wherein the enable signal is configured to activate or deactivate the source separation.

Inventors:
UHLICH STEFAN (DE)
FABBRO GIORGIO (DE)
ENENKL MICHAEL (DE)
KEMP THOMAS (DE)
OSAKO KEIICHI (DE)
Application Number:
PCT/EP2022/076804
Publication Date:
April 06, 2023
Filing Date:
September 27, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SONY GROUP CORP (JP)
SONY EUROPE BV (GB)
International Classes:
G10L21/0272; G06N3/02; G10H1/36; G10L25/30; G10L25/81
Domestic Patent References:
WO2015150066A12015-10-08
Foreign References:
US20070021958A12007-01-25
US20160180861A12016-06-23
US20180350381A12018-12-06
CN111540374A2020-08-14
US20180122403A12018-05-03
EP3201917A12017-08-09
Other References:
UHLICH, STEFAN ET AL.: "2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP", 2017, IEEE, article "Improving music source separation based on deep neural networks through data augmentation and network blending"
STOTER, FABIAN-ROBERT ET AL., OPEN-UNMIX - A REFERENCE IMPLEMENTATION FOR MUSIC SOURCE SEPARATION
Attorney, Agent or Firm:
MFG PATENTANWÄLTE MEYER-WILDHAGEN, MEGGLE-FREUND, GERHARD PARTG MBB (DE)
Download PDF:
Claims:
29

CLAIMS

1. An electronic device comprising circuitry configured to perform source separation on an audio signal based on an enable signal to obtain a processed audio signal comprising a separated source and a residual signal, wherein the enable signal is configured to activate or deactivate the source separation.

2. The electronic device of claim 1 further comprises circuitry configured, if the source separation is deactivated by the enable signal, to adjust the audio signal to obtain an adjusted audio signal as the processed audio signal.

3. The electronic device of claim 1 further comprises circuitry configured to change a position of a switch based on a value of the enable signal to activate or deactivate the source separation.

4. The electronic device of claim 1, wherein the source separation is implemented by a deep neural network and the enable signal is used to deactivate some or all layers of the DNN such that their outputs are not updated anymore.

5. The electronic device of claim 2 further comprises circuitry configured to apply a gain to the audio signal based on the enable signal to obtain the adjusted audio signal.

6. The electronic device of claim 1 further comprises circuitry configured, if the source separation is deactivated by the enable signal, to delay the audio signal to obtain a delayed audio signal.

7. The electronic device of claim 2 further comprises circuitry configured to apply a gain to a user’s vocals signal to obtain adjusted user’s vocals signal, the user’s vocals signal being acquired by a microphone.

8. The electronic device of claim 7 further comprises circuitry configured to mix the adjusted user’s vocals signal with the processed audio signal to obtain a mixed audio signal.

9. The electronic device of claim 1 further comprises circuitry configured to perform enable signal generation based on the separated source and the residual signal to obtain the enable signal.

10. The electronic device of claim 9 further comprises circuitry configured to perform vocals detection on the audio signal to obtain a vocals detection signal, wherein the enable signal generation is performed based on the vocals detection signal, the separated source and the residual signal to obtain the enable signal.

11. The electronic device of claim 9, wherein the enable signal is pre-computed on a server side. 30

12. The electronic device of claim 9, wherein the enable signal is computed during the first time a song is played on the electronic device.

13. The electronic device of claim 1, wherein the separated source comprises vocals and the residual signal comprises accompaniment.

14. The electronic device of claim 13 further comprises circuitry configured to apply a gain to the vocals to obtain adjusted vocals and apply a gain to the accompaniment to obtain adjusted accompaniment.

15. The electronic device of claim 14 further comprises circuitry configured to mix the adjusted vocals with the adjusted accompaniment to obtain the processed audio signal.

16. The electronic device of claim 1, wherein the audio signal comprises at least one of vocals and accompaniment or wherein the separated source comprises speech and the residual signal comprises background noise.

17. The electronic device of claim 1, wherein the processed audio signal is output to a loudspeaker system.

18. An electronic device comprising circuitry configured to perform source separation on an audio signal to obtain a separated source and a residual signal; and perform enable signal generation based on the separated source and the residual signal to obtain an enable signal, wherein the enable signal is configured to activate or deactivate the source separation.

19. The electronic device of claim 18 further comprises circuitry configured to perform vocals detection on the separated source and the residual signal to obtain a vocals detection signal, wherein the enable signal generation is performed based on the vocals detection signal, the separated source and the residual signal to obtain the enable signal.

20. The electronic device of claim 18, wherein the enable signal is pre-computed on a server side using a vocals detection network, or the enable signal is computed during the first time a song is played on the electronic device using an energy threshold on the separated source and the residual signal.

21. A method comprising: performing source separation on an audio signal based on an enable signal to obtain a processed audio signal comprising a separated source and a residual signal, wherein the enable signal is configured to activate or deactivate the source separation.

22. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of claim 21.

Description:
AUDIO SOURCE SEPARATION

TECHNICAL FIELD

The present disclosure generally pertains to the field of audio processing, and in particular, to devices, methods and computer programs for audio playback.

TECHNICAL BACKGROUND

There is a lot of audio content available, for example, in the form of compact disks (CD), tapes, audio data files which can be downloaded from the internet, but also in the form of sound tracks of videos, e.g. stored on a digital video disk or the like, etc.

When a music player is playing a song of an existing music database, the listener may want to sing along. Typically, state-of-the art karaoke and play-along systems use audio source separation constantly to remove the original vocals during the played-back song.

It is generally desirable to improve methods and apparatus for reducing energy consumption.

SUMMARY

According to a first aspect, the disclosure provides an electronic device comprising circuitry configured to perform source separation on an audio signal based on an enable signal to obtain a processed audio signal comprising a separated source and a residual signal, wherein the enable signal is configured to activate or deactivate the source separation.

According to a second aspect, the disclosure provides a method comprising performing source separation on an audio signal based on an enable signal to obtain a processed audio signal comprising a separated source and a residual signal, wherein the enable signal is configured to activate or deactivate the source separation.

According to a third aspect, the disclosure provides a computer program comprising instructions which, when the program is executed by a computer, cause the computer to perform source separation on an audio signal based on an enable signal to obtain a processed audio signal comprising a separated source and a residual signal, wherein the enable signal is configured to activate or deactivate the source separation.

Further aspects are set forth in the dependent claims, the following description, and the drawings. BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments are explained byway of example with respect to the accompanying drawings, in which:

Fig. 1 schematically shows a general approach of audio mixing by means of blind source separation (BSS), such as music source separation (MSS);

Fig. 2 schematically shows an embodiment of a process of audio mixing based on audio processing;

Fig. 3 schematically shows in more detail an embodiment of the audio processing performed in the process of audio mixing described in Fig. 2, wherein source separation is performed based on an enable signal;

Fig. 4 schematically illustrates a deep neural network comprising a recurrent neural network (RNN) and additional non-recurrent learnable layers before and after the RNN layers, wherein switching-off the source separation described in Fig. 3 is performed;

Fig. 5 schematically shows an embodiment of a process of enable signal generation, wherein the enable signal is generated for the first time;

Fig. 6 schematically shows a diagram of an enable signal over time during song playing-back;

Fig. 7a schematically shows a table in which an enable signal value and a switch position are mapped;

Fig. 7b schematically shows a table in which an enable signal value and a gain factor are mapped;

Fig. 8 schematically describes in more detail an embodiment of the audio processing described in Fig. 3 performed in a case where the enable signal value is true-vocals and accompaniment;

Fig. 9 schematically describes in more detail an embodiment of the audio processing described in Fig. 3 performed in a case where the enable signal value is false-only vocals;

Fig. 10 schematically describes in more detail an embodiment of the audio processing described in Fig. 3 performed in a case where the enable signal value is false-only accompaniment;

Fig. 11 schematically shows another embodiment of a process of enable signal generation, wherein vocals detection is performed;

Fig. 12 shows a flow diagram visualizing a method for signal mixing related to audio processing by performing source separation based on an enable signal to obtain a mixed audio signal;

Fig. 13 shows a flow diagram visualizing a method for audio processing related to source separation based on an enable signal to obtain an adjusted audio signal; and Fig. 14 shows a block diagram depicting an embodiment of an electronic device that can implement the processes of audio mixing based on an enable signal and audio processing.

DETAILED DESCRIPTION OF EMBODIMENTS

Before a detailed description of the embodiments under reference of Fig. 1 to 14 is given, some general explanations are made.

As indicated in the outset, typically, play-along systems, for example karaoke systems, use audio source separation constantly to remove the original vocals during the song playback. However, it has been recognized that, for example, on karaoke devices, such a constantly performed audio source separation may be energy consuming, which may result in a quick drain of the battery of such karaoke devices.

Consequently, some embodiments pertain to an electronic device configured to perform source separation on an audio signal based on an enable signal to obtain a processed audio signal comprising a separated source and a residual signal, wherein the enable signal is configured to activate or deactivate the source separation.

The circuitry of the electronic device may include a processor, may for example be CPU, a memory (RAM, ROM or the like), a memory and/or storage, interfaces, etc. Circuitry may comprise or may be connected with input means (mouse, keyboard, camera, etc.), output means (display (e.g. liquid crystal, (organic) light emitting diode, etc.)), loudspeakers, etc., a (wireless) interface, etc., as it is generally known for electronic devices (computers, smartphones, etc.). Moreover, circuitry may comprise or may be connected with sensors for sensing still images or video image data (image sensor, camera sensor, video sensor, etc.), for sensing environmental parameters (e.g. radar, humidity, light, temperature), etc.

In audio source separation, an audio signal comprising a number of sources (e.g. instruments, voices, or the like) is decomposed into separations. Audio source separation may be unsupervised (called “blind source separation”, BSS) or partly supervised. “Blind” means that the blind source separation does not necessarily have information about the original sources. For example, it may not necessarily know how many sources the original signal contained, or which sound information of the input signal belong to which original source. The aim of blind source separation is to decompose the original signal without knowing the separations beforehand. A blind source separation unit may use any of the blind source separation techniques known to the skilled person. In (blind) source separation, source signals may be searched that are minimally correlated or maximally independent in a probabilistic or information-theoretic sense or on the basis of a non-negative matrix factorization structural constraints on the audio source signals can be found. Methods for performing (blind) source separation are known to the skilled person and are based on, for example, principal components analysis, singular value decomposition, (independent component analysis, non-negative matrix factorization, artificial neural networks, etc.

Although, some embodiments use blind source separation for generating the separated audio source signals, the present disclosure is not limited to embodiments where no further information is used for the separation of the audio source signals, but in some embodiments, further information is used for generation of separated audio source signals. Such further information can be, for example, information about the mixing process, information about the type of audio sources included in the input audio content, information about a spatial position of audio sources included in the input audio content, etc.

The audio signal can be an audio signal of any type. It can be in the form of analog signals, digital signals, it can origin from a compact disk, digital video disk, or the like, it can be a data file, such as a wave file, mp3-file or the like, and the present disclosure is not limited to a specific format of the input audio content. An input audio content may for example be a stereo audio signal having a first channel input audio signal and a second channel input audio signal, without that the present disclosure is limited to input audio contents with two audio channels. In other embodiments, the input audio content may include any number of channels, such as remixing of an 5.1 audio signal or the like.

The input signal may comprise one or more source signals. In particular, the input signal may comprise several audio sources. An audio source can be any entity, which produces sound waves, for example, music instruments, voice, speech, vocals, artificial generated sound, e.g. origin form a synthesizer, etc.

The input audio content may represent or include mixed audio sources, which means that the sound information is not separately available for all audio sources of the input audio content, but that the sound information for different audio sources, e.g. at least partially overlaps or is mixed.

The separated source produced by source separation from the audio signal may for example comprise a “vocals” separation, a “bass” separation, a “drums” separations and an “other” separation. In the “vocals” separation all sounds belonging to human voices might be included, in the “bass” separation all noises below a predefined threshold frequency might be included, in the “drums” separation all noises belonging to the “drums” in a song/ piece of music might be included and in the “other” separation all remaining sounds might be included.

In a case where the separated source is “vocals”, a residual signal may be “accompaniment”, without limiting the present disclosure in that regard. Alternatively, other types of separated sources may be obtained, for example, in a speech enhancement case, the separated source may be “speech”, and the residual signal may be “background noise”. Still alternatively, in an instrument separation case, the separated source may be “drums”, a residual signal may be “vocals”, “bass”, “guitar”, “other”, or the like.

Source separation obtained by a Music Source Separation (MSS) system may result in artefacts such as interference, crosstalk or noise.

The processed audio signal may be a signal that comprises the separated source and the residual signal. In other words, the separated source may be adjusted by a gain factor or the like based on the enable signal and then mixed with the residual signal such that the processed audio signal is obtained.

The enable signal may be a digital signal, such as a Boolean signal i.e. a signal having true and false values, indicating whether only vocals, only accompaniment, or vocals and accompaniment are present in the audio signal. Alternatively, the enable signal may be a binary signal, i.e. a signal having two binary values, namely “0” and “1” values, indicating whether vocals are present or not in the audio signal, without limiting the present disclosure in that regard. Still alternatively, the enable signal may by a binary signal having four binary values indicating whether bass, drums, vocals, other are present or not in the audio signal. Still alternatively, the enable signal may have values indicating either “on”, “only speech”, “only noise” or a subset of it.

The enable signal may be a signal that serves as a trigger which activates or deactivates the source separation. For example, the enable signal may switch-on the source separation, that is, source separation is performed on the received audio when the source separation is activated by the enable signal. The enable signal may switch-off the source separation, that is, source separation is not performed on the received audio when the source separation is deactivated by the enable signal.

In some embodiments, the electronic device may further comprise circuitry configured, if the source separation is deactivated by the enable signal, to adjust the audio signal to obtain an adjusted audio signal as the processed audio signal. The audio signal may be adjusted by applying a gain factor, e.g. a gain parameter, to the audio signal to obtain the processed audio signal.

The source separation may be activated and deactivated based on the enable signal. By performing activation and deactivation of the source separation the energy consumption of the electronic device, for example, a Karaoke system, may be reduced.

In some embodiments, the electronic device may further comprise circuitry configured to change a position of a switch based on a value of the enable signal to activate or deactivate the source separation. For example, by changing the position of the switch, the source separation may be activated, i.e. switched-on, or deactivated, i.e. switched-off.

In some embodiments, the source separation may be implemented by a deep neural network (DNN) and the enable signal may be used to deactivate some or all layers of the DNN such that their outputs are not updated anymore. The deep neural network (DNN) may be any kind of DNN, such as for example, a recurrent neural network (RNN), a feed forward network (FFNN), a convolutional neural network (CNN), or the like. The source separation may be deactivated by freezing the neural network and thus also the hidden states of the neural network layers, and thus all computations may be saved. Alternatively, the source separation may be deactivated by performing forwardpropagation through the neural network up to the neural network layers and update their hidden states, and thus all operations that come after the neural network, such as the operations of the decoding layers, may be saved.

In some embodiments, the enable signal may be configured to activate the source separation if the value of the enable signal is “true” and to deactivate the source separation if the value of the enable signal is “false”. For example, in some embodiments, the value of the enable signal may be “true- vocals and accompaniment”, “false-only vocals”, or “false-only accompaniment”, without limiting the present disclosure in that regard. Alternatively, the value of the enable signal may be “on”, “only speech”, “only noise” or a subset of it.

In some embodiments, the electronic device may further comprise circuitry configured to activate the source separation if the value of the enable signal is “true-vocals and accompaniment”, or deactivate the source separation if the value of the enable signal is “false-only vocals”, or “false-only accompaniment”. In this manner, the source separation is deactivated by the enable signal if the audio signal comprises only vocals or only accompaniment. The value of the enable signal “true- vocals and accompaniment” may indicate that the audio signal comprises vocals and accompaniment. The value of the enable signal “false-only vocals” may indicate that the audio signal comprises only vocals. The value of the enable signal “false-only accompaniment” may indicate that the audio signal comprises only accompaniment.

In some embodiments, the electronic device may further comprise circuitry configured to apply a gain to the audio signal based on the enable signal to obtain the adjusted audio signal. For example, in some embodiments, the electronic device may further comprise circuitry configured to perform delay on the audio signal to obtain a delayed audio signal, if the source separation is deactivated based on the enable signal, and to apply a gain to the delayed audio signal, such that to adjust the audio signal, and thus, to obtain the adjusted audio signal. The processed audio signal may comprise the adjusted audio signal. Alternatively, the processed audio signal may be the adjusted audio signal.

The audio signal may be an audio signal comprising vocals, or only vocals and the gain may be a gain factor, i.e. a gain parameter applied to the vocals, for example, +3dB, -12dB, -20dB, to increase or decrease the volume of the vocals, or a gain factor for generating silence, or the like. The skilled person may however choose the gain to be applied in other ways according to the needs of the specific use case.

Alternatively, the audio signal may be an audio signal comprising accompaniment or only accompaniment, and the gain may be a gain factor, i.e. a gain parameter applied to the accompaniment, for example, +6dB, OdB, -6dB, or the like to increase, decrease or leave unchanged the volume of the accompaniment. For example, the skilled person may set a gain factor to be applied to the accompaniment as a predefined parameter according to the specific requirements of the instrument at issue or according to the needs of the specific use case.

In some embodiments, the electronic device may further comprise circuitry configured to apply a gain to a user’s vocals signal to obtain adjusted user’s vocals signal, wherein the user’s vocals signal may be acquired by a microphone. The gain may be a gain factor, i.e. a gain parameter applied to the user’s vocals signal, for example, +3dB, +6dB, -3dB, or the like to increase or decrease the volume of the user’s vocals. The skilled person may however choose the gain to be applied in other ways according to the needs of the specific use case.

In some embodiments, the electronic device may further comprise circuitry configured to mix the adjusted user’s vocals signal with the processed audio signal to obtain a mixed audio signal.

In some embodiments, the electronic device may further comprise circuitry configured to perform enable signal generation based on the separated source and the residual signal to obtain the enable signal, without limiting the present disclosure in that regard. In some embodiments, the electronic device may further comprise circuitry configured to perform vocals detection on the audio signal to obtain a vocals detection signal.

In some embodiments, the electronic device may further comprise circuitry configured to perform vocals detection on the audio signal to obtain a vocals detection signal, wherein the enable signal generation is performed based on the vocals detection signal, the separated source and the residual signal to obtain the enable signal.

In some embodiments, the enable signal may be computed during the first time a song is played on the electronic device. The enable signal may be computed by the output of the source separation itself when seeing an audio for the first time by using a simple energy threshold on the vocals and/ or accompaniment signal, without limiting the present disclosure in that regard.

In some embodiments, the enable signal may be pre-computed on a server side. For example, , the enable signal may be pre-computed on the streaming server side using a “vocals detection network” and is then, together with the audio, transmitted to the electronic device. This transmission may be done by embedding the signal by watermarking techniques directly into the audio. Alternatively, the enable signal may be computed on the electronic device using a small “vocals detection network”. This may decrease the overall power consumption if more operations are saved with the enable signal than required to be computed.

In some embodiments, the separated source may comprise vocals and the residual signal may comprise accompaniment.

In some embodiments, the electronic device may further comprise circuitry configured to apply a gain to the vocals to obtain adjusted vocals and apply a gain to the accompaniment to obtain adjusted accompaniment. The gain may be a gain factor, i.e. a gain parameter applied to the vocals, for example, -12dB, -20dB, or the like, to increase or decrease the volume of the vocals. The gain may be a gain factor, i.e. a gain parameter applied to the accompaniment, for example, -3dB, OdB, +3dB, +6dB, or the like, to increase, decrease or leave unchanged the volume of the accompaniment. The skilled person may however choose the gain to be applied in other ways according to the needs of the specific use case.

In some embodiments, the electronic device may further comprise circuitry configured to mix the adjusted vocals signal with the adjusted accompaniment to obtain the processed audio signal.

In some embodiments, the audio signal may comprise at least one of vocals and accompaniment, without limiting the present disclosure in that regard. Alternatively, in some embodiments, the separated source may comprise speech and the residual signal may comprise background noise. Still alternatively, the separated source may comprise drums and the residual signal may comprise bass, other or the like. Depending on the use case, source separation may be performed to obtain a suitable separated source and a suitable residual signal.

In some embodiments, the user’s vocals may be acquired by a microphone. In some embodiments, the microphone may be a microphone of an electronic device such as a smartphone, headphones, a TV set, a Blu-ray player.

In some embodiments, the processed audio signal may be output to a loudspeaker system. The loudspeaker system may be a loudspeaker array of an electronic device, such that the user of the electronic device may sing along while listening to the played-back audio. The embodiments also disclose an electronic device comprising circuitry configured to perform source separation on an audio signal to obtain a separated source and a residual signal and perform enable signal generation based on the separated source and the residual signal to obtain an enable signal, wherein the enable signal is configured to activate or deactivate the source separation.

In some embodiments, the electronic device may further comprise circuitry configured to perform vocals detection on the separated source and the residual signal to obtain a vocals detection signal, wherein the enable signal generation is performed based on the vocals detection signal, the separated source and the residual signal to obtain the enable signal.

In some embodiments, the enable signal may be pre-computed on a server side using a vocals detection network, or the enable signal may be computed during the first time a song is played on the electronic device using an energy threshold on the separated source and the residual signal.

The embodiments also disclose a method comprising performing source separation on an audio signal based on an enable signal to obtain a processed audio signal comprising a separated source and a residual signal, wherein the enable signal is configured to activate or deactivate the source separation.

It is to be noted that the methods as described herein are also implemented in some embodiments as a computer program causing a computer and/ or a processor to perform the method, when being carried out on the computer and/or processor. In some embodiments, also a non-transitory computer-readable recording medium is provided that stores therein a computer program product, which, when executed by a processor, such as the processor described above, causes the methods described herein to be performed.

The embodiments also disclose a computer program comprising instructions which, when the program is executed by a computer, cause the computer to perform source separation on an audio signal based on an enable signal to obtain a processed audio signal comprising a separated source and a residual signal, wherein the enable signal is configured to activate or deactivate the source separation.

The embodiments also disclose a non-transitory computer-readable recording medium that stores therein a computer program product, which, when executed by a processor, causes source separation to be performed on an audio signal based on an enable signal to obtain a processed audio signal comprising a separated source and a residual signal, wherein the enable signal is configured to activate or deactivate the source separation.

Audio mixing by means of audio source separation Fig. 1 schematically shows a general approach of audio mixing by means of blind source separation (BSS), such as music source separation (MSS).

First, source separation (also called “demixing”) is performed which decomposes a source audio signal 1 comprising multiple channels I and audio from multiple audio sources Source 1, Source 2, . . . Source K (e.g. instruments, voice, etc.) into “separations”, here into source estimates 2a-2d for each channel i, wherein K is an integer number and denotes the number of audio sources. In the embodiment here, the source audio signal 1 is a stereo signal having two channels i = 1 and i = 2. As the separation of the audio source signal may be imperfect, for example, due to the mixing of the audio sources, a residual signal 3 (r(n)) is generated in addition to the separated audio source signals 2a-2d. The residual signal may for example represent a difference between the input audio content and the sum of all separated audio source signals. The audio signal emitted by each audio source is represented in the input audio content 1 by its respective recorded sound waves. For input audio content having more than one audio channel, such as stereo or surround sound input audio content, also a spatial information for the audio sources is typically included or represented by the input audio content, e.g. by the proportion of the audio source signal included in the different audio channels. The separation of the input audio content 1 into separated audio source signals 2a-2d and a residual 3 is performed based on blind source separation or other techniques which are able to separate audio sources.

In a second step, the separations 2a-2d and the possible residual 3 are remixed and rendered to a new loudspeaker signal 4, here a signal comprising five channels 4a-4e, namely a 5.0 channel system. Based on the separated audio source signals and the residual signal, an output audio content is generated by mixing the separated audio source signals and the residual signal taking into account spatial information. The output audio content is exemplary illustrated and denoted with reference number 4 in Fig. 1.

In the following, the number of audio channels of the input audio content is referred to as M in and the number of audio channels of the output audio content is referred to as M out . As the input audio content 1 in the example of Fig. 1 has two channels i = 1 and i =2 and the output audio content 4 in the example of Fig. 1 has five channels 4a-4e, M in = 2 and M out = 5. The approach in Fig. 1 is generally referred to as remixing, and in particular as upmixing if Mi n < M out . In the example of the Fig. 1 the number of audio channels M m = 2 of the input audio content 1 is smaller than the number of audio channels M ou t = 5 of the output audio content 4, which is, thus, an upmixing from the stereo input audio content 1 to 5.0 surround sound output audio content 4. Technical details about source separation process described in Fig. 1 above are known to the skilled person. An exemplifying technique for performing blind source separation is for example disclosed in European patent application EP 3 201 917, or by Uhlich, Stefan, et al. “Improving music source separation based on deep neural networks through data augmentation and network blending.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017. There also exist programming toolkits for performing blind source separation, such as Open-Unmix, DEMUCS, Spleeter, Asteroid, or the like which allow the skilled person to perform a source separation process as described in Fig. 1 above. It is known to include additional non-recurrent learnable layers before and after the RNNs: their role is to encode the signal to provide the RNNs with a more appropriate signal representation, as described by Stbter, Fabian-Robert et al. in “Open- Unmix - A Reference Implementation for Music Source Separation”, where additional layers are used before and after the RNNs.

Audio mixing based on audio processing and an enable signal

Fig. 2 schematically shows an embodiment of a process of audio mixing based on audio processing. The process allows to perform audio mixing using audio processing based on an enable signal.

An audio 200 (see audio input signal 1 in Fig. 1) containing multiple sources (see 1, 2, . . ., K in Fig. 1), with, for example, multiple channels (e.g. M m = 2) e.g. a piece of music, is input to audio processing 202 and processed based on an enable signal 201 to obtain a processed audio 206, i.e. a processed audio signal. A gain 204 is applied to user’s vocals 203 to obtain adjusted user’s vocals 207. A mixer 205 mixes the processed audio 206 with the adjusted user’s vocals 207 to obtain a mixed audio 208, i.e. a mixed audio signal.

In the embodiment of Fig. 2, audio processing is performed based on an enable signal, which is obtained during source separation when an audio, e.g. a song, is played-back for the first time by a play-back device, such as a music player, as described in Fig. 4. The enable signal may be a digital signal, such as a Boolean signal i.e. a signal having true and false values, or a binary signal, i.e. a signal having “0” and “1” values. In the present disclosure, as described in Fig.5, the enable signal is a Boolean signal having true and false values, without limiting the present disclosure in that regard. Alternatively, the enable signal may be a binary signal, or the like.

In the embodiment of Fig. 2, a gain is applied to the user’s vocals to adjust the user’s vocals in accordance with the user’s preferences. For example, the gain may be a preset parameter that adjusts the user’s vocals accordingly or may be a parameter that the user sets in real-time. The preset gain parameter may comprise a predefined volume change parameter, or the like. The gain parameters may comprise a predefined volume increase parameter or decrease parameter related to the vocals. For example, the predefined volume increase parameter may be a volume increase parameter of +3dB, or the like, and the predefined volume decrease parameter may be a volume increase parameter of -3dB, or the like, without limiting the present embodiment in that regard. Any parameter suitable to the skilled person may be used to adjust the user’s vocals. Alternatively, no gain may be applied to the user’s vocals.

It is to be noted that the user’s vocals 203 may be received via a microphone, e.g. a microphone included in a microphone array (see 1310 in Fig. 13).

It is to be noted that the processed audio 206 and/ or the mixed audio 208 may be output to a loudspeaker system (see 1309 in Fig. 13), e.g. on-ear, in-ear, over-ear, wireless headphones, etc., and/ or may be recorded to a recording medium, e.g. CD, etc., or stored on a memory of an electronic device (see 1302 in Fig. 13), or the like. For example, the processed audio 206 is output to the headphones of the user such that the user can sing along with the played-back audio.

Audio processing based on source separation and an enable signal

Fig. 3 schematically shows in more detail an embodiment of the audio processing performed in the process of audio mixing described in Fig. 2, wherein source separation is performed based on an enable signal. In the present embodiment, the enable signal is already generated by an enable signal generation process, as described in Fig. 4, and stored in a memory.

The audio 200 (see also audio input signal 1 in Fig. 1) containing multiple sources (see 1, 2, . . ., K in Fig. 1), with, for example, multiple channels (e.g. M m = 2) e.g. a piece of music, is input to audio processing 202 together with the enable signal 201 as described in Fig. 2 above. Source separation 301 is performed on the audio signal 200, during audio processing 202, and the audio signal 200 is decomposed into, here, vocals and accompaniment. In the present embodiment, the source separation 301 has two states, namely a first state being an activation state in which source separation is performed, i.e. switch-on state, and a second state being an deactivation state in which source separation is not performed, i.e. switch-off state. Based on the enable signal 201 audio processing 202 is performed on the audio 200, wherein the enable signal 201 activates, i.e. switches- on, or deactivates, i.e. switches-off, the source separation 301. In the embodiment of Fig. 3, if the state of the source separation is the activation state, that is, the state in which source separation is switched-on (here position AB of a switch 300), source separation is performed on the audio signal 200 to obtain vocals and accompaniment. If the state of the source separation is the deactivation state, that is, the state in which source separation is switched-off (here position AC of the switch 300), source separation is not performed and the audio signal 200 is processed by a delay 303 and a gain 304. The enable signal 201 indicates during the duration of the audio whether the audio 200 contains only vocals, only accompaniment or vocals and accompaniment, and thus, whether the source separation 301 is activated or deactivated.

The source separation model may use machine learning technics such as a neural network, e.g. deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN) and the like. In this source separation model, the source separation 301 is implemented as a recurrent neural network. For example, for the case of RNN layers inside the neural network, there are at least two ways to switch-off the networks, here to switch-off the source separation 301. One way is to freeze the neural network and therefore the hidden states of the RNN layers and thus to save all computations (see Fig. 4 below). Another way is to forward-propagate through the neural network up to the RNN layers and update their hidden states and thus to save all operations that come after the RNN, i.e. the decoding layers (see Fig. 4 below).

In the embodiment of Fig. 3, in a case where the enable signal 201 indicates that the audio 200 contains vocals and accompaniment, the source separation 301 is switched-on (here, the pair of contacts AB of the switch 300 are connected) and the audio 200 is decomposed into separations (see separated sources 2a-2d and residual signal 3 in Fig. 1) as it is described with regard to Fig. 1 above. In the present embodiment, the audio 200 is decomposed into vocals and accompaniment (a separated source 2 and a residual signal 3). Based on the enable signal 201, the vocals and the accompaniment are adjusted by a gain 302, e.g. a gain factor is applied to the vocals and the accompaniment, to obtain adjusted vocals and adjusted accompaniment respectively. A mixer 305 mixes the adjusted vocals to the adjusted accompaniment to obtain the processed audio 206. That is, there is an expected latency, for example a time delay At, from the source separation 301. The expected time delay is a known, predefined parameter, which may be set in the delay 303 as a predefined parameter.

In a case where the enable signal 201 indicates that the audio 200 contains only vocals or only accompaniment, the source separation 301 is switched-off (here, the pair of contacts AC of the switch 300 are connected) and a delay 303 is performed on the audio 200 to obtain a delayed audio, i.e. a delayed audio signal, for example, delayed vocals or delayed accompaniment. Based on the enable signal 201, the delayed audio is adjusted by a gain 304, e.g. a gain factor is applied to the delayed audio, to obtain adjusted audio. The adjusted audio is the processed audio 206, which is mixed with the adjusted user’s vocals (see 207 in Fig. 2) to obtain the mixed audio (see 208 in Fig. 2) as described in Fig. 2 above. That is, there is an expected latency, for example a time delay At, from the source separation. The expected time delay is a known, predefined parameter, which may be set in the delay 303 as a predefined parameter. At the delay 303, the audio signal, here the vocals or the accompaniment, are delayed by the expected latency, due to the source separation 301 process, to obtain the delayed vocals or delayed accompaniment. This has the effect that the latency, due to the source separation 301 process, is compensated by a respective delay of the vocals or the accompaniment.

In the embodiment of Fig. 3, based on the enable signal 201, the delayed audio or the vocals and the accompaniment are adjusted by a gain, e.g. a gain factor to increase, decrease or leave unchanged the volume of the audio or the vocals and the accompaniment respectively. The gain factor may be set by the user in real-time or may be set statically and in advance. In the present embodiment, the vocals may be adjusted by increasing or decreasing the volume of the vocals signal by a gain factor equal to -12dB, or the like, without limiting the present embodiment in that regard. The accompaniment may be adjusted by increasing, decreasing, or leaving unchanged the volume of the vocals signal by a gain factor equal to - 3dB, OdB, +3dB, or the like, without limiting the present embodiment in that regard. The delayed audio may be adjusted by a gain factor equal to -12dB, - 20dB if the delayed audio is delayed vocals or equal to OdB, +3dB if the delayed audio is delayed accompaniment, without limiting the present embodiment in that regard. Any gain factor suitable to the skilled person may be used to adjust the vocals, the accompaniment, or the delayed audio. Alternatively, no gain may be applied to the vocals, the accompaniment, or the delayed audio. Still alternatively, in case the audio is only vocals, the delayed vocals may be adjusted such as only silence is output, or in case the audio is only accompaniment, the delayed accompaniment may be not adjusted such that the accompaniment is directly output as processed audio 206.

As described above, namely the source separation 301, and the gain 304 can be performed in realtime, e.g. “online” with some latency, here delay 303. For example, they could be directly run on the smartphone, smartwatch of the user/in his headphones, Bluetooth device, or the like.

The source separation 301 process may for example be implemented as described in more detail in published paper Uhlich, Stefan, et al. “Improving music source separation based on deep neural networks through data augmentation and network blending.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017. There also exist programming toolkits for performing blind source separation, such as Open-Unmix, DEMUCS, Spleeter, Asteroid, or the like which allow the skilled person to perform a source separation process as described in Fig. 1 above.

In the embodiment of Fig. 2, source separation is performed on an audio which is a song, and thus, is decomposed to vocals and accompaniment, without limiting the present embodiment in that regard. Alternatively, other types of source separation may be performed. For example, speech enhancement may be performed, wherein the audio is e.g. a lecture and is decomposed into speech and noise. In such case, the enable signal can be either “on”, “only speech”, “only noise” or a subset of it. Still alternatively, instrument separation may be performed, wherein the audio is e.g. a recorded concert, and is decomposed into bass, drums, vocals and other. In such case, the enable signal may be a binary signal having four binary values indicating whether “bass”, “drums”, “vocals”, “other” are present or not.

It should be noted that, by using the enable signal 201, an electronic device’s energy consumption may be reduced, wherein the electronic device may be a played-back device. A source separation model, such as source separation 301, may be trained with the enable signal, to tell a neural network whether it needs to separate or whether there is currently a part where there is only vocals or only accompaniment and the separation is not needed. This may make it possible for the source separation model to exploit the information that comes from it, for example, to quicker adapt to changing conditions, e.g., a change from “only instruments” to “instruments and vocals”.

In the embodiment of Fig. 3, the source separation is implemented by a recurrent neural network (RNN). Recurrent neural networks are neural networks that, in addition to its inputs, use an internal state to perform a task. The new internal state is calculated from the old internal state and the input of the new internal state is the output of the old internal state. In other words, RNNs get part of their output as input for the next time step, i.e. next state. RNNs can take one or more input vectors and produce one or more output vectors and the output(s) are influenced by weights applied on inputs and by a hidden state vector representing the context based on prior input(s)/ output(s). Typically, in RNNs the same weight is applied to each input.

In the embodiment of Fig. 4 below, the neural network is an RNN that implements the source separation, and the source separation is switched-off based on the enable signal (here, the state of the switch is AC). The enable signal (see 201 in Figs. 2 and 3) is used to deactivate some or all layers of the DNN, here RNN, such that their outputs are not updated anymore.

Fig. 4 schematically illustrates a deep neural network comprising a recurrent neural network (RNN) and additional non-recurrent learnable layers before and after the RNN layers, wherein switching-off the source separation described in Fig. 3 is performed.

In the present embodiment, a deep neural network for performing source separation (see 301 in Fig. 3) using Open-Unmix is illustrated. As described by Stbter, Fabian-Robert et al. in the published paper “Open-Unmix - A Reference Implementation for Music Source Separation”, Open-Unmix is based on a three-layer bidirectional Long Short-Term Memory (BLSTM) network. The model leams to predict the magnitude spectrogram 517 of a target, e.g. vocals (see separated source 2 in Fig. 1), from the magnitude spectrogram 500 of a mixture input, e.g. an audio signal (see 200 in Fig. 2). Internally, the prediction is obtained by applying a mask on the input. To perform separation into multiple sources (see separations 2a-2d in Fig. 1), multiple models are trained for each target, namely for each separation.

Although the Open-Unmix model uses bi-directional LSTM cells and, hence, processes the mixture offline, in the present embodiment, the bi-directional LSTM cells are replaced with uni-directional LSTM cells. Thereby, the Open-Unmix model is causal and can be used for online Karaoke tasks, such as an online separation.

Typically, Open-Unmix operates in the time-frequency domain to perform its prediction, therefore, the input of the model can be either a time domain signal tensor or pre-computed magnitude spectrograms. For example, the magnitude spectrogram can be defined as the logarithmically scaled magnitude spectrum of an audio signal (see 200 in Fig. 2) across time.

Cropping 501 is performed on the input mix spectrograms 500, having a left and right channel, to obtain cropped mix spectrograms. Cropping 501 is performed over the frequency dimension, such that only high frequencies are removed, and thus, no information is lost over time. The cropped mix spectrograms are standardized, i.e. normalized, by an input scaler 502 using the global mean and standard deviation for every frequency bin across all frames. The standardized cropped mix spectrograms pass through a first fully connected layer (fol) 503 which applies a feature transformation, e.g. an affine transform (i.e. affine layer fol), such that a more appropriate representation, e.g. features, for the Uni-directional LSTM network 506 is obtained. In addition, the dimensionality of the input spectrograms, i.e. the number of numerical values that can be used to represent the frequency content, is reduced by the fully connected layer (fol) 503, and thus, redundancies in the input are reduced. In other words, the fully connected layer (fol) 503 compresses the frequency and channel axis of the model and maps the magnitude STFT bins of both channels (and one frame) into features. A first batch normalization (bnl) 504 is performed on the features, followed by a tanh function 505 used as activation function that compresses the numerical values to [-1, 1]. The core of open-unmix is a three-layer Uni-directional Long Short- Term Memory (Uni-LSTM) network 506. In the present embodiment, the Uni-LSTM is switched- off and source separation is not performed, therefore, the Uni-directional LSTM network 506 is not used. After the LSTM network 506, which is an RNN, two more affine transforms (i.e. affine layers fc2 and fc3) together with batch normalizations (bn2) 509 and (bn3) 512 are applied. In particular, the estimated target source representation is input to a fully connected layer (fc2) 508 followed by a second batch normalization (bn2) 509 and a rectified linear unit (ReLU) 510 which is an activation function. Then, the output of ReLU 510 is input to a third fully connected layer (fc3) 511, which performs restoration of the STFT dimensions of the spectrograms. The third fully connected layer (fc3) 511 is followed by a third batch normalization (bn3) 512 and an output scaler 513, which denormalize the numerical values. A ReLU 514 activation function is applied to the output of the output scaler 513, and then is multiplied 516 with the mix spectrogram 500, so that the models are asked to predict masks.

The above described process is a process performed internally in the network during the operational phase, i.e. “evaluation phase”, during which the masks are predicted. The operational phase, i.e. “evaluation phase”, is independent of and subsequent to the “training phase”, where the parameters of the model, and not the masks, are learnt.

The Open-Unmix neural network comprises a recurrent neural network (RNN), namely the Uni- LSTM 506 and additional non-recurrent learnable layers, namely the affine layers fol, fc2, fc3 with batch normalization bnl, bn2, bn3, applied before and after the RNN layers. The affine layers with the batch normalization bnl, bn2, bn3 is performed in multiple stages of the model, wherein the feature transformations are performed by the affine layers fol, fc2, fc3 and the batch normalization layers are only used to normalize the features. In this manner, the training of deep architectures may become more easy.

In the present embodiment, the LSTM network 506 is switched-off in a case where source separation is not performed. Switching-off the network means not computing anything through fc2, bn2, fc3 and bn3, but update only the internal LSTM states. The outputs of the first two LSTM are used to update the state of the successive LSTM layer, while the output of the last BLSTM is not used.

Switching-off the above described neural network can be performed in two ways, namely by not computing anything through the network (hidden states are not updated) or by performing only the computations up to the RNNs, so that we update the internal states, but we don’t produce any new output.

For example, by not computing anything through the neural network, the hidden states of the RNN layers are frozen, and thus, all computations that would have been performed are saved. For example, freezing the RNN and its layers may be performed by not computing anything through the network (e.g. which entails not updating the hidden states) In other words, when the RNN and its hidden states are frozen by not updating, i.e. not changing, the hidden states, the LSTM is not operating. In this manner, the computations which have been performed up to that point are saved.

In the present embodiment, since no computation is performed (LSTM is not operating), RNNs will not have outputs, i.e., the input will not be forward through the RNN, because no output is needed. Alternatively, switching-off the neural network can be performed by only performing the computations up to the RNNs, so that the internal states are updated, but without producing any new output. For example, in order to switch-off the source separation, forward-propagate is performed through the neural network up to the RNN layers, followed by an update of the RNN hidden states, and thus all operations that come after the RNN are saved. Such operations are the decoding operations performed at the decoding layers.

In this manner, the internal states of the LSTMs are updated, but the output of the last LSTM is not used. In this case, the LSTM is always fully operational, independently on the value of the switching- off signal (while the computations in the subsequent layers of the neural network are not performed). Therefore, the equations for the update of the hidden states are given by: c t = a c (W c x t + Uch^ + h c ) c t = f t O c t- + i t O c t h t = o t O a h (ct) where all W, U and b are the weights of the LSTM (they only change during training), <J is a nonlinear (gate) function, O is an element-wise product for matrices, h is the hidden state, c is the cell state, X is the input, and h is also the output. Initial values for h and c are usually 0.

Enable signal generation for the first time

Fig. 5 schematically shows an embodiment of a process of enable signal generation, wherein the enable signal is generated for the first time.

The audio 200 is input to source separation 301 and decomposed into separations, here vocals and accompaniment. The vocals and accompaniment are input to the enable signal generation 400 to obtain the enable signal 201. The enable signal 201 can then be stored on the electronic device e.g. in a storage unit (see 1302 in Fig. 14) and re-used the next time the user listens to the same song.

In the embodiment of Fig. 5, the first time the audio is played-back by an electronic device, such as a karaoke device, a smartphone or the like, the audio is directly input to the source separation to decompose it into separations. This process described in Fig. 5 is similar to the process described in Fig. 3 above, wherein the process of Fig. 5 may be implemented using a preset condition, i.e. an enable signal that activates the source separation, i.e. switches-on the source separation, such that the first time that the electronic device plays-back an audio, source separation is performed to obtain the enable signal.

In the embodiment of Fig. 5, the enable signal is computed by the output of the source separation itself when seeing an audio (here the audio 200 is a song) for the first time by using a simple energy threshold on the vocals and/ or accompaniment signal, without limiting the present embodiment in that regard. Alternatively, the enable signal may be pre-computed on the streaming server side using a “vocals detection network” and is then, together with the audio, transmitted to the electronic device. This transmission may be done by embedding the signal by watermarking techniques directly into the audio. Still alternatively, the enable signal may be computed on the electronic device using a small “vocals detection network”. This may decrease the overall power consumption if more operations are saved with the enable signal than required to compute it. The process of vocal detection is described with regard to Fig. 11.

Fig. 6 schematically shows a diagram of an enable signal over time during song playing-back. The enable signal, according to this embodiment, is a Boolean signal having true and false values, which indicate the whether the audio includes vocals and accompaniment, only vocals or only accompaniment. Here, the enable signal has a value “true-vocals and accompaniment”, a value “false-only accompaniment” and a value “false-only vocals”.

In the embodiment of Fig. 6, the abscissa displays the time and the ordinate the value of the enable signal. The horizontal dashed lines represent the value of the enable signal, here three values, and the vertical dashed lines represent the time instances to, ti, tz, to, tj. The duration period of the audio is from 0 to time instance t4. The horizontal solid lines represent the values of the enable signal during the duration period of the audio. Here, between 0 and time instance to the enable signal is “false-only vocals”, between time instance to and time instance ti the enable signal is “true-vocals and accompaniment”, between time instance ti and time instance tz the enable signal is “false-only accompaniment”, between time instance tz and time instance to the enable signal is “true-vocals and accompaniment”, and between time instance to and time instance the enable signal is “false-only vocals”.

As described above, between 0 and time instance to and between time instance to and time instance t4 the enable signal is “false-only vocals”, which indicates that during these periods the audio includes only vocals, thereby during audio processing there is no need to perform source separation on the audio, as described in more detail with regard to Fig. 9. Therefore, the processed audio may be silence or may be an audio adjusted by a gain (see 304 in Figs. 3, 9) based on the user’s preferences in real-time or based on a predefined gain parameter. Between time instance ti and time instance tz the enable signal is “false-only accompaniment”, which indicates that during this period the audio includes only accompaniment, thereby during audio processing there is no need to perform source separation on the audio, as described in more detail with regard to Fig. 10. Therefore, the processed audio may be the accompaniment or may be an audio adjusted by a gain (see 304 in Figs. 3, 10) based on the user’s preferences or based on a predefined gain parameter. Between time instance to and time instance ti and between time instance tz and time instance to the enable signal is “true- vocals and accompaniment”, which indicates that during these periods the audio includes vocals and accompaniment, thereby during audio processing source separation is performed on the audio, as described in more detail with regard to Fig. 8.

In the embodiment of Fig. 6, the enable signal is a digital signal, such as a Boolean signal i.e. a signal having true and false values, without limiting the present disclosure in that regard. Alternatively, the enable signal may be a digital signal, such as a binary signal, i.e. a signal having “0” and “1” values, or the like. For example, a binary signal with value “0” may indicate that the audio includes no vocals and a binary signal with value “1” may indicate that the audio includes vocals, or the like.

Fig. 7a schematically shows a table in which a value of the enable signal and a switch position of the switch are mapped. The enable signal (see 201 in Figs. 2, 3, 5) has three possible values, namely “true-vocals and accompaniment”, “false-only accompaniment”, and “false-only vocals” and the switch (see 300 in Fig. 3) has two possible positions, namely AB and AC. The switch symbolizes switching on and off the source separation (see 301 in Fig. 3), wherein when the position of the switch is AB, the source separation is switched-on, i.e. activated and when the position of the switch is AC, the source separation is switched-off, i.e. deactivated.

For example, the enable signal (see 201 in Figs. 2, 3, 5) with a value “true-vocals and accompaniment” (see Fig. 6) is mapped to the switch AB (see Fig. 3), which indicates that the source separation (see 301 in Fig. 3) is switched-on based on the enable signal.

In other words, the value “true-vocals and accompaniment” of the enable signal serves as a trigger value that activates the source separation, i.e. source separation is performed. Based on the table of Fig. 7a, an enable signal (see 201 in Figs. 2, 3, 5) with a value “false-only accompaniment” (see Fig. 6) and an enable signal with a value “false-only vocals” are both mapped to the switch AC (see Fig. 3), which indicates that the source separation (see 301 in Fig. 3) is switched-off based on the enable signal. In other words, the value “false-only accompaniment” and the value “false-only vocals” of the enable signal serve as a trigger value that deactivates the source separation, i.e. source separation is not performed. Fig. 7b schematically shows a table in which an enable signal value and a gain factor are mapped. The enable signal (see 201 in Figs. 2, 3, 5) with a value “true-vocals and accompaniment” (see Fig. 6) is mapped, based on the enable signal, to a gain (see 302 in Fig. 3) of -12dB, which is applied on the vocals. An enable signal (see 201 in Figs. 2, 3, 5) with a value “false-only accompaniment” (see Fig. 6) is mapped, based on the enable signal, to a gain (see 304 in Fig. 3) of OdB, which is applied on the accompaniment. An enable signal (see 201 in Figs. 2, 3, 5) with a value “false-only vocals” (see Fig. 6) is mapped, based on the enable signal, to a gain (see 304 in Fig. 3) of -20dB, which is applied on the vocals.

In the embodiment of Fig. 7b, based on the value of the enable signal, the gain has different values, e.g. different gain factors are applied to the vocals or to the accompaniment, without limiting the present embodiment in that regard. The skilled person may apply any suitable gain factor based on his expertise. For example, a gain factor may be applied to the vocals to generate silence, in a case where the value of the enable signal is “false-only vocals”. A gain factor may be applied to the accompaniment to increase the volume of the accompaniment, in a case where the value of the enable signal is “false-only accompaniment”. A gain factor may be applied to the vocals to decrease the volume of the vocals, in a case where the value of the enable signal is “true-vocals and accompaniment” .

Fig. 8 schematically describes in more detail an embodiment of the audio processing described in Fig. 3 performed in a case where the enable signal value is “true-vocals and accompaniment”. In the present embodiment, the enable signal value is “true-vocals and accompaniment” and serves as a trigger value that activates the source separation, i.e. switch-on. The audio 200 is input to source separation 301 and decomposed into vocals and accompaniment. The gain 302 is applied to the vocals to obtain adjusted vocals, and to the accompaniment to obtain adjusted accompaniment, as also described in Fig. 3 above. The mixer 305 mixes the adjusted vocals with the accompaniment to obtain a processed audio (see 206 in Figs. 2, 3). The vocals are represented by a dashed line indicating that the vocals may be adjusted by a gain factor or may be cut off and instead, silence is output to the mixer 305. The accompaniment is represented by a solid line indicating that the accompaniment may be adjusted by a gain factor or may be directly output to the mixer 305.

In the embodiment of Fig. 8, the vocals may be adjusted by applying a gain factor equal to -12dB, - 20dB, or the like, to decrease the volume of the vocals, without limiting the present embodiment in that regard. Alternatively, a gain factor equal to e.g. +3dB may be applied to the vocals to increase the volume of the vocals or a gain factor equal to OdB may be applied to the vocals to leave the volume unchanged. The accompaniment may be adjusted by applying a gain factor equal to -3dB, OdB, +3dB, or the like, to decrease, increase or leave unchanged the volume of the vocals, without limiting the present embodiment in that regard. The skilled person may set any suitable gain factor to be applied to the vocals and the accompaniment as a predefined parameter according to the needs of the specific use case.

Fig. 9 schematically describes in more detail an embodiment of the audio processing described in Fig. 3 performed in a case where the enable signal value is “false-only vocals”. In the present embodiment, the enable signal value is “false-only vocals” and serves as a trigger value that deactivates the source separation, i.e. switch-off. The audio 200, which comprises only vocals, is input to delay 303 to output delayed audio, here delayed vocals. The gain 304 is applied to the delayed vocals to obtain adjusted audio, here adjusted vocals, as also described in Fig. 3 above. The adjusted vocals signal is the processed audio signal (see 206 in Figs. 2, 3) being output form the audio processing (see 202 in Figs. 2, 3).

Based on the enable signal (see 201 in Figs. 2, 3), the delayed audio is adjusted by a gain 304, e.g. a gain factor is applied to the delayed audio, to obtain adjusted audio, here adjusted vocals. That is, there is an expected latency, for example a time delay At, from the source separation (see 301 in Fig. 3). The expected time delay is a known, predefined parameter, which may be set in the delay 303 as a predefined parameter. At the delay 303, the audio signal, here the vocals, are delayed by the expected latency, due to the source separation process, to obtain the delayed vocals. This has the effect that the latency, due to the source separation process, is compensated by a respective delay of the vocals.

Fig. 10 schematically describes in more detail an embodiment of the audio processing described in Fig. 3 performed in a case where the enable signal value is “false-only accompaniment”. In the present embodiment, the enable signal value is “false-only accompaniment” and serves as a trigger value that switches-off the source separation, i.e. not performed. The audio 200, which comprises only accompaniment, is input to delay 303 to output delayed audio, here delayed accompaniment. The gain 304 is applied to the delayed vocals to obtain adjusted audio, here adjusted accompaniment, as also described in Fig. 3 above. The adjusted accompaniment is the processed audio (see 206 in Figs. 2, 3) being output form the audio processing (see 202 in Figs. 2, 3).

Based on the enable signal (see 201 in Figs. 2, 3), the delayed audio is adjusted by a gain 304, e.g. a gain factor is applied to the delayed audio, to obtain adjusted audio, here adjusted accompaniment. That is, there is an expected latency, for example a time delay At, from the source separation (see 301 in Fig. 3). The expected time delay is a known, predefined parameter, which may be set in the delay 303 as a predefined parameter. At the delay 303, the audio signal, here the accompaniment, are delayed by the expected latency, due to the source separation process, to obtain the delayed accompaniment. This has the effect that the latency, due to the source separation process, is compensated by a respective delay of the accompaniment.

Enable signal generation using vocals detection

Fig. 11 schematically shows another embodiment of a process of enable signal generation, wherein vocals detection is performed. The audio 200 is input to source separation 301 and decomposed into separations, here vocals and accompaniment. The vocals and accompaniment are input to the vocal detection 1000 to detect the presence or not of vocals to the audio 200 and thus to obtain a vocals detection signal 1002. The vocals detection signal 1002 and the vocals and accompaniment are input to the enable signal generation 1001 to obtain the enable signal 201. The enable signal 201 can then be stored on the electronic device (see 1302 in Fig. 14) and re-used the next time the user listens to the same song.

In the embodiment of Fig. 11, the vocal detection 1000 together with the enable signal generation 1001 may form a vocals detection network used to compute the enable signal 201. For example, the enable signal may be pre-computed on the streaming server side using such a vocals detection network and together with the audio, may be transmitted to the electronic device. This transmission may be done by embedding the signal by watermarking techniques directly into the audio. Alternatively, the enable signal may be computed on the electronic device using a small vocals detection network, which may result to decrease the overall power consumption if more operations are saved with the enable signal than required to be computed.

Method and Implementation

Fig. 12 shows a flow diagram visualizing a method for signal mixing related to audio processing by performing source separation based on an enable signal to obtain a mixed audio signal.

At 1100, the audio processing (see 202 in Figs. 2, 3) receives an audio (see 200 in Figs. 2, 3). At 1101, the audio processing (see 202 in Figs. 2, 3) receives an enable signal (see 201 in Figs. 2, 3). At 1102, audio processing is performed (see 202 in Figs. 2, 3) on the received audio (see 200 in Figs. 2, 3) based on the received enable signal (see 201 in Figs. 2, 3) to obtain a processed audio (see 206 in Figs. 2, 3). At 1103, the mixer (see 205 in Fig. 2) receives user’s vocals (see 203, 207 in Fig. 2) and at 1104, mixing is performed of the processed audio with the received user’s vocals to obtain a mixed audio (see 208 in Fig. 2). The mixed audio and/ or the processed audio may be output to a loudspeaker system of a smartphone, of a smartwatch, of a Bluetooth, or the like, such as headphones and the like.

Fig. 13 shows a flow diagram visualizing a method for audio processing related to source separation based on an enable signal to obtain an adjusted audio signal. At 1200, the audio processing (see 202 in Figs. 2, 3) receives an audio (see 200 in Figs. 2, 3). At 1201, the audio processing (see 202 in Figs. 2, 3) receives an enable signal (see 201 in Figs. 2, 3). If at 1202, the enable signal is true (see Figs. 5, 7a, 8), the enable signal activates, i.e. switches-on the source separation (see 301 in Fig. 3) and the process proceeds to 1203. At 1203, source separation (see 301 in Figs. 3, 8) is performed on the received audio to obtain vocals and accompaniment. At 1204, the vocals are adjusted based on the received enable signal to obtain adjusted vocals. At 1205, mixing the accompaniment with the adjusted vocals is performed to obtain processed audio (see 206 in Figs. 2, 3). If at 1202, the enable signal is not true (see Figs. 6, 7a, 8), the enable signal deactivates, i.e. switches-off the source separation (see 301 in Fig. 3) and process proceeds to 1206. At 1206, the source separation is switched-off, i.e. is not performed based on the enable signal and the process proceeds to 1207. At 1207, the audio is adjusted based on the received enable signal to obtain adjusted audio (see Figs. 3, 9, 10).

Depending on the enable signal, the adjusted audio may be adjusted vocals based on the user’s preferences in real-time, or based on preset gain parameters, which may decrease or increase the vocals volume or may be output only silence such that the user can sing the vocals himself. Alternatively, the adjusted audio may be adjusted accompaniment based on the user’s preferences in real-time, or based on preset gain parameters, which may decrease or increase the accompaniment volume when only the accompaniment is played back to the user such that the user can sing the vocals himself.

Fig. 14 shows a block diagram depicting an embodiment of an electronic device that can implement the processes of audio mixing based on an enable signal and audio processing. The electronic device 1300 comprises a CPU 1301 as processor. The electronic device 1300 further comprises a microphone array 1310, a loudspeaker array 1309 and a convolutional neural network unit 1207 that are connected to the processor 1301. The processor 1301 may for example implement a gain 302, 304, a mixer 205, 305 that realize the processes described with regard to Figs. 2, and Fig. 3 in more detail. The CNN 1307 may for example be an artificial neural network in hardware, e.g. a neural network on GPUs or any other hardware specialized for the purpose of implementing an artificial neural network. The CNN 1320 may for example implement an audio processing 202, a source separation 301, a delay 303, an enable signal generation 400, 1001, a vocals detection 1000 that realize the processes described with regard to Figs. 2, 3, 5, 8, 9, 10 and 11 in more detail.

Loudspeaker array 1309 may be headphones, e.g. on-ear, in-ear, over-ear, wireless headphones and the like, or may consist of one or more loudspeakers that are distributed over a predefined space and is configured to render any kind of audio, such as 3D audio. The microphone array 1310 may be configured to receive speech (voice), vocals (signer’s voice), instrumental sounds or the like, for example, when the user signs a song or plays an instrument (see audio 200 in Figs. 2, 3, 5, 8, 9, 10 and 11). The microphone array 1310 may be configured to receive speech (voice) commands via automatic speech recognition to operate the electronic device 1300. The electronic device 1300 further comprises a user interface 1308 that is connected to the processor 1301. This user interface 1308 acts as a man -machine interface and enables a dialogue between an administrator and the electronic device. For example, an administrator may make configurations to the system using this user interface 1308. The electronic device 1300 further comprises an Ethernet interface 1306, a Bluetooth interface 1304, and a WEAN interface 1305. These units 1304, 1305, 1306 act as I/O interfaces for data communication with external devices. For example, additional loudspeakers, microphones, and video cameras with Ethernet, WLAN or Bluetooth connection may be coupled to the processor 1301 via these inter-faces 1304, 1305 and 1306.

The electronic device 1300 further comprises a data storage 1302 and a data memory 1303 (here a RAM). The data memory 1303 is arranged to temporarily store or cache data or computer instructions for processing by the processor 1301. The data storage 1302 is arranged as a long-term storage, e.g., for recording sensor data obtained from the microphone array 1310, or for storing enable signal values used for turning a switch and a mapping table that maps switch positions to enable signal values (see Figs. 3, 6 and 7a). The data storage 1302 may also store audio data that represents audio messages, which the electronic device may output to the user for guidance or help.

It should be noted that the description above is only an example configuration. Alternative configurations may be implemented with additional or other sensors, storage devices, interfaces, or the like.

It should be recognized that the embodiments describe methods with an exemplary ordering of method steps. The specific ordering of method steps is, however, given for illustrative purposes only and should not be construed as binding.

It should also be noted that the division of the electronic device of Fig. 14 into units is only made for illustration purposes and that the present disclosure is not limited to any specific division of functions in specific units. For instance, at least parts of the circuitry could be implemented by a respectively programmed processor, field programmable gate array (FPGA), dedicated circuits, and the like.

All units and entities described in this specification and claimed in the appended claims can, if not stated otherwise, be implemented as integrated circuit logic, for example, on a chip, and functionality provided by such units and entities can, if not stated otherwise, be implemented by software. In so far as the embodiments of the disclosure described above are implemented, at least in part, using software-controlled data processing apparatus, it will be appreciated that a computer program providing such software control and a transmission, storage or other medium by which such a computer program is provided are envisaged as aspects of the present disclosure.

Note that the present technology can also be configured as described below.

(1) An electronic device comprising circuitry configured to perform source separation (301) on an audio signal (1; 200) based on an enable signal (201) to obtain a processed audio signal (206) comprising a separated source (2) and a residual signal (3), wherein the enable signal (201) is configured to activate or deactivate the source separation (301).

(2) The electronic device of (1) further comprises circuitry configured, if the source separation (301) is deactivated by the enable signal (201), to adjust the audio signal (1; 200) to obtain an adjusted audio signal (206) as the processed audio signal (206).

(3) The electronic device of (1) or (2) further comprises circuitry configured to change a position (B, C) of a switch (300, AB, AC) based on a value (true, false) of the enable signal (201) to activate or deactivate the source separation.

(4) The electronic device of (3), wherein the enable signal (201) is configured to activate the source separation (301) if the value (true, false) of the enable signal (201) is “true” and to deactivate the source separation (301) in a case where the value (true, false) of the enable signal (201) is “false”.

(5) The electronic device of anyone of (1) to (4), wherein the source separation (301) is implemented by a deep neural network (DNN) and the enable signal (201) is used to deactivate some or all layers of the DNN such that their outputs are not updated anymore.

(6) The electronic device of (2) further comprises circuitry configured to apply a gain (304) to the audio signal (1; 200) based on the enable signal (201) to obtain the adjusted audio signal (206).

(7) The electronic device of anyone of (1) to (6) further comprises circuitry configured, if the source separation (301) is deactivated by the enable signal (201), to delay (303) the audio signal (1; 200) to obtain a delayed audio signal.

(8) The electronic device of (2) further comprises circuitry configured to apply a gain (204) to a user’s vocals signal (203) to obtain adjusted user’s vocals signal (207), the user’s vocals signal (203) being acquired by a microphone (1310).

(9) The electronic device of (7) further comprises circuitry configured to mix the adjusted user’s vocals (207) with the processed audio signal (206) to obtain a mixed audio signal (208). (10) The electronic device of anyone of (1) to (9) further comprises circuitry configured to perform enable signal generation (400; 1001) based on the separated source (2) and the residual signal (3) to obtain the enable signal (201).

(11) The electronic device of (10) further comprises circuitry configured to perform vocals detection (1000) on the audio signal (1; 200) to obtain a vocals detection signal (1002), wherein the enable signal generation (400) is performed based on the vocals detection signal (1002), the separated source (2) and the residual signal (3) to obtain the enable signal (201).

(12) The electronic device of (10), wherein the enable signal (201) is pre-computed on a server side.

(13) The electronic device of (10), wherein the enable signal (201) is computed during the first time a song is played on the electronic device.

(14) The electronic device of anyone of (1) to (13), wherein the separated source (2) comprises vocals and the residual signal (3) comprises accompaniment.

(15) The electronic device of (14) further comprises circuitry configured to apply a gain (302) to the vocals to obtain adjusted vocals and apply a gain (302) to the accompaniment to obtain adjusted accompaniment.

(16) The electronic device of (15) further comprises circuitry configured to mix (305) the adjusted vocals with the adjusted accompaniment to obtain the processed audio signal (206).

(17) The electronic device of anyone of (1) to (16), wherein the audio signal (1; 200) comprises at least one of vocals and accompaniment or wherein the separated source (2) comprises speech and the residual signal (3) comprises background noise.

(18) The electronic device of (3), wherein the value (true, false) of the enable signal (201) is “true- vocals and accompaniment”, “false-only vocals”, or “false-only accompaniment”.

(19) The electronic device of (18) further comprises circuitry configured, if the value (true, false) of the enable signal (201) is “true-vocals and accompaniment”, to activate the source separation (301), or if the value (true, false) of the enable signal (201) is “false-only vocals”, or “false-only accompaniment”, to deactivate the source separation (301).

(20) The electronic device of (7), wherein the microphone (1310) is a microphone of a device (1300) such as a smartphone, headphones, a TV set, a Blu-ray player. (21) The electronic device of anyone of (1) to (20), wherein the processed audio (206) is output to a loudspeaker system (1309).

(22) A method comprising: performing source separation (301) on an audio signal (1; 200) based on an enable signal (201) to obtain a processed audio signal (206) comprising a separated source (2) and a residual signal (3), wherein the enable signal (201) is configured to activate or deactivate the source separation (301).

(23) A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of (22).

(24) An electronic device comprising circuitry configured to perform source separation (301) on an audio signal (200) to obtain a separated source (2) and a residual signal (3); and perform enable signal generation (400; 1001) based on the separated source (2) and the residual signal (3) to obtain an enable signal (201), wherein the enable signal is configured to activate or deactivate the source separation (301).

(25) The electronic device of (24) further comprises circuitry configured to perform vocals detection (1000) on the separated source (2) and the residual signal (3) to obtain a vocals detection signal (1002), wherein enable signal generation (1001) is performed based on the vocals detection signal (1002), the separated source (2) and the residual signal (3) to obtain the enable signal (201).

(26) The electronic device of (24), wherein the enable signal (201) is pre-computed on a server side using a vocals detection network (1000), or the enable signal (201) is computed during the first time a song is played on the electronic device using an energy threshold on the separated source (2) and the residual signal (3).