Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND DEVICE FOR DISCONTINUOUS TRANSMISSION IN AN OBJECT-BASED AUDIO CODEC
Document Type and Number:
WIPO Patent Application WO/2024/103163
Kind Code:
A1
Abstract:
A method and device for discontinuous transmission (DTX) of audio objects in an object-based audio codec, the audio objects including respective audio streams, comprises an analyser of the audio streams for producing voice or signal activity information on the audio objects. A DTX controller detects, in response to the activity information on the audio objects, a DTX signal segment of the audio objects and a SID frame within the DTX signal segment. The DTX controller (a) updates a global SID counter of inactive frames, and (b) signals the SID frame within the DTX signal segment depending on a value of the global SID counter. An encoder encodes the SID frame. In a device for decoding audio objects during discontinuous transmission (DTX) operation, the audio objects each including an audio stream with metadata (MD) including at least one MD parameter, a metadata decoder for decodes the metadata and adjusts values of the MD parameter to lower differences in the MD parameter between frames, and an audio stream decoder decodes the audio streams.

Inventors:
EKSLER VACLAV (CZ)
Application Number:
PCT/CA2023/051518
Publication Date:
May 23, 2024
Filing Date:
November 14, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
VOICEAGE CORP (CA)
International Classes:
G10L19/00; G10L19/012; H04W76/28
Domestic Patent References:
WO2019174588A12019-09-19
Foreign References:
EP4196980A12023-06-21
US8780978B22014-07-15
Attorney, Agent or Firm:
LACHERÉ, Julien (CA)
Download PDF:
Claims:
What is claimed is : 1. A device for discontinuous transmission (DTX) of audio objects in an object-based audio codec, the audio objects including respective audio streams, comprising: an analyser of the audio streams for producing voice or signal activity information on the audio objects; a DTX controller for detecting, in response to the activity information on the audio objects, a DTX signal segment of the audio objects and a silence insertion descriptor (SID) frame within the DTX signal segment, wherein the DTX controller (a) updates a global SID counter of inactive frames, and (b) signals the detected SID frame within the DTX signal segment depending on a value of the global SID counter; and an encoder of the signaled, detected SID frame using SID frame coding. 2. A discontinuous transmission device as defined in claim 1, wherein the activity information comprises an activity detection flag for each audio objects, and wherein the DTX controller detects a DTX signal segment when the activity detection flags of the audio objects are set to a given value. 3. A discontinuous transmission device as defined in claim 1 or 2, wherein, upon detection of a DTX signal segment, the DTX controller sets a DTX flag to a given value. 4. A discontinuous transmission device as defined in any one of claims 1 to 3, wherein the DTX controller signals the SID frame detected within the DTX signal segment by setting a SID flag to a given value in response to a certain value of the global SID counter. 5. A discontinuous transmission device as defined in any one of claims 1 to 4, wherein the DTX controller signals the detected SID frame in response to the global SID counter equal to “0”. 6. A discontinuous transmission device as defined in any one of claims 1 to 5, wherein the DTX controller resets the global SID counter in every active frame. 7. A discontinuous transmission device as defined in any one of claims 1 to 6, wherein the DTX controller increments the global SID counter in every inactive frame up to a value corresponding to a SID update rate. 8. A discontinuous transmission device as defined in claim 7, wherein the DTX controller resets the global SID counter when the value corresponding to the SID update rate is reached by the global SID counter. 9. A discontinuous transmission device as defined in any one of claims 1 to 8, wherein the DTX controller alters the DTX flag using additional classification stages of the audio objects. 10. A discontinuous transmission device as defined in claim 9, wherein the additional classification stages compare a mean value of long-term background noises over the audio objects and a mean value of long-term background noise variations over the audio objects to respective thresholds. 11. A discontinuous transmission device as defined in claim 9, wherein the additional classification stages use an energy of background noise in the audio objects. 12. A discontinuous transmission device as defined in any one of claims 1 to 11, wherein the audio objects each comprise an audio stream with metadata, and wherein the SID frame encoder comprises a metadata encoder for encoding the metadata of the audio objects using absolute coding. 13. A discontinuous transmission device as defined in any one of claims 1 to 11, wherein the audio objects each comprise an audio stream with metadata, and wherein the DTX controller calculates a metadata (MD) flag for each audio object, indicating that MD parameters are unchanged to avoid coding and transmission of the metadata of the audio object by the SID frame encoder and thereby save SID bit-budget. 14. A discontinuous transmission device as defined in any one of claims 1 to 11, wherein the audio objects each comprise an audio stream with metadata, and wherein the DTX controller estimates a bit-budget for quantizing the metadata and compares the estimated bit-budget with a bit-budget available for quantizing the metadata to select SID frame coding or active frame coding. 15. A discontinuous transmission device as defined in claim 14, wherein the DTX controller sets the DTX flag to a first given value when the estimated bit-budget is higher than the bit-budget available for quantizing the metadata and selects active frame coding for coding the metadata. 16. A discontinuous transmission device as defined in claim 14 or 15, wherein the DTX controller sets the DTX flag to a second given value when the estimated bit-budget is lower than the bit-budget available for quantizing the metadata and selects SID frame coding for coding the metadata.

17. A discontinuous transmission device as defined in any one of claims 1 to 16, wherein the audio objects each comprise an audio stream with metadata (MD), and wherein the DTX controller makes a resolution of MD values dependent on the number of audio objects. 18. A device for decoding audio objects during discontinuous transmission (DTX) operation, the audio objects each including an audio stream with metadata (MD) including at least one MD parameter, comprising: a metadata decoder for decoding the metadata, wherein the metadata decoder adjusts values of the MD parameter to lower differences in the said MD parameter between frames; and an audio stream decoder for decoding the audio streams. 19. An audio object decoding device according to claim 18, wherein the metadata decoder, for lowering differences in MD parameter values, smoothes the MD parameter by interpolation between a value of the MD parameter in a current frame and a value of the MD parameter in a previous frame. 20. An audio object decoding device according to claim 19, wherein the metadata decoder smoothes the MD parameter in frames following a silence insertion descriptor (SID) frame whereby the value of the MD parameter evolves smoothly. 21. An audio object decoding device according to any one of claims 18 to 20, wherein the metadata decoder lowers differences in the MD parameter such that a maximum difference in the MD parameter between two adjacent frames is lower than a given threshold. 22. An audio object decoding device according to claim 19 or 20, wherein the metadata decoder limits a maximum number of frames in which smoothing is applied to a given threshold. 23. An audio object decoding device according to claim 22, wherein the metadata decoder skips smoothing of the MD parameter in active frames when an absolute value of the difference between the value of the MD parameter in the current frame and the value of the MD parameter value in the previous frame is higher than a smoothing step value multiplied by the given threshold. 24. A method for discontinuous transmission (DTX) of audio objects in an object-based audio codec, the audio objects including respective audio streams, comprising: analysing the audio streams for producing voice or signal activity information on the audio objects; detecting, in response to the activity information on the audio objects, a DTX signal segment of the audio objects and a silence insertion descriptor (SID) frame within the DTX signal segment, wherein the segment and frame detection comprises (a) updating a global SID counter of inactive frames, and (b) signaling the detected SID frame within the DTX signal segment depending on a value of the global SID counter; and encoding the signaled, detected SID frame using SID frame coding. 25. A discontinuous transmission method as defined in claim 24, wherein the activity information comprises an activity detection flag for each audio objects, and wherein the segment and frame detection comprises detecting a DTX signal segment when the activity detection flags of the audio objects are set to a given value. 26. A discontinuous transmission method as defined in claim 24 or 25, wherein, upon detection of a DTX signal segment, the segment and frame detection comprises setting a DTX flag to a given value. 27. A discontinuous transmission method as defined in any one of claims 24 to 26, comprising signaling the SID frame detected within the DTX signal segment by setting a SID flag to a given value in response to a certain value of the global SID counter. 28. A discontinuous transmission method as defined in any one of claims 24 to 27, comprising signaling the detected SID frame in response to the global SID counter equal to “0”. 29. A discontinuous transmission method as defined in any one of claims 24 to 28, comprising resetting the global SID counter in every active frame. 30. A discontinuous transmission method as defined in any one of claims 24 to 29, comprising incrementing the global SID counter in every inactive frame up to a value corresponding to a SID update rate. 31. A discontinuous transmission method as defined in claim 30, comprising resetting the global SID counter when the value corresponding to the SID update rate is reached by the global SID counter. 32. A discontinuous transmission method as defined in any one of claims 24 to 31, comprising altering the DTX flag using additional classification stages of the audio objects. 33. A discontinuous transmission method as defined in claim 32, wherein the additional classification stages compare a mean value of long-term background noises over the audio objects and a mean value of long-term background noise variations over the audio objects to respective thresholds. 34. A discontinuous transmission method as defined in claim 32, wherein the additional classification stages use an energy of background noise in the audio objects. 35. A discontinuous transmission method as defined in any one of claims 24 to 34, wherein the audio objects each comprise an audio stream with metadata, and wherein the SID frame encoding comprises encoding the metadata of the audio objects using absolute coding. 36. A discontinuous transmission method as defined in any one of claims 24 to 34, wherein the audio objects each comprise an audio stream with metadata, and wherein the segment and frame detection comprises calculating a metadata (MD) flag for each audio object, indicating that MD parameters are unchanged to avoid coding and transmission of the metadata of the audio object and thereby save SID bit-budget. 37. A discontinuous transmission method as defined in any one of claims 24 to 34, wherein the audio objects each comprise an audio stream with metadata, and wherein the segment and frame detection comprises estimating a bit-budget for quantizing the metadata and comparing the estimated bit-budget with a bit-budget available for quantizing the metadata to select SID frame coding or active frame coding. 38. A discontinuous transmission method as defined in claim 37, wherein the segment and frame detection comprises setting the DTX flag to a first given value when the estimated bit-budget is higher than the bit-budget available for quantizing the metadata and selecting active frame coding for coding the metadata.

39. A discontinuous transmission method as defined in claim 37 or 38, wherein the segment and frame detection comprises setting the DTX flag to a second given value when the estimated bit-budget is lower than the bit-budget available for quantizing the metadata and selecting SID frame coding for coding the metadata. 40. A discontinuous transmission method as defined in any one of claims 24 to 39, wherein the audio objects each comprise an audio stream with metadata (MD), and wherein the method comprises making a resolution of MD values dependent on the number of audio objects. 41. A method for decoding audio objects during discontinuous transmission (DTX) operation, the audio objects each including an audio stream with metadata (MD) including at least one MD parameter, comprising: decoding the metadata comprising adjusting values of the MD parameter to lower differences in the said MD parameter between frames; and decoding the audio streams. 42. An audio object decoding method according to claim 41, wherein decoding the metadata comprises, for lowering differences in MD parameter values, smoothing the MD parameter by interpolation between a value of the MD parameter in a current frame and a value of the MD parameter in a previous frame. 43. An audio object decoding method according to claim 42, wherein decoding the metadata comprises smoothing the MD parameter in frames following a silence insertion descriptor (SID) frame whereby the value of the MD parameter evolves smoothly. 44. An audio object decoding method according to any one of claims 41 to 43, wherein decoding the metadata comprises lowering differences in the MD parameter such that a maximum difference in the MD parameter between two adjacent frames is lower than a given threshold. 45. An audio object decoding method according to claim 42 or 43, wherein decoding the metadata comprises limiting a maximum number of frames in which smoothing is applied to a given threshold. 46. An audio object decoding method according to claim 45, wherein decoding the metadata comprises skipping smoothing of the MD parameter in active frames when an absolute value of the difference between the value of the MD parameter in the current frame and the value of the MD parameter value in the previous frame is higher than a smoothing step value multiplied by the given threshold. 47. A device for discontinuous transmission (DTX) of audio objects in an object-based audio codec, the audio objects including respective audio streams, comprising: at least one processor; and a memory coupled to the processor and storing non-transitory instructions that when executed cause the processor to implement: an analyser of the audio streams for producing voice or signal activity information on the audio objects; a DTX controller for detecting, in response to the activity information on the audio objects, a DTX signal segment of the audio objects and a silence insertion descriptor (SID) frame within the DTX signal segment, wherein the DTX controller (a) updates a global SID counter of inactive frames, and (b) signals the detected SID frame within the DTX signal segment depending on a value of the global SID counter; and an encoder of the signaled, detected SID frame using SID frame coding. 48. A device for discontinuous transmission (DTX) of audio objects in an object-based audio codec, the audio objects including respective audio streams, comprising: at least one processor; and a memory coupled to the processor and storing non-transitory instructions that when executed cause the processor to: analyse the audio streams for producing voice or signal activity information on the audio objects; detect, in response to the activity information on the audio objects, a DTX signal segment of the audio objects and a silence insertion descriptor (SID) frame within the DTX signal segment, comprising (a) updating a global SID counter of inactive frames, and (b) signaling the detected SID frame within the DTX signal segment depending on a value of the global SID counter; and an encoder of the signaled, detected SID frame using SID frame coding. 49. A device for decoding audio objects during discontinuous transmission (DTX) operation, the audio objects each including an audio stream with metadata (MD) including at least one MD parameter, comprising: at least one processor; and a memory coupled to the processor and storing non-transitory instructions that when executed cause the processor to implement: a metadata decoder for decoding the metadata, wherein the metadata decoder adjusts values of the MD parameter to lower differences in the said MD parameter between frames; and an audio stream decoder for decoding the audio streams.

50. A device for decoding audio objects during discontinuous transmission (DTX) operation, the audio objects each including an audio stream with metadata (MD) including at least one MD parameter, comprising: at least one processor; and a memory coupled to the processor and storing non-transitory instructions that when executed cause the processor to: decode the metadata comprising adjusting values of the MD parameter to lower differences in the said MD parameter between frames; and decode the audio streams.

Description:
METHOD AND DEVICE FOR DISCONTINUOUS TRANSMISSION IN AN OBJECT- BASED AUDIO CODEC TECHNICAL FIELD [0001] The present disclosure relates to sound coding, in particular but not exclusively to a method and device for discontinuous transmission (DTX) in an object-based audio codec. [0002] In the present disclosure and the appended claims: (a) The term “audio” may be related to speech, music and any other sound. (b) The term “multichannel” may be related to two or more channels. (c) The term “stereo” is an abbreviation for “stereophonic”. (d) The term “mono” is an abbreviation for “monophonic”. (e) The term “object-based audio” is intended to represent an auditory scene as a collection of individual elements, also known as audio objects. Also, “object-based audio” may comprise, for example, speech, music and any other sound including general audio sound. (f) The term “audio object” is intended to designate an audio stream with associated metadata. For example, in the present disclosure, an “audio object” is referred to as an independent audio stream with metadata (ISM). (g) The term “audio stream” is intended to represent, in a bit-stream, an audio waveform, for example speech, music and/or any other sound including general audio sound, and may consist of one channel (mono) although multi-channels including two channels (stereo) might be also considered. (h) The term “metadata” is intended to represent a set of information describing for example an audio stream and an artistic intension used to translate the original or coded audio objects to a reproduction system. The metadata usually describes spatial properties of each individual audio object, such as position, orientation, volume, width, etc. As a non-limitative example, in the context of the present disclosure, two sets of metadata are considered: - input metadata: unquantized metadata representation used as an input to a codec; the present disclosure is not restricted a specific format of input metadata; and - coded metadata: quantized and coded metadata forming part of a bit-stream transmitted from an encoder to a decoder. (i) The term “audio format” is intended to designate an approach to achieve an immersive audio experience. (j) The term “reproduction system” is intended to designate an element, in a decoder, capable of rendering audio objects, for example but not exclusively in a 3D (Three- Dimensional) audio space around a listener using the transmitted metadata and artistic intension at the reproduction side. The rendering can be performed to a target loudspeaker layout (for example 5.1 surround) or to headphones while the metadata can be dynamically modified, for example in response to feedback from a head-tracking device. Other types of rendering may be contemplated. BACKGROUND [0003] Discontinuous transmission (DTX) is used in mobile communication systems to switch a radio transmitter off during speech or general audio pauses. The use of DTX saves power in the mobile station and increases the time required between battery recharging. It also reduces the general interference level and thus improves transmission quality. During speech or general audio pauses, however, the background noise that is typically transmitted with the speech or general audio also disappears if the channel is completely cut off. The result is an unnatural sounding audio signal (silence) at the receiving end of the communication. [0004] Instead of completely switching the transmission off during speech or general audio pauses, a number of techniques have been developed in which parameters that characterize the background noise are generated and sent in a Silence Insertion Descriptor (SID) frames bit-stream at a low bitrate. These parameters, often referred to as Comfort Noise (CN) parameters, can then be used at the receiver side (decoder) to regenerate background noise respecting, as much as possible, the spectral and temporal content of the background noise at the transmitter side (encoder). The process to regenerate the background noise is known as the Comfort Noise Generation (CNG). [0005] Historically, conversational telephony has been implemented with mono handsets having only one transducer to output sound only to one of the user’s ears. Consequently, the SID of mono codecs can achieve a low bitrate. In the last decade, users have started to use their portable handset in conjunction with a headphone to receive the sound over their two ears mainly to listen to music but also, sometimes, to listen to speech. Nevertheless, when a portable handset is used to transmit and receive conversational speech, the content is still mono but presented to the user’s two ears when a headphone is used. [0006] With the 3GPP (3rd Generation Partnership Project) speech coding standard implementing a Codec for Enhanced Voice Services (EVS) as described in Reference [1], of which the full content is incorporated herein by reference, the quality of the coded audio sound, for example speech, music and any other sound that is transmitted and received through a portable handset has been significantly improved. The next natural step is to transmit stereo information such that the receiver gets as close as possible to a real-life audio scene that is captured at the other end of the communication link. [0007] Further, in last years, the generation, recording, representation, coding, transmission, and reproduction of audio is moving towards enhanced, interactive and immersive experience for the listener. The immersive experience can be described, for example, as a state of being deeply engaged or involved in an audio scene while sounds are coming from all directions. In immersive audio (also called 3D (Three-Dimensional) audio), the sound image is reproduced in all three dimensions around the listener, taking into consideration a wide range of sound characteristics like timbre, directivity, reverberation, transparency and accuracy of (auditory) spaciousness. Immersive audio is produced for a particular audio playback or reproduction system such as loudspeaker-based-system, integrated reproduction system (sound bar) or headphones. Then, interactivity of an audio reproduction system may include, for example, an ability to adjust sound levels, change positions of sounds, or select different languages for the reproduction. [0008] There are three fundamental approaches (also referred below as audio formats) to achieve an immersive audio experience. [0009] A first approach is a channel-based audio where multiple spaced microphones are used to capture sounds from different directions while one microphone corresponds to one audio channel in a specific loudspeaker layout. Each recorded channel is supplied to a loudspeaker in a particular location. Examples of channel-based audio comprise, for example, stereo, 5.1 surround, 5.1+4 etc. [0010] A second approach is a scene-based audio (SBA) which represents a desired sound field over a localized space as a function of time by a combination of dimensional components. The signals representing the scene-based audio are independent of the sound sources positions while the sound field has to be transformed to a chosen loudspeakers layout at the rendering reproduction system. An example of scene-based audio is ambisonics. [0011] A third, last immersive audio approach is an object-based audio which represents an auditory scene as a set of individual audio elements (for example singer, drums, guitar) accompanied by information about, for example their position in the audio scene, so that they can be rendered at the reproduction system to their intended locations. This gives an object-based audio a great flexibility and interactivity because each object is kept discrete and can be individually manipulated. [0012] Beyond the fundamental approaches, new multi-channel coding techniques are being developed such as Metadata-Assisted Spatial Audio (MASA) as described for example in Reference [5] of which the full content is incorporated herein by reference. In the MASA approach, the MASA metadata (for example direction, energy ratio, spread coherence, distance, surround coherence, all in several time-frequency slots) are generated in a MASA analyzer, quantized, coded, and passed into the bit-stream while MASA audio channel(s) are treated as (multi-)mono or (multi-)stereo transport signals coded by the core encoder(s). At the MASA decoder, MASA metadata then guide the decoding and rendering process to recreate an output spatial sound. [0013] Each of the above-described audio approaches to achieve an immersive experience presents pros and cons. It is thus common that, instead of only one audio approach, several audio approaches are combined in a complex audio system to create an immersive auditory scene. An example can be an audio system that combines scene-based audio (SBA) or MASA with object-based audio, for example SBA or MASA with a few discrete audio objects. [0014] In recent years, 3GPP started working on developing a 3D audio codec for immersive services called IVAS (Immersive Voice and Audio Services) as described in Reference [2] of which the full content is incorporated herein by reference, based on the EVS codec as described in Reference [1]. The IVAS codec is a multi-channel codec where the bitrate is usually more demanding with the increased number of coded and transmitted channels. [0015] The DTX operation in multi-channel codec thus needs to address a trade-off between (a) keeping the SID bitrate low and (b) using a high number of channels to be represented. If, for example, each of the channels would be represented by its own SID, the overall codec SID bitrate would be too high. Consequently, there is a need for efficient DTX method and SID coding. SUMMARY [0016] According to a first aspect, the present disclosure relates to a method for discontinuous transmission (DTX) of audio objects in an object-based audio codec, the audio objects including respective audio streams, comprising: analysing the audio streams for producing voice or signal activity information on the audio objects; detecting, in response to the activity information on the audio objects, a DTX signal segment of the audio objects and a SID frame within the DTX signal segment, wherein the segment and frame detection comprises (a) updating a global SID counter of inactive frames, and (b) signaling the detected SID frame within the DTX signal segment depending on a value of the global SID counter; and encoding the signaled, detected SID frame using SID frame coding. [0017] According to another aspect, the present disclosure is concerned with a device for discontinuous transmission (DTX) of audio objects in an object-based audio codec, the audio objects including respective audio streams, comprising: an analyser of the audio streams for producing voice or signal activity information on the audio objects; a DTX controller for detecting, in response to the activity information on the audio objects, a DTX signal segment of the audio objects and a SID frame within the DTX signal segment, wherein the DTX controller (a) updates a global SID counter of inactive frames, and (b) signals the detected SID frame within the DTX signal segment depending on a value of the global SID counter; and an encoder of the signaled, detected SID frame using SID frame coding. [0018] According to a further aspect, the present disclosure describes a method for decoding audio objects during discontinuous transmission (DTX) operation, the audio objects each including an audio stream with metadata (MD) including at least one MD parameter, comprising: decoding the metadata comprising adjusting values of the MD parameter to lower differences in the said MD parameter between frames; and decoding the audio streams. [0019] According to a fourth aspect, the present disclosure discloses a device for decoding audio objects during discontinuous transmission (DTX) operation, the audio objects each including an audio stream with metadata (MD) including at least one MD parameter, comprising: a metadata decoder for decoding the metadata, wherein the metadata decoder adjusts values of the MD parameter to lower differences in the said MD parameter between frames; and an audio stream decoder for decoding the audio streams. [0020] The foregoing and other objects, advantages and features of (a) the method and device for discontinuous transmission (DTX) of audio objects in an object-based audio codec and (b) the method and device for decoding audio objects during discontinuous transmission (DTX) operation will become more apparent upon reading of the following non-restrictive description of illustrative embodiments thereof, given by way of example only with reference to the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS [0021] In the appended drawings: [0022] Figure 1 is a schematic block diagram illustrating concurrently the DTX transmission method and device implemented in an ISM encoder and corresponding ISM encoding method; [0023] Figure 2 is a flow chart illustrating a SID/DTX logic used in the DTX transmission method and device implemented in the ISM encoder and corresponding ISM encoding method of Figure 1; [0024] Figure 3 is a non-limitative example of structure of a SID bit-stream in an object-based audio codec; [0025] Figure 4 is a schematic block diagram illustrating concurrently an ISM decoder and corresponding ISM decoding method; [0026] Figure 5 is a graph of an example of metadata (azimuth) adjustment in DTX operation; and [0027] Figure 6 is a simplified block diagram of an example configuration of hardware components forming (a) the method and device for discontinuous transmission (DTX) of audio objects in an object-based audio codec, (b) the ISM encoding method and encoder, (c) the method and device for decoding audio objects during discontinuous transmission (DTX) operation, and/or (d) the ISM decoding method and decoder. DETAILED DESCRIPTION [0028] The present disclosure describes the method and device for discontinuous transmission (DTX) of audio objects in an object-based audio codec and the method and device for decoding audio objects during discontinuous transmission (DTX) operation. [0029] Non-restrictive illustrative embodiments of the method and device for discontinuous transmission (DTX) of audio objects in an object-based audio codec and the method and device for decoding audio objects during discontinuous transmission (DTX) operation as described hereinafter involves concepts such as identification of inactive signal segments, decision about coding SID frames and associated CN parameters, and coding, quantization, and restoring metadata in inactive signal segments in an object-based audio codec. [0030] In the present disclosure, the method and device for discontinuous transmission (DTX) of audio objects in an object-based audio codec and the method and device for decoding audio objects during discontinuous transmission (DTX) operation will be described, by way of non-limitative example only, with reference to an IVAS coding framework referred to throughout the present disclosure as IVAS codec (or IVAS audio codec). Specifically, the following IVAS formats make the reference: ISM format, OMASA format, and OSBA format. However, it is within the scope of the present disclosure to incorporate such DTX technique in any other audio codec supporting object-based audio. 1. Introduction [0031] As a non-limitative example, the present disclosure considers a framework that supports simultaneous coding of several audio objects (for example up to 4 audio objects) while a fixed constant codec total bitrate is considered for coding the audio objects, including the audio streams with their associated metadata. It should be noted that the metadata are not necessarily transmitted for at least some of the audio objects, for example in the case of non-diegetic content. Non-diegetic sounds in movies, TV shows and other videos are sounds that the characters cannot hear. Soundtracks are an example of non-diegetic sound, since the audience members are the only ones to hear the music. [0032] The present disclosure also considers a basic non-limitative example of input metadata consisting of two metadata (MD) parameters, namely azimuth and elevation, which are stored per audio frame for each audio object. In this example, an azimuth range of [-180⁰, 180⁰), and an elevation range of [-90⁰, 90⁰], are considered. However, it is within the scope of the present disclosure to consider only one or more than two (2) MD parameters with various ranges. 2. Object-Based Coding [0033] Figure 1 is a schematic block diagram illustrating concurrently the method 150 and device 100 for discontinuous transmission (DTX) of audio objects in an object- based audio codec implemented within an ISM encoding method and corresponding ISM encoder. 2.1 Input buffering [0034] Referring to Figure 1, the ISM encoding method comprises an operation of input buffering 151. To perform the operation 151 of input buffering, the ISM encoder comprises an input buffer 101. [0035] The input buffer 101 buffers a number N of input audio objects 102, i.e. a number N of audio streams with the associated respective N metadata. The N input audio objects 102, including the N audio streams and the N metadata associated to each of these N audio streams are buffered for one frame, for example a 20 ms long frame. As well known in the art of audio signal processing, the audio signal is sampled at a given sampling frequency and processed by successive blocks of these samples called “frames” each divided into a number of “sub-frames.” 2.2 Audio Streams Analysis and Front Pre-Processing [0036] Still referring to Figure 1, the DTX transmission method 150 comprises an operation of analysis and front pre-processing 153 of the N audio streams. To perform the operation 153, the DTX transmission device 100 comprises an audio stream processor (analyser) 103 to analyze and front pre-process, for example sequentially, the buffered N audio streams transmitted from the input buffer 101 to the audio stream processor (analyser) 103 through a number N of transport channels 104, respectively. [0037] The analysis and front pre-processing operation 153 performed by the audio stream processor (analyser) 103 may comprise, for example, at least one of the following sub-operations: time-domain transient detection, spectral analysis, long-term prediction analysis, pitch tracking and voicing analysis, voice or signal activity detection (VAD/SAD), bandwidth detection, noise estimation and signal classification (which may include in a non-limitative embodiment (a) core-encoder selection between, for example, ACELP core- encoder, TCX core-encoder, HQ core-encoder, etc., (b) signal type classification between, for example, inactive core-encoder type, unvoiced core-encoder type, voiced core-encoder type, generic core-encoder type, transition core-encoder type, and audio core-encoder type, etc., (c) speech/music classification, etc.). Information obtained from the analysis and front pre-processing operation 153 is supplied to a configuration and decision processor 106 through a line 121. Examples of the foregoing sub-operations are described in Reference [1] in relation to the EVS codec and, therefore, will not be further described in the present disclosure. 2.2.1 DTX/CNG Operation Overview [0038] As indicated herein above, discontinuous transmission (DTX) and comfort noise generation (CNG) are used in the method 150 and corresponding device 100 to reduce the transmission bitrate by simulating background noise during inactive signal periods. In the EVS codec as described in Reference [1], a regular DTX/CNG scheme is supported for bitrates up to 24.4 kbps. For higher bitrates, the EVS codec supports a less aggressive DTX/CNG scheme that only switches to comfort noise generation (CNG) for low input signal power. [0039] The reduction of the transmission bitrate during inactive signal periods is achieved by coding the parameters referred to as comfort noise (CN) parameters. The CN parameters can be transmitted at a fixed or adaptive bitrate during inactive signal periods. As an example, in the EVS codec as described in Reference [1], a default transmission rate of CNG update (aka SID update rate) is set to 8 frames. [0040] When the EVS codec is operated with the DTX/CNG scheme, a signal activity detector (SAD) is used to analyze the input audio signal to determine whether the input audio signal is active or inactive (see SAD decision in subclause 5.1.12 of Reference [1]). Based on its analysis, the SAD detector generates a SAD flag, f SAD , whose state indicates whether the input audio signal is active (fSAD = 1) or merely a background noise (fSAD = 0). When the flag fSAD = 1, regular encoding and decoding is performed, as in the default option. When fSAD = 0, DTX functions are run at the encoder that transmit either a Silence Insertion Descriptor (SID) frame or a NO_DATA frame. In the following description, the frames where the input audio signal is active (fSAD = 1) are called “active frames” and the frames with merely background noise (fSAD = 0) are called “inactive frames”. [0041] The SID frame contains the CN parameters, which are used to update the characteristics of the background noise at the decoder, whereas the NO_DATA frame is empty. The SID frame in EVS is always encoded using 48 bits (which corresponds to the bitrate of 2.4 kbps). [0042] As indicated herein above, the multichannel IVAS codec is based on the EVS codec. Specifically, a variable bitrate version of the EVS encoder is employed as the so- called core-encoder. The SID bitrate of the core-encoder in IVAS is the same as the SID bitrate of EVS, i.e.2.4 kbps, while the CNG algorithms of the EVS codec are reused in the core-encoder of the IVAS codec as much as possible. [0043] In general, the number of core-encoders in IVAS is one up to the number of input audio channels (i.e. every audio channel is coded by an associated core-encoder). In case of the ISM format, the number of core-encoders N is usually equal to the number N of independent audio streams with metadata (ISMs) and it is then called “discrete ISM” coding. It should be noted that “parametric ISM” techniques can be also employed where the number of core-encoders is less than the number of audio objects. These “parametric ISM” techniques are usually employed at low bitrates where active channel coding of all audio objects is not possible due to bitrate constraints. [0044] It is obvious that coding all ISMs in an SID frame would result in N times (2.4 kbps + metadata) bitrate which would further result in too high an IVAS SID bitrate. In order to keep it reasonably low, the IVAS SID bitrate is set to 5.2 kbps and efficient coding of SID frames is introduced. 2.2.2 Classification of Inactive Frames in ISM Encoder [0045] As indicated herein above, the audio stream processor (analyser) 103 analyzes the audio streams conveyed by respective channels 104 from the input buffer 101, classifies the input audio objects and produces the voice or signal activity detection (VAD/SAD) flag fSAD, one per audio object. A change from fSAD = 1 to fSAD = 0 indicates a start of inactive signal segment for a particular audio object. Several VAD/SAD flag variants can be used, for example a “local VAD” flag as described in Paragraph 5.1.12 of Reference [1]. It is obvious that the start of an inactive signal segment usually happens at different time instances in different audio objects. [0046] By default, in an object-based audio system a DTX signal segment is declared when an inactive signal segment is declared for all audio objects. Then, additional classification stages are used to classify the inactive signal segments. Also, the values of metadata have an impact whether a frame is declared as an inactive frame. The classification logic of inactive (SID or NO_DATA) frames or active frames is illustrated in Figure 2. 2.2.3 Global SID Counter, DTX Flag, SID Flag [0047] Figure 2 is a flow chart illustrating concurrently a DTX control operation 250 and corresponding DTX controller 200 used in the DTX transmission method 150 and device 100 implemented in the ISM encoder and corresponding ISM encoding method of Figure 1. [0048] Referring to Figure 2, in order to control the SID/NO_DATA frame decision and the start of a DTX signal segment, the DTX controller 200 uses a global SID counter cnt SID of inactive frames. This global SID counter cnt SID allows to control the SID update rate and synchronize the SID/NO_DATA frames across all audio objects. Consequently, the global SID counter cntSID enables efficient tuning of a potential hang-over (hysteresis) DTX logic, or permits other than a default SID update rate of, for example, 8 frames or SID adaptive update rate. [0049] In the disclosed logic, the global SID counter cntSID (parameter “cnt_SID_ISM” in the source code at the end of the present disclosure) is thus superior to per audio object SID counters and it effectively ensures that the individual per audio object SID counters are synchronized. [0050] Referring to Figure 2, the DTX controller 200 receives the information from line 121 obtained from the analysis and pre-processing operation 153, including the VAD information. The DTX controller 200 then initializes to “0” the DTX flag flagDTX and the SID flag flagSID (block 201). [0051] By default, the DTX controller 200 detects a DTX signal segment (SID or NO_DATA frame) of the audio objects when the VAD flags flagVAD of all audio objects are equal to 0 (block 202). The DTX controller 200 signals (block 203) that the VAD flags flagVAD of all audio objects are equal to 0 by setting the DTX flag flagDTX = 1 (parameter ‘dtx_flag’ in the source code at the end of the present disclosure). This can be expressed using relation (1): ^ ^ ^^ ^^ ^^ ^^ ^^ ^^ = {1 if local VAD = 0 for all audio objects 0 otherwise (1) [0052] where flagDTX is, as mentioned earlier, the DTX flag. When the VAD flags flagVAD of all audio objects are not equal to 0 (block 202), the DTX controller 200 signals an active frame (block 210) and selects active frame coding. [0053] Further, the DTX controller 200 signals SID frames within the DTX signal segment using a SID flag flag SID (parameter ‘sid_flag’ in the source code at the end of the present disclosure). The SID flag flag SID is set (a) to 1 (block 205) when the global SID counter cnt SID equals to 0 (block 204) in which case the DTX controller 200 signals a SID frame (block 207) within the DTX signal segment and selects SID frame coding of the signaled SID frame and (b) to 0 when the global SID counter cnt SID does not equal to 0 (block 204) in which case the DTX controller 200 signals a NO_DATA frame (block 209) within the DTX signal segment and selects NO_DATA frame coding of the signaled NO_DATA frame. This can be expressed using relations (2): ^ ^ ^^ ^^ ^^ ^^ ^^ ^^ = { 1 if ^^ ^^ ^^ 0 otherw ^^ i ^ s ^ ^^ = 0 e (2) [0054] In block 206 of Figure 2, the DTX controller 200 resets the SID counter cnt SID to -1 in every active frame and it is incremented by 1 at every inactive frame up to the value corresponding to the SID update rate (by default 8 frames in the above example of implementation). If the value of counter cnt SID reaches the SID update rate, it is reset to 0 and the DTX controller 200 signals a SID frame (blocks 204, 205 and 207) within the DTX signal segment and selects SID frame coding of the signaled SID frame. [0055] The DTX controller 200 can alter the DTX flag flag DTX by using other classification stages (block 208) based on core-encoder pre-processing values for all audio objects. [0056] As a non-limitative example, the DTX controller 200 may comprise a logic that forces active frame coding (flag DTX = 0) and selection of active frame coding (block 210) in cases when 1) a mean value of the LT (Long-Term) background noises over all audio objects, noise mean , is higher than a first threshold ^^ 1 , or 2) the mean value of LT background noises over all audio objects, noise mean , is higher than a second threshold ^^ 2 and LT background noise variations over all audio objects, noise_var mean , is higher than a third threshold ^^ 3 . This can be expressed using relation (3): ^^ ^^ ^^ ^^ ^^ ^^ ^^ = 0 if ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ > ^^ 1 or ( ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ > ^^ 2 and ^^ ^^ ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^ ^^ ^^ > ^^ 3 ) (3) [0057] The thresholds are found experimentally and can be set e.g. to ^^ 1 = 50, ^^ 2 = 10, and ^^ 3 = 2. The introduction of the assessment of the LT background noise variations is introduced in order to check for background noise similarities among all audio objects. The parameter noisemean represents the mean value of the long-term background noise energy values of all audio objects (see Paragraph 5.1.11 in Reference [1]). The parameter noise_varmean then represents the variation of the long-term background noise energy values of all audio objects. [0058] Also, a similar logic as in the EVS codec is used in the IVAS codec; for higher bitrates, the IVAS codec supports a less aggressive DTX/CNG scheme that only switches to CNG for low input signal power. [0059] As another non-limitative example of decision about an inactive signal segment, the DTX controller 200 comprises a logic that assesses the energy of background noise in all audio channels 104 (audio streams). Typically, when there is one audio object with a high-energy dominant background noise and other audio objects have low-energy background noises, the DTX flag is set to 1 (flagDTX = 1) otherwise it is set to 0 (flagDTX = 0) which means that DTX is not triggered (active frame coding (block 210) is selected) when there are more than one audio object with a high-energy noise. 2.3 Metadata Analysis, Quantization and Coding [0060] The ISM encoding method of Figure 1, for coding the object-based audio signal further comprises an operation of metadata analysis, quantization and coding 155. To perform the operation 155, the ISM encoder for coding the object-based audio signal comprises a metadata processor 105. [0061] The analysis, quantization and coding of metadata in non-DTX operation can be done, for example, as described in Reference [3], of which the full content is incorporated herein by reference. It is noted that the metadata processor 105 of Figure 1 quantizes and codes the metadata 140 of the N audio objects 102, in the described non- restrictive illustrative embodiments, sequentially in a loop while a certain dependency can be employed between quantization of audio objects and the metadata parameters of these audio objects. [0062] As indicated herein above, in the present disclosure, two metadata parameters, azimuth and elevation (as included in the N input metadata), are considered in the example implementation. As a non-limitative example, the metadata processor 105 comprises a quantizer (not shown) of the following metadata parameter indices using the following example resolution to reduce the number of bits being used: - Azimuth parameter: A 12-bit azimuth parameter index from a file of the input metadata is quantized to a Baz-bit index (for example Baz = 7). Giving the minimum and maximum azimuth limits (-180 and +180⁰), a quantization step for a (Baz = 7)-bit uniform scalar quantizer is 2.835⁰. - Elevation parameter: A 12-bit elevation parameter index from the input metadata file is quantized to a Bel-bit index (for example Bel = 6). Giving the minimum and maximum elevation limits (-90⁰ and +90⁰), a quantization step for a (B el = 6)-bit uniform scalar quantizer is 2.857⁰. [0063] Both azimuth and elevation parameter indices, once quantized, can be coded by a metadata encoder (not shown) of the metadata processor 105 using either absolute or differential coding (112 in Figure 1). As known in the art, absolute coding means that a current value of a parameter is coded. Differential coding means that a difference between a current value and a previous value of a parameter is coded. As the indexes of the azimuth and elevation parameters usually evolve smoothly (i.e. a change in azimuth or elevation position can be considered as continuous and smooth), differential coding is used by default as it often consumes less bits compared to absolute coding. 2.3.1 Metadata Analysis, Quantization and Coding in SID Frames [0064] As indicated herein above, the bit-budget in SID frames is relatively low. In case of the IVAS codec, the SID bitrate is 5.2 kbps from which roughly one half is reserved for a metadata (MD) payload. In order to transmit as much as possible MD values, some compromises are made. [0065] The present disclosure is based on forcing the MD coding by the metadata encoder (not shown) of the metadata processor 105 in SID frames to the absolute coding method. It is noted that this is different from the MD coding in active frames where absolute/differential coding is employed. The motivation to use solely the absolute coding in SID frames is to avoid potential degradation in long segments of inactive frames due to a lost SID frame in case of a noisy channel if differential coding would be used. [0066] Then, one possibility to match the bit-budget constraints is to lower the resolution of MD values. However, this would mean that the MD reconstruction at the decoder will not be precise enough causing subjective quality degradation. The present disclosure thus introduces several mechanisms to overcome these constraints. [0067] First, the DTX controller 200 makes resolution of MD values dependent on the number of audio objects. Specifically, when the number of audio objects is low, the available bit-budget for coding the metadata (MD) is relatively generous and the MD resolution is thus kept relatively high (for example the same as in the active frames). On the other hand, when the number of coded audio objects is high, the resolution of MD values is relatively low. [0068] In an example implementation, where azimuth and elevation MD parameters are considered, the metadata processor 105 is set-up to encode: a) azimuth and elevation indexes, for example, by means of Baz = 8 bits and Bel = 7 bits in a system with one or two audio objects, b) azimuth and elevation indexes, for example, by means of Baz = 6 bits and Bel = 5 bits in a system with three or four audio objects, while the actual number of coded bits and thus the resolution is explicitly known from the ISM common signaling (see 113 of Figure 1 and Paragraph 2.8 below). [0069] Second, in order to keep the SID bit-budget as low as possible, a saving in MD bit-budget is achieved by computing a flag, one flag per audio object, indicating that the MD parameters have not changed (or has not changed significantly) since the last frame and consequently that the metadata (MD) parameters for a specific audio object are not coded and transmitted. Similarly, this flag serves as an indication that the input MD parameters are not present for that specific audio object. [0070] Referring back to Figure 2, in an example implementation, the DTX controller 200 computes a MD on/off flag, flagMD (parameter ‘diff_flag’ in the source code at the end of the present disclosure) for all MD parameter values (block 213). For example, a flag flagMD,θ is calculated for the azimuth MD parameter using the following relation (4): ^^ ^^ ^^ ^^ ^^ ^^, ^^ = { 1 if |θ − θ 0 otherwise ^^ ^^ ^^ ^^| > δ ^^ (4) [0071] where θ is the current frame azimuth, θlast is the last frame azimuth and δ ^^ is an azimuth maximum difference value, for example δ ^^ = 15. In the same (see relation (4)), the DTX controller (block 213) flag flagMD,φ for the elevation MD parameter (or for any other MD parameters). The final MD on/off flag flagMD for one audio object can be obtained as a logical OR (symbol V in the relation (5) below) between all particular MD flags: ^^ ^^ ^^ ^^ ^^ ^^ = ^^ ^^ ^^ ^^ ^^ ^^, ^^ ⋁ ^^ ^^ ^^ ^^ ^^ ^^, ^^ ⋁ ⋯ (5) [0072] Finally, the flag ^^ ^^ ^^ ^^ ^^ ^^ 214 represented as a 1-bit information per audio object, is inserted into the bit-stream in the SID frame 207 right after the ISM signalization of the number N of coded audio objects 301 (see Paragraph 2.8.1 below). [0073] Third, the DTX controller comprises a mechanism (blocks 211 and 212 of Figure 2) that estimates the bit-budget bitsMD for metadata quantization. It should be noted that it is only the bit-budget for the quantization of metadata values which is estimated (computed in advance) at this stage while the quantization itself is possibly performed only later. When the estimated bit-budget bitsMD is higher than a maximum available bit-budget for the MD coding, bitsavailable, (block 212) the flag flagDTX is reset to 0 (block 215) and active frame coding (block 210) is performed. On the other hand, when the estimated MD bit- budget, bits MD , is lower or equal to the maximum available bit-budget for the MD coding, bitsavailable, (block 212) the flag flagDTX is not changed and the DTX coding segment continues. While the maximum available bit-budget for the MD coding, bitsavailable, is computed as the difference between the codec SID bit-budget minus the bit-budget needed for other than MD coding (e.g. SID frame signaling, core-encoder SID bit-budget, spatial information bit-budget, ISM signaling) in the SID frame. [0074] This logic comes from premise that CNG frames do not represent important audio details while the metadata (MD) are considered as important details for representing the output audio and thus are transmitted to the decoder. When there are more audio objects, the probability to switch to the active frame coding (block 210) is obviously higher. This logic also becomes even more important when more metadata (other than only azimuth and elevation) are present and to be coded. 2.4.1 Bitrates per Channel Configuration and Decision [0075] Referring to Figure 1, the ISM encoding method comprises an operation 156 of configuration and decision about bitrates per transport channel 104. To perform the operation 156, the ISM encoder comprises a configuration and decision processor 106 forming a bit-budget allocator. [0076] The configuration and decision processor 106 (herein after bit-budget allocator 106) can use a bitrate adaptation algorithm to distribute the available bit-budget for core-encoding the N audio streams in the N transport channels 104. Details of the bitrate adaptation algorithm to distribute the available bit-budget for core-encoding can be found in Reference [3]. 2.5 Pre-Processing [0077] Referring to Figure 1, the ISM encoding method comprises an operation of pre-processing 158 of the N audio streams conveyed through the N transport channels 104 from the configuration and decision processor 106 (bit-budget allocator). To perform the operation 158, the ISM encoder comprises a pre-processor 108. [0078] Once the configuration and bitrate distribution between the N audio streams is completed by the configuration and decision processor 106 (bit-budget allocator), the pre-processor 108 performs sequential further pre-processing 158 on each of the N audio streams. Such pre-processing 158 may comprise, for example, further signal classification, further core-encoder selection (for example selection between ACELP core, TCX core, and HQ core), other resampling at a different internal sampling frequency F s adapted to the bitrate to be used for core-encoding, etc. Examples of such pre-processing can be found, for example, in Reference [1] in relation to the EVS codec and, therefore, will not be further described in the present disclosure. 2.6 Core-Encoding [0079] Referring to Figure 1, the ISM encoding method comprises an operation of core-encoding 159. To perform the operation 159, the ISM encoder 100 comprises, to code the N audio streams, for example a number N of core-encoders 109 to respectively code the N audio streams conveyed through the N transport channels 104 from the pre- processor 108. [0080] Specifically, in the case of active frame coding, the N audio streams are encoded using N fluctuating bitrate core-encoders 109, for example mono core-encoders. The bitrate used by each of the N core-encoders is the bitrate selected by the configuration and decision processor 106 (bit-budget allocator) for the corresponding audio stream. For example, core-encoders based on EVS as described in Reference [1] can be used as core- encoders 109. [0081] In case of DTX operation (SID frame coding), the SID of one audio object is coded by one of the core-encoders 109 while the processing in the other core-encoders 109 can be done in NO_DATA frame operation (they only update the core-encoder state parameters). Alternatively, no processing is performed in the said other core-encoders 109, i.e. pre-processing and core-encoding are completely skipped in the said other core- encoders 109. 2.7 Coding of Spatial Information [0082] Referring to Figure 1, the ISM encoding method comprises an operation 180 of coding spatial information, and the ISM encoder comprises a corresponding coding module 130. The spatial information analyses similarities between audio objects and can be based, for example, on inter-channel cues like inter-channel level difference, or inter- channel coherence. [0083] The spatial information is usually estimated and coded in parametric ISM coding techniques or in SID frames coding. Examples can be found in the IVAS framework described in Reference [4], of which the full content is incorporated herein by reference. 2.8 SID Bit-Stream Structure [0084] Referring to Figure 1, the ISM encoding method comprises an operation of multiplexing 160. To perform the operation 160, the ISM encoder comprises a multiplexer 110. [0085] Figure 3 is a schematic diagram illustrating, for a frame, the structure of the SID bit-stream 111 produced by the multiplexer 110 and transmitted from the ISM encoder of Figure 1 to the ISM decoder 400 of Figure 4. Regardless whether metadata are present and transmitted or not, the structure of the SID bit-stream 111 may be structured as illustrated in Figure 3. [0086] Referring to Figure 3, the multiplexer 110 writes the indices 302 of SID format signaling followed by the indices 114 of one core-encoder SID from the beginning of the bit-stream 111 while the indices of ISM common signaling 113 from the configuration and decision processor 106 (bit-budget allocator), spatial information 131 from the coding module 130, and metadata 112 from the metadata processor 105 are written from the end of the bit-stream 111. 2.8.1 ISM Common Signaling in SID Frame [0087] The multiplexer writes the ISM common signaling 113 from the end of the bit-stream 111. The ISM common signaling in SID frame is produced by the configuration and decision processor 106 (bit-budget allocator) and comprises a variable number of bits representing: [0088] (a) a number N of audio objects: the signaling for the number N of coded audio objects present in the bit-stream 111 is in the form of, for example, a unary code 301 with a stop bit (for example, for N = 3 audio objects, the first 3 bits of the ISM common signaling would be “110” written in a backward order). [0089] (b) the metadata on/off flag flagMD, one per audio object, as described in Relation (5) and represented by line 214. 2.8.2 Metadata Payload [0090] In a SID frame, right after the ISM common signaling 113, in the backward order, there are written into the bit-stream 111 a) spatial information indices 131, and finally b) metadata values 112 as quantized in Paragraph 2.3.1 above. 2.8.3 Audio Streams Payload [0091] In active frames, the multiplexer 110 receives N audio streams 114 coded by N core-encoders 109 through N transport channels 104, and writes it from the beginning of the bit-stream 111 (See Figure 3) right after the IVAS format bits 302. [0092] In SID frames, the multiplexer 110 receives one audio stream 114 coded by one of the core-encoders 109 through one transport channel 104, and write it from the beginning of the bit-stream 111 (See Figure 3) right after the IVAS SID format bits 302 (signalization of the SID mode related to the IVAS format). 3. Decoding of Audio Objects [0093] Figure 4 is a schematic block diagram illustrating concurrently an ISM decoding method 450 implementing the method for decoding audio objects during discontinuous transmission (DTX) operation and a corresponding ISM decoder 400 implementing the device for decoding audio objects during discontinuous transmission (DTX) operation. 3.1 Demultiplexing [0094] Referring to Figure 4, the ISM decoding method 450 comprises an operation of demultiplexing 451. To perform the operation 451, the ISM decoder 400 comprises a demultiplexer 401. [0095] The demultiplexer 401 receive a bit-stream 402 transmitted from the ISM encoder of Figure 1 to the ISM decoder 400 of Figure 4. Specifically, the bit-stream 402 of Figure 4 corresponds to the bit-stream 111 of Figure 1. [0096] The demultiplexer 401 extracts from the bit-stream 402 (a) the coded N audio streams 114 in case of active frame coding, or one audio stream in case of a SID frame, (b) the coded metadata 112 for the N audio objects, (c) the spatial information 131, and (d) the ISM common signaling 113 read from the end of the received bit-stream 402. 3.2 Metadata Dequantization and Decoding [0097] Referring to Figure 4, the ISM decoding method 450 comprises an operation 453 of metadata decoding and dequantization. To perform the operation 453, the ISM decoder 400 comprises a metadata decoding and dequantization processor (metadata decoder) 403. [0098] The metadata decoding and dequantization processor (metadata decoder) 403 is supplied with the coded metadata 112 for the transmitted N audio objects, the ISM common signaling 113, and an output set-up 404 to decode and dequantize the metadata for the audio streams/objects with active contents. The output set-up 404 is a command line parameter about the number M of decoded audio objects/transport channels and/or audio formats, which can be equal to or different from the number N of coded audio objects/transport channels. The metadata decoding and dequantization processor (metadata decoder) 403 produces decoded metadata 405 for the M audio objects/transport channels, and supplies the decoded metadata and information about their respective bit-budgets on line 406. Obviously, the decoding and dequantization performed by the processor (metadata decoder) 103 is the inverse of the quantization and coding performed by the metadata processor 105 of Figure 1. 3.2.1 Metadata Adjustment in DTX Operation [0099] At the ISM decoder 400, the SID frames are received at a certain rate (for example in IVAS, 8 frames by default), resulting in a possibility that received MD parameter values are changed between SID frames with a large step. These large steps can cause subjective artifacts; for example, the positions of the audio objects can change suddenly from one place to another one. In order to avoid these artifacts, the metadata decoding and dequantization processor (metadata decoder) 403 adjusts the MD parameter values at the ISM decoder 400 such that the MD parameter value differences are lowered between frames. For example, an interpolation between the true decoded and dequantized current frame MD parameter value and the previous frame MD parameter value can be applied in certain frames following the SID frame. This results in the MD parameter values evolving more smoothly while the smoothing is applied in several CNG frames, or several active frames, or several CNG and active frames following an SID frame. [00100] As an example, the adjustment, or smoothing, can be applied on each MD parameter such that the maximum difference (step) of a MD parameter between two adjacent frames is not more (lower) than a given threshold. Let’s suppose the current decoded and dequantized MD parameter value of azimuth θdec and the maximum smoothing step for azimuth difference being Δ θ . Then: ^ ^ = { ^^ ^^ ^^ ^^ ^^ + ^^ ^^ ^^( ^^ ^^ ^^ ^^ ^^ − ^^ ^^ ^^ ^^ ^^) ∙ ∆ ^^ if | ^^ ^^ ^^ ^^ ^^ − ^^ ^^ ^^ ^^ ^^| > ∆ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ otherwise (6) [00101] where θtrue is the transmitted quantized azimuth (quantized MD parameter value in general) in the SID frame and θlast is the azimuth (MD parameter value in general) in the preceding frame and updated at the end of each frame decoding as θ last = θ dec . In the example implementation, the azimuth (MD parameter value in general) smoothing step value is Δθ = 5 corresponding to the parameter ‘CNG_MD_MAX_DIFF_AZIMUTH’ in the source code at the end of the present disclosure). Further, the operation sgn(x) in Relation (6) is a mathematical expression ^ ^ ^^ ^^( ^^) = { 1, if x ≥ 0 (7) −1 [00102] Further, the maximum number of frames to apply the smoothing step can be set to a given threshold value D max . For example, in an example implementation it is not limited in inactive segments but limited in active segments to Dmax = 5 frames (constant ‘IVAS_ISM_DTX_HO_MAX’ in the source code at the end of the present disclosure) in the example implementation. [00103] Moreover, the smoothing step is skipped in active frames when the absolute value of the difference between the current frame quantized MD parameter value θtrue and the previous frame decoded MD parameter value θlast is higher than the smoothing step value Δθ multiplied by the threshold value Dmax. For example, in case of the azimuth parameter, it means that: if | ^^ ^^ ^^ ^^ ^^ − ^^ ^^ ^^ ^^ ^^ | > ^^ ^^ ^^ ^^ ∙ ∆ ^^ then skip the smoothing (8) [00104] Relation (8) also applies to MD parameters other than azimuth. Relation (8) is used to prevent wrong smoothed MD parameter estimation when the true values of a MD parameter are changing significantly from frame to frame and the number of frames between the last SID frame and the active frame is low (in general lower than D max ). [00105] The effect of the smoothing can be seen from the graph of Figure 5 for the MD azimuth parameter of the first audio object when coding two audio objects in the DTX operation in the IVAS codec. From top to bottom, for a 1.1 second segment, there is a) noisy segment of input audio signal, b) CNG synthesis, c) IVAS total bitrate indicating active frame (48 kbps), SID frames (5.2 kbps), and NO_DATA frames (0 bps), d) encoder input azimuth, e) decoder (quantized) azimuth in case of active frames coding (no DTX), f) decoder azimuth in CNG segment without smoothing, g) decoder azimuth in CNG segment with smoothing. 3.3 Configuration and Decision about Bitrates [00106] Referring to Figure 4, the ISM decoding method 450 comprises an operation 457 of configuration and decision about bitrates per channel. To perform the operation 457, the ISM decoder 400 comprises a configuration and decision processor 407 (bit- budget allocator). [00107] The bit-budget allocator 407 receives the information about the respective bit-budgets for the M decoded metadata on line 406 to determine the core-decoder bitrates per audio stream. The bit-budget allocator 407 uses the same procedure as in the bit-budget allocator 106 of Figure 1 to determine the core-decoder bitrates (see section 2.4). In case of SID frame, the core-decoder SID bitrate is assigned to one core-decoder while bitrate of 0 kbps is assigned to the other core-decoders. Obviously, in case of NO_DATA frame, the bitrate of 0 kbps is assigned to all core-decoders. 3.4 Decoding of Spatial Information [00108] Referring to Figure 4, the ISM decoding method 450 comprises an operation of decoding of spatial information 458. The ISM decoder 400 comprises a spatial information decoding module 408 to perform operation 458. [00109] Operation 458 is responsible for decoding of spatial information 409 used in parametric ISM techniques or in SID frames. The decoding 458 is the inverse of the coding 180 of Figure 1. 3.5 Core-Decoding [00110] Still referring to Figure 4, the ISM decoding method 450 comprises an operation of core-decoding 460. To perform the operation 460, the ISM decoder 400 comprises a decoder of the N audio streams 410 including a number N of core-decoders 410, for example N fluctuating bitrate core-decoders. [00111] The N audio streams 114 from the demultiplexer 401 are decoded, for example sequentially decoded in the number N of fluctuating bitrate core decoders 410 at their respective core-decoder bitrates 411 as determined by the bit-budget allocator 407. When the number of decoded audio objects, M, as requested by the output set-up 404 is lower than the number of transport channels, i.e. M < N, a lower number of core-decoders 410 are used. Similarly, not all metadata payloads may be decoded in such a case. [00112] In response to the N audio streams 114 from the demultiplexer 401, the core-decoder bitrates 411 as determined by the bit-budget allocator 407, and the output set-up 404, the core-decoders 410 produces a number M of decoded audio streams 412 on respective M transport channels. [00113] In case of the SID frame, one audio stream 114 from the demultiplexer 401 is decoded and fed to one of the core-decoders 410. The other core-decoders 410 are fed by the SID of the first core-encoder and spatial information 409. 3.6 Audio Channels Rendering [00114] The ISM decoding method 450 might comprises an operation of audio channels rendering 463. To perform operation 463, the ISM decoder 400 comprises a renderer 413 of audio objects. [00115] The renderer 413 transforms the M decoded metadata 405 and the M decoded audio streams 412 into a number of output audio channels 414, taking into consideration an output set-up 415 indicative of the number and contents of output audio channels to be produced. Again, the number of output audio channels 414 may be equal to or different from the number M. [00116] The renderer 413 may be designed in a variety of different structures to obtain the desired output audio channels. For that reason, the renderer is not further described in the present disclosure. 4. Example Configuration of Hardware Components [00117] Figure 6 is a simplified block diagram of an example configuration of hardware components forming (a) the method and device for discontinuous transmission (DTX) of audio objects in an object-based audio codec, (b) the ISM encoding method and encoder, (c) the method and device for decoding audio objects during discontinuous transmission (DTX) operation, and/or (d) the ISM decoding method and decoder (hereinafter collectively the “encoding and decoding methods and devices”). [00118] The encoding and decoding methods and devices may be implemented as a part of a mobile terminal, as a part of a portable media player, or in any similar device. The encoding and decoding methods and devices (identified as 600 in Figure 6) comprises an input 602, an output 603, a processor 601 and a memory 604. [00119] The input 602 is configured to receive the input signal(s). The output 603 is configured to supply the output signal(s). The input 602 and the output 603 may be implemented in a common module, for example a serial input/output device. [00120] The processor 601 is operatively connected to the input 602, to the output 603, and to the memory 604. The processor 601 is realized as one or more processors for executing code instructions in support of the functions of the various elements and operations of the above described encoding and decoding methods and devices as shown in the accompanying figures and/or as described in the present disclosure. [00121] The memory 604 may comprise a non-transient memory for storing code instructions executable by the processor 601, specifically, a processor-readable memory storing non-transitory instructions that, when executed, cause a processor to implement the elements and operations of the encoding and decoding methods and devices. The memory 604 may also comprise a random access memory or buffer(s) to store intermediate processing data from the various functions performed by the processor 601. [00122] Those of ordinary skill in the art will realize that the description of the encoding and decoding methods and devices is illustrative only and is not intended to be in any way limiting. Other embodiments will readily suggest themselves to such persons with ordinary skill in the art having the benefit of the present disclosure. Furthermore, the disclosed encoding and decoding methods and devices may be customized to offer valuable solutions to existing needs and problems of encoding and decoding audio signals. [00123] In the interest of clarity, not all of the routine features of the implementations of the encoding and decoding methods and devices are shown and described. It will, of course, be appreciated that in the development of any such actual implementation of the encoding and decoding methods and devices, numerous implementation-specific decisions may need to be made in order to achieve the developer’s specific goals, such as compliance with application-, system-, network- and business-related constraints, and that these specific goals will vary from one implementation to another and from one developer to another. Moreover, it will be appreciated that a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking of engineering for those of ordinary skill in the field of audio processing having the benefit of the present disclosure. [00124] In accordance with the present disclosure, the elements, processing operations, and/or data structures described herein may be implemented using various types of operating systems, computing platforms, network devices, computer programs, and/or general purpose machines. In addition, those of ordinary skill in the art will recognize that devices of a less general purpose nature, such as hardwired devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), or the like, may also be used. Where a method comprising a series of operations and sub- operations is implemented by a processor, computer or a machine and those operations and sub-operations may be stored as a series of non-transitory code instructions readable by the processor, computer or machine, they may be stored on a tangible and/or non- transient medium. [00125] Elements and processing operations of the encoding and decoding methods and devices as described herein may comprise software, firmware, hardware, or any combination(s) of software, firmware, or hardware suitable for the purposes described herein. [00126] In the encoding and decoding methods and devices, the various processing operations and sub-operations may be performed in various orders and some of the processing operations and sub-operations may be optional. [00127] Although the present disclosure has been described hereinabove by way of non-restrictive, illustrative embodiments thereof, these embodiments may be modified at will within the scope of the appended claims without departing from the spirit and nature of the present disclosure. 5. References [00128] The present disclosure mentions the following references, of which the full content is incorporated herein by reference: [1] 3GPP TS 26.445, v.17.0.0, “Codec for Enhanced Voice Services (EVS); Detailed Algorithmic Description”, April 2022. [2] 3GPP TS 26.250 v 1.0.0, “Codec for Immersive Voice and Audio Services - General overview”, September 2023. [3] V. Eksler, “Method and System for Coding Metadata in Audio Streams and for Flexible Intra-Object and Inter-Objects Bitrate Adaptation,” Patent Publication US 2022/0238127 A1 dated July 28, 2022. [4] 3GPP SA4 contribution S4-231233 “High-Level Description of the IVAS Codec Candidate of the “IVAS Codec Public Collaboration””, SA4 meeting #125, August 21-25, 2023, https://www.3gpp.org/ftp/Meetings_3GPP_Sync/SA4/Docs/S4-2312 33.zip [5] 3GPP SA4 contribution S4-180462, “On spatial metadata for IVAS spatial audio input format”, SA4 meeting #98, April 9-13, 2018, https://www.3gpp.org/ftp/tsg_sa/WG4_CODEC/TSGS4_98/Docs/S4-1 80462.zip SOURCE CODE [00129] The code excerpts implementing the present disclosure in the IVAS audio codec framework are as follows. /*---------------------------------------------------------- ---------* * ivas_ism_enc() * * ISM Metadata + CoreCoders encoding routine *----------------------------------------------------------- --------*/ ivas_ism_enc( ) { ... /*---------------------------------------------------------- --------* * DTX analysis *----------------------------------------------------------- ------*/ if ( st_ivas->hEncoderConfig->Opt_DTX_ON ) { /* analysis and decision about DTX */ dtx_flag = ivas_ism_dtx_enc( ... ); } /*---------------------------------------------------------- --------* * Analysis of objects, config. and decision about brates per channel * Metadata quantization and encoding *----------------------------------------------------------- ------*/ if ( dtx_flag ) { ivas_ism_metadata_sid_enc( ... ); } else if ( st_ivas->ism_mode == ISM_MODE_PARAM ) { ivas_ism_compute_noisy_speech_flag( ... ); ivas_ism_metadata_enc( ... ); } else /* ISM_MODE_DISC */ { ivas_ism_metadata_enc( ... ); } update_last_metadata( ... ); /*---------------------------------------------------------- ------* * Write IVAS format signaling in SID frames *----------------------------------------------------------- -----*/ if ( sid_flag ) { ivas_write_format_sid ... ); } ... } /*---------------------------------------------------------- ---------* * ivas_ism_get_dtx_enc() * * Analysis and decision about DTX in ISM format *----------------------------------------------------------- --------*/ /*! r: indication of DTX frame */ int16_t ivas_ism_dtx_enc( ISM_DTX_HANDLE hISMDTX, SCE_ENC_HANDLE hSCE[MAX_SCE], const int16_t num_obj, const int16_t nchan_transport, int16_t vad_flag[MAX_NUM_OBJECTS], ISM_METADATA_HANDLE hIsmMeta[], int16_t md_diff_flag[], int16_t *sid_flag ) { ... /*---------------------------------------------------------- --------* * compute global ISM DTX flag *----------------------------------------------------------- ------*/ /* compute global ISM based on localVAD */ dtx_flag = 1; for ( ch = 0; ch < num_obj; ch++ ) { dtx_flag &= !vad_flag[ch]; } /* compute global ISM based on long-term background noise */ /* one of the channels is active -> no DTX */ for ( ch = 0; ch < num_obj; ch++ ) { lp_noise[ch] = hSCE[ch]->hCoreCoder[0]->lp_noise; } noise_var = var( lp_noise, num_obj ); noise_mean = mean( lp_noise, num_obj ); if( noise_mean > BETA1 || (noise_mean > BETA2 && noise_var > BETA3)) { dtx_flag = 0; } /*---------------------------------------------------------- --------* * Reset the bit-stream *----------------------------------------------------------- ------*/ if ( dtx_flag ) { reset_indices_enc( hSCE[0]->hCoreCoder[0]->hBstr, MAX_NUM_IND ); } /*---------------------------------------------------------- --------* * decide about SID metadata to be sent or not (per object) * estimate the MD bit-budget consumption *----------------------------------------------------------- ------*/ if ( dtx_flag ) { ivas_get_ism_sid_quan_bitbudget( num_obj, &nBits_azimuth, &nBits_elevation, &nBits_ener, &nBits_coh ); nBits = 0; for ( ch = 0; ch < num_obj; ch++ ) { /* check difference between current and last metadata */ md_diff_flag[ch] = 0; if ( fabsf( hIsmMeta[ch]->azimuth - hIsmMeta[ch]- >last_azimuth ) > MD_MAX_DIFF_AZIMUTH ) { md_diff_flag[ch] = 1; } if ( fabsf( hIsmMeta[ch]->elevation - hIsmMeta[ch]- >last_elevation ) > MD_MAX_DIFF_ELEVATION ) { md_diff_flag[ch] = 1; } /* estimate SID metadata bit-budget */ nBits++; /* number of objects */ nBits++; /* SID metadata flag */ if ( md_diff_flag[ch] == 1 ) { nBits += nBits_azimuth; nBits += nBits_elevation; } } /* calculate maximum available MD bit-budget */ nBits_MD_max = ( IVAS_SID_5k2 - SID_2k40 ) / FRAMES_PER_SEC; nBits_MD_max -= SID_FORMAT_NBITS; for ( ch = 0; ch < nchan_transport - 1; ch++ ) { nBits_MD_max -= nBits_ener; nBits_MD_max -= nBits_coh; } /* too many metadata bits -> switch to active coding */ if ( nBits > nBits_MD_max ) { dtx_flag = 0; } } /*---------------------------------------------------------- --------* * set core_brate for all channels * get 'sid_flag' value *----------------------------------------------------------- ------*/ *sid_flag = 0; if ( !dtx_flag ) { /* at least one of the channels is active -> no DTX */ for ( ch = 0; ch < num_obj; ch++ ) { hSCE[ch]->hCoreCoder[0]->core_brate = -1; } hISMDTX->cnt_SID_ISM = -1; /* IVAS format signaling was erased in dtx() */ if ( hSCE[0]->hCoreCoder[0]->hBstr->nb_bits_tot == 0 ) { push_indice( hSCE[0]->hCoreCoder[0]->hBstr, IND_IVAS_FORMAT, 2 /* == ISM format */, IVAS_FORMAT_SIGNALING_NBITS ); } } else /* ism_dtx_flag == 1 */ { for ( ch = 0; ch < num_obj; ch++ ) { hSCE[ch]->hCoreCoder[0]->cng_type = FD_CNG; } /* * update the global SID counter */ hISMDTX->cnt_SID_ISM++; if ( hISMDTX->cnt_SID_ISM >= hSCE[0]->hCoreCoder[0]->hDtxEnc- >max_SID ) { /* adaptive SID update interval */ hSCE[0]->hCoreCoder[0]->hDtxEnc->max_SID = hSCE[0]- >hCoreCoder[0]->hDtxEnc->interval_SID; hISMDTX->cnt_SID_ISM = 0; } /* encode SID in one channel only */ for ( ch = 0; ch < num_obj; ch++ ) { hSCE[ch]->hCoreCoder[0]->core_brate = FRAME_NO_DATA; } if ( hISMDTX->cnt_SID_ISM == 0 ) { hSCE[hISMDTX->sce_id_dtx]->hCoreCoder[0]->core_brat e = SID_2k40; *sid_flag = 1; } } if ( dtx_flag == 1 && *sid_flag == 0 ) { set_s( md_diff_flag, 0, num_obj ); } return dtx_flag; } /*---------------------------------------------------------- ---------* * ivas_ism_metadata_sid_enc() * * Quantize and encode ISM metadata in SID frame *----------------------------------------------------------- --------*/ void ivas_ism_metadata_sid_enc( ISM_DTX_HANDLE hISMDTX, const int16_t num_obj, const int16_t nchan_transport, ISM_METADATA_HANDLE hIsmMeta[], const int16_t sid_flag, const int16_t md_diff_flag[], BSTR_ENC_HANDLE hBstr, int16_t nb_bits_metadata[] ) { ... if ( sid_flag ) { nBits = ( IVAS_SID_5k2 - SID_2k40 ) / FRAMES_PER_SEC; nBits -= SID_FORMAT_NBITS; nBits_start = hBstr->nb_bits_tot; /*---------------------------------------------------------- -* * Write ISm common signaling *----------------------------------------------------------- */ /* write number of objects - unary coding */ for ( ch = 1; ch < num_obj; ch++ ) { push_indice( hBstr, IND_ISM_NUM_OBJECTS, 1, 1 ); } push_indice( hBstr, IND_ISM_NUM_OBJECTS, 0, 1 ); /* write SID metadata flag (one per object) */ for ( ch = 0; ch < num_obj; ch++ ) { push_indice( hBstr, IND_ISM_METADATA_FLAG, md_diff_flag[ch], 1 ); } /*---------------------------------------------------------- -* * Set quantization bits based on the number of coded objects *----------------------------------------------------------- */ low_res_q = ivas_get_ism_sid_quan_bitbudget( num_obj, &nBits_azimuth, &nBits_elevation, &nBits_ener, &nBits_coh ); /*---------------------------------------------------------- -* * Spatial parameters, loop over TCs - 1 *----------------------------------------------------------- */ for ( ch = 0; ch < nchan_transport - 1; ch++ ) { /* quantize and write energy ratio*/ idx = (int16_t) ( hISMDTX->ene_ratio[ch] * ( ( 1 << nBits_ener ) - 1 ) + 0.5f ); push_indice( hBstr, IND_ISM_DTX_ENER, idx, nBits_ener ); /* quantize and write coherence */ idx = (int16_t) ( hISMDTX->coh[ch] * ( ( 1 << nBits_coh ) - 1 ) + 0.5f ); push_indice( hBstr, IND_ISM_DTX_COH_SCA, idx, nBits_coh ); } /*---------------------------------------------------------- -* * Metadata quantization and coding, loop over all objects *----------------------------------------------------------- */ for ( ch = 0; ch < num_obj; ch++ ) { if ( md_diff_flag[ch] == 1 ) { hIsmMetaData = hIsmMeta[ch]; if ( low_res_q ) { ivas_ism_quantize_dtx_low_res( hIsmMetaData->azimuth, hIsmMetaData->elevation, nBits_azimuth, nBits_elevation, &idx_azimuth, &idx_elevation ); } else { idx_azimuth = ism_quant_meta( hIsmMetaData->azimuth, &valQ, ism_azimuth_borders, 1 << ISM_AZIMUTH_NBITS ); idx_elevation = ism_quant_meta( hIsmMetaData- >elevation, &valQ, ism_elevation_borders, 1 << ISM_ELEVATION_NBITS ); } push_indice( hBstr, IND_ISM_AZIMUTH, idx_azimuth, nBits_azimuth ); push_indice( hBstr, IND_ISM_ELEVATION, idx_elevation, nBits_elevation ); hIsmMetaData->last_azimuth_idx = idx_azimuth; hIsmMetaData->last_elevation_idx = idx_elevation; } } /* Write unused (padding) bits */ nBits_unused = nBits - hBstr->nb_bits_tot; while ( nBits_unused > 0 ) { i = min( nBits_unused, 16 ); push_indice( hBstr, IND_UNUSED, 0, i ); nBits_unused -= i; } nb_bits_metadata[0] = hBstr->nb_bits_tot - nBits_start; } return; } /*---------------------------------------------------------- -----------* * ivas_dec() * * Principal IVAS decoder routine *----------------------------------------------------------- ----------*/ ivas_error ivas_dec( ) { ... else if ( st_ivas->ivas_format == ISM_FORMAT ) { /* Metadata decoding and configuration */ if ( ivas_total_brate == IVAS_SID_5k2 || ivas_total_brate == FRAME_NO_DATA ) { ivas_ism_dtx_dec( st_ivas, nb_bits_metadata ); } Else if ( st_ivas->ism_mode == ISM_MODE_PARAM ) { ivas_ism_metadata_dec( ... ); } else /* ISM_MODE_DISC */ { ivas_ism_metadata_dec( ... ); } ... } /*---------------------------------------------------------- ---------* * ivas_ism_dtx_dec() * * ISM DTX Metadata decoding routine *----------------------------------------------------------- --------*/ ivas_error ivas_ism_dtx_dec( Decoder_Struct *st_ivas, /* i/o: IVAS decoder structure */ int16_t *nb_bits_metadata /* o : number of metadata bits */ ) { ... /* read number of objects */ if ( !st_ivas->bfi && ivas_total_brate == IVAS_SID_5k2 ) { if ( st_ivas->ism_mode == ISM_MODE_PARAM ) { num_obj_prev = st_ivas->hDirAC->hParamIsm->num_obj; } else /* ism_mode == ISM_MODE_DISC */ { num_obj_prev = st_ivas->nchan_transport; } num_obj = 1; pos = (int16_t) ( ( ivas_total_brate / FRAMES_PER_SEC ) - 1 - SID_FORMAT_NBITS ); while ( get_indice( st_ivas->hSCE[0]->hCoreCoder[0], pos, 1 ) == 1 && num_obj < MAX_NUM_OBJECTS ) { ( num_obj )++; pos--; } ... ivas_ism_dec_config( st_ivas, num_obj ); for ( ch = 0; ch < st_ivas->nchan_transport; ch++ ) { st_ivas->hSCE[ch]->hCoreCoder[0]->cng_paramISM_flag = 1; } } else { if ( st_ivas->ism_mode == ISM_MODE_PARAM ) { num_obj = st_ivas->hDirAC->hParamIsm->num_obj; } else /* ism_mode == ISM_MODE_DISC */ { num_obj = st_ivas->nchan_transport; } } /* Metadata decoding and dequantization */ ivas_ism_metadata_sid_dec( st_ivas->hSCE, ivas_total_brate, st_ivas- >bfi, num_obj, st_ivas->nchan_transport, st_ivas->hIsmMetaData, nb_bits_metadata ); set_s( md_diff_flag, 1, num_obj ); update_last_metadata( st_ivas->nchan_transport, st_ivas- >hIsmMetaData, md_diff_flag ); /* set core_brate for all channels */ for ( ch = 0; ch < num_obj; ch++ ) { st_ivas->hSCE[ch]->hCoreCoder[0]->core_brate = FRAME_NO_DATA; } if ( ivas_total_brate == IVAS_SID_5k2 ) { st_ivas->hSCE[0]->hCoreCoder[0]->core_brate = SID_2k40; } for ( ch = 1; ch < st_ivas->nchan_transport; ch++ ) { nb_bits_metadata[ch] = nb_bits_metadata[0]; } return IVAS_ERR_OK; } /*---------------------------------------------------------- ---------* * ivas_ism_metadata_sid_dec() * * Decode ISM metadata in SID frame *----------------------------------------------------------- --------*/ ivas_error ivas_ism_metadata_sid_dec( SCE_DEC_HANDLE hSCE[MAX_SCE], const int32_t ism_total_brate, const int16_t bfi, const int16_t num_obj, const int16_t nchan_transport, ISM_METADATA_HANDLE hIsmMeta[], int16_t nb_bits_metadata[] ) { ... dtx_hangover_cnt = 0; if ( ism_total_brate == FRAME_NO_DATA ) { ism_metadata_smooth( hIsmMeta, ism_total_brate, num_obj ); return IVAS_ERR_OK; } /* initialization */ st0 = hSCE[0]->hCoreCoder[0]; nb_bits_start = 0; last_bit_pos = (int16_t) ( ( ism_total_brate / FRAMES_PER_SEC ) - 1 - SID_FORMAT_NBITS ); bstr_orig = st0->bit_stream; next_bit_pos_orig = st0->next_bit_pos; st0->next_bit_pos = 0; /* reverse the bit-stream for easier reading of indices */ for ( i = 0; i < min( MAX_BITS_METADATA, last_bit_pos ); i++ ) { bstr_meta[i] = st0->bit_stream[last_bit_pos - i]; } st0->bit_stream = bstr_meta; st0->total_brate = ism_total_brate; /* needed for BER detection in get_next_indice() */ if ( !bfi ) { /* take into account padding bits as metadata bits to keep later bitrate checks valid */ nb_bits_metadata[0] = ( IVAS_SID_5k2 - SID_2k40 ) / FRAMES_PER_SEC; /*---------------------------------------------------------- -* * ISm common signaling *----------------------------------------------------------- */ /*number of objects was already read in ivas_ism_get_dtx_dec()*/ /* update the position in the bit-stream */ st0->next_bit_pos += num_obj; /* read SID metadata flag( one per object ) */ for ( ch = 0; ch < num_obj; ch++ ) { md_diff_flag[ch] = get_next_indice( st0, 1 ); } /*---------------------------------------------------------- -* * Set quantization bits based on the number of coded objects *----------------------------------------------------------- */ low_res_q = ivas_get_ism_sid_quan_bitbudget( num_obj, &nBits_azimuth, &nBits_elevation, &nBits_ener, &nBits_coh ); /*---------------------------------------------------------- -* * Spatial parameters, loop over TCs - 1 *----------------------------------------------------------- */ if ( nchan_transport > 1 ) { total_scaling = 0.0f; for ( ch = 0; ch < nchan_transport - 1; ch++ ) { /* decode the energy ratio */ idx = get_next_indice( st0, nBits_ener ); hSCE[ch]->hCoreCoder[0]->hFdCngDec->hFdCngCom->s caling = (float) ( idx ) / (float) ( ( 1 << ISM_DTX_ENER_BITS ) - 1 ); total_scaling += hSCE[ch]->hCoreCoder[0]->hFdCngDec- >hFdCngCom->scaling; /* decode the coherence */ idx = get_next_indice( st0, nBits_coh ); hSCE[ch]->hCoreCoder[0]->hFdCngDec->hFdCngCom->c oherence = (float) ( idx ) / (float) ( ( 1 << ISM_DTX_COH_SCA_BITS ) - 1 ); } /* rearrange to obtain proper values */ total_scaling /= ( nchan_transport - 1 ); hSCE[ch]->hCoreCoder[0]->hFdCngDec->hFdCngCom->s caling = 1.0f - total_scaling; for ( ch = nchan_transport - 1; ch > 0; ch-- ) { hSCE[ch]->hCoreCoder[0]->hFdCngDec->hFdCngCom->c oherence = hSCE[ch - 1]->hCoreCoder[0]->hFdCngDec->hFdCngCom->coheren ce; } } else { hSCE[0]->hCoreCoder[0]->hFdCngDec->hFdCngCom->sc aling = 1.0f; } /*---------------------------------------------------------- -* * Metadata decoding and dequantization, loop over all objects *----------------------------------------------------------- */ for ( ch = 0; ch < num_obj; ch++ ) { hIsmMetaData = hIsmMeta[ch]; if ( md_diff_flag[ch] == 1 ) { if ( low_res_q ) { idx_azimuth = get_next_indice( st0, nBits_azimuth ); idx_elevation = get_next_indice( st0, nBits_elevation ); ivas_ism_dec_dequantize_dtx_low_res( ... ); } else { /* Azimuth decoding */ idx_azimuth = get_next_indice( st0, nBits_azimuth ); /* azimuth is on a circle - check for diff coding for -180° -> 180° and vice versa changes */ if ( idx_azimuth > ( 1 << ISM_AZIMUTH_NBITS ) - 1 ) { idx_azimuth -= ( 1 << ISM_AZIMUTH_NBITS ) - 1; /* +180° -> -180° */ } else if ( idx_azimuth < 0 ) { idx_azimuth += ( 1 << ISM_AZIMUTH_NBITS ) - 1; /* -180° -> +180° */ } /* +180° == -180° */ if ( idx_azimuth == ( 1 << ISM_AZIMUTH_NBITS ) - 1 ) { idx_azimuth = 0; } /* sanity check in case of FER or BER */ if ( idx_azimuth < 0 || idx_azimuth > ( 1 << ISM_AZIMUTH_NBITS ) - 1 ) { idx_azimuth = hIsmMetaData->last_azimuth_idx; } hIsmMetaData->azimuth = ism_dequant_meta( idx_azimuth, ism_azimuth_borders, 1 << ISM_AZIMUTH_NBITS ); /* Elevation decoding */ idx_elevation = get_next_indice( st0, nBits_elevation ); /* sanity check in case of FER or BER */ if ( idx_elevation < 0 || idx_elevation > ( 1 << ISM_ELEVATION_NBITS ) - 1 ) { idx_elevation = hIsmMetaData->last_elevation_idx; } /* Elevation dequantization */ hIsmMetaData->elevation = ism_dequant_meta( idx_elevation, ism_elevation_borders, 1 << ISM_ELEVATION_NBITS ); } hIsmMetaData->last_azimuth_idx = idx_azimuth; hIsmMetaData->last_elevation_idx = idx_elevation; /* save for smoothing metadata evolution */ hIsmMetaData->last_true_azimuth = hIsmMetaData->azimuth; hIsmMetaData->last_true_elevation = hIsmMetaData- >elevation; } } /* set the bit-stream pointer to its original position */ st0->bit_stream = bstr_orig; st0->next_bit_pos = next_bit_pos_orig; } /* smooth the metadata evolution */ ism_metadata_smooth( hIsmMeta, ism_total_brate, num_obj ); return IVAS_ERR_OK; } /*---------------------------------------------------------- ---------* * ism_metadata_smooth() * * Smooth the metadata evolution *----------------------------------------------------------- --------*/ static void ism_metadata_smooth( ISM_METADATA_HANDLE hIsmMeta[], const int32_t ism_total_brate, const int16_t num_obj ) { ISM_METADATA_HANDLE hIsmMetaData; int16_t ch; float diff; for ( ch = 0; ch < num_obj; ch++ ) { hIsmMetaData = hIsmMeta[ch]; /* smooth azimuth */ diff = hIsmMetaData->last_true_azimuth - hIsmMetaData- >last_azimuth; if ( diff > ISM_AZIMUTH_MAX ) { diff -= ( ISM_AZIMUTH_MAX - ISM_AZIMUTH_MIN ); hIsmMetaData->last_azimuth += ( ISM_AZIMUTH_MAX - ISM_AZIMUTH_MIN ); } ( hIsmMetaData->azimuth = hIsmMetaData->last_true_azimuth; } if ( hIsmMetaData->azimuth > ISM_AZIMUTH_MAX ) { hIsmMetaData->azimuth -= ( ISM_AZIMUTH_MAX - ISM_AZIMUTH_MIN ); } /* smooth elevation */ diff = hIsmMetaData->last_true_elevation - hIsmMetaData- >last_elevation; if ( ism_total_brate > IVAS_SID_5k2 && diff > IVAS_ISM_DTX_HO_MAX * CNG_MD_MAX_DIFF_ELEVATION ) { /* skip the smoothing */ } else if ( fabsf( diff ) > CNG_MD_MAX_DIFF_ELEVATION ) { hIsmMetaData->elevation = hIsmMetaData->last_elevation + sign( diff ) * CNG_MD_MAX_DIFF_ELEVATION; } } return; } /*---------------------------------------------------------- ------* * ivas_get_ism_sid_quan_bitbudget() * * Set quantization bits based on the number of coded objects *----------------------------------------------------------- -----*/ /*! r: low resolution flag */ int16_t ivas_get_ism_sid_quan_bitbudget( const int16_t num_obj, /* i : number of objects */ int16_t *nBits_azimuth, /* o : number of Q bits for azimuth */ int16_t *nBits_elevation, /* o : number of Q bits for elevation */ int16_t *nBits_ener, /* o : number of Q bits for energy */ int16_t *nBits_coh /* o : number of Q bits for coherence */ ) { int16_t low_res_q; low_res_q = 0; *nBits_azimuth = ISM_AZIMUTH_NBITS; *nBits_elevation = ISM_ELEVATION_NBITS; }