[music-dsp] Audio codec

Sampo Syreeni decoy at iki.fi
Tue Feb 17 20:41:19 EST 2004

On 2004-02-17, Joshua Scholar uttered:

>SBR is pretty much an aral exciter (perhaps with a few bits set aside
>to supress it when it does more harm than good), right?

Not really -- in actuality SBR is a fairly nuanced technique which
constructs approximate high-frequency spectral envelopes with minimal
out-of-band data, with some help from the actual perceptually coded
band. I'll (mostly) adjoin a post I sent on the subject to the sursound
list, a while ago. (The text was written in a hurry and I have a
propensity for intuitive leaps, so...)
In case anybody's interested, I went on a fishing expedition this
morning, and came up with the following references to SBR. (aacPlus is
AAC+SBR, MP3Pro is MP3+SBR. In MPEG-4 documents ISO talks about
bandwidth extension, or BWE.)


[...] SBR isn't based on a sink model at all, but a source one. The
emphasis is on noise substitution and the approximate resynthesis of
high frequency spectral envelopes, not on perceptual coding fodder like
masking calculations. In the remainder I'll attempt to outline the

SBR is aimed specifically at high frequency enhancement in situations
where the band would otherwise have a hard upper limit. The task is to
avoid the sense of muffling which comes with such limits, without actual
reconstruction of the high band. This is achieved by expressing the
tonal parts in the high range as a function of the lower end partials,
substituting generic white noise for the residue, adding expressly
synthesized high frequency sinusoids when they aren't generated by the
first part, and finally applying adaptive gain to each frequency band so
that the original high frequency envelope is regenerated. The envelope
data used is heavily aggregated across both time and frequency, so it
takes little space (2kbps is typical). The result is amazingly similar
to HILN, except that a large part of the tonal content is derived from
an existing signal instead of being synthesized from individual
sinusoids. The actual steps taken by the SBM decoder are:

 1) SBM uses a highly adaptable mechanism to vary the time-frequency
    tradeoff of its control data, and to group adjacent channels together
    (remember, we're aiming at an approximate envelope, not a real
    reconstruction), so the first thing to do is to derive the grid from
    the control data. Time resolution of course varies inversely with
    the frequency one, so the latter is apparently derived from the
    former. Temporal placing of the stuff generated below is controlled by
    the temporal grid derived here, but also by transient locations as
    detected by the main AAC codec. It seems SBR envelope control frame
    boundaries do not need to align with the filterbank blocks, either.
 2) Upto five blocks of low frequency channels are copied to the high
    end and spectrally flattened by inverse filtering with an LPC derived
    from the low end. The process is used to extend any harmonic series
    present in the low end, but so that at first, the newly generated
    partials have approximately equal amplitude.
 3) Noise scalefactors sent by the encoder are used to substitute an
    appropriate amount of white noise into the high end. The idea is,
    noise scalefactors roughly set the tonal-to-noise balance over each
    generated band, and the patching process marks where the tonal
    parts should be.
 4) The variable time-frequency tradeoff and the heterogeneous
    architecture makes it somewhat difficult to keep track of how much
    energy we already added. We expressly estimate the energy and limit it
    in each band, compensating for the losses induced by the limiter.
    After this, we decode the aggregate spectral envelopes (remember,
    multiple analysis bands per envelope because this is an
    approximation), smooth the current energy and envelope scalefactors
    sent by the encoder and use the resulting gain contours to recreate the
    time-variable high frequency spectral envelope.
 5) Finally extra sinusoids are generated if the encoder tells us so.
    Since the envelope already tells us what the amplitude should be at
    each frequency, the encoder only sends a command to generate a
    sinusoid at a given band, and we use the above energy estimates and
    envelope data to adjust its amplitude. This is done because not every
    strong partial can be synthesized from the low band.

Scalefactors are sent delta encoded either over time or over frequency,
quantized and Huffman coded. Coupled channels are uncoupled before
processing, and each channel is treated more or less separately. Added
sinusoids are indicated by a bitmap with a single bit per synthesis
filterbank band. Frequency band mappings are derived from parameters
with a rather lengthy calculation. Otherwise the coding seems to be a
rather simple bitstream format.
Sampo Syreeni, aka decoy - mailto:decoy at iki.fi, tel:+358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2

More information about the music-dsp mailing list