[music-dsp] Peceptual "masking" as used in MP3 etc?

Sampo Syreeni decoy at iki.fi
Sat Apr 10 18:55:45 EDT 2004


On 2004-04-10, Richard (UK] uttered:

>I have seen that there are acoustic features of "masking" where: [...]

As I see masking, the three main components are a bit different:

a) each tone competes for being heard; in the spectral domain each tone
   will mask other tones progresssively less with increasing frequency
   separation; psychoacoustics gives us more or less constant
   masking profiles/functions on the Bark scale
b) wideband noise behaves a bit differently, which is why we have
   tone-masking-noise and noise-masking-tone charts and tonality
   calculations in perceptual codecs
c) temporal masking stretches roughly exponentially both backwards and
   forwards, but the fall-of is about ten times faster backwards in time

That's a very rough outline, of course...

>My question is: which of these masking effects is used by the MP3 and
>similar compression schemes?

That's a difficult one, because MP3 and all of the other serious
perceptual coding formats are defined as recipes for how to decode.
There is no normative encoder, so the encoders might employ whatever
algorithms they choose.

Current MP3 encoders employ all of the above-mentioned masking criteria.

>And how the heck do they do it?

a) the coder calculates a total perceptual masking threshold from all
   of the sound that is going on in the current frame and quantizes the
   signal in each band so that the noise power induced by
   the quantization distortion is as far below the threshold as possible
b) the coder calculates calculates tonality indices for each band and
   chooses the masking contour based on that; noisy bands lead to higher
   masking thresholds all around
c) the coder switches FFT transform block sizes if there is a transient
   around; typically the coder works with two different transform sizes,
   so that tonal content gets handled with longer blocks and transients
   get the shorter ones; this is basically a heuristic which relies on
   diminishing pre-echoes, that is, trouble with the weak backward
   temporal masking characteristic of our hearing
c1)in the most sophisticated systems such as AAC, we actually exploit
   the time-frequency tradeoff explicitly; when there are transients, we
   actually expect the spectrum to be smooth, and encode it via LPC

Lots of terms there, and it's still an outline... Modern audio coding is
brutally complex.

>Do they FFT scan the frequency bands and simply knock out "bins" which
>are too close?

Typically they FFT a block of data and lower number of bits used to
encode the signal in that subband if the other subbands do not seem to
be masking it.

>Do they simply handle case [b] by knocking out "upstream" FFT bins?

Sort of. This is part of the process described above. We derive a global
masking threshold from psychoacoustic masking calculations which tells
us what level of noise is tolerable in each band, and adapt the number
of bits used to represent each sample in that subband based on the
threshold.

If the total coding bandwidth is enough to represent all of the bands so
that no noise will be heard, we go with that. If it isn't, we allocate
the bits proportionally, so that the total amount of audible coding
noise per band stays constant. Then we run an optimisation loop to
actually make that happen -- this process is highly nonlinear, so no
closed for solution exists in how to allocate the bits between the
bands.

>And is the temporal masking effect catered for?

It is, but it's handled differently in different codecs, as I noted
above.
-- 
Sampo Syreeni, aka decoy - mailto:decoy at iki.fi, tel:+358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2



More information about the music-dsp mailing list