[music-dsp] Peceptual "masking" as used in MP3 etc?
decoy at iki.fi
Sat Apr 10 18:55:45 EDT 2004
On 2004-04-10, Richard (UK] uttered:
>I have seen that there are acoustic features of "masking" where: [...]
As I see masking, the three main components are a bit different:
a) each tone competes for being heard; in the spectral domain each tone
will mask other tones progresssively less with increasing frequency
separation; psychoacoustics gives us more or less constant
masking profiles/functions on the Bark scale
b) wideband noise behaves a bit differently, which is why we have
tone-masking-noise and noise-masking-tone charts and tonality
calculations in perceptual codecs
c) temporal masking stretches roughly exponentially both backwards and
forwards, but the fall-of is about ten times faster backwards in time
That's a very rough outline, of course...
>My question is: which of these masking effects is used by the MP3 and
>similar compression schemes?
That's a difficult one, because MP3 and all of the other serious
perceptual coding formats are defined as recipes for how to decode.
There is no normative encoder, so the encoders might employ whatever
algorithms they choose.
Current MP3 encoders employ all of the above-mentioned masking criteria.
>And how the heck do they do it?
a) the coder calculates a total perceptual masking threshold from all
of the sound that is going on in the current frame and quantizes the
signal in each band so that the noise power induced by
the quantization distortion is as far below the threshold as possible
b) the coder calculates calculates tonality indices for each band and
chooses the masking contour based on that; noisy bands lead to higher
masking thresholds all around
c) the coder switches FFT transform block sizes if there is a transient
around; typically the coder works with two different transform sizes,
so that tonal content gets handled with longer blocks and transients
get the shorter ones; this is basically a heuristic which relies on
diminishing pre-echoes, that is, trouble with the weak backward
temporal masking characteristic of our hearing
c1)in the most sophisticated systems such as AAC, we actually exploit
the time-frequency tradeoff explicitly; when there are transients, we
actually expect the spectrum to be smooth, and encode it via LPC
Lots of terms there, and it's still an outline... Modern audio coding is
>Do they FFT scan the frequency bands and simply knock out "bins" which
>are too close?
Typically they FFT a block of data and lower number of bits used to
encode the signal in that subband if the other subbands do not seem to
be masking it.
>Do they simply handle case [b] by knocking out "upstream" FFT bins?
Sort of. This is part of the process described above. We derive a global
masking threshold from psychoacoustic masking calculations which tells
us what level of noise is tolerable in each band, and adapt the number
of bits used to represent each sample in that subband based on the
If the total coding bandwidth is enough to represent all of the bands so
that no noise will be heard, we go with that. If it isn't, we allocate
the bits proportionally, so that the total amount of audible coding
noise per band stays constant. Then we run an optimisation loop to
actually make that happen -- this process is highly nonlinear, so no
closed for solution exists in how to allocate the bits between the
>And is the temporal masking effect catered for?
It is, but it's handled differently in different codecs, as I noted
Sampo Syreeni, aka decoy - mailto:decoy at iki.fi, tel:+358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
More information about the music-dsp