[music-dsp] who else needs a fractional delay.
rossb-lists at audiomulch.com
Mon Nov 22 06:43:46 EST 2010
[ i suspect my previous reply got bounced for size reasons. here's one with
less context.. and a few edits minor.. sorry about duplicates if the prevous
one makes it in the end]
robert bristow-johnson wrote (at the end):
> well, i admit that i don't totally understand what is going on here. are
> the incoming packets all containing uniformly sampled audio (even if the
> packets come in all jittery)?
> who determines when an output packet goes out? you or the recipient of
> the output packets?
Packets are captured by an A/D converter, bundled into UDP network packets
(with low latency CELT compression) and transmitted to receivers on a WiFi
network. The transmission is driven by callbacks from the A/D.
> On Nov 21, 2010, at 6:10 AM, Ross Bencina wrote:
>> The problem is that there can be a huge amount of jitter on the audio
> now, does this mean that the samples that go into the packets will have
> jitter? i hope not.
No, they are sampled from an A/D with a good clock. For all intents and
purposes they are stable.
> the rate it draws samples out of the original stream should be based on
> *long* term packet rate and not jitter around just because there is some
> jitter in the packet rate (if you don't want it to sound like shit).
Agreed. Thing is, I don't have time to detect the *long* term packet rate at
the client. I only have a couple of seconds to start the stream. So I need
to make the most of the limited amount of packet timing data I have. Hence
resorting to robust regression -- or anything else that makes the most of a
small number of timestamps.
As for sounding like shit, I'll be happy if I can get the weighted wow to
the same level as an analog studio tape machine when the stream starts. It
will stabilise from there.
>> Let's say the packets have a 29ms period -- the jitter caused by the
>> network (especially with Multicast traffic over WiFi) can be up to
> but that 50 ms is only in the reception of the packets. where they not
> originated with a nice 29 ms clock? adjacent packets still represent
> uniformly sampled audio even if the packets are coming in with bursts.
>> so asking the servo mechanism to smooth out all of that jitter is asking
>> quite a lot...
> i think, you will need some latency. just to smooth things out. how much
> can you put up with?
Very little. I'm aiming for <10ms extra buffering over what's needed to mask
the packet jitter (which is about 50ms with the test network, perhaps
slightly better in production).
In my simulations with normally distributed packet jitter, the OLS rate
estimator can lock on to the word clock rate of the source stream with 1 in
44100 accuracy within 1 minute. I think that kind of precision is OK for
this application. I just need it to stabilise quicker.
As an aside, real packet jitter doesn't follow a normal distribution so the
OLS will perform worse than my simulations and will be more sensitive to
outliers (occasional big delays)
>> especially since a P(I)(D) mechanism has no inherent model of system or
>> measurement error. you could heavily damp the system and I suppose it
>> might stabilise eventually but I need more stability and faster
>> convergence than that.
> is this in case the sample rate (or packet rate) coming in might change
> and you need to adapt to it? or is it going out?
No, the packet rate is constant. I just don't know what it is at startup due
to potential difference in source and receiver rate (due to crystal
precision and thermal clock drift on the relatively-low-precision consumer
devices that are the receivers).
I need faster convergence because if the PID is heavily damped it will take
a long time to entrain and during that period either I will get buffer
underruns (if the receiver plays out too fast) or overruns/excess latency
(if the receiver play out too slow).
>> So I have two phases:
>> 1. The network packets come in and I timestamp them
> *you* timestamp them? so this means that you are assuming that they drew
> out of the original audio stream with a nice, smooth, uniform rate,
> the first (or zeroth) sample of packet K is the sample that immediately
> follows the last sample of packet K-1, right? with equal sample spacing,
> right? do you have timestamps from the originator?
The packets have source sequence numbers, which is basically the same thing.
There is no timestamp synchronisation between source and receiver (or any
feasible way to create it with high precision) so my whole model is
predicated on clock recovery at the client purely based on the incoming
>> and run some kind of robust clock recovery procedure (OLS, Kalman,
>> whatever) to determine the incoming sample rate and phase/offset.
> these packets do not contain information about their original sample rate?
The incoming sample rate is known to be nominally 44100. Any deviations are
at the level of manufacturing tollerances, which I havn't measured accross
the whole set of 1300 receivers, but for a small sample of receivers the
clock differences seem to be +/- a few samples -- which is enough to slip a
5ms buffer in 1 minute without this scheme we're discussing. The literature
suggests that tollerances in consumer word clocks are usually bigger than
> i dunno, but it seems to me that it would be worth a couple of bytes per
> packet to let you know what rate the samples were intended to be played
> back with. that isn't in the packet structure?
Yeah. I get that info when I start the stream. I'm trying to correct for the
difference between nominal and actual sample rates.
>> 2. Take this stabilised rate and offset information and feed it to the
>> PI controller to adjust the playout rate.
> now, if the playout rate changes and the samples represent uniform
> sampling, you will hear that as pitch/speed change.
Yes. That effect will be kept to a margin below acceptable weighted "wow"
> are you playing 1
> out at an ostensible lower or higher sampling rate than what was the
> original sampling rate?
Could be either, it depends on the actual sample rate of the receiver.
> is not that playout rate considered stable, even if the packets come and
> go in bursts?
Yes, it is considered *stable* within time frames where thermal clock-drift
magnitude is not significant. However it is not *known*. If I begin with the
nominal rate (say 44100Hz) and the actual rate is 44103Hz things go out of
sync pretty quickly -- that's a 3 sample slippage every second -- if I have
a safety margin of 256 samples, well, you can work out how long it takes for
the safety to be eaten up.
Now, I could use _only_ a PI controller to correct for this, but as far as I
can tell I will need a much bigger safety margin than 5ms and/or I will get
much bigger deviations (wow) in the output because the PI will have to work
a lot harder to keep things in sync than if I use a more powerful
statistical method to feed-forward more accurate recoverd-rate information.
The idea is that the clock-recovery procedure does the bulk of the incoming
sample rate estimation work, and the PI-controller is responsible primarily
for the phase/offset correction.
> i'm trying to figure out exactly what the deal is here. are you to infer
> the playout sample rate solely from how many "packet requests" you
> receive during some fixed period in the past?
When the system starts (when the user presses the start button on the
receiver) I'm trying to infer the source sample rate and offset soley from a
small number of received packets. As time goes on I have more packets to
work with and can use them all to estimate the sample rate. The long-time
behavior is less interesting because many techniques will stabilise over a
long time period, whereas I need one that will stabilise as fast as
I can always add a bit more latency at startup to mask poor startup clock
recovery performance if I have to but I'm trying to avoid that in the first
> but if the samples were drawn from a uniformly-sampled stream, you do not
> want to do a varying interpolation between them, you just want to buffer
> them, no?
1. I need to match the _actual_ playout rate to the _actual_ source rate so
that the buffer-fill level is constant within a small margin (say 5ms but
ideally less). Neither of these actual rates are known (even in pro audio
systems we use word-clock cables to distribute a single master clock because
rates between clocks vary).
2. I need to vary the playout rate to make use of the increasing certainty
about the source rate and source offset. The source offset is important
because it determines the actual latency of the system -- having a good
estimate of the time at which a packet was originated allows me to keep the
buffering margins low -- if I just used the first recieved packet as the
time origin my latency would have a random offset bounded by the network
jitter -- ideally I don't want network jitter to have a dynamic impact on my
latency at all.
> your interpolation depends on where the output pointer (which has an
> integer and fractional value) lies. but that increment (with an integer
> and fractional part) should not change much due to jitter, otherwise your
> audio sounds like crap. it should change only a little when you realize
> that the increment was too low or too high over the long term and you
> will get buffer overrun if nothing is done about it.
I agree. But I think we may have different ideas about the definition "over
the long term" -- if you only add 5 ms of extra buffering to account for
misestimation of clock skew then "over the long term" is not very long.
There is a trade off here between having more wow, and having a
slower response (needing more buffering). Like I said in an earlier post, I
havn't tuned it against this trade off yet. but the spec is "start up within
2-5 seconds, minimise latency" so I will be allowing wow at startup up to
the perceptual threshold and then adding extra latency as needed.
>> That's what I'm doing. I feed samples into the structure using the
>> incoming sample rate i derive at step (1) above.
> that incoming sample rate is derived purely from the timestamps *you* put
> on the packets as they are received? (as well as the knowledge of the
> number of samples per packet, i s'pose.)
> if that is the case, you need buffering (and latency).
To a point.
The more accurately I can estimate sample rate and offset the less
additional buffering/latency I need to add. Given that the spec is "make it
low latency" I can afford to throw whatever I can at getting a more accurate
estimate, and I'm pretty sure the mechanism I have is better at reducing
latency than pure PI + extra buffering.
Check out section 2 ("Clock Recovery work Review in DVB-IPTV") of the
following review paper:
"Technical Review of Various Clock Recovery Methods For IP Set- Top Box"
Monika Jain, P.C. Jain, Sharad Jain. JTES: Journal of Technology and
Engineering Science, Jul-Dec 2009 Vol1 No2
I think you're thinking of something like the PLL approach and I'm working
on something like the LLR transmitter time-base recovery approach.
> anyway, it's not so much the input sample rate nor the output sample rate,
> but the *ratio* between the two that you must determine. who determines
> the rate at which packets go out? do you? or is someone tapping you
> every time they need a packet?
CoreAudio taps me every time it needs a 5ms playback packet. So I also need
to estimate actual the local sample rate.. and as you say, calculate the
differernce. Calculating the local sample rate is easier because (1) the
jitter in the local time measurements are low (2) CoreAudio provides actual
sample rate information so I may be able to use it directly (although I
havn't verified its accuracy yet).
>> I timestamp a UDP packet when it arrives. The packet contains ~29ms of
>> audio data @44.1k in the current implementation... but the packet size
>> might change.
> okay, if the packet size increases, does that mean you expect the packet
> rate to decrease?
Sorry, i've misled you here. The packet size is fixed for the life of a
stream. It might change for different instances of a stream.
>> Yes, I think that's the regression problem I mention. I need to
>> continuously compute the relative word-clock phase of the incoming
> even if some of the packets have not arrived yet because they've been
> snarled in traffic for as long as 50 ms, right?
That's right. I need to compute a time alignment/playout rate that is
accurate. That allows my jitter buffer to go to almost-empty during traffic
snarls without any impact on playout rate.
This whole thing runs on a controlled private network with QoS for audio
traffic so although there can be traffic snarls, the maximum network delay
is assumed to be bounded -- the jitter buffer has a fixed duration computed
according to modelled and measured network delay characteristics.
>>> i've been comparing this problem to a more hardware-like ASRC where,
>>> from reading the DSP chip's clock register (that increments at the
>>> machine instruction rate) when an input sample arrives (and putting it
>>> into a buffer) and when the output samples goes out. that, and the
>>> difference of the output pointer and the set point (that increments
>>> with the input pointer and that fractional portion that is computed).
>>> there is a simple way to anticipate what the fractional portion is
>>> from the difference of the most recent two input timestamps and the
>>> difference of the output timestamp and the most recent input
>>> timestamp. but i can see that with more of the input timestamps (than
>>> just two), you can get a better guess.
>> The difference between what you describe, and the problem I have is that
>> I have only one timestamp every 29ms, not one timestamp per sample. And
>> my timestamp can be wrong by up to 50ms.
> it's a quantitative difference, and it seems to me that you need buffering
> (of a lot more than 50 ms) with its associated latency to deal with that.
If you view it as a quantitative difference and only try to solve it with a
controller then perhaps extra buffering is the only answer. My reasoning is
that using more powerful clock-recovery techniques ahead of the PI
controller will allow me to keep total client buffering down the minimum
needed to mask total-path network jitter (~50ms or less) and then a small
margin for the PI controller to do its thing.
>From what I've read, most network streaming audio systems use buffering on
the order of 300ms or more, but the system I'm working on needs to deliver
latencies way below that. Originally we were aiming for ~50ms total
end-to-end latency. I'm not sure that's possible but it's still the goal.
There are major differences between unicast and multicast jitter profiles
for example that will affect the final outcome.
As a side note, we're talking about continuous music audio streaming here.
If it was VOIP there are a whole bunch of other techniques that can be used
(talkspurts etc) instead of ASRC to deal with clock skew and keep the
>> So I need to do the regression (or similar) accross a larger number of
>> timestamps to try to recover something remotely approximating a word
>> clock (constant time increment per incoming sample)
> simple thing (whether it's packets or samples), is put your timestamp data
> into a buffer, subtract the timestamp from N packets ago from the
> timestamp of the most current packet. divide that timestamp difference
> by the number of samples (which is proportional to N-1, if the packet
> size doesn't change) received and that's a word clock period (in units of
> your timestamp clock).
Yep. That's a simple way of doing it, and it gets more accurate the more
timestamps you have. In fact this is similar to the approach RADclock uses
for high precision network time synchronisation
(http://www.cubinlab.ee.unimelb.edu.au/radclock/ their paper "Principles of
Robust Timing Over the Internet" in ACM Queue is a good read if you can get
it). Thing is, you need your first and last timestamp to be quite a long
time appart to make the effects of jitter on those end packets timestamps
If one or both of your end timestamps is an outlier (big jitter) then your
estimate will have a big error. So although the effect of jitter is reduced
with longer timeperiod (as N increases) it is still relativley sensitive to
jitter in the first and last timestamp. Least squares regression accross all
timestamps will give a better rate estimate than that... and you can update
the running sums in OLS each time you get a new packet. So it's not very
expensive to compute an OLS regression. Thing is, OLS is still sensitive to
outliers, which is why I'm looking at robust regression schemes like LMS
(least median of squares) and TLS (trimmed least squares) -- these are
harder to update incrementally but there are some algorithms out there for
low complexity LMS update at least.
Bogac suggested using a H-infinity (minimax) estimator, which from my brief
research might perform better than Kalman or OLS for this problem.
> now when the output packet goes out, you record a timestamp using the same
> clock as what you use to timestamp the input packets. now you want to
> know where the smoothly incrementing input pointer might be at that time.
> it's whereever the last *true* input pointer is (immediately after the
> last sample from the last input packet) at the time of timestamping the
> output packet *plus* what you predict by subtracting the timestamp of the
> last input packet from the timestamp of the outgoing packet and dividing
> that by the calculated time-per- sample calculated simply by the above
> paragraph (or whatever more sophisticated method you are using).
Yes. That's what I'm doing.
That description only takes in rate adjustment though. The controller also
has to deal with phase adjustment as the phase estimate improves.
>>> but i haven't tried it because i was thinking that the simple method
>>> would be good enough if the two asynced sample streams settled down.
>>> if they settle down, then your increment rate on the output pointer
>>> should become constant and about 1/r. that's what your controller has
>>> to do.
>> once I've done the wordclock recovery, then i can do that bit. That's my
>> step 2 above.
> so the issue is if there is a sudden change in the clock of the packets
> going out, right?
There will be no sudden changes anywhere in the actual clock rates. The only
sudden changes are in the estimates -- the estimates get better over time --
the estimate of source sample rate improves exponentially, the estimate of
source timebase offset can improve suddenly (which doesn't mean I need to
The timestamps of a stream of incoming packets can be modelled as:
received_t = (ideal_transmit_time + constant_delay) + jitter
The packet with minimum jitter (the least delayed packet) gives the best
estimate of the offset to the source timebase -- for this application that's
a better number to use than the intercept of a least squares regression, or
the offset resulting from the method you describe. Intuitively it makes
sense: the system is causal, so all packets are delayed, the least delayed
packet gives the closest estimate of the actual source time base.
I have an algorithm that keeps a running list of candidate earliest packets
(it's basically a convex hull problem) and then chooses the actual earliest
point based on the rate estimate (using our favourite method, OLS, line
through the ends, whatever). This computes the offset.
With a rate and offset we can do the adjustment with a PI controller you
described in the previous paragraph. But the rate and offset estimates
change over time (they improve). The offset can improve suddenly (a packet
with lower network jitter than any previously received packet arrives) --
this means that we've been buffering more than necessary and I can lower the
buffer fill level -- how the controller responds to this is a trade off
between perceptual quality and lowering latency quickly.
Ultimately, if this system had 5 minutes to entrain, we wouldn't be having
this conversation and the PI controller would be slowly adjusting the
playback rate at the +/- 1 sample/second level, but at startup the source
rate and offset estimates aren't as accurate as I would like so the
controller works harder (up to a perceptual limit "doesn't sound like shit")
and i deal with the rest with buffering.
Actually, with unicast transport I expect this all to work OK. But for the
last little while I've been trying to get this all working with multicast
WiFi network jitter (all packets are queued at the router until the next
beacon -- so the router is retransmitting packets according to a constant
25ms beacon pulse, while you're trying to transmit them with a 29ms
period -- clock recovery in this situation is more difficult <g>).
More information about the music-dsp