[music-dsp] who else needs a fractional delay.

Ross Bencina rossb-lists at audiomulch.com
Sun Nov 21 06:10:22 EST 2010

robert bristow-johnson wrote:
> On Nov 20, 2010, at 4:46 PM, Ross Bencina wrote:
>> I'm implementing a low-latency audio-over-wi-fi system with UDP 
>> transport. The packet period is somewhere between 5 and 30ms. I'm  doing 
>> clock-recovery on the client to keep the buffering in sync.  Since it's a 
>> low latency system I can't afford to have more than the  minimum required 
>> buffering - so getting the playout rate correct is  important.
>> I've already implemented a prototype/simulation of most of the 
>> mechanisms. I'm using a PI controller for the servo mechanism with a 
>> feedforward path for (smoothed) playback rate (so the servo only  really 
>> needs to deal with correcting offset errors). I havn't tuned  it yet, but 
>> the simulation results look OK. I found the wikipedia  article on PID 
>> controllers pretty helpful: http://en.wikipedia.org/wiki/PID_controller
>> Would you recommend something other than a PI controller for this?
> i don't think you want any D, but you probably want some P and I.

Yeah, that's why I said "I'm using a PI controller"

> one thing to remember, because this becomes the rate input to an NCO 
> (essentially the output pointer address, which has a fractional 
> component), there is an inherent integrator in your "plant" (using 
> control systems lingo :).  the plant and the controller are  essentially 
> in series, so the integrator in the plant teams up with  the PI making it 
> I and I^2, instead.  maybe you *do* want a D.  i  dunno.  but i'm pretty 
> sure you don't need an I, because your  controller P already is your I.

Interesting, I hadn't thought of that, thanks. So if P is already my I, 
what's my P? or is that why you're saying I might want D...

>>> From my point of view the more difficult thing is recovering a stable
>> wordclock from a jittery packet stream -- and getting this to start  up 
>> quickly enough to be useful. In the past I've used an Ordinary  Least 
>> Squares regression on packet timestamps to estimate the  incoming sample 
>> rate and intercept (time offset).
> whoa!  i wonder how that is incorporated in this?
> i'm only thinking about how a difference signal between two pointers  in a 
> buffer.


The problem is that there can be a huge amount of jitter on the audio 
packets. Let's say the packets have a 29ms period -- the jitter caused by 
the network (especially with Multicast traffic over WiFi) can be up to 
50ms... so asking the servo mechanism to smooth out all of that jitter is 
asking quite a lot... especially since a P(I)(D) mechanism has no inherent 
model of system or measurement error. you could heavily damp the system and 
I suppose it might stabilise eventually but I need more stability and faster 
convergence than that.

So I have two phases:

1. The network packets come in and I timestamp them and run some kind of 
robust clock recovery procedure (OLS, Kalman, whatever) to determine the 
incoming sample rate and phase/offset.

2. Take this stabilised rate and offset information and feed it to the PI 
controller to adjust the playout rate.

I might be able to do it all in one step with some kind of fancy controller 
that I have no idea about right now, but I my intuition is that it's best to 
recover a stable incoming sample clock first and feed this into a PI 
controller (ie a mostly feedforward structure) rather than having a feedback 
system act on such massively jittery phase information.

I have read quite a few papers on approaches to this (mostly in clock 
recovery for IP TV, but also some audio streaming applications) and everyone 
seems to have their own favourite method -- with little consensus. I found 
people proposing new methods in the literature as late as 2009.

> there is an input pointer that increments each time you  get an input 
> sample (you might be getting them in asynchronous bursts  of samples). 
> and there is an output pointer that advances by a given  stride that has 
> both integer and fractional components.  the value of  that stride is the 
> reciprocal of the output/input sample rate ratio,  r.  at least it is, 
> when it settles down to a reasonably constant value.

Yep. Arrival of input samples in this system have a jitter of ~50ms.

> so imagine this circular buffer with samples popping in with a pointer 
> whose increment stride is 1.  and an output pointer that increments by 
> something that settles down to 1/r.  that is the signal that comes out  of 
> your controller.
> what goes into your controller is the difference between the output 
> pointer (the pointer that has an integer and fractional value) and a  set 
> point for that pointer.  the set point is at some fixed delay  behind the 
> input pointer, which might be at the opposite side of the  circular buffer 
> (if you want an equal amount of elbow room, but you  might not if you want 
> low latency).

That's what I'm doing. I feed samples into the structure using the incoming 
sample rate i derive at step (1) above.

> do you get time-stamps for the samples going out?  does it come from  an 
> asynchronous interrupt source or can you increment some number for  it?

I timestamp a UDP packet when it arrives. The packet contains ~29ms of audio 
data @44.1k in the current implementation... but the packet size might 

> when the input and output sample rates are *almost* the same, there  can 
> be a problem of delay slippage of a single sample when the  controller 
> knows it's off.  so you have to compute a signal from the  difference of 
> timestamp of the sample (or packet) now going out and  the computed 
> timestamp of effectively where the input pointer is.   even thought it 
> always increments by 1, you need a fractional signal  that represents a 
> smoothly incrementing pointer value that happens to  cross an integer 
> sample boundary every time an input sample comes in.   this is estimated 
> from the N most recent input samples (or packets).   this might be the 
> regression problem you mention.  i dunno.

Yes, I think that's the regression problem I mention. I need to continuously 
compute the relative word-clock phase of the incoming signal.

>> This time I have a mechanism for time offset based on the assumption 
>> that the "most on time" packets represent the best time offset, but  the 
>> rate estimator is still a bit of a mystery... I have a Kalman  filter 
>> version that works about as well as the OLS rate detector and  is a 
>> little cheaper -- lately I've been reading up on "robust  regression" 
>> methods (LMS and TLS) -- they're pretty costly but I'm  hoping they'll 
>> allow me to lock on to the word clock more quickly.
> i haven't done a Kalman filter problem since grad school.  i dunno how  to 
> do it or use it anymore.  i've been comparing this problem to a  more 
> hardware-like ASRC where, from reading the DSP chip's clock  register 
> (that increments at the machine instruction rate) when an  input sample 
> arrives (and putting it into a buffer) and when the  output samples goes 
> out.  that, and the difference of the output  pointer and the set point 
> (that increments with the input pointer and  that fractional portion that 
> is computed).  there is a simple way to  anticipate what the fractional 
> portion is from the difference of the  most recent two input timestamps 
> and the difference of the output  timestamp and the most recent input 
> timestamp.  but i can see that  with more of the input timestamps (than 
> just two), you can get a  better guess.

The difference between what you describe, and the problem I have  is that I 
have only one timestamp every 29ms, not one timestamp per sample. And my 
timestamp can be wrong by up to 50ms. So I need to do the regression (or 
similar) accross a larger number of timestamps to try to recover something 
remotely approximating a word clock (constant time increment per incoming 

> but i haven't tried it because i was thinking that the  simple method 
> would be good enough if the two asynced sample streams  settled down.  if 
> they settle down, then your increment rate on the  output pointer should 
> become constant and about 1/r.  that's what your  controller has to do.

once I've done the wordclock recovery, then i can do that bit. That's my 
step 2 above.

thanks.. it's a big help to have the opportunity to try to explain it. and 
although I'm still hunting for the idea robust regression/state estimation 
method to do the clock recovery with I think I understand what I want the 
result to look like a bit better than the (P)(I)(D) controller bit.


More information about the music-dsp mailing list