No subject


Sun Jan 21 11:55:30 EST 2007


easy to get 4-6 instructions per cycle using only "optimized" C code. I have
used assembly only once in this time, for a very particular implementation
of a fix-point biquad structure with 64-bits poles and accumulations. The
preferred way to go is to use intrinsics (keywords allowing the use of
assembly instructions from C code). This is pretty efficient because it gets
you closer to the architecture while still remaining in a high-level
language environment.

Examples can be found in this app note, as well as some demo code:
http://focus.ti.com/docs/apps/catalog/resources/appnoteabstract.jhtml?abstra
ctName=spra867

More generally speaking, I presented a paper on this topic 2 years ago at
AES. The abstract can be found under the following URL:
http://www.aes.org/events/112/papers/y.cfm

Thanks and best regards, 
Remi 
PS: Thanks to Angelo for the very interesting comments.

> -----Original Message-----
> From: music-dsp-admin at shoko.calarts.edu [mailto:music-dsp-
> admin at shoko.calarts.edu] On Behalf Of Angelo Farina
> Sent: Monday, December 08, 2003 12:58 PM
> To: farina at pcfarina.eng.unipr.it; music-dsp at shoko.calarts.edu
> Subject: RE: [music-dsp] re: What music DSPs?
> 
> Thanks Dafydd for Your very unbiased post. I hope I can add
> some information.
> My past experience is with TI and AD processors, always
> working in floating point. I hate the mess of fixed-point DSP
> programming, so I did not even consider the Motorola platform.
> I have also to admit that I have not received yet the latest
> evaluation boards with the last generation processors, my
> experience is based on the previous C6711 (TI) and 21161N
> (AD). However AD promised to deliver me an evaluation board
> with the new 21262 before the end of the year...
> On both processors, the theoretically achievable performance
> depends singificantly on the particular code. An high degree
> of parallelization is obtained only in very simple cases,
> such as a basic FIR structure for performing convolution by
> an old, stupid tight loop of MAC (multiply-accumulate) operations.
> Going in more detail, and with reference with the SHARC
> processor (I hope that Dafydd can complete a similar analysis
> for the TI), the DSP is equipped with two "independent" CPUs,
> each of them working on a separate data flow, and with
> separate filtering coefficients.
> Each of these CPU can perform, in a single clock cycle, the
> following operations:
> - Sum of two registers
> - Difference among the SAME two registers
> - Multiplications between two registers
> - one memory access (read OR write) to the Data Memory bus
> - one memory access (read OR write) to the Program Memory bus
> This means that up to 5 "operations" can be conducted in each
> cycle, which means that each of the two CPUs can be "rated"
> at 200 MHz x 5 = 1000 Gops. But only 3 of these 5 operations
> are actually "computations", so the CPU is rated at 600
> Mflops. We have two CPUs, so we get the 1200 Mflops declared.
> In practice, in a basic FIR structure You only need one sum
> and one multiplication per cycle, the subtraction capability
> is useless. Consequently, we have actually just 400 Mflops
> from each CPU, and 800 Mflops total. Consider that the second
> CPU operates on independent data and coefficients, but is
> forced to be executing always the same instruction as the
> first CPU. This is great for FIR filtering in stereo, but in
> other algorithms, where branches do occur, the second CPU has
> to stall whenever the instructions for the two data paths differ.
> This means that, in general-purpose algorithms, You often
> cannot use efficiently the second CPU.
> Furthermore, there are further limitations in the usage of
> the CPU registers for obtaining the simultaneous sum,
> multiply and memory access. In practice, You are not free to
> choose any of the available registers for each of these
> operations, You have to follow rules about the internal bit
> addressing of these registers.
> This also means that, in general cases, it is not always
> possible to get 4 operations per cycle, very often this
> number is reduced to 2 or even just one.
> Proper coding in Assembly language can
> result in very significant speed improvement. I do not know
> if this is due to a suboptimal implementation of the C
> compiler for AD processors, but we have seen that working in
> Assembly language is mandatory if optimal performances are
> required, being the speed-up factor obtainable usually
> greater than a factor 2.
> However, it must be said the the Assembly language of the
> Sharc processor is very friendly and high-level (it seems a
> sort of Basic language), so the step between C programming
> and Assembly programming is very low. The same cannot be said
> for TI processors, which have a very good C support, and
> instead are very, very hard to program in Assembly.
> Said that, I do not think that the actual true performance of
> a DSP is rated properly from those declared Mflops values. I
> am used to evaluate the real-world performances from these
> other two figures (both are evaluated with reference to a 48
> kHz sampling rate, with 24-bits codecs and 32-bits float
> internal data path):
> - for very simple code, a good evaluator is the maximum
> number of filtering coefficients in a basic FIR structure
> (tight loop of MACs). For example, the Sharc 21161N (100 MHz)
> can perform approximately 1250 coefficients on 2 channels
> simultaneously, or 2500 coefficients in total. The new 21262
> should double these values.
> - for more advanced code, the proper evaluator is again the
> maximum number of coefficients obtained when employing fast
> convolution (that is, based on FFT processing). The SHARC
> 21161N is actually capable of convolving 110,000 coefficients
> in mono, and 32000 in stereo (2 inputs, 2 outputs, 2
> independent filters). The new 21262 should give a boost of a
> factor much more than 2, because it has a larger internal
> memory, which is the critical factor when performing these
> memory-intensive convolutions. But I have not tested it yet,
> so I cannot give You precise quantities here....
> 
> It would be very interesting to get from Dafydd the
> corresponding figures for the new TI unit, which I suppose
> will be significantly better...
> 
> Bye!
> 
> Angelo Farina
> University of Parma, ITALY
> HTTP://pcfarina.eng.unipr.it
> 
> 
> > > -----Original Message-----
> > > From: music-dsp-admin at shoko.calarts.edu
> > > [mailto:music-dsp-admin at shoko.calarts.edu] On Behalf Of Dafydd Roche
> > > Sent: 08 December 2003 10:51
> > > To: music-dsp at shoko.calarts.edu
> > > Cc: proaudio at tendolla.com
> > > Subject: [music-dsp] re: What music DSPs?
> > >
> > > Jim,
> > >
> > > >From what I see, you have 3 DSP manufacturers you can choose from:
> > >
> > > - Texas Instruments
> > > - Analog Devices
> > > - Motorola
> > >
> > > Now, before you read further - I have a confession to make.
> > I work for
> > > TI.
> > > I know very well, that this mailing list is not a
> > promotional channel
> > > for the latest and greatest that TI, ADI or Motorola has :)
> > >
> > > That said, I've tried to make the next bit as bias-free as possible.
> > > Hopefully, it's opinion free.
> > >
> > > All this information can be found on the respective websites.
> > > I've tried to just state facts - I'm sure everyone has
> > their opinion
> > > on what is best.
> > >
> > >
> > > TI:
> > > TMS320C6713 - 32bit floating point (64bit in double
> > > precision) running at 225MHz with 8 functial units, allowing
> > > 8 simultanious instructions per cycle. (1350MFLOPS or 1800
> > > MIPS) (1.3GFlops or 1.8Gips)
> > > (http://focus.ti.com/docs/prod/folders/print/tms320c6713.html)
> > >
> > > ADI:
> > > Sharc ADSP21262 - 32bit/40bit floating point (64 bit in double
> > > Precision
> > > mode) running at 200MHz with 2 functional units (2 datapaths). I
> > > haven't worked with the device directly, so I'm quoting from AD's
> > > website:
> > > "200MHz (5ns) SIMD SHARC Core, capable of 1.2GFLOPS"
> > > (http://www.analog.com/processors/epProductPage/0,2542,ADSP%25
> > > 2D21262S,00.html)
> > >
> > > Motorola:
> > > DSP56371 - 24bit fixed point (48bit fixed in double
> > > precision) running at 180MHz. the DSP56731 offers 180Mips of
> > > performance.
> > > (http://e-www.motorola.com/webapp/sps/site/prod_summary.jsp?co
> > > de=DSP56371&nodeId=03z6wYqRXz8596)
> > >
> > > So, that's a really quick overview of the respective core's and the
> > > device they can be found in. Each of these devices has
> > advantages and
> > > disadvantages in terms of things like memory, serial ports, power
> > > consumption, cost etc. etc. etc.
> > >
> > > This would be an easy place to start a slagging match
> > between vendors,
> > > which I hope I have not started. DSP choice normally
> > depends on a lot
> > > of different issues from programming experience and existing code,
> > > through to cost, performance needs and connectivity requirements.
> > >
> > > Best Regards
> > >
> > > Dafydd Roche
> > > dupswapdrop -- the music-dsp mailing list and website:
> > > subscription info, FAQ, source code archive, list archive, book
> > > reviews, dsp links http://shoko.calarts.edu/musicdsp
> > > http://ceait.calarts.edu/mailman/listinfo/music-dsp
> > >
> >
> 
> 
> dupswapdrop -- the music-dsp mailing list and website:
> subscription info, FAQ, source code archive, list archive, book reviews,
> dsp links
> http://shoko.calarts.edu/musicdsp
> http://ceait.calarts.edu/mailman/listinfo/music-dsp



More information about the music-dsp mailing list