[music-dsp] vectorizers

Christopher Weare chriswea at microsoft.com
Thu Aug 2 17:12:35 EDT 2001


Yes, but not for floats.

-chris


-----Original Message-----
From: Ian Lewis [mailto:ILewis at acclaim.com] 
Sent: Thursday, August 02, 2001 12:47 PM
To: 'music-dsp at shoko.calarts.edu'
Subject: RE: [music-dsp] vectorizers


Now for the $64,000 question: can Altivec do horizontal ops? That is,
can it take a vector (x,y,z,w) and perform x+y+z+w? Ian

> -----Original Message-----
> From: Rob Barris [mailto:rbarris at quicksilver.com]
> Sent: Thursday, August 02, 2001 1:12 PM
> To: music-dsp at shoko.calarts.edu
> Subject: Re: [music-dsp] vectorizers
> 
> 
> At 10:25 AM -0600 8/2/01, Ian Lewis wrote:
> >Has anyone here programmed the Mac AltiVec units? Can you
> compare/contrast
> >them with SSe/SSE2, 3DNow, and "traditional" DSP processors? Ian
> >
> 
> 
>    I wrote the following years ago.  It predates SSE2.  I like AltiVec
> (referred to as Velocity Engine here, in the Apple 
> nomenclature) - I would
> welcome corrections or add'l information.
> 
>    It was before I got to actually write some AV code, so the 
> example code
> is bogus - as it turns out AV can only do 8 byte multiplies 
> in parallel,
> not 16 IIRC.  FWIW- and don't forget www.altivec.org, they have a good
> mailing list.
> 
> 
> 
> >
> >   With the high level of noise on this NG about relative 
> performance of
> >G4 vs G3, G4 vs x86, and AltiVec/Velocity vs MMX/SSE/3dNow, 
> I thought some
> >might appreciate a programmer's view detailing specific strengths and
> >weaknesses of the G4 and the AltiVec/Velocity model.
> >
> >   It seems to me that there's a lot of handwaving going on, 
> and some of
> >the significant differences in design and philosophy get completely
> >overlooked.  It's not surprising, given that so few actually 
> have their
> >hands on the G4's yet, but maybe this will help.  A lot of this is
> >straight off of Apple's web site
> >   <http://developer.apple.com/hardware/altivec/summary.html>
> >and also from the Motorola documentation.
> >
> >   Velocity adds 32 new registers to PowerPC.  PPC already 
> has 32 integer
> >(32-bit) and 32 FP (64-bit) registers, the 32 new Velocity 
> registers are
> >128-bits wide.
> >   In contrast with x86: MMX provides 8 64-bit integer-only SIMD
> >registers, and SSE (P-III) brings in 8 more 128-bit 4-element FP
> >registers.
> >   Concretely, the Velocity register file is big enough to 
> hold "eight"
> >complete 4x4 matrices.  SSE's FP register file can only hold "two".
> >   Since you usually need a register or two free to hold a 
> vector to be
> >modified by a matrix, this might be better expressed as 
> "seven" and "one".
> >
> >   Velocity is largely modeless.  Any Velocity register can 
> be interpreted
> >as holding 16 8-bit integers, 8 16-bit integers, 4 32-bit 
> integers, or 4
> >32-bit (single precision) floating point values.
> >   Further, instructions are available to efficiently 
> convert between the
> >integer and FP domains, without having to store out to 
> memory and back,
> >and can even do "float to fixed" conversion, where a FP number "x" is
> >conveniently converted into an integer of the form (x * 2^n).
> >   In contrast with x86: MMX integer work has to be 
> mode-switched with the
> >classic FP stack (by the programmer), and the 8 MMX 
> registers are strictly
> >partitioned from the 8 SSE registers.
> >   If you have an algorithm that needs LOTS of individual 
> integer strings,
> >or LOTS of individual FP strings, Velocity excels because it 
> can handle
> >either case.
> >   Put another way, you can have the equivalent of the 
> entire register set
> >of MMX and SSE inside the Velocity unit, with 16 more 
> 128-bit registers to
> >spare.
> >   Put yet another way, if you have an algorithm with between two and
> >seven active FP 4x4 matrices, it'll fit on Velocity.  And if 
> your results
> >wind up in the integer domain, you'll find that the 
> conversion steps are
> >trivial and fast.
> >
> >   Velocity largely follows the RISC model of "3-operand" 
> instructions.
> >In general, given two source operands X and Y in registers, 
> the result can
> >go anywhere (z = x+y).
> >   In contrast with x86: where virtually all MMX operations 
> are of the
> >two-operand destructive form (x = x+y), often necessitating extra
> >moves/reloads on the part of the programmer, and more cycles 
> spent by the
> >processor.  I haven't done SSE programming yet, perhaps someone can
> >clarify whether it follows that model as well.
> >
> >   Velocity includes virtually all of the convenient instructions you
> >would want, for dealing with signal processing, 3D math, 
> graphics, etc.
> >Saturated[clamping] / unsaturated arithmetic, shift/rotate, fused
> >multiply-add, sum-across, average, min/max, round, AND/OR/XOR/NOR,
> >conditional-select (useful for avoiding branches in Velocity code).
> >   
> <http://developer.apple.com/hardware/altivec/highlights.html#H
> ighlights>
> >
> >
> >   Velocity is directly programmable from the C language 
> using the new
> >"vector" keyword, leaving the compiler free to assign 
> registers, deal with
> >initializations etc., just as it does for conventional (non vector)
> >integer and FP work.  A snippet of some Velocity code:
> >
> >  {
> >   vector unsigned char x( 0,0,0,1, 0,0,0,2, 0,0,0,3, 0,0,0,4 );
> >   vector unsigned char y( 0,1,2,3, 4,5,6,7, 8,9,10,11, 
> 12,13,14,15 );
> >   vector unsigned char sum,prod;
> >
> >   sum = vec_add( x, y );
> >   prod = vec_prod( x, y );
> >  }
> >
> >   Here, the compiler takes care of translating all of these 
> operations
> >(initialization of the two 16-element byte vectors, the 
> multiply operation
> >and the addition operation) into specific Velocity instructions. The
> >compiler knows what data type is involved, so it is able to 
> choose the
> >right form of instruction (8-bit add, 8-bit mul) for each 
> line of code.
> >   This example is pretty brain dead, it just puts 16 
> individual byte sums
> >into "sum" and 16 individual byte products into "prod".  
> Aside from the
> >setup/initialization, I would guess it would execute in 
> somewhere between
> >2 and 4 clock cycles.
> >
> >   More examples at
> >   <http://developer.apple.com/hardware/altivec/examples.html>
> >
> >   In contrast with x86: the vast majority of MMX/SSE code 
> must be written
> >by hand in the native assembler form.  Some compilers presently offer
> >3dNow! code generation for specific types of FP/matrix code, 
> my opinion is
> >that this is not nearly as general a solution as the 
> "vector" extension to
> >the C language used by Velocity.
> >   The Velocity approach is to accept the notion that for best
> >performance, recoding is going to be necessary for key 
> functions, and to
> >make that coding/debugging process as easy as possible.  You 
> can't grasp
> >how important this is until you've written some MMX 
> assembler code for the
> >8-register model and its 2-operand format (I have).
> >
> >   There has been some talk in this NG about Velocity 
> support being as
> >easy as a checkbox in a compiler, it's not.  I don't think 
> there is yet a
> >compiler that will go in and auto-vectorize code. The good 
> news is, it
> >doesn't take an assembler hacker or any tricky mode switches to start
> >taking advantage of Velocity from C.
> >
> >
> >    Other Velocity strengths:
> >
> >   Convenient instructions and data types for dealing with 
> 1-5-5-5 RGB
> >pixel data. While none of the logical/arithmetic operations 
> act directly
> >on 1-5-5-5 format data, there are "pack" and "unpack" 
> instructions that
> >can convert 5-5-5 pixels to 8-8-8 format and back.  It's 
> very simple to do
> >all 8-8-8 format pixel calculations and then issue a single "pack"
> >operation before storing the finished pixels out to memory.
> >
> >   "Permute" - mix and match individual bytes from two 
> source vectors,
> >into a single destination vector, in a single operation.  
> The generality
> >of this operation is remarkable.
> >
> >   Velocity as presently implemented has two pipelines, one for the
> >arithmetic/logical/FP operations, and one for permute/shift/move type
> >operations. It's sometimes the case that an operation that 
> would seem to
> >belong in one class can be re-expressed as an operation of the other
> >class, allowing for more overlap of computation.
> >   More generally speaking, this split means that algorithms 
> operating on
> >arbitrarily-aligned input data can have a greater chance of 
> not stalling,
> >as it is possible to overlap vector loads/stores, and 
> "permutes" with the
> >actual computations.
> >
> >   The Velocity SIMD FP unit is "four wide", the SSE unit 
> according to
> >some reports is only two wide.  What this means is that Velocity can
> >finish off four FP ops in parallel on any given clock cycle.  As I
> >understand it, SSE can only retire two, implying that it's 
> SIMD FPU is
> >only "two wide", even though the actual registers are "four 
> wide".  This
> >was discussed at some length on comp.sys.intel when P-III shipped.
> >   For FP algorithms that are memory limited, this *may or 
> may* not make a
> >noticeable difference in speed, but it certainly affects the 
> "peak GFLOP"
> >numbers - G4 can hit 4GLFOPS peak at 500MHz, since it can finish four
> >mul-adds per cycle, and a mul-add counts as two ops.
> >
> >
> >   Some Velocity weaknesses:
> >
> >   Velocity loads and stores ONLY happen on 16-byte 
> boundaries.  There is
> >no such thing as a misaligned load or store, nor even an 
> exception if such
> >a load is attempted - the low 4 bits of the effective 
> address are simply
> >masked off. Depending on your algorithm and your data set 
> this can make
> >life tricky, the "permute" and "shift" operations are meant 
> to make this
> >process simpler.  You basically have to fetch two aligned 
> vectors and then
> >extract the elements you want - two loads and a "permute".  See
> ><http://developer.apple.com/hardware/altivec/algorithms.html> .
> >
> >   It's a really big register file - 32 registers, 16 bytes 
> each. Context
> >switch times will certainly rise, although there is a VRsave register
> >which can help the OS figure out which Velocity regs are 
> actually in use
> >by a particular process, in order to reduce the amount of 
> memory traffic
> >at context switch time.
> >   However, assuming a rather high 1KHz context switch rate, 
> the extra
> >registers only add up to 1MB/sec of additional memory 
> traffic in the worst
> >case - on a machine with several hundred MB/sec of main 
> memory bandwidth,
> >I doubt anyone will notice.
> >
> >   It takes up a lot of chip area, some of the latest copper 
> G3 chips are
> >down around 40-50 sq mm of area, the first G4's are about 85 sq mm.
> >
> >
> >   I probably missed a couple things. In general I feel that 
> the Velocity
> >design, while it did take longer to get to market than MMX 
> and SSE, is
> >easier to get started programming with, and offers some significantly
> >better functions and features at the machine level.  I 
> didn't even touch
> >on some of the other G4 features like the autonomous data stream
> >prefetch/touch instructions, maybe another time.
> >
> >Rob
> 
> 
> --
> Rob Barris       Quicksilver Software Inc.      
> rbarris at quicksilver.com
> 
> 
> 
> dupswapdrop -- the music-dsp mailing list and website: 
> subscription info,
> FAQ, source code archive, list archive, book reviews, dsp links
> http://shoko.calarts.edu/musicdsp/
> 

dupswapdrop -- the music-dsp mailing list and website: subscription
info,
FAQ, source code archive, list archive, book reviews, dsp links
http://shoko.calarts.edu/musicdsp/


dupswapdrop -- the music-dsp mailing list and website: subscription info,
FAQ, source code archive, list archive, book reviews, dsp links
http://shoko.calarts.edu/musicdsp/




More information about the music-dsp mailing list