[music-dsp] vectorizers
Christopher Weare
chriswea at microsoft.com
Thu Aug 2 17:12:35 EDT 2001
Yes, but not for floats.
-chris
-----Original Message-----
From: Ian Lewis [mailto:ILewis at acclaim.com]
Sent: Thursday, August 02, 2001 12:47 PM
To: 'music-dsp at shoko.calarts.edu'
Subject: RE: [music-dsp] vectorizers
Now for the $64,000 question: can Altivec do horizontal ops? That is,
can it take a vector (x,y,z,w) and perform x+y+z+w? Ian
> -----Original Message-----
> From: Rob Barris [mailto:rbarris at quicksilver.com]
> Sent: Thursday, August 02, 2001 1:12 PM
> To: music-dsp at shoko.calarts.edu
> Subject: Re: [music-dsp] vectorizers
>
>
> At 10:25 AM -0600 8/2/01, Ian Lewis wrote:
> >Has anyone here programmed the Mac AltiVec units? Can you
> compare/contrast
> >them with SSe/SSE2, 3DNow, and "traditional" DSP processors? Ian
> >
>
>
> I wrote the following years ago. It predates SSE2. I like AltiVec
> (referred to as Velocity Engine here, in the Apple
> nomenclature) - I would
> welcome corrections or add'l information.
>
> It was before I got to actually write some AV code, so the
> example code
> is bogus - as it turns out AV can only do 8 byte multiplies
> in parallel,
> not 16 IIRC. FWIW- and don't forget www.altivec.org, they have a good
> mailing list.
>
>
>
> >
> > With the high level of noise on this NG about relative
> performance of
> >G4 vs G3, G4 vs x86, and AltiVec/Velocity vs MMX/SSE/3dNow,
> I thought some
> >might appreciate a programmer's view detailing specific strengths and
> >weaknesses of the G4 and the AltiVec/Velocity model.
> >
> > It seems to me that there's a lot of handwaving going on,
> and some of
> >the significant differences in design and philosophy get completely
> >overlooked. It's not surprising, given that so few actually
> have their
> >hands on the G4's yet, but maybe this will help. A lot of this is
> >straight off of Apple's web site
> > <http://developer.apple.com/hardware/altivec/summary.html>
> >and also from the Motorola documentation.
> >
> > Velocity adds 32 new registers to PowerPC. PPC already
> has 32 integer
> >(32-bit) and 32 FP (64-bit) registers, the 32 new Velocity
> registers are
> >128-bits wide.
> > In contrast with x86: MMX provides 8 64-bit integer-only SIMD
> >registers, and SSE (P-III) brings in 8 more 128-bit 4-element FP
> >registers.
> > Concretely, the Velocity register file is big enough to
> hold "eight"
> >complete 4x4 matrices. SSE's FP register file can only hold "two".
> > Since you usually need a register or two free to hold a
> vector to be
> >modified by a matrix, this might be better expressed as
> "seven" and "one".
> >
> > Velocity is largely modeless. Any Velocity register can
> be interpreted
> >as holding 16 8-bit integers, 8 16-bit integers, 4 32-bit
> integers, or 4
> >32-bit (single precision) floating point values.
> > Further, instructions are available to efficiently
> convert between the
> >integer and FP domains, without having to store out to
> memory and back,
> >and can even do "float to fixed" conversion, where a FP number "x" is
> >conveniently converted into an integer of the form (x * 2^n).
> > In contrast with x86: MMX integer work has to be
> mode-switched with the
> >classic FP stack (by the programmer), and the 8 MMX
> registers are strictly
> >partitioned from the 8 SSE registers.
> > If you have an algorithm that needs LOTS of individual
> integer strings,
> >or LOTS of individual FP strings, Velocity excels because it
> can handle
> >either case.
> > Put another way, you can have the equivalent of the
> entire register set
> >of MMX and SSE inside the Velocity unit, with 16 more
> 128-bit registers to
> >spare.
> > Put yet another way, if you have an algorithm with between two and
> >seven active FP 4x4 matrices, it'll fit on Velocity. And if
> your results
> >wind up in the integer domain, you'll find that the
> conversion steps are
> >trivial and fast.
> >
> > Velocity largely follows the RISC model of "3-operand"
> instructions.
> >In general, given two source operands X and Y in registers,
> the result can
> >go anywhere (z = x+y).
> > In contrast with x86: where virtually all MMX operations
> are of the
> >two-operand destructive form (x = x+y), often necessitating extra
> >moves/reloads on the part of the programmer, and more cycles
> spent by the
> >processor. I haven't done SSE programming yet, perhaps someone can
> >clarify whether it follows that model as well.
> >
> > Velocity includes virtually all of the convenient instructions you
> >would want, for dealing with signal processing, 3D math,
> graphics, etc.
> >Saturated[clamping] / unsaturated arithmetic, shift/rotate, fused
> >multiply-add, sum-across, average, min/max, round, AND/OR/XOR/NOR,
> >conditional-select (useful for avoiding branches in Velocity code).
> >
> <http://developer.apple.com/hardware/altivec/highlights.html#H
> ighlights>
> >
> >
> > Velocity is directly programmable from the C language
> using the new
> >"vector" keyword, leaving the compiler free to assign
> registers, deal with
> >initializations etc., just as it does for conventional (non vector)
> >integer and FP work. A snippet of some Velocity code:
> >
> > {
> > vector unsigned char x( 0,0,0,1, 0,0,0,2, 0,0,0,3, 0,0,0,4 );
> > vector unsigned char y( 0,1,2,3, 4,5,6,7, 8,9,10,11,
> 12,13,14,15 );
> > vector unsigned char sum,prod;
> >
> > sum = vec_add( x, y );
> > prod = vec_prod( x, y );
> > }
> >
> > Here, the compiler takes care of translating all of these
> operations
> >(initialization of the two 16-element byte vectors, the
> multiply operation
> >and the addition operation) into specific Velocity instructions. The
> >compiler knows what data type is involved, so it is able to
> choose the
> >right form of instruction (8-bit add, 8-bit mul) for each
> line of code.
> > This example is pretty brain dead, it just puts 16
> individual byte sums
> >into "sum" and 16 individual byte products into "prod".
> Aside from the
> >setup/initialization, I would guess it would execute in
> somewhere between
> >2 and 4 clock cycles.
> >
> > More examples at
> > <http://developer.apple.com/hardware/altivec/examples.html>
> >
> > In contrast with x86: the vast majority of MMX/SSE code
> must be written
> >by hand in the native assembler form. Some compilers presently offer
> >3dNow! code generation for specific types of FP/matrix code,
> my opinion is
> >that this is not nearly as general a solution as the
> "vector" extension to
> >the C language used by Velocity.
> > The Velocity approach is to accept the notion that for best
> >performance, recoding is going to be necessary for key
> functions, and to
> >make that coding/debugging process as easy as possible. You
> can't grasp
> >how important this is until you've written some MMX
> assembler code for the
> >8-register model and its 2-operand format (I have).
> >
> > There has been some talk in this NG about Velocity
> support being as
> >easy as a checkbox in a compiler, it's not. I don't think
> there is yet a
> >compiler that will go in and auto-vectorize code. The good
> news is, it
> >doesn't take an assembler hacker or any tricky mode switches to start
> >taking advantage of Velocity from C.
> >
> >
> > Other Velocity strengths:
> >
> > Convenient instructions and data types for dealing with
> 1-5-5-5 RGB
> >pixel data. While none of the logical/arithmetic operations
> act directly
> >on 1-5-5-5 format data, there are "pack" and "unpack"
> instructions that
> >can convert 5-5-5 pixels to 8-8-8 format and back. It's
> very simple to do
> >all 8-8-8 format pixel calculations and then issue a single "pack"
> >operation before storing the finished pixels out to memory.
> >
> > "Permute" - mix and match individual bytes from two
> source vectors,
> >into a single destination vector, in a single operation.
> The generality
> >of this operation is remarkable.
> >
> > Velocity as presently implemented has two pipelines, one for the
> >arithmetic/logical/FP operations, and one for permute/shift/move type
> >operations. It's sometimes the case that an operation that
> would seem to
> >belong in one class can be re-expressed as an operation of the other
> >class, allowing for more overlap of computation.
> > More generally speaking, this split means that algorithms
> operating on
> >arbitrarily-aligned input data can have a greater chance of
> not stalling,
> >as it is possible to overlap vector loads/stores, and
> "permutes" with the
> >actual computations.
> >
> > The Velocity SIMD FP unit is "four wide", the SSE unit
> according to
> >some reports is only two wide. What this means is that Velocity can
> >finish off four FP ops in parallel on any given clock cycle. As I
> >understand it, SSE can only retire two, implying that it's
> SIMD FPU is
> >only "two wide", even though the actual registers are "four
> wide". This
> >was discussed at some length on comp.sys.intel when P-III shipped.
> > For FP algorithms that are memory limited, this *may or
> may* not make a
> >noticeable difference in speed, but it certainly affects the
> "peak GFLOP"
> >numbers - G4 can hit 4GLFOPS peak at 500MHz, since it can finish four
> >mul-adds per cycle, and a mul-add counts as two ops.
> >
> >
> > Some Velocity weaknesses:
> >
> > Velocity loads and stores ONLY happen on 16-byte
> boundaries. There is
> >no such thing as a misaligned load or store, nor even an
> exception if such
> >a load is attempted - the low 4 bits of the effective
> address are simply
> >masked off. Depending on your algorithm and your data set
> this can make
> >life tricky, the "permute" and "shift" operations are meant
> to make this
> >process simpler. You basically have to fetch two aligned
> vectors and then
> >extract the elements you want - two loads and a "permute". See
> ><http://developer.apple.com/hardware/altivec/algorithms.html> .
> >
> > It's a really big register file - 32 registers, 16 bytes
> each. Context
> >switch times will certainly rise, although there is a VRsave register
> >which can help the OS figure out which Velocity regs are
> actually in use
> >by a particular process, in order to reduce the amount of
> memory traffic
> >at context switch time.
> > However, assuming a rather high 1KHz context switch rate,
> the extra
> >registers only add up to 1MB/sec of additional memory
> traffic in the worst
> >case - on a machine with several hundred MB/sec of main
> memory bandwidth,
> >I doubt anyone will notice.
> >
> > It takes up a lot of chip area, some of the latest copper
> G3 chips are
> >down around 40-50 sq mm of area, the first G4's are about 85 sq mm.
> >
> >
> > I probably missed a couple things. In general I feel that
> the Velocity
> >design, while it did take longer to get to market than MMX
> and SSE, is
> >easier to get started programming with, and offers some significantly
> >better functions and features at the machine level. I
> didn't even touch
> >on some of the other G4 features like the autonomous data stream
> >prefetch/touch instructions, maybe another time.
> >
> >Rob
>
>
> --
> Rob Barris Quicksilver Software Inc.
> rbarris at quicksilver.com
>
>
>
> dupswapdrop -- the music-dsp mailing list and website:
> subscription info,
> FAQ, source code archive, list archive, book reviews, dsp links
> http://shoko.calarts.edu/musicdsp/
>
dupswapdrop -- the music-dsp mailing list and website: subscription
info,
FAQ, source code archive, list archive, book reviews, dsp links
http://shoko.calarts.edu/musicdsp/
dupswapdrop -- the music-dsp mailing list and website: subscription info,
FAQ, source code archive, list archive, book reviews, dsp links
http://shoko.calarts.edu/musicdsp/
More information about the music-dsp
mailing list