SSE4, future processors and GPGPU thoughts

santyhammer · Nov 22, 2006

Intel released the some days ago a PDF with the future SSE4 instructions:

http://www.intel.com/technology/architecture/new_instructions.htm
ftp://download.intel.com/technology/architecture/new-instructions-paper.pdf

notice it includes a 1-cycle SIMD dot product ( and also an interesting instruction for regEx string manipulation which can be used by antivirus and sprintf/sscanf very well ). Oh btw, there is no CPU atm supporting that... People thing SSE4 is present at the moment in the Core 2 Duo, but nope... what C2D has is SSE3.1...

and I wonder... When the CPU was so bad designed that lost the battle versus the GPUs, PPUs, etc...?
Why Intel and AMD didn't listen us and put the DOT, MAD, ADD/SUB/MUL/DIV, SQRT, LOG and EXP shader 1-cycle instructions inside the CPU and create a REAL and USEFUL SSE implementation?
Why SSE is, in general, so poor, bad designed and slow? I need like 4 shuffles, 2 adds and 2 mul to perfom a simple dot product... With SSE3 is a bit better ( mul and hadd ) but still bad... Oh, come on... See the Xenos/Cell VMX128 for example... is much better... http://arstechnica.com/articles/culture/mattlee.ars/3

Perfectly Intel/AMD could do this and recover part of the power they lost with GPUs...

Now we can use CUDA and ATI GPGPU sdks... But I think the future ( with Fusion kicking hard ) is to integrate these 1-cycle CISC shader instructions inside the CPU and forget any GPU, PPU or NPU. And, seriously, I want somebody to tell me why the CPU companies didn't design the SIMD well in a start so they could avoided the GPU growing and maintain all the calculations inside the CPU.

So pls, if anybody at Intel is listening... re-though that SSE4 and ADD all the DX10 shader instructions to that.

Techno+ · Nov 22, 2006

hey hammer read this

http://arstechnica.com/news.ars/post/20061119-8250.html

the last part says 'Both of these steps pave the way for the introduction of GPU-specific extensions to the x86 ISA, extensions that eventually will probably be modeled to some degree on the ISA for the existing ATI hardware. These extensions will start life as a handful of instructions that help keep the CPU and GPU in sync and aware of each other as they share a common socket, frontside bus, and memory controller. A later, stream-processor-oriented extension could turn x86 into a full-blown GPU ISA.'

now u know what will happen .IMO eventually the CPU will have instructions similar to those of the GPU, just like what happened with the FPU, and i dont really know what he meant by that phrase about the x86 stream oriented GPU ISA, does he mean better communication between the CPU and GPU, or integrating GPU-like instructions into the CPU.

Arun · Nov 22, 2006

I think you are MASSIVELY misunderstanding why GPUs are good at math - and why they'll get even better at it than CPUs could ever dream of. Dot products and SF are only a small part of that; the icing on the cake, if you wish.

Furthermore, I think you'll agree it's ironic that you're taking dot products so seriously here, because both NVIDIA and Intel (in the G965) have already gotten rid of it for this generation. The trend is towards GPUs becoming completely scalar (for math operations, at least!), and ATI will follow up sooner or later. This makes sense because the ALUs can remain SIMD internally anyway; they just process 16 pixels/vertices per ALU, with the same instruction. (32 threads with the same instruction for pixels, actually!)

GPUs are also moving towards out-of-order processing, although in a less sophisticated way than CPUs, and with basically no runtime overhead. The first scheme used is out-of-order processing of threads to hide latency; you can think of this as a massively beefed up HyperThreading scheme, although with hundreds cores and thousands threads!

Secondly, independent instructions are detected by the compiler and may be run simultaneously up to a certain extend. This is done to further hide latency, and minimize the costs of a scalar architecture in realworld situations. Think of a Vec4 MUL in SSE and decompose that in 4 scalar instructions; each of these are independent, so in order to minimize register usage and maximize latency tolerance, you might as well issue four instructions of the same thread rather than one instruction each for the four threads. It is unknown at this point whether the G80 implements this or not, though; NVIDIA has a number of patents on the subject, so if it doesn't, I'm sure future generation GPUs will - and maybe Intel's G965 or ATI's future R600 do/will implement it.

As for special function operations such as log/exp/sin/cos/etc. - modern GPUs are indeed terrifyingly powerful at those. If you look at G80, you got 128 scalar MADDs at 1.35GHz, and 128 attribute interpolators. The attribute interpolators are also used for special function computation (there are a number of NVIDIA papers on the subject by Stuart Oberman, who used to be the lead 3DNow! and FPU guy at AMD, afaik), and for several reasons I won't get into here, you need 4 SF units per SF function, and you also mobilize the MADD unit to put the values in range.

So if you used the SF (sin/cos/log/exp/etc.) as much as you could, you'd have 32 scalar SF operations per cycle and 96 scalar MADDs per cycle, at 1.35GHz. Certainly, CPUs can't come close to that, but it's also questionable they should have so much SF power (for scientific applications, you'd want FP64 which the GPU can't support, and in other cases, that's massive overkill in terms of SF - the GPU only has that much of it because it's shared with the attribute interpolation unit and that rcp is used for perspective correction...)

And I very much doubt AMD will do it for Fusion. I'm not sure if you saw the AMD-released diagrams, but Fusion is basically the CPU and the GPU on a single die. They remain completely distinct parts. Perhaps Fusion2 or some such will be quite different, but we aren't there yet; that's the Intel Gesher timeframe, which we know basically nothing about yet, anyway.

Uttar
P.S.: Techno+: That arstechnica article is pretty good, but you need to take into consideration that they're talking about products that possibly aren't even on the drawing board yet. Such a thing would be slated at the 2011+ timeframe, and as I said, you don't know what Intel is planning then either. And both Intel and AMD could fuckup completely and have a laughable architecture as a result of such an ambition. How often did you see see 3D programmers liking MMX or 3DNow! when they came out? Did they ever use it for serious graphics programming in shipping games? (hint: never)

Techno+ · Nov 22, 2006

thanx for the hint lol, and i was just speculating(my best hobby). If u remember, programs like itunes were programmed to use the G5' VMX unit, so i think if this kind of trend is expnaded, programmers might and i mean oly might agree, but u know more than me.

Arun · Nov 22, 2006

Techno+ said:
thanx for the hint lol, and i was just speculating(my best hobby).

Heh, speculation is fun, don't get me wrong

I used to do my fair share of ridiculous fan-boyish speculation in the NVIDIA NV30 days. After a while, you learn to be a bit less overly confident in your speculation, and to consider less extreme outcomes. In the end, it's all about looking at the big picture and not taking partial snapshots.

For example, how can anyone honestly consider speculating on NVIDIA's and AMD's handheld initiatives without analyzing the market in-depth? That's literally dozens of companies you need to look at, and figure out their business model, their customers and their development processes. And the reason I'm saying this is not that I actually did that analysis - rather than I decided not to focus too much on it yet, given how much effort I realized it was going to take

If u remember, programs like itunes were programmed to use the G5' VMX unit, so i think if this kind of trend is expnaded, programmers might and i mean oly might agree, but u know more than me.

Right, I'm not sure exactly what iTunes was doing with VMX. I think it was basically doing operations like a DSP would do for sound processing, so obviously that's a kind of application that benefits significantly from something like VMX.
The point isn't that SSE isn't capable of some pretty good acceleration, and often without out-of-this-world efforts. Even MMX and 3DNow! had some interesting uses if you decided it was worth the trouble to implement. The point just is that all these instruction set extensions are not, at all, delivering what graphics need for proper high-quality acceleration.

The one thing where SSE is pretty good at in graphics is DirectX9 Vertex Shading. No texture fetches and plenty of Vec4 operations. Although even then, it could be slightly more optimal if it supported dot products or, better, scalar operations. Still, it ain't too bad in terms of raw performance; but if you looked at it from the point of view of performance/watt, it's likely that the upcoming unified DX10 low-end IGPs will match a Conroe in terms of vertex shading, and yet they take an order of magnitude less power!

Anyway, this isn't the biggest problem with CPUs "replacing" GPUs. The DX10-level programmable shader cores have pretty damn huge (and attainable) gigaflop ratings, but if you used a CPU to do modern graphics operations, its inefficiency would be even more pronounced in texture filtering. The G80 can do 64 bilinear filtering operations per clock on INT8 inputs (but the outputs are ~FP32). Each filtering operation consists of average 4 Vec4 samples with a given weight for each sample.

One way to calculate this is with 4 Vec4 dot products, which corresponds to 28 flops if your architecture is ideally suited to dot products. As such, the filtering alone on G80 corresponds to 64*28*575MFlops, or one teraflop, if my estimate is accurate. And here we aren't even counting things like the attribute interpolation unit, triangle setup, rasterization, blending, etc. - the list goes on. And in terms of texture filtering, you should also count texture addressing, which is not exactly a simple task either (although it's more about actual unit complexity than mere gigaflops...)

Could a CPU add specialized units for some of these functions? Yes. Would it be a good design choice? Probably not, as they'd be idling when the processor does something else than graphics-heavy work. And when it is doing graphics-heavy work, there might not be enough of them. The latter situation makes it acceptable for an IGP replacement perhaps, but anything above the $50 food chain is unlikely to be affected by actual unification of both CPU and GPU functionality.

Could a more exotic architecture exist that balances the design constraints of both schemes perfectly? Maybe, but I don't think it will be the case as long as the industry remain focused on rasterization. And I doubt we'll switch away from that anytime soon - furthermore, I don't think a CPU is an ideal processor for it, either. Future GPUs will, ironically, probably be better at both rasterization and raytracing then future CPUs (I'm thinking specifically of the ~2012 timeframe here, for what it's worth...)

Uttar

3dilettante · Nov 22, 2006

santyhammer said:
notice it includes a 1-cycle SIMD dot product ( and also an interesting instruction for regEx string manipulation which can be used by antivirus very well ). Oh btw, there is no CPU atm supporting that... People thing SSE4 is present at the moment in the Core 2 Duo, but nope... what C2D has is SSE3.1...

I think that's a one-cycle throughput dot product, in that the CPU can issue a dot product every cycle.

Each instruction would still take multiple clocks, for most CPUs, a multiply takes at least 3 cycles for integer math, probably closer to 5 for floating point. A dot product would take longer than that to propogate the results.

To do better than that would take a lot more hardware, or a slower clock cycle. GPUs do both, and even they have operations that take multiple cycles.

and I wonder... When the CPU was so bad designed that lost the battle versus the GPUs, PPUs, etc...?
Why Intel and AMD didn't listen us and put the DOT, MAD, ADD/SUB/MUL/DIV, SQRT, LOG and EXP shader 1-cycle instructions inside the CPU and create a REAL and USEFUL SSE implementation?

No operation is free; it either has a cost in silicon, power, or time. Most of the code CPUs need to be able to run fast never needs a dot product. Most of the transistors that would go into a dedicated high-speed DOT
unit would go unused 99.9% of the time, and then people would complain why their 99.9% of integer code is running 10% slower than it could be.

DIV is a fine example of this. There are ways of implementing DIV that are "faster", but only if you look at the DIV hardware in isolation. Since everything in a CPU is connected, a super-ALU would either take up room that could be used for increasing performance on a lot more code, or it could drag down the clock rate. A slower clock rate would be reduced performance 100% of the time for one instruction out of a billion.

GPUs can have all this extra math hardware because they know they're going to use it.

Why SSE is, in general, so poor, bad designed and slow? I need like 4 shuffles, 2 adds and 2 mul to perfom a simple dot product... With SSE3 is a bit better ( mul and hadd ) but still bad... Oh, come on... See the Xenos/Cell VMX128 for example... is much better...

It's difficult to paste vector-type performance onto an architecture that isn't designed for it, and even more difficult to make it happen when the pre-existing software base needs to be convinced to change.

Incremental changes are needed because big changes wind up being unused. Each SSE op that has been added has been evaluated as being likely to be used. It's a guess on the future, one the designers don't have full control over. Look at MMX as a example of educated guesses that didn't make a huge splash.

Cell and Xenos don't have that problem. They have their own platform; one that didn't exist earlier.

That's why GPUs can kick any CPU in math operations... Notice I am not talking about pointers, etc... Perfectly Intel/AMD could do this and recover part of the power they lost with GPUs...

Not without trade-offs. Silicon is inflexible, so things only go into a design that have a good chance of being used. CPU software space is so broad that they can't afford to specialize too much.

As people adjust to the changes, perhaps the software ecosystem will change as well, but that is very slow.

And, seriously, I want somebody to tell me why the CPU companies didn't design the SIMD well in a start so they could avoided the GPU growing and maintain all the calculations inside the CPU.

When the first CPUs came out, SIMD didn't really exist. Neither did standardized floating point, for that matter. 3d computer graphics didn't really exist either.

The transistor densities needed for 3d computer graphics didn't exist until the 90s.
Until it became feasible to integrate the functions, and the massive amounts of memory and bandwidth became economical, there was no point to putting the new functions in.

CouldntResist · Nov 22, 2006

santyhammer said:
Why SSE is, in general, so poor, bad designed and slow? I need like 4 shuffles, 2 adds and 2 mul to perfom a simple dot product... With SSE3 is a bit better ( mul and hadd ) but still bad... Oh, come on... See the Xenos/Cell VMX128 for example... is much better... http://arstechnica.com/articles/culture/mattlee.ars/3

Cell SPU doesn't have dot-product, nor even horisontal FP add instruction.

I guess Intel's SSE enhancements are necessity: when you only have 8 registers, it's kind of hard to process 4 stream elements in scalar code (recommended style by Intel itself), rather than 1 stream element with vector code.

Bob · Nov 22, 2006

Uttar said:
The G80 can do 64 bilinear filtering operations per second on INT8 inputs

I sure hope G80 runs faster than that! (575 million times faster, even).

Arun · Nov 22, 2006

Bob said:
I sure hope G80 runs faster than that! (575 million times faster, even).

Are you sure? I mean, that order of magnitude makes more sense, given the 11ms fixed branch cost...

But good catch - thankies, and I corrected the post.

Uttar

santyhammer · Nov 22, 2006

Techno+ said:
hey hammer read this

http://arstechnica.com/news.ars/post/20061119-8250.html

Very good! See ATI/AMD is innovating and putting those instructions well. Intel should follow that too.

3dilettante said:
When the first CPUs came out, SIMD didn't really exist. Neither did standardized floating point, for that matter. 3d computer graphics didn't really exist either.

Why they don't wipe all the absurd and slow instructions of the x86 ( like the horrible 387 instructions and the old 3dnow/SSE ) to save silicon? They could do it perfectly in the x86 to x64 transition and nobody will complain... after all you need to recompile to get good x64 applications ( ok x64 have a x86 compability mode but is going to be deprecated in 2 years... ). I think will be a good moment to get rid of all the old x86 instructions.

Also I think the 900M transistors CPUs are a joke... Expensive, hard to design... Why not better to make a motherboard with 200 ZIF sockets and to plug small coprocesors on it? I think I saw some presentation about this ( Torrenza?, CellGrid? ).

The CPUs must change or will completely dissapear eated by the GPUs, PPUs, NPUs and XXUs... I can see my new ZIP hw card to accelerate WinZip... I want all integrated inside the CPU like the good old times and not to expend 1000$ in sepparate gadgets like the Ageia, the KillerNic, the Geforce/Radeon or the Pure raytracing card

And other thing I can't understand... GPUs running C programs with CUDA/CTM? Absurd. So the CPU can't be good at graphics but the GPU can be extremely good at graphics + computation + conditionals? Then why I need a CPU, to do one or two pointer operations or recursive stack operations only?

I dont like the future! I still thing SSE4 gonna be bad like all the previous ones. I still thing SSE4 should give us the full DX10 shader instructions.

3dilettante · Nov 22, 2006

santyhammer said:
I still thing SSE4 gonna be bad like all the previous ones. I still thing SSE4 should give us the full DX10 shader instructions.

SSE2 isn't so bad if you count the scalar FP instructions. They're trying to move to those instead of x87.

Why they don't wipe all the absurd and slow instructions of the x86 ( like the horrible 387 instructions and the old 3dnow/SSE ) to save silicon? They could do it perfectly in the x86 to x64 transition and nobody will complain... after all you need to recompile to get good x64 applications ( ok x64 have a x86 compability mode but is going to be deprecated in 2 years... ). I think will be a good moment to get rid of all the old x86 instructions.

Number of programs that have some of those old x86 instructions=billions.
Number of old programs needed by institutions that either don't want to recompile, can't afford to, or have lost the source code=thousands.
Number of programs that would be in the new slim x86 ISA=0.

Backwards compatibility is one of the biggest reasons why x86 has survived this long. Any new chip would be basically competing with x86 all over again, only it'd still have all the odd instruction semantics and weird formats.

Also I think the 900M transistors CPUs are a joke... Expensive, hard to design... Why not better to make a motherboard with 200 ZIF sockets and to plug small coprocesors on it? I think I saw some presentation about this ( Torrenza?, CellGrid? ).

That's a lot of sockets.
Transistors are cheap, the rest (pins, packaging sockets, board routing, mounting hardware) is much more expensive.

A chip 1/200th the size of a full processor will not cost 1/200th what it cost to manufacture the single large one.

And other thing I can't understand... GPUs running C programs with CUDA/CTM? Absurd. So the CPU can't be good at graphics but the GPU can be extremely good at graphics + computation + conditionals? Then why I need a CPU, to do one or two pointer operations or recursive stack operations only? I dont like the future, specially if I were Intel...

You can make a computerized coffee-pot's microprocessor run a C program, that doesn't mean anything.
It's not like there aren't plenty of other CPU tasks that go into setting up the program, the OS, and system that the GPU works within.

Arun · Nov 22, 2006

santyhammer said:
And other thing I can't understand...

If you can't understand anything, there's not much I can do about it. The above posts are relatively extensive, and you seem to basically just be ignoring everything said in them, possibly because they don't fit your current vision of reality.

We're not paid to explain you this stuff, or at least I'm not! There's zero problem with proposing some ideas and speculating about future trends, and that can be interesting, but if your response to our explanations is that you repeat exactly what you initially said without taking into consideration what we posted, that's not even worth the reading time anymore. Please don't think you're right on everything, as this is just plain annoying - nobody can possibly be right on everything, and from my point of view, you're horribly wrong on most things.

So if you want to have any worthwhile discussion going on here, I suggest you drop that reply style and/or that overly confident attitude...

Uttar

santyhammer · Nov 23, 2006

3dilettante said:
SSE2 isn't so bad if you count the scalar FP instructions. They're trying to move to those instead of x87.

They are moving, yep... but not enough. The dot and the regEx instruction are ok, but insuficient.

3dilettante said:
Number of programs that have some of those old x86 instructions=billions.

I've heard that argument some time ago for the Win16 applications. How many programs are you running in 16bits model atm?

3dilettante said:
Backwards compatibility is one of the biggest reasons why x86 has survived this long.

Yes, like EDO ram. Patchs, patch and patchs for nothing. Then the SDRAM came and where is the EDO ram now? Sometimes is better to say "stop this, let's redesign from zero" than to continue improving and patching an obsolete design.

3dilettante said:
That's a lot of sockets.
Transistors are cheap, the rest (pins, packaging sockets, board routing, mounting hardware) is much more expensive.

All is cheap really. I think what is expensive are the developing costs. Developing less complicated CPUs will help. Thats why NVIDIA put OUTSIDE the G80 die the NV-IO chip.. that's the policy.

Uttar said:
I think you'll agree it's ironic that you're taking dot products so seriously here, because both NVIDIA and Intel (in the G965) have already gotten rid of it for this generation

???? I still see the DOT3/4 in OpenGL and D3D. The dot product is already the basis of vertex transformations with matrices, lighting, BSP, backface culling, etc... Sorry I don't understand what you mean. Dot product is vital for me.

Uttar said:
If you can't understand anything, there's not much I can do about it...

We're not paid to explain you this stuff, or at least I'm not!

Please don't think you're right on everything...

So if you want to have any worthwhile discussion going on here, I suggest you drop that reply style and/or that overly confident attitude.

I know, but I started this thread to complain and not to explain or to debate.
My intention was that if we all complain perhaps some engineer could thing about it. Not to improve nm or Mhz or to include 900000-cores in the silicon die.

So, basically your answers are always welcome... but all I want is to complain, complain and complain so they can see the job they are doing is severely lacking.

I complained, complained and complained and, after 5 years, they put the @~~#~@#1%%% dot product... why? Because somebody complaint. The question is why they didn't asked to the developers "what instructions you want?" ... no... they just added the instructions THEY thought were good... And that's why I complain, because they don't listen at all.

And of course I don't have the reason always... but I don't matter. All that I'm saying is that the CPUs are worse and worse with the time... Are faster, but worse ( larger, complicated, more power consumption, too much cost, etc )

Also sorry for my english that can appear rude, but i'm not a native english talker and sometimes I can't find the exact words I want. In my country we have a proverb "El que no llora, no mama" ( something like if you don't cry you won't get your milk ), and that's exactly what I want.

And to finish other thought... CPUs currently lack innovation. They just improve the silicon integration, more transistors, more Mhz, more power and more cost. Personally I don't think that's the way to go, with 1200W power sources to use a 8-core CPU which has 1/900 the mathematical power of a GPU. They should see a processor like the Cell, where you can join zillions using the parallel "pins" to produce, for example, the BlueGene. Or see AMD with the Fusion. Or use carbon nanotubes to refrigerate it. Or NVIDIA/ATI who, in five years, multiplied the FPU power 100x and put out the graphics computations from the CPU. That's innovation. To include more cores, improve Mhz and go less nanometers is not real innovation.

We are reaching the electronics limit and we need new design/conceptual solutions, not to upgrade the Mhz and say "here is the Pentium18". NO thx.

---
To return the original theme, because the thread is degerating a lot with all that GPU explanations and comparisons... The future SSE4 are gonna suck. People will use their 1200$ 8-core-quad-SLI GeForce8/R600 with CUDA/CTM/GPGPU to do their SIMD calculations... and Intel will expend billions of dollars developing ANOTHER sse set that nobody will use... I will see Folder@Home clusters with GPUs instead with CPUs. I will see a windows calculator using OpenGL to calculate a SQRT. Congratulations.

Wanna avoid that?

1) Improve SSE4 with a decent SIMD set ( something like the DX10 shader instructions + video + that nice regEx for antivirus ), not with the proposed instructions in the PDF I linked.

2) Adopt a "coprocessor" policy. You need more calculation power? Add another copro to your motherboard without throwing to the thrash can you old CPU. I don't wanna see a 1900M transistor 8-core CPU disipating 900W. Multiple small copros are better approach I bet ( and perhaps could avoid the fan! )

3) Intel should also open a public forums where EVERYBODY could post the wanted instructions.

Nick · Nov 23, 2006

santyhammer said:
When the CPU was so bad designed that lost the battle versus the GPUs, PPUs, etc...?
Why Intel and AMD didn't listen us and put the DOT, MAD, ADD/SUB/MUL/DIV, SQRT, LOG and EXP shader 1-cycle instructions inside the CPU and create a REAL and USEFUL SSE implementation? Why SSE is, in general, so poor, bad designed and slow?

When comparing a 3 GHz CPU with a 500 MHz GPU, it's actually pretty much possible to issue these operations in one GPU clock (yes I know a G80 is clocked higher). The main difference is that a GPU can issue dozens of them in parallel. But SSE is not fundamentally flawed (as a CPU instruction set). It could have been better, but it's still very useful.

And, seriously, I want somebody to tell me why the CPU companies didn't design the SIMD well in a start so they could avoided the GPU growing and maintain all the calculations inside the CPU.

The main problem is parallel throughput, and they're working on that. Intel isn't planning to stop till there are Cores-a-plentyâ„¢.

Nick · Nov 23, 2006

Uttar said:
As for special function operations such as log/exp/sin/cos/etc. - modern GPUs are indeed terrifyingly powerful at those.

Just for your information, here's are ways to compute sin and cos on the CPU with very little operations: Fast and accurate sine/cosine. GPUs are certainly still much faster, but I wouldn't say they are relatively speaking much better at special instructions compared to a CPU.

Obviously the CPU is also the only choice if you need full IEEE-754 (single or double) precision. Which is just another reason why CPUs will never be GPUs and GPUs will never be CPUs.

And both Intel and AMD could fuckup completely and have a laughable architecture as a result of such an ambition. How often did you see see 3D programmers liking MMX or 3DNow! when they came out? Did they ever use it for serious graphics programming in shipping games? (hint: never)

Hint: Unreal. And 3DNow! was used a lot in the pre-T&L days. Also, while no real 3D game uses it for rendering these days, it's still present in engines/libraries/APIs/drivers for many other purposes. And video processing has always relied on it. So I'd give it a little more credit. Even though it was marketed differently they are pretty generic SIMD instruction sets.

santyhammer · Nov 23, 2006

Nick said:
Hint: Unreal. And 3DNow! was used a lot in the pre-T&L days. Also, while no real 3D game uses it for rendering these days, it's still present in engines/libraries/APIs/drivers for many other purposes. And video processing has always relied on it. So I'd give it a little more credit. Even though it was marketed differently they are pretty generic SIMD instruction sets.

Full agree with you, Nick.

I am using mmx, 3dnow and sse1-3 in my raytracers too. Also inside my 3D engine to accelerate some Vector/Matrix/Quaternion maths, SW skinning, triangle-ray, BSPs and sound processing ( the PC lacks a few because I don't have a dammit good DOT product... the Xenos is nice because can do 1-cycle DOT so i'm happy there )

For the cosine/sine/etc I mix a precompute table with Newtown-Raphson aproximation ( or use those damm precomputed PS1.0 textures to precompute a complex function )

Humus · Nov 23, 2006

santyhammer said:
notice it includes a 1-cycle SIMD dot product

Holy glory and hallelujah!

santyhammer said:
and also an interesting instruction for regEx string manipulation which can be used by antivirus very well

I think the bigger problem with Antivirus programs is the HD access though.

Uttar said:
The point isn't that SSE isn't capable of some pretty good acceleration, and often without out-of-this-world efforts. Even MMX and 3DNow! had some interesting uses if you decided it was worth the trouble to implement.

In my experience SSE often requires a signficant amount of work to do the task you want to accomplish. Except for a few particular tasks that Intel engineers had in mind back then you tend to always end up shuffling so much that you might as well just stick to scalar in the end. 3DNow on the other hand was done right from the beginning. It added only a small amount of instructions, but ended up a lot more powerful in typical SIMD tasks than SSE which had twice or three times the number of instructions. While things improved in SSE2 it was still largely unsuitable for any kind of "3D" task. It took until SSE3 until we got horizontal adds and dot products finally was practical to implement. 3DNow had these from the start. It boggles my mind that this was not also in the first SSE implementation and took all the way to SSE3 to get in there.

Uttar said:
Although even then, it could be slightly more optimal if it supported dot products

Not just slightly. A lot.

3dilettante said:
Most of the code CPUs need to be able to run fast never needs a dot product.

But I'd say a lot if not most code that would ever be considered for SSE optimizations use dot products.

santyhammer · Nov 23, 2006

Humus! Hey, glad to see you here too!
Tell these guys what is ATI doing in Fusion so they can see

Also tell if we can manage pointers and complex structures with CTM, omg!

And hey! Do you know is the Terranza is what I talked about the coprocessor policy instead of a collosal 8-core 1900M transistor CPU? Also that policy for the R700? See http://uk.theinquirer.net/?article=35818

Nick · Nov 23, 2006

3dilettante said:
Each instruction would still take multiple clocks, for most CPUs, a multiply takes at least 3 cycles for integer math, probably closer to 5 for floating point. A dot product would take longer than that to propogate the results.

To do better than that would take a lot more hardware, or a slower clock cycle. GPUs do both, and even they have operations that take multiple cycles.

I believe a CPU's latency is actually better than a GPU's latency (even if multiple instructions are required for a shader instruction). The result can be forwarded and start a new instruction the very next clock cycle (hence some instructions have a latency of just one clock cycle even though the total pipeline is much longer). The GPU just hides latency completely by pushing other pixels through the pipeline. But it takes many clock cycles from one instruction to the next for the same pixel (no forwarding - but it doesn't matter anyway).

All I'm trying to say is that the CPU is still a masterpiece of engineering. It's totally no match at graphics against a GPU, but it's extremely fast at executing one stream of general purpose instructions. If it can issue one dot product every clock cycle it's actually faster than one GPU pipeline. So really the 'problem' is parallelism, not that much the instruction set or clock frequency.

No operation is free; it either has a cost in silicon, power, or time. Most of the code CPUs need to be able to run fast never needs a dot product. Most of the transistors that would go into a dedicated high-speed DOT
unit would go unused 99.9% of the time, and then people would complain why their 99.9% of integer code is running 10% slower than it could be.

It most likely uses existing mul and add units. It just requires extra wiring for the subresults. This way it would take multiple clok cycles but it can be fully pipelined and has a fairly low hardware cost.

Anyway in the end it's pretty useless for graphics, and there must be other reasons why Intel added it now.

santyhammer · Nov 23, 2006

Nick said:
So really the 'problem' is parallelism, not that much the instruction set or clock frequency

What do you think about the idea of multiple CPUs inserted into multiple ZIF sockets? I think with 1-cycle DOT and the dx10 shader instructions inside a new SSE plus the ability to put like 2 or 4 of those CPUs on the motherboard could rock!

And that is what is Fusion + Terranza or not? Humus tell us omg!

SSE4, future processors and GPGPU thoughts

santyhammer

Techno+

Arun

Unknown.

Techno+

Arun

Unknown.

3dilettante

CouldntResist

Bob

Arun

Unknown.

santyhammer

3dilettante

Arun

Unknown.

santyhammer

Nick

Nick

santyhammer

Humus

Crazy coder

santyhammer

Nick

santyhammer

Similar threads