SSE4, future processors and GPGPU thoughts

Nick · Nov 23, 2006

CouldntResist said:
I guess Intel's SSE enhancements are necessity: when you only have 8 registers, it's kind of hard to process 4 stream elements in scalar code (recommended style by Intel itself), rather than 1 stream element with vector code.

For your information, x86-64 offers 16 SSE registers. Still no abundance, but definitely a significant improvement for lots of applications. Also, an x86 CPU's cache is so fast that in several situations using it as temporary storage is almost for free (since load/store can execute in parallel).

Arun · Nov 23, 2006

Nick: That looks like some pretty smart code for sin/cos on the CPU! The precision doesn't match quite what GPUs can manage (G80: see this graph), but then again GPUs can't do double precision sincos and such yet either

So I'll call it a tie.

As for 3DNow!, yeah, I got my timeframes slightly confused wrt when NV introduced T&L, so I had completely forgotten about that. Oops! TBH, I never used 3DNow!, only a bit of SSE, so I guess that's partially what's biasing me (see: Humus' reply for SSE vs 3DNow!)

santyhammer said:
I know, but I started this thread to complain and not to explain or to debate.
My intention was that if we all complain perhaps some engineer could thing about it. Not to improve nm or Mhz or to include 900000-cores in the silicon die.

So, basically your answers are always welcome... but all I want is to complain, complain and complain so they can see the job they are doing is severely lacking.

Poor crybaby. You clearly aren't on the right forum then, btw... And if you did want to post on Beyond3D (which I encourage you not to do too much if that's your only motive), Hardware & Software Talk would be the appropriate section, considering you want to exclusively focus on the CPU here.

3) Intel should also open a public forums where EVERYBODY could post the wanted instructions.

Do you honestly think Intel has zero Developer Relations? The guys they keep a contact with most likely have a much better idea of these things than you do. They might benefit from a formal process, but that wouldn't guarantee anything either.

Uttar

santyhammer · Nov 23, 2006

Uttar said:
You clearly aren't on the right forum then, btw... And if you did want to post on Beyond3D (which I encourage you not to do too much if that's your only motive), Hardware & Software Talk would be the appropriate section, considering you want to exclusively focus on the CPU here.

Well then move it, pls!

Uttar said:
Do you honestly think Intel has zero Developer Relations? The guys they keep a contact with most likely have a much better idea of these things than you do. They might benefit from a formal process, but that wouldn't guarantee anything either.

OpenGL forums have what I'm requesting and worked very well in the realization of Shader Model 4 and new extensions ( see the Suggestions for the new OpenGL release, etc )
Does OpenGL zero developer relations like you suggest? I dont think so.

Nick · Nov 23, 2006

santyhammer said:
Why they don't wipe all the absurd and slow instructions of the x86 ( like the horrible 387 instructions and the old 3dnow/SSE ) to save silicon? They could do it perfectly in the x86 to x64 transition and nobody will complain... after all you need to recompile to get good x64 applications ( ok x64 have a x86 compability mode but is going to be deprecated in 2 years... ). I think will be a good moment to get rid of all the old x86 instructions.

Deprecating it in 2 years is definitely not an option. Maybe 10 years, but then you still have duplicate hardware (decoders in particular). Internally an x86 CPU is pretty much RISC, and giving it another instruction encoding wouldn't help much. Another major reason why they won't do this is because SSE is not as bad as you think. In comparison with GPUs it doesn't look so fantastic, but it suits the needs of a lot of other applications. And what application other than graphics would benefit from a full 3D shader instruction set with limited precision special functions? Last but not least, all x86 instructions take only two operands. The microcode and all data paths are also optimized for that. Supporting up to four operands like shader instructions would add so much complexity that the clock freqency would have to be significantly lower. You could lower the complexity by removing out-of-order execution, but then single-threaded performance would be terrible, and you'd end up with something that would be very similar to a GPU...

Nick · Nov 23, 2006

santyhammer said:
What do you think about the idea of multiple CPUs inserted into multiple ZIF sockets?

Multiple sockets is a bad idea. We already have dual-socket workstation motherboards, but they are very expensive because manufacturers have to design multiple boards and sell mostly single-socket motherboards. Multi-core is a better idea. We have 100 $ dual-core CPUs now and with every silicon shrink they can keep doubling cores (or do something else so the gap between high-end and low-end stays relatively small).

I think with 1-cycle DOT and the dx10 shader instructions inside a new SSE plus the ability to put like 2 or 4 of those CPUs on the motherboard could rock!

Rock what?

Techno+ · Nov 23, 2006

santyhammer, what u said about GPUs dominating CPUs isnt completely correct, GPUs (and other processors) will never take over the CPU, no matter what happens we do need CPUs . I have been reading through AMD plans, and i have found something interseting that might please you(maybe me and the others). These two qoutes share something in common

1) FROM PHIL HESTER
"When referring to the future goals for AMD's architecture, the only example Phil Hester provided for FPU Extensions to AMD64 was the idea of introducing extensions that would accelerate 3D rendering. We got the impression that these extensions would be similar to a SSEn type of extension, but more specifically focused on usage models like 3D rendering.

Through the use of extensions to the AMD64 architecture, Hester proposed that future multi-core designs may be able to treat general purpose cores as almost specialized hardware, but refrained from committing to the use of Cell SPE-like specialized hardware in future AMD microprocessors. We tend to agree with Hester's feelings on this topic, as he approached the question from a very software-centric standpoint; the software isn't currently asking for specialized hardware, it is demanding higher performance general purpose cores, potentially augmented with some application specific instructions. "

2) From that Arstechnica article about Fusion

"To support CPU/GPU integration at either level of complexity (i.e. the modular core level or something deeper), AMD has already stated that they'll need to add a graphics-specific extension to the x86 ISA. Indeed, a future GPU-oriented ISA extension may form part of the reason for the company's recently announced "close to metal" (CTM) initiative. By exposing the low-level hardware of its ATI GPUs to coders, AMD can accomplish two goals. First, they can get the low-level ISA out there and in use, thereby creating a "legacy" code base for it and moving it further toward being a de facto standard. Second, they can get feedback from the industry on what coders want to see in a graphics-specific ISA.

Both of these steps pave the way for the introduction of GPU-specific extensions to the x86 ISA, extensions that eventually will probably be modeled to some degree on the ISA for the existing ATI hardware. These extensions will start life as a handful of instructions that help keep the CPU and GPU in sync and aware of each other as they share a common socket, frontside bus, and memory controller. A later, stream-processor-oriented extension could turn x86 into a full-blown GPU ISA."

both of these quotes talk about adding some kind of stream processor\GPU instructions to AMD's processors. In my opinion this is what will happen, this will start at the late 2008/early 2009 timeline.

1) AMD releases Fusion, and adds simple gfx instructions to their Fusion chips, to make the CPU communicate easily with the on die GPU.
2)AMD improves Fusion chips and adds more instructions, the CPU starts to be able to process light-weight gfx workloads.
3) Developers start accepting AMD's Fusion as an industry standard, Intel forced to copy AMD, Fusion begins to go high-end.
4) More improvements follow
5)Eventually, the CPU ends up with a lot of gfx related instructions, including stream processing (shading) and the ability to do shading,thus the CPU cores start to be treated as sort of specialized hardware, just like Phil Hester said.

And since the CPU will have shader instructions, it will also be able to process non-graphics workloads. So the CPU and GPU will both converge at some point of time.

Nick · Nov 23, 2006

Uttar said:
Nick: That looks like some pretty smart code for sin/cos on the CPU! The precision doesn't match quite what GPUs can manage (G80: see this graph), but then again GPUs can't do double precision sincos and such yet either So I'll call it a tie.

I can't seem to access that site (even without the parenthesis).

Anyway, indeed I would call it a tie if we look at their respective purpose. Both have undergone significant engineering effort, and I don't believe it's possible to improve one or the other in a certain direction without sacrificing specifications. That being said, I applaud the addition of fast low-precision instructions for the CPU (e.g. rcpps, rsqrtps), and the continued effort to improve the precision of GPU operations.

As for 3DNow!, yeah, I got my timeframes slightly confused wrt when NV introduced T&L, so I had completely forgotten about that. Oops! TBH, I never used 3DNow!, only a bit of SSE, so I guess that's partially what's biasing me (see: Humus' reply for SSE vs 3DNow!)

I never used 3DNow! either. I avoid it because of incompatibility with Intel processors, and I've always regarded SSE as superiour. While I probably have a biased view as well, the lack of horizontal instructions was never a serious issue. It requires some rethinking of data structures and keeping the number of swizzles low but with the right approach SSE can be almost two times faster. And it also keeps the MMX registers free.

Nick · Nov 23, 2006

Techno+ said:
Through the use of extensions to the AMD64 architecture, Hester proposed that future multi-core designs may be able to treat general purpose cores as almost specialized hardware, but refrained from committing to the use of Cell SPE-like specialized hardware in future AMD microprocessors. We tend to agree with Hester's feelings on this topic, as he approached the question from a very software-centric standpoint; the software isn't currently asking for specialized hardware, it is demanding higher performance general purpose cores, potentially augmented with some application specific instructions. "

What I think could make a lot of sense is to add SPE-like x86 cores. By removing out-of-order execution, we could have for example two full cores, and about eight SPE-like cores, on the die space of a regular quad-core.

This way we'd still have full x86 compatibility (no need to rewrite applications or compilers), but much higher throughput for applications that make good use of the in-order cores. It would take extra effort to minimize instruction dependencies, but that's true for Cell and GPUs as well.

rwolf · Nov 23, 2006

Uttar said:
I think you are MASSIVELY misunderstanding why GPUs are good at math - and why they'll get even better at it than CPUs could ever dream of. Dot products and SF are only a small part of that; the icing on the cake, if you wish.

Furthermore, I think you'll agree it's ironic that you're taking dot products so seriously here, because both NVIDIA and Intel (in the G965) have already gotten rid of it for this generation. The trend is towards GPUs becoming completely scalar (for math operations, at least!), and ATI will follow up sooner or later. This makes sense because the ALUs can remain SIMD internally anyway; they just process 16 pixels/vertices per ALU, with the same instruction. (32 threads with the same instruction for pixels, actually!)

I'd wait to see what ATI brings to the table before making blanket statements like ATI will follow latter.

Techno+ · Nov 23, 2006

Nick said:
What I think could make a lot of sense is to add SPE-like x86 cores. By removing out-of-order execution, we could have for example two full cores, and about eight SPE-like cores, on the die space of a regular quad-core.

This way we'd still have full x86 compatibility (no need to rewrite applications or compilers), but much higher throughput for applications that make good use of the in-order cores. It would take extra effort to minimize instruction dependencies, but that's true for Cell and GPUs as well.

I think a cell processor approach will be pointless, on die GPUs (and future GPU instructions as i said) will be more programmable and provide a lot of parallelism.

Nick · Nov 23, 2006

Techno+ said:
I think a cell processor approach will be too stupid, on die GPUs (and future GPU instructions as i said) will be more programmable and provide a lot of parallelism.

It's not exactly a Cell approach. Cell SPE's are not general-purpose, and they have their own instruction set which means they can't help out the PowerPC core and they require a specialized compiler and lots of developer effort.

With x86 mini-cores, every multi-threaded application would benefit, and there would be process-level parallelism as well for other legacy software (making one design useful for both server and desktop). Plus existing compilers can be used to develop for it. And it would only require simple extensions to make sure that threads meant to run on the mini-cores get proper instruction scheduling.

An on-die GPU core would be most efficient for graphics, but only just graphics. As soon as another graphics chip is added this is dead silicon. With the right x86 instruction set extensions the mini-cores could be very adequate for graphics (running Vista without wasting calories on out-of-order execution), and benefit every other application.

Techno+ · Nov 23, 2006

Nick said:
It's not exactly a Cell approach. Cell SPE's are not general-purpose, and they have their own instruction set which means they can't help out the PowerPC core and they require a specialized compiler and lots of developer effort.

With x86 mini-cores, every multi-threaded application would benefit, and there would be process-level parallelism as well for other legacy software (making one design useful for both server and desktop). Plus existing compilers can be used to develop for it. And it would only require simple extensions to make sure that threads meant to run on the mini-cores get proper instruction scheduling.

An on-die GPU core would be most efficient for graphics, but only just graphics. As soon as another graphics chip is added this is dead silicon. With the right x86 instruction set extensions the mini-cores could be very adequate for graphics (running Vista without wasting calories on out-of-order execution), and benefit every other application.

An on die GPU will be good for stuff other than gfx, haven't u heard og GPGPU and CUDA. A lot of mini-cores will be very hard to program for, and that Intel Terascale is only being researched, and i think this Intel Terascale is just like the 10 GHZ story. Maybe a lot of miniocores will be good when graphics instruction extensions are added to them, but 4 X86 heavy cores with graphics extensions which can also be used for FP will be better. Keep this in mind, the multiore race isn't just simply tossing more cores, it is a race of using your resources efficiently for more adaptive solutions.Haven't u also heard of Amdahl's law, which states that at some point tossing more processors won'y yield any improvements

Nick · Nov 23, 2006

Techno+ said:
An on die GPU will be good for stuff other than gfx, haven't u heard og GPGPU and CUDA.

Yes, but GPGPU is generally done on a high-end GPU, not something you would include on a CPU die. Furthermore, it takes significant effort to program for and wouldn't help running legacy x86 executables or new multi-threaded x86 applications.

A lot of mini-cores will be very hard to program for...

Easier than Cell and not harder than GPGPU. Some multimedia applications (video encoding, raytracing) are already ready for 16+ cores. The rest will follow when concurrency is considered in every software architecture.

...and that Intel Terascale is only being researched, and i think this Intel Terascale is just like the 10 GHZ story. Maybe a lot of miniocores will be good when graphics instruction extensions are added to them, but 4 X86 heavy cores with graphics extensions which can also be used for FP will be better.

It remains a compromise of course, but 2 full cores and 16 mini-cores could definitely perform better than 4 full cores. The applications that can use 4 cores can most likely use more as well. And a lot of variation is possible. Instead of completely sacrificing out-of-order execution they could shorten the reorder buffers and simplify branch prediction. Or they could have for example only eight mini-cores without out-of-order execution but each capable of Hyper-Threading to help fill the pipelines.

Keep this in mind, the multiore race isn't just simply tossing more cores, it is a race of using your resources efficiently for more adaptive solutions.Haven't u also heard of Amdahl's law, which states that at some point tossing more processors won'y yield any improvements

Certainly, but Amdahl's law doesn't pester graphics, multimedia and server applications that much. Besides, adding an on-die GPU would also have to deal with Amdahl's law for GPGPU applications.

nAo · Nov 23, 2006

Why all this fuss about DOT products? Store and process your data in SoA order and live happy

Humus · Nov 24, 2006

Except that's often unintuitive, impractical and just a pain in the ass.

nAo · Nov 24, 2006

Humus said:
Except that's often unintuitive, impractical and just a pain in the ass.

we are not all equal

Humus · Nov 25, 2006

Well, it's not just down to a personal preference. To begin with, you may not even be working with arrays in the first place. Most problems are more advanced than just a long row of data that needs to be processed in order. Not to mention the fact that object oriented programming makes AoS the natural arrangement of most kinds of data. Or the fact that AoS typically results in better memory access pattern. But the most important reason is that you want to be able to use SIMD even for small computations, like writing a small inline dot() function that can be used anywhere, especially in this age when intrinsics in a high level language is the preferred way (and the only way on x64 in MSVC 2005) to use SIMD.

nAo · Nov 25, 2006

Humus said:
Well, it's not just down to a personal preference. To begin with, you may not even be working with arrays in the first place.

In that case or you make up your own arrays or simply you don't use them.

Most problems are more advanced than just a long row of data that needs to be processed in order.

So What? Many are still amenable of being SoAed.

Not to mention the fact that object oriented programming makes AoS the natural arrangement of most kinds of data. Or the fact that AoS typically results in better memory access pattern.

Having smaller granularity almost always give you a better memory access pattern over a certain threshold.

But the most important reason is that you want to be able to use SIMD even for small computations, like writing a small inline dot() function that can be used anywhere, especially in this age when intrinsics in a high level language is the preferred way (and the only way on x64 in MSVC 2005) to use SIMD.

SIMD for small computations here and there won't change your performance a bit, no one is going to gain a thing just cause compute a few dot products instead of a few madds.

Nick · Nov 25, 2006

Humus said:
...especially in this age when intrinsics in a high level language is the preferred way (and the only way on x64 in MSVC 2005) to use SIMD.

Not really: SoftWire. A far better alternative if you want to get the assembly code you actually wrote...

Nick · Nov 25, 2006

nAo said:
SIMD for small computations here and there won't change your performance a bit, no one is going to gain a thing just cause compute a few dot products instead of a few madds.

I fully agree. The only way to get the real potential of SIMD is to write the whole bottleneck in assembly. In many cases that means a whole loop or a function, not just a few of the vector operations inside of it.

SSE4, future processors and GPGPU thoughts

Nick

Arun

Unknown.

santyhammer

Nick

Nick

Techno+

Nick

Nick

rwolf

Rock Star

Techno+

Nick

Techno+

Nick

nAo

Nutella Nutellae

Humus

Crazy coder

nAo

Nutella Nutellae

Humus

Crazy coder

nAo

Nutella Nutellae

Nick

Nick

Similar threads