If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 |
|
B3D Scallywag
|
As it stands currently it can be argued that there are 3 major CPU SIMD instruction sets in use for modern high end gaming. (okay, ignoring SPU's).
Those being: SSE4: Used on Pernyn and Nehalem (in slightely different configurations) AVX: Used on the very latest PC CPU architecures, namely Sandybridge, Bulldozer and Ivybridge. VMX: Used in Xenon x3 and in a slightely reduced form in the PPU on Cell So given the same theoretical throughput, what are the general thoughts about which of these instructions sets is best suited for modern gaming? Obviously AVX has twice the theoretical single precision throughput of SSE4 and VMX per clock so lets say were using as near as dammin 100% vectorised code on the following hypothetical CPU's: 1x Penryn Core @ 3.2 Ghz 1x SandyBridge Core @ 1.6 Ghz 1x Xenon core @3.2Ghz Any views on how these would fair against one another?
__________________
PowerVR PCX1 4MB --> Voodoo Banshee 16MB --> GeForce2 MX200 32MB --> GeForce2 Ti 64MB --> GeForce4 Ti 4200 128MB --> 9800Pro 128MB --> 8800GTS 640MB --> Radeon HD 4890 1GB --> GeForce GTX 670 DirectCU II TOP 2GB |
|
|
|
|
|
#2 |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,497
|
Is avx used in any games ?
or is it transparent to the programmer. I'm guessing sse3 is needed for older cpu's, my cpu doesnt support sse4
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
#3 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#4 |
|
Senior Member
Join Date: Oct 2002
Posts: 2,437
|
You cannot really say which instruction set is faster as that would be dependent on implementation. Latency and throughput of i.e. sse2 instructions vary greatly between different cpus.
Furthermore the instruction set of AVX isn't actually different to SSE(4), it's exactly the same instructions just extended to 256bit (well for floats only - 256bit ints need to wait til AVX2, Haswell). The instructions are just mostly slightly different with AVX since the vex encoding has non-destructive (3 operand) syntax (makes the instructions slightly larger but saves most register-register move instructions which should be good for some small performance improvement). AVX with ints is thus just just minimally faster than SSE4 on the same cpu (the only advantage comes from less move instructions), and with floats it's a bit more than twice as fast in theory (except for divisions on sandy as the divide unit is only 4-wide though Ivy "fixed" that). This assumes though your algorithm really can be adjusted to use 8-wide floats trivially, and further assumes no load/store bottlenecks (sandy can load 2 128bit values and store 1 128bit value per clock) not to mention obviously other things like limitations due to memory bandwidth or latency also still are the same. I don't know much about VMX, I believe it has some better support for horizontal operations and shuffles but if you can benefit from such instructions can't be said generally. About VMX on Xenon I have absolutely no idea what the throughput for even the "basic" operations (float vec mul, add) are just because the instructions are 4-wide doesn't tell you much what the cpu can do per clock, not sure if that information was published anywhere for Xenon (it might be possible that just like older cpus supporting sse2 they really only have 2-wide instead of 4-wide execution units for instance). |
|
|
|
|
|
#5 | |
|
Member
Join Date: Aug 2011
Posts: 369
|
AVX1 is honestly not all that interesting. Getting 8-wide parallelism without gather is a whole lot harder than 4-wide. AVX2, to be released with Haswell, however is very. All the low-level coders I routinely talk with are pretty stoked for the gather support and FMA. Not only does it make "lists of elements" style code a lot easier to vectorize, it should finally make reasonable gains from autovectorization a reality. Vector instructions with gather are just better than than ones without.
Quote:
|
|
|
|
|
|
|
#6 |
|
B3D Scallywag
|
Thanks for the input guys. So it sounds like AVX is most certainly not just 2x SSE4 but AVX2 coming with Haswell might be getting close? Sounds like there's a lot to be excited about in Haswell.
Does VMX support FMA? I assume that's quite advantageous for games and so would give it a leg up in some respects over AVX?
__________________
PowerVR PCX1 4MB --> Voodoo Banshee 16MB --> GeForce2 MX200 32MB --> GeForce2 Ti 64MB --> GeForce4 Ti 4200 128MB --> 9800Pro 128MB --> 8800GTS 640MB --> Radeon HD 4890 1GB --> GeForce GTX 670 DirectCU II TOP 2GB |
|
|
|
|
|
#7 |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,497
|
Do you need to code specifically for sse4/avx ?
Back in the day (cue old fart story) if I wanted to support x87 i would just use a comiler directive ($N i think) and that was it, job done if the pc had a math co-pro it would get used, if not the program would just use the integer unit
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
#8 | |||
|
Member
Join Date: Aug 2011
Posts: 369
|
Just to elaborate. Most loads do not vectorize that easily on all implementations. That's why comparing ideal cases is pointless -- you just don't see them that much in the real world. The ability to vectorize more cases is much more important than the optimal throughput in the optimal case.
Quote:
Quote:
Quote:
You can probably see why this gets hairy fast. It's hard to do by hand, and nigh-impossible to do automatically by a compiler. There is some downright heroic work on the subject by the Intel and GCC teams, but even they really don't get that much speedup from autovectorized code. So today, only the things that are absolutely trivial tend to get optimized. (position and speed = 2 4-element vectors.) AVX2 brings gather instructions, which are basically vectorized loads. they take a base address and a vector full of offsets, and fill the target register with [base + offset]. This should make vector instructions useful in a lot of places they weren't before, because a lot of loops can then be trivially vectorized by the compiler. |
|||
|
|
|
|
|
#9 |
|
Senior Member
|
Let's not forget the hardware transactional memory support, primed for Haswell too. It will further optimize memory pipeline performance under heavy MP loads.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#10 |
|
Member
Join Date: Nov 2007
Posts: 942
|
VMX128 is actually a very good set of instructions (compared to SSE at least). It has very good shuffles/inserts/select, multiply-add, complex bit packing instructions (including float16 conversion), (AOS) dot product, etc. However instruction set is only one side of the coin, the other is the CPU architecture implementing the instruction set.
Nothing of course compares to AVX2 (in Haswell). But gather is only good if it is fast enough, and nobody really knows that yet. 256 bit wide integer operations are of course nice addition as well. |
|
|
|
|
|
#11 |
|
B3D Scallywag
|
Sounds like Haswell going to be a pretty impressive chip. I wonder how long it will be before CPU's drop specialised SIMD units altogether though and move vector processing to the GPU's. Are we getting close to that yet? Or would GPU's be unsuitable as complete replacements?
I know AMD has been hinting about it in a future fusion iteration but I'm not sure whether that would be a complete replacement for the CPU's SIMD abilities or just complimentary.
__________________
PowerVR PCX1 4MB --> Voodoo Banshee 16MB --> GeForce2 MX200 32MB --> GeForce2 Ti 64MB --> GeForce4 Ti 4200 128MB --> 9800Pro 128MB --> 8800GTS 640MB --> Radeon HD 4890 1GB --> GeForce GTX 670 DirectCU II TOP 2GB |
|
|
|
|
|
#12 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,109
|
There would need to be some pretty striking advances in implementation to allow for a CPU FP unit to be completely stripped out of the CPU core.
The latency in hopping from a CPU to a GPU would be unacceptable for workloads that require higher straightline performance. Problems that do not need much more data level parallelism than a CPU provides would also be a waste on a CU unit that needs four to eight times as many work items.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#13 | |
|
Member
Join Date: Aug 2011
Posts: 369
|
Quote:
I really don't think that the cpu vector units will ever be dropped. More likely, either they will evolve into the GPU ones (expand avx to full width, put 4-8 threads into the frontend, run GPU code on the CPU), or at some point the manufacturers will stop adding to them, and just put all the new advancements in the new dedicated vector block. |
|
|
|
|
|
|
#14 | |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Eventually it will make sense to just move all programmable throughput computing to the CPU. AVX2 will already be perfectly suitable for graphics shaders. The only remaining deal breaker is the higher power consumption. But this can be tackled with AVX-1024. The VEX encoding already supports extending it to 1024-bit registers, and by executing such instructions on 256-bit units in four cycles, the CPU's front-end and scheduler will have four times less switching activity, hence dramatically lowering the power consumption. A 16 nm successor to Haswell could deliver 2 TFLOPS for the same die size and not break a sweat. GPGPU is dying. Even though AMD is making its GPU architecture more flexible, NVIDIA went the other direction with Kepler. And on top of that you get wildly inconsistent performance between discrete and integrated parts. So GPGPU is utter rubbish for mainstream applications. Developers will instead focus on AVX2, since that will be available in every CPU from Haswell forward, and is only going to get more powerful. |
|
|
|
|
|
|
#15 | |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,497
|
Quote:
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
|
#16 | |
|
B3D Scallywag
|
Quote:
I'm not sure how you get 500 GFLOPS out of a quad Haswell though? Even running at 4 Ghz (which is certainly possible) if would need to be capable of twice the single precision FLOPs as Ivy Bridge. Is AVX2 going to double the throughput of AVX? (32 flops per cycle vs 16)
__________________
PowerVR PCX1 4MB --> Voodoo Banshee 16MB --> GeForce2 MX200 32MB --> GeForce2 Ti 64MB --> GeForce4 Ti 4200 128MB --> 9800Pro 128MB --> 8800GTS 640MB --> Radeon HD 4890 1GB --> GeForce GTX 670 DirectCU II TOP 2GB |
|
|
|
|
|
|
#17 | |
|
Senior Member
|
Quote:
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
|
#18 | |
|
Member
|
Quote:
Even talking about SP it's just barely enough...Cos I don't think a 4-core-haswell can went up to 4Ghz
__________________
Well I'm not a native English speaker so there might be misuse through my words. I just hope it won't cause too much misunderstanding. |
|
|
|
|
|
|
#19 |
|
Senior Member
|
By the time Haswell is out, I think Intel should already have a refined 22nm process up and running. After all, Haswell will be the first native architecture built for Tri-Gate.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#20 | |
|
B3D Scallywag
|
Quote:
EDIT: Ive no doubt Haswell will be capable of hitting 4 Ghz but I doubt Intel will clock it that high given the lack of competition. I'm fairly sure intel could have been releasing stock 4ghz CPU's since Sandybridge if they'd have felt the need.
__________________
PowerVR PCX1 4MB --> Voodoo Banshee 16MB --> GeForce2 MX200 32MB --> GeForce2 Ti 64MB --> GeForce4 Ti 4200 128MB --> 9800Pro 128MB --> 8800GTS 640MB --> Radeon HD 4890 1GB --> GeForce GTX 670 DirectCU II TOP 2GB |
|
|
|
|
|
|
#21 | ||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
We know for a fact Haswell will support FMA, and we also know Sandy Bridge has a separate ADD and MUL execution unit. They can't go for a single FMA unit with Haswell, since that would dramatically cripple legacy performance. They also can't go for an ADD+FMA or MUL+FMA combination, because then the same port is needed by MUL and FMA or ADD and FMA respectively, and with typical Instruction mix frequencies this actually results in lower performance due to port contention! So under the safe assumption that they want the extra transistors to pay off, the only sane option is dual FMA units. This also simplifies scheduling. And note that Bulldozer already has dual FMA (even though it's 128-bit each, note that it's on 32 nm). This also isn't all that incredible compared to what we've come to expect from GPUs. And Intel clearly is putting a lot of Larrabee's technology into AVX2. Quote:
Last edited by Nick; 15-May-2012 at 06:57. |
||
|
|
|
|
|
#22 |
|
B3D Scallywag
|
Cheers Nick, its a pretty exciting prospect. I kinda wish we could see Haswell in a console just so we can see what such a monster would be capable of if fully utilised for games. Good point about 4ghz too, I guess if AMD hit it then Intel will have little choice but to match for marketing reasons.
__________________
PowerVR PCX1 4MB --> Voodoo Banshee 16MB --> GeForce2 MX200 32MB --> GeForce2 Ti 64MB --> GeForce4 Ti 4200 128MB --> 9800Pro 128MB --> 8800GTS 640MB --> Radeon HD 4890 1GB --> GeForce GTX 670 DirectCU II TOP 2GB |
|
|
|
|
|
#23 | |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,109
|
Quote:
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
|
#24 | |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
A quad-core Haswell chip would offer four times the peak FP throughput, and definitely consume less. Fortunately Piledriver looks like an improvement on the power consumption front, but AMD has to put AVX2 on the roadmap sooner rather than later to keep up. |
|
|
|
|
|
|
#25 | ||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,109
|
Quote:
If they had their way, there would have been an 8 core running several hundred MHz above where the 8150 is at. The architecture is not able to overcome its many tradeoffs in per-clock performance until it does. Quote:
I'm in a relatively sour mood with regards to AMD today, so I was going to snark that the "keep up" part was out of the question several years ago.
__________________
Dreaming of a .065 micron etch-a-sketch. |
||
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|