SSSE3/SSE4.1/SSE5 already for 45nm "K10.5"?

Too bad Intel will never support SSE5, therefore severely limiting its reach (much in the same way 3Dnow was largely abandoned in favor of the original SSE).
They wish to promote their own upcoming SSE4.2 with "Nehalem".
 
Nice find w0mbat!
Too bad Intel will never support SSE5, therefore severely limiting its reach (much in the same way 3Dnow was largely abandoned in favor of the original SSE).
They wish to promote their own upcoming SSE4.2 with "Nehalem".
With run-time code generation it doesn't matter that much. Also, SSE4.2 merely adds some string compare instructions and such, while SSE5 adds fused multiply-add instructions which are much more useful for stream processing and the like. It's going to be surpassed by Intel's AVX though (doubling the vector width).

In my opinion the really interesting questions are: Will Sandy Bridge have 256-bit AVX execution units? Will it feature FMA? Will FMA have a throughput of 1 per cycle? Will it have two FMA execution units?

Early GFLOP reports of Sandy Bridge seem to suggest that one or more of the above has a negative answer. This gives SSE5 a bit more potential, at least for a couple years more or so. Either way, a 4 GHz 8-core with two AVX execution units per core could reach 1 TFLOP...
 
Nice find w0mbat!

With run-time code generation it doesn't matter that much. Also, SSE4.2 merely adds some string compare instructions and such, while SSE5 adds fused multiply-add instructions which are much more useful for stream processing and the like. It's going to be surpassed by Intel's AVX though (doubling the vector width).

In my opinion the really interesting questions are: Will Sandy Bridge have 256-bit AVX execution units? Will it feature FMA? Will FMA have a throughput of 1 per cycle? Will it have two FMA execution units?

Early GFLOP reports of Sandy Bridge seem to suggest that one or more of the above has a negative answer. This gives SSE5 a bit more potential, at least for a couple years more or so. Either way, a 4 GHz 8-core with two AVX execution units per core could reach 1 TFLOP...

So the expectation is 16 FLOPS/clock per AVX execution unit? Do you know if they are DP or SP? Also, what are the current rumors about Sandy Bridges floating point capability?

Cheers :D
 
In my opinion the really interesting questions are: Will Sandy Bridge have 256-bit AVX execution units? Will it feature FMA? Will FMA have a throughput of 1 per cycle? Will it have two FMA execution units?
AVX was disclosed to have FMA.
It will also have non-destructive operands, whereas SSE5's FMA overwrites the first multiplication operand.

Early GFLOP reports of Sandy Bridge seem to suggest that one or more of the above has a negative answer. This gives SSE5 a bit more potential, at least for a couple years more or so. Either way, a 4 GHz 8-core with two AVX execution units per core could reach 1 TFLOP...
The expectation on Intel's slides was 7 DP SSE ops per cycle, which may be innacurate, because even a single 256-bit FMA should deliver the equivalent of 8.
Perhaps Intel has revamped the FPU specs in response to SSE5 from the early slides, though it seems late to try revamp it as much to deliver a factor of 5 increase.
 
Early GFLOP reports of Sandy Bridge seem to suggest that one or more of the above has a negative answer. This gives SSE5 a bit more potential, at least for a couple years more or so. Either way, a 4 GHz 8-core with two AVX execution units per core could reach 1 TFLOP...

Unless they give next-gen motherboards some huge multiple of present-day memory bandwidth, it may end up being a waste of silicon...
 
Unless they give next-gen motherboards some huge multiple of present-day memory bandwidth, it may end up being a waste of silicon...

I expect 38.4GB/s would be an absolute minimum. Thats 3x DDR3 1600Mhz.

But DDR3 2000Mhz isn't that unrealistic IMO. That would give 48GB/s assuming only 3 channels.

I have no idea if that would be enough to support a TFLOP though.
 
AVX was disclosed to have FMA.
FMA is a separate CPUID flag. I haven't found any confirmation yet that Sandy Bridge will support it.
The expectation on Intel's slides was 7 DP SSE ops per cycle, which may be innacurate, because even a single 256-bit FMA should deliver the equivalent of 8.
It may imply that FMA is actually not going to be supported by Sandy Bridge. 7 double-precision operations per cycle could be obtained with a dot product instruction (dpps).

I hope I'm wrong though.
 
It may imply that FMA is actually not going to be supported by Sandy Bridge. 7 double-precision operations per cycle could be obtained with a dot product instruction (dpps).

I hope I'm wrong though.

http://softwarecommunity.intel.com/.../Intel-AVX-Programming-Reference-31943302.pdf

"Intel Advanced Vector Extensions (AVX) introduces 256-bit vector processing capability and includes two components to be introduced on Intel processor generations built from 32nm process and beyond.
-The first generation Intel AVX provides 256-bit SIMD register support, 256-bit vector floating-point instructions, enhancements to 128-bit SIMD instructions, support for three and four operand syntax.
-FMA is a future extension of Intel AVX, FMA provides floating-point, fused-multiply-add instructions supporting 256-bit and 128-bit SIMD vectors."
 
Last edited by a moderator:
http://www.zdnetasia.com/news/hardware/0,39042972,62041107,00.htm

Looks like AMD's upcoming processors aren't based on the bulldozer architecture. They seem to be betting too much on these proprietary extensions for performance PR. Reminds me a bit of the K6-2 when 3DNOW was used in a similar way.

There's no news about Fusion either. Hopefully, this means they're just announcing products that they've sampled and that they can deliver for now. I hope they get the execution right for Shanghai and build a few quarters of profitability as a base to expand R&D.
 
Last edited by a moderator:
The roadmap update was for servers, so Fusion wouldn't have been mentioned.

Bulldozer being delayed is no surprise, but the extent of the delay is even larger than I had seen speculated.
The original delay from 2009 to 2010 was enough to have matched the delay between the original K8 replacement and the restart that lead to Barcelona.

I don't get why AMD would scrap Montreal outright unless the system platform built around the core simply isn't working. There was a fair gap between its introduction in 2009 and the likely introduction of its successors in 2010.

edit: I missed the line on Istanbul. What socket is that going into?
 
Last edited by a moderator:
I don't get why AMD would scrap Montreal outright unless the system platform built around the core simply isn't working. There was a fair gap between its introduction in 2009 and the likely introduction of its successors in 2010.

You know, oddly enough Charlie Demerjian's flippant remarks in the latest Inq piece on AMD's roadmap are actually right on the money (if you ask me).

edit: I missed the line on Istanbul. What socket is that going into?

edit: I was incorrect originally when I stated that the roadmap showed Istanbul on Socket F. Istanbul/Magny Cours will be on socket G34.

The disappearance of Montreal was at first troubling, but the addition of Istanbul makes up for a lot of it, what you lose with two cores you make up for with no additional HT hop latency. Well, not really, but it will help.

10h is already suffering latency issues thanks to the pathetic L3 AMD implemented, and the lack of cHT-3 + the 4th cHT link, particularly in multi-socket systems.
 
Last edited by a moderator:
3DNow! was a bit useful as it was used in 3dfx drivers (ah, the fond memory of my K6/2 400 with voodoo2, the first PC I really owned and which I had lanparties with). the voodoo2 absolutely spanked a TNT on that rig because of that.
 
Back
Top