SSSE3/SSE4.1/SSE5 already for 45nm "K10.5"?

w0mbat · Apr 14, 2008

AMD just updated their CPUID guide and there are some interesting new features listed:

• CPUID Fn0000_0001_ECX[SSE41]: Added.
• CPUID Fn0000_0001_ECX[SSSE3]: Added.
• CPUID Fn8000_0001_ECX[SSE5]: Added.

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25481.pdf

So will we see 45nm "K10.5 Shanghai"-Generation w/ Intels SSSE3/SSE4.1 and new SSE5?

INKster · Apr 15, 2008

Too bad Intel will never support SSE5, therefore severely limiting its reach (much in the same way 3Dnow was largely abandoned in favor of the original SSE).
They wish to promote their own upcoming SSE4.2 with "Nehalem".

Nick · Apr 15, 2008

Nice find w0mbat!

INKster said:
Too bad Intel will never support SSE5, therefore severely limiting its reach (much in the same way 3Dnow was largely abandoned in favor of the original SSE).
They wish to promote their own upcoming SSE4.2 with "Nehalem".

With run-time code generation it doesn't matter that much. Also, SSE4.2 merely adds some string compare instructions and such, while SSE5 adds fused multiply-add instructions which are much more useful for stream processing and the like. It's going to be surpassed by Intel's AVX though (doubling the vector width).

In my opinion the really interesting questions are: Will Sandy Bridge have 256-bit AVX execution units? Will it feature FMA? Will FMA have a throughput of 1 per cycle? Will it have two FMA execution units?

Early GFLOP reports of Sandy Bridge seem to suggest that one or more of the above has a negative answer. This gives SSE5 a bit more potential, at least for a couple years more or so. Either way, a 4 GHz 8-core with two AVX execution units per core could reach 1 TFLOP...

pjbliverpool · Apr 15, 2008

Nick said:
Nice find w0mbat!

With run-time code generation it doesn't matter that much. Also, SSE4.2 merely adds some string compare instructions and such, while SSE5 adds fused multiply-add instructions which are much more useful for stream processing and the like. It's going to be surpassed by Intel's AVX though (doubling the vector width).

In my opinion the really interesting questions are: Will Sandy Bridge have 256-bit AVX execution units? Will it feature FMA? Will FMA have a throughput of 1 per cycle? Will it have two FMA execution units?

Early GFLOP reports of Sandy Bridge seem to suggest that one or more of the above has a negative answer. This gives SSE5 a bit more potential, at least for a couple years more or so. Either way, a 4 GHz 8-core with two AVX execution units per core could reach 1 TFLOP...

So the expectation is 16 FLOPS/clock per AVX execution unit? Do you know if they are DP or SP? Also, what are the current rumors about Sandy Bridges floating point capability?

Cheers

3dilettante · Apr 15, 2008

Nick said:
In my opinion the really interesting questions are: Will Sandy Bridge have 256-bit AVX execution units? Will it feature FMA? Will FMA have a throughput of 1 per cycle? Will it have two FMA execution units?

AVX was disclosed to have FMA.
It will also have non-destructive operands, whereas SSE5's FMA overwrites the first multiplication operand.

Early GFLOP reports of Sandy Bridge seem to suggest that one or more of the above has a negative answer. This gives SSE5 a bit more potential, at least for a couple years more or so. Either way, a 4 GHz 8-core with two AVX execution units per core could reach 1 TFLOP...

The expectation on Intel's slides was 7 DP SSE ops per cycle, which may be innacurate, because even a single 256-bit FMA should deliver the equivalent of 8.
Perhaps Intel has revamped the FPU specs in response to SSE5 from the early slides, though it seems late to try revamp it as much to deliver a factor of 5 increase.

tachyon_john · Apr 17, 2008

Early GFLOP reports of Sandy Bridge seem to suggest that one or more of the above has a negative answer. This gives SSE5 a bit more potential, at least for a couple years more or so. Either way, a 4 GHz 8-core with two AVX execution units per core could reach 1 TFLOP...

Unless they give next-gen motherboards some huge multiple of present-day memory bandwidth, it may end up being a waste of silicon...

pjbliverpool · Apr 17, 2008

tachyon_john said:
Unless they give next-gen motherboards some huge multiple of present-day memory bandwidth, it may end up being a waste of silicon...

I expect 38.4GB/s would be an absolute minimum. Thats 3x DDR3 1600Mhz.

But DDR3 2000Mhz isn't that unrealistic IMO. That would give 48GB/s assuming only 3 channels.

I have no idea if that would be enough to support a TFLOP though.

Nick · Apr 18, 2008

3dilettante said:
AVX was disclosed to have FMA.

FMA is a separate CPUID flag. I haven't found any confirmation yet that Sandy Bridge will support it.

The expectation on Intel's slides was 7 DP SSE ops per cycle, which may be innacurate, because even a single 256-bit FMA should deliver the equivalent of 8.

It may imply that FMA is actually not going to be supported by Sandy Bridge. 7 double-precision operations per cycle could be obtained with a dot product instruction (dpps).

I hope I'm wrong though.

DavidC · May 8, 2008

It may imply that FMA is actually not going to be supported by Sandy Bridge. 7 double-precision operations per cycle could be obtained with a dot product instruction (dpps).

I hope I'm wrong though.

http://softwarecommunity.intel.com/.../Intel-AVX-Programming-Reference-31943302.pdf

"Intel Advanced Vector Extensions (AVX) introduces 256-bit vector processing capability and includes two components to be introduced on Intel processor generations built from 32nm process and beyond.
-The first generation Intel AVX provides 256-bit SIMD register support, 256-bit vector floating-point instructions, enhancements to 128-bit SIMD instructions, support for three and four operand syntax.
-FMA is a future extension of Intel AVX, FMA provides floating-point, fused-multiply-add instructions supporting 256-bit and 128-bit SIMD vectors."

Raqia · May 8, 2008

http://www.zdnetasia.com/news/hardware/0,39042972,62041107,00.htm

Looks like AMD's upcoming processors aren't based on the bulldozer architecture. They seem to be betting too much on these proprietary extensions for performance PR. Reminds me a bit of the K6-2 when 3DNOW was used in a similar way.

There's no news about Fusion either. Hopefully, this means they're just announcing products that they've sampled and that they can deliver for now. I hope they get the execution right for Shanghai and build a few quarters of profitability as a base to expand R&D.

3dilettante · May 8, 2008

The roadmap update was for servers, so Fusion wouldn't have been mentioned.

Bulldozer being delayed is no surprise, but the extent of the delay is even larger than I had seen speculated.
The original delay from 2009 to 2010 was enough to have matched the delay between the original K8 replacement and the restart that lead to Barcelona.

I don't get why AMD would scrap Montreal outright unless the system platform built around the core simply isn't working. There was a fair gap between its introduction in 2009 and the likely introduction of its successors in 2010.

edit: I missed the line on Istanbul. What socket is that going into?

ShaidarHaran · May 8, 2008

3dilettante said:
I don't get why AMD would scrap Montreal outright unless the system platform built around the core simply isn't working. There was a fair gap between its introduction in 2009 and the likely introduction of its successors in 2010.

You know, oddly enough Charlie Demerjian's flippant remarks in the latest Inq piece on AMD's roadmap are actually right on the money (if you ask me).

3dilettante said:
edit: I missed the line on Istanbul. What socket is that going into?

edit: I was incorrect originally when I stated that the roadmap showed Istanbul on Socket F. Istanbul/Magny Cours will be on socket G34.

The disappearance of Montreal was at first troubling, but the addition of Istanbul makes up for a lot of it, what you lose with two cores you make up for with no additional HT hop latency. Well, not really, but it will help.

10h is already suffering latency issues thanks to the pathetic L3 AMD implemented, and the lack of cHT-3 + the 4th cHT link, particularly in multi-socket systems.

Blazkowicz · May 9, 2008

3DNow! was a bit useful as it was used in 3dfx drivers (ah, the fond memory of my K6/2 400 with voodoo2, the first PC I really owned and which I had lanparties with). the voodoo2 absolutely spanked a TNT on that rig because of that.

SSSE3/SSE4.1/SSE5 already for 45nm "K10.5"?

w0mbat

INKster

Nick

pjbliverpool

B3D Scallywag

3dilettante

tachyon_john

pjbliverpool

B3D Scallywag

Nick

DavidC

Raqia

3dilettante

ShaidarHaran

hardware monkey

Blazkowicz

Similar threads