K8L performance

pascal · Sep 11, 2006

AMD next year will launch a new core named K8L. It is supposed to be the sucessor of the k8 (Athlon 64) avoiding the bad name K9 (canine). See some info/links here: http://en.wikipedia.org/wiki/K8L

The new core will come with a new 65nm process, then it will be faster.
But what is interresting is the rumors of a second FPU unit and the SSSE3 capabilities.
IIRC SSSE3 has the MA operation defined.

Will this K8L be up to 4 times as fast as the K8 core at the same MHz?
Will we see a dual-core 4GHz chip next year capable to do 16GFlops?
What can we expect from this new K8L?
Is this the end of many new ...PUs?
How fast can it run Quake in softmode?
Could it run a dot3 in soft mode?
What will happens with the 45nm, 35nm and 22nm process?
Will we see a 22nm 8-K8L cores 12GHz 384GFlops CPU by ~2012?
Will this new k8l multicore be memory limited?

K.I.L.E.R · Sep 11, 2006

More importantly, can it play Quake 3 raytraced at 640x480 at realtime framerates?

_xxx_ · Sep 11, 2006

Dude, you got unresolved questions there for at least ten threads, the only one missing being "Is there an Elvis pic hidden on the die?" or such

Blazkowicz · Sep 11, 2006

canine doesn't sound too bad, here previous generations read as "cassis" (a kind of red fruit) and "cassette".

pascal · Sep 11, 2006

_xxx_ said:
Dude, you got unresolved questions there for at least ten threads, the only one missing being "Is there an Elvis pic hidden on the die?" or such

Probably there is a pit-bull pic (canine)

pascal · Sep 11, 2006

Another question:
How well could a multicore K8L run a Voxel based game at 720p ?

swaaye · Sep 11, 2006

Sounds like a test to perform with on of them thar Delta Force games. Or maybe a hacked Comanche Maximum Overkill!!

Guden Oden · Sep 11, 2006

swaaye said:
Or maybe a hacked Comanche Maximum Overkill!!

Outcast. Nuff said!

pascal · Sep 11, 2006

Guden Oden said:
Outcast. Nuff said!

Outcast has: bump-mapping, depth of field, anti-aliasing. Sounds good

The bad side is that those games are not optimized for the newer architectures like K8L.
Maybe the second FPU can be activeted without any special optimization

But a real test is new Voxel engine properly optimized and with new tricks/FX.

pascal · Sep 11, 2006

Looks like there is no two FPUs but a wider and improved FPU and buses.
http://www.xbitlabs.com/articles/cpu/display/amd-k8l_5.html
This explain the 1.5x improvment I read in the wikipedia.
But this probably doesnt count with the MA instruction improvment.

zsouthboy · Sep 12, 2006

Having just read about the Core 2 Quad core intel will be releasing soon, I am salivating at the future.

I've already promised myself (and set aside the funds for) an 8-way box, whether that's 2x Quad core intel or 2x quad K8L will remain to be seen.
(an aside: I'm going balls to the wall: 8gigs of ram, at the minimum, RAID, the whole deal.... i can't wait!)

Of course, that depends on the avialiblity of 2 way motherboards - something tells me that we'll see more K8L based ones. (the socket for Torrenza will be a standard AM2/3, correct?)

3dilettante · Sep 12, 2006

pascal said:
The new core will come with a new 65nm process, then it will be faster.
But what is interresting is the rumors of a second FPU unit and the SSSE3 capabilities.
IIRC SSSE3 has the MA operation defined.

I don't think it does. That would require loading an additional operand and require a much more robust fp loading pipeline than either Prescott (for which SSE3 was made) or K8.

Will this K8L be up to 4 times as fast as the K8 core at the same MHz?

I'm going to say no on most workloads that don't choke horribly on K8 right now.

Will we see a dual-core 4GHz chip next year capable to do 16GFlops?

I think AMD would have to seriously struggle to get that to happen. Intel might be able to force Conroe or a descendant past that point, but probably not unless AMD poses a serious performance threat, and it won't for some time.

What can we expect from this new K8L?

Better integer performance per clock, possibly as anemic a clock speed rise from K8 to K8L as it was for K7 to K8 (remember how long it took to get a Hammer core that outclocked a Barton?).

Against Conroe, integer single-threaded performance will probably be lower on K8. It simply is not as aggressive as Intel's core is when it comes to speculating ahead of memory accesses. This might be less noticeable in Long-mode, since K8L's instruction fetch is geared to handle longer instructions, while Conroe handles more, shorter instructions.

Is this the end of many new ...PUs?

When they start, I'll get around to looking at them.

How fast can it run Quake in softmode?

Faster than K8, and faster than Conroe, if the code hasn't been reworked for SSE3. For x87 fp (which will probably die a slower death than it should), K8L will have a serious advantage over Conroe, because Intel's core has some issue restrictions due to how it allocates units for old fp instructions.

What will happens with the 45nm, 35nm and 22nm process?

If AMD does what it usually does, it will probably replace K8L by 35nm. Considering how far off that node is, they better have a better core by then.

Will we see a 22nm 8-K8L cores 12GHz 384GFlops CPU by ~2012?

I hope not, they'd get slaughtered by the Core4 Decaduos or whatever.

Will this new k8l multicore be memory limited?

Not a high performance core in the world that isn't already limited by memory in most situations.

pascal · Sep 12, 2006

3dilettante said:
I don't think it does. That would require loading an additional operand and require a much more robust fp loading pipeline than either Prescott (for which SSE3 was made) or K8.

Looks like the K8L will have a more robust (wider) fp pipeline http://www.xbitlabs.com/articles/cpu/display/amd-k8l_5.html

In the K8L processor the FADD and FMUL devices will be expanded to 128 bits (Figure 5), which will help double the theoretical floating-point performance with code that uses vector SSE instructions (not only due to a doubled dispatch rate, but also due to an increased decoding and retiring rate caused by the reduced number of generated macro-ops).

The buses for reading data from the cache will also become two times wider, which will enable the processor to perform two 128-bit data loads from the L1 cache per cycle. The ability to perform two 128-bit data reads per cycle can give the K8L an advantage in some algorithms over a Conroe-core processor that is only capable of performing one 128-bit load.

And wider/faster integer SIMD

Besides having wider execution units, the K8L will also have wider integer units inside the FADD and FMUL blocks that deal with SSE2 commands processing. As a result, the integer applications using these instruction sets will work faster. Also K8L will learn to perform a few extra SSE instructions that we wonâ€™t discuss here.

pascal · Sep 12, 2006

3dilettante said:
I hope not, they'd get slaughtered by the Core4 Decaduos or whatever.

Agree. Do you think we will see somewhere in the future the end or discontinuation of old ISA like IA32? Maybe just keep the x86-64 without legacy? Or some kind of c-optimized x86-64 (more like a RISC) ?

3dilettante · Sep 12, 2006

pascal said:
Looks like the K8L will have a more robust (wider) fp pipeline http://www.xbitlabs.com/articles/cpu/display/amd-k8l_5.html

The units seem to be clustered to allow for two simultaneous 128-bit ops.
It's unlikely that AMD will have a MAD instruction because it wouldn't be in SSE3 (which was introduced on a Prescott that didn't have the resources for such an instruction), and they don't have a good history with non-Intel instructions.

3DNOW had some nice stuff in it, even before SSE2 made it irrelevant.

If there is some kind of fused multiply-add, it will not be under the auspices of any SSE set. Because of that, it is likely it would never be used.

to answer the next post:

pascal said:
Agree. Do you think we will see somewhere in the future the end or discontinuation of old ISA like IA32? Maybe just keep the x86-64 without legacy? Or some kind of c-optimized x86-64 (more like a RISC) ?

Discontinuation for x86 is inevitable in the same way that some day every mountain shall turn to dust and all stars in the sky will burn out. Sure it'll happen, but it'll probably outlive us only to be defeated by ARM-powered cockroaches.

x86-64 would in the end be better off without having to worry about IA-32, but there a lot of quirks to it that only make sense when you know it is an extension of an older ISA. Those wrinkles won't go away as long as x86 is part of the name.

A safe bet would be in the future they'll probably make a multicore with a lot of cores, but only a few have full backwards compatibility.

pascal · Sep 13, 2006

Anyway people believe AMD will try to promote a SSE4a (SSSE3 + extensions), and that include multiply and add instruction. Only time will tell.

The ARM-powered cockroaches is profetic. Many believe it will pass the 8051 in volume number in the next 10 years.

For the x86-64 what makes sense or not from a ISA point of view is less relevant than optimize it for the de facto low level language these days (c). Like what Motorola did to the 68k with a new Coldfire c-optimized ISA.

And the idea of just keep a few cores with full back compatibility come to my mind too. Like when you have 16 cores then maybe keep just 2 or 4 with full back compatibility

3dilettante · Sep 14, 2006

pascal said:
Anyway people believe AMD will try to promote a SSE4a (SSSE3 + extensions), and that include multiply and add instruction. Only time will tell.

AMD has terrible luck at pushing its own instructions. Hammer originally was going to go with its own FP instruction set, but that got junked in favor of SSE2.
Then there are questions about just what kind of multiply-add it would be.
Register>Register>Register,
Memory>Register>Register,
etc.
Then there's the thing where anything x86 has the shared source/destination operand. Where's that going to go with a three-operand instruction?

Depending on how it works, the software is either more complicated, or the extra load's going to conflict with K8's cache porting.

For the x86-64 what makes sense or not from a ISA point of view is less relevant than optimize it for the de facto low level language these days (c). Like what Motorola did to the 68k with a new Coldfire c-optimized ISA.

That would limit the market quite a bit for the chip. It needs the support of everyone, and hitching an ISA to a programming language is a problem when an ISA may live for decades, but a new revision of the language comes out every few years. It would threaten to turn AMD's chip into a niche product, since not everyone is going to use c.

Science types still use FORTRAN, and those compilers still get the best FP performance.
Languages that restrict memory aliasing would be a better target than C, which I think does not.

Gubbi · Sep 15, 2006

There's no real reason (other than instruction stream density) to go with a fused multiply-accumulate (mul-add is a bit more complex in x86). As 3Dd. says, it plays havok with the 2-address opcode scheme of x86.

But instead of defining a new extension of the ISA the instruction-decoder could simply fuse consecutive multiplies and adds:

a=a*b
a=a+c

to

a=a*b+c

While preserving semantics (that would include rounding of the intermediate).

This way you don't clutter your ISA with new instructions, and at the same time potentially doubles throughput, since a fused mul-acc requires Â½ the issue ports and result bus resources internally in the CPU (but it would require changes to the scheduler since it has 3 source operands instead of 2).

Cheers

ADEX · Sep 15, 2006

Will this K8L be up to 4 times as fast as the K8 core at the same MHz?

In some case it may be, but only where the differences in the cores make a difference.
For the most part no, processors do not scale like that and neither does most software (if it scales at all).

Will we see a dual-core 4GHz chip next year capable to do 16GFlops?

Dual core 2.5GHz PowerPC 970 already beats that.
An IBM JS21 (quad core blade) gets 33.2 GFlops on Linpack (n=1000).

What can we expect from this new K8L?

It'll give Intel a good run for it's money but you'll need rewritten software to take full advantage of it.
Don't expect much in the way of a clock speed boost, in fact it'll probably be lower clocked.
Intel's quad core chips (actually 2 dual core chips in the same package) are lower clocked and use the Merom core which was designed for laptops.

What will happens with the 45nm, 35nm and 22nm process?

Cores, cores and more cores.

Will we see a 22nm 8-K8L cores 12GHz 384GFlops CPU by ~2012?

No, there'll be completely new core designs by then some of which will be GPU shaders (AMD have already done presentations on this). They'll probably be remarkably Cell like.

Will this new k8l multicore be memory limited?

If you double the number of cores you also need to double the memory bandwidth and not reduce memory latency. In reality memory bandwidth per core will drop and latency will increase to both cache and memory.

3dilettante · Sep 15, 2006

Gubbi said:
There's no real reason (other than instruction stream density) to go with a fused multiply-accumulate (mul-add is a bit more complex in x86). As 3Dd. says, it plays havok with the 2-address opcode scheme of x86.

But instead of defining a new extension of the ISA the instruction-decoder could simply fuse consecutive multiplies and adds:

a=a*b
a=a+c

to

a=a*b+c

While preserving semantics (that would include rounding of the intermediate).

This way you don't clutter your ISA with new instructions, and at the same time potentially doubles throughput, since a fused mul-acc requires ½ the issue ports and result bus resources internally in the CPU (but it would require changes to the scheduler since it has 3 source operands instead of 2).

Cheers

On-the-fly merging of ops wouldn't make it an fused multiply-add exactly. Part of the bonus of the instruction is that it maintains a much higher internal precision than two separate operations.
However, floating point math is often very counterintuitive. Sometimes keeping lower-order bits that would be rounded or truncated can mess with an algorithm that assumes the junk will be lost.

This can be very critical in cases where the multiply operands lead to results that should be denormalized or zeroed out prior to being added to the third operand.

If the code worked prior to the operation fusion, a true emulation of a fmadd could ruin it.

If it were extended to register/memory ops and other complex situations, semantics can't be preserved with respect to memory accesses if the fusion nixes one of the writes.

A properly rounded fused multiply-add would lose some of its luster, and it would require additional read ports from the register file or risk frequent cache stalls due to the extra operand. On the other hand, unless it is used often, the additional hardware devoted to the instruction won't get used, given how the separate pipelines are laid out.

Without it being explicitly in the ISA, AMD and Intel are probably better off making it so that their FPU can just churn and issue through the ops much more quickly, or at most just save the scheduler a headache and do some kind of shared internal entry for a slighly longer operation.

K8L performance

pascal

K.I.L.E.R

Retarded moron

_xxx_

Blazkowicz

pascal

pascal

swaaye

Entirely Suboptimal

Guden Oden

Senior Member

pascal

pascal

zsouthboy

3dilettante

pascal

pascal

3dilettante

pascal

3dilettante

Gubbi

ADEX

3dilettante

Similar threads