K8L performance

pascal · Sep 19, 2006

fellix said:
I think it's a bit too early to consider any real intention of AMD using Z-RAM cells in rev. H, or isn't it!?
Nevertheless, the L2 SRAM sells in 65nm shots are really very dense.

Agree. But we see a 5 times relative reduction in cache size

fellix said:
Another curious thing is the 2nd FP block.
Given the fact that it will obviously handle only SIMD (SSE 1~4a) op's, together with the legacy (1st) FP block will simply double (or quadruple in some cases) SSE performance, but strange thing here is why this 2nd block has it's own FP register file.
May be both FP blocks will work together in some interleaved manner!?

Humm, maybe.

Blazkowicz · Sep 20, 2006

don't forget what you're looking at : 512K L2 on revision H core, versus 1M on the previous ones

so, with half the size and the process going from 90 to 65nm, that accounts for a surface 3.8 times smaller ( 2*(90/65)Â² ).
area is around 5.2 times lower, so AMD, by refining its cache or I don't know what, has done a further 37% increase in density (or (90/65)Â² is a bit simplistic).

weren't AMD caches fatter (less dense) than Intel ones till now? (considering an equal process size)

pascal · Sep 20, 2006

This combination of factors could explain such reduction. Thanks

Moloch · Sep 20, 2006

Well since K7s and even more-so K8s aren't really dependant on cache it's a good move imo :smile:

3dilettante · Sep 20, 2006

radeonic2 said:
Well since K7s and even more-so K8s aren't really dependant on cache it's a good move imo :smile:

They rely heavily on their caches. One of the annoying trade-offs made thus far has been the 3-cycle latency of their L1 caches, which could probably gain them several percent in performance if they could figure a way to shave a cycle off.
They just didn't need so much extra cache to hide some performance weakness.

That won't be the case if they're facing off against a Conroe with larger cache and equal or better execution power.

P4 got a lot of cache slathered on because of a number of missteps and unforseen problems that restricted its clock speed gains and reduced its IPC even more than the designers thought would happen.

Conroe is unlikely to need cache just to make up for an execution shortfall, and it will still have a cache situation as good as or more likely better than K8L's.

The unknown is the outcome from the extra level of cache being given to AMD's future quad-cores. However, Intel has been supplying L3 cache on its chips for some time.

Moloch · Sep 20, 2006

3dilettante said:
They rely heavily on their caches. One of the annoying trade-offs made thus far has been the 3-cycle latency of their L1 caches, which could probably gain them several percent in performance if they could figure a way to shave a cycle off.
They just didn't need so much extra cache to hide some performance weakness.

That won't be the case if they're facing off against a Conroe with larger cache and equal or better execution power.

P4 got a lot of cache slathered on because of a number of missteps and unforseen problems that restricted its clock speed gains and reduced its IPC even more than the designers thought would happen.

Conroe is unlikely to need cache just to make up for an execution shortfall, and it will still have a cache situation as good as or more likely better than K8L's.

The unknown is the outcome from the extra level of cache being given to AMD's future quad-cores. However, Intel has been supplying L3 cache on its chips for some time.

I meant size wise :???:

I know their cache latecy and speed has always been the downside though.

fellix · Sep 20, 2006

All the K8 (and K7 to some degree) are architectures that primary rely on working with the L1 array, partially due to their exclusive L1-L2 relation using LRU scheme.
I'm not sure for K7, but K8 can bypass completely L2 and load data straight from the memory (one benefit from the IMC). In this case L2 is actually used mostly as a LRU "overflow" container, for state changes and other stuff.

btw, the very first K7 (for Slot-A) was inclusive design (L1-L2) implementation.
I you compare L1 throughput between P4 and K8, clock-for-clock, the second one is a tag better. It's not all about access latency.

Baraclese · Sep 26, 2006

I believe FMADD is in the revision of IEEE 754, also Itanium has an FMADD. Plus the AMD64 extension was a success and AMD has gained much more respect since 3dnow!. Which makes me think that we could very well see FMADD in the K8L.

Gubbi · Sep 26, 2006

Baraclese said:
I believe FMADD is in the revision of IEEE 754, also Itanium has an FMADD. Plus the AMD64 extension was a success and AMD has gained much more respect since 3dnow!. Which makes me think that we could very well see FMADD in the K8L.

I don't think we will see it. AMD already talked about technical floating point extensions prior to the AMD64 enabled K8s.

They never materialized.

Rumour: Not because they wouldn't have been a good idea, but rather because a certain software monopoly told that they would only support one FP extension, SSE2.

Cheers

Gubbi · Sep 26, 2006

3dilettante said:
P4 got a lot of cache slathered on because of a number of missteps and unforseen problems that restricted its clock speed gains and reduced its IPC even more than the designers thought would happen.

P4 Prescott is one big enigma. Why did they double the register file to 256 integer and 256 floating point registers while maintaining a 126 entry ROB. Worst case rename register usage would be architected registers + number of entries in the ROB which is way lower than 256, right ?

One could suspect Prescott of being able to have more than two active contexts, like, say, four. The reason it isn't enabled could be because real life tests show that more than two hyper thread contexts degrade performance... Or that software licensing issues allow two logical CPUs on a physical CPU for the price of one, but not four.

Who knows?

Cheers

3dilettante · Sep 26, 2006

Gubbi said:
P4 Prescott is one big enigma. Why did they double the register file to 256 integer and 256 floating point registers while maintaining a 126 entry ROB. Worst case rename register usage would be architected registers + number of entries in the ROB which is way lower than 256, right ?

The only source for that 256 register number I could find was from Chip-Architect's speculative article that came out prior to Prescott's release.

Nothing after release seems to indicate 256 physical registers in the integer file.
Is there another source confirming this post-release?

That the RAT doubled in size is unsurprising, since the extra registers in x86-64 need their own entries as well.

It's also not entirely unheard of to have more registers than the ROB seems to need.
The Alpha EV8 (if it hadn't been canceled) would have had two mirrored register files to cut down on access time. Being 8-wide, the number of ports would have strangled clock scaling if they all went to the same place.

edit:
Now that I remember, so did the EV6, and that was released.

Gubbi · Sep 26, 2006

3dilettante said:
The only source for that 256 register number I could find was from Chip-Architect's speculative article that came out prior to Prescott's release.

Nothing after release seems to indicate 256 physical registers in the integer file.
Is there another source confirming this post-release?

Hmm, I don't know where I picked that up. Just browsed the P4 Prescott chapter in "The Unabridged Pentium 4" which details a whole bunch of improvements, but does not mention any changes to the register file.

So you might be right.

3dilettante said:
Now that I remember, so did the EV6, and that was released.

Yup, trades half the read ports for an extra cycle of latency for results from the other exec unit cluster

Cheers

fellix · Sep 29, 2006

OK, seems the Inquirer got sticked (literally) on some pre-production wafer with Deerhounds.
Anyway, I took a chance and jumped in PS to do some measurements and what I've figured out, assuming that this is 300mm wafer (of course), my previous dare attempt, to guess the die size, is in quite departure from what I've counted on the wafer - 204 "visually" health pieces (+/- 2 sliced at the edge) versus 242 if I count with my die size numbers. This means that initial die size should be tag higher than 300mm².

K8L performance

pascal

Blazkowicz

pascal

Moloch

God of Wicked Games

3dilettante

Moloch

God of Wicked Games

fellix

Baraclese

Gubbi

Gubbi

3dilettante

Gubbi

fellix

Similar threads