K8L performance

I think it's a bit too early to consider any real intention of AMD using Z-RAM cells in rev. H, or isn't it!?
Nevertheless, the L2 SRAM sells in 65nm shots are really very dense. ;)
Agree. But we see a 5 times relative reduction in cache size ;)

Another curious thing is the 2nd FP block.
Given the fact that it will obviously handle only SIMD (SSE 1~4a) op's, together with the legacy (1st) FP block will simply double (or quadruple in some cases) SSE performance, but strange thing here is why this 2nd block has it's own FP register file.
May be both FP blocks will work together in some interleaved manner!?
Humm, maybe.
 
don't forget what you're looking at : 512K L2 on revision H core, versus 1M on the previous ones ;)

so, with half the size and the process going from 90 to 65nm, that accounts for a surface 3.8 times smaller ( 2*(90/65)² ).
area is around 5.2 times lower, so AMD, by refining its cache or I don't know what, has done a further 37% increase in density (or (90/65)² is a bit simplistic).

weren't AMD caches fatter (less dense) than Intel ones till now? (considering an equal process size)
 
Well since K7s and even more-so K8s aren't really dependant on cache it's a good move imo :smile:
 
Well since K7s and even more-so K8s aren't really dependant on cache it's a good move imo :smile:

They rely heavily on their caches. One of the annoying trade-offs made thus far has been the 3-cycle latency of their L1 caches, which could probably gain them several percent in performance if they could figure a way to shave a cycle off.
They just didn't need so much extra cache to hide some performance weakness.

That won't be the case if they're facing off against a Conroe with larger cache and equal or better execution power.

P4 got a lot of cache slathered on because of a number of missteps and unforseen problems that restricted its clock speed gains and reduced its IPC even more than the designers thought would happen.

Conroe is unlikely to need cache just to make up for an execution shortfall, and it will still have a cache situation as good as or more likely better than K8L's.

The unknown is the outcome from the extra level of cache being given to AMD's future quad-cores. However, Intel has been supplying L3 cache on its chips for some time.
 
They rely heavily on their caches. One of the annoying trade-offs made thus far has been the 3-cycle latency of their L1 caches, which could probably gain them several percent in performance if they could figure a way to shave a cycle off.
They just didn't need so much extra cache to hide some performance weakness.

That won't be the case if they're facing off against a Conroe with larger cache and equal or better execution power.

P4 got a lot of cache slathered on because of a number of missteps and unforseen problems that restricted its clock speed gains and reduced its IPC even more than the designers thought would happen.

Conroe is unlikely to need cache just to make up for an execution shortfall, and it will still have a cache situation as good as or more likely better than K8L's.

The unknown is the outcome from the extra level of cache being given to AMD's future quad-cores. However, Intel has been supplying L3 cache on its chips for some time.
I meant size wise :???:
I know their cache latecy and speed has always been the downside though.
 
All the K8 (and K7 to some degree) are architectures that primary rely on working with the L1 array, partially due to their exclusive L1-L2 relation using LRU scheme.
I'm not sure for K7, but K8 can bypass completely L2 and load data straight from the memory (one benefit from the IMC). In this case L2 is actually used mostly as a LRU "overflow" container, for state changes and other stuff.

btw, the very first K7 (for Slot-A) was inclusive design (L1-L2) implementation.
I you compare L1 throughput between P4 and K8, clock-for-clock, the second one is a tag better. It's not all about access latency.
 
Last edited by a moderator:
I believe FMADD is in the revision of IEEE 754, also Itanium has an FMADD. Plus the AMD64 extension was a success and AMD has gained much more respect since 3dnow!. Which makes me think that we could very well see FMADD in the K8L.
 
I believe FMADD is in the revision of IEEE 754, also Itanium has an FMADD. Plus the AMD64 extension was a success and AMD has gained much more respect since 3dnow!. Which makes me think that we could very well see FMADD in the K8L.

I don't think we will see it. AMD already talked about technical floating point extensions prior to the AMD64 enabled K8s.

They never materialized.

Rumour: Not because they wouldn't have been a good idea, but rather because a certain software monopoly told that they would only support one FP extension, SSE2.

Cheers
 
Last edited by a moderator:
P4 got a lot of cache slathered on because of a number of missteps and unforseen problems that restricted its clock speed gains and reduced its IPC even more than the designers thought would happen.

P4 Prescott is one big enigma. Why did they double the register file to 256 integer and 256 floating point registers while maintaining a 126 entry ROB. Worst case rename register usage would be architected registers + number of entries in the ROB which is way lower than 256, right ?

One could suspect Prescott of being able to have more than two active contexts, like, say, four. The reason it isn't enabled could be because real life tests show that more than two hyper thread contexts degrade performance... Or that software licensing issues allow two logical CPUs on a physical CPU for the price of one, but not four.

Who knows?

Cheers
 
P4 Prescott is one big enigma. Why did they double the register file to 256 integer and 256 floating point registers while maintaining a 126 entry ROB. Worst case rename register usage would be architected registers + number of entries in the ROB which is way lower than 256, right ?

The only source for that 256 register number I could find was from Chip-Architect's speculative article that came out prior to Prescott's release.

Nothing after release seems to indicate 256 physical registers in the integer file.
Is there another source confirming this post-release?

That the RAT doubled in size is unsurprising, since the extra registers in x86-64 need their own entries as well.

It's also not entirely unheard of to have more registers than the ROB seems to need.
The Alpha EV8 (if it hadn't been canceled) would have had two mirrored register files to cut down on access time. Being 8-wide, the number of ports would have strangled clock scaling if they all went to the same place.

edit:
Now that I remember, so did the EV6, and that was released.
 
Last edited by a moderator:
The only source for that 256 register number I could find was from Chip-Architect's speculative article that came out prior to Prescott's release.

Nothing after release seems to indicate 256 physical registers in the integer file.
Is there another source confirming this post-release?

Hmm, I don't know where I picked that up. Just browsed the P4 Prescott chapter in "The Unabridged Pentium 4" which details a whole bunch of improvements, but does not mention any changes to the register file.

So you might be right.

Now that I remember, so did the EV6, and that was released.
Yup, trades half the read ports for an extra cycle of latency for results from the other exec unit cluster

Cheers
 
OK, seems the Inquirer got sticked (literally) on some pre-production wafer with Deerhounds.
Anyway, I took a chance and jumped in PS to do some measurements and what I've figured out, assuming that this is 300mm wafer (of course), my previous dare attempt, to guess the die size, is in quite departure from what I've counted on the wafer - 204 "visually" health pieces (+/- 2 sliced at the edge) versus 242 if I count with my die size numbers. This means that initial die size should be tag higher than 300mm².
 
Last edited by a moderator:
Back
Top