If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1201 | ||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
Quote:
That this could be snuck in as a parallel process to the decoder block probably took some engineering work, and it does buffer the more limited 16B decoder path. Quote:
__________________
Dreaming of a .065 micron etch-a-sketch. |
||
|
|
|
|
|
#1202 |
|
Senior Member
|
I still think AMD went a bit too far with their automated design approach for Bulldozer, especially regarding the front-end. If you look carefully on the die-shot, the front-end block is very similar to what they have been using since K8. Now, of course, the thing is somehow patched for dual-threaded workloads, but I can't help to think there are too many legacy leftovers there, where the "copy-pasted" L1i being just one of them.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#1203 |
|
Senior Member
Join Date: Feb 2002
Posts: 2,544
|
The uop cache is effectively a cache for one trace. If anything it validates the trace cache concept, - massive issue width while consuming less power.
The problem with the P4 was that it was limited to decoding a single instruction per cycle when it missed the trace cache. Intel picked the low hanging fruit by exploiting the very predictable nature of loops, but I wouldn't be surprised to see more aggresive "uop" caches in the future with support for more traces. It won't be called a trace cache, that name is forever stigmatized. Cheers
__________________
I'm pink, therefore I'm spam |
|
|
|
|
|
#1204 | |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
Quote:
The contents of the cache have changed, since branch prediction no longer has bits in the L1 like in K8. The cache would have been reimplemented in BD, so it's not a matter of simple matter of copy/paste. The designers looked at major design parameters of the previous gen's L1, and they kept them.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
|
#1205 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,434
|
Quote:
That said are you suggesting AMD is going to stick with 2-way l1 instruction cache associativity for the next 5 years or so? Because tweaked Bulldozer is all that's on the roadmap (well apart from the low-power designs). fwiw l1i associativity for core2 was 8, 4 on Nehalem and now back to 8 for Sandy/Ivy Bridge. Clearly these things can be redesigned. AMD OTOH stuck with 2-way 64KB L1 instruction/data caches for forever (since K7 days - K6 also had two-way caches but only 2x32KB). Only BD now has different L1D (Bobcat is also "blessed" with 2-way 32KB L1I though it probably makes sense there, and the 8-way 32KB L1D is actually more than what you get with BD...) |
|
|
|
|
|
|
#1206 | |||||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
Quote:
The uop cache is a post-decode cache that has a fixed relationship to the linearly addressed Icache, and it primarily validates a benefit for a chip running a complex ISA. To a limited extent, it gathers some low-hanging fruit for the original purpose of a trace cache--solving the problem of discontinuities in the fetch stream compromising superscalar issue. Quote:
The TLB and tag check logic would be altered if the ratio of tag and index bits changes. One possible, if unlikely change, would be to significantly change the associativity or reduce capacity so that it matches the size/associativity ratio of Sandy Bridge. This would eliminate the aliasing problem entirely, and discard a portion of the cache fill pipeline used to invalidate synonyms. Quote:
Some reports say that more change is in store with Steamroller, more so than was promised with Piledriver. Since Steamroller is also meant to be on a new non-SOI node, more changes could be in the air because various parts of the pipeline will need to be adjusted anyway. Quote:
Bulldozer to Piledriver is something like the SB to IVB transition, without the node jump. Quote:
__________________
Dreaming of a .065 micron etch-a-sketch. |
|||||
|
|
|
|
|
#1207 | |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
|
Quote:
That would make it 12KB, but it probably takes up significantly more space than a normal 12KB 6-way set-associative instruction cache would due to metadata mapping instruction addresses between it and the L1 instruction cache. |
|
|
|
|
|
|
#1208 |
|
Senior Member
|
Thanks for the head up.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#1209 | ||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
Quote:
Any given window can be represented by up to 18 uops that can take up to 3 lines in the uop cache. There isn't a 1:1 correspondence between the external representation and the uop cache in terms of size or op count. I'm not sure how to arrive at 8 bytes per uop in a general case. There is a restriction that instructions with 64-bit immediates take up two slots, which may give a granularity to the uop cache where a 32 bit immediate can fit comfortably, so perhaps each slot is between 64 and 128 bits in length. The amplification of a 32 byte chunk of instructions would be variable. If slots are 64 bits and each way can have up to 6 uops, that's at least 48 bytes of uop, not counting metadata. A 32 byte window mapping to a fully occupied way would be amplified by a factor of 1.5. 96 bits per uop would give a factor of 2.25. Quote:
There is at least 48 bits for the IP of the first instruction in the window. Then there would be theoretical max of 18 length counters per window. The max byte length for x86 is 18, which naively makes me think 5 bits per counter. This assumes there is a valid way to pad an instruction out to 18 bytes while translating to one uop. It may be unnecessary to be that naive because an instruction that long wouldn't allow enough room in the window for 17 additional instructions.
__________________
Dreaming of a .065 micron etch-a-sketch. |
||
|
|
|
|
|
#1210 | |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
|
Quote:
64-bits might fit as uop size. Despite being the same size as Prescott's (generally simpler) uops the uop fusion rules don't really add a lot of extra data per-uop. If this number is incorrect then I suspect it's not that much larger, for example 80 bits; 128 bits would really surprise me. We both agree they're fixed width though, right? The confusion behind fellix's original comment is probably because of Intel's claim that the uop cache performs "like a 6KB instruction cache." Going just by capacity that'd imply that a uop in the cache is worth about 4 bytes of x86 code - the average bytes/instruction in typical programs is probably well under 4, but the average uops/instruction is also going to be a bit higher than 1. Then the uop cache will have some unused parts in lines. So it's a pretty reasonable sounding estimate. Last edited by Exophase; 08-Jun-2012 at 20:24. |
|
|
|
|
|
|
#1211 | ||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
Quote:
I reread the description of the uop cache hit process a few times, and after thinking it through the initial guess of 64 bits fits. Upon a hit the uop cache sends the 1-3 lines to an intermediate buffer, and that buffer can take 6 uops a cycle but it has an output limit of 4 uops/32 bytes, which gives each one 64 bits. Since each way of the uop cache has 6 uops, that would make mean each way of the uop cache contains 48 bytes worth of op data plus additiona meta data. 32 sets * 8 ways *48 bytes per way / 8 bytes in 64bits gives 1.5K uops. Quote:
__________________
Dreaming of a .065 micron etch-a-sketch. Last edited by 3dilettante; 09-Jun-2012 at 05:36. |
||
|
|
|
|
|
#1212 |
|
Member
Join Date: Apr 2007
Location: Australia
Posts: 645
|
looks like trinity has made some solid gains in quite a few area's in regards to "IPC" but then not in others.
http://www.tomshardware.com/reviews/...400k,3224.html much like Barcelona this is what bulldozer should have been. |
|
|
|
|
|
#1213 |
|
Junior Member
Join Date: Jan 2010
Posts: 48
|
How does ~10-15% better perfomance for piledriver stack up to Intel's bridges made of sand and ivy though?
|
|
|
|
|
|
#1214 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
I will need to go back and review numbers from years back.
On a multithreaded basis, there may be more areas where on a module to core basis Piledriver is more competitive. In terms of single thread performance, I will need to check where it is in relation to Westmere before worrying about SB and IB.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#1215 | |
|
Member
Join Date: Apr 2007
Location: Australia
Posts: 645
|
Quote:
There have been statements from AMD saying steamroller with bring there single thread performance much closer to intels, i wonder what they are going to change? Decode, load/store to the FPU, number of ALU's, trace cache? it will be very interesting to see what steamroller is, looks like its good enough for atleast sony, maybe even Mircosoft. I guess we will see if bulldozer was Yonnah or northwood |
|
|
|
|
|
|
#1216 |
|
Member
Join Date: Nov 2006
Location: Somewhere over the ocean
Posts: 633
|
There's around some benchmark with the last win8 preview?
|
|
|
|
|
|
#1217 |
|
Red-headed step child
Join Date: Jun 2004
Location: Guess ;)
Posts: 3,084
|
Is this question in relation to the updated scheduling that was touted in Win8 to provide an incremental performance improvement on Bulldozer? My own speculation: I doubt it's significant enough to write about.
__________________
"...twisting my words" |
|
|
|
|
|
#1218 |
|
Member
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
|
The most important scheduler changes were already released for windows 7 during the winter, so I don't except big improvements over them with w8.
|
|
|
|
|
|
#1219 |
|
Member
Join Date: Nov 2006
Location: Somewhere over the ocean
Posts: 633
|
really? i read that the new scheduler is am 8's exclusive
|
|
|
|
|
|
#1220 |
|
Senior Member
Join Date: Feb 2004
Posts: 2,440
|
Incorrect, it's been in Windows 7 for months now.
|
|
|
|
|
|
#1221 |
|
Red-headed step child
Join Date: Jun 2004
Location: Guess ;)
Posts: 3,084
|
__________________
"...twisting my words" |
|
|
|
|
|
#1222 |
|
Senior Member
|
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#1223 |
|
Junior Member
Join Date: Mar 2012
Location: cracks
Posts: 53
|
I don't see it so bad.
Clock Mesh gave them a free +10%, and it seems they mainly improved BD's single core IPC. BD architecture gains overall more by single-core IPC improvements than Intel, since you have (sort of) two cores. So, for 1-year distance, a +20% increase in performance is pretty good. Also, it is very interesting they didnt fix yet the huge front-end problem: intel has 32K8W L1I whereas AMD has yet a pathetic 64K2W... and AMD cores are way more hungry since they cannot cover latencies with HT (not talking of the AMD decoder ofc). But I am curious to get steamroller info, where they *should* fix frontend issues. edit: just a note on schedulers - win7 scheduler kills AMD cmt since by default it reschedule threads without processor affinity, so trashing L2 cache. On intel this is not a problem since L2 cache is just 128kb and L3 cache can keep everything up. On AMD you get a lot of problems due to basically trashing 2MB of cache instead of 128Kb.. Also, I do not believe MS did rewrite W7 scheduler. One would be mad to touch a critical working component with such huge changes 'on the flight'. So it is likely that W8 will have a better scheduler for AMD. Last edited by imaxx; 10-Aug-2012 at 18:11. |
|
|
|
|
|
#1224 |
|
Senior Member
Join Date: Dec 2004
Location: Toulouse
Posts: 4,136
|
did you mean 256K L2 for Intel?, or are you implying it's spiit in two halves for each thread (it isn't, I'd think, but I've never ever thought about how two (or more other archs) thread would share the L1 or L2 cache..)
|
|
|
|
|
|
#1225 | ||
|
Red-headed step child
Join Date: Jun 2004
Location: Guess ;)
Posts: 3,084
|
Quote:
As for trashing their cache? Yes, but because they've yet to spend the R&D to get the associativity beyond 2-way on their front end which directly contradicts how they want certain parts of these cores to be shared... This isn't a scheduling issue, it's a design issue. Edit: Let's put this to rest, can we? PC Stats tested the exact same rig, powered by an FX-8150, on both Win7 and Win8. Here is their basic summary: Quote:
__________________
"...twisting my words" Last edited by Albuquerque; 10-Aug-2012 at 18:59. |
||
|
|
|
![]() |
| Tags |
| amd, blewdozer, oh well, patents |
| Thread Tools | |
| Display Modes | |
|
|