If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#26 | |
|
Senior Member
|
Quote:
The <1 cycle values are most likely a result of super-scalar execution issue.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
|
#27 |
|
Member
Join Date: Jan 2006
Location: France
Posts: 197
|
"le débit mesure la vitesse entre chaque instruction traitée lorsqu'on en traite plusieurs. " So here, "debit" is like "time between instructions".
__________________
- I'm french. Sorry if you don't understand what i say - |
|
|
|
|
|
#28 |
|
Senior Member
|
Thanks!
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts. Work| RecreationWarning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration! |
|
|
|
|
|
#29 | |||
|
Senior Member
Join Date: Oct 2002
Posts: 2,437
|
Quote:
[edit: actually it indeed seems it's slower in games more often than not - the games would really need to have more threads for it to be faster for the most part.] Quote:
Quote:
It is hard to say though what the achievable frequency would be (and hence if it would be faster) as we don't know how well (or not) 32nm fares against 45nm. Also Phenom II might have some trouble scaling to more cpus, the memory throughput wasn't all that much and not sure about L3 cache throughput neither. Yes, this confirms some information which was already available: since it has two FMAC pipes some FPU/SSE2 commands can run in both pipes which can only run in one for other archs (muls at least, and interestingly looks like DIVs too? Or maybe they are more pipelined now, though at first glance it really looks like almost anything could run in either pipe). The latency of in particular complex x87 operations is horrific though so it's probably not really a win for old software, larger queues or not... Last edited by mczak; 12-Oct-2011 at 15:13. |
|||
|
|
|
|
|
#30 | |
|
Senior Member
|
Quote:
Looks like AMD is (un)willingly following the old steps to Intel's NetBurst "hell".
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
|
#31 | |
|
Member
Join Date: Jan 2006
Location: France
Posts: 197
|
Quote:
__________________
- I'm french. Sorry if you don't understand what i say - Last edited by Rootax; 12-Oct-2011 at 14:52. |
|
|
|
|
|
|
#32 |
|
Senior Member
|
I can only hope for AMD that they've got the connection between the Cores and the Co-processors right and that GCN-like cores will be part of future BDs rather sooner than later. Otherwise …
edit: BTW, were's Charlies rant complaining about AMD forcing a HPC/server part down the mainstreamers' throats? *SCNR*
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts. Work| RecreationWarning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration! |
|
|
|
|
|
#33 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,437
|
Quote:
It seems to me like the horrific latencies for the complex x86 ops is just because they can now be executed in both pipes but AMD implemented them a bit cheaper hence the individual ones take longer to execute (but of course can instead execute two of them simultaneously). Those ops might be rare enough that they are "fast enough" but making both FMAC pipes (nearly) symmetric probably made things easier for scheduling etc. hence AMD doing this instead of going the more traditional route of only executing these ops in one pipe but a bit faster (of course more traditional fpu execution ports are asymmetric anyway). Last edited by mczak; 12-Oct-2011 at 22:01. |
|
|
|
|
|
|
#34 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,437
|
Quote:
|
|
|
|
|
|
|
#35 |
|
Senior Member
|
No, I wasn't talking about a specific iteration of BD. But from what I'm seeing in the reviews, AMD needs to get their performance per watt up as fast a possible also for the HPC space and efficient, wider vector units would greatly help to at least make/keep (depends on how you assess the current situation) them competitive in that market.
http://www.hardware.fr/articles/842-...ergetique.html
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts. Work| RecreationWarning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration! |
|
|
|
|
|
#36 | |
|
Mr. Upgrade
Join Date: Nov 2003
Location: Finland
Posts: 1,335
|
Quote:
|
|
|
|
|
|
|
#37 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,437
|
Quote:
|
|
|
|
|
|
|
#38 |
|
Senior Member
|
http://www.anandtech.com/show/4955/t...x8150-tested/5
The classic Queens bench shows the sad state of the single-threaded performance, being hammered by the ill fated front-end and wimpy L1i cache.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#39 |
|
Senior Member
Join Date: Oct 2002
Posts: 2,437
|
Ok here's my summary of what I'm thinking after reading some reviews:
The good: - TurboCore which actually does something. Of course for perf/power this is not good but it's nice to finally see this truly working. - lower idle power. In some reviews it made a much larger difference (almost to the point of being similar to intel idle power consumption) in some not so much but in any case looks like an improvement most likely thanks to powergating the cores. - higher memory bandwidth than Phenom II (not quite as good as SNB but definitely an improvement). - AES instructions for catching up with SNB, AVX, FMA4 etc. - shared FPU actually looks ok. Only some rare synthetics seem to show this to cause a performance hit, otherwise scaling to multiple threads seems largely independent if it's integer or float workload (of course this most likely is a result of the beefed up FPU too, it probably will show bad scaling with code using AVX-256 where the FPU suddenly doesn't look all that beefy anymore). - CMT is a neat idea and scaling isn't too bad (hardware.fr has some numbers, 4->8 threads scaling is better than going from 4 to 6 Phenom II cores and of course better than HT), so might be a win on a perf/area scale for multithreaded workloads, if just the singlethreaded baseline would be a bit higher... The bad: - nowhere near the once promised "30% higher" clockspeed. If that's due to cpu design or manufacturing trouble I don't know. - high load power consumption. Efficiency actually didn't increase compared to a X6 1100T which is worrysome. - low single-thread performance even with some clock increase just barely at Phenom II levels, and not in the same ballpark as SNB cpus at all (of course it was expected but the difference is bigger than it should be). The ugly: - AMD promised roughly same IPC for single threaded performance and they clearly missed it for typical workloads by about 10-15% or so. I think that's the biggest problem actually. Why did they miss it? Is it just the removal of the 3rd ALU in an integer core? It looked like it could have been possible to "compensate" for that with other improvements (like memory disambiguation, better branch prediction, larger scheduling queues) given that it's hard to schedule 3 alu ops in the first place but it didn't happen. - L1I thrashing issues with only iffy OS bandaids for a problem which should have been avoided by better cache design (higher associativity). - not convinced of the whole cache design. Requires large area and the latency is just bad both compared to Phenom II and more so the competition, so the large size might not be worth it. Moreover, I wonder if the very low L1D write bandwidth (not even 1/5 of read bandwidth while traditionally it "should" be roughly half of the read bandwidth, it is a result of the L1D write through design together with the low L2 write bandwidth) isn't a real problem reducing throughput quite heavily in some cases (there is a "Write Coalescing Cache" to help with that, but at least in hardware.fr numbers it didn't turn up). Last edited by mczak; 12-Oct-2011 at 17:09. |
|
|
|
|
|
#40 |
|
Member
Join Date: Mar 2003
Location: Finland
Posts: 938
|
What is most sad: Deus Ex: Human Revolution, this year's arguable the biggest "Gaming Evolved"-title just BSODs with FX-8150.
What a sorry mess.
__________________
Mikael Koskinen blog: .NET Programming, Windows Phone Development, Software Architecture |
|
|
|
|
|
#41 |
|
Member
Join Date: Jan 2010
Posts: 416
|
Did anyone seen some multitasking benchmarks for the dozer? If it needs those threads so badly
A also guess this could be a problem in future if software cant catch up to the cores. 8 or 12 cores will just sit idle and wont bring much improvement if u dont use it for rendering or encoding. Last edited by GZ007; 12-Oct-2011 at 16:58. |
|
|
|
|
|
#42 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,437
|
Quote:
This benchmark MIGHT show the worst case as it's possible it is indeed optimized and written in a way which allows the cpus to schedule as many (int) instructions per clock as they can - 4 for SNB, 3 for Phenom II, 2 for BD. The scores would align nearly perfectly with that. [edit: of course SNB can't actually execute 4 alu ops per clock - not sure what I was thinking there...] Thankfully it's not quite _that_ bad in general... Makes you wonder if removing the third ALU pipe was a good idea though I guess it's all part of the compromise - e.g. third ALU pipe might mean instruction decode becomes the bottleneck too often (at least for multithreaded workloads), and of course it runs counter to the idea of CMT where you streamline the int cores a bit but have more of them instead. Last edited by mczak; 12-Oct-2011 at 19:57. |
|
|
|
|
|
|
#43 |
|
Regular
Join Date: Nov 2005
Posts: 5,048
|
An impressive failure.
__________________
Hall of fame thread: http://forum.beyond3d.com/showthread.php?t=50668 |
|
|
|
|
|
#44 | |
|
Mr. Upgrade
Join Date: Nov 2003
Location: Finland
Posts: 1,335
|
Quote:
|
|
|
|
|
|
|
#45 | |
|
Senior Member
|
Quote:
We'll see. What I found the most interesting insight so far is that CMT seems to decelerate heavily threaded games. http://www.hardware.fr/articles/842-...acite-cmt.html More exactly this graph: ![]() While SC2 is only using two threads at most and relies heavily on a high IPC, I am not exactly sure about ArmA II, but F1 2011 and RoF are definitely heavily threaded as well as Anno 1404 (aka Dawn of Discovery) and all are running slower with CMT enabled, i.e. 8C/4CU vs. 4c/4CU.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts. Work| RecreationWarning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration! |
|
|
|
|
|
|
#46 |
|
Senior Member
Join Date: Jul 2008
Posts: 2,157
|
This kinda reminds me of Phenom 1.. came up with a couple of dumb mistakes that made it underperform badly, but were corrected with Phenom II shortly after.
Nonetheless, it is an impressive failure as a consumer desktop chip. How come it has twice the transistor count, no iGPU, and it's still badly beaten by Sandybridge clock-for-clock? How come they launch a completely new architecture that apparently shows an equal number of disadvantages and advantages over the previous one? Seems like they should've just stuck with Phenom II X6 architecture instead, improved the core throttling and power management, and just take advantage of the new process to raise the clocks. About the weaker FP processing (6 units in PII-X6 vs. 4 units in BD), could it be that AMD intends to leverage all FP calculations to the iGPU cores in future Fusion chips? Last edited by ToTTenTranz; 12-Oct-2011 at 17:54. |
|
|
|
|
|
#47 |
|
Member
Join Date: Mar 2008
Location: Jurong West
Posts: 759
|
Seriously, what's with the cache splurging?
2MB per module and then another 8 for L3- oh wow seriously. That, and if you compare this to Gulftown for perf/transistors too... I'm not sure if AMD's 15% a year will help much if they're always going to be stuck on dice that big. If BD was aimed at being lean and mean, this is surely not the chip to prove it. I'm gonna say for one though, that once you look to the lower frequencies (oh the irony), there's this 2.5Ghz Opteron Octa-core at 32W ACP. It'll fare better on mobile, where Trinities should clock in at about 2.5Ghz+ or so and have 2 modules. The current 35W Llano maxes out at 1.5Ghz, the 45W variant at 1.9Ghz. That's where the 30% comes in, but not how AMD wished it would be
__________________
<rpg.314> - I have a feeling that shielding 480 from the evils of afr, embodied in that creation of satan called 5970, will be a part of epic battle between good and evil <neliz> - The Devil doesn't wear green. |
|
|
|
|
|
#48 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,117
|
The amount of cache seems more appropriate for server loads. I haven't gotten to a review that looks at those.
The latencies are definitely not client-friendly. Hopefully it looks better for server. The cache structure seems like it has gotten worse compared to the previous chips, and AMD's setup there isn't exactly the greatest. The measured L2 latencies in some of the reviews are worse than I expected (one has 25-27, but there's a lot of variability there). In cycle terms, a miss from the puny L1 has a hit in the same league as an SB L3 hit.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#49 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,437
|
Quote:
|
|
|
|
|
|
|
#50 | |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,566
|
Quote:
I don't think turning a 2-way set associative 64KB L1 icache into 16-way set associative is something that can be described as a "fix." Especially not with this being the same L1 cache design AMD has used since K7, hell, even K6 had a 32KB 2-way associative cache. There's usually no really good reason for software to have 4KB->32KB aliasing in the first place, even between two threads.. it's just a side effect of aggressive address space randomization and has already been changed for Linux. The impact is minor, as you'd expect, because I doubt two threads are always spending all of their active cycles in the same shared library code. It might not be a problem in Windows to begin with, I mean, it isn't a problem where large pages are used.. Last edited by Exophase; 12-Oct-2011 at 19:17. |
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|