Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 12-Oct-2011, 14:00   #26
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,819
Send a message via Skype™ to fellix
Default

Quote:
Originally Posted by CarstenS View Post
Possibly debit is better translated as CPI, opposed to IPC? After initial latency of course?
Yes, it probably means how many cycles it takes to complete the operation (on average) for a test run.
The <1 cycle values are most likely a result of super-scalar execution issue.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 12-Oct-2011, 14:09   #27
Rootax
Member
 
Join Date: Jan 2006
Location: France
Posts: 197
Default

"le débit mesure la vitesse entre chaque instruction traitée lorsqu'on en traite plusieurs. " So here, "debit" is like "time between instructions".
__________________
- I'm french. Sorry if you don't understand what i say -
Rootax is offline   Reply With Quote
Old 12-Oct-2011, 14:16   #28
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,842
Send a message via ICQ to CarstenS
Default

Thanks!
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is offline   Reply With Quote
Old 12-Oct-2011, 14:22   #29
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,437
Default

Quote:
Originally Posted by function View Post
Phenom II is faster for gaming. AMD have been behind for years now and they're actually moving backwards.
Depends on the game, looked overall quite similar to me - lose some, win some. But that is indeed disappointing.
[edit: actually it indeed seems it's slower in games more often than not - the games would really need to have more threads for it to be faster for the most part.]

Quote:
AMDs decision not do anything new on 45nm for years, then jump in with a brand new architecture on a completely unproven process sure paid off.
They didn't quite do that forgot about Llano? If the process isn't up to the task (I have no idea there really if the chip not reaching higher frequencies is really due to 32nm trouble) there's not much you can do about it anyway at this point.


Quote:
Originally Posted by V3 View Post
Would a 32nm Phenom II X8 be faster and smaller compare to 4-module BD ?
An interesting question. With the same 512KB L2 per core and 8MB L3 it should be smaller. Which is counter to the idea that 1 BD module should be smaller than 2 older cores because some resources are shared...
It is hard to say though what the achievable frequency would be (and hence if it would be faster) as we don't know how well (or not) 32nm fares against 45nm. Also Phenom II might have some trouble scaling to more cpus, the memory throughput wasn't all that much and not sure about L3 cache throughput neither.

Quote:
Originally Posted by fellix View Post
Something interesting I found, from hardware.fr's review:
"Latence" = Latency
"Debit" = Throughput
Yes, this confirms some information which was already available: since it has two FMAC pipes some FPU/SSE2 commands can run in both pipes which can only run in one for other archs (muls at least, and interestingly looks like DIVs too? Or maybe they are more pipelined now, though at first glance it really looks like almost anything could run in either pipe). The latency of in particular complex x87 operations is horrific though so it's probably not really a win for old software, larger queues or not...

Last edited by mczak; 12-Oct-2011 at 15:13.
mczak is offline   Reply With Quote
Old 12-Oct-2011, 14:34   #30
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,819
Send a message via Skype™ to fellix
Default

Quote:
Originally Posted by mczak View Post
The latency of in particular complex x87 operations is horrific though so it's probably not really a win for old software, larger queues or not...
Intel did exactly the same with P4 ten years ago, to make room for SSE2 and to get longer pipeline to pump up the clock-rate. Sadly, AMD is quiet on the pipeline number of stages for BD.
Looks like AMD is (un)willingly following the old steps to Intel's NetBurst "hell".
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 12-Oct-2011, 14:44   #31
Rootax
Member
 
Join Date: Jan 2006
Location: France
Posts: 197
Default

Quote:
Originally Posted by fellix View Post
Intel did exactly the same with P4 ten years ago, to make room for SSE2 and to get longer pipeline to pump up the clock-rate. Sadly, AMD is quiet on the pipeline number of stages for BD.
Looks like AMD is (un)willingly following the old steps to Intel's NetBurst "hell".
Yep, high speed, crappy performances... only problem is Intel has big pockets and very good engie teams, while AMD seems in difficulty for years now. I read somewhere that a lot of AMD engies left some time ago because of crappy management. I hope AMD will rebound but I doubt it (on the desktop side anyway).
__________________
- I'm french. Sorry if you don't understand what i say -

Last edited by Rootax; 12-Oct-2011 at 14:52.
Rootax is offline   Reply With Quote
Old 12-Oct-2011, 14:50   #32
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,842
Send a message via ICQ to CarstenS
Default

I can only hope for AMD that they've got the connection between the Cores and the Co-processors right and that GCN-like cores will be part of future BDs rather sooner than later. Otherwise

edit: BTW, were's Charlies rant complaining about AMD forcing a HPC/server part down the mainstreamers' throats? *SCNR*
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is offline   Reply With Quote
Old 12-Oct-2011, 14:58   #33
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,437
Default

Quote:
Originally Posted by fellix View Post
Intel did exactly the same with P4 ten years ago, to make room for SSE2 and to get longer pipeline to pump up the clock-rate. Sadly, AMD is quiet on the pipeline number of stages for BD.
Looks like AMD is (un)willingly following the old steps to Intel's NetBurst "hell".
Well the horrific latencies in the FMACs are quite specific to complex x86 ops, while the simple ones are increased too it's "only" from 4 to 5 clocks. So I don't think pipeline length has increased massively.
It seems to me like the horrific latencies for the complex x86 ops is just because they can now be executed in both pipes but AMD implemented them a bit cheaper hence the individual ones take longer to execute (but of course can instead execute two of them simultaneously). Those ops might be rare enough that they are "fast enough" but making both FMAC pipes (nearly) symmetric probably made things easier for scheduling etc. hence AMD doing this instead of going the more traditional route of only executing these ops in one pipe but a bit faster (of course more traditional fpu execution ports are asymmetric anyway).

Last edited by mczak; 12-Oct-2011 at 22:01.
mczak is offline   Reply With Quote
Old 12-Oct-2011, 15:00   #34
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,437
Default

Quote:
Originally Posted by CarstenS View Post
I can only hope for AMD that they've got the connection between the Cores and the Co-processors right and that GCN-like cores will be part of future BDs rather sooner than later. Otherwise
If you're talking Piledriver it doesn't look like that, with Anand stating L3 cache will likely be gone, which would suggest a totally separate IGP like in Llano (of course the connection COULD be faster but I don't think without common L3 cache this is really very useful for that use case). Though well integrated GCN is still one more generation ahead...
mczak is offline   Reply With Quote
Old 12-Oct-2011, 15:32   #35
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,842
Send a message via ICQ to CarstenS
Default

No, I wasn't talking about a specific iteration of BD. But from what I'm seeing in the reviews, AMD needs to get their performance per watt up as fast a possible also for the HPC space and efficient, wider vector units would greatly help to at least make/keep (depends on how you assess the current situation) them competitive in that market.

http://www.hardware.fr/articles/842-...ergetique.html
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is offline   Reply With Quote
Old 12-Oct-2011, 15:34   #36
Mendel
Mr. Upgrade
 
Join Date: Nov 2003
Location: Finland
Posts: 1,335
Default

Quote:
Originally Posted by Hanners
I'd say this sums it up better:

This should indeed be merged with the doom&gloom thread
Mendel is offline   Reply With Quote
Old 12-Oct-2011, 15:58   #37
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,437
Default

Quote:
Originally Posted by CarstenS View Post
No, I wasn't talking about a specific iteration of BD. But from what I'm seeing in the reviews, AMD needs to get their performance per watt up as fast a possible also for the HPC space and efficient, wider vector units would greatly help to at least make/keep (depends on how you assess the current situation) them competitive in that market.
But in any case I think you're looking 2013 at the earliest for any such thing. The competition will be Haswell at that time which I've no idea how it compares to todays chips .
mczak is offline   Reply With Quote
Old 12-Oct-2011, 16:26   #38
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,819
Send a message via Skype™ to fellix
Default

http://www.anandtech.com/show/4955/t...x8150-tested/5

The classic Queens bench shows the sad state of the single-threaded performance, being hammered by the ill fated front-end and wimpy L1i cache.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 12-Oct-2011, 16:38   #39
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,437
Default

Ok here's my summary of what I'm thinking after reading some reviews:

The good:
- TurboCore which actually does something. Of course for perf/power this is not good but it's nice to finally see this truly working.
- lower idle power. In some reviews it made a much larger difference (almost to the point of being similar to intel idle power consumption) in some not so much but in any case looks like an improvement most likely thanks to powergating the cores.
- higher memory bandwidth than Phenom II (not quite as good as SNB but definitely an improvement).
- AES instructions for catching up with SNB, AVX, FMA4 etc.
- shared FPU actually looks ok. Only some rare synthetics seem to show this to cause a performance hit, otherwise scaling to multiple threads seems largely independent if it's integer or float workload (of course this most likely is a result of the beefed up FPU too, it probably will show bad scaling with code using AVX-256 where the FPU suddenly doesn't look all that beefy anymore).
- CMT is a neat idea and scaling isn't too bad (hardware.fr has some numbers, 4->8 threads scaling is better than going from 4 to 6 Phenom II cores and of course better than HT), so might be a win on a perf/area scale for multithreaded workloads, if just the singlethreaded baseline would be a bit higher...

The bad:
- nowhere near the once promised "30% higher" clockspeed. If that's due to cpu design or manufacturing trouble I don't know.
- high load power consumption. Efficiency actually didn't increase compared to a X6 1100T which is worrysome.
- low single-thread performance even with some clock increase just barely at Phenom II levels, and not in the same ballpark as SNB cpus at all (of course it was expected but the difference is bigger than it should be).

The ugly:
- AMD promised roughly same IPC for single threaded performance and they clearly missed it for typical workloads by about 10-15% or so. I think that's the biggest problem actually. Why did they miss it? Is it just the removal of the 3rd ALU in an integer core? It looked like it could have been possible to "compensate" for that with other improvements (like memory disambiguation, better branch prediction, larger scheduling queues) given that it's hard to schedule 3 alu ops in the first place but it didn't happen.
- L1I thrashing issues with only iffy OS bandaids for a problem which should have been avoided by better cache design (higher associativity).
- not convinced of the whole cache design. Requires large area and the latency is just bad both compared to Phenom II and more so the competition, so the large size might not be worth it. Moreover, I wonder if the very low L1D write bandwidth (not even 1/5 of read bandwidth while traditionally it "should" be roughly half of the read bandwidth, it is a result of the L1D write through design together with the low L2 write bandwidth) isn't a real problem reducing throughput quite heavily in some cases (there is a "Write Coalescing Cache" to help with that, but at least in hardware.fr numbers it didn't turn up).

Last edited by mczak; 12-Oct-2011 at 17:09.
mczak is offline   Reply With Quote
Old 12-Oct-2011, 16:47   #40
Miksu
Member
 
Join Date: Mar 2003
Location: Finland
Posts: 938
Default

What is most sad: Deus Ex: Human Revolution, this year's arguable the biggest "Gaming Evolved"-title just BSODs with FX-8150.

What a sorry mess.
__________________
Mikael Koskinen blog: .NET Programming, Windows Phone Development, Software Architecture
Miksu is offline   Reply With Quote
Old 12-Oct-2011, 16:49   #41
GZ007
Member
 
Join Date: Jan 2010
Posts: 416
Default

Did anyone seen some multitasking benchmarks for the dozer? If it needs those threads so badly .


A also guess this could be a problem in future if software cant catch up to the cores. 8 or 12 cores will just sit idle and wont bring much improvement if u dont use it for rendering or encoding.

Last edited by GZ007; 12-Oct-2011 at 16:58.
GZ007 is offline   Reply With Quote
Old 12-Oct-2011, 16:56   #42
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,437
Default

Quote:
Originally Posted by fellix View Post
http://www.anandtech.com/show/4955/t...x8150-tested/5

The classic Queens bench shows the sad state of the single-threaded performance, being hammered by the ill fated front-end and wimpy L1i cache.
The L1i isn't the problem here as it's largely the same as in Phenom II (it's only a real problem with multiple threads).
This benchmark MIGHT show the worst case as it's possible it is indeed optimized and written in a way which allows the cpus to schedule as many (int) instructions per clock as they can - 4 for SNB, 3 for Phenom II, 2 for BD. The scores would align nearly perfectly with that. [edit: of course SNB can't actually execute 4 alu ops per clock - not sure what I was thinking there...]
Thankfully it's not quite _that_ bad in general...
Makes you wonder if removing the third ALU pipe was a good idea though I guess it's all part of the compromise - e.g. third ALU pipe might mean instruction decode becomes the bottleneck too often (at least for multithreaded workloads), and of course it runs counter to the idea of CMT where you streamline the int cores a bit but have more of them instead.

Last edited by mczak; 12-Oct-2011 at 19:57.
mczak is offline   Reply With Quote
Old 12-Oct-2011, 17:00   #43
RobertR1
Regular
 
Join Date: Nov 2005
Posts: 5,048
Default

An impressive failure.
__________________
Hall of fame thread: http://forum.beyond3d.com/showthread.php?t=50668
RobertR1 is offline   Reply With Quote
Old 12-Oct-2011, 17:05   #44
Mendel
Mr. Upgrade
 
Join Date: Nov 2003
Location: Finland
Posts: 1,335
Default

Quote:
Originally Posted by Miksu View Post
What is most sad: Deus Ex: Human Revolution, this year's arguable the biggest "Gaming Evolved"-title just BSODs with FX-8150.

What a sorry mess.
Well a friend of mine once said in a similar situation: "It´s the Ati Wonder."
Mendel is offline   Reply With Quote
Old 12-Oct-2011, 17:32   #45
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,842
Send a message via ICQ to CarstenS
Default

Quote:
Originally Posted by mczak View Post
But in any case I think you're looking 2013 at the earliest for any such thing. The competition will be Haswell at that time which I've no idea how it compares to todays chips .
Yes, probably 2013. But if AMD could manage to integrate a winning Vector Co-Processor and GPU-Compute has gained more traction until then, maybe they can be very competitive with Haswell, which I doubt will be more than an evolutionary step, whereas perusing GCN as CPU Co-Processor could as well be revolutionary.

We'll see.

What I found the most interesting insight so far is that CMT seems to decelerate heavily threaded games.
http://www.hardware.fr/articles/842-...acite-cmt.html
More exactly this graph:


While SC2 is only using two threads at most and relies heavily on a high IPC, I am not exactly sure about ArmA II, but F1 2011 and RoF are definitely heavily threaded as well as Anno 1404 (aka Dawn of Discovery) and all are running slower with CMT enabled, i.e. 8C/4CU vs. 4c/4CU.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is offline   Reply With Quote
Old 12-Oct-2011, 17:36   #46
ToTTenTranz
Senior Member
 
Join Date: Jul 2008
Posts: 2,157
Default

This kinda reminds me of Phenom 1.. came up with a couple of dumb mistakes that made it underperform badly, but were corrected with Phenom II shortly after.


Nonetheless, it is an impressive failure as a consumer desktop chip.
How come it has twice the transistor count, no iGPU, and it's still badly beaten by Sandybridge clock-for-clock?


How come they launch a completely new architecture that apparently shows an equal number of disadvantages and advantages over the previous one?
Seems like they should've just stuck with Phenom II X6 architecture instead, improved the core throttling and power management, and just take advantage of the new process to raise the clocks.



About the weaker FP processing (6 units in PII-X6 vs. 4 units in BD), could it be that AMD intends to leverage all FP calculations to the iGPU cores in future Fusion chips?

Last edited by ToTTenTranz; 12-Oct-2011 at 17:54.
ToTTenTranz is offline   Reply With Quote
Old 12-Oct-2011, 18:04   #47
Tchock
Member
 
Join Date: Mar 2008
Location: Jurong West
Posts: 759
Default

Seriously, what's with the cache splurging?
2MB per module and then another 8 for L3- oh wow seriously.

That, and if you compare this to Gulftown for perf/transistors too...

I'm not sure if AMD's 15% a year will help much if they're always going to be stuck on dice that big. If BD was aimed at being lean and mean, this is surely not the chip to prove it.


I'm gonna say for one though, that once you look to the lower frequencies (oh the irony), there's this 2.5Ghz Opteron Octa-core at 32W ACP. It'll fare better on mobile, where Trinities should clock in at about 2.5Ghz+ or so and have 2 modules. The current 35W Llano maxes out at 1.5Ghz, the 45W variant at 1.9Ghz.

That's where the 30% comes in, but not how AMD wished it would be
__________________
<rpg.314> - I have a feeling that shielding 480 from the evils of afr, embodied in that creation of satan called 5970, will be a part of epic battle between good and evil
<neliz> - The Devil doesn't wear green.
Tchock is offline   Reply With Quote
Old 12-Oct-2011, 18:20   #48
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,117
Default

The amount of cache seems more appropriate for server loads. I haven't gotten to a review that looks at those.
The latencies are definitely not client-friendly.

Hopefully it looks better for server. The cache structure seems like it has gotten worse compared to the previous chips, and AMD's setup there isn't exactly the greatest.

The measured L2 latencies in some of the reviews are worse than I expected (one has 25-27, but there's a lot of variability there). In cycle terms, a miss from the puny L1 has a hit in the same league as an SB L3 hit.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is online now   Reply With Quote
Old 12-Oct-2011, 18:24   #49
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,437
Default

Quote:
Originally Posted by CarstenS View Post
What I found the most interesting insight so far is that CMT seems to decelerate heavily threaded games.
http://www.hardware.fr/articles/842-...acite-cmt.html
While SC2 is only using two threads at most and relies heavily on a high IPC, I am not exactly sure about ArmA II, but F1 2011 and RoF are definitely heavily threaded as well as Anno 1404 (aka Dawn of Discovery) and all are running slower with CMT enabled, i.e. 8C/4CU vs. 4c/4CU.
I think that's quite easy to explain. Performance is mostly the same when using only 2 modules too so while those games might be heavily threaded it looks like (at least on this chip) performance is more or less determined by how fast one particular (or maybe 2) of these threads is running. And adding more threads just takes away resources from that thread. It might behave the same on SNB with/without HT (or maybe not as BD apparently has issues with its L1I cache SNB does not).
mczak is offline   Reply With Quote
Old 12-Oct-2011, 19:09   #50
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,566
Default

Quote:
Originally Posted by Raqia View Post
I wonder how well a 16 core bobcat CPU at 3 ghz would do vs. an Interlagos 16 core.
The problem is that I don't think the Bobcat design can actually do anywhere close to 3GHz under any real conditions.

Quote:
Originally Posted by Gubbi View Post
BD might be a decent server chip, but I'm really surprised how much AMD dropped the ball with BD as a desktop chip.

Things to improve:
1. Fix the false aliasing L1 caches (increase associativity to pad up the bits in the index/tags)
I don't think turning a 2-way set associative 64KB L1 icache into 16-way set associative is something that can be described as a "fix." Especially not with this being the same L1 cache design AMD has used since K7, hell, even K6 had a 32KB 2-way associative cache. There's usually no really good reason for software to have 4KB->32KB aliasing in the first place, even between two threads.. it's just a side effect of aggressive address space randomization and has already been changed for Linux. The impact is minor, as you'd expect, because I doubt two threads are always spending all of their active cycles in the same shared library code. It might not be a problem in Windows to begin with, I mean, it isn't a problem where large pages are used..

Last edited by Exophase; 12-Oct-2011 at 19:17.
Exophase is online now   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 07:37.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.