Bring back high performance single core CPUs already!

Davros · Apr 17, 2012

Grall said:
I could be wrong though, it's been known to happen.

No Never....

pcchen · Apr 17, 2012

Grall said:
Well, non-P4 processors typically do not have a trace cache either, whose whole purpose is to essentially take the decoder stages out of the equation, from what I've read.

Of course, that depends on the hit rate (and miss penalty) of P4's trace cache compared to a more traditional L1-I cache. IIRC since the original P4's trace cache is not exactly large (12K uOps, roughly equivalent to 16KB) its hit rate should be similar to a more traditional CPU, but its miss penalty is much larger (the whole trace has to be flushed and rebuilt, compared to just load the instructions from L2 cache in the traditional case).

Looking back, a full trace cache seems to be a bad idea. However, Sandy bridge revived this idea in the form of a uOps cache (1.5K entry). Basically it works just like a traditional I-cache, but it stores decoded uOps instead of x86 instructions. So it's not a trace cache, but it's still able to save some ecoder loads.

imaxx · Apr 17, 2012

pcchen said:
Of course, that depends on the hit rate (and miss penalty) of P4's trace cache compared to a more traditional L1-I cache.

You're right, my point was that TC in P4 is.. peculiar, since there's no front L1I, so you had to balance the comparison somehow by either trounce both, or find a way to include both.

The one in SB, instead, looks cool to me. In moderate loops, it can pump up heavily the processor speed by removing the front end.

@Grall: BD seems really not bad at all. Dont look at performance for a while, look to the idea behind it. I did try to see it from any angle, wondering why it was so pathetic on performance. It's IPC is surely... terrible. I thought it could be due to many missing AGLU instructions, L1D cache trash, or the extra cycle latency, or the undersized BTB, or the missing 3rd ALU (but AMD said it was rarely used), or the shared L2 access that literally makes it ~an L3.
In the end, I believe my very first impression was right.
On a long run, decoders average around 2 instr/cycle. BD can average 2 instr/cycle/core (theorical 2+1Alu+2Aglu+2Float). A shared decoder would output average/good case of 2-3 mops/2 cycles... or about between 1 and 1,5/cycle instructions per core. Mah.

itsmydamnation · Apr 17, 2012

sebbbi said:
But the reality is that HT costs minimal die space, while doubling core count almost doubles the die space (and rises power draw dramatically). You should be comparing 2c/2t with 2c/4t (and 4c/4t with 4c/8t) because those designs are similar in power draw, transistor count and manufacturing costs.

Comparing 2c/4t directly with 4c/4t is not a fair comparison, because the 2c/4t CPU is much cheaper to produce and consumes considerably less power (half the cores, half the execution units, etc).

except a bulldozer module (L2 inc) is about the same size as a SB core (L3 included). You can spin it any way you want, but the simple fact is that in the 17 watt class intel has a 2c/4t and in a month or two Trinity will be out with 4c/4t and from the looks of it is seems much improved over both bulldozer and stars.

http://amdfx.blogspot.com.au/2012/04/amd-trinity-benchmark-geekbench.html

pcchen · Apr 17, 2012

itsmydamnation said:
except a bulldozer module (L2 inc) is about the same size as a SB core (L3 included). You can spin it any way you want, but the simple fact is that in the 17 watt class intel has a 2c/4t and in a month or two Trinity will be out with 4c/4t and from the looks of it is seems much improved over both bulldozer and stars.

Are you sure about this? Bulldozer's module size seems to be quite large. It's die size is 315 mm^2 @ 32nm, and from the floor plan each module looks like to be roughly 50 mm^2 (that's without L3 cache).

Sandy Bridge, on the other hand, is 216 mm^2 @ 32nm. Its GPU/Display/DMI part takes roughly the same proportion of Bulldozer's L3 cache + system part, so Sandy Bridge's each core with its respective L3 cache is roughly 36 mm^2. That's much smaller and, frankly, offers better performance in general.

[EDIT] I used die photo of both CPU to estimate the core size (Bulldozer for its module without the L3 cache, and Sandy Bridge for one core and 1/4 L3 cache). The number for Bulldozer is 36 mm^2, while Sandy Bridge is 28.7 mm^2.

Silent_Buddha · Apr 18, 2012

Albuquerque said:
Look at the 1.5Ghz data! Sebbi is on to something, I believe The "cave" scene shows that even a mostly fillrate limited scene still needs two physical cores, and so does the "City" scene (same one from my first test but now with the enhanced ugrids and shadows) although four threads is best if you're not going to overclock.

That's also an excellant illustration of why AMD multi-core CPUs are just about as good as Sandy Bridge CPU's for enthusiast game settings in current games that can take advantage of multiple threads.

In single threaded or lightly threaded games or games which stress the CPU more than GPU then SB can sometimes have a noticeable lead, but in many cases an enthusiast would do just as well with an AMD multicore as an Intel multicore as enthusiast graphic settings.

At least if all they are doing is gaming.

Regards,
SB

hoho · Apr 18, 2012

Silent_Buddha said:
in many cases an enthusiast would do just as well with an AMD multicore as an Intel multicore as enthusiast graphic settings.

I'm one of those enthusiasts that also care about noise. Having the CPU consume a ton more power and thus needing beefier cooling would mean more noise.

Though yeah, raw performance wise BD is good enough for most stuff, especially after it's price was lowered somewhat. It's only problem is it's maximum performance isn't all that great vs I7's.

sebbbi · Apr 18, 2012

itsmydamnation said:
You can spin it any way you want, but the simple fact is that in the 17 watt class intel has a 2c/4t and in a month or two Trinity will be out with 4c/4t and from the looks of it is seems much improved over both bulldozer and stars.

AMDs way of sharing resources (vector units) between two threads is also a very well designed one. Vector pipelines are longer than scalar/logic/fetch pipelines, so it's harder to keep them fed from just single thread ILP alone. Getting instructions from two threads benefits vector processing even more than generic processing. Additionally vector instructions are not used as frequently as normal instructions (usual usage pattern contains heavy bursts and lots of pipeline idling between). So sharing a vector pipeline & execution units between two threads improves the vector pipeline usage even more than HT does for generic processing.

But the biggest gain of sharing vector units between two threads is that vector execution units take a lot of die space, and sharing reduces the die space requirement of vector units to half. It's better to slightly beef up the vector execution unit and share it with two threads than have separate ones for each core (with very low utilization rate because of long pipelines and bursty instruction usage patterns).

The reason why Bulldozer cannot currently match Sandy Bridge is not caused by the shared vector units. One of the bottlenecks in Bulldozer design is the L1i cache. It has only two way associativity, but is shared between two threads in the module. Two way cache is just enough for one thread if branching pattern in simple, but sharing such a simple cache with two threads is asking for trouble. Sandy Bridge's L1i is 8 way associative, so it works very well with two threads as well (HT). AMD should have improved their L1i cache logic (for example 4 way associative would have been a good compromise) when they designed to share the cache with two threads (just like Intel did with HT). Or they could have split the L1i cache to half and gave both threads their own caches (so that they would not randomly evict pages of each other). It's not a common scenario that two threads are executing exactly the same code, so a shared cache doesn't help much compared to smaller separate ones.

swaaye · Apr 20, 2012

Silent_Buddha said:
That's also an excellant illustration of why AMD multi-core CPUs are just about as good as Sandy Bridge CPU's for enthusiast game settings in current games that can take advantage of multiple threads.

Except with multi-GPU setups, where I've seen some curious test results with Bulldozer. I don't know if Phenom II was better but no point in multi GPU on one of those anymore anyway.

Tim · Apr 21, 2012

hoho said:
No, pre-Prescott Netbursts had 20 stage pipeline and Prescott and up had 31 stages.
In what workloads?

For comparison Sandy Bridge has a 18 stage pipeline.

imaxx · Apr 21, 2012

Tim said:
For comparison Sandy Bridge has a 18 stage pipeline.

what?? C2/nehalem had around 13-14.. more or less like K8/10 design.

I dont remember seeing relevant miarch changes (<- to pipeline)....
ah, i think i understand. ~+4 for the loop buffer that has become a true TC cache.

Well, this brings again the problem of P4: when you consider a loop within TC cache boundaries (~1500 mops), decoding is out of play and you get full decoding throughput.

almighty · Apr 21, 2012

Silent_Buddha said:
That's also an excellant illustration of why AMD multi-core CPUs are just about as good as Sandy Bridge CPU's for enthusiast game settings in current games that can take advantage of multiple threads.

In single threaded or lightly threaded games or games which stress the CPU more than GPU then SB can sometimes have a noticeable lead, but in many cases an enthusiast would do just as well with an AMD multicore as an Intel multicore as enthusiast graphic settings.

As I've run both including a Phenom 2 x6 that I run 24/7 at 4.8Ghz I can say that average frame rates between all of them is mostly the same.

Minimum frame rates on the other hand are a completely different story, they are a lot lower on the AMD chips, and on BD they are really low.

Minimums are the most I'm orange number of all.

And personally I would never compare any Core i series CPU with an AMD chip as the Intel would completely bash them in gaming.

The old 1156 Core i3 at 4Ghz+ from a pure gaming point of view would flat out leave any AMD chip for dead.

Blazkowicz · Apr 23, 2012

hoho said:
Yes, that's another point I forgot to mention earlier.

Moving from a single core to even a measily 2.8GHz Northwood P4 with HT was by far the biggest upgrade to system responsiveness I've seen. EVER. And that includes moving from 233MHz P1 with 16M RAM @ 40MB/s to 500MHz P3 with half a gig at some 5x+ higher throughput.

I've had similar gains on single core, single thread computers, just by asking the OS scheduler to set the priority of an offending process (firefox) lower.

moving from sempron 64 to athlon II X2, on same mobo, made my PC incredibly better. sure. installing an OS in virtualbox was utter pain (major CPU hog for every simulated I/O)

Davros · Apr 23, 2012

on the subject of p4's my 3ghz northwood seems to have 90% cpu usage while watching flv's/flash

pascal · May 20, 2012

Just give me a 22nm 6.4GHz Xenon CPU

Very small, cheap and fast for entertainment applications.
And probably cool too.

pjbliverpool · May 20, 2012

pascal said:
Just give me a 22nm 6.4GHz Xenon CPU

Very small, cheap and fast for entertainment applications.
And probably cool too.

Even at 6.4Ghz Xenon would be slow compared to a modern mid range quad core.

Blazkowicz · May 20, 2012

I would like AMD to make a single module Piledriver, still unlocked like other FX processors.
no GPU, just the basics for am3+ socket, small and somewhat cool running - can be o/c on stock heatsink without too much noise.

maybe we can get something like that in the form of a disabled Trinity - but there's the socket dance.

swaaye · May 21, 2012

Yeah something like that would be interesting.

pascal · May 21, 2012

pjbliverpool said:
Even at 6.4Ghz Xenon would be slow compared to a modern mid range quad core.

For gaming and entertainment it would be fast and cheap.
Probably something around 230 GFlops peak

swaaye · May 21, 2012

pascal said:
For gaming and entertainment it would be fast and cheap.
Probably something around 230 GFlops peak

Xenon has been said to be practically similar to a high end Athlon 64 X2. But there's no way 6.4 GHz is going to happen.

What is "entertainment"? HTPC just needs a video DSP like the APUs, IGPs and video cards have, plus any CPU that's faster than a P4.

Bring back high performance single core CPUs already!

Davros

pcchen

Moderator

imaxx

itsmydamnation

pcchen

Moderator

Silent_Buddha

hoho

sebbbi

swaaye

Entirely Suboptimal

Tim

imaxx

almighty

Blazkowicz

Davros

pascal

pjbliverpool

B3D Scallywag

Blazkowicz

swaaye

Entirely Suboptimal

pascal

swaaye

Entirely Suboptimal

Similar threads