Bring back high performance single core CPUs already!

Well, non-P4 processors typically do not have a trace cache either, whose whole purpose is to essentially take the decoder stages out of the equation, from what I've read.

Of course, that depends on the hit rate (and miss penalty) of P4's trace cache compared to a more traditional L1-I cache. IIRC since the original P4's trace cache is not exactly large (12K uOps, roughly equivalent to 16KB) its hit rate should be similar to a more traditional CPU, but its miss penalty is much larger (the whole trace has to be flushed and rebuilt, compared to just load the instructions from L2 cache in the traditional case).

Looking back, a full trace cache seems to be a bad idea. However, Sandy bridge revived this idea in the form of a uOps cache (1.5K entry). Basically it works just like a traditional I-cache, but it stores decoded uOps instead of x86 instructions. So it's not a trace cache, but it's still able to save some ecoder loads.
 
Of course, that depends on the hit rate (and miss penalty) of P4's trace cache compared to a more traditional L1-I cache.

You're right, my point was that TC in P4 is.. peculiar, since there's no front L1I, so you had to balance the comparison somehow by either trounce both, or find a way to include both.

The one in SB, instead, looks cool to me. In moderate loops, it can pump up heavily the processor speed by removing the front end.

@Grall: BD seems really not bad at all. Dont look at performance for a while, look to the idea behind it. I did try to see it from any angle, wondering why it was so pathetic on performance. It's IPC is surely... terrible. I thought it could be due to many missing AGLU instructions, L1D cache trash, or the extra cycle latency, or the undersized BTB, or the missing 3rd ALU (but AMD said it was rarely used), or the shared L2 access that literally makes it ~an L3.
In the end, I believe my very first impression was right.
On a long run, decoders average around 2 instr/cycle. BD can average 2 instr/cycle/core (theorical 2+1Alu+2Aglu+2Float). A shared decoder would output average/good case of 2-3 mops/2 cycles... or about between 1 and 1,5/cycle instructions per core. Mah.
 
But the reality is that HT costs minimal die space, while doubling core count almost doubles the die space (and rises power draw dramatically). You should be comparing 2c/2t with 2c/4t (and 4c/4t with 4c/8t) because those designs are similar in power draw, transistor count and manufacturing costs.

Comparing 2c/4t directly with 4c/4t is not a fair comparison, because the 2c/4t CPU is much cheaper to produce and consumes considerably less power (half the cores, half the execution units, etc).
except a bulldozer module (L2 inc) is about the same size as a SB core (L3 included). You can spin it any way you want, but the simple fact is that in the 17 watt class intel has a 2c/4t and in a month or two Trinity will be out with 4c/4t and from the looks of it is seems much improved over both bulldozer and stars.

http://amdfx.blogspot.com.au/2012/04/amd-trinity-benchmark-geekbench.html
 
except a bulldozer module (L2 inc) is about the same size as a SB core (L3 included). You can spin it any way you want, but the simple fact is that in the 17 watt class intel has a 2c/4t and in a month or two Trinity will be out with 4c/4t and from the looks of it is seems much improved over both bulldozer and stars.

Are you sure about this? Bulldozer's module size seems to be quite large. It's die size is 315 mm^2 @ 32nm, and from the floor plan each module looks like to be roughly 50 mm^2 (that's without L3 cache).

Sandy Bridge, on the other hand, is 216 mm^2 @ 32nm. Its GPU/Display/DMI part takes roughly the same proportion of Bulldozer's L3 cache + system part, so Sandy Bridge's each core with its respective L3 cache is roughly 36 mm^2. That's much smaller and, frankly, offers better performance in general.

[EDIT] I used die photo of both CPU to estimate the core size (Bulldozer for its module without the L3 cache, and Sandy Bridge for one core and 1/4 L3 cache). The number for Bulldozer is 36 mm^2, while Sandy Bridge is 28.7 mm^2.
 
Look at the 1.5Ghz data! Sebbi is on to something, I believe :) The "cave" scene shows that even a mostly fillrate limited scene still needs two physical cores, and so does the "City" scene (same one from my first test but now with the enhanced ugrids and shadows) although four threads is best if you're not going to overclock.

That's also an excellant illustration of why AMD multi-core CPUs are just about as good as Sandy Bridge CPU's for enthusiast game settings in current games that can take advantage of multiple threads.

In single threaded or lightly threaded games or games which stress the CPU more than GPU then SB can sometimes have a noticeable lead, but in many cases an enthusiast would do just as well with an AMD multicore as an Intel multicore as enthusiast graphic settings.

At least if all they are doing is gaming.

Regards,
SB
 
in many cases an enthusiast would do just as well with an AMD multicore as an Intel multicore as enthusiast graphic settings.
I'm one of those enthusiasts that also care about noise. Having the CPU consume a ton more power and thus needing beefier cooling would mean more noise.

Though yeah, raw performance wise BD is good enough for most stuff, especially after it's price was lowered somewhat. It's only problem is it's maximum performance isn't all that great vs I7's.
 
You can spin it any way you want, but the simple fact is that in the 17 watt class intel has a 2c/4t and in a month or two Trinity will be out with 4c/4t and from the looks of it is seems much improved over both bulldozer and stars.
AMDs way of sharing resources (vector units) between two threads is also a very well designed one. Vector pipelines are longer than scalar/logic/fetch pipelines, so it's harder to keep them fed from just single thread ILP alone. Getting instructions from two threads benefits vector processing even more than generic processing. Additionally vector instructions are not used as frequently as normal instructions (usual usage pattern contains heavy bursts and lots of pipeline idling between). So sharing a vector pipeline & execution units between two threads improves the vector pipeline usage even more than HT does for generic processing.

But the biggest gain of sharing vector units between two threads is that vector execution units take a lot of die space, and sharing reduces the die space requirement of vector units to half. It's better to slightly beef up the vector execution unit and share it with two threads than have separate ones for each core (with very low utilization rate because of long pipelines and bursty instruction usage patterns).

The reason why Bulldozer cannot currently match Sandy Bridge is not caused by the shared vector units. One of the bottlenecks in Bulldozer design is the L1i cache. It has only two way associativity, but is shared between two threads in the module. Two way cache is just enough for one thread if branching pattern in simple, but sharing such a simple cache with two threads is asking for trouble. Sandy Bridge's L1i is 8 way associative, so it works very well with two threads as well (HT). AMD should have improved their L1i cache logic (for example 4 way associative would have been a good compromise) when they designed to share the cache with two threads (just like Intel did with HT). Or they could have split the L1i cache to half and gave both threads their own caches (so that they would not randomly evict pages of each other). It's not a common scenario that two threads are executing exactly the same code, so a shared cache doesn't help much compared to smaller separate ones.
 
Last edited by a moderator:
For comparison Sandy Bridge has a 18 stage pipeline.
what?? C2/nehalem had around 13-14.. more or less like K8/10 design.

I dont remember seeing relevant miarch changes (<- to pipeline)....
ah, i think i understand. ~+4 for the loop buffer that has become a true TC cache.

Well, this brings again the problem of P4: when you consider a loop within TC cache boundaries (~1500 mops), decoding is out of play and you get full decoding throughput.
 
That's also an excellant illustration of why AMD multi-core CPUs are just about as good as Sandy Bridge CPU's for enthusiast game settings in current games that can take advantage of multiple threads.

In single threaded or lightly threaded games or games which stress the CPU more than GPU then SB can sometimes have a noticeable lead, but in many cases an enthusiast would do just as well with an AMD multicore as an Intel multicore as enthusiast graphic settings.

As I've run both including a Phenom 2 x6 that I run 24/7 at 4.8Ghz I can say that average frame rates between all of them is mostly the same.

Minimum frame rates on the other hand are a completely different story, they are a lot lower on the AMD chips, and on BD they are really low.

Minimums are the most I'm orange number of all.

And personally I would never compare any Core i series CPU with an AMD chip as the Intel would completely bash them in gaming.

The old 1156 Core i3 at 4Ghz+ from a pure gaming point of view would flat out leave any AMD chip for dead.
 
Last edited by a moderator:
Yes, that's another point I forgot to mention earlier.

Moving from a single core to even a measily 2.8GHz Northwood P4 with HT was by far the biggest upgrade to system responsiveness I've seen. EVER. And that includes moving from 233MHz P1 with 16M RAM @ 40MB/s to 500MHz P3 with half a gig at some 5x+ higher throughput.

I've had similar gains on single core, single thread computers, just by asking the OS scheduler to set the priority of an offending process (firefox) lower.

moving from sempron 64 to athlon II X2, on same mobo, made my PC incredibly better. sure. installing an OS in virtualbox was utter pain (major CPU hog for every simulated I/O)
 
Just give me a 22nm 6.4GHz Xenon CPU :)

Very small, cheap and fast for entertainment applications.
And probably cool too.
 
Last edited by a moderator:
I would like AMD to make a single module Piledriver, still unlocked like other FX processors.
no GPU, just the basics for am3+ socket, small and somewhat cool running - can be o/c on stock heatsink without too much noise.

maybe we can get something like that in the form of a disabled Trinity - but there's the socket dance.
 
For gaming and entertainment it would be fast and cheap.
Probably something around 230 GFlops peak :)

Xenon has been said to be practically similar to a high end Athlon 64 X2. But there's no way 6.4 GHz is going to happen.

What is "entertainment"? HTPC just needs a video DSP like the APUs, IGPs and video cards have, plus any CPU that's faster than a P4.
 
Back
Top