Bring back high performance single core CPUs already!

Yeah, not sure what the bottleneck is on 8192 shadows, but I ran headlong into the same bottleneck on my Q9450 + 5850 config a few months ago. I made the (naive?) assumption it was VRAM limited, but never did the proper research to prove that.

GPU-Z and MSI Afterburner both show >2500Mb of VRAM usage while I'm dorking around the outdoors in Skyrim; typically less when I'm indoors somewhere. I haven't compared it after making Richard's shadow and uGrids changes; I'll check it out tonight and report back.
 
I generally agree, but there are some odd outliers. Skyrim really loves six cores even at the highest details:
CPU_2.png

I suppose it might be caching, but really seems to love threads. Bizarre.

X2 3.3GHz = 44FPS
X4 3.7GHz = 54FPS
X6 3.3 (up to 3.7GHz with turbo) = 49FPS

the game doesn't seem to care for more core than maybe 3,
I think you shouldn't look at the i7 3xxx compared to the 2600k because there is simply to much difference, a lot more l3 cache, a lot more memory bandwidth and in this game maybe a higher clock with turbo to,

my experience with Skyrim was divided between 3 CPUs,
first an E5400 (2mb L2, 2 Cores) at 3.7GHz,
without the first patches the framerate was already OK at my settings (above 20 on the most intensive places but normally above 40), BUT the game had some terrible stuttering and freezes which as far as I know were only happening with dual core CPUs, there was an unofficial fix that worked (from enbdev.com), but it was later solved with the patches,
I tested some underclocking at 2.4GHz the game was still playable, with more than 20fps in Riften, and here is the funny thing, with a higher level of details (all at ultra, which is to much for my VGA) the experience was still smooth at this clock, with constant framerate, while at 3.7GHz it was awful, with a lot of variation, with the framerate jumping up and down all the time,
anyway, the 1.4 patch made the game a lot lighter, I stated seeing most of the time the framerate at 50fps or more, with the lowest going to the 30s...
I swapped this CPU with a Core 2 Quad 65nm at 2.85GHz, and performance decreased a little but stayed close enough, it was clear by looking at task manager that this game uses very little more than 2 cores....

going for a i3 2100 the framerate definitely improved, I guess the architecture improvements really can make up for the "missing" cores easily in this case (gaming), a lot faster memory/IO subsystem I guess...

but there are games that are far more successful at using "more threads" than Skyrim,
comparing the e54@3.75 to the C2Q@2.85GHz I saw a huge advantage on the C2Q at some games like Witcher 2 and GTA 4 things jumping from 20 to 30FPS basically (but the C2Q also had more L2 cache),

I think a dual core sandy bridge with HT at 4.5GHz would mostly make useless more cores for gaming right now... and I also think that this is the reason why intel only unlocks overclocking at their more expensive parts... so yes, the OP have a point I think... most users don't really need 4/6 cores, but could use 2 stronger cores,
 
X2 3.3GHz = 44FPS
X4 3.7GHz = 54FPS
X6 3.3 (up to 3.7GHz with turbo) = 49FPS

the game doesn't seem to care for more core than maybe 3,
I think you shouldn't look at the i7 3xxx compared to the 2600k because there is simply to much difference, a lot more l3 cache, a lot more memory bandwidth and in this game maybe a higher clock with turbo to,

my experience with Skyrim was divided between 3 CPUs,
first an E5400 (2mb L2, 2 Cores) at 3.7GHz,
without the first patches the framerate was already OK at my settings (above 20 on the most intensive places but normally above 40), BUT the game had some terrible stuttering and freezes which as far as I know were only happening with dual core CPUs, there was an unofficial fix that worked (from enbdev.com), but it was later solved with the patches,
I tested some underclocking at 2.4GHz the game was still playable, with more than 20fps in Riften, and here is the funny thing, with a higher level of details (all at ultra, which is to much for my VGA) the experience was still smooth at this clock, with constant framerate, while at 3.7GHz it was awful, with a lot of variation, with the framerate jumping up and down all the time,
anyway, the 1.4 patch made the game a lot lighter, I stated seeing most of the time the framerate at 50fps or more, with the lowest going to the 30s...
I swapped this CPU with a Core 2 Quad 65nm at 2.85GHz, and performance decreased a little but stayed close enough, it was clear by looking at task manager that this game uses very little more than 2 cores....

going for a i3 2100 the framerate definitely improved, I guess the architecture improvements really can make up for the "missing" cores easily in this case (gaming), a lot faster memory/IO subsystem I guess...

but there are games that are far more successful at using "more threads" than Skyrim,
comparing the e54@3.75 to the C2Q@2.85GHz I saw a huge advantage on the C2Q at some games like Witcher 2 and GTA 4 things jumping from 20 to 30FPS basically (but the C2Q also had more L2 cache),

I think a dual core sandy bridge with HT at 4.5GHz would mostly make useless more cores for gaming right now... and I also think that this is the reason why intel only unlocks overclocking at their more expensive parts... so yes, the OP have a point I think... most users don't really need 4/6 cores, but could use 2 stronger cores,

Read the rest of his data, you'll see that HT just isn't quite enough.
 
Yeah, uh, try reading the rest of my posts :LOL: ;) Let me help with one of my cliff notes from an earlier post (but long after the one you quoted...)
Yes, you are correct. I chose to drag Skyrim in here as a game that had previously demonstrated scaling beyond 4 cores, and someone rightfully asked if that held true after all the recent patching.

I felt it necessary to properly answer the question, and the answer was generally "no", scaling did NOT hold true after the newer patches, at least when playing using graphics settings that the ultra-enthusiast is probably going to use. I guess you could say that I was doing the proper due diligence to either support or refute my claim, and it kinda went 50/50 for me :D

Negatives; six-core scaling appears to be zero (or perhaps even slightly negative?) Meh.
Positives: it still needs a minimum of two cores to be playable, preferably four.

I've done a bit of homework for the world to see, and HT isn't much help. You want real cores, not HT... The i5-2500k seems to be your absolute best "bang for the buck" in terms of gaming performance, which really isn't news to anyone... Also, at very low speeds (ie low-power processors found in laptops) there is a measurable performance benefit to having four physical cores in Skyrim. The benches indicated a jump from 10fps -> 30fps via jumping from single core to quad core (hyperthreading helped at lower core counts, but maximum performance was found at 4c / 4t rather than 2c / 4t)
 
Last edited by a moderator:
Yeah, uh, try reading the rest of my posts :LOL: ;) Let me help with one of my cliff notes from an earlier post (but long after the one you quoted...)


I've done a bit of homework for the world to see, and HT isn't much help. You want real cores, not HT... The i5-2500k seems to be your absolute best "bang for the buck" in terms of gaming performance, which really isn't news to anyone... Also, at very low speeds (ie low-power processors found in laptops) there is a measurable performance benefit to having four physical cores in Skyrim. The benches indicated a jump from 10fps -> 30fps via jumping from single core to quad core (hyperthreading helped at lower core counts, but maximum performance was found at 4c / 4t rather than 2c / 4t)

so what your really trying to say is that 17watt trinity is going to be awesome :LOL:
 
I'm pretty sure the CPU market took the turn towards multicore the way it did for a reason, a reason smarter and more informed people than me decided. However, one has to wonder what kind of performance a hypothetical single core Sandy Bridge pushed to 5Ghz+ with 4 threads and 512 bit larrabee-style vector unit, with massive 32-64MB L2 cache would look like, performance wise, had the single core method stuck around. Might be able to keep up with or maybe even beat a dual core SB of today. Though 4/6 cores would probably crush it. Someone make it happen! :p
 
so what your really trying to say is that 17watt trinity is going to be awesome :LOL:
No comment ;)

However, one has to wonder what kind of performance a hypothetical single core Sandy Bridge pushed to 5Ghz+ with 4 threads and 512 bit larrabee-style vector unit, with massive 32-64MB L2 cache would look like, performance wise, had the single core method stuck around.
Meh. Four threads from a single core sounds foolish and unlikely useful; a fat vector unit makes a lot of assumptions on how game code would get written, and epic L2 cache actually doesn't seem of much use given prior history and the 'usefulness' of parts that sport the same clockspeed and architecture but fatter cache size (ie: very little difference.)

Case in point: my six core, twelve thread 3930k sports 50% more cache, 50% more cores, 50% more threads, 100% more main memory bandwidth, and 200% more PCI-E lanes and yet basically equals or loses to a 2600k when talking strictly about games. Of course, when you throw in something that is "compute" related (H.264 encoding, raytracing, blah-de-blah) then the SB-E platform brings out the big guns and lays waste to the 2600k.

Single core with all that jazz? Not seeing it.
 
@Davros: Netburst was a very interesting CPU architecture. Not a good one, but very interesting. Pushed to the limits, 4Ghz base clock, it was running internally at an amazing 8Ghz frequency(!!). Problem is, the delta gained with an higher base clock (something like 33% if I remember well) was lost due to the compromises (32 stages pipeline) required to push such clock. AMD has done the same with BD for raising its clock, bringing its pipeline to more or less to the same length of the original netburst architecture (around 20-23). A risky choice, considered its precedent, at least (but BD sucks hard because of the shared decoder, anyway).

AFAIK Willamette/Northwood had 28 stages, and AFAIK bulldozer has only about 20 stages(precise number not stated).
And Presscott had 42 stages.

So, bulldozers pipeline length is only "halfway from P6/K7 to willamette and northwood" and "1/3 way from P6/K7 to presscott".

And long pipeline was not the biggest/only problem with willamette/northwood's IPC; slow shifts, multiplications, small L1D cache were bigger IPC limiters.

On presscott with many of these fixed or improved and even longer pipeline, the pipeline length was really the biggest ipc-reducer.

And the pipeline length of bulldozer is quite equal to the pipeline length of Power7, worlds fastest microprocessor.
 
20 stages AFTER the trace cache
8 stages before the trace cache/total.
Counting the TC into the stages for the P4 is like to count L1I latency - it sounds not very fair.
NetBurst did pay indeed a high price for such performance, but was funny: a nice example I remember was it needed an additional mop for INC vs ADD for masking out flags, or the TC space 'borrows' alchemy.

I'm not saying BD is slower because it has the same stages of P4, but rather it sounds like a trend reversal for x86. I'm sure AMD ensured that the 10/15% clock advantage more than cover the issues it brings (well, same could have been said for Intel so..). BD looks a bit like K10 to me - a caged, immense firepower toy with a tiny entrance (K10 with a tiny exit, too).

Power7... if x86 had fixed instruction size, maybe with VLIW possibilities, more registers, an optimized set of instructions +etc... but you see where Itanium ended up -at AMD64. Compatibility wins, for large scale.
 
Look at the 1.5Ghz data! Sebbi is on to something, I believe :)
Yes... a 4.5 GHz single core (with HT) already reaches 52 fps (very near the 60 fps cap), and dual core (without HT) reaches the max. Adding more cores or threads do not help at all, since two beefy high clocked Sandy Bridge cores can already execute all the game threads sequentially in the allocated 16 ms time slot (60 fps target).

At 1.5 GHz however you see good scaling. 1 core = 10.5 fps, 1 cores with HT = 15.5 fps, 2 cores = 16.5 fps, 2 cores with HT = 23.3 fps, and 4 cores = 29.5. Also you see that extra hardware threads provided by HT give very good gains at low core counts (1 core + HT = 93% of 2 cores, 2 cores + HT = 78% of 4 cores).

It seems that Skyrim is designed to run at 30 fps on lower end processors (and consoles). The 29.5 fps is exactly half of the 59.3 fps cap seen in high end benchmarks. Maybe the game detects the CPU clock speed and lowers the cap to half if an low end CPU is detected (very good idea, since constant frame rate is always better than a fluctuating one). That's why there's no additional gains when going over 4 cores. You could try to slightly increase the CPU clocks, and see when the game switches to 60 fps mode. 2 GHz would be an good even number for example... The 3 GHz CPU slightly scales up from 4 -> 6 cores, so there's likely extra scaling to be discovered at lower clocks (as long as the frame cap is not lowered to 30 fps).
 
What's wrong with real cores AND hyperthreading?

I'm sure my application would benefit from it.
If I can choose between 2c / 4t and 4c / 4t, then obvious winner is the latter (given all other things are equal.) If you can get both, then more power to you!

My overclocked 3930k has no problem obviously crushing this game and anything else I throw at it, but I leave HT turned on regardless. It certainly helps when transcoding all the videos of my daughter...
 
Counting the TC into the stages for the P4 is like to count L1I latency - it sounds not very fair.

Calculating P4 pipeline stages without calculating decode stages but calculating decode stages on other processors even even less "fair".


The "corrent" way of comparing these pipeline legths would be calculating the maximum sequential gate count of pipeline stages, ie. how long is one stage, not total amount of the pipeline stages. And there P4 is clearly longer than bulldozer, and bulldozer is very close to power7.
 
If I can choose between 2c / 4t and 4c / 4t, then obvious winner is the latter (given all other things are equal.) If you can get both, then more power to you!
But the reality is that HT costs minimal die space, while doubling core count almost doubles the die space (and rises power draw dramatically). You should be comparing 2c/2t with 2c/4t (and 4c/4t with 4c/8t) because those designs are similar in power draw, transistor count and manufacturing costs.

Comparing 2c/4t directly with 4c/4t is not a fair comparison, because the 2c/4t CPU is much cheaper to produce and consumes considerably less power (half the cores, half the execution units, etc).

HT is a good way to get extra performance efficiently (only minimal extra transistors needed, doesn't raise power consumption that much). A processor without HT (or similar SMT technology) wastes lots of CPU cycles doing nothing. This happens because it's very hard to keep long pipelines filled with instructions only from one thread. The parallerism that can be extracted from one thread (ILP) is just not enough. A processor with HT can fill the empty slots of the pipeline (that would just execute NOP) with instructions from another thread. That's basically free performance (without having to add any new execution resources).

HT is most important for CPUs that have low core count to begin with. Good examples are those 17W dual core Sandy Bridge CPUs (found in Ultrabooks and Macbook Air). With good turbo clocks these CPUs can handle single core situations pretty well, and with HT these CPUs can also handle four threaded code pretty well. Without HT these ULV processors would really crumble in code that is designed for highly multithreaded execution. Of course if you already have four or six real cores (with high desktop clock rates), adding HT doesn't help that much in most games and applications (that are often designed four core execution).
 
Calculating P4 pipeline stages without calculating decode stages but calculating decode stages on other processors even even less "fair".
Well, non-P4 processors typically do not have a trace cache either, whose whole purpose is to essentially take the decoder stages out of the equation, from what I've read.

I could be wrong though, it's been known to happen. :D

The "corrent" way of comparing these pipeline legths would be calculating the maximum sequential gate count of pipeline stages, ie. how long is one stage, not total amount of the pipeline stages. And there P4 is clearly longer than bulldozer, and bulldozer is very close to power7.
Sounds like an entirely pointless metric IMO, since Bulldog has crap performance regardless of how many or few gates comprise its pipeline.

Absolute numbers mean nothing; it's the real-world performance that counts.
 
I agree and already know everything you've said, and yet it does not invalidate what I said. Which was, specifically, two physical cores (regardless of HT implementation) will perform less than four physical cores -- all else being equal. Yes, it's obvious. I know. I was responding to "one core, four threads" concept brought up earlier regarding the building of some conceptual high-performance 'single core' CPU with a fat AVX unit and 64Mb of L2 cache.

And the whole "well they aren't equal if one has HT, Duuuhhhrr" response is not necessary, because "all ELSE being equal" has a unique defining term that I've helpfully highlighted for you. Hyperthreading would be the unique delta in this case; all ELSE would be the same. If you want to argue this, consider that HT / non-HT Intel processors on the SB platform are identical minus a capabilities fuse that was blown. Thus, an HT versus non-HT core is indeed EQUAL, except for the HT being enabled. See? All else? That else part, yeah, it's equal :D

So, carry on :)
 
Last edited by a moderator:
Back
Top