Why Nehalem, for games, is not better than Yorkfield?

Discussion in 'PC Gaming' started by MTd2, Oct 10, 2008.

  1. MTd2

    Newcomer

    Joined:
    May 13, 2004
    Messages:
    212
    Likes Received:
    0
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Some possible factors are the immaturity of the platform, a limit to how well the games scale with multiple cores and threads, and the revamped and higher-latency cache structure.

    The massive L2 on Yorkfield can really benefit games that have a working set that can fit. For Nehalem a similar working set will not fit in Nehalem's tiny L2s and will run into the much slower L3.
    A percentage point or two of performance might be lost on the higher latency L1 as well.

    Some of these changes, like the L1 latency, remove some stumbling blocks to higher clocks in the future.
     
  3. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,489
    Likes Received:
    400
    Location:
    Varna, Bulgaria
    Hm, looks like SMT is still able to wreck performance havoc, since the times of NetBurst.
     
  4. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    3,984
    Likes Received:
    34
    Correct. Small, local L2s + the inclusion of a high latency L3 (relative to L1 & L2 latencies it's high).

    There may have been a few apps that didn't "play nice" with HT years ago, but this is hardly the case anymore.
     
  5. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    The Windows XP scheduler (and I assume Vista) is HT-aware, trying to avoid any such situation.
    The performance problems with HT were therefore mostly seen in Windows 2000, which was still the most popular OS at the introduction of the P4 HT.
    So indeed, that issue is dead now, because I don't think anyone will run Windows 2000 on their Core i7.
     
  6. Berek

    Regular

    Joined:
    Oct 17, 2004
    Messages:
    271
    Likes Received:
    4
    Location:
    Houston, TX
    Nehalem is essentially optimized for calculations acceleration, database processing and multithreaded apps. Each core shares a smaller L2 cache, so you're going to be taking a hit there in some cases. Even the L3 level has higher delays which affect performance.

    And as mentioned also, HT is the multi-everything-but-games game. Not really used for the computer game at this time. When games start to seriously take advantage of multiple cores, and they dab on a few more MB in cache, Nehalem would make a serious difference then.

    It's a shame really that they didn't do this in some cases, but it doesn't really matter considering CPUs today are good enough for games of today. If a game happens to support multithreading in a serious way, then you will see some performance benefit. I'm not sure which games may do that effectively though...
     
  7. Wesker

    Newcomer

    Joined:
    May 3, 2008
    Messages:
    103
    Likes Received:
    0
    Location:
    Australia
    Since when were games the only measure of real performance?

    Games benefit greatly from large, low latency cache (L2 cache typically being the most optimal) and short pipelines in CPU architectures.

    Except you missed the fact that SMT/Hyperthreading was never a problem of NetBurst.

    If anything, SMT/HT was one of the only good things to come out of the NetBurst fiasco (other things being NetBurst's branch predictors and process node optimisations that Intel needed to sustain NetBurst's high clocks).
     
  8. MTd2

    Newcomer

    Joined:
    May 13, 2004
    Messages:
    212
    Likes Received:
    0
    Since I raised the topic specificaly about games, so the real games are the actual measure of performance here, not vantage or other artificial benchmarks.
     
  9. igg

    igg
    Newcomer

    Joined:
    May 16, 2008
    Messages:
    63
    Likes Received:
    0
    I think AnandTech had a good article about that:
    Source
     
  10. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    I think that Anandtech article is nonsense really. It's a very naive and ignorant view of game engines.
    You can't just sweep all games on one big pile, and you can't make sweeping claims about how cache and integer performance are the only things that matter.
    Especially the integer stuff is rather silly, since some of the most computationally intensive tasks in a game today are things like character animation, pathfinding, physics and determination of the potentially visible subset, which is mostly done with floating point.
     
  11. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    3,984
    Likes Received:
    34
    I think Johan De Gelas has a little bit higher level understanding of CPUs than you (or I, for that matter).

    While I agree that sweeping generalizations are not good practice, you have to understand he is only doing so for the sake of brevity, in an attempt to make things easy to understand for the layman audience.
     
  12. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    We're talking about game engines, not CPUs. I'm quite sure I know more about that than Johan De Gelas does. His statements don't exactly reflect a deeper understanding of game engines at any rate, as I already addressed in my previous post.
    That part wasn't exactly in-depth as far as CPU design goes anyway, it didn't go deeper than basic cache size and latency (still leaves at least one big unknown: the associativity).

    What I see is laymen on forums everywhere shouting that Nehalem sucks for games because it has a small cache. Articles like these are to blame for that.
    I'm getting rather tired of it.
     
  13. ChronoReverse

    Newcomer

    Joined:
    Apr 14, 2004
    Messages:
    245
    Likes Received:
    1
    If the new chip doesn't exceed the previous, especially for the cost, then in practice, it does "suck" for games. While this remains to be seen, it's still a concern for gamers.
     
  14. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    3,984
    Likes Received:
    34
    Send the guy an email.

    As I said, it was dumbed-down for the audience.

    I won't make any official "calls" yet because Core i7 has yet to be released and reviewed, but in theory he is correct. Shrinking the L2 down to a miniscule 256KB per core isn't going to help single threaded IPC, and adding a 3rd level of cache with a relatively high latency isn't going to help either. These changes were made to enhance multi-threaded performance. No use arguing against it, these are commonly held precepts.
     
  15. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Can't be bothered really.

    If you assume that all other factors are equal... But they aren't.
    For example, the Conroe has the same L2 cache size as the Presler. It also runs on the same chipset, so memory bandwidth and FSB and all that is equal.
    But because of a completely different caching strategy, Conroe performs much better with the same L2 size.

    In most benchmarks Nehalem actually demonstrates better single-threaded IPC, so apparently it's not as simple as just looking at cache size. Nehalem most probably has better prefetching strategies than Core2 had... and in the case of a cache miss, Nehalem has an onboard memory controller with far more bandwidth and far lower latency than Core2 has.

    Now I'm not saying that Nehalem actually will be the better gaming CPU in all cases, but I am saying this:

    1) We've not seen enough benchmarks to really draw any conclusions yet. We've seen a handful of games, some performing better on Nehalem, some worse. Perhaps the majority actually does run faster, and these few were the exception to the rule.
    2) There could be other reasons than cache.
    Firstly, we've not seen actual benchmarks of final hardware yet. BIOS and chipset might not be fine-tuned yet, which could affect game performance aswell (perhaps because of suboptimal PCI-e bus performance, or the fact that the hardware now has to do memory transfers through the CPU's memory controller rather than directly via the northbridge).
    Secondly, it could be that the videocard drivers don't work properly with Nehalem for whatever reasons (perhaps they don't make proper use of the 8 logical cores, adversely affecting performance... or perhaps Core i7 requires different optimization strategies).

    Let's just wait and see...

    I can only say that I find it rather strange that Nehalem seems to win in all benchmarks but games. I find it highly unlikely that games are the only software among these benchmarks that happen to benefit from the large Core2 caches.
    Also, if you look at AMD's Phenom... it has a similar cache hierarchy to Nehalem... however, Phenom doesn't seem to display such a discrepancy between games and other applications. The difference in IPC between Core2 and Phenom is pretty much constant regardless of the application... Why would Core i7 be so different, and why just games?
    At least with Pentium 4 it was obvious.... The architecture was completely different, and worked okay in some applications, but very inappropriate for others. Core i7 is not that different from Core2 and Phenom.
     
    #15 Scali, Oct 10, 2008
    Last edited by a moderator: Oct 10, 2008
  16. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    3,984
    Likes Received:
    34
    Presler's cache is actually 2x2MB per core, not shared 4MB.

    Yes, the shared cache helps, but it is (up to) double the size per core, which of course also helps. Penryn's performance improvements over Conroe are due in large part to the 50% larger cache (4MB->6MB).

    I can't say I agree with this. The single-threaded benchmarks in which Nehalem has been demonstrated to outperform Penryn are largely bandwidth-bound, and the tri-channel DDR3 with its massive ~20GB/s bandwidth is solely responsible for the performance gains in these benchmarks. Other than that the only gains demonstrated thus far have been from multi-threaded benchmarks.

    As I said previously, we still don't have official numbers on Nehalem, so I digress.
     
  17. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,489
    Likes Received:
    400
    Location:
    Varna, Bulgaria
    Data prefetching and load/write op's in Conroe/Penryn architecture are much more aggressive than anything from AMD to the date and anything from Intel before that, so a large and fast L2 is much better utilized here. And on top of this, inclusive nature of the hierarchy in Intel's case doesn't impose additional cache-line write cycle penalty.

    For Nehalem's case, sharing data between all the threads/cores is going all through the L2 shadow copies in the L3 array, so MT-intensive environments are more likely to be limited by the narrow and latent shared L3 interface, than the tiny small but fast L2. Any performance improvements in the wide desktop application base, is to be credited to getting rid of the FSB limitation.

    Anyway, this thread is for reallocating in a proper forum section.
     
    #17 fellix, Oct 10, 2008
    Last edited by a moderator: Oct 10, 2008
  18. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    I say 2x2 MB = 4 MB, hence same size. The point was exactly that: different cache architectures can give different performance characteristics, even with the same size. So size in itself doesn't mean that much considering the fact that Nehalem's hierarchy is completely different from Core2.
    So you can try to juggle with some sizes and latencies, but that really doesn't mean anything. You'd have to consider each application separately to see how it behaves on the different architectures, because each application has a different working set of data, different access patterns etc.
    Which also makes generalizations like "games want a large L2 cache" rather suspect.
    That would assume that all games basically have the same algorithms and datasets. Which we know for a fact is not true.

    If all games perform poorly, then there must be some common ground between all of them. Chipset or driver inefficiencies make a lot more sense. That also ties in with the fact that none of the other benchmarks are I/O bound, so that sort of factors wouldn't affect the results, and you'd get more of a synthetic CPU test, without factoring in the rest of the system.

    I wasn't talking about performance gains so much as the fact that it doesn't seem to be slower at anything.
    Games are the only thing where it actually appears slower in some cases. Which is suspect, if it is as fast or faster than a Core2 in EVERY other benchmark. Yes it gets a larger gain when more memory bandwidth or multithreading is involved, but still, even the single-threaded stuff is no problem for Nehalem.
    A well-known singlethreaded cache-heavy benchmark like SuperPI runs just great on Nehalem. How is that possible if that is exactly the weak spot of Nehalem?
     
  19. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    8,457
    Likes Received:
    580
    Location:
    WI, USA
    Are you sure SuperPi is cache-heavy beyond ~256KB?

    A Nehalem core has less L2 cache than any recent performance CPU from the company. You have to go back to the Coppermine/Willamette cores to see a 256KB cache on a top part. That definitely throws up some warning flags I'd say. I think it's an obvious concession to keep transistor count down on a part that has to be very competitive in cost because of the market it is entering.

    The focus is definitely to win big in multithreaded apps and they appear to do so. Bring the memory controller onboard and add that L3, both to dramatically improve core-to-core communication. Add HT do extract even more from each core in such apps. The single threaded app looks to have been almost ignored in comparison. Maybe the designers saw us having more heavily parallelized software by now.
     
    #19 swaaye, Oct 10, 2008
    Last edited by a moderator: Oct 10, 2008
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    How irregular are the accesses in SuperPi?
    Some games might not hit the prefetch logic well enough to mask L2 to L3 latency.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...