I Can Hazwell?

Discussion in 'PC Industry' started by Grall, Nov 9, 2011.

  1. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    It's flowing through me too, in the exact same way.
     
  2. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    SSE3 is supported by over 99% of current installed CPU base according to latest Steam Survey. It's a no brainer to support it at least. I don't think many current generation games are using AVX, since most games are developed first for current generation consoles, and those do not support wider than 128 bit vectors. PC CPUs are so much faster than current console CPUs that the extra work and version management for going beyond SSE3 doesn't pay off. It is much better to spend that time optimizing the GPU stuff (as the draw call / API overhead on PC is still an issue compared to consoles).

    I think the biggest single improvement is that AVX is supported by both AMD and Intel. SSE 4.X(a) had many different versions that weren't fully compatible with each other. Jaguar also supports AVX, and is the CPU in PS4. This is good news for PC gaming, since games will be AVX optimized already for the console. No extra work is needed to support it.

    Compared to SSE4.X, 128 bit AVX has some extra instructions such as broadcast and mask moves. These do save the count of memory instructions in some cases. Also the VEX prefix allows AVX to write result to a separate register (nondestructive operation), reducing the register pressure (and reducing the extra move operations). Both of these things are actually very good for Jaguar, since unlike Sandy/Ivy Bridge, AMD CPUs do not have "free" moves by register renaming. Also Jaguar can only sustain two uops per cycle. All the extra moves and extra shuffles take away slots that could be used for doing real work (adds and multiplys). AVX helps with that.

    256 bit AVX on Jaguar: That's an interesting question that is not yet answered by AMD (as far as I know). Running 256 bit AVX on Bulldozer doesn't help at all. But Bulldozer has a separate shared vector pipeline, so that might yield slightly different results. Bobcat splits the 128 bit vector instructions to two 64 bit instructions in the decoder. 128 bit operations take two cycles to decode (according to Agner Fogs analysis) and are two separate instructions for the rest of the pipeline. So in case of Bobcat 128 bit (vs 64 bit) only helps by reducing the instruction cache usage. I don't see instruction cache being a bottleneck for Jaguar (it has very good L1 caches). Let's wait for the first Jaguar benchmarks (and Agner's analysis). They shouldn't be far away (since there's already some leaked Temash tablet benchmarks around the net).
     
  3. Novum

    Regular

    Joined:
    Jun 28, 2006
    Messages:
    335
    Likes Received:
    8
    Location:
    Germany
    No TSX for K models apparently. Seriously Intel? What the hell?
     
  4. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    Intel does have some very weird idea of market segmentation.
    However, in the case of TSX, at least one of the function (Hardware Lock Elision) is backward compatible (i.e. a code with TSX support runs fine on older CPU, just without the benefit).
     
  5. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    BD does have free moves so it's not just intel cpus - only for xmm regs though not ymm (I guess it's not only easier for the implementation, it actually probably makes sense since you rarely need moves with avx anyway). Well the moves aren't entirely free since you still got the uops moving around, but that's the same as Sandy Bridge (only Ivy Bridge can do better). Not sure what Jaguar will do there though, Bobcat certainly wasn't as advanced.

    As for SSE3 I'm not convinced it's used. It may be supported on more than 99% of all cpus, but really the additional instructions are so minor (float horizontal add/sub and that's about it) you could as well take care of the remaining 1% of all cpus by just using SSE2 only. SSSE3 is way more interesting (byte shuffle for instance) as is sse4.1. But support for those is less wide-spread.
     
    #225 mczak, Mar 25, 2013
    Last edited by a moderator: Mar 25, 2013
  6. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    They are used for complex mul mostly. Hardly useful for games.
     
  7. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    Moves take up decode bandwidth, but should be otherwise completely free on all physical register file OOOe implementations since the move is completely resolved in the renaming stage.

    Bobcat and Jaguar are physical register file OOOe machines too, but given the narrow decoder, moves probably have an impact.

    Cheers
     
  8. John Reynolds

    John Reynolds Ecce homo
    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    4,491
    Likes Received:
    267
    Location:
    Westeros
    Start8 from Stardock.com fixes that for $5. I'd still be using Win7 if it weren't for this little app.
     
  9. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    4,309
    Likes Received:
    1,107
    Location:
    35.1415,-90.056
    The "K" series have always gone missing certain, specific features to keep them out of serious production systems. Have a look at the i7-2600, i7-2600k, i7-3770, and i7-3770k...
    http://ark.intel.com/compare/52213,52214,65719,65523

    The "k" series are both missing VT-d and TXT. This new "subtraction" of TSX doesn't seem much different to me.
     
  10. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    It probably doesn't seem different to the Intel management that made that decision either, but it's a lot different. TSX, unlike VT-d and TXT, is something that can apply to a wide variety of software but needs actual programming effort to utilize. But it's a lot harder to motivate software developers to do this when a lot of their userbase won't have access to it and can't test it.

    Only Ivy Bridge implemented this optimization, meaning on Sandy Bridge moves still flowed through the execution units (and of course Netburst uarchs did as well). Bulldozer only has it for SSE moves. I don't think it's necessarily completely free to allow multiple architectural registers to map to the same physical registers. Would not count it as a given on Bobcat and Jaguar.
     
  11. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    That's really odd, I'd consider this very low hanging fruit.

    Cheers
     
  12. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    4,309
    Likes Received:
    1,107
    Location:
    35.1415,-90.056
    I'm not disagreeing, just stating the obvious.
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    There are at least two non-K SKUs that appear to have TSX disabled as well, at least according to Tomshardware.

    That's not including any omissions at i3 and below that might turn up eventually. I'm with most commentators in that I don't see the upside to fragmenting things like this, even though I'm not certain TSX will do much at the the core counts and typical software consumer SKUs are concerned with.
     
  14. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    One possibility: they are locking it out of chips that might be used as xeon replacements and will unlock it for Xeons?
     
  15. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    Yes, however Sandy/Ivy core can decode four instructions per cycle. Bulldozer/Piledriver (and Bobcat/Jaguar) can only decode two instructions per cycle (per core). Sandy/Ivy thus have plenty of free decode slots available for decoding the extra moves that will be eliminated by the register renaming mechanism. The wider Intel cores should benefit more from this feature compared to narrow AMD cores.
    I agree that both SSSE3 and SSE4.1 are more interesting than SSE3, but SSE4.1 has only 62% hardware coverage (source: Steam Survey). SSE4.1 is not a good baseline (if you are targeting only a single instruction set).

    SSE3 horizontal operations are handy for example in dot product implementation (dot = mul + 2 x horizontal add, or 2 x dot = 2 x mul + 2 x horizontal add). In SSE2 a single dot product costs you six instructions (mul + 2 x add + 3 x shuffle). Games ported from Xbox 360 tend to use (AoS) vector dot products, because dot products are very fast on Xbox 360 CPU (single cycle throughput rate).

    According to Steam Survey SSE3 has a 99.4% coverage, while SSE2 has 99.8% (0.4% difference). 0.4% is not a valid reason to choose SSE2. Unless you want to have dot products that require 2x-3x more instructions... or are calculating everything using SoA layout... but that seems to be something that gameplay programmers are not willing to do. You give them a good optimized vector class and that's the lowest level abstraction they are going to use. SoA vector batch processing is only used by low level engine programmers (as far as my experience goes).
     
  16. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    That's probably the rationale, and at least fits for other feature segmentation like with VT-d, TXT, AESNI, ECC, etc.. but if they're looking at TSX as an enterprise-class only feature then they're not positioning it well, IMO.
     
  17. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Well I agree those instructions could be handy. But I still got some doubts that they are really all that useful - personally I've been able to avoid them whenever I first thought they'd be useful (actually that's not quite true but almost). And even if they are a perfect fit for your code (such as that AoS dot product) the performance benefit is most likely close to nonexistent, because the internal implementation is apparently exactly that, ordinary add + shuffle. They generate tons of uops, have high latency and crappy throughput. e.g. Wolfdale lists this as 3 uops, latency 7, throughput 1 every two clock. And it's typically worse for AMD where it looks like you could actually do better by doing the add+shuffle manually for some reason at least with Bulldozer (at least it's not the same order of fail as sse41 DPPS on BD which generates 16! uops and definitely looks like you could always do much better manually).
    So those instructions probably don't help as much as you'd think they do by just looking at the instruction count - they look much better on paper than they are. And the workarounds (using shuffles) really are quite trivial, in contrast to the instructions you get with ssse3/sse41 (emulating byte shuffles by hand is hilarious for instance, emulating rounding correctly tricky at best etc.).
    Oh and while here I'd like to bring up the other rare sse3 instruction, lddq, a band-aid specifically invented for helping the P4 because it's movdqu implementation was simply unbearable, and completely useless on any other cpu...
     
  18. max-pain

    Regular

    Joined:
    Feb 13, 2004
    Messages:
    309
    Likes Received:
    2
    Crysis 3 only supports DX11 GPUs and only 58% of Steam users have that. That is lower than SSE4.1's share. Most next-gen games/ports will require a decent DX11 GPU (and a decent CPU). Every Intel CPU (that isn't low-end) from the past 5 years supports SSE4.1. But (not so) old AMD CPUs could be a problem...
    What we know (Steam):
    May 2012: SSE4.1 - 52.66%, SSE4.2 - 38.56%, DX11 GPUs - 45.83%
    November 2012: SSE4.1 - 59.94%, SSE4.2 - 46.70%, DX11 GPUs - 55.50%
    February 2013: SSE4.1 - 62.06%, SSE4.2 - 50.08%, DX11 GPUs - 58.32%
    Prediction:
    November 2013: SSE4.1 - ~70%, SSE4.2 - ~60%, DX11 GPUs - ~70% (when next-gen consoles launch)
    May 2014: SSE4.1 - ~75%, SSE4.2 - ~70%, DX11 GPUs - ~75%
    Next-gen games/ports could require SSE4.x and i think they should.
     
  19. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    19,426
    Likes Received:
    10,320
    Crysis 3 is an outlier in that they aren't interested in selling a game to as many people as possible. They are more interested in pushing the tech as far as possible.

    Most companies take the Blizzard approach to trying to sell to the largest audience possible.

    But yes, the new consoles will change that dynamic. As then the largest pool of people will consist of PS4/Xbox next/PC which means Dx11 class features. But don't be surprised if you still have Dx9 class games with Dx11 added if the developer/publisher wants to target PS3/X360/older PCs in addition to PS4/Xbox next/Dx11 PCs.

    Regards,
    SB
     
  20. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,176
    Location:
    La-la land
    That...is a sexy piece of hardware pr0n. Wow. Haswell is the first non-rectangular (4-core) core i-series CPU ever. That GPU looks to be absolutely massive, unless intel re-jigged the whole layout of the chip.

    In previous i-series chips, CPU cores were lined up in a row with L3 beneath them and the GPU tacked on to the side. Now I would assume that the GPU sits on the opposite side of the L3 compared to the cores, thus filling the chip out into a square-ish shape. That'd mean 50+ percent of the die is GPU... Ugh! :D

    Off-chip die is fairly large. Wonder what geometry it is manufactured with, I'd assume something coarser than 22nm, probably, seeing as DRAM is quite frugal with power, and older fabs are cheaper to run... Anyhow, damn nice piece of kit. I'm all hot and bothered now! :razz:
     

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...