Haswell vs Kaveri

Discussion in 'Architecture and Products' started by AnarchX, Feb 8, 2012.

  1. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    (Slightly OT, continued from my above post)

    I did some extra testing with the GCN ROP caches. Assuming the mobile versions have equally sized ROP caches than my Radeon 7970 (128 KB), it looks like rendering in 128x128 tiles might help these new mobile APUs very much, as APUs are very much BW limited.

    Particle rendering has a huge backbuffer BW cost, especially when rendering HDR particles to 4x16f backbuffer. Our system renders all particles using a single draw call (particle index buffer is depth sorted, and we use premultiplied alpha to achieve both additive and alpha channel blended particles simultaneously). It is actually over 2x faster to brute force render a draw call containing 10k particles 60 times to 128x128 tiles (move scissor rectangle across a 1280x720 backbuffer) compared to rendering it once (single draw call, full screen). And you can achieve this kinds of gains by spending 15 minutes (just a brute force hack). With a little bit of extra code, you can skip particle quads (using geometry shader) that do not land on the active 128x128 scissor area (and save most of the extra geometry cost). This is a good way to reduce particle overdraw BW cost to zero. A 128x128 tile is rendered (alpha blended) completely inside the GPU ROP cache. This is especially a good technique for low BW APUs, but it helps even the Radeon 7970 GE (with massive 288 GB/s BW).

    With this technique, soft particles gain even more, since the full screen depth texture reads (128x128 area) fits the GCN 512/768 KB L2 cache (and become BW free as well). Of course Kepler based chips should have similar gains (but I don't have one for testing).

    If techniques like this become popular in future, and developers start to spend lots of time in optimizing for the modern GPU L2/ROP caches, it might make larger GPU memory pools (such as the Haswell 128 MB L4 cache) less important. It's going to be interesting too see how things pan out.
     
    #541 sebbbi, Jun 18, 2013
    Last edited by a moderator: Jun 18, 2013
  2. DSC

    DSC
    Banned

    Joined:
    Jul 12, 2003
    Messages:
    689
    Likes Received:
    3
  3. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,489
    Likes Received:
    907
    I was hoping for some change in the cache hierarchy, but that doesn't appear to be the case either. I guess the L2 could still be faster.
     
  4. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,012
    Likes Received:
    112
    You could count the hUMA changes for cpu/gpu cache snooping as a change there I guess.
    But if you're waiting for a shared (all cores + gpu) L3 then I don't know if Excavator is going to have that.
     
  5. Zaphod

    Zaphod Remember
    Veteran

    Joined:
    Aug 26, 2003
    Messages:
    2,202
    Likes Received:
    101
    The Hotchips presentation last year said that Steamroller would have "Shared L2 Cache" with "Dynamic resizing of L2 cache, Adaptive mode based on workload", which might sound like a unified L2 for all CPU cores at least (similar to Kabini?). But maybe that didn't make it for Kaveri, or they meant something less ambitious.
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,118
    Likes Received:
    2,860
    Location:
    Well within 3d
    The L2 is already shared between two cores. The description didn't mention sharing across modules.
     
  7. Zaphod

    Zaphod Remember
    Veteran

    Joined:
    Aug 26, 2003
    Messages:
    2,202
    Likes Received:
    101
    You're right. I see they only showed a single module in the diagram too. So they were basically talking about better load balancing between the execution units for efficiency gains and/or powering down parts of it for a perf/watt improvement. Maybe for Excavator then.
     
  8. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,723
    Likes Received:
    193
    Location:
    Stateless
    Great post to read :)

    I never though about it but that amount of cache is spread across 8 render back ends. It got me to wonder if all the units have equal access (in bandwidth, latency) to their "non local" slices of the Z and color caches.
    Your results are pretty impressive, though I wonder (for the ref I'm neither a coder or a hardware guy sorry if my comment is dumb) if you could try to consider that amount of cache as 8 pieces of caches.
    If those cache behave like the L3 in Intel or IBM architectures (bound to a specific resource in that case a CPU core) you could have even better result by fitting your tiles to the size of a "local subset" of those caches.
    So have you tried to submit 60x8 (240 that sounds high 8O ) times 16x16 tiles that would fit in the local share of the cache of a RBE? EDIT <= horrendous math mistake...
    May be by trying 16x16, 32x32 and 64x64 (128x128) tiles you could figure out how much bandwidth the render units have to their 'non local' shares of the aforementioned caches.

    Though may be they have equal access (which for somne reasons I don't expect) and may be the impact of the (raising) number of draw calls could "taint" the results ?

    / I stop here with things over my head, hope it was not dumb if incorrect or worse nonsensical.
     
    #548 liolio, Jun 19, 2013
    Last edited by a moderator: Jun 20, 2013
  9. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,723
    Likes Received:
    193
    Location:
    Stateless
    I can't help it... my mind have to wonder...

    I don't know how ROPs work but I could imagine situations where the units are busy enough and the penalty in latencies and bandwidth from accessing "non local" share of those caches would not trigger penalties /?

    -------------
    ABout Intel, D.Kanter has an in depth analysis of Intel Gen 7 GPU, to me they are the "same", though it seems there is only one RBE and so the amount of cache is only 8Kb for color and 32KB for Z.

    Anyway it would still be weird to me if manufacturers come with solutions that doesn't allow software to "scale". In you case if I get it properly you optimized for a 8 RBEs / 32ROPs but it would not benefit a lesser part with 4 RBEs (/16 ROps). I would get that optimization based on the size of the L2 would be platform specific but I find it odd for the ROps/RBE. They are scalable structures I would expect them to be though so software performances scales with the number of ROPs up and down (though bandwidth should be scale linearly too).

    So the "building block" is the RBE to me, I wonder if software should be optimized around that building block. A bit like multi-core CPU, say Xenon you have 32x3KB of L1 data cache though I would think that you would optimize your data structure for 32KB not 96KB, no?

    My idea is that RBE are designed so if your data fit within their cache (4Kb color cache, 16KB z cache ?) they should achieve their maximum throughput without needing external bandwidth. Actually the color cache is tiny I'm not sure what tile format would fit in there.

    /Ignore if it doesn't make sense.

    Edit Oops maths are not my friend... from 128x128 to 16x16 and I find 8 times the number of drawcalls... it is more like... well 64, lots of draw calls (3840...).
     
    #549 liolio, Jun 19, 2013
    Last edited by a moderator: Jun 20, 2013
  10. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Each slice gets its own pixel backend + color and Z caches (and setup, rasterizer, etc..).
    For instance HSW GT3 has two of everything and you can scale it up (see slide 19: http://www.highperformancegraphics.org/previous/www_2012/media/Hot3D/HPG2012_Hot3D_Intel.pdf)

    Also the color cache is backed by the whole CPU+GPU cache hierarchy.
     
  11. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,723
    Likes Received:
    193
    Location:
    Stateless
    Nice paper, thanks.

    I find the Gen X pretty interesting as it looks a lot to me like "proper" multi-cores GPU.
    AMD and Nvidia scale the number of either CUs or SMX but ROPs and the (gpu) last level of cache is not part of the "party".
    Now Intel seems to go with "self contained" blocks, slices to use their wording.
    I'm not a technical guy but my guts tells me it is the right move as I can see how it allows to "fine tune" everything. I mean matching the amount of execution units with the amount of cache you have and the bandwidth to those cache or the amount of threads in flight to cover the latency (trying to put it in a more sensible way may having a proper cache hierarchy so pretty high number of thread is your main tool to hide those latency).

    I could also see some benefits fro the "wiring", If I look at AMD GPU I imagine a pretty massive bus running around "a bunch" of CUs linking the CUs, the fixed function units, the L2, ROPs ,etc.
    Now with those self contain blocks, slices, I could see that amount being lesser, and whatever need to be share could be get just by checking the local share of the last level of cache of each slices. I could also guess that the latency to "reach" anything withitn a "self contained blocks", slice, are lower than checking a "whole chip".

    It may be an incorrect way to state it but I see something "beautiful" in the way that cache hierarchy is put together, no too surprising as it is Intel forte. I'm not technical enough (big quite a stretch) to really get the benefit clearly but it looks more like a "proper" cache hierarchy to me and by that I mean something as lean and evolved as what we find in CPUs that have gone through decade of evolution to ends up with that nice set-up one find in Intel or IBM CPUs.

    It seems that the architecture is already ahead of competition, it seems like a more balanced architecture that can make more of data locality than competing architectures, that could possibly require lot less threads to function, it can already successfully work on significantly narrower vector / no offence to Nidia or AMD engineers, it looks like something really intended at leveraging parallelism, not the most extreme cases, wrt to vector width but also data structures.

    Anyway I stop with pseudo technical blablating akin to "pub level philosophy" :lol: I can't analyze properly the competing architectures, some time the high level representations differ in more than one way from the actual silicon, etc. Though looking at it I see some "beauty" to it, you guys have to be proud of it and competition concerned.
     
    #551 liolio, Jun 20, 2013
    Last edited by a moderator: Jun 20, 2013
  12. zorg

    Newcomer

    Joined:
    Aug 1, 2012
    Messages:
    32
    Likes Received:
    0
    Location:
    Sweden
    They can scale the ROPs and the L2 cache independently from the core section. This is a different approach, but they have more control on performance scaling.
    Intel just use less independent blocks, and this is definitely not the right way, because they spend much more transistor to achieve the same performance level.

    Sure, it has benefits, but AMD and NVIDIA design the architecture for high scalability. Ultramobiles to supercomputers as they say. Intel don't scale the GenX this kind of performamce range. They can, but it won't be efficient.
     
  13. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,723
    Likes Received:
    193
    Location:
    Stateless
    I don't see how less scalable it is, they are just moving toward more "self contained building block".

    As for spending more transistors well this is clearly disputable. If you read the Anandtech reviews you the core i$950 HQ competing with a 4 quad core CPU + Gt 650m.
    The GT 650 is alone 1.2 billion transistors.
    Now the perfs per watts does not compare favorably for Nvidia.

    As for ultramobile parts there are no GPU parts from Nvidia or AMD that compare to desktop GPU Intel is to deploy its GPU in a couples of months (Silvermont) in a really low power profile platforms.
    Let see how Kepler fares in the mobile realm that would be a point of comparison but it seems that as far as timeline are concerned Intel is to beat Nvidia and AMD significantly.

    For Super computer, Intel has other products, their GPU don't support DP calculation, but I don't see that as a sign of lesser scalability. They can scale their architecture up to close to 2 TFLOPS (according to the paper nAo linked). As you say they cn but it doesn't fit their goal /power consumption. For the kind of power budget they aim they fry the competition (though they sell those Haswell + CW at crazy high price but it is not set in stone and the market is to tell them if that price is worth, the crazy part it could be ;) ).

    In the super computer realm I think IBM rules but it seems that Intel may attempt to come after them as it seems that they move to a pretty aggressive roadmap for their Xeon Phi line. A part that include Crystal Web should launch on their 14nm process at some point in 2014.

    Overall I think that in a world where the marketing PR department of GPU manufacturer claim to sell "many cores" GPU (with core count in the thousand), Intel move toward a more comprehensive cache hierarchy for what could really well qualify as "real" GPU cores /general purpose vector machine 2.0 looks like the thing to do for me.

    --------------
    There is also the issue that Intel design iGPU, it will be interesting to see how well competing AMD and Nvidia platform handle communication between the CPU and GPU, how coherency traffic is handle, at which power cost, etc. and thing like bandwidth requirement with setting constrain on the amount of RAM the system can use (as GDDR5 does).
     
    #553 liolio, Jun 20, 2013
    Last edited by a moderator: Jun 20, 2013
  14. zorg

    Newcomer

    Joined:
    Aug 1, 2012
    Messages:
    32
    Likes Received:
    0
    Location:
    Sweden
    ... and they can't scale the design beyond 4 block.
    The EDRAM, and the GDDR5 is "messing up the results". Without the EDRAM, the GT3 design is not faster than the fastest Richland IGP, which is use much less transistor.

    AMD Temash is an ultramobile SoC with 3.x watt TDP, and it use a GCN-based iGPU. It's the fastest solution in the market at this power level, more than ten times faster than Atom Z2760.
    NVIDIA Logan SoC will use Kepler in the next year.
     
  15. Paran

    Regular Newcomer

    Joined:
    Sep 15, 2011
    Messages:
    251
    Likes Received:
    14

    This looks quite different to a Haswell slice. I see four samplers on each slice. Haswell only has one sampler per slice. Is that a real upcoming GPU?
     
  16. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,723
    Likes Received:
    193
    Location:
    Stateless
    Well it is messing nothing both come at a cost, though from a power perspective CW wins.

    As for Trinity Richland, if I look here or here perf are really close, so is the transistor count.
    Though the difference in power consumption is pretty significant.
    Other than that well CPU perf are not in the same ball park, taking in account the fact that a 4770k embark a lot of L3 I'm not sure about who out of intel or AMD is spending its silicon in the best manner.
    Anyway I smell that attempt at trigger a bullshit war, let be clear I don't care. Neither thoes CPU offers enough (GPU) performance for me. GT3e is OKish but way too expensive vs a discrete solution. Ultimately I'm not "on the market" researching a new set-up to buy.
    I've no intention to further discuss benches, driver optimization, etc. If you can elaborate on the cache hierarchy and provide me more information you are welcome to do so.
    Wow more than ten time faster. Sorry I'm wary about AMD power figures, those 100Watts part burns close to 130 Watts, I will wait for review and yes the Next Atom aims at higher perfs per watts enough to fit in a phone (review will tell if INtel succeeded).
     
    #556 liolio, Jun 20, 2013
    Last edited by a moderator: Jun 20, 2013
  17. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,485
    Likes Received:
    396
    Location:
    Varna, Bulgaria
    Probably a Broadwell SKU. Intel promised a major boost in IGP performance in the generation after Haswell.
     
  18. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
  19. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    This doesn't make sense. A well-thought machine is made of parts that have been specifically designed to work together with certain trade-offs. If you take a piece off in order to compare it to another one you often reach the wrong conclusion. You can't simply take the eDRAM and pretend the rest was not architected to use it.
     
  20. Paran

    Regular Newcomer

    Joined:
    Sep 15, 2011
    Messages:
    251
    Likes Received:
    14

    I missed the second indeed. It's quite different to Haswell nevertheless. Also +4 EUs per slice obviously. If this is real and not just a showcase slide it possibly belongs to Gen8 Broadwell.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...