Haswell vs Kaveri

Discussion in 'Architecture and Products' started by AnarchX, Feb 8, 2012.

  1. Albuquerque

    Albuquerque Red-headed step child Veteran

    I thought I would quote myself for posterity, seems I figured out "connected standby" to some degree before it was announced. Go me! :razz:
     
  2. Raqia

    Raqia Regular

    Does this mean code involves lot less ugly context creation and switching?
     
  3. 3dilettante

    3dilettante Legend Alpha

    Since this is supposed to support HSA, there should be queue binding and the setup of the kernel at the outset, but much less overhead after that.
    Commands should be able to be passed around in memory or on-chip queues.
    The original requirement that memory be copied back and forth between memory spaces or multiple transitions into the kernel driver at command issue should no longer be necessary.

    It should be much better than what has come before.
     
  4. Andrew Lauritzen

    Andrew Lauritzen Moderator Moderator Veteran

    Well I contend that the market for "high end" socketed APUs has yet to be proven. The cost analysis just never comes out in favor of these things compared to cheap dGPUs unless you are form factor or power-constrained, and I don't expect that to change any time soon.

    So sure, you can say that they fill that niche and thus aren't comparable to anything Intel ships, but I'm not convinced that niche exists to start with :)

    I guess if they are going to resist that comparison I'll have to wait for the mobile chips. It'll be even harder for them to compete there though due to a process disadvantage I imagine.

    This is my biggest concern as well. I was assuming Kaveri would have some shared LLC similar to Haswell with all of the talk of heterogeneous computation and so on but someone told me that was not the case. Thus if you need to spill your whole working set to memory every time you want to swap between the CPU and the GPU, it's not going to be a lot more fine-grained than it is today. Being able to use pointer-based data structures is nice and all (although hardly required), but it's far more important to be able to share data on-chip. Hopefully either the person that told me this is wrong or this will get addressed in the next chip.

    I'm also not totally clear on how the GPU sending work to the CPU stuff is supposed to happen. Is the HSA runtime going to own threads that it spins up and down to do work and fires callbacks into user code? Hopefully they expose the low level stuff (coherent atomics, interrupts, OS events) in addition because I hardly want to be shacked with a made-up queue abstraction on the CPU that isn't tied to any hardware reality. I can handle whatever threading and tasking on the CPU myself thank you very much :)
     
    Last edited by a moderator: Nov 12, 2013
  5. Alexko

    Alexko Veteran Subscriber

    Judging from the diagrams we've seen, it's almost certain at this point that there is no shared LLC in Kaveri. I just don't think AMD has the transistor budget.
     
  6. Raqia

    Raqia Regular

    It does sound like latency will be much much better since all the CPU needs to do now is dereference a pointer the same physical and logical memory space to get the results of the GPU or issue it more work. This sounds much much faster than pulling data back and forth across a narrow, high latency PCIe bridge or being burdened by memory copies in the same physical ram in the middle of a computation. If anything, this hopping around is exactly what Kaveri sounds like it's intended to solve, but we'll see.

    You're right about AMD being the smaller, less popular player and whatever programming model it's using here might not be popular once Intel introduces something comparable.
     
  7. 3dilettante

    3dilettante Legend Alpha

    Given the very low bar for coherent memory given for Kaveri, there is an expectation that it's spilling to memory outside of the case of an initial handoff from CPU to GPU, and that may not happen all the time.
    Onion+ is what AMD is declaring "coherent", which is a bypass of the unsnooped GPU cache hierarchy. GPU reads can hit a CPU cache, but the other direction is going to memory. Unless something happens to hit that narrow time window in the memory queue, it's in DRAM.
    The GPU caches are likely too primitive, too numerous, and too slow to be plugged into the same coherent broadcast path.

    The latencies of the GPU are such that I would suspect the longevity of cached data is not going to be that great. Commands should be handled by queues and hardware, and the wavefront initialization requires no outside intervention.
    It's much lower latency than what has come before, but the latencies were originally horrible.
    The GPU itself is still a very long-latency entity relative to the CPU, so the odds are good that any data you'd want to share is going to be evicted.
    Because of the GPU's coarseness, it would favor heavy streamout that would swamp any cache anyway.

    I suspect it's not wrong.
    Not much has been disclosed that would point to it changing.
    The architectures as they stand don't look like they can take advantage of such sharing.

    I'm not sure about the CPU side, I think the HSA runtime is a part of it. However, AMD stated that that kind of thread management could be accessed at a lower level.
     
  8. Raqia

    Raqia Regular

    You could use an Intel Iris Pro like solution, giant L4 victim cache for L3, to hide the latency to RAM. Intel must be using L3 like AMD's USWC for GPU reads. Edit: maybe I should say instead that AMD's USWC is a limited kind of L3.

    http://amddevcentral.com/afds/assets/presentations/1004_final.pdf

    It is an extremely asymmetric situation, but the old model seems like such a cluster fuck for determining best practices since the split memory spaces adds two "local memory" cases a developer has to consider for best performance. The actual hardware is more complex as well since it's essentially a shrink of the structures on present a motherboard like the northbridge into silicon instead of more direct memory controllers.

    Onion+ seems much nicer:

    http://beyond3d.com/showthread.php?t=64406
     
    Last edited by a moderator: Nov 12, 2013
  9. 3dilettante

    3dilettante Legend Alpha

    What do you think Kaveri's method is?
    Aside from removing pinning requirements and possibly the static allocation of device memory, much of the infrastructure isn't changing. There are still going to be distinctions between cached and uncached pages, and an additional wrinkle where only a specific allocation type can be considered coherent.
     
  10. rapso

    rapso Newcomer

    or when you're saying, there is a theoretical benefit, incase when .. and ... and ... and... and.. happens at the same time, while in all other cases it's just wasted space.

    that shouldn't change, there seem to be the same amount of units.
    if you use mmx, you probably do some int processing and you don't utilize the ffma units, so there is no real conflict, I'd guess the performance shouldn't change in most real world cases.

    I nave not really an idea why they did it, I could only guess it's a simple clean up/refactoring. if there is no real world benefit from having 4 ports, make the design cleaner/simpler, maybe that was the original plan and they had just no time till now.


    sorry, I wasn't that clear in my explanation. I didn't mean just the pure units, but all the data paths. Intel doubled for haswell all internal paths. they said there is no point in having FMA if you cannot supply enough data to the units. so, having 4ports and issuing 4instructions, if that's not just a one-cycle thing but longer lasting algorithm/code, then you need to also feed those 4 instead of 3 units with enough data. BD seems not to be a bandwidth monster like haswell, so I'd assume, if you push 4 instructions, BD will not satisfy all units with data (except at the beginning maybe when those L/S-units had enough time to prefetch+pipe the data).

    Why the did it in the first place? maybe it was a simple thing to do, rather then merging an fmal into an existing pipeline.
     
  11. rapso

    rapso Newcomer

    it's a convenient way to work,but you pay all the time for the rare case that you need it. GPUs are living from the fact that they are not coherent within the gpu, having to snoop across different units isn't fast.

    HSA isn't meant for AMD GPUs only, it's a communication layer across all kind of units. in theory the CPU could consume the tasks it created, or some fpga could runs your task.

    I am also concerned about the tasking. it's easy to reach high utilization, but it's also easy to get a low efficiency that way.
     
  12. Raqia

    Raqia Regular

    You're right, the need for specifying coherent pages doesn't change, but wouldn't you be able to get away with looser practices like interleaving CPU and GPU operations more finely because of less overhead? And wouldn't you also not be worrying about two kinds of pointers to different heap spaces as you're programming as well? I guess the performance dynamics would still be different depending on if accesses are coming from GPU or CPU but I like having less complexity. When you're trying to eke out some performance, on that chart on page 35 of 1004_final, you wouldn't need to worry about those four "local" cases anymore, it'd just be uncached and cacheable.

    It doesn't look like Onion+ is making a hop through an old fashion northbridge which involves extra moving around of data and translation of memory addresses as overhead and local buffers that probably take up die space.

    Anyway, I should probably stop talking as I've never actually programmed seriously in this space. :)
     
    Last edited by a moderator: Nov 12, 2013
  13. 3dilettante

    3dilettante Legend Alpha

    It's not entirely theoretical. There were tests and benchmarks of some applications that relied heavily on integer SIMD that did better than expected, for which the additional port is one contributor to an explanation.

    There is one less FMAL unit that can be utilized in any given clock cycle.
    The old setup could issue an FMMA, FMAL, FMAL in one cycle, the new one cannot.

    There are two cores sharing the same FPU. There's no physical reason why they can't be using different SIMD codes.

    It helps if the two cores that share an FPU are running different SIMD codes. Since the fourth pipe is also the STO pipe, it also keeps MMX code from being disproportionately hit if it's store-heavy.
    Putting FMAL in pipe 0 probably increased the complexity of the logic in that pipe, since originally it only had to deal with IMAC.

    AMD may have weighed the costs and benefits of the wider FPU and found more benefit from narrowing it, but to say there was zero benefit to the old scheme would require some additional justification.
     
  14. mczak

    mczak Veteran

    Pretty sure it really is less area, and it might also mean less power.
    I kinda agree that the old distribution of units in the simd unit wasn't the best (if you only have fp code one port is completely idle all the time as it can't do anything useful in that case, not even something data-type independent like shuffles). So I guess for code which is "mostly" using floats the new arrangement could even be a win, but code using a good portion (in total coming from both int cores) of ints will definitely lose. In fact the new arrangement doesn't look too dissimilar to what K8/K10 used, the simd unit isn't all that fat anymore, I have some revolutionary idea they could give each int core its own simd unit! Compared to intel cpus that simd unit is really looking old now, half the peak throughput of an intel simd unit and shared by 2 cores... But of course AMD seems to think that's what the gcn shader units are for.
     
  15. moozoo

    moozoo Newcomer

    So I'm guessing Kaveri's GPU DP ratio is 1:16?
    I wish the was a FirePro Kaveri with 1:2 :)
     
  16. Andrew Lauritzen

    Andrew Lauritzen Moderator Moderator Veteran

    I'm not arguing for full cache-line coherency on the GPU, but it's important to have *some* mechanism to keep data on-chip efficiently. Obviously it's expected to have to flush at least some GPU caches at certain points (for instance, the texture cache is not going to snoop for every tap...), but that's somewhat orthogonal to having a shared cache to start with.

    To put it another way, while fine-grained interop between the CPU/GPU that would require stronger coherence is almost certainly not necessary, being able to flush GPU L1/L2/* caches and have data picked up from some sort of LLC without hitting memory is desirable. On Haswell parts, even without EDRAM, the 6-8MB of LLC is useful for these purposes. You just don't want to burn memory bandwidth every time you go back and forth... on these bandwidth-starved APUs that's not even that much better than burning PCIe bandwidth to discrete cards (although obviously the latency is improved).
     
  17. Raqia

    Raqia Regular

    The slides posted earlier by rSkip:

    http://www.slideshare.net/ssuser512d95/final-apu13-philrogerskeynote21

    provide some additional detail about what the programmer can expect. Looks like some features are going to be baked into JIT VM's like the "parallel" keyword in Java, and others into libraries like Bolt and CLMath. No mention of what lower level commands are used to allocate new memory and access GPU resources.

    What I'm interested in doing something like a modified tiled Mergesort where data is split recursively on the CPU side until we get sufficiently many subsets. Then the GPU handles sorting all the smaller tiles in parallel and the CPU combines them back together. I have no idea how I would do something like that with HSA and maybe it wouldn't even be a good idea.
     
  18. RecessionCone

    RecessionCone Regular Subscriber

    Mergesort is fast on the GPU at all levels of segmentation. No need for the CPU at all.
    http://nvlabs.github.io/moderngpu/mergesort.html
     
  19. Raqia

    Raqia Regular

    I noticed in the code that you have to pass a context (which presumably has to be initialized and pulled back and forth between CPU and GPU) to all the methods. I wonder if Kaveri could change that paradigm if the compiler could incorporate its HSA structures at run time with just a flag and how much speed we gain from the unified memory space. It looks like the Java VM is able to sweep this under the rug with "parallel..."
     
  20. RecessionCone

    RecessionCone Regular Subscriber

    The context just holds memory allocations. It's not going to gain any speed from HSA.
     
Loading...

Share This Page

Loading...