Haswell vs Kaveri

Albuquerque · Nov 12, 2013

Albuquerque said:
Yup, this is exactly where I see it. Having a processor that's just "barely running" is conceptually far less latency than one that needs to recover from S3, while perhaps using only a small fraction of power above that S3 state. Enough to capture an interrupt, speed up enough to process it (or perhaps, decide not to) and then slow back down. An operating system that can do interrupt aggregation could significantly benefit from this.

Imagine being able to have the box go into the "SuperS1" state with the display still powered. Your keystroke could generate an interrupt that could be handled within the nearest few hundred cycles (ie, quite lazily), the screen updated, and the box may never have to be more busy than that. The processor draw for an 'office use' laptop could go to less than a watt for a box that could be perceived as perfectly responsive.

I thought I would quote myself for posterity, seems I figured out "connected standby" to some degree before it was announced. Go me!

Raqia · Nov 12, 2013

3dilettante said:
There is hardware support for queueing and memory sharing.
The GPU architecture is potentially no longer in the "downright unacceptable" category.

Does this mean code involves lot less ugly context creation and switching?

3dilettante · Nov 12, 2013

Since this is supposed to support HSA, there should be queue binding and the setup of the kernel at the outset, but much less overhead after that.
Commands should be able to be passed around in memory or on-chip queues.
The original requirement that memory be copied back and forth between memory spaces or multiple transitions into the kernel driver at command issue should no longer be necessary.

It should be much better than what has come before.

Andrew Lauritzen · Nov 12, 2013

Paran said:
R-series isn't a mass product whereas Kaveri is. Not really comparable. At least in the desktop space. Intel has to bring something decent for the socket version which happens with Skylake at the earliest if they want compete. Mobile as always is completely different of course.

Well I contend that the market for "high end" socketed APUs has yet to be proven. The cost analysis just never comes out in favor of these things compared to cheap dGPUs unless you are form factor or power-constrained, and I don't expect that to change any time soon.

So sure, you can say that they fill that niche and thus aren't comparable to anything Intel ships, but I'm not convinced that niche exists to start with

I guess if they are going to resist that comparison I'll have to wait for the mobile chips. It'll be even harder for them to compete there though due to a process disadvantage I imagine.

3dilettante said:
Barring some advance in the wide gulf in latency, SIMD granularity, and cache behavior of the CPU and GPU, there's going to be a swath of workloads that do not do well on the GPU.

This is my biggest concern as well. I was assuming Kaveri would have some shared LLC similar to Haswell with all of the talk of heterogeneous computation and so on but someone told me that was not the case. Thus if you need to spill your whole working set to memory every time you want to swap between the CPU and the GPU, it's not going to be a lot more fine-grained than it is today. Being able to use pointer-based data structures is nice and all (although hardly required), but it's far more important to be able to share data on-chip. Hopefully either the person that told me this is wrong or this will get addressed in the next chip.

I'm also not totally clear on how the GPU sending work to the CPU stuff is supposed to happen. Is the HSA runtime going to own threads that it spins up and down to do work and fires callbacks into user code? Hopefully they expose the low level stuff (coherent atomics, interrupts, OS events) in addition because I hardly want to be shacked with a made-up queue abstraction on the CPU that isn't tied to any hardware reality. I can handle whatever threading and tasking on the CPU myself thank you very much

Alexko · Nov 12, 2013

Judging from the diagrams we've seen, it's almost certain at this point that there is no shared LLC in Kaveri. I just don't think AMD has the transistor budget.

Raqia · Nov 12, 2013

3dilettante said:
Barring some advance in the wide gulf in latency, SIMD granularity, and cache behavior of the CPU and GPU, there's going to be a swath of workloads that do not do well on the GPU.
For heterogenous loads, there will be those with latency requirements or mediocre speedups that cannot tolerate the overhead of hopping between the sides.
Then there's all the software that was and will not be coded specially for a minority holder of the x86 market.

It does sound like latency will be much much better since all the CPU needs to do now is dereference a pointer the same physical and logical memory space to get the results of the GPU or issue it more work. This sounds much much faster than pulling data back and forth across a narrow, high latency PCIe bridge or being burdened by memory copies in the same physical ram in the middle of a computation. If anything, this hopping around is exactly what Kaveri sounds like it's intended to solve, but we'll see.

You're right about AMD being the smaller, less popular player and whatever programming model it's using here might not be popular once Intel introduces something comparable.

3dilettante · Nov 12, 2013

Andrew Lauritzen said:
This is my biggest concern as well. I was assuming Kaveri would have some shared LLC similar to Haswell with all of the talk of heterogeneous computation and so on but someone told me that was not the case. Thus if you need to spill your whole working set to memory every time you want to swap between the CPU and the GPU, it's not going to be a lot more fine-grained than it is today.

Given the very low bar for coherent memory given for Kaveri, there is an expectation that it's spilling to memory outside of the case of an initial handoff from CPU to GPU, and that may not happen all the time.
Onion+ is what AMD is declaring "coherent", which is a bypass of the unsnooped GPU cache hierarchy. GPU reads can hit a CPU cache, but the other direction is going to memory. Unless something happens to hit that narrow time window in the memory queue, it's in DRAM.
The GPU caches are likely too primitive, too numerous, and too slow to be plugged into the same coherent broadcast path.

The latencies of the GPU are such that I would suspect the longevity of cached data is not going to be that great. Commands should be handled by queues and hardware, and the wavefront initialization requires no outside intervention.
It's much lower latency than what has come before, but the latencies were originally horrible.
The GPU itself is still a very long-latency entity relative to the CPU, so the odds are good that any data you'd want to share is going to be evicted.
Because of the GPU's coarseness, it would favor heavy streamout that would swamp any cache anyway.

Being able to use pointer-based data structures is nice and all (although hardly required), but it's far more important to be able to share data on-chip. Hopefully either the person that told me this is wrong or this will get addressed in the next chip.

I suspect it's not wrong.
Not much has been disclosed that would point to it changing.
The architectures as they stand don't look like they can take advantage of such sharing.

I'm also not totally clear on how the GPU sending work to the CPU stuff is supposed to happen. Is the HSA runtime going to own threads that it spins up and down to do work and fires callbacks into user code?

I'm not sure about the CPU side, I think the HSA runtime is a part of it. However, AMD stated that that kind of thread management could be accessed at a lower level.

Raqia · Nov 12, 2013

You could use an Intel Iris Pro like solution, giant L4 victim cache for L3, to hide the latency to RAM. Intel must be using L3 like AMD's USWC for GPU reads. Edit: maybe I should say instead that AMD's USWC is a limited kind of L3.

http://amddevcentral.com/afds/assets/presentations/1004_final.pdf

It is an extremely asymmetric situation, but the old model seems like such a cluster fuck for determining best practices since the split memory spaces adds two "local memory" cases a developer has to consider for best performance. The actual hardware is more complex as well since it's essentially a shrink of the structures on present a motherboard like the northbridge into silicon instead of more direct memory controllers.

Onion+ seems much nicer:

http://beyond3d.com/showthread.php?t=64406

3dilettante · Nov 12, 2013

What do you think Kaveri's method is?
Aside from removing pinning requirements and possibly the static allocation of device memory, much of the infrastructure isn't changing. There are still going to be distinctions between cached and uncached pages, and an additional wrinkle where only a specific allocation type can be considered coherent.

rapso · Nov 12, 2013

3dilettante said:
That's an argument used when trying to say that a regression isn't that bad, or if it is bad that it's limited enough to awkwardly handwave away.

or when you're saying, there is a theoretical benefit, incase when .. and ... and ... and... and.. happens at the same time, while in all other cases it's just wasted space.

The higher MMX throughput cases are themselves a limited set, just ones where BD wasn't that bad.

that shouldn't change, there seem to be the same amount of units.
if you use mmx, you probably do some int processing and you don't utilize the ffma units, so there is no real conflict, I'd guess the performance shouldn't change in most real world cases.

If the savings somehow allowed for more units (no) or higher clocks (no), it might mean more. It might mean less area, which might mean cheaper for AMD.

I nave not really an idea why they did it, I could only guess it's a simple clean up/refactoring. if there is no real world benefit from having 4 ports, make the design cleaner/simpler, maybe that was the original plan and they had just no time till now.

The FPU does not have direct access to the L/S units, just a buffer.
The L/S units would see no difference, other than the possibility they see less activity because of the reduced throughput.

sorry, I wasn't that clear in my explanation. I didn't mean just the pure units, but all the data paths. Intel doubled for haswell all internal paths. they said there is no point in having FMA if you cannot supply enough data to the units. so, having 4ports and issuing 4instructions, if that's not just a one-cycle thing but longer lasting algorithm/code, then you need to also feed those 4 instead of 3 units with enough data. BD seems not to be a bandwidth monster like haswell, so I'd assume, if you push 4 instructions, BD will not satisfy all units with data (except at the beginning maybe when those L/S-units had enough time to prefetch+pipe the data).

Why the did it in the first place? maybe it was a simple thing to do, rather then merging an fmal into an existing pipeline.

rapso · Nov 12, 2013

Andrew Lauritzen said:
This is my biggest concern as well. I was assuming Kaveri would have some shared LLC similar to Haswell with all of the talk of heterogeneous computation and so on but someone told me that was not the case. Thus if you need to spill your whole working set to memory every time you want to swap between the CPU and the GPU, it's not going to be a lot more fine-grained than it is today. Being able to use pointer-based data structures is nice and all (although hardly required), but it's far more important to be able to share data on-chip. Hopefully either the person that told me this is wrong or this will get addressed in the next chip.

it's a convenient way to work,but you pay all the time for the rare case that you need it. GPUs are living from the fact that they are not coherent within the gpu, having to snoop across different units isn't fast.

I'm also not totally clear on how the GPU sending work to the CPU stuff is supposed to happen. Is the HSA runtime going to own threads that it spins up and down to do work and fires callbacks into user code? Hopefully they expose the low level stuff (coherent atomics, interrupts, OS events) in addition because I hardly want to be shacked with a made-up queue abstraction on the CPU that isn't tied to any hardware reality. I can handle whatever threading and tasking on the CPU myself thank you very much

HSA isn't meant for AMD GPUs only, it's a communication layer across all kind of units. in theory the CPU could consume the tasks it created, or some fpga could runs your task.

I am also concerned about the tasking. it's easy to reach high utilization, but it's also easy to get a low efficiency that way.

Raqia · Nov 12, 2013

You're right, the need for specifying coherent pages doesn't change, but wouldn't you be able to get away with looser practices like interleaving CPU and GPU operations more finely because of less overhead? And wouldn't you also not be worrying about two kinds of pointers to different heap spaces as you're programming as well? I guess the performance dynamics would still be different depending on if accesses are coming from GPU or CPU but I like having less complexity. When you're trying to eke out some performance, on that chart on page 35 of 1004_final, you wouldn't need to worry about those four "local" cases anymore, it'd just be uncached and cacheable.

It doesn't look like Onion+ is making a hop through an old fashion northbridge which involves extra moving around of data and translation of memory addresses as overhead and local buffers that probably take up die space.

Anyway, I should probably stop talking as I've never actually programmed seriously in this space.

3dilettante · Nov 12, 2013

rapso said:
or when you're saying, there is a theoretical benefit, incase when .. and ... and ... and... and.. happens at the same time, while in all other cases it's just wasted space.

It's not entirely theoretical. There were tests and benchmarks of some applications that relied heavily on integer SIMD that did better than expected, for which the additional port is one contributor to an explanation.

that shouldn't change, there seem to be the same amount of units.

There is one less FMAL unit that can be utilized in any given clock cycle.
The old setup could issue an FMMA, FMAL, FMAL in one cycle, the new one cannot.

if you use mmx, you probably do some int processing and you don't utilize the ffma units, so there is no real conflict, I'd guess the performance shouldn't change in most real world cases.

There are two cores sharing the same FPU. There's no physical reason why they can't be using different SIMD codes.

Why the did it in the first place? maybe it was a simple thing to do, rather then merging an fmal into an existing pipeline.

It helps if the two cores that share an FPU are running different SIMD codes. Since the fourth pipe is also the STO pipe, it also keeps MMX code from being disproportionately hit if it's store-heavy.
Putting FMAL in pipe 0 probably increased the complexity of the logic in that pipe, since originally it only had to deal with IMAC.

AMD may have weighed the costs and benefits of the wider FPU and found more benefit from narrowing it, but to say there was zero benefit to the old scheme would require some additional justification.

mczak · Nov 13, 2013

3dilettante said:
That's an argument used when trying to say that a regression isn't that bad, or if it is bad that it's limited enough to awkwardly handwave away.
The higher MMX throughput cases are themselves a limited set, just ones where BD wasn't that bad.
In terms of their throughput, whether all the ports are generally used is a Do Not Care.

If the savings somehow allowed for more units (no) or higher clocks (no), it might mean more.
It might mean less area, which might mean cheaper for AMD.

Pretty sure it really is less area, and it might also mean less power.
I kinda agree that the old distribution of units in the simd unit wasn't the best (if you only have fp code one port is completely idle all the time as it can't do anything useful in that case, not even something data-type independent like shuffles). So I guess for code which is "mostly" using floats the new arrangement could even be a win, but code using a good portion (in total coming from both int cores) of ints will definitely lose. In fact the new arrangement doesn't look too dissimilar to what K8/K10 used, the simd unit isn't all that fat anymore, I have some revolutionary idea they could give each int core its own simd unit! Compared to intel cpus that simd unit is really looking old now, half the peak throughput of an intel simd unit and shared by 2 cores... But of course AMD seems to think that's what the gcn shader units are for.

moozoo · Nov 13, 2013

So I'm guessing Kaveri's GPU DP ratio is 1:16?
I wish the was a FirePro Kaveri with 1:2

Andrew Lauritzen · Nov 13, 2013

rapso said:
it's a convenient way to work,but you pay all the time for the rare case that you need it. GPUs are living from the fact that they are not coherent within the gpu, having to snoop across different units isn't fast.

I'm not arguing for full cache-line coherency on the GPU, but it's important to have *some* mechanism to keep data on-chip efficiently. Obviously it's expected to have to flush at least some GPU caches at certain points (for instance, the texture cache is not going to snoop for every tap...), but that's somewhat orthogonal to having a shared cache to start with.

To put it another way, while fine-grained interop between the CPU/GPU that would require stronger coherence is almost certainly not necessary, being able to flush GPU L1/L2/* caches and have data picked up from some sort of LLC without hitting memory is desirable. On Haswell parts, even without EDRAM, the 6-8MB of LLC is useful for these purposes. You just don't want to burn memory bandwidth every time you go back and forth... on these bandwidth-starved APUs that's not even that much better than burning PCIe bandwidth to discrete cards (although obviously the latency is improved).

Raqia · Nov 13, 2013

The slides posted earlier by rSkip:

http://www.slideshare.net/ssuser512d95/final-apu13-philrogerskeynote21

provide some additional detail about what the programmer can expect. Looks like some features are going to be baked into JIT VM's like the "parallel" keyword in Java, and others into libraries like Bolt and CLMath. No mention of what lower level commands are used to allocate new memory and access GPU resources.

What I'm interested in doing something like a modified tiled Mergesort where data is split recursively on the CPU side until we get sufficiently many subsets. Then the GPU handles sorting all the smaller tiles in parallel and the CPU combines them back together. I have no idea how I would do something like that with HSA and maybe it wouldn't even be a good idea.

RecessionCone · Nov 13, 2013

Raqia said:
What I'm interested in doing something like a modified tiled Mergesort where data is split recursively on the CPU side until we get sufficiently many subsets. Then the GPU handles sorting all the smaller tiles in parallel and the CPU combines them back together. I have no idea how I would do something like that with HSA and maybe it wouldn't even be a good idea.

Mergesort is fast on the GPU at all levels of segmentation. No need for the CPU at all.
http://nvlabs.github.io/moderngpu/mergesort.html

Raqia · Nov 13, 2013

RecessionCone said:
Mergesort is fast on the GPU at all levels of segmentation. No need for the CPU at all.
http://nvlabs.github.io/moderngpu/mergesort.html

I noticed in the code that you have to pass a context (which presumably has to be initialized and pulled back and forth between CPU and GPU) to all the methods. I wonder if Kaveri could change that paradigm if the compiler could incorporate its HSA structures at run time with just a flag and how much speed we gain from the unified memory space. It looks like the Java VM is able to sweep this under the rug with "parallel..."

RecessionCone · Nov 13, 2013

Raqia said:
I noticed in the code that you have to pass a context (which presumably has to be initialized and pulled back and forth between CPU and GPU) to all the methods. I wonder if Kaveri could change that paradigm if the compiler could incorporate its HSA structures at run time with just a flag and how much speed we gain from the unified memory space. It looks like the Java VM is able to sweep this under the rug with "parallel..."

The context just holds memory allocations. It's not going to gain any speed from HSA.

Haswell vs Kaveri

Albuquerque

Red-headed step child

Raqia

3dilettante

Andrew Lauritzen

Moderator

Alexko

Raqia

3dilettante

Raqia

3dilettante

rapso

rapso

Raqia

3dilettante

mczak

moozoo

Andrew Lauritzen

Moderator

Raqia

RecessionCone

Raqia

RecessionCone