22 nm Larrabee

trinibwoy · Jun 26, 2011

Gipsel said:
Simple question:
It is better to waste half of your cycles if only a single unblocked wavefront is left, or would it be better to be able to do useful work in each cycle under those circumstances?

Sure but if you find yourself in that situation often then you have much bigger problems.

I said already it would be a minor effect.

The main advantage would be the ability to hide the ALU latencies with fewer threads.

Yup. I figure pipeline latencies aren't a big deal though. 10 warps per SIMD is enough to completely fill the Fermi pipeline, not a tall order by any means in throughput workloads. The battle will be fought and won with cache efficiency and lower memory latency.

GCN (and Cayman) already has 33% more L2 bandwidth/clk than Fermi for example.

Gipsel · Jun 26, 2011

But there was some talk not only from AMD but also nvidia that they want to strengthen the architectures for lesser threaded algorithms and enabling a more fine grained "braided" multithreading. 10 warps per SIMDs means 320 warps for Fermi an probably 500+ for the 28nm shrink of it. That are already ~16000 data elements. GCN with just 2 wavefronts in flight per SIMD and 32CUs are also already at 16k data elements (with 10 per SIMD we are looking at 80k data elements). Wouldn't it be nice it would work more efficient also for lower counts? There are probably quite some algorithms where hundreds of thousands or even millions of data elements aren't the relevant problem size. GPGPU or what it develops into needs to tackle that problem.

Nick · Jun 27, 2011

rpg.314 said:
a) AVX2 will come in 2013, and avx 1024, if it does at all, not before 2015

So?

b) By 2013 lots of code will have gpu acceleration and sys arch will be fixed.

By 2013 we'll only see the very first mainstream APUs where GPGPU computing on the IGP might just have become a viable option from a technical viewport. It would still take several years before they're widespread enough for developers to bother investing much effort into it though. But with a Haswell quad-core offering up to 500 GFLOPS and an even more flexible programming model plus a solid software infrastructure, the APU won't get much of a chance at all.

c) http://ompf.org/forum/viewtopic.php?f=4&t=4952#p26672

What about it?

Nick · Jun 27, 2011

Gipsel said:
I doubt that it is significant. As mentioned, the cache structure of GPUs has traditionally different design goals. It is designed for acesses to a lot of different addresses (very high associativities) and is not meant at all to keep the complete working set in the L1.

Sure, that's what it's "designed" for. But that means performance plummets when you have a workload that is any more divergent than rasterization graphics.

That's why GPGPU applications often require a high-end GPU to outperform an ancient CPU. Think of the PhysX fiasco. On an IGP, it becomes a complete humiliation. So GPU designers better start caring about reducing the thread count. In practice that means converging it toward a CPU architecture. So it just seems simpler to me to turn the CPU into a high throughput device. After AVX2 it will take just one more step.

Name one compiler which uses SSE/AVX for compilation!
That kind of tasks is simply not meant for vector/throughput architectures. So any amount of AVX resources you add to CPUs won't speed that up either.

CPUs are already efficient at tasks like that! AVX is just to make them a lot more powerful at throughput computing. It further reduces the set of applications for which the IGP could prove useful.

So if you can't beat them, join them. Stop wasting resources on a heterogeneous architecture that has no future, but create a superior homogeneous architecture which combines the best of both worlds and even benefits from the synergy between them.

Nick · Jun 27, 2011

Mintmaster said:
How exactly do you plan on designing an architecture where texture samples travels through time to give you a homogenous workload?

Why would there have to be a homogeneous workload? A CPU doesn't care if you have a 1:10 TEX:ALU ratio or a 10:1 ratio.

You need a fast texel rate because parts of the scene have you doing lots of texture sampling while others have none.

There's no shortage of texel rate. I merely measured the average required texel rate because to a CPU that's the only thing that counts.

Nick · Jun 27, 2011

Gipsel said:
Wouldn't it be nice it would work more efficient also for lower counts? There are probably quite some algorithms where hundreds of thousands or even millions of data elements aren't the relevant problem size. GPGPU or what it develops into needs to tackle that problem.

Amen.

rpg.314 · Jun 27, 2011

Gipsel said:
There are probably quite some algorithms where hundreds of thousands or even millions of data elements aren't the relevant problem size. GPGPU or what it develops into needs to tackle that problem.

The way I see it, the only to do that is to ditch the horribly restrictive register file and just allocate registers from caches, like lrb. So that even with few threads, you don't feel the pain of memory latency, or at least, not as much.

You could try having both a big cache and a big register file, but if the cache has any reasonable size at all, it will look like a horrible waste. of transistors.

Gipsel · Jun 27, 2011

Nick said:
Sure, that's what it's "designed" for. But that means performance plummets when you have a workload that is any more divergent than rasterization graphics.

Care to explain, why a "divergent workload" (whatever that means) hurts GPUs more than CPUs (when running on wide vector AVXx-units)?

Nick said:
That's why GPGPU applications often require a high-end GPU to outperform an ancient CPU. Think of the PhysX fiasco. On an IGP, it becomes a complete humiliation. So GPU designers better start caring about reducing the thread count.

I guess you got that wrong. If a low thread count is really the problem, it would run on a low end GPU with basically the same speed as on a high end GPU. Generally, I wouldn't take some of nvidia's sometimes dubious decisions and marketing claims as a basis for such an asessment.

Nick said:
So if you can't beat them, join them. Stop wasting resources on a heterogeneous architecture that has no future, but create a superior homogeneous architecture which combines the best of both worlds and even benefits from the synergy between them.

So if you can't beat them, join them. Stop wasting resources on a humongous and power hungry AVX unit that eventually stands no chance in efficient throughput computing, but create a superior heterogeneous architecture which combines the best of both worlds and even benefits from the synergy between them.
Fixed

Gipsel · Jun 27, 2011

rpg.314 said:
The way I see it, the only to do that is to ditch the horribly restrictive register file and just allocate registers from caches, like lrb. So that even with few threads, you don't feel the pain of memory latency, or at least, not as much.

That would be bad for the power consumption, I think. The short term fix will be to allow several small problems to run simultaneously on the same units.
And I doubt it needs to be "solved" entirely, just lessened to a degree, where you can live with it (and dispatch smaller problems to the SSE/AVX units).

GZ007 · Jun 27, 2011

Nick said:
CPUs are already efficient at tasks like that! AVX is just to make them a lot more powerful at throughput computing. It further reduces the set of applications for which the IGP could prove useful.

So if you can't beat them, join them. Stop wasting resources on a heterogeneous architecture that has no future, but create a superior homogeneous architecture which combines the best of both worlds and even benefits from the synergy between them.

U can buy CPU/GPU and than use the GPU as a graphic card and rarely for troughtput computing, or u can buy a oversized AVX buffed intel CPU and use it rarely for troughput computing.
From a consumer standpoint i dont see why a heterogeneous architecture is wasting of resources. I would rather say that the Intels version has no future. Its harder to sell Knights Ferry for masses than a GPU.

rpg.314 · Jun 27, 2011

Gipsel said:
That would be bad for the power consumption, I think. The short term fix will be to allow several small problems to run simultaneously on the same units.
And I doubt it needs to be "solved" entirely, just lessened to a degree, where you can live with it (and dispatch smaller problems to the SSE/AVX units).

a) Are you saying that a cache would consume more power than a reg file of equal size? Why?

b) Since mem latency is the biggest problem, to get good perf with fewer threads/workgroup, I think more cache is the only solution since GPU's can already run workloads with few threads at high throughput if these threads store a large part of their working set in registers.

c) Muxing different workgroups from different kernels onto the same CU/SM would reduce cache/workgroup back to the same level, giving no perf benefit.

Gipsel · Jun 27, 2011

rpg.314 said:
a) Are you saying that a cache would consume more power than a reg file of equal size? Why?

If it is used like a register file, yes.
The reason is simple, a reg file is (normally) directly adressed, you don't need to check if it is in there, and when it is in, in which way of the highly associative cache it is. Furthermore a Cache is likely physically further away than the regfile (especially the splitted ones for each lane of an SIMD unit) in current GPUs. And driving data over lines costs a lot of energy.

rpg.314 said:
b) Since mem latency is the biggest problem, to get good perf with fewer threads/workgroup, I think more cache is the only solution since GPU's can already run workloads with few threads at high throughput if these threads store a large part of their working set in registers.

That really depends on what problems you think will be relevant for execution on those evolved GPU like units. Almost by definition, if latency is your overwhelming constraint, you should head straight for the CPU. But those are most likely no throughput oriented scenarios anyway.
And I gave a few numbers, how much threads do you need with the next round of GPUs to fully load them (which only rises with GPUs just scaling the number of units on newer processes). There is quite a gap between current CPUs and current GPUs to fill

rpg.314 said:
c) Muxing different workgroups from different kernels onto the same CU/SM would reduce cache/workgroup back to the same level, giving no perf benefit.

It raises your throughput of course. It won't help the individual kernel, but altogether it will be faster. And we need also the possibility that different (or the same) contexts can run different kernels also on different CUs (which GCN provide with its ACEs), so you don't have to fully load the whole GPU with a single kernel.

rpg.314 · Jun 27, 2011

Gipsel said:
If it is used like a register file, yes.
The reason is simple, a reg file is (normally) directly adressed, you don't need to check if it is in there, and when it is in, in which way of the highly associative cache it is. Furthermore a Cache is likely physically further away than the regfile (especially the splitted ones for each lane of an SIMD unit) in current GPUs. And driving data over lines costs a lot of energy.

well, 50-50 split then, between rf and cache? Cache will cost a lot of power though.

That really depends on what problems you think will be relevant for execution on those evolved GPU like units. Almost by definition, if latency is your overwhelming constraint, you should head straight for the CPU. But those are most likely no throughput oriented scenarios anyway.

Latency will be a constraint for everything. If you don't have enough parallelism, then rf can't scale. And with a small cache, you can only do horizontal reuse (across workitems) but not vertical reuse (within the same workitem), which is what high perf with few threads is about anyway.

Gipsel · Jun 27, 2011

rpg.314 said:
Latency will be a constraint for everything. If you don't have enough parallelism, then rf can't scale. And with a small cache, you can only do horizontal reuse (across workitems) but not vertical reuse (within the same workitem), which is what high perf with few threads is about anyway.

You should try to use the registers file for that. If you don't have a huge amount of threads, there should quite a bit of space there.

3dilettante · Jun 27, 2011

Both the register file and L1 cache consume a significant amount of power as a result of their nearly constant use.
Designing a cache to provide the same level of porting and bandwidth as a register file would significantly add to the cost.

The space of register identifiers is vanishingly small compared to that of the address space, as such it is much simpler to check for dependences amongst a few hundred IDs than 2^64 addresses. A lot of things get much harder if resident in memory. Operand bypass must go through the result forwarding path of the load/store unit, and is subject to the restrictions and limitations of that process.
Then there is the TLB (and speculation hardware, if we are talking about Haswell).

Then there, is the problem that memory accesses are of indeterminate latency, have additional OS-level interactions and protection checks, and can cause exceptions or be interrupted. There are ways around it, as x86 can attest. There are costs, as a micrograph/thermal shot/transistor count of an x86 die can attest.

Now, what if there were a "cache" that could skip all of that and just address lines of SRAM directly.
A CU has 64 KiB of that. I'm not sure if this is a desired outcome, or AMD could not figure a way to combine the cache and LDS in a satisfactory manner at the upcoming process node.

Fermi does some kind of allocation from the cache for shared memory, which leaves me wondering whether this is effectively a TLB-level check, which saves little in complexity, or if it has a separate part of the memory pipeline for it.

Nick · Jun 27, 2011

Gipsel said:
Care to explain, why a "divergent workload" (whatever that means) hurts GPUs more than CPUs (when running on wide vector AVXx-units)?

With just one or two threads per core a CPU has a lot of cache space per thread. There are many cache hits even when the data accesses diverge for a while and then old data is reused. GPUs on the other hand offer hardly any wiggle room. There are too many threads for each of them to get a decent amount of cache space. Only incredibly correlated accesses benefit from having a cache at all, such as constants shared between threads, and overlapping texture filter kernels. That's fine for rasterization graphics, but not for much else.

If a low thread count is really the problem, it would run on a low end GPU with basically the same speed as on a high end GPU.

A high-end GPU has more storage and bandwidth in total. So instead of having just a few threads that manage to make painstakingly slow progress, you get several more of them.

Stop wasting resources on a humongous and power hungry AVX unit that eventually stands no chance in efficient throughput computing...

AVX units are not humongous nor power hungry. Haswell should reach 500 GFLOPS for 1 billion transistors. That's roughly the same computing density as NVIDIA's latest GPUs, and the TDP wouldn't be too far off either (even at today's 32 nm process).

Add in FinFET technology and AVX-1024, and it's readily clear that GPU manufacturers should not underestimate what a homogeneous CPU could do.

trinibwoy · Jun 28, 2011

Nick said:
Add in FinFET technology and AVX-1024, and it's readily clear that GPU manufacturers should not underestimate what a homogeneous CPU could do.

Why not? You seem to have no issue with underestimating the capabilities of current and future GPUs

Gipsel · Jun 28, 2011

Nick said:
With just one or two threads per core a CPU has a lot of cache space per thread. There are many cache hits even when the data accesses diverge for a while and then old data is reused.

The difference may be not as large as you think as GPUs can hold a lot more data in the registers.

But let's make up some numbers and say a GCN CU runs 12 heavy threads. That are 3 threads per 512bit wide SIMD unit (logical width is 2048 bit in ATIs case). A CPU with SMT also runs already 2 threads for let's say two 256bit SIMD units. Each thread (aka wavefront) on a CU has significantly more vector registers (in this example 85 instead of 16) available to store it's active data before any need to resort to the caches for reuse of data. Additionally, it would have access to the local memory array (64kB per CU) available as a user controlled cache for instance. This will result in a significantly lesser loaded cache system for the same throughput to begin with.
Each CU allows to read (that's what current GPUs already do) from 4 independent addresses per clock (4x16 byte/clock) from its L1, in addition, each of the 32 banks of the 64kB shared memory is able to deliver 4 byte/clock (coming from up to 32 different locations obviously).
For comparison, Sandybridge can load up to 2x16byte/clock (from 2 addresses). The interface between L1 und L2 cache in GCN is 64 (or even 128) bytes/clock wide, Sandybridge has 32 byte/clock. One has to normalize that to the clockspeed and the intended throughput of course. So let us do it:

GCN, single CU @ 850 MHz:
54.4 (108.8) GFlop/s (with fma)
256 kB vector register space for 12*64=768 simultaneous data elements, i.e. 85 floats per data element
8 kB scalar register space for the 12 threads
64 kB local memory
16 kB 64 way associative L1
108.8 GB/s local memory bandwidth
54.4 GB/s L1 cache bandwidth (L1<->L2 is the same)

Sandybridge core @ 3.4 GHz
54.4 GFlop/s (no fma available)
1 kB vector register space for 2*8 = 16 data elements with SMT (512 byte for 8 data elements without), i.e. 16 floats per data element
256 Byte (0.25 kB) integer register space for 2 threads
no explicit local memory
32 kB 8 way associative L1
108.8 GB/s L1 cache bandwidth (L1<->L2 is the same)

I still fail to see the distinct advantage the CPU is supposed to have. While it is true, that one can imagine problems where the L1 of the CPU can hold an extended working set for the data elements that may not fit in the registers of a GPU, it is very likely that the reduction of the needed L1 traffic outweighs that easily. Because after all, the available bandwidth/flop is very comparable.

Nick said:
GPUs on the other hand offer hardly any wiggle room. There are too many threads for each of them to get a decent amount of cache space. Only incredibly correlated accesses benefit from having a cache at all, such as constants shared between threads, and overlapping texture filter kernels. That's fine for rasterization graphics, but not for much else.

That is not true. First, for most problems one has indeed to fetch data from close locations for data elements lying close together (if not, organize your problem accordingly, it will also help the CPU), and it is definitely true for graphics, which is still your main concern, if I didn't misunderstand you.
And it is enough that another (or the same) thread needs data from the same cache line to have a benefit. Actually the conditions to see a positive effect of the caches are basically the same as on a CPU with the only real problem being completely random memory accesses for arrays far larger than any cache (higher for CPUs) and no arithmetics to hide the latency (GPUs are often better at this).
I found it actually quite amazing how much bandwidth you can realize on a GPU with random indexing (with basically no coherence between neighboring data elements) into a 16 MB buffer (far larger than any cache) when each access reads 16 bytes (it's more than you can expect from a CPU when reading from the L3 cache).

Nick said:
A high-end GPU has more storage and bandwidth in total. So instead of having just a few threads that manage to make painstakingly slow progress, you get several more of them.

Again, a high-end GPU does basically nothing for a problem with a low thread count. If you have just a single thread (wavefront/warp) per SIMD/SM on a high-end GPU, a GPU with half the number of units will execute the kernel at the same speed. Scaling the number of units doesn't scale the number of threads of a kernel.

Nick said:
AVX units are not humongous nor power hungry. Haswell should reach 500 GFLOPS for 1 billion transistors.

How do you count that? A quad core Sandybridge has already more than 900 million transistors for ~200 GFlop/s in single precision. Do you think two 256 bit FMA units per core and all those data paths necessary so they don't just idle are coming for free?

rpg.314 · Jun 28, 2011

Nick said:
Haswell should reach 500 GFLOPS for 1 billion transistors. That's roughly the same computing density as NVIDIA's latest GPUs, and the TDP wouldn't be too far off either (even at today's 32 nm process).

Obviously, nvidia will be selling Fermi in Haswell's time.

rpg.314 · Jun 28, 2011

Gipsel said:
The difference may be not as large as you think as GPUs can hold a lot more data in the registers.

But let's make up some numbers and say a GCN CU runs 12 heavy threads. That are 3 threads per 512bit wide SIMD unit (logical width is 2048 bit in ATIs case). A CPU with SMT also runs already 2 threads for let's say two 256bit SIMD units. Each thread (aka wavefront) on a CU has significantly more vector registers (in this example 85 instead of 16) available to store it's active data before any need to resort to the caches for reuse of data. Additionally, it would have access to the local memory array (64kB per CU) available as a user controlled cache for instance. This will result in a significantly lesser loaded cache system for the same throughput to begin with.
Each CU allows to read (that's what current GPUs already do) from 4 independent addresses per clock (4x16 byte/clock) from its L1, in addition, each of the 32 banks of the 64kB shared memory is able to deliver 4 byte/clock (coming from up to 32 different locations obviously).
For comparison, Sandybridge can load up to 2x16byte/clock (from 2 addresses). The interface between L1 und L2 cache in GCN is 64 (or even 128) bytes/clock wide, Sandybridge has 32 byte/clock. One has to normalize that to the clockspeed and the intended throughput of course. So let us do it:

GCN, single CU @ 850 MHz:
54.4 (108.8) GFlop/s (with fma)
256 kB vector register space for 12*64=768 simultaneous data elements, i.e. 85 floats per data element
8 kB scalar register space for the 12 threads
64 kB local memory
16 kB 64 way associative L1
108.8 GB/s local memory bandwidth
54.4 GB/s L1 cache bandwidth (L1<->L2 is the same)

Sandybridge core @ 3.4 GHz
54.4 GFlop/s (no fma available)
1 kB vector register space for 2*8 = 16 data elements with SMT (512 byte for 8 data elements without), i.e. 16 floats per data element
256 Byte (0.25 kB) integer register space for 2 threads
no explicit local memory
32 kB 8 way associative L1
108.8 GB/s L1 cache bandwidth (L1<->L2 is the same)

I still fail to see the distinct advantage the CPU is supposed to have. While it is true, that one can imagine problems where the L1 of the CPU can hold an extended working set for the data elements that may not fit in the registers of a GPU, it is very likely that the reduction of the needed L1 traffic outweighs that easily. Because after all, the available bandwidth/flop is very comparable.

The problem with storing your data set in registers is that it's not always feasible, or even possible. Registers are good choice for storing scalars, but any thing more complex than that is HARD.

While ray tracing, are programmers supposed to hold - let's say one or two top levels of acceleration structure - in registers? For a more complex data structure, it might be flat out impossible.

Alternatively, which parts of a ray tracing acceleration structure are suitable for storage in registers?

Which

22 nm Larrabee

trinibwoy

Meh

Gipsel

Nick

Nick

Nick

Nick

rpg.314

Gipsel

Gipsel

GZ007

rpg.314

Gipsel

rpg.314

Gipsel

3dilettante

Nick

trinibwoy

Meh

Gipsel

rpg.314

rpg.314

Similar threads