22 nm Larrabee

Gipsel · Jun 24, 2011

Nick said:
So did they reduce pipeline latency or increase storage (or both) to achieve that?

They basically quadrupled the number of pipelines while dividing the width of the (formerly VLIW) pipelines by 4. Througput stays the same. They may have also halved the latency, helping to get away with fewer threads.

MfA · Jun 24, 2011

Given the 4x MIMD I sincerely doubt ALU latency was decreased. Cache got a lot more complex, so latency probably didn't decrease much there either.

As I said, they probably ran the numbers on their average workload and found out that memory access dominated the flight time of the work items ... making ILP and ALU latencies a secondary concern. GPUs don't execute much cache to cache.

rpg.314 · Jun 24, 2011

trinibwoy said:
And now there are 4x the number of threads running in parallel.

It was your example, not mine My argument stays the same as well if you change the ratio.

Given any fixed alu:mem ratio you don't get away with fewer threads unless you reduce aggregate alu throughput or absolute memory latency. GCN has the same aggregate instruction throughput per CU as Cayman has per SIMD. You might get some opportunities for register reuse going from VLIW to scalar but that's not the general case.

GCN has 10 threads per SIMD. That's enough to hide 400 cycles of memory latency per SIMD given a 10:1 alu:mem ratio. How do you maintain that level of latency hiding with fewer threads without increasing the alu:mem ratio or slowing down the ALUs (using narrower SIMD width)?

Old SIMD != new SIMD.

Look at the slide titled "programmer's view of SIMD". Threads don't schedule across simd's in a CU. Each SIMD has to hide it's latency from it's own reg file, and it is a quarter of original SIMD's rf.

Even if 4x more workitems are running in gcn's cu, it doesn't help a simd with latency hiding any which way.

rpg.314 · Jun 24, 2011

Nick said:
So did they reduce pipeline latency or increase storage (or both) to achieve that?

Aggregate storage is the same.

SIMD's storage has been quartered.

trinibwoy · Jun 24, 2011

Gipsel said:
They basically quadrupled the number of pipelines while dividing the width of the (formerly VLIW) pipelines by 4. Througput stays the same. They may have also halved the latency, helping to get away with fewer threads.

Even if they did halve ALU latency, how does that help with hiding memory latency?

rpg.314 said:
Old SIMD != new SIMD.

Look at the slide titled "programmer's view of SIMD". Threads don't schedule across simd's in a CU. Each SIMD has to hide it's latency from it's own reg file, and it is a quarter of original SIMD's rf.

Even if 4x more workitems are running in gcn's cu, it doesn't help a simd with latency hiding any which way.

Yes, that's correct. However, you started this particular thread claiming a reduction in register requirements. Per SIMD they are reduced but now there are 4x the SIMDs so you don't really save squat from an architectural or chip standpoint given equivalent throughput.

rpg.314 · Jun 24, 2011

trinibwoy said:
Yes, that's correct. However, you started this particular thread claiming a reduction in register requirements. Per SIMD they are reduced but now there are 4x the SIMDs so you don't really save squat from an architectural or chip standpoint given equivalent throughput.

I only claimed that slower alu's mean less threads to hide latency.

I never said anything about chip level savings.

Gipsel · Jun 24, 2011

Nick said:
You think that's a good thing?

It's the reality we face

Nick said:
This massive register file is shared by a very large number of strands, leaving only a modest number of registers per strand.

Still more or at least as much as CPUs and their SSE/AVX units have at their disposal. And they don't have to swap that content out in case of a thread switch.

Nick said:
When executing strands more slowly, you need more of them to reach the same throughput, which means you'd need an even larger register file.

They have already reasonably sized register files for that purpose. That doesn't need to be changed dramatically.

Nick said:
At the same time, the software is getting more complex as well, demanding even more registers. They can't continue to sacrifice die space for that.

SRAM is cheap, especially as GPUs can easily live with access latencies of 3 cycles (or even more) and can distribute the register space over a lot of small ones (a GCN CU has actually 64 separate 4 kB register files) as opposed to a massive AVXx unit (when it has horizontal capabilities and supports permute instructions). So will a rise of needed register space hurt throughput computing more on CPU or on GPU like units? I would say CPUs doesn't fare too well with this kind of workload either.

Nick said:
Instead, some simple forms of out-of-order execution and superscalar issue can incease ILP and lower storage demand.

I had always the impression that OoOE increases the need for physical registers on your chip (rename registers and such stuff).
Either way, if that costs you more silicon and power than a bit additional register space, it's not worth it.

Nick said:
Note that it's not just about registers. If the working set of all strands combined doesn't fit inside the L1 cache most of the time, you get a very high percentage of misses, which results in higher bandwidth usage, and higher latency. Ironically higher latency means you need more strands, which again means more register and cache pressure...

Help me out here, strand was intel's name for a data element on Larrabee isn't it?
And what do you mean with "the working set of all strands combined"? Normally you try to keep as much of your working set in the registers (which sets a limit to the numbers of data elements processed simultaneously). The caches primarily work as bandwidth amplifiers, when you need data from nearby places for all data elements (you should group them accordingly, it's the same as on CPUs, completely random reads are going to be slow). If that is not enough, a CPU will be basically lost, too. Each CU has about 55 GB/s bandwidth to it's L1 cache and 110 GB/s bandwidth to the local memory (assuming 0.85 GHz). Considering a CU uses it more lightly than a CPU with let's say 96 GB/s @ 3GHz (assuming 32 byte/cycle read bandwidth (because it can hold more in the registers), I fail to see the advantage on the CPU's side. Factor in that GPU caches are traditionally quite well able to handle a high amount concurrent accesses to non contiguous adresses (texturing!) and the GPUs have clearly a head start here. Of course, the price are higher latencies also for cache hits. But they paid this price already and are capable of handling it (yes, the same story again: large register files, lots of warps/wavefronts in flight).

Nick said:
It doesn't mean just anything: Merging CPUs and GPUs.

"You can expect to talk to the GPU via extensions to the x86 ISA, and the GPU will have its own register file (much like FP and integer units each have their own register files). Elements of the architecture will be shared, especially things like the cache hierarchy, which will prove useful when running applications that require both CPU and GPU power."

That somehow doesn't sound to me like a homogeneous CPU where everything is done in the AVX units. Didn't we have this already, that an array of CUs with it's own scheduler (ACE is probably AMD's term for it) might be linked to and shared by all cores in all modules in a bulldozer like CPU? Larger throughput oriented tasks are dispatched there and others in the FPU/SSE/AVX units.

Gipsel · Jun 24, 2011

MfA said:
Given the 4x MIMD I sincerely doubt ALU latency was decreased.

But the individual vector-ALU lanes got simpler. That you have four of them is an orthogonal property. The four vector ALUs don't see each other, neither do their register files. It is definitely simpler as the 4 vector ALUs bundled in that VLIW setup, where each ALU could read the registers of the neighbouring slots.

MfA said:
Cache got a lot more complex, so latency probably didn't decrease much there either.

I was only talking about the ALU latency.

MfA said:
As I said, they probably ran the numbers on their average workload and found out that memory access dominated the flight time of the work items ... making ILP and ALU latencies a secondary concern. GPUs don't execute much cache to cache.

In the general view of things they had also a distinct disadvantage on the work distribution, kernel/shader dispatch and also control flow (the UTDP and the thread sequencer where simply slow and probably also not meant to scale to 24 SIMDs in the first place).
But you are right, slowing down the execution of an individual wavefront (by ditching VLIW) makes it more memory latency tolerant.
The ALU latency thing goes into the direction of saving on scheduling logic (you can issue dependent instructions back-to-back) which helps you a in scenarios where you don't have a lot of unstalled wavefronts to play with. That can be either kernels with a high register/local memory usage (limiting the number of wavefronts on a SIMD in the first place), but also scenarios where most of your wavefronts are waiting for some data from cache/memory leaving only a few left for execution.
And such an approach would follow the history of AMD GPUs, where these dependencies are not tracked (current ones use simple interleaving of 2 wavefronts, nothing needs to be checked during ALU clauses). And the "horizontal" operations like the DP4 (and also the dependent mads or dependent muls within one VLIW which started to appear with Cypress) basically prove that the ALUs itself are already capable of doing a mad in half of the time.

trinibwoy said:
Even if they did halve ALU latency, how does that help with hiding memory latency?

You need less threads to fill the ALUs while waiting for the data. You have less stalls, you loose less performance. If they halved the latency, a single non-blocked wavefront will be enough to avoid stalls and bubbles. You get away with one wavefront less on the SIMD.

Voxilla · Jun 25, 2011

Nick said:
No, but I do know the Intel HD Graphics 3000 has only 2 samplers. Also, sampling would take roughly 40 instructions for 16 filtered samples using AVX2. So with an ICP of ~2, the equivalent of one CPU core would be needed to provide the same peak texel rate.

The Intel HD Graphics 3000 can sample 4 GTexels/s with bilinear filtering, as shown here.
Care to explain how this can be done with 2 samplers ?

Nick · Jun 25, 2011

Voxilla said:
The Intel HD Graphics 3000 can sample 4 GTexels/s with bilinear filtering, as shown here.
Care to explain how this can be done with 2 samplers ?

My bad, it has 4 samplers (5.4 GT/s). That said, I measured that Crysis Warhead requires merely 1.14 GT/s to run at 1680x1050 @ 60 FPS. HD Graphics 3000 doesn't run Crysis at that performance level at all.

So again, a software renderer really doesn't require all that much texel fillrate.

AlexV · Jun 25, 2011

dkanter said:
They are quite correlated. The question is: "Does 3DMark performance really correlate well with *real* performance?" The answer is somewhat unclear.

DK

Unless FLOPs were the only variable, with all else being equal, that's a pretty strong statement based on the presented data. Don't think that the only difference between a 4850 and a 5850 is arithmetic throughput, for example. Probably the best experimenting medium for something like this would be NVIDIA's HW, since they allow separate clocking of ALUs and the rest of the functional units, so you would be reasonably close to varying FLOP throughput only.

Nick · Jun 26, 2011

Gipsel said:
They may have also halved the latency, helping to get away with fewer threads.

Ah, there we go. Together with the higher utilization versus VLIW, reducing the ALU latency would explain how they can manage to keep the thread run time roughly the same and thus wouldn't need extra storage.

I still wonder whether NVIDIA is at an advantage by having scoreboarding and superscalar issue though. Remember the early criticism of Hyper-Threading? It sometimes made things slower simply because the threads evict each other's data from the caches. And that's with just two threads. So keeping the number of threads as low as reasonably possible can have a significant effect on certain workloads (in particular divergent ones).

mczak · Jun 26, 2011

Nick said:
My bad, it has 4 samplers (5.4 GT/s).

I was quite sure it has 8 samplers (in any case max gpu clock of this chip is 1200Mhz so theoretical max is 4.8/9.6 GT/s). Given the results I could be wrong though...

Nick · Jun 26, 2011

Gipsel said:
Still more or at least as much as CPUs and their SSE/AVX units have at their disposal.

CPUs can use the L1 data cache as additional temporary storage, if necessary. The relatively small number of architectural registers is never really an issue because the live ranges of variables is quite small, compilers minimize spilling, and register renaming works around false dependencies.

Also note that AVX-1024 would increase the effective register space and when executed in 4 cycles on 256-bit execution units it would allow to hide latencies more easily. It's like you implicitly get access to four times more registers!

GPUs are still horrendously slow at complex tasks, such as compilation. Making them any better at them would require far more than just bigger register files. Turning CPUs into high-throughput devices on the other hand is fairly straightforward. AVX2 is already on the roadmap and AVX-1024 is quite feasible.

Throw the software economics into the mix, and GPGPU doesn't stand a chance...

3dilettante · Jun 26, 2011

If the AVX 1024 implementation has native 1024-bit registers, the register space is increased.
If it relies on chaining together 256-bit registers, there can be a chance that the chip will experience what BD experiences when cracking 256-bit ops, where performance could go down.

If we were to extend Sandy Bridge's FP register file to 1024 bits, the resulting file is 20480 bytes.
If there is modest growth in the register file in each of the next 2-3 generations, we could see a physical register file size that is in the same ballpark as a GT200 SM.

This size growth may need to happen anyway. Maintaining the 16 architectural registers of 1024-bit AVX for two threads would leave a sliver or storage for rename registers if we used a SB register file. The chip would choke for no externally visible reason.

rpg.314 · Jun 26, 2011

Nick said:
AVX2 is already on the roadmap and AVX-1024 is quite feasible.

Throw the software economics into the mix, and GPGPU doesn't stand a chance...

a) AVX2 will come in 2013, and avx 1024, if it does at all, not before 2015

b) By 2013 lots of code will have gpu acceleration and sys arch will be fixed.

c) http://ompf.org/forum/viewtopic.php?f=4&t=4952#p26672

Gipsel · Jun 26, 2011

Nick said:
Ah, there we go. Together with the higher utilization versus VLIW, reducing the ALU latency would explain how they can manage to keep the thread run time roughly the same and thus wouldn't need extra storage.

Actually, it would have a quite small effect. It just reduces the required number of wavefronts by 1. It is more a question of esthetics and some saved efforts on the schedulers. But the world doesn't break down if the latency stays at 8 cycles (it probably depends how they really organized the ALUs, maybe AMD just "rotated" the VLIW4 units of Cayman by 90 degrees, i.e. 4 VLIW4 units make now an SIMD).

Nick said:
I still wonder whether NVIDIA is at an advantage by having scoreboarding and superscalar issue though.

If AMD managed to halve the latencies, those things are just trying to keep the disadvantages of the longer pipeline in check.

Nick said:
Remember the early criticism of Hyper-Threading? It sometimes made things slower simply because the threads evict each other's data from the caches. And that's with just two threads. So keeping the number of threads as low as reasonably possible can have a significant effect on certain workloads (in particular divergent ones).

I doubt that it is significant. As mentioned, the cache structure of GPUs has traditionally different design goals. It is designed for acesses to a lot of different addresses (very high associativities) and is not meant at all to keep the complete working set in the L1.

Nick said:
CPUs can use the L1 data cache as additional temporary storage, if necessary. The relatively small number of architectural registers is never really an issue because the live ranges of variables is quite small, compilers minimize spilling, and register renaming works around false dependencies.

And that adds to the power consumption in comparison to a throughput oriented in-order design.

Nick said:
GPUs are still horrendously slow at complex tasks, such as compilation.

Name one compiler which uses SSE/AVX for compilation!
That kind of tasks is simply not meant for vector/throughput architectures. So any amount of AVX resources you add to CPUs won't speed that up either.

Mintmaster · Jun 26, 2011

Nick said:
My bad, it has 4 samplers (5.4 GT/s). That said, I measured that Crysis Warhead requires merely 1.14 GT/s to run at 1680x1050 @ 60 FPS.

How exactly do you plan on designing an architecture where texture samples travels through time to give you a homogenous workload?

You need a fast texel rate because parts of the scene have you doing lots of texture sampling while others have none.

trinibwoy · Jun 26, 2011

Gipsel said:
You need less threads to fill the ALUs while waiting for the data. You have less stalls, you loose less performance. If they halved the latency, a single non-blocked wavefront will be enough to avoid stalls and bubbles. You get away with one wavefront less on the SIMD.

Gipsel said:
If AMD managed to halve the latencies, those things are just trying to keep the disadvantages of the longer pipeline in check.

I see what you're saying but just don't buy it. If you make an assumption that reducing alu latencies allows you to get away with using fewer unblocked threads to cover memory latency then you must also assume those threads will themselves become blocked more quickly as they arrive at memory instructions faster. In terms of hiding memory latency GCN benefits far more from having lower instruction throughput than it does from a short alu pipeline.

It's 1/4 instr/clk compared to 1 instr/clk on Fermi. It's this relatively slow roll through the alus that lowers thread requirements per SIMD.

Name one compiler which uses SSE/AVX for compilation!

Gipsel · Jun 26, 2011

trinibwoy said:
If you make an assumption that reducing alu latencies allows you to get away with using fewer unblocked threads to cover memory latency then you must also assume those threads will themselves become blocked more quickly as they arrive at memory instructions faster.

Simple question:
It is better to waste half of your cycles if only a single unblocked wavefront is left, or would it be better to be able to do useful work in each cycle under those circumstances?

trinibwoy said:
In terms of hiding memory latency GCN benefits far more from having lower instruction throughput than it does from a short alu pipeline.

I said already it would be a minor effect. To cite myself:

Gipsel said:
Actually, it would have a quite small effect. It just reduces the required number of wavefronts by 1. It is more a question of esthetics and some saved efforts on the schedulers. But the world doesn't break down if the latency stays at 8 cycles (it probably depends how they really organized the ALUs, maybe AMD just "rotated" the VLIW4 units of Cayman by 90 degrees, i.e. 4 VLIW4 units make now an SIMD).

The main advantage would be the ability to hide the ALU latencies with fewer threads. Fermi has a long pipeline (18 hot clocks) and a relatively high issue rate of 1 warp every 2 (hot clock) cycles per scheduler. This requires already a lot of instructions in flight just for hiding this pipeline length. If GCN wouldn't need that and could really issue dependent instructions back-to-back, don't tell me it would be no advantage, both for the execution speed for lowly threaded problems and also for the complexity of the schedulers (at the expense of of more effort for the ALUs itself and the result forwarding). The question the AMD engineers needed to decide about, if it would be worth it. Implementing a single precision fma with 4 cycles latency (DP would have 8, same as now on Cypress/Cayman) at <= 1GHz clock doesn't sound too challenging to me when we will see six cycles for DP-FMAC at ~4GHz in Bulldozer (Phenoms do a 80bit EP multiplication with 4 cycles latency at up to 3.7 GHz). As reasoned already, the "horizontal" instructions of dependent operations within an VLIW instruction basically prove that the raw ALU latency is actually lower than the 8 cycles dictated by the scheduling in current designs.

22 nm Larrabee

Gipsel

MfA

rpg.314

rpg.314

trinibwoy

Meh

rpg.314

Gipsel

Gipsel

Voxilla

Nick

AlexV

Heteroscedasticitate

Nick

mczak

Nick

3dilettante

rpg.314

Gipsel

Mintmaster

trinibwoy

Meh

Gipsel

Similar threads