22 nm Larrabee

rpg.314 · Jun 23, 2011

trinibwoy said:
That still doesn't explain the need for advanced scoreboarding. The only reason to do that is to exploit ILP.

Wouldn't nv need advanced scoreboarding even if you weren't exploiting ILP? RAW hazards exist for them.

Gipsel · Jun 23, 2011

Nick said:
That would result in horrible ILP, which means you need massive register files.

Last time I checked GPUs used to have massive register files (256 kB per VLIW SIMD-Engine for Radeons, that's space for 64k floats).

And you don't care about ILP at all, if your design executes only a single instruction after another strictly in-order (GCN)

Nick said:
As far as I'm aware NVIDIA uses lightweight scoreboarding and superscalar issue to tackle this very problem. So why would AMD be able to do without dynamic scheduling?

Look to the SI/HD7000 thread

Short version: The execution latency is matched to the ratio of logical and physical vector length.

Nick said:
Originally AMD planned on fully unifying the CPU and GPU:
http://images.anandtech.com/reviews/cpu/amd/roadahead/evolving2.jpg

Seriously? You are reading that kind of stuff into a marketing driven pictogram? It can mean almost anything and definitely doesn't give away much about the actual implementation. Or do you think that GPU and CPU part are shaped like puzzle pieces now?

Nick said:
TSMC and Global Foundries will move to FinFET at the 14 nm node. That's four years from now, if everything goes well. That's four years during which Intel has a substantial advantage beyond mere density, and during which others will also severely struggle to keep leakage under control.

Intel said that FD-SOI (and planar transistors) can bring comparabable advantages as FinFET. Intel dismissed it because they think FinFET has a cost advantage. What makes you sure we are not going to see 20nm FD-SOI processes from the members of the SOI consortium?

Nick said:
Yes but that means threads advance at an excruciatingly slow pace (one scalar at a time), meaning you need more threads in flight, and thus more storage (beyond the ever increasing need for storage required by long and complex kernels).

??? You probably mean one vector at a time.
See beginning of the post.
By the way, SRAM itself isn't that power hungry. Accessing it is. Nvidia has stated, that the read of the operands from the register file costs already more energy than the actual processing of an fma. In that respect, it is an advantage to split your reg file into smaller ones in close proximity of the ALUs, i.e. each SIMD lane get its own register file. Swizzling data between the lanes is probably better "outsourced" to a shared memory array from a power perspective, so the additional power for the longer distances is only spent when it is really needed.

Nick said:
Of course moving from VLIW4 to scalar SIMD lowers latency a bit

It may be true for GCN, but there is no "of course". Just compare the absolute latencies of nvidia GPUs (Fermi: 18 cycles, older ones even more) with AMD ones (8 cycles for ages now).

Nick said:
but then why would NVIDIA perform scoreboarding but not AMD?

Slightly longer version of the sentence about it above:
Current AMD GPUs interleave the execution of two wavefronts, so the latency of 8 cycles matches the troughput of 8 cycles per 2 wavefronts. That means two consecutive instructions for each wavefront (which is basically a logical vector) have no possible hazards of unresolved dependencies. So you don't need to check for them.

trinibwoy · Jun 23, 2011

rpg.314 said:
Wouldn't nv need advanced scoreboarding even if you weren't exploiting ILP? RAW hazards exist for them.

I don't know of any situations where RAW hazards exist when each instruction is completely retired from the pipeline before another is submitted (i.e. GCN). RAW checks are only required when you need to have multiple instructions in-flight from the same thread (i.e. ILP).

3dilettante · Jun 23, 2011

Gipsel said:
Intel said that FD-SOI (and planar transistors) can bring comparabable advantages as FinFET. Intel dismissed it because they think FinFET has a cost advantage. What makes you sure we are not going to see 20nm FD-SOI processes from the members of the SOI consortium?

20nm FD-SOI isn't being discussed. GF so far has promised SOI for the nodes closest to Intel's 22nm.

TSMC did say at one point it was planning for FinFET at 14nm. GF, I'm not sure about.
The room for surprises is closing fast. If they have a process full of awesomesauce, it can't be used by fabless companies if it's sprung on them at the last minute.

FD-SOI does help leakage problems and one source of variation. FinFET has an advantage over planar FD-SOI in an effective increase in area for the inversion layer for a similar footprint. A FinFET can provide more current, something the substrate doesn't change.

AlexV · Jun 23, 2011

Nick said:
For GPUs, FLOPS and FPS are strongly correlated.

Ugh, no. For GPUs, FLOPS and 3DMark performance may be strongly correlated. Or not.

dkanter · Jun 24, 2011

AlexV said:
Ugh, no. For GPUs, FLOPS and 3DMark performance may be strongly correlated. Or not.

They are quite correlated. The question is: "Does 3DMark performance really correlate well with *real* performance?" The answer is somewhat unclear.

DK

RecessionCone · Jun 24, 2011

dkanter said:
They are quite correlated. The question is: "Does 3DMark performance really correlate well with *real* performance?" The answer is somewhat unclear.

DK

Your experiment showed that flops and 3dmark performance are only correlated within an architectural family. This conversation has been hypothecating performance for wildly different architectures, so I don't think you can say flops and performance are correlated at all. Otherwise, Cayman would completely dominate GF110, which it clearly doesn't, despite its overwhelming peak flops advantage.

Nick · Jun 24, 2011

liolio said:
I remember an AMD top engineer (may be Huddy) saying that by 2015 you won't be able to tell the difference between CPU and GPU. So they may have a different roadmap than INtel. There may also be difference in implementation.

That's quite interesting. Do you have a source for that or was it at a conference maybe? Was he talking about the programmability (software view) or the actual architecture (hardware view)?

Nick · Jun 24, 2011

rpg.314 said:
No, it means that since each thread advances upto 4x slower, it needs, upto 4x less reg file to hide the same latency, which is exactly what they have done.

Could you please explain to me how advancing threads more slowly allows to require less registers?

Nick · Jun 24, 2011

rpg.314 said:
That model used a synthetic bench for data. It is not necessarily applicable to any real workload, aka games.

3DMark Vantage renders graphics. It might not be perfectly representative of actual games (no benchmark ever is), but I doubt that for games which aren't CPU bottlenecked the correlation would be significantly less. Also note that with software rendering on a homogeneous CPU you can't really get CPU bottlenecked in the same sense...

Anyway, since you're apparently more interested in "real" workloads, here's a statistic of the shader instructions executed in Crysis:

mov 20.398790
mul 18.116055
mad 17.485568
texld 13.826393
add 12.470492
dp3 3.679010
rcp 3.397286
cmp 1.960043
texkill 1.893258
rsq 1.603783
abs 1.288219
max 1.131027
exp 1.001035
nrm 0.968049
dp2add 0.315590
pow 0.278941
lrp 0.186461

Clearly this will run much faster on a CPU with 4 times the GLOPS, non-destructive instructions, and gather. Of course it also depends on the cache bandwidth, but they'd be fools not to scale that accordingly.

Desktops are dying.

You missed the point. There are plenty of other reasons why someone may opt for software rendering instead of an IGP.

rpg.314 · Jun 24, 2011

If mem latency is 100 cycles, and 1 alu op (costing 1 cycle each) interleaves 1 mem op on average, then you need 100 thread to hide latency. But if you make every alu op take 4 cycles, then you need only 25 threads. Hence less threads.

liolio · Jun 24, 2011

Nick said:
That's quite interesting. Do you have a source for that or was it at a conference maybe? Was he talking about the programmability (software view) or the actual architecture (hardware view)?

This the best I could find but it's coming from a marketing guy which I did not noticed back in time, anyway they may be some truth to the statement anyway.

Nick · Jun 24, 2011

Gipsel said:
Last time I checked GPUs used to have massive register files (256 kB per VLIW SIMD-Engine for Radeons, that's space for 64k floats).

You think that's a good thing? This massive register file is shared by a very large number of strands, leaving only a modest number of registers per strand. When executing strands more slowly, you need more of them to reach the same throughput, which means you'd need an even larger register file. At the same time, the software is getting more complex as well, demanding even more registers. They can't continue to sacrifice die space for that. Instead, some simple forms of out-of-order execution and superscalar issue can incease ILP and lower storage demand.

Note that it's not just about registers. If the working set of all strands combined doesn't fit inside the L1 cache most of the time, you get a very high percentage of misses, which results in higher bandwidth usage, and higher latency. Ironically higher latency means you need more strands, which again means more register and cache pressure...

Either they'll attempt to reduce the pipeline depth, they'll use more dynamic scheduling, or they'll need supermassive register files and caches. It might be a combination, but only increasing the storage seems like a waste of die space to me.

Seriously? You are reading that kind of stuff into a marketing driven pictogram? It can mean almost anything and definitely doesn't give away much about the actual implementation.

It doesn't mean just anything: Merging CPUs and GPUs.

"You can expect to talk to the GPU via extensions to the x86 ISA, and the GPU will have its own register file (much like FP and integer units each have their own register files). Elements of the architecture will be shared, especially things like the cache hierarchy, which will prove useful when running applications that require both CPU and GPU power."

So at least initially when they bought ATI they envisioned combining the flexibility of the CPU with the throughput of a GPU. It looks like Bulldozer and GCN can still be part of this long-term plan, but I wonder what the next steps will be.

Not entirely surprisingly it looks like AVX2 and Larrabee put Intel one step closer to a fully converged architecture. It can't be a coincidence that they've already reserved an encoding bit for 512 and 1024-bit AVX (they could have instead just reserved it for an undetermined feature). It's also quite interesting that Intel paid NVIDIA 1.5 billion to get access to patents which they might require to implement the sequencing logic to execute AVX-1024 on 256-bit execution units in a power efficient manner.

trinibwoy · Jun 24, 2011

rpg.314 said:
If mem latency is 100 cycles, and 1 alu op (costing 1 cycle each) interleaves 1 mem op on average, then you need 100 thread to hide latency. But if you make every alu op take 4 cycles, then you need only 25 threads. Hence less threads.

Are you sure? You can reduce alu instruction throughput either by processing a wide vector over many cycles (like current GPUs) or making your alu's slower. The former probably won't reduce register requirements and the latter isn't feasible for obvious reasons. The only sure way to reduce register requirements for hiding latency in a 1:1 alu:mem scenario is to reduce absolute memory latency.

Nick · Jun 24, 2011

rpg.314 said:
If mem latency is 100 cycles, and 1 alu op (costing 1 cycle each) interleaves 1 mem op on average, then you need 100 thread to hide latency. But if you make every alu op take 4 cycles, then you need only 25 threads. Hence less threads.

What happens to throughput? And shouldn't cache hits have a much lower latency?

rpg.314 · Jun 24, 2011

trinibwoy said:
The former probably won't reduce register requirements and the latter isn't feasible for obvious reasons.

AMD just did the latter. Execution rate of individual threads is now up to 4x slower.

The only sure way to reduce register requirements for hiding latency in a 1:1 alu:mem scenario is to reduce absolute memory latency.

1:1 is just an example. The argument would still say less threads if 10:1 alu:mem was the baseline, although specific numbers would change.

rpg.314 · Jun 24, 2011

Nick said:
What happens to throughput?

AMD's doing just fine with GCN. Aggregate throughput is still the same.

MfA · Jun 24, 2011

rpg.314 said:
If mem latency is 100 cycles, and 1 alu op (costing 1 cycle each) interleaves 1 mem op on average, then you need 100 thread to hide latency. But if you make every alu op take 4 cycles, then you need only 25 threads. Hence less threads.

Since you have 4 times as many work items executing in parallel with the same amount of ALUs (VLIW runs one work item on 4 ALUs vs scalar running 4 work items) that part of the equation gets you exactly diddly.

The register usage is registers needed per work item times throughput times flight time. So increasing flight time will increase register usage ... what made AMD probably do it any way is that the flight time of work items in an average kernel is mostly memory access dominated at the moment, so the overall hit is small ... and there are gains as well in ALU utilization for code with low levels of exploitable ILP.

As I said before though, ideally the hardware could switch between VLIW and scalar execution of work items on the fly. So if you were running say a FFT it could race through with VLIW, and then switch back to scalar for a sparse matrix multiply.

PS. I wish every one would just settle on using OpenCL terminology ... like me

trinibwoy · Jun 24, 2011

rpg.314 said:
AMD just did the latter. Execution rate of individual threads is now up to 4x slower.

And now there are 4x the number of threads running in parallel.

1:1 is just an example

The argument would still say less threads if 10:1 alu:mem was the baseline, although specific numbers would change.

It was your example, not mine

My argument stays the same as well if you change the ratio.

Given any fixed alu:mem ratio you don't get away with fewer threads unless you reduce aggregate alu throughput or absolute memory latency. GCN has the same aggregate instruction throughput per CU as Cayman has per SIMD. You might get some opportunities for register reuse going from VLIW to scalar but that's not the general case.

GCN has 10 threads per SIMD. That's enough to hide 400 cycles of memory latency per SIMD given a 10:1 alu:mem ratio. How do you maintain that level of latency hiding with fewer threads without increasing the alu:mem ratio or slowing down the ALUs (using narrower SIMD width)?

Nick · Jun 24, 2011

rpg.314 said:
AMD's doing just fine with GCN. Aggregate throughput is still the same.

So did they reduce pipeline latency or increase storage (or both) to achieve that?

22 nm Larrabee

rpg.314

Gipsel

trinibwoy

Meh

3dilettante

AlexV

Heteroscedasticitate

dkanter

RecessionCone

Nick

Nick

Nick

rpg.314

liolio

Aquoiboniste

Nick

trinibwoy

Meh

Nick

rpg.314

rpg.314

MfA

trinibwoy

Meh

Nick

Similar threads