AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
I assume you mean that GPU A is supposed to be Nvidia Maxwell or Pascal. You should note that Maxwell takes at least 4 warps per SMM to get peak ALU rate since there are 4 vector units per SMM. A single warp per SMM can only harness, at best, 1/4 of the SMM's ALU horsepower, and at worst 1/24th.
No. It was a completely made up simple example scenario. There is no GPU with that high IPC. Without OoO you just can't find that much ILP in common code. A more CPU-like architecture could reach result like that, but at a very high extra power cost.
 
It's not that hard to generate code that has a high degree of ILP in the arithmetic portions. Your big tool is loop unrolling combined with not reusing registers between iterations, which trades off higher register count for higher usable ILP. Here's an example of what I'm talking about. The code is just a naive loop that sums up the numbers in a buffer in serial.

Naive:

MOV $r9,BufferPointer
MOV $r10,BufferSize
MOV $r0,0
LOOP:
ADD $r0,$r0,[$r9]
ADD $r10,$r10,-4
ADD $r9,$r9,4
JPGZ $r10,LOOP

With loop unrolling (Assume size is divisible by 4).

MOV $r9,BufferPointer
MOV $r10,BufferSize
MOV $r0,0
LOOP:
ADD $r0,$r0,[$r9]
ADD $r0,$r0,[$r9+4]
ADD $r0,$r0,[$r9+8]
ADD $r0,$r0,[$r9+12]
ADD $r10,$r10,-16
ADD $r9,$r9,16
JPGZ $r10,LOOP

With loop unrolling and register optimization:

MOV $r9,BufferPointer
MOV $r10,BufferSize
MOV $r0,0
MOV $r1,0
MOV $r2,0
MOV $r3,0
LOOP:
ADD $r10,$r10,-16
ADD $r0,$r0,[$r9]
ADD $r1,$r1,[$r9+4]
ADD $r2,$r2,[$r9+8]
ADD $r3,$r3,[$r9+12]
ADD $r9,$r9,16
JPGZ $r10,LOOP
ADD $r0,$r0,$r1
ADD $r1,$r2,$r3
ADD $r0,$r0,$r1

Without any unrolling, we waste a lot of time updating pointers and counters. Unrolling fixes this, making most of the instructions inside the loop contribute to the actual computation. However, the repeated adds to $r0 generate read after write hazards, meaning the pipeline will stall before each one! To fix that, we split the accumulation to use 4 independent registers and reduce them at the end. This gets rid of all the RAW hazards between the adds. We also moved updating the loop counter to the top of the loop so that we avoid the read after write hazard on the jump comparison.

Latencies between arithmetic instructions can be fairly easily avoided through this type of optimization, simply because they're so short. This is somewhat different on a superscalar CPU where instead of a 6 cycle dependency lasting 6 instructions, you have it last, say, 4*6 instructions due to having 4 parallel pipelines! Oh, and the pipelines are longer due to the high target clockspeeds. This is where OoO starts being relevant.

There is, of course, a pink elephant in the room: Memory transactions. When you suddenly have to cover a 200 cycle access to DRAM, trying to find other instructions to cover this is sketchy at best. Your best hope is that you have a really good cache system to prevent as many of these transactions as possible, or else have another handy thread to switch to.
 
Last edited:
What's up with this: http://techreport.com/news/30133/amd-debuts-radeon-m400-mobile-gpus-with-a-host-of-rebrands
Why all the rebrands when polaris 11 is about to be launched?

The report notes that there are some slots around the Tonga rebrand that have not yet been filled. If that is the target range, it might be the case that AMD doesn't want to sell its spiffy products on more expensive FinFET silicon in the crapware that would take 28nm rebrands. I think a few of those slots go to GPUs that are meant to pair with existing APUs, and the two Polaris chips might not be able to meet the volume for that broad list of SKUs.
 
Aren't these rebrands just the same as what happened with the OEM 7xxx => 8xxx while everyone was waiting for Tonga, Hawaii, etc.?
 
From the perspective of the existing products, this is business as usual at this point.

It's happening in a mobile space that was one of the areas being targeted by the advance Polaris marketing push, and allegedly near its launch.
The 7xxxx series rename was something like 9 months before Hawaii would launch as a product above them, and another year would go by before Tonga--which didn't have a clear product space to exist in at the time.

These rebrands appear to blot out much of the mobile range, except for some omissions that were noted in the article mentioned earlier.
 
Probably right, and even if 10nm can go larger/higher power it won't until 2018 at the earliest.

Specifically, "10nm" is more or less another "half" improved node over 16/14nm, just like 20nm was. It's being shoved out ASAP by foundries as part of trying to keep mobile demand up with continuous upgrades. But the engineering cost and time to go to tape out for large chips for a relatively small perf/watt improvement on "10nm" probably won't be worth it. Not when initial demand will be eaten by mobile anyway, and by the time yields and supply ramp up enough for larger chips and other vendors "7nm" will probably be just a year away.

"7nm" for foundries other than Intel appears to be another large jump, EG 28nm to 16/14nm. But right now there's a huge question of how it will actually be made. Right now the industry relies on 193nm patterning, and at 7nm that's going to mean a lot of manufacturing and design cost. EUV replacing 193nm would mean both design and manufacturing costs could plummet, but probably won't be ready in the earliest runs. But early supply is dominated by small yield friendly SOCs with high demand from Apple and the like anyway. I'd wonder if AMD and Nvidia might wait all the way till 2019 or so when they (theoretically? Hopefully?) would be able to design 7nm on EUV, vastly reducing multiple pattern design and costs while increasing wafer throughput by a lot at the same time.

Of course after 7nm it seems to be bye bye silicon. So right now it seems 16/14nm will last quite a while for GPUs, and 7nm will last uhhh... until (?) for everyone.
 
Of course after 7nm it seems to be bye bye silicon. So right now it seems 16/14nm will last quite a while for GPUs, and 7nm will last uhhh... until (?) for everyone.
I wouldn't worry about that too much. Not as long as the "node" size only denotes precision and accuracy, but not actual feature size. So until we see the end of silicon, we are at least going to see a so called 2 or 1nm process, and multiple iterations of these. The end of silicon is going to be when the size of the smallest feature would fall below a single atom in the smallest dimension, and at the same time we are no longer seeing new designs for old structures. And especially regarding structure design, not just size, we haven't seen all of it yet.
 
The 7xxxx series rename was something like 9 months before Hawaii would launch as a product above them, and another year would go by before Tonga--which didn't have a clear product space to exist in at the time.
Hrm, this was also when AMD ditched the old naming scheme. So even though a major part of the lineup was ditched, AMD did still push the version numbers for the old parts (Cedar) a last time to 8xxx for the OEM market prior to killing them off entirely together with the old naming scheme.

Going by that logic, we are possibly going to see a couple of Polaris GPUs pressed into the 4xx naming scheme, temporarily, but also a switch in the naming scheme as soon as the lineup is being filled entirely with 16nm products. So 4xx products for the OEM market, probably to fulfill the need for huge version numbers, to keep up with Nvdias version step, but we are in for a full rebrand again, and with it probably also the end of the remaining GCN 1.0 and 2.0 parts.
 
It's not that hard to generate code that has a high degree of ILP in the arithmetic portions. Your big tool is loop unrolling combined with not reusing registers between iterations, which trades off higher register count for higher usable ILP. Here's an example of what I'm talking about. The code is just a naive loop that sums up the numbers in a buffer in serial.

(REMOVED CODE EXAMPLE)

Without any unrolling, we waste a lot of time updating pointers and counters. Unrolling fixes this, making most of the instructions inside the loop contribute to the actual computation. However, the repeated adds to $r0 generate read after write hazards, meaning the pipeline will stall before each one! To fix that, we split the accumulation to use 4 independent registers and reduce them at the end. This gets rid of all the RAW hazards between the adds. We also moved updating the loop counter to the top of the loop so that we avoid the read after write hazard on the jump comparison.

Latencies between arithmetic instructions can be fairly easily avoided through this type of optimization, simply because they're so short. This is somewhat different on a superscalar CPU where instead of a 6 cycle dependency lasting 6 instructions, you have it last, say, 4*6 instructions due to having 4 parallel pipelines! Oh, and the pipelines are longer due to the high target clockspeeds. This is where OoO starts being relevant.

There is, of course, a pink elephant in the room: Memory transactions. When you suddenly have to cover a 200 cycle access to DRAM, trying to find other instructions to cover this is sketchy at best. Your best hope is that you have a really good cache system to prevent as many of these transactions as possible, or else have another handy thread to switch to.
Yes, you can definitely reach high peak ILP even on in-order architectures with long pipelines, but you need LOTS of extra registers and aggressive loop unrolling to do it. I still remember last gen PPC-based consoles. Xbox 360 had a very long VMX-128 multiply-add pipeline. It had 128 SIMD registers per thread (256 total per core). VMX-128 was practically only useful for very long loops of unrolled VMX-128 code (and no scalar<->VMX data moves as those caused LHS stalls).

The problem is that GPU shader code doesn't have nearly as many loops as performance critical CPU code. Many expensive shaders have no loops at all. On GPU the loop is commonly executed in parallel (threads instead of loop iterations). Thus the available loop parallelism is already exploited as DLP. Much less ILP is left. On CPU most of the code that matters for performance exists inside a hot inner loop (and can be exploited as ILP). Loop unrolling thus isn't a similarly generic solution for improving in-order GPU ILP. Definitely there are cases the GPU loop rolling is a big benefit, but the whole GPU architecture needs to be designed to be efficient also when no loops exist in the shader (minimal amount of ILP). Shaders with zero ILP are perfect for AMDs architecture, as it hides all the common (fastpath 4 cycle) instruction latency. You can execute dependent instructions one after each other, and there is no penalty.

Update:
There is no free lunch. If you hide your instruction latency completely (like AMD GCN does), you don't need to allocate extra registers to allow multiple instructions to execute concurrently from the same wave (to hide instruction latency). On the other hand, if an in-order CPU/GPU can issue a new instruction every cycle (from the same thread/wave), there is quite a big visible pipeline latency. The compiler has to allocate more registers to keep the pipeline filled (hide instruction latency).

I am not a compiler expert, so I don't know how much extra registers you need (on GPUs only the peak number matters). But I would guess that it is only a constant number increase (pipeline length * registers per instruction?). However even a small constant increase in register count hurts GPUs, since there are so many work items running at the same time.

Advancing a wave only every 1/4th cycle also helps in hiding cache/memory latency. For example if a L2 memory fetch takes 100 cycles, a single wave has only (at maximum) advanced 25 instructions during this time. If there are enough instructions between the load and use, the memory latency has zero impact. Of course GPU can execute other threads while one is waiting, but again heavy ILP focus would need more registers per wave, thus leading to smaller amount of concurrent waves running -> less potential to hide memory latency. It seems that all the current IHVs have slightly different balance in their architecture. It's tricky to author shaders that are perfect for all of them.
 
Last edited:
I wouldn't worry about that too much. Not as long as the "node" size only denotes precision and accuracy, but not actual feature size. So until we see the end of silicon, we are at least going to see a so called 2 or 1nm process, and multiple iterations of these. The end of silicon is going to be when the size of the smallest feature would fall below a single atom in the smallest dimension, and at the same time we are no longer seeing new designs for old structures. And especially regarding structure design, not just size, we haven't seen all of it yet.

There's one thing that was rather consistent, even when node names deviated. The names started deviating because basing it on gate length was no longer realistic. It wasn't that they were trying to lie to people or anything. If its not economical to scale with gate length, but you can still make the area 1/2, and gain transistor performance, what's the difference?

The node names are only important for the specific company. So TSMC 16 is only comparable to TSMC 20, or 28. Samsung 14 with Samsung 22. Intel 14 to Intel 22.

That said, the one thing that was consistent was Contacted gate pitch/M1 sizes. Right now in the "16nm" generation" its at ~55nm. With 32nm Intel the pitch was 112.5nm, and they started using immersion lithography. Traditionally, to reach 1/2 in area, pitch scales down by square root of that, or about 0.7x. So for successive nodes it should be:

45nm: 160nm
32nm: 112.5nm - Immersion(Intel)
22nm: ~80nm - Limited DP(Intel)
14nm: ~56nm - DP(Intel)
10nm: ~40nm - QP(Intel)
7nm: ~28nm - QP?
5nm: ~20nm - OP?

Funny thing. Immersion is said to offer 30-40% improvement in resolution. Now, the 193nm patterning is 70% larger, but maybe optimizations allow it to go 112.5nm with immersion. Not all companies use exactly the same size so for some its bigger, for some its smaller. With something complicated as process technology, simple equations usually tell its pattern.

Even though its much more complicated than that, doubling resolution basically allows you to quadruple density. There's some doubts whether QP or OP patterning is realistic or profitable, but if 8x is realistic than we see a practical limit without EUV at 3.5nm generation. Considering how much trouble they seem to go through to achieve 14nm with DP, and OP is incredibly more complicated than that.

So really, the gate pitch, whether Contacted Gate Pitch, or M1 is roughly what we need to base scaling on. Whether estimating how far scaling is viable or what the density will be. We leave the engineers to figure out more complicated aspects of the process, like lowering leakage, and increasing drive current.

The absolute limit for scaling may stop well before 1-2 atom in size. At 1.2nm gate oxide thickness on the Intel 90nm process, it was said to be hard already, and that's what? 10-12 Atoms? Gate length(which was what minimum feature size was traditionally based on) fell off scaling after 32nm. So they basically "removed" the limits of scaling, first by stop scaling gate oxide, and second by Gate Length. So we got about 3 full generations before Gate Pitch sizes fall into Gate Length size at 22nm, and when things get really hard.
 
Capacity 36CU

Confirmation?
It's been 36 CU in every single claimed testleak excluding the very latest "sources", which claimed only 32 CU without any data to back it up with.
The most common theory has been 40 CU chip with 36 CU in that model
 
Status
Not open for further replies.
Back
Top