AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
Were there similar register pressure issues pre-Pascal? I know register pressure is always something you like to avoid for CUDA, but I don't know what extent game developers struggle with it.

For GP100, my understanding is that it was done mainly for FP64. Of course it's nice to have more registers, but they and more importantly their connection fabric to the ALUs themselves increase (of course) up space, power and complexity.
 
Each GCN SIMD has a 2048-bit wide, 256-entry register file. 2048 bits is 64 x 32-bit.

Each cycle the register file is either read or written. Three cycles of reads and one cycle for write is enough for MAD (FMA).
 
Each GCN SIMD has a 2048-bit wide, 256-entry register file. 2048 bits is 64 x 32-bit.

Each cycle the register file is either read or written. Three cycles of reads and one cycle for write is enough for MAD (FMA).

Does AMD buffer operands for the entire wavefront first or are the reads done on the fly?
 
Well we will see in 13 days, if its not a launch, just an announcement for a future announcement like they have recently done, or a drawn out date for launch (after one month) that would be bad news, if they give out a date in the near future (one month) great!
 
My understanding is that it executes a wave of 64 as 4 cycles on a 16 wide SIMD. I understand how the wave of 64 can give utilization issues when there's thread divergence, but once you accept that, does it matter whether or not that 64 wide wave gets executed in 1, 2 or 4 cycles? That's just an implementation issue, isn't it?
Example.

GPU A CU takes instructions from a single wave. It can start a new instruction from the same wave every cycle (and has low enough instruction latency / reordering to keep the pipelines mostly filled). Let's say it fails to find an independent instruction 20% of the time. Result = 80% ALU utilization, IPC (of a wave) = 0.8.

GPU B CU on the other hand can start a new instruction every 4 cycles from a single wave. It takes instructions (round robin) from 4 waves to guarantee 100% ALU unit utilization (when not waiting for memory). Result = 100% ALU utilization, IPC (of a wave) = 0.25.

GPU B needs to have register space for 4 waves per CU, as it runs them all at the same time. GPU B thus need 4x larger register files to operate efficiently. Low IPC (from wave's point of view) is a trade-off, as it means that every single wave is executing for a longer time, reserving registers. Also running more waves concurrently tends to trash caches more, unless caches are also bigger (and have more associativity).

CU can run as many waves as there are registers (or LDS if that is the bottleneck). If shader is complex, it requires high amount of registers. If not enough waves can be concurrently running, there is no memory latency hiding. Utilization issues are caused by memory stalls. All waves on the same CU are waiting simultaneously for memory. GPU doesn't issue any instructions to the SIMD.
 
I have been wondering why GPUs haven't already switched to two-tier register files. Most of the registers are not used right now, but are instead just storing some loaded data for later use (compiler statically does this to hide memory/cache latency as there is no OoO and no register renaming). A big low latency register file near the execution units is expensive and consumes a lot of power. Why not keep "storage" registers that are not used right now a bit further away from the execution units. This way you should be able keep the main register file smaller and its latency lower (and distance to main register file smaller -> power savings).

Nvidia did a move into this direction with Maxwell, when they introduced operand cache (basically a tiny L0 register file). It lowers the register file power consumption. I am wondering whether this change was one of the factors allowing them to double registers per thread in Pascal, now that the new process gives them the die space to spend for more registers. If AMD increases their register file size in Polaris, I am curious to see their approach. A full blown 2-tier register file would be very interesting.

Interesting read about this topic:
On Maxwell there are 4 register banks, but unlike on Kepler (also with 4 banks) the assignment of banks to numbers is very simple. The Maxwell assignment is just the register number modulo 4. On Kepler it is possible to arrange the 64 FFMA instructions to eliminate all bank conflicts. On Maxwell this is no longer possible. Maxwell, however, provides something to make up for this and at the same time offers the capability to significantly reduce register bank traffic and overall chip power draw. This is the operand reuse cache. The operand reuse cache has 8 bytes of data per source operand slot. An instuction like FFMA has 3 source operand slots. Each time you issue an instruction there is a flag you can use to specify if each of the operands is going to be used again. So the next instruction that uses the same register in the same operand slot will not have to go to the register bank to fetch its value. And with this feature you can see how a register bank conflict can be averted.
 
A general observation about CUs: the hardware has a register file (or several), a shared memory and a L1 data cache - usually in that order of descending size.

NVidia unified L1 and shared memory. Registers stand as the odd one out.

Larrabee architecture uses L1 for all three roles: L1, work item state and shared memory. It then has a small register file, AVX style (and, obviously, an x86 RF too). An operand cache, if you like.
 
NVidia unified L1 and shared memory. Registers stand as the odd one out.
.
I was under the impression they unified Texture Cache and L1 with shared memory still being separate. Do I remember incorrectly?
 
Obvious to whom? The general public or decades-old experienced tech journalists?
AMD definitely has more direct and effective methods to reach their partners, and especially ones that won't generate hype and disappointment towards enthusiasts.
The choice of using the company's general account of a worldwide broadcasting social media platform to say "sneak peak on Polaris" seems obtuse at best.

Completely agreed. Specially when you think that you don't hype your partners showing them a "high level views"? Didn't Anandtech already showed that months ago? You hype your partners sending them an email saying "our new chip will cost x amount of money and the reference card will cost x amount of money, so we expect the profit from every card to increase x amount of money". In my opinion that is the language they talk and/or understand, how much money that can make, and not the "new tech we will introduce" Which again...they already talk about....

This is even worse because of the timing. They are tweeting in just after the Nvidia presentation, so people are waiting for AMD response. Also they place the event right after Nvidia NDA...again people will be waiting AMD response, and AMD saying "Hey we have an announcement to make!" and then "oh no, that wasn't meant for you, it was meant for our partners" like really?

Someone in AMD marketing department needs to go home.
 
http://docs.nvidia.com/cuda/kepler-tuning-guide/#shared-memory-capacity

Since Fermi, Nvidia has had configurable 64 KB memory pool per SM. You can configure L1 / shared memory split. Options: 16/48, 32/32 and 48/16.
I am aware of that. You cannot, however, do that with Maxwell any more. It's open for debate if that's a hard limit or only imposed via software restrictions. I would guess that for power reasons, simpler hardware mostly is more efficient and tend towards the first option: hard limit.
 
Do you know they haven't for a fact? This paper: http://dl.acm.org/citation.cfm?id=2155675 came out quite a few years ago, for example.
This paper describes a compile time (static) technique. You should see it clearly in the GPU shader microcode. CodeXL shows GCN Gen1-3 microcode, and it doesn't have any hints to tiered register file. Neither does AMDs ISA documents or AMDs open source linux drivers. Same is true for Intel's open source linux drivers and ISA documents.

Are there any public low level ISA documents for modern Nvidia GPUs? PTX is a virtual higher level ISA and it doesn't expose low level stuff like tiered register file or operand reuse. Are there any open source drivers available?
 
The paper compares a number of possible 2 and 3 level register file schemes, including hardware managed ones that might not expose the tiers in the instruction encoding. But yes, the software managed one turns out to be the most efficient, hence the title of the paper.
 
Example.

GPU A CU takes instructions from a single wave. It can start a new instruction from the same wave every cycle (and has low enough instruction latency / reordering to keep the pipelines mostly filled). Let's say it fails to find an independent instruction 20% of the time. Result = 80% ALU utilization, IPC (of a wave) = 0.8.

GPU B CU on the other hand can start a new instruction every 4 cycles from a single wave. It takes instructions (round robin) from 4 waves to guarantee 100% ALU unit utilization (when not waiting for memory). Result = 100% ALU utilization, IPC (of a wave) = 0.25.

GPU B needs to have register space for 4 waves per CU, as it runs them all at the same time. GPU B thus need 4x larger register files to operate efficiently. Low IPC (from wave's point of view) is a trade-off, as it means that every single wave is executing for a longer time, reserving registers. Also running more waves concurrently tends to trash caches more, unless caches are also bigger (and have more associativity).

CU can run as many waves as there are registers (or LDS if that is the bottleneck). If shader is complex, it requires high amount of registers. If not enough waves can be concurrently running, there is no memory latency hiding. Utilization issues are caused by memory stalls. All waves on the same CU are waiting simultaneously for memory. GPU doesn't issue any instructions to the SIMD.
I assume you mean that GPU A is supposed to be Nvidia Maxwell or Pascal. You should note that Maxwell takes at least 4 warps per SMM to get peak ALU rate since there are 4 vector units per SMM. A single warp per SMM can only harness, at best, 1/4 of the SMM's ALU horsepower, and at worst 1/24th.
 
Status
Not open for further replies.
Back
Top