AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

The maximum number is determined by the size of your register file. The minimum required to cover internal latency is determined by pipeline depth. For CUDA at least, you try to get as many threads as possible except in some special cases.

Getting as many threads as possible is indeed the rule of thumb put forward by NVIDIA, but I'm not sure the cases where it's wrong are so special:

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

But once you've maximized ILP, maximizing thread count is probably a sound strategy in most cases. I wonder how Volkov's findings hold up with Maxwell, which I gather is somewhat less reliant on ILP.
 
Isn't the number of threads determined by pipeline/memory latency hiding requirements?
As an example, imagine the kernel requires 256 scalar registers. On VLIW that's 64 vec4 registers. So, on VLIW you get 4 hardware threads to hide latency. On GCN that's also 4 hardware threads. The hardware thread count is the same because both architectures have the same quantity of register file per CU.

With VLIW these 4 hardware threads are sharing a single SIMD and a single register file. So when 1 thread issues a memory request, there are 3 other hardware threads that can run.

On GCN, you have 4 SIMDs, each with a private register file that's one-quarter of the CU's total register file capacity. So when sharing out the 4 hardware threads, each SIMD ends up with a single hardware thread. So when any of those hardware threads issues a memory request, the SIMD it was running on falls idle. Therefore latency has not been hidden.

To hide even the smallest amount of latency requires at least two hardware threads. On GCN with 4 SIMDs per CU, that means 8 hardware threads are required. On VLIW, only 2 hardware threads are required. When both are given a kernel with the same register file allocation (measured in bytes per work item), you will end up with more latency-hiding capability on VLIW.

hardware thread count = Compute Unit RF size / work item allocation / work items per hardware thread / count of SIMDs

CU RF Size = 262144 bytes
work item allocation = 256 scalar registers * 4 bytes per register = 1024 bytes
work items per hardware thread = 64

You can rewrite this in terms of per-SIMD:

hardware thread count = RF size / work item allocation / work items per hardware thread

RF Size is 65536 bytes on GCN. But on VLIW it's 262144.

The counter argument would be that on GCN, when a SIMD falls idle, only 16 SIMD lanes are actually idling. Whereas on VLIW, 64 lanes are idling (and each of those is 4 or 5 operations). The problem with this argument is simply that the VLIW architecture generally has enough hardware threads to not fall idle, or to idle for much shorter periods of time.

EDIT: brainfart alert, that should be "Whereas on VLIW, 80 or 64 lanes are idling (VLIW-5 or VLIW-4)."
 
Last edited:
You make it sound as if VLIW were the architecture the IHVs were moving towards and not the one they're coming from. :)
 
You make it sound as if VLIW were the architecture the IHVs were moving towards and not the one they're coming from. :)
VLIW tends to look good on paper and in theory. They are perennially revived for that particular reason. In practice, as a general purpose solution they tend to fare poorly, but they're well matched for certain workloads. I believe that if GPU compute hadn't become a thing (and AMD would not have tricked itself into focusing on it a lot), the VLIW architectures would have remained a pretty competent alternative for graphics. They were robust flingers of pixels, their glass jaws were elsewhere.
 
VLIW tends to look good on paper and in theory. They are perennially revived for that particular reason. In practice, as a general purpose solution they tend to fare poorly, but they're well matched for certain workloads. I believe that if GPU compute hadn't become a thing (and AMD would not have tricked itself into focusing on it a lot), the VLIW architectures would have remained a pretty competent alternative for graphics. They were robust flingers of pixels, their glass jaws were elsewhere.

I think a prominent graphics architect from AMD (I can't remember who it was exactly) once commented that if not for compute, AMD wouldn't even have moved from VLIW5 to VLIW4.

But since compute is now a part of graphics too, it's probably just as well that they did, even for games.
 
You make it sound as if VLIW were the architecture the IHVs were moving towards and not the one they're coming from. :)
If AMD were to double or quadruple the RF size per GCN SIMD, this latency-hiding argument in favour of VLIW would disappear.

64KB to be shared by a minimum of 128 work items (two hardware threads) is simply too little. Note that NVidia's GM2xx GPUs have 64KB shared by 64 work items (two hardware threads of 32 work items) in the worst case, which makes NVidia much more robust.

NVidia has taken this seriously. The biggest change I want to see in the GCN core is more RF per work item. Ideally 4x more. Yes, we're then talking about RF area coming close to the ALU area.

I expect it not to happen.
 
what ? ..,.. im not sure you are knowing OpenGL so-- well specially OpenGL 4.0 .. But thats another story, so maybe we will not go so far.
Just a friendly suggestion - you need to spend some time working on aspects of your posting. While I am certain that in various circles this kind of unfounded nonsense is cool and everybody high-fives everybody for it and for outlining the evil green company, B3D holds itself to a higher standard. This kind of random made-up crud that spawns across the Internetz echo chamber is not really welcome. This is just an example, but this is a constant theme with your posting, so please stop doing it. The exchange is particularly ridiculous (and shows just how the sleep of reason sometimes gives birth to monsters) considering 3dcgi's...err...lineage.
 
If AMD were to double or quadruple the RF size per GCN SIMD, this latency-hiding argument in favour of VLIW would disappear.

64KB to be shared by a minimum of 128 work items (two hardware threads) is simply too little. Note that NVidia's GM2xx GPUs have 64KB shared by 64 work items (two hardware threads of 32 work items) in the worst case, which makes NVidia much more robust.

NVidia has taken this seriously. The biggest change I want to see in the GCN core is more RF per work item. Ideally 4x more. Yes, we're then talking about RF area coming close to the ALU area.

I expect it not to happen.

If I'm not mistaken, the density of SRAM cells is bound by transistor size more than by the metal stack. Since FinFET processes improve density for transistors far more than for metal wires, the density of SRAM cells should improve much more than that of logic.

So this might be an opportune time to increase the SRAM:logic ratio.
 

That's not what this document says. GCN Compute Units have a scalar unit that can be used for various things, but most of the execution units are 16-wide SIMDs. So it's a scalar + SIMD architecture, somewhat like a modern CPU.

However, when programming it you can think of it as a scalar, multithreaded architecture, because it can use predication for branching.
 
That's not what this document says. GCN Compute Units have a scalar unit that can be used for various things, but most of the execution units are 16-wide SIMDs. So it's a scalar + SIMD architecture, somewhat like a modern CPU.

However, when programming it you can think of it as a scalar, multithreaded architecture, because it can use predication for branching.


Interesting!
 
On GCN, you have 4 SIMDs, each with a private register file that's one-quarter of the CU's total register file capacity. So when sharing out the 4 hardware threads, each SIMD ends up with a single hardware thread. So when any of those hardware threads issues a memory request, the SIMD it was running on falls idle. Therefore latency has not been hidden.

Got it, thanks. So all else equal (wavefront size, max registers per thread, total register file capacity etc) VLIW can support more threads per execution unit and hence hide more latency.
 
VLIW tends to look good on paper and in theory. They are perennially revived for that particular reason.
They also tend to come back if a team is constrained in resources for implementing or validating a design, as VLIW does leave pretty dumb hardware. This also goes to the other motivation when hardware budget is constrained. Dumb hardware is in the best case smaller, so it saves in area and in power.
GPUs facing node stagnation and power ceilings are where limits become more obvious, and one product line adjusted its architecture while another one did not.

Nvidia's latest architecture has shifted somewhat back to a statically-scheduled ISA. Bypassing and dependence information are encoded in the software stream, and the FP16 solution that the mobile version and the next gen use encapsulates two operations in a word.
The long-term benefit or cost of this approach, particularly in compute, remains to be seen. It has not served as a misstep against GCN, and may only be an incremental cost in terms of the difficulties in handling the onrush from x86.

There seems to be real utility in giving the hardware readily available context information, and a limited level of instruction grouping has been fine for CPUs (UV pipes, 4+1+1+1, etc.) has been fine.
GCN's architecture forces a regimented cycle model and a form of trivial coherence that points to limited hardware closer to that of VLIW, while presenting to the software a more standard software architecture that has lost a lot of the context information for its still not smart software to work with. The expanded list of wait states needed for certain operations shows where this veil is being stretched too thin.
It uses more threads to simplify software scheduling and latency hiding, but at a hardware level that means it is keeping more silicon awake.

If I'm not mistaken, the density of SRAM cells is bound by transistor size more than by the metal stack. Since FinFET processes improve density for transistors far more than for metal wires, the density of SRAM cells should improve much more than that of logic.

Sufficient transistor performance can also increase density by saving on the periphery of the SRAM arrays. TSMC's enchanced version of 16nm seems to be taking advantage of this.

http://semiengineering.com/ibm-intel-and-tsmc-roll-out-finfets/

SRAM speed has been improved by greater than 25%. “This improvement allows the use of a 512 bits per bit-line scheme instead of a 256 bits per bit-line scheme to reduce the periphery circuit size,” according to TSMC.
 
However, when programming it you can think of it as a scalar, multithreaded architecture, because it can use predication for branching.
Umm aren't GPU's still under the SIMT paradigm? And according to you and the doc Razor provided, SIMT on GCN is 16 wide on the hardware side of things, and 4 of those 16 wide vector units comprise a part of a CU?
 
AMD currently are beaten by 9xx cards in lower resolutions and having better cpu overhead would make a difference, would help also on the lower end where they aren't playing well with i3.
What OS driver overhead has to do with improving computing performance of GCN cores?

Obviously, but it still may be relatively easy job compared to trying to fit actually new blocks like the TrueAudio DSPs in there
Even if so, it doesn't matter because most time and money go into setting up production and getting good yields.

I wouldn't agree on the way you read that (supposed) slide, especially since all the GCN's (at least 1.1+) already do FL12_0 with resource binding 3
GCN 1.1+ take ~2% of all Direct3D12 capable CPUs at best, according to April 2015 Steam Survey (hard to tell exact share since AMD does not differentiate individual models in OEM strings, but all R7 200 and R9 200 combined take ~2.5%) . Fiji is going to be even more niche product - if it really has FL 12_1 and new rendering features, what would be the reason for AMD to conceal this?

The AMD shader compiler is the biggest issue for GCN on PC. Current generation consoles have roughly the same GPU architecture as the AMD PC GPUs, making it easy for the developers to compare the compiler results (instruction counts, GPR counts, etc). A new (fully rewritten) shader compiler would be a big step in catching up with NVIDIA.
Aren't PS4/XBOne use the same AMD source code for their implementation?

The tessellation and geometry shader design is still bad in GCN. I know this is not an easy issue to solve, but it currently makes geometry shaders useless and limits tessellation usage to small amplification factors.
What do you mean "shader design"? The AMD compiler/optimizer is inefficient for domain/hull/geometry HLSL profiles of the IML bytecode, or the GCN instruction set is lacking instructions or registers to perform these shaders efficiently?
 
Back
Top