AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
There are wait counts for the various memory accesses and exports, but NOP or independent instruction padding handles the class of operations that involve sourcing or forwarding between vector and scalar elements. It would be an unenviable bug to have to trace through for a problem that dynamically may or may not occur based on what SIMD the scheduler chooses. The alternative can be an excessively pessimistic compiler that inserts enough NOPs to satisfy the worst case where a wait state of several 4-cycle vector issues gets multiplied by a 4x scalar issue rate.
GCN ISA already includes moves twixt scalar/vector registers :D for cases that don't involve execution masks. The code is quite explicit already, so there's no bug lying in wait because some timings are different.
 
GCN ISA already includes moves twixt scalar/vector registers :D for cases that don't involve execution masks. The code is quite explicit already, so there's no bug lying in wait because some timings are different.
EXEC-related cases would be some of the cases where NOPs or independent instructions are expected. There are shorter waits for vector writes to SGPRs that are then used as vector sources. I did not mention some of the shorter delays that may or may not show up like those for RAW for DPP instructions. Whether there is a wall-clock delay that is being hidden by the 4-cycle cadence may factor into what happens.
 
The 4 cycle cadence makes everything work (current GCN):

e.g. SIMDs 0 to 3 each have 6 hardware threads: SALU on cycle 0 executes an instruction for SIMD0 thread 0, cycle 1 might be SIMD1 thread 3, cycle 2 might be SIMD2 thread 5 and cycle 3 SIMD3 thread 1. Meanwhile, during precisely the same 4 cycles, the SIMDs are executing (for example): SIMD0 thread 1, SIMD1 thread0, SIMD2 thread 6 and SIMD3 thread 0.

SALU executes four different threads over 4 consecutive cycles, while the four VALUs execute 4 different threads over 4 consecutive cycles which are mutually excluded from the set that's live in SALU.

As soon as we start talking about a CU-wide RF then the timings have to become more slack, because each SIMD requires an operand collector (and scatter unit). The code doesn't need NOPs to cover this since any available alternative hardware thread can execute in the slack. If there are no alternative hardware threads, then it's just a stall. You don't need an explicit NOP to make that happen.
 
The 4 cycle cadence makes everything work (current GCN):
It makes everything work except where the ISA document requires that software put in wait states, which can be mandated by undetected accesses to specially-aliased registers like EXEC, a lack of forwarding, or some other kind of race condition where movement takes longer than 4 clock cycles to finish but there is no existing wait instruction that can halt further issue.

The multiple-length SIMD scheme provides for a complete breaking of the cadence, which matters when the scheme is supposed to be transparent to software. GCN as we know it is fragile in this case, and requires static scheduling that can either be pessimistic or broken by the front end. It might require something to change in GCN as it will be to support this.

As soon as we start talking about a CU-wide RF then the timings have to become more slack, because each SIMD requires an operand collector (and scatter unit). The code doesn't need NOPs to cover this since any available alternative hardware thread can execute in the slack. If there are no alternative hardware threads, then it's just a stall. You don't need an explicit NOP to make that happen.
The multiple-length SIMD patent does not give a CU-wide RF, and even provides for the scenario where a wavefront can split across multiple SIMDs, based on how the front-end is able to pack lanes.
As far as requiring an operand collector, I may have skimmed past that being mentioned.
 
Can't compare bandwidth with desktop parts as desktops don't have to deal with memory contention like the PS4 has to. The GDDR5 on PS4 has to service both the CPU and GPU effective bandwidth for the PS4 will generally be lower than the number stated.

Regards,
SB
We aren't talking about 32-core 4GHz workstation CPU, but low-power conservatively-clocked derivate of mobile low-end architecture. It may consume about 5 % of the total bandwidth at full load, maybe even less.
 
We aren't talking about 32-core 4GHz workstation CPU, but low-power conservatively-clocked derivate of mobile low-end architecture. It may consume about 5 % of the total bandwidth at full load, maybe even less.

It's currently allocated more than 5% (20-30 GB/s IIRC). But regardless of that, there is still memory contention over and above the bandwidth allocated to each subsystem as the entire pool of memory must be accessible by either the CPU or GPU at any given time. Switching between CPU access and GPU access incurs overhead. That doesn't exist for current desktop GPUs where memory is exclusive to the GPU.

Regards,
SB
 
Raw GPU performance is less than 6 % higher than R9 380X, while the bandwidth is 20 % higher.

Once you add in the CPU bandwidth and inefficiencies that contention introduces they're probably pretty close in compute/bandwidth ratio. Which to me means it's probably pretty well balanced.

All of a sudden, Polaris 11 doesn't offer "console-level performance" at subnotebook form factor.

Although that was in reference to 3 year old consoles (by the time this actually launches). Another way of looking at it would be to say that Polaris 11 offers equivalent (or possibly greater) performance to a brand new, state of the art console release in notebook form factor (~100w). I think that might be a first ever for a console launch.
 
4 consecutive cycles
Definition of "cycles" probably complicates things. Paper mentioned dynamic clocks in addition to the cadence changing. Could be a fixed 2/4x multiplier to keep things in sync, but SIMDs are probably clocked asynchronously. Might be why they made no mention of partial 16 ALU SIMDs as an 8x multiplier is probably pushing the limits.

My understanding of the paper was that SIMD and scalar were mutually exclusive under most conditions. SIMD hits a sync() and the scalar does it's thing for a while. Prefetching highlights that it would be possible, but they ran further ahead. They may also have a scalar per SIMD and per CU with slightly different capabilities/clocks.
 
GCN as we know it is fragile in this case,
No, it always works.

If operands have to be moved to ALUs (either just in time or in bulk) then it's a variant of the already existing mechanisms that GCN has for situations where static analysis of code indicates that state needs time to settle before an ALU can consume operands.

The most obvious example of this is simply a change in execution mask, which could result in a decision to move state to a different ALU.

Consistency for read/write of the execution mask is affected at the time that state is moved around the CU, but otherwise there's no difference from GCN as it currently is.

I can't think of a more finely-grained consistency problem that is introduced specifically by any of these patent documents.

e.g. imagine that state movement is effected by a variant of spill-out and spill-in instructions where instead of an off-die implicit target memory address in global memory, the target is an explicit place in another register file within the CU. Predicated spill-out/spill-in with lane indexing should do the job. SALU would have instructions to do lane index magic to feed into the spill instructions.

It might require something to change in GCN as it will be to support this.
Yes, I would expect the ISA to be extended to cover the specific new cases. I can't think of any reason why existing code would just break. At least none that can't be blamed on the compiler people.

The multiple-length SIMD patent does not give a CU-wide RF, and even provides for the scenario where a wavefront can split across multiple SIMDs, based on how the front-end is able to pack lanes.
As far as requiring an operand collector, I may have skimmed past that being mentioned.
It's pretty glib about that problem.

Moving any kind of bulk state around is in my view the biggest reason why this is all a land grab: so that actually all we are left with is ALUs that can power-off lanes.

And that can only work if intra-hardware-thread lane reassignment is performed. Something that the cross-lane permutation hardware in GCN 3 already does, mostly.

In other words, I'm betting against variable width SIMDs. The most I'm expecting is that SALU inherits the full VALU instruction set (I expect there are some instructions that aren't worth porting over). In which case the bandwidth/latency problems involved in moving state back and forth between per-VALU RFs and the SALU RF are so minor that it isn't worth all this debate. Certainly none of these patents.

Yes, there's devil in the details of the SALU ISA, e.g. will it have FMA or will it use a macro to do FMA in two successive instructions?
 
Improved AMD GCN

What kind of hardware improvement, probably borrowed from Polaris, is possible while keeping a perfect hardware compatibility with regular PS4 games? Color compression? What else?
 
What kind of hardware improvement, probably borrowed from Polaris, is possible while keeping a perfect hardware compatibility with regular PS4 games? Color compression? What else?
Primitive Discard Accelerator could be big and transparent. Prefetching. ASTC for textures which could be big. Would require some console specific content and repackaging, but not unreasonable. Could also be better color compression for ROPs.
 
No, it always works.

If operands have to be moved to ALUs (either just in time or in bulk) then it's a variant of the already existing mechanisms that GCN has for situations where static analysis of code indicates that state needs time to settle before an ALU can consume operands.
I wouldn't say it always works if the instruction stream can fail to pad or perhaps purposefully not pad with NOPs or independent instructions and yield incorrect or undefined behaviors. The usual bare-minimum for that kind of description is that the hardware at least would stall.

I can't think of a more finely-grained consistency problem that is introduced specifically by any of these patent documents.
The multi-length patent's provision for a high-performance scalar unit that explicitly abandons the 4-cycle cadence changes the implicit number of cycles the earlier static analysis would depend on. The motivation of the patent is the capture of information that cannot be determined by the compiler, and so defies the use of static analysis for this purpose in general.

If a 4-cycle operation dependence requires 5 NOPs or independent operations, that is 20 cycles of settling time. The high-performance scalar unit that a normal wavefront can be switched to on the fly quadruples the issue rate in terms of actual cycles.
The compiler could cover it, if it took the onerous step of adding 20 NOPs in the off-chance the hardware might dynamically subvert the wait states.
The other smaller SIMDs may also change the cadence on the fly, if the patent is to be believed, with varying levels of pessimism required of the compiler.

In other words, I'm betting against variable width SIMDs. The most I'm expecting is that SALU inherits the full VALU instruction set (I expect there are some instructions that aren't worth porting over). In which case the bandwidth/latency problems involved in moving state back and forth between per-VALU RFs and the SALU RF are so minor that it isn't worth all this debate. Certainly none of these patents.

Yes, there's devil in the details of the SALU ISA, e.g. will it have FMA or will it use a macro to do FMA in two successive instructions?
It would need the ability to source an additional operand if FMA were fully implemented. LDS sourcing is another thing the current SALU cannot do, either.
Getting all that into place might make the next step of creating a pipelined FMA unit incremental in complexity. Keeping it as a singular operation would avoid the need to track macro progression in case the GPU opts to preempt things in the next 15 cycles.

What kind of hardware improvement, probably borrowed from Polaris, is possible while keeping a perfect hardware compatibility with regular PS4 games? Color compression? What else?
Frame buffer compression would be helpful. There's also the possibility that "improved GCN" is the PS4's original IP improvements over the original GCN.
 
Videocardz has an AMD roadmap showing the Polaris line split evenly between Polaris 10 and Polaris 11.

AMD-Radeon-2016-2017-Polaris-Vega-Navi-Roadmap.png
 
I wouldn't say it always works if the instruction stream can fail to pad or perhaps purposefully not pad with NOPs or independent instructions and yield incorrect or undefined behaviors. The usual bare-minimum for that kind of description is that the hardware at least would stall.
The compiler knows how to keep the instruction stream valid.

The multi-length patent's provision for a high-performance scalar unit that explicitly abandons the 4-cycle cadence changes the implicit number of cycles the earlier static analysis would depend on. The motivation of the patent is the capture of information that cannot be determined by the compiler, and so defies the use of static analysis for this purpose in general.
The static analysis doesn't depend on the 4 cycle cadence - it depends on a hardware thread having a constrained set of issues. SALU and VALU are not co-issued from the same hardware thread. The hazards caused by state being in one or the other ALU and required in the following instruction by the other ALU are fully understood.

Any time some hypothetical new architecture has to decide whether to move state to another ALU, purely for the sake of efficient scheduling, a barrier placed by the compiler will show the hardware where and how it should schedule the move. The possible places for a barrier are really quite limited: anywhere there's a branch and anywhere there's a hazard.

You forget that LDS and global memory operations have variable settling times and there is no consistency problem experienced there. The barriers for these operations are placed there by the compiler too, despite the fact that the compiler doesn't know the settling time.

Any time the execution mask can change, a barrier, which implies a context switch, prevents a blind run through a hazard.

If a 4-cycle operation dependence requires 5 NOPs or independent operations, that is 20 cycles of settling time. The high-performance scalar unit that a normal wavefront can be switched to on the fly quadruples the issue rate in terms of actual cycles.
The compiler could cover it, if it took the onerous step of adding 20 NOPs in the off-chance the hardware might dynamically subvert the wait states.
The other smaller SIMDs may also change the cadence on the fly, if the patent is to be believed, with varying levels of pessimism required of the compiler.


It would need the ability to source an additional operand if FMA were fully implemented. LDS sourcing is another thing the current SALU cannot do, either.
Getting all that into place might make the next step of creating a pipelined FMA unit incremental in complexity. Keeping it as a singular operation would avoid the need to track macro progression in case the GPU opts to preempt things in the next 15 cycles.
It would be ironic if the way that execution is switched to a scalar ALU is using a SALU-specific code path generated by the compiler:

Code:
for each x
    if bit_count(exec_mask) > 1 then
        [VALU loop code]
    else
        [SALU loop code]

:p

Then the macro would just be normal code and preemption wouldn't make any meaningful difference.

This would also allow the compiler to put in explicit MOV instructions to get data to the required ALU :runaway:
 
but using a MOV instruction what kind of latency are you going to incur by doing that. Moving from a register to memory sounds like it will introduce quite a bit waiting period. I can see if it working if its planned for....... by on the fly just doesn't seem likely.
 
227% more GPU throughput and 31% higher CPU clock, relying on only 23% bandwidth boost. Probably an inevitable compromise for a mid-term upgrade.
Looking at existing GPU that sound OK. A big missing part of picture is actually the number of ROPs which is of great when it comes to pushing out lots of pixels.
 
Status
Not open for further replies.
Back
Top