AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
NVidia's arrangement of SP and DP SIMDs within an SM is already similar to this in some ways.

This document talks about each SIMD having an associated register file. And the associated RF may only contain a subset of the register data required for execution. But this is likely irrelevant, since a loop can be thousands of cycles long and encompass 80%+ of the entire RF allocation per work item.

Also putting register data close to a SIMD is costly, if that data was close to another SIMD. It involves moving that data around the compute unit's SIMDs.

If this document describes a real implementation (can't help feeling sceptical), these ALUs are likely to be of the same four-cycle-duration-per-instruction design as current ALUs, just because that is in itself an energy saving.

The current design's scalar ALU is already highly functional, but it doesn't have the full set of math instructions seen in current SIMDs. It could be expanded to fulfil that role. If it were adapted then it could become overloaded though. This ALU is still required to evaluate execution masks (i.e. run conditional instructions). But it could be argued that low-latency evaluation of conditional instructions is not a top priority, since a resulting change in execution mask causes a data movement from one RF to another RF.

A change in execution mask has a worst case latency of 16 cycles in GCN, I believe (need to check).

If we're talking about moving data from SIMD to SIMD (RF to RF) then I can't help wondering if it would be simpler to have a thin RF design (like Larrabee) and rely upon L1 cache for thread state instead. A 64 byte cache line is 16 RF entries.
 
I think the most relevant part of this invention would be this:
5fqiCId.png

You won't need to move data around much more than now, but you can selectively disable parts of each SIMD Unit in order to save power. This will cost a bit of die area for the gating mechanisms, but given the potential density improvements for 14/16 nm, it could be well worth it to save power - part of Polaris' Power Miracle. :)
 
I'm not the resident µarch-expert around here, so my guess is as good as yours. But probably yes - if it's not already in Polaris as well.
 
Carsten, I suspect you're right.

Following on from that, I suspect that while writing a patent application for what you have highlighted, they broadened the application to include the other stuff relating to a collection of varied-width SIMDs and moving context (registers) to the correct SIMD. This other stuff isn't what they intended to patent, but they thought they might as well grab those concepts since they're nearby.

Applying this technique to GCN is potentially problematic though. Each SIMD is 16 lanes and a single instruction takes four clock cycles of SIMD execution. So, imagine we have a wavefront of 64 work items in which work items 32 to 63 (base work item is 0) have 0 in their execution mask. So half of the wavefront is not being executed, the second half.

This corresponds with clock cycles 3 and 4 of the SIMD, where the entire SIMD is "switched off". This looks good: yay we've saved half the power. But when the next instruction comes along the SIMD needs to turn on again: for cycles 1 and 2 of the next instruction. Then off for cycles 3 and 4. Then instruction 3 comes along and the SIMD needs to be on for the first 2 cycles then off for the next 2 cycles.

Can power gating run at base clock rate? It seems extremely unlikely to me!
 
What you describe, i.e. gating at base-freq, would be best case and I agree that it seems highly unlikely, albeit it could save LOADS of power. But from what I read in the patent, it seems to apply to wavefront with threads who are predicated off for good at some point, in which case you could switch off the remaining SIMD lanes maybe even with a slight delay (8, 16 or more cycles) and still save some power.

Of course, you still need to be able to make a decision if it's feasible to just continue running this on the original, 16-wide configured SIMD (for example if all registers are readily populated and you just need to execute one more ADD on all of them) or if the remaining instructions are numerous enough (and maybe have a few hazards in them as well, so that they would be put on back in the queue anyway) to warrant a reconfigured SIMD for execution.

Now I am not sure about this, but I also think that it could work slightly different as well: You don't switch parts of the SIMD off for fully populated Wavefronts but have variable wavefront-width size from the start as well.

Edit:
Paragraphs 39, 40 and 41 are very interesting as well: Configuration updates every cycle, fully-configured scalar unit (you mentioned that!), and scheduling to narrower (configured) SIMDs in case of memory bottlenecking. I think this is a very interesting patent wrt to Polaris and Vega!
 
Last edited:
What you describe, i.e. gating at base-freq, would be best case and I agree that it seems highly unlikely, albeit it could save LOADS of power. But from what I read in the patent, it seems to apply to wavefront with threads who are predicated off for good at some point, in which case you could switch off the remaining SIMD lanes maybe even with a slight delay (8, 16 or more cycles) and still save some power.
Yes, agreed, that is the fundamental idea of the patent: that one or more lanes within the SIMD are turned off for more than a single cycle to save power. The classical example of this is where a loop is bounded by a test whose result varies for each work item:

while (pixel.x < 1)
.... stuff that lasts for lots of cycles

And each time you start an iteration of the loop, the set of work items that passes the test changes. The size of the set could grow or shrink or stay the same. So the execution mask (which specifies which work items are required to compute a result) varies each time a new iteration of the loop starts.

But GCN runs an instruction over multiple cycles, which means that an ALU lane would toggle between on and off at a much higher frequency than either the loop iteration (which could be 20 cycles long, say, corresponding with 5 instructions) or the instruction frequency (every 4 cycles).

The patent mentions the problem I described in paragraph 35. That requires varying-width SIMDs, though. It also requires that operand collection/resultant scatter is also performed. Which is where we get into the subject of moving RFs from one SIMD to another.

Alternatively, all the SIMDs share a RF. But then each SIMD requires at least an operand collector.

It's interesting that the patent doesn't explicitly talk about a single cycle of execution for a SIMD.

Of course, you still need to be able to make a decision if it's feasible to just continue running this on the original, 16-wide configured SIMD (for example if all registers are readily populated and you just need to execute one more ADD on all of them) or if the remaining instructions are numerous enough (and maybe have a few hazards in them as well, so that they would be put on back in the queue anyway) to warrant a reconfigured SIMD for execution.

Now I am not sure about this, but I also think that it could work slightly different as well: You don't switch parts of the SIMD off for fully populated Wavefronts but have variable wavefront-width size from the start as well.
Yes, it's entirely possible that a wavefront spends its entire life only having 4, say, work items. So a wavefront that was defined as 4-wide would run on the scalar ALU over four cycles for all of its instructions (could be hundreds of instructions).

I've noticed that paragraph 38 indicates a setup that is very close to current GCN which has five ALUs (four 16-lane SIMDs and one scalar):

three 16-wide SIMDs + four 4-wide SIMDs + four scalar ALUs = 68 lanes

Here the 16-wide SIMDs are sort of the same as three of the four SIMDs in GCN. Then the fourth GCN SIMD is split into four 4-wide SIMDs instead. So far, that's 64 lanes, the same as GCN.

Then the four scalar ALUs compare to the single GCN scalar ALU. So that's three more scalar ALUs.

So this design is 3 ALU lanes wider than GCN. But, of course, there's much more control flow and data routing to worry about. Instead of GCN's 5 ALUs, we now have 11 ALUs.

And instead of GCN's register file being dedicated to each SIMD, with a bi-directional data path from each ALU to the compute unit's scalar ALU (and its dedicated register file), we conceptually have 11 ALUs all sharing a single register file.

So the ALU cost is only a small increment over GCN (more decoders, more narrow ALUs), the operand gather (collection) and resultant scatter units are way more costly, since the data paths are generally very wide and you have a many-to-many routing problem to solve (crossbar of some kind).

Running all instructions on all ALUs over 4 cycles does help with collection/scatter complexity/cost/timing though. And if a change in execution mask needs to be evaluated to determine how to route work items to SIMDs, that latency cost overhead makes the latency of routing less painful: if you're waiting for one latency anyway (execution mask), adding more latency (routing) isn't a dramatic change in the amount of latency that needs to be hidden.

Edit:
Paragraphs 39, 40 and 41 are very interesting as well: Configuration updates every cycle, fully-configured scalar unit (you mentioned that!), and scheduling to narrower (configured) SIMDs in case of memory bottlenecking. I think this is a very interesting patent wrt to Polaris and Vega!
I can't help thinking it would be easier to re-configure GCN with smaller native wavefronts.

Currently four compute units share shader code (instruction cache) and each compute unit has four 16-wide SIMDs and a scalar ALU. So that's sixteen 16-wide SIMDs and 4 scalar ALUs that are grouped together.

So keep the SIMD count the same, sixteen, and make them 4-wide instead. With a native wavefront size of 16. Done.

If you really wanted to apply this patent then you could make the four scalar ALUs shared and general purpose. In current GCN programs, scalar ALU code and scalar registers are never normally a bottleneck. In other words GCN's current scalar ALU is over-specified.

Giving the scalar ALUs full math and making a compute unit consist of two types of ALU: 4-wide and scalar would bring all the advantages of the patent, but would constrain the data routing (collection and scatter) problems to the scalar ALUs. In this configuration, the scalar ALUs simply can't get through data fast enough to require a big fat crossbar. And in this configuration each scalar ALU would be constrained to a subset of four SIMDs, further reducing the chip area over which the crossbar has to operate.
 
Last edited:
The patent is relatively broad in implementation possibilities, as most patents are.
I remember back when discussions about dynamic warp formation around the Fermi generation that some of the questions surrounding divergence handling came up. I wondered if there were a way for a GPU's broad hardware to "change gears" as it were, although it seemed at the time that the cost was in excess of the gains.

In terms of departure from the current configuration, there is a succession of possibilities that range from mainly power-efficiency plays to utilization or even peak performance plays.
The more straightforward implementation posited was a CU that can selectively gate some of its 16-wide SIMDs to play the role of at least some of the reduced-width SIMDs. It does have the benefit of giving the GPU peak numbers when the situation presents itself, whereas the diagrammed example where a CU increases the SIMD and unit count but loses a lot of width in aggregate could be considered a loss in performance and power efficiency in a workload that really plays well to the existing batch size.
The idea that there would be voltage or clock differences, and possibly a dynamic boost when enough of the SIMDs are gated off would be where some of the performance might be clawed back.

One way to partly hack this into the existing architecture would be to expand on the implementation of the DPP instructions that allow a limited amount of permutation between lanes. It unfortunately does not cover more than one operand, but it would allow among other things one of the methods of handling migrating lanes. Besides using special migration instructions, another possibility was using a crossbar or other interconnect to just move data from the usable register file lanes to the new ALU lane for the thread.
The power cost would be higher per-operation, but it might not be a loss for X number of instructions relative to the cost of migrating register context if the context is large and the method for moving it goes further out from the SIMDs.

Some additional context tracking might be necessary for these wavefronts, since moving to a shrunken SIMD does not change the logical lane positioning for the threads, which might matter when various cross-lane operations might behave differently when lane 4 is now next to lane 38. It might go either way in some situations if it might help.
The branch stack used by GCN might expand to keep track of some of this, and might help guide where a context might migrate.

The high-performance scalar unit would be something of a departure since it brings a wavefront's issue rate out of the 4-cycle cadence, which brought to my mind the Fibonacci benchmark used in the DX12 thread. The results would have been interesting to see.

Getting dedicated register files for specific hardware SIMD widths might provide a way to more finely allocate register capacity. A 64-lane wavefront's single software register allocation translates into 256 bytes, even if it's 63-lanes-predicated-off. One context dump to a narrower SIMD or to the scalar unit, a more intelligent front end might know it has capacity on the rest of the SIMDs that can go to something else--if the GPU architecture doesn't go for an approach that gates off lanes or units if a specific wave cannot use all of it. It might take a multiple-issue CU to handle this.

The later posited architecture with variable SIMDs, multiple thread pools, multiple issue units, and heterogeneous unit capability seems like it might be a bit pie in the sky. However, that would give better utilization without losing as much peak performance. More widely-varied execution units might have some long-term implications for some of the simpler math and varied width hardware sections in the GPU.
 
Realignment controller using a realignment element:

http://www.freepatentsonline.com/y2015/0100758.html

A data processor includes a register file divided into at least a first portion and a second portion for storing data. A single instruction, multiple data (SIMD) unit is also divided into at least a first lane and a second lane. The first and second lanes of the SIMD unit correspond respectively to the first and second portions of the register file. Furthermore, each lane of the SIMD unit is capable of data processing. The data processor also includes a realignment element in communication with the register file and the SIMD unit. The realignment element is configured to selectively realign conveyance of data between the first portion of the register file and the first lane of the SIMD unit to the second lane of the SIMD unit.

I can't find the register file cache patent application that's referred to in the text as application 13/689421.

This document goes as far as suggesting that work from distinct wavefronts is combined within the register file cache for submission to the SIMD :oops:
 
AMD Polaris 11 in shows CompuBench has 1024 Shader processors

The most recent find it AMD Polaris 11, Device ID 67FF:C8 codenamed “Goose”. This would be the base GPU for a several entry-level products. Now, the CompuBench database reports back that this device has 16 CUs with a maximum clock frequency of 1000 MHz. Multiple your CUs (compute units) by the number of shader processors per cluster (assuming that AMD keeps 64 per cluster) and you'll notice that Polaris 11 in this configuration has 1024 Shader processors tied to a 128-bit bus and 2048 MB of memory.

Polaris 10, codenamed "Ellesmere," would then feature over 2304 stream processors (36 CUs); and Vega 10 featuring 4096 stream processors, with 64 CUs. Things could end up looking like this:

http://www.guru3d.com/news-story/amd-polaris-11-in-shows-compubench-has-1024-shader-processors.html
 
Realignment controller using a realignment element:

http://www.freepatentsonline.com/y2015/0100758.html



I can't find the register file cache patent application that's referred to in the text as application 13/689421.

This document goes as far as suggesting that work from distinct wavefronts is combined within the register file cache for submission to the SIMD :oops:

There is a mention of a register ID stack needed for each successive remapping as a wavefront's divergence is handled. Unlike the software-visible fork and join branching method documented for existing GCN, this would seeming need to be handled fully in hardware, since the idea is that the lanes do not know about this remapping. It would seem like the previous patent would have similar needs if it were to handle dynamic remapping.
 
http://worldwide.espacenet.com/publ...CC=US&NR=2014149710A1&KC=A1&ND=1&locale=en_EP

Methods, media, and computing systems are provided. The method includes, the media are configured for, and the computing system includes a processor with control logic for allocating memory for storing a plurality of local register states for work items to be executed in single instruction multiple data hardware and for repacking wavefronts that include work items associated with a program instruction responsive to a conditional statement. The repacking is configured to create repacked wavefronts that include at least one of a wavefront containing work items that all pass the conditional statement and a wavefront containing work items that all fail the conditional statement.
 
Which means the power gating is applied to the individual stages and individual registers of the ALU, and the power gate itself is cascaded across the whole width of the SIMD unit.

Summing up the stuff from the last few posts:
  1. For each lane, have a bit mask denoting which lanes (and optionally which wave) it originated from.
  2. Disabled lanes pull in operands from the right neighbor instead of from the top, repacking the operand array.
  3. (optional, but logical) Adjacent operands are compared, and for each pair of identical operands, the right one is disabled. In addition, an additional bit is set on the left lane to indicate which additional lanes are handled by this one.
  4. (optional, required if 3. is used) Disabled lanes pull in operands from the right neighbor instead of from the top, repacking the operand array.
  5. (optional) If at most 50% of the operands are filled, wait for the next batch of the same wave. Repeat until occupation >50% or next op is >50% filed or different wave.
  6. For each lane, feed through the disabled bit unconditionally. Power gate all ALU stages to the right.
  7. (optional, required if 5. is used) Unpack combined waves.
  8. Unpack result operands based on the bitmask.
In addition to that, rely on the compiler detecting scalar sections and switching down to a single operand set.

In summary, when using all steps, that should:
  • Cut down power consumption to active lanes + static overhead.
  • Cut down power consumption and ALU occupation for large same-valued fields.
  • Cut power consumption and ALU occupation in half for patterns with <50% active threads (e.g. MOD 2, or checkerboard).

Power gating individual stages of the pipeline at regular clock speed definitely isn't a problem at all, as long as you pass through the "disabled" flag so the output register of the lane is assigned a static, safe value, so the resulting hazards don't accidentally propagate. I'm not aware that any of the individual pipeline stages of an ALU would need to be saturated or get into oscillation, so all of them should be save to be powered up right away when they have a valid input supplied.

Even though you don't even need to place power gates on ALL lanes to start with - as the lanes are utilized from the left to the right, it should be enough to gate before 1st lane (full NOP), after 1st lane, 2n lane and perhaps again at 50%. So you actually need only 4 large power gates per stage, instead of one per lane, to shave of most of the possible savings.


No guarantee that it actually looks like that, but it's certainly possible and not even all too difficult or complex. I honestly wouldn't have expected it to look that simple.
 

Relying on the LDS to store register context and an indirection table harkens back to a paper on using an upgraded scalar unit to run the vector subset of the ISA. This would extend it to allow SIMDs to do the same, in order to allow coalescing across the wavefronts in a workgroup.

I'm curious how well this synthesizes with the in-register realignment patent, and the disparate register files in the multi-length SIMD case. The LDS is being leaned on for its scatter/gather capability, but the tradeoffs seem uncertain in terms of what kind of complexity is needed to track when a register reference requires a dereference back to LDS, and then the uncertainty injected by banking conflicts, contention, extra hardware powered up, and the reduced bandwidth relative to the more straightforward SIMD register files. Then there's the idea of the realignment portion of the other patent, which seems like an evolution of what DPP instructions can do.

There might be thresholds to the process so that the schemes outlined in the three patents can somehow coexist. Realignment might allow for the LDS to not expand its banking and routing logic as much, by letting register contexts spread through the LDS banks to load as-is even if the SIMD's predication or length don't match up to the lanes that need to consume date.

Dynamically mapping to different SIMD widths might allow for optimizations such as how much the stack needs to track if a given truncated wavefront already knows that N lanes of the original wavefront are irrelevant. Different widths might also be helpful in a different corner case with the LDS remap scheme, where there are multiple work items stacked up within the same LDS banks. Stacking deeper on a narrower SIMD would allow the conflicts to be resolved without ALU underutilization.

Another question I have is whether it might be helpful to not create a context table in LDS until after the first wavefront refactoring. With the in-register methods from the other two schemes, a lot of divergence work might be handled for one path without dumping back to the LDS, lessening the space impact and traffic to a memory resource shared by the whole CU.

It seems like the LDS would be beefed up in a number of these synthesized possibilities.

It all seems rather complex to put together, however, so maybe not all of these are going to be used together?
 
You wouldn't want to touch the LDS itself, but gather at the full (max) width. That is already complex enough, with deduplication and multiplexing for same register operands. So the repacking would definitely not be integrated all too much. Especially not since the repacking needs to operate on the value domain, not register address, so you need to buffer that anyway.

If you do it in a separate cycle (or at least buffered with registers, so the input is hazard free, might still happen in the load cycle), that 3-stage repack process shouldn't be all too difficult to implement, neither in terms of transistor count nor latency. Maybe a 5-10 thousand transistors in total, for packing a full 16x3x32bit operand array (for both NOP lane elimination and deduplication) and generating the bit masks for bookkeeping. (Even a single 24bit multiplier array is already much bigger than that.)

Unpacking is absolutely trivial.
 
PS:
OK, if you really want to cut on the register file bandwidth, by eliminating double use of same registers, you obviously DO need to modify something. Not much though, and not in the LDS at all. A single element (or larger, if you prefer) LRU filter for each operand slot applied during the decode phase should be sufficient to block off duplicate requests long before they hit the register file. Scattering into all input registers should already be part of the LDS, so that one doesn't really need to be touched at all.

It doesn't make much sense to me to try and repack lanes by register address ahead of time. All which matters, is that you have the coalesced batch of register addresses ready at the start the load cycle, and for that using an LRU during the decode cycle should be sufficient.

As long as the critical part of the pipeline, actual load till store, remains the same length, it's mostly fine.
 
Wavefront repacking works by inserting barrier instructions that would otherwise not exist, forcing the CU's population of active wavefronts to all arrive at the same PC that causes divergence. This technique requires a decent population of wavefronts otherwise you're going to run out of candidate work items very rapidly. Considering that there's only a maximum of 10 wavefronts per SIMD in GCN currently, I can't help thinking that isn't enough.

On the other hand pixel shaders probably don't need to go below quad level granularity on packing.

If a CU has a register file that spans all ALUs then the population problem is substantially mitigated, since there's now 4x as many wavefronts.

If a CU has a single register file, then the realignment element comes into play: you are doing variable-width SIMDs so you need realignment. And, well, once you build a realignment element you might as well do wavefront repacking. So perhaps once you start down the road with a CU-wide register file and varied-width SIMDs you end up doing wavefront repacking because it's a minor additional cost.
 
You wouldn't want to touch the LDS itself, but gather at the full (max) width.
The full max width is 32 banks per cycle, which I am a little unclear from the patents on how many operands can be remapped in this fashion. The patent for that method will try to refactor through a full workgroup on a CU, which is from the current understanding up to 40 64-item wavefronts per CU, with up to 3 operands per active item potentially dereferenced.
It would take a rather low active thread ratio to avoid saturation for one SIMD, and the scheme is relying on this methodology for all SIMDs in a CU. The allocation of storage would be interesting, because it sounds like this could be gathering some fraction of entries spread out non-uniformly across 2560 entries, either from a single workgroup or in aggregate if more than one are on a CU.

Then there's the transparent use of this method for code that uses the LDS, and then the instructions that have a register and LDS source, or flat-addressing that has a race condition for GCN as we know it. Any of these could stand to be helped by a stronger LDS.

Capacity-wise, a single context register per lane across a workgroup in a CU is going to consume ~10K, so capacity-wise it seems like the LDS would benefit from more room, particularly for LDS-consuming algorithms that have already mapped out their capacity usage.

If you do it in a separate cycle (or at least buffered with registers, so the input is hazard free, might still happen in the load cycle), that 3-stage repack process shouldn't be all too difficult to implement, neither in terms of transistor count nor latency.
Is this extra cycle placed somewhere within the continued use of the existing cycle cadence of the variable-length SIMD patent?
How the various schemes could play together is what I am curious about.
The simplest interpretation of the variable-length scheme can has some odd implications as well, since it only really discusses one front end unit trying to balance one wavefront among its SIMD units, whereas the description of prior art is the 4-SIMD current version of GCN, which at least then might avoid stalling everything if there were a hiccup in the variable-latency LDS path.
 
Ups, LDS was local data storage, not the load store unit. So my previous post didn't make much sense, respectively described something entirely different.

Storing register allocation in the LDS shouldn't be all to bad. Sure, costs LDS bandwidth, but the indirection can be resolved early during the instruction decode. Theoretically, an instruction could still "preempt", respectively can be overtaken at this point without penalty since. The register map should never change during the lifetime of a wavefront, so no race conditions should be able to occur in this stage of the pipeline yet.

Is this extra cycle placed somewhere within the continued use of the existing cycle cadence of the variable-length SIMD patent?
How the various schemes could play together is what I am curious about.
The simplest interpretation of the variable-length scheme can has some odd implications as well, since it only really discusses one front end unit trying to balance one wavefront among its SIMD units, whereas the description of prior art is the 4-SIMD current version of GCN, which at least then might avoid stalling everything if there were a hiccup in the variable-latency LDS path.
No, if you would do it in an full extra cycle it wouldn't be covered. But you wouldn't actually need to, using a delay element is usually enough to chain multiple smaller pipeline stages (such as this one), within a single clock cycle. The clock cycle is only relevant for clearing the whole section of the pipleine for the next instruction. So as long as there is time left to spare in any of the clock gated sections...

You wouldn't be able to prevent any type of stalling using variable length SIMD. Allowing instruction from different wavefronts to jump each other is an entirely different thing. Plus repacking using the map in the LDS sounds like madness, that's nothing you could do on the fly.

The simplest interpretation would be only skipping 3 of 4 SIMD batches all together if only a single thread is active. Would be surprised if the current GCN generation didn't already do that.
Every type of power gating inside a SIMD would already require analyzing a single SIMD batch, plus repacking if handling for anything apart from the trivial "1st lane only" is desired. However, that also enables data aware, dynamic power savings.
 
Wavefront repacking works by inserting barrier instructions that would otherwise not exist, forcing the CU's population of active wavefronts to all arrive at the same PC that causes divergence. This technique requires a decent population of wavefronts otherwise you're going to run out of candidate work items very rapidly. Considering that there's only a maximum of 10 wavefronts per SIMD in GCN currently, I can't help thinking that isn't enough.

On the other hand pixel shaders probably don't need to go below quad level granularity on packing.

If a CU has a single register file, then the realignment element comes into play: you are doing variable-width SIMDs so you need realignment. And, well, once you build a realignment element you might as well do wavefront repacking. So perhaps once you start down the road with a CU-wide register file and varied-width SIMDs you end up doing wavefront repacking because it's a minor additional cost.
The repack element is probably one or more of those scalar ALUs that are really good at INT and conditionals with a high clock. It's also possible they are NOT repacking the register file, just changing references to threads for alignment. It's likely possible, but I'd think a lot of threads would need to die for that to be worthwhile.

It should also be possible to start with 4 threads/wave, and a consolidated register file, that will get scheduled on a high performance SIMD.

The barrier could force realignment/consolidation of all threads in a warp or possibly workgroup. With the async work it's likely not all waves are the same and they'd target concurrent workloads. Even the scalar unit would want 4 cycles so a quad at minimum. The exception being operations where all data was local (realign).

Even though you don't even need to place power gates on ALL lanes to start with - as the lanes are utilized from the left to the right, it should be enough to gate before 1st lane (full NOP), after 1st lane, 2n lane and perhaps again at 50%. So you actually need only 4 large power gates per stage, instead of one per lane, to shave of most of the possible savings.
It seems likely there would be a handful(<4) of power gates, each disabling the remaining odd/even gates. All the configurations in the patent were pow2 and you would want to space out the active lanes prior to running up the clocks. Then realign

It also seems likely SIMDs would be running uncoupled to run up their clocks separately so would have to dereference.
 
Status
Not open for further replies.
Back
Top