AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

Won't HBM be a much wider bus at the physical level? e.g. 1024 or 2048 bits?

If so, I'd say "there's your risk right there, it reaches back into the chip's overall bus architecture".
I agree that there's a risk, but since this is a memory standard meant for broader use, the tolerance level for irregularity should be much lower. The on-die memory subsystem, besides the controllers, has not been discussed as being significantly altered. At least relative to GDDR5, HBM targets a very wide bus, but it has very modest speed and electrical characteristics.
R600 is the reverse, where there was DRAM on the outside that was established and an internal bus that was significantly scaled up and elaborated from what had been done before.

HBM might present a manufacturing risk, which is difficult to quantify. The working chips should do fine bandwidth-wise.

Tonga seems like a great example of bad balance. Not enough fillrate for the bandwidth available. ROPs were R600's crushing problem, especially Z-rate.
There seems to be something amiss with Tonga, but I'm not sure what. However, it seems to slot pretty close to Tahiti with generally the same CU and ROP component, narrower bus, but wider geometry front end.
I suppose if Tahiti is also an example of bad balance, it would follow that Tonga would be as well.

As an example, if the VALUs in 1 or more SIMDs are idle due to un-hidden latencies, but other SIMDs within the same CU are working at full speed, are those idle SIMDs burning loads of power?
By this point, the physical design for the GPU should be capable of something as clock gating a physically separate and idle vector ALU, particularly when it has 4 cycles to do it in.
I suppose I can't guarantee that the design has become that thorough, but power management that coarse would be difficult to reconcile with AMD's aggressive DVFS and the fact that it can spike in power so readily. It couldn't have that large a differential unless it was able to at least partly idle at a lower power level.

One thing I discovered recently, with my HD7970 (which is a 1GHz stock card that was launched a while before the 1GHz Editions were official) is that I have to turn up the power limit to stop it throttling. That's despite the fact that some kernels are only getting ~55-75% VALU utilisation and the fact that the card is radically under-volted (1.006V). I don't know if the BIOS is super-conservative, or if it's detecting current or what. I've just run a test and got 8% more performance on a kernel with 84% VALU utilisation with the board power set to "+20" (is that percent?). I suspect it's still throttling (the test is about 2 minutes long).
How is VALU utilization being measured? Is there a trace for issue cycles, or a throughput/peak measurement? Clock throttling will reduce clock rate, but it would be something else if it somehow altered what was being done in those cycles. I can envision a secondary effect if the memory bus didn't also underclock, decreasing the ratio of core to memory speed, however, that would reduce the perceived memory latency, which would increase VALU utilization. Raising the power limit should in theory increase the number of cycles that the VALUs are stalled on something that didn't scale in frequency, because they are given more cycles.

I can't help wondering if HSA has explicitly distracted them from efficiency.
Compute in general distracted from graphics efficiency, as VLIW was stipulated to still be very efficient for traditional loads at the time GCN launched.
Various elements, such as the promise for more virtualized resources, better responsiveness, and virtual memory compatible with x86 were all efficiency detractors from an architecture that could dumbly whittle away at graphics workloads.

Simple example: it's possible to write a loop with either a for or a while. Both using an int for the loop index/while condition. One will compile to VALU instructions that increment and compare the index. The other uses SALU:
(edit: I'm pretty sure I read it backwards after some time away. The increment leading into a conditional jump looked like a do while where the loop's only job was the index update, but I misread the s_cmp_cc0 as its opposite. I'm guessing the leading move to s10 and the jump past the increment were elided.)
To reiterate in case I'm reading backwards, the while loop emits as scalar while the for loop is vector. I guess I'm not sure why. Is the vector variant simply the vector equivalent of the scalar ops posted in the snippet? No special modes like vskip?


I hadn't noticed that paper. I dare say I've given up on thinking divergence will be tackled.
It's not a strong guarantee, as papers are submitted for many untaken directions, but it is something that has come up and it might have some hardware help in the future (evidence is pretty thin at this point, though).
http://people.engr.ncsu.edu/hzhou/ipdps14.pdf

Why do you say "apparent"? And what's the problem you're alluding to?
One of AMD's proposed goals with GCN was to provide a sort of idealized machine to software, for which dwelling on the 4-cycle loop is a clunky thing to expose. As long as a dependent instruction can issue the next issue opportunity, that should at a basic level be good enough for reasonable levels of latency, for the abstract compute model AMD is trying to convey.

Well, more fundamentally, when a work group is fully occupying a CU (not difficult with >128 VGPR allocation and a work group size of 256), there's a very serious cost if a decision is taken to switch that CU to a different kernel - LDS and all registers are either discarded and the work group is restarted later or all that state is stashed somewhere... With that much pain, maybe the nature of context switching is rather coarse.
The hardware could have a heuristic that avoids evicting large kernels, but in the case of preemption it might be urgent enough a need to do so.
However, if the convention holds that long-latency ops like vector memory operations are grouped together, it might be exploitable. If a wavefront cannot resolve itself in time, the issue logic might be able to run for a set number of cycles or until it hits an operation like a load, and perform null issues until it hits a waitcnt. Any destination registers in that null run do not need to be saved.
I suppose a large allocation kernel wouldn't benefit much, but smaller shaders might.

Would that be a scalar wait instruction (i.e. scalar pipe waiting for VALU to finish)?

The IB_STS field in the hardware register is documented as having a counter for vector ops, and S_WAITCNT doesn't have an equivalent range defined. It's not much of an indicator, since the documented counter lengths documented for IB_STS and S_WAITCNT's immediate don't agree. It might be a vestigial remnant of some discarded direction.
 
Last edited:
To reiterate in case I'm reading backwards, the while loop emits as scalar while the for loop is vector. I guess I'm not sure why. Is the vector variant simply the vector equivalent of the scalar ops posted in the snippet? No special modes like vskip?

But isn't a 'for' loop easier to auto-vectorize in the first place? At least, the index of the loop would be easier to locate due to the particular sythax..

( For a c-like for loop I would try to maybe rewrite all the index-related statements inside the body of the for to rule that out)
 
Last edited:
Possible yes. but like every other informations are unreadable anyway ( 3520SP 44C 1ГГц, 3Гб )

3520 Stream Processors, 44 CU, 1 GHz, 3GB.

It just seems that someone/something assumed CUs had 80 SPs as they did in the VLIW5 days: 44×80 = 3520.
 
So we're looking at going from 44 to 64 CUs for AMD and 16 to 24 SMs for Nvidia? That's not too encouraging for AMD in terms of gaming related calculation power.
The memory BW should blow away all the rest, but how often is BW really the limiting factor in terms of performance?

Fiji and gm200 are going to end up very close to each other, as usual. Power consumption may be the deciding factor.

Compute is a different story...

I think this also puts the 20nm speculation to rest.
 
But isn't a 'for' loop easier to auto-vectorize in the first place? At least, the index of the loop would be easier to locate due to the particular sythax..

( For a c-like for loop I would try to maybe rewrite all the index-related statements inside the body of the for to rule that out)
I probably reversed the meaning of the conditional jump.
 
I agree that there's a risk, but since this is a memory standard meant for broader use, the tolerance level for irregularity should be much lower.
What I'm trying to say is that mapping 8 RBE/16 L2 blocks to individual 64-bit MCs is different from (for the sake of argument) 16 RBE/32 L2 blocks to 4096 bits of memory (is that 64 64-bit channels, or 32 128-bit channels or ...)

Does tiling of textures work the same way? What about render targets? What's the effect on the latency-v-bandwidth curve of such a vast bus?

That "leak" would indicate 640GB/s of bandwidth. Latency could be 1/4 of what's seen in current GPUs (guess).

How do you scale a chip to use this performance profile?

There seems to be something amiss with Tonga, but I'm not sure what. However, it seems to slot pretty close to Tahiti with generally the same CU and ROP component, narrower bus, but wider geometry front end.
I suppose if Tahiti is also an example of bad balance, it would follow that Tonga would be as well.
With Tonga I'm alluding specifically to the delta compression's effect on performance. It needs more fillrate. Or it would have been close to the same performance with a 128-bit bus.

By this point, the physical design for the GPU should be capable of something as clock gating a physically separate and idle vector ALU, particularly when it has 4 cycles to do it in.
Well, maybe it's a 4-way superscalar SIMD, with instructions issued to each way on a rotating cadence :eek:

I suppose I can't guarantee that the design has become that thorough, but power management that coarse would be difficult to reconcile with AMD's aggressive DVFS and the fact that it can spike in power so readily. It couldn't have that large a differential unless it was able to at least partly idle at a lower power level.
I bet spiking is just down to dynamic clocking and count of CUs (or, hell, shader engines?) switched on.

How is VALU utilization being measured? Is there a trace for issue cycles, or a throughput/peak measurement?
CodeXL lets you run your application with "profiling mode on". It presumably applies the right hooks to capture the on-chip performance counters.

http://developer.amd.com/tools-and-sdks/opencl-zone/codexl/codexl-benefits-detail/

Clock throttling will reduce clock rate, but it would be something else if it somehow altered what was being done in those cycles. I can envision a secondary effect if the memory bus didn't also underclock, decreasing the ratio of core to memory speed, however, that would reduce the perceived memory latency, which would increase VALU utilization. Raising the power limit should in theory increase the number of cycles that the VALUs are stalled on something that didn't scale in frequency, because they are given more cycles.
I sampled the VALU utilisation on a brief, intermediate test. I should re-do this to find out if the full length test shows utilisation that differs.

Originally, I discovered the throttling because I was running brief tests and got notably higher than expected performance.

Compute in general distracted from graphics efficiency, as VLIW was stipulated to still be very efficient for traditional loads at the time GCN launched.
Various elements, such as the promise for more virtualized resources, better responsiveness, and virtual memory compatible with x86 were all efficiency detractors from an architecture that could dumbly whittle away at graphics workloads.
This is only half true. Modern graphics are compute-heavy. And virtualisation and responsiveness are actually important to gaming, too.

(edit: I'm pretty sure I read it backwards after some time away. The increment leading into a conditional jump looked like a do while where the loop's only job was the index update, but I misread the s_cmp_cc0 as its opposite. I'm guessing the leading move to s10 and the jump past the increment were elided.)
To reiterate in case I'm reading backwards, the while loop emits as scalar while the for loop is vector. I guess I'm not sure why. Is the vector variant simply the vector equivalent of the scalar ops posted in the snippet? No special modes like vskip?
Sorry, I wasn't trying to dwell on this point.

The for loop generates the pure scalar ALU code. The while generates VALU code, with integer.

The snippet I showed is the tail end of the loop (in this case the tail end of the for loop). Change all the "s"s to a "v"s and you get the code for the tail end of the while loop.

One could argue that a while loop is more likely to show inter-work-item divergence. When divergence occurs you have no choice but to use vcc (since that's per work-item). But the compiler can see that there's no divergence in this particular while loop, so there's no need to hold on to vcc.

It's not a strong guarantee, as papers are submitted for many untaken directions, but it is something that has come up and it might have some hardware help in the future (evidence is pretty thin at this point, though).
http://people.engr.ncsu.edu/hzhou/ipdps14.pdf
Can't say I'm impressed by that. They basically optimised around a wodge of deficiencies in NVidia's old architecture, inspired by the SALU concept of GCN.

Cache-aware algorithms, by definition, pre-fetch.

As the discussion of the for loop above shows, the compiler can issue scalar instructions that will run in parallel with the VALU instructions. In fact the scalar add and scalar compare should complete well before the branch needs to be taken (though I admit I have no idea how the hardware actually scheduled this). If there is only a single instance of scc, then the compare can't be issued at any time other than immediately before the branch, though. I have no idea whether GCN has a single scc shared by all work-groups in a CU.

Work-item shuffling (through data movement) is a toy concept. Any general purpose implementation is stuck with moving all the state of work-items and local memory around. The programmer can do this themself. That's what happens in a smart reduction, for example. The programmer has the bigger picture. It's similar to what persistent kernels do. etc.

The hardware could have a heuristic that avoids evicting large kernels, but in the case of preemption it might be urgent enough a need to do so.
I suspect, on the other hand, that merely pre-empting at the CU level is enough. i.e. allow individual CUs to finish what they're doing. A shader engine can only send work to one CU at a time, as I understand it. Therefore stopping all CUs in a SE simultaneously is pointless. Either let them drain naturally, or just kill a small subset and re-queue them for other CUs to run in parallel with the pre-empting workload.

The IB_STS field in the hardware register is documented as having a counter for vector ops, and S_WAITCNT doesn't have an equivalent range defined. It's not much of an indicator, since the documented counter lengths documented for IB_STS and S_WAITCNT's immediate don't agree. It might be a vestigial remnant of some discarded direction.
With only 3 bits, that looks like nothing more than some kind of interaction with the instruction cache (mini-fetches from cache into ALU?). Though I had wondered whether it is related to macros (such as a double-precision divide) or to debugging or exception handling. Macros are way longer though, aren't they? Nope, I can't work it out.

But well, it doesn't seem very important being only 3 bits (though it might be 3 bits encoding the range 0-128, say).
 
What I'm trying to say is that mapping 8 RBE/16 L2 blocks to individual 64-bit MCs is different from (for the sake of argument) 16 RBE/32 L2 blocks to 4096 bits of memory (is that 64 64-bit channels, or 32 128-bit channels or ...)
A single HBM module has 8 128-bit channels and a prefetch of 2. A burst would be 256 bits, or 32 bytes. A GDDR5 channel is 32 bits with a prefetch of 8, so 32 bytes per burst.
Other presentations put the general command latencies in nanoseconds as roughly equivalent to GDDR5.

Does tiling of textures work the same way? What about render targets? What's the effect on the latency-v-bandwidth curve of such a vast bus?
More stripes would be needed, the alleged 4096-bit leak would have double the number of channels to stripe across.

That "leak" would indicate 640GB/s of bandwidth. Latency could be 1/4 of what's seen in current GPUs (guess).
I'm uncertain it will be that significant. The DRAM device latency is unlikely to change significantly, and the GPU memory subsystem and controllers inject a large amount of latency all their own. The biggest change would be the interface and its IO wires, but they are not a primary contributor to latency.

I bet spiking is just down to dynamic clocking and count of CUs (or, hell, shader engines?) switched on.
If that's all there is to it, then having the CUs run a trivial shader that allocates some VGPRs and then spends the rest of its time incrementing an SGPR should have a similar power profile to having them run one that is incrementing on a VGPR.

Originally, I discovered the throttling because I was running brief tests and got notably higher than expected performance.
Throttling behavior that actually changes what CUs get wavefronts launched or whether instructions are issued or not would be some kind of workload management at a higher level than what the clock-gating would have awareness of.
If by higher than expected you mean higher than the theoretical peak for the upper clock range, that probably should not happen for a pre-GHz edition card that doesn't have a boost state.
Higher than the usual normally achieved performance, but still within theoretical bounds an be a sign of power management effects poking through.

The snippet I showed is the tail end of the loop (in this case the tail end of the for loop). Change all the "s"s to a "v"s and you get the code for the tail end of the while loop.

One could argue that a while loop is more likely to show inter-work-item divergence. When divergence occurs you have no choice but to use vcc (since that's per work-item). But the compiler can see that there's no divergence in this particular while loop, so there's no need to hold on to vcc.
I'm not sure if the compiler is being pessimistic, like it cannot go back and review the properties a while loop's index variable. The scope is potentially different, perhaps it's giving up early because of that?

Can't say I'm impressed by that. They basically optimised around a wodge of deficiencies in NVidia's old architecture, inspired by the SALU concept of GCN.
An alteration to the hardware would be incremental, and I wouldn't count on it being impressive because of that fact.
It's somewhat less forgettable because there are schemes AMD posited (and patented, for all those are worth) for repacking wavefronts, where an extra indirection table is used to track per-lane thread contexts and allows physical lanes to use the active variables of other threads, should the threads that are predicated on be repacked to a physical SIMD lane whose logical lane is off. Both schemes leverage cross-lane communication by storing some amount of context in shared memory.

The latter scheme for repacking is questionable for the amount of indirection it places for execution, and possible banking conflicts and a worsening of physical locality for data propagation. The scheme would theoretically make the indirection check part of the execution pipeline.

I have no idea whether GCN has a single scc shared by all work-groups in a CU.
SCC is documented as being a part of a wavefront's context, and is one of the values initially generated at wavefront launch. I'm not sure there is a safe way to share it beyond that.

I suspect, on the other hand, that merely pre-empting at the CU level is enough. i.e. allow individual CUs to finish what they're doing.
This would run counter to AMD's desire to allow very long-lived wavefronts of arbitrary allocation size. The descriptions of one possible way of handling this is to give the hardware a period of time where it can wait for enough resources to open up, but after that it will force a CU to begin a context switch.

With only 3 bits, that looks like nothing more than some kind of interaction with the instruction cache (mini-fetches from cache into ALU?). Though I had wondered whether it is related to macros (such as a double-precision divide) or to debugging or exception handling. Macros are way longer though, aren't they? Nope, I can't work it out.
If it's anything like the other counters, it's an increment when an operation begins, and a decrement after completion/result available. For existing VALUs, it would be a +1 at issue and -1 at the last cycle.
It could very well be something a false start that was not totally scrubbed from the documents. Currently, thanks to how execution is handled, any attempt to make use of it would be meaningless because no further instruction issue can occur in that 4 cycle window.
 
Fiji and gm200 are going to end up very close to each other, as usual. Power consumption may be the deciding factor.

Compute is a different story...

I think this also puts the 20nm speculation to rest.
If the leaked info is accurate, and no further measures are taken, the 4GB memory capacity would make compute a short story for AMD.
 
If the leaked info is accurate, and no further measures are taken, the 4GB memory capacity would make compute a short story for AMD.
Good point. Which raises the question: what kind of measures can be taken? 4GB should be sufficient for GPU work in practice, but it will look bad on the retail box once the more memory heavy 8GB (and 12GB? Insane!) GeForce SKUs enter the market.

What are the options? Mixed HBM and GDDR5 doesn't sound very practical.
 
Good point. Which raises the question: what kind of measures can be taken? 4GB should be sufficient for GPU work in practice, but it will look bad on the retail box once the more memory heavy 8GB (and 12GB? Insane!) GeForce SKUs enter the market.

What are the options? Mixed HBM and GDDR5 doesn't sound very practical.

I'm not sure. Waiting for HBM Gen2 seems like the best way to avoid what looks to be a very awkward product arrangement. Nvidia had some slides on HBM, as it will be using that as well, but the bit rate and capacity are consistent with waiting for Gen2, but that's years away.
HBM Gen1 is somewhat compelling in some metrics, but the capacity loss is significant and bandwidth gains get lost if going for a more modest stack count.
For the 4k alleged sample, even if there were a second memory type, HBM is going to be on two of the 4 sides, and GDDR5 is itself not a particularly dense memory type. There are other memory standards that could get more, but the bandwidth disparity would be greater.
The competing HMC standard has a more complex fabric that could allow other modules to hang off the adjacent memory modules, but that's not a path AMD has looked to go down.

I don't know where a chip with the alleged size of a high-price product, a higher-price memory that requires a higher-price package manufacturing process, is going to get away with capacity that will shut it out of high-end gaming--as you've noted, but also much of the most lucrative compute and professional graphics segments. AMD already has 16GB FirePro cards, and there's no font I'm aware of that is going to make cutting that by a factor of four look good on the box.

The question came up a while back as to why AMD might not ship working silicon, and whiffing on the memory or interposer could do it, particularly since the limits of that choice may have killed the market for it.
 
Good point. Which raises the question: what kind of measures can be taken? 4GB should be sufficient for GPU work in practice, but it will look bad on the retail box once the more memory heavy 8GB (and 12GB? Insane!) GeForce SKUs enter the market.

What are the options? Mixed HBM and GDDR5 doesn't sound very practical.

https://asc.llnl.gov/fastforward/AMD-FF.pdf

AMDs Fastforward project explicitly claims that there should be a Two Level Memory system.

I am a great follower of the idea of two memory systems, one being the fast and low-latency HBM memory pool, the other being another pool of memory, which offers lower bandbwidth, but high volume. For comparison, you could take Intels Xeon Phi, with 8/16GB of fast HBM memory and another pool of DDR4.

A Fiji with 4GB HBM and additional DDR4 memory, could offer a very high bandwidth and a very high memory volume, especially with the new DDR4 Stacks.
 
https://asc.llnl.gov/fastforward/AMD-FF.pdf

AMDs Fastforward project explicitly claims that there should be a Two Level Memory system.

I am a great follower of the idea of two memory systems, one being the fast and low-latency HBM memory pool, the other being another pool of memory, which offers lower bandbwidth, but high volume. For comparison, you could take Intels Xeon Phi, with 8/16GB of fast HBM memory and another pool of DDR4.

A Fiji with 4GB HBM and additional DDR4 memory, could offer a very high bandwidth and a very high memory volume, especially with the new DDR4 Stacks.

There is one difference in the presentation between evaluating memory interfaces and two-tier memory, in that they've gotten as far as defining a program model and simulating the latter while they have evaluated actual existing examples of the former.
If the alleged leaked example exists, it has a memory pool of existing and non-simulated RAM. AMD has not done much evangelizing of tiered memory, which at least would benefit HPC and compute. Existing applications not developed for the possibility (games, conservatively developed workstation applications?) would be a question mark.

At least Xeon Phi's promised HMC and not HBM memory pool is sized such that it doesn't drop below what would be at least decent capacity for existing upper-tier gaming and professional cards.
 
https://asc.llnl.gov/fastforward/AMD-FF.pdf

AMDs Fastforward project explicitly claims that there should be a Two Level Memory system.

I am a great follower of the idea of two memory systems, one being the fast and low-latency HBM memory pool, the other being another pool of memory, which offers lower bandbwidth, but high volume. For comparison, you could take Intels Xeon Phi, with 8/16GB of fast HBM memory and another pool of DDR4.

A Fiji with 4GB HBM and additional DDR4 memory, could offer a very high bandwidth and a very high memory volume, especially with the new DDR4 Stacks.
With an APU, your two levels would be an improvement over the current two level system in that the GPU has first class access to the regular DRAM as well. With a chip like Fiji, you'd introduce a third level. Now addition additional levels is something that's being done all the time (see L1,L2,L3 caches in regular CPU, but at least that's managed in a transparent way. For a GPU, it'd be another software managed level.
I don't know how much overhead that would give. It can probably work...
 
The Xbox One APU to a certain extent has already introduced a two-tier memory system, as the ESRAM works on pages mapped to on-die addresses rather than having them transparently cached.
The automatic management of this has proven pretty functional for graphics at least. It's not transparent enough that it doesn't take additional work to massage it relative to a certain other APU's single GDDR5 pool, and HPC would appreciate deeper control and more advanced notice than has been given.

The console example's problems could be attributed to storage that is massively smaller than what an HBM system would provide, sustained performance not being significantly different from the competing GDDR5 pool, and additional wrinkles like an immature platform and other hardware differences that can lead to performance problems being blamed on that storage.
If there were a DDR4 backing store for an HBM GPU, or possibly other standards that can reach higher density than GDDR5, the bandwidth benefit of the HBM pool probably would be more clear. It would be much more unambiguous with HBM Gen2, but Gen1 with sufficient width can have a wider gap than the console solutions presented.
 
A single HBM module has 8 128-bit channels and a prefetch of 2. A burst would be 256 bits, or 32 bytes. A GDDR5 channel is 32 bits with a prefetch of 8, so 32 bytes per burst.
Other presentations put the general command latencies in nanoseconds as roughly equivalent to GDDR5.
Finding it hard to find much that's concrete except pretty pictures stating first generation HBM has ~half the latency of DDR3.

More stripes would be needed, the alleged 4096-bit leak would have double the number of channels to stripe across.
I'm wondering if the striping algorithms will be different, e.g. to the extent that striping is abandoned for most of the small texture sizings.

If that's all there is to it, then having the CUs run a trivial shader that allocates some VGPRs and then spends the rest of its time incrementing an SGPR should have a similar power profile to having them run one that is incrementing on a VGPR.
Interesting experiment...

If by higher than expected you mean higher than the theoretical peak for the upper clock range, that probably should not happen for a pre-GHz edition card that doesn't have a boost state.
Higher than the usual normally achieved performance, but still within theoretical bounds an be a sign of power management effects poking through.
The latter.

I'm not sure if the compiler is being pessimistic, like it cannot go back and review the properties a while loop's index variable. The scope is potentially different, perhaps it's giving up early because of that?
Honestly it just seems like the for loop is the easy case (absurdly trivial) and they left it there.

An alteration to the hardware would be incremental, and I wouldn't count on it being impressive because of that fact.
It's somewhat less forgettable because there are schemes AMD posited (and patented, for all those are worth) for repacking wavefronts, where an extra indirection table is used to track per-lane thread contexts and allows physical lanes to use the active variables of other threads, should the threads that are predicated on be repacked to a physical SIMD lane whose logical lane is off. Both schemes leverage cross-lane communication by storing some amount of context in shared memory.

The latter scheme for repacking is questionable for the amount of indirection it places for execution, and possible banking conflicts and a worsening of physical locality for data propagation. The scheme would theoretically make the indirection check part of the execution pipeline.
You bothered to write what I couldn't be arsed with ;)

SCC is documented as being a part of a wavefront's context, and is one of the values initially generated at wavefront launch. I'm not sure there is a safe way to share it beyond that.
Well that definitely makes the "parallel execution case" I talked about earlier work.

If it's anything like the other counters, it's an increment when an operation begins, and a decrement after completion/result available. For existing VALUs, it would be a +1 at issue and -1 at the last cycle.

It could very well be something a false start that was not totally scrubbed from the documents.
Looking more closely at the ISA manual, these bits can be read or written by an SALU instruction, e.g. s_setreg_b32. So "abandoned" is not the case.

Currently, thanks to how execution is handled, any attempt to make use of it would be meaningless because no further instruction issue can occur in that 4 cycle window.
I think it might be what I originally guessed at: allowing SALU to keep an eye on the latency of VALU instructions, before it can proceed with the next scalar instruction.
 
Back
Top