The supposed Radeon at Sisoft results claims 4096-bit membus on the HBMsWon't HBM be a much wider bus at the physical level? e.g. 1024 or 2048 bits?
The supposed Radeon at Sisoft results claims 4096-bit membus on the HBMsWon't HBM be a much wider bus at the physical level? e.g. 1024 or 2048 bits?
I agree that there's a risk, but since this is a memory standard meant for broader use, the tolerance level for irregularity should be much lower. The on-die memory subsystem, besides the controllers, has not been discussed as being significantly altered. At least relative to GDDR5, HBM targets a very wide bus, but it has very modest speed and electrical characteristics.Won't HBM be a much wider bus at the physical level? e.g. 1024 or 2048 bits?
If so, I'd say "there's your risk right there, it reaches back into the chip's overall bus architecture".
There seems to be something amiss with Tonga, but I'm not sure what. However, it seems to slot pretty close to Tahiti with generally the same CU and ROP component, narrower bus, but wider geometry front end.Tonga seems like a great example of bad balance. Not enough fillrate for the bandwidth available. ROPs were R600's crushing problem, especially Z-rate.
By this point, the physical design for the GPU should be capable of something as clock gating a physically separate and idle vector ALU, particularly when it has 4 cycles to do it in.As an example, if the VALUs in 1 or more SIMDs are idle due to un-hidden latencies, but other SIMDs within the same CU are working at full speed, are those idle SIMDs burning loads of power?
How is VALU utilization being measured? Is there a trace for issue cycles, or a throughput/peak measurement? Clock throttling will reduce clock rate, but it would be something else if it somehow altered what was being done in those cycles. I can envision a secondary effect if the memory bus didn't also underclock, decreasing the ratio of core to memory speed, however, that would reduce the perceived memory latency, which would increase VALU utilization. Raising the power limit should in theory increase the number of cycles that the VALUs are stalled on something that didn't scale in frequency, because they are given more cycles.One thing I discovered recently, with my HD7970 (which is a 1GHz stock card that was launched a while before the 1GHz Editions were official) is that I have to turn up the power limit to stop it throttling. That's despite the fact that some kernels are only getting ~55-75% VALU utilisation and the fact that the card is radically under-volted (1.006V). I don't know if the BIOS is super-conservative, or if it's detecting current or what. I've just run a test and got 8% more performance on a kernel with 84% VALU utilisation with the board power set to "+20" (is that percent?). I suspect it's still throttling (the test is about 2 minutes long).
Compute in general distracted from graphics efficiency, as VLIW was stipulated to still be very efficient for traditional loads at the time GCN launched.I can't help wondering if HSA has explicitly distracted them from efficiency.
(edit: I'm pretty sure I read it backwards after some time away. The increment leading into a conditional jump looked like a do while where the loop's only job was the index update, but I misread the s_cmp_cc0 as its opposite. I'm guessing the leading move to s10 and the jump past the increment were elided.)Simple example: it's possible to write a loop with either a for or a while. Both using an int for the loop index/while condition. One will compile to VALU instructions that increment and compare the index. The other uses SALU:
It's not a strong guarantee, as papers are submitted for many untaken directions, but it is something that has come up and it might have some hardware help in the future (evidence is pretty thin at this point, though).I hadn't noticed that paper. I dare say I've given up on thinking divergence will be tackled.
One of AMD's proposed goals with GCN was to provide a sort of idealized machine to software, for which dwelling on the 4-cycle loop is a clunky thing to expose. As long as a dependent instruction can issue the next issue opportunity, that should at a basic level be good enough for reasonable levels of latency, for the abstract compute model AMD is trying to convey.Why do you say "apparent"? And what's the problem you're alluding to?
The hardware could have a heuristic that avoids evicting large kernels, but in the case of preemption it might be urgent enough a need to do so.Well, more fundamentally, when a work group is fully occupying a CU (not difficult with >128 VGPR allocation and a work group size of 256), there's a very serious cost if a decision is taken to switch that CU to a different kernel - LDS and all registers are either discarded and the work group is restarted later or all that state is stashed somewhere... With that much pain, maybe the nature of context switching is rather coarse.
Would that be a scalar wait instruction (i.e. scalar pipe waiting for VALU to finish)?
To reiterate in case I'm reading backwards, the while loop emits as scalar while the for loop is vector. I guess I'm not sure why. Is the vector variant simply the vector equivalent of the scalar ops posted in the snippet? No special modes like vskip?
http://www.sisoftware.eu/rank2011d/...dbe6deedd4e7d3f587ba8aacc9ac91a187f4c9f9&l=en
Claims 4096 Stream processors and HBM-memories
That just looks like Hawaii with a misreported SP number.Look like there's a second entry with a 3520SP / 44CU chips http://www.sisoftware.eu/rank2011d/...dbe6deeddaeed6f082bf8fa9cca994a482f1ccf4&l=en
That just looks like Hawaii with a misreported SP number.
Possible yes. but like every other informations are unreadable anyway ( 3520SP 44C 1ГГц, 3Гб )
I probably reversed the meaning of the conditional jump.But isn't a 'for' loop easier to auto-vectorize in the first place? At least, the index of the loop would be easier to locate due to the particular sythax..
( For a c-like for loop I would try to maybe rewrite all the index-related statements inside the body of the for to rule that out)
What I'm trying to say is that mapping 8 RBE/16 L2 blocks to individual 64-bit MCs is different from (for the sake of argument) 16 RBE/32 L2 blocks to 4096 bits of memory (is that 64 64-bit channels, or 32 128-bit channels or ...)I agree that there's a risk, but since this is a memory standard meant for broader use, the tolerance level for irregularity should be much lower.
With Tonga I'm alluding specifically to the delta compression's effect on performance. It needs more fillrate. Or it would have been close to the same performance with a 128-bit bus.There seems to be something amiss with Tonga, but I'm not sure what. However, it seems to slot pretty close to Tahiti with generally the same CU and ROP component, narrower bus, but wider geometry front end.
I suppose if Tahiti is also an example of bad balance, it would follow that Tonga would be as well.
Well, maybe it's a 4-way superscalar SIMD, with instructions issued to each way on a rotating cadenceBy this point, the physical design for the GPU should be capable of something as clock gating a physically separate and idle vector ALU, particularly when it has 4 cycles to do it in.
I bet spiking is just down to dynamic clocking and count of CUs (or, hell, shader engines?) switched on.I suppose I can't guarantee that the design has become that thorough, but power management that coarse would be difficult to reconcile with AMD's aggressive DVFS and the fact that it can spike in power so readily. It couldn't have that large a differential unless it was able to at least partly idle at a lower power level.
CodeXL lets you run your application with "profiling mode on". It presumably applies the right hooks to capture the on-chip performance counters.How is VALU utilization being measured? Is there a trace for issue cycles, or a throughput/peak measurement?
I sampled the VALU utilisation on a brief, intermediate test. I should re-do this to find out if the full length test shows utilisation that differs.Clock throttling will reduce clock rate, but it would be something else if it somehow altered what was being done in those cycles. I can envision a secondary effect if the memory bus didn't also underclock, decreasing the ratio of core to memory speed, however, that would reduce the perceived memory latency, which would increase VALU utilization. Raising the power limit should in theory increase the number of cycles that the VALUs are stalled on something that didn't scale in frequency, because they are given more cycles.
This is only half true. Modern graphics are compute-heavy. And virtualisation and responsiveness are actually important to gaming, too.Compute in general distracted from graphics efficiency, as VLIW was stipulated to still be very efficient for traditional loads at the time GCN launched.
Various elements, such as the promise for more virtualized resources, better responsiveness, and virtual memory compatible with x86 were all efficiency detractors from an architecture that could dumbly whittle away at graphics workloads.
Sorry, I wasn't trying to dwell on this point.(edit: I'm pretty sure I read it backwards after some time away. The increment leading into a conditional jump looked like a do while where the loop's only job was the index update, but I misread the s_cmp_cc0 as its opposite. I'm guessing the leading move to s10 and the jump past the increment were elided.)
To reiterate in case I'm reading backwards, the while loop emits as scalar while the for loop is vector. I guess I'm not sure why. Is the vector variant simply the vector equivalent of the scalar ops posted in the snippet? No special modes like vskip?
Can't say I'm impressed by that. They basically optimised around a wodge of deficiencies in NVidia's old architecture, inspired by the SALU concept of GCN.It's not a strong guarantee, as papers are submitted for many untaken directions, but it is something that has come up and it might have some hardware help in the future (evidence is pretty thin at this point, though).
http://people.engr.ncsu.edu/hzhou/ipdps14.pdf
I suspect, on the other hand, that merely pre-empting at the CU level is enough. i.e. allow individual CUs to finish what they're doing. A shader engine can only send work to one CU at a time, as I understand it. Therefore stopping all CUs in a SE simultaneously is pointless. Either let them drain naturally, or just kill a small subset and re-queue them for other CUs to run in parallel with the pre-empting workload.The hardware could have a heuristic that avoids evicting large kernels, but in the case of preemption it might be urgent enough a need to do so.
With only 3 bits, that looks like nothing more than some kind of interaction with the instruction cache (mini-fetches from cache into ALU?). Though I had wondered whether it is related to macros (such as a double-precision divide) or to debugging or exception handling. Macros are way longer though, aren't they? Nope, I can't work it out.The IB_STS field in the hardware register is documented as having a counter for vector ops, and S_WAITCNT doesn't have an equivalent range defined. It's not much of an indicator, since the documented counter lengths documented for IB_STS and S_WAITCNT's immediate don't agree. It might be a vestigial remnant of some discarded direction.
A single HBM module has 8 128-bit channels and a prefetch of 2. A burst would be 256 bits, or 32 bytes. A GDDR5 channel is 32 bits with a prefetch of 8, so 32 bytes per burst.What I'm trying to say is that mapping 8 RBE/16 L2 blocks to individual 64-bit MCs is different from (for the sake of argument) 16 RBE/32 L2 blocks to 4096 bits of memory (is that 64 64-bit channels, or 32 128-bit channels or ...)
More stripes would be needed, the alleged 4096-bit leak would have double the number of channels to stripe across.Does tiling of textures work the same way? What about render targets? What's the effect on the latency-v-bandwidth curve of such a vast bus?
I'm uncertain it will be that significant. The DRAM device latency is unlikely to change significantly, and the GPU memory subsystem and controllers inject a large amount of latency all their own. The biggest change would be the interface and its IO wires, but they are not a primary contributor to latency.That "leak" would indicate 640GB/s of bandwidth. Latency could be 1/4 of what's seen in current GPUs (guess).
If that's all there is to it, then having the CUs run a trivial shader that allocates some VGPRs and then spends the rest of its time incrementing an SGPR should have a similar power profile to having them run one that is incrementing on a VGPR.I bet spiking is just down to dynamic clocking and count of CUs (or, hell, shader engines?) switched on.
Throttling behavior that actually changes what CUs get wavefronts launched or whether instructions are issued or not would be some kind of workload management at a higher level than what the clock-gating would have awareness of.Originally, I discovered the throttling because I was running brief tests and got notably higher than expected performance.
I'm not sure if the compiler is being pessimistic, like it cannot go back and review the properties a while loop's index variable. The scope is potentially different, perhaps it's giving up early because of that?The snippet I showed is the tail end of the loop (in this case the tail end of the for loop). Change all the "s"s to a "v"s and you get the code for the tail end of the while loop.
One could argue that a while loop is more likely to show inter-work-item divergence. When divergence occurs you have no choice but to use vcc (since that's per work-item). But the compiler can see that there's no divergence in this particular while loop, so there's no need to hold on to vcc.
An alteration to the hardware would be incremental, and I wouldn't count on it being impressive because of that fact.Can't say I'm impressed by that. They basically optimised around a wodge of deficiencies in NVidia's old architecture, inspired by the SALU concept of GCN.
SCC is documented as being a part of a wavefront's context, and is one of the values initially generated at wavefront launch. I'm not sure there is a safe way to share it beyond that.I have no idea whether GCN has a single scc shared by all work-groups in a CU.
This would run counter to AMD's desire to allow very long-lived wavefronts of arbitrary allocation size. The descriptions of one possible way of handling this is to give the hardware a period of time where it can wait for enough resources to open up, but after that it will force a CU to begin a context switch.I suspect, on the other hand, that merely pre-empting at the CU level is enough. i.e. allow individual CUs to finish what they're doing.
If it's anything like the other counters, it's an increment when an operation begins, and a decrement after completion/result available. For existing VALUs, it would be a +1 at issue and -1 at the last cycle.With only 3 bits, that looks like nothing more than some kind of interaction with the instruction cache (mini-fetches from cache into ALU?). Though I had wondered whether it is related to macros (such as a double-precision divide) or to debugging or exception handling. Macros are way longer though, aren't they? Nope, I can't work it out.
If the leaked info is accurate, and no further measures are taken, the 4GB memory capacity would make compute a short story for AMD.Fiji and gm200 are going to end up very close to each other, as usual. Power consumption may be the deciding factor.
Compute is a different story...
I think this also puts the 20nm speculation to rest.
Good point. Which raises the question: what kind of measures can be taken? 4GB should be sufficient for GPU work in practice, but it will look bad on the retail box once the more memory heavy 8GB (and 12GB? Insane!) GeForce SKUs enter the market.If the leaked info is accurate, and no further measures are taken, the 4GB memory capacity would make compute a short story for AMD.
Good point. Which raises the question: what kind of measures can be taken? 4GB should be sufficient for GPU work in practice, but it will look bad on the retail box once the more memory heavy 8GB (and 12GB? Insane!) GeForce SKUs enter the market.
What are the options? Mixed HBM and GDDR5 doesn't sound very practical.
Replace bandwidth saving parts with computational.How do you scale a chip to use this performance profile?
Good point. Which raises the question: what kind of measures can be taken? 4GB should be sufficient for GPU work in practice, but it will look bad on the retail box once the more memory heavy 8GB (and 12GB? Insane!) GeForce SKUs enter the market.
What are the options? Mixed HBM and GDDR5 doesn't sound very practical.
https://asc.llnl.gov/fastforward/AMD-FF.pdf
AMDs Fastforward project explicitly claims that there should be a Two Level Memory system.
I am a great follower of the idea of two memory systems, one being the fast and low-latency HBM memory pool, the other being another pool of memory, which offers lower bandbwidth, but high volume. For comparison, you could take Intels Xeon Phi, with 8/16GB of fast HBM memory and another pool of DDR4.
A Fiji with 4GB HBM and additional DDR4 memory, could offer a very high bandwidth and a very high memory volume, especially with the new DDR4 Stacks.
With an APU, your two levels would be an improvement over the current two level system in that the GPU has first class access to the regular DRAM as well. With a chip like Fiji, you'd introduce a third level. Now addition additional levels is something that's being done all the time (see L1,L2,L3 caches in regular CPU, but at least that's managed in a transparent way. For a GPU, it'd be another software managed level.https://asc.llnl.gov/fastforward/AMD-FF.pdf
AMDs Fastforward project explicitly claims that there should be a Two Level Memory system.
I am a great follower of the idea of two memory systems, one being the fast and low-latency HBM memory pool, the other being another pool of memory, which offers lower bandbwidth, but high volume. For comparison, you could take Intels Xeon Phi, with 8/16GB of fast HBM memory and another pool of DDR4.
A Fiji with 4GB HBM and additional DDR4 memory, could offer a very high bandwidth and a very high memory volume, especially with the new DDR4 Stacks.
Finding it hard to find much that's concrete except pretty pictures stating first generation HBM has ~half the latency of DDR3.A single HBM module has 8 128-bit channels and a prefetch of 2. A burst would be 256 bits, or 32 bytes. A GDDR5 channel is 32 bits with a prefetch of 8, so 32 bytes per burst.
Other presentations put the general command latencies in nanoseconds as roughly equivalent to GDDR5.
I'm wondering if the striping algorithms will be different, e.g. to the extent that striping is abandoned for most of the small texture sizings.More stripes would be needed, the alleged 4096-bit leak would have double the number of channels to stripe across.
Interesting experiment...If that's all there is to it, then having the CUs run a trivial shader that allocates some VGPRs and then spends the rest of its time incrementing an SGPR should have a similar power profile to having them run one that is incrementing on a VGPR.
The latter.If by higher than expected you mean higher than the theoretical peak for the upper clock range, that probably should not happen for a pre-GHz edition card that doesn't have a boost state.
Higher than the usual normally achieved performance, but still within theoretical bounds an be a sign of power management effects poking through.
Honestly it just seems like the for loop is the easy case (absurdly trivial) and they left it there.I'm not sure if the compiler is being pessimistic, like it cannot go back and review the properties a while loop's index variable. The scope is potentially different, perhaps it's giving up early because of that?
You bothered to write what I couldn't be arsed withAn alteration to the hardware would be incremental, and I wouldn't count on it being impressive because of that fact.
It's somewhat less forgettable because there are schemes AMD posited (and patented, for all those are worth) for repacking wavefronts, where an extra indirection table is used to track per-lane thread contexts and allows physical lanes to use the active variables of other threads, should the threads that are predicated on be repacked to a physical SIMD lane whose logical lane is off. Both schemes leverage cross-lane communication by storing some amount of context in shared memory.
The latter scheme for repacking is questionable for the amount of indirection it places for execution, and possible banking conflicts and a worsening of physical locality for data propagation. The scheme would theoretically make the indirection check part of the execution pipeline.
Well that definitely makes the "parallel execution case" I talked about earlier work.SCC is documented as being a part of a wavefront's context, and is one of the values initially generated at wavefront launch. I'm not sure there is a safe way to share it beyond that.
Looking more closely at the ISA manual, these bits can be read or written by an SALU instruction, e.g. s_setreg_b32. So "abandoned" is not the case.If it's anything like the other counters, it's an increment when an operation begins, and a decrement after completion/result available. For existing VALUs, it would be a +1 at issue and -1 at the last cycle.
It could very well be something a false start that was not totally scrubbed from the documents.
I think it might be what I originally guessed at: allowing SALU to keep an eye on the latency of VALU instructions, before it can proceed with the next scalar instruction.Currently, thanks to how execution is handled, any attempt to make use of it would be meaningless because no further instruction issue can occur in that 4 cycle window.