I guess I'm not sure which powerpoint had that particular value. I'm aware of the leaked Hynix presentation, which had one particular latency parameter for column to column access at half of DDR3, but that would be a minor contributor.Finding it hard to find much that's concrete except pretty pictures stating first generation HBM has ~half the latency of DDR3.
A 2013 presentation about HSA and GPU had memory addresses distributed as Address/256%N where N was the channel count. 256 bytes has some nice synergies with the write throughput of a vector instruction, as well as a 64-pixel block, although saying the striping is due to how well the SIMD architecture likes 256 bytes may be looking at things backwards. It may be that the architecture for both the vector and ROP systems is structured to appeal to a certain range of strides that DRAM arrays like to work with.I'm wondering if the striping algorithms will be different, e.g. to the extent that striping is abandoned for most of the small texture sizings.
4 Gbps GDDR5 has a prefetch of 8 on a 32 bit channel, so it delivers 128 bits in 1ns. Faster than that, and banks are subdivided into four groups that require interleaving to avoid exceeding the speed of the arrays.
HBM Gen1 at 1 Gbps has a prefetch of 2 on a 128 bit channel, so it delivers 128 bits in 1ns.
There is apparently some margin to push the arrays faster, going by the possible 1.2 Gbps leaked device, but it takes a likely change to DDR signaling with Gen2 to get near 2 Gbps.
GDDR5 has 16 banks per channel if we're talking a single device and not clamshell. Clamshell is physically 32, but the chips behave as if they were a double-capacity single channel.
HBM has 8 banks per channel--but the banks are actually subdivided into 2 sub-banks each. There are also 2 channels per slice, so 16 banks or 32 sub-banks per layer in HBM.
Striding through all the banks in GDDR5 seems to only go up to 128 bytes, but it seems like it helps to have overlapping accesses to hide command and refresh overhead.
I'm fumbling through what amounts to numerology due to my limited understanding of DRAM to describe the base constraint, that HBM and GDDR interfaces are pulling data from very similar ploddingly slow DRAM arrays.
It seems like AMD and Hynix kept an eye on what existing hardware is optimized for when designing HBM, so if one standard likes a granularity of 256 bytes per channel to keep the arrays happy, then I think it's possible a similar granularity is helpful for the other.
IB_STS is hardware register 7 and is listed as being read-only in the ISA doc. I don't know how the hardware is supposed to react to a setreg on a read-only register. It might be ignored. It would seem likely to cause a serious problem if it did allow the operation to execute, should a program overwrite one of the other counters that are known to be used. One simplifying assumption for the hardware would be to not make the internal adder fully-featured enough to worry about a nonsense scenario where loads are launched and VMCNT is set back to zero before the decrement can kick in.Looking more closely at the ISA manual, these bits can be read or written by an SALU instruction, e.g. s_setreg_b32. So "abandoned" is not the case.