Is that in GCN's VALU set (pre VI)?Yes, AMD's only been doing 8-bit packed media instructions since late VLIW, such a laggard.
Is that in GCN's VALU set (pre VI)?Yes, AMD's only been doing 8-bit packed media instructions since late VLIW, such a laggard.
Then I rescind my point on the 8-bit multimedia section.
VLIW is significantly higher IPC (for one thread), meaning that it finishes a thread faster. When threads finish faster they keep the resources (registers) for shorter time. VLIW also was 16 wide, meaning that there was a smaller propability for a load/store (wave wide scatter/gather) to cache miss. A single missed missed lane stalls the whole wave. During the stall period the other wave registers are still reserved. These are some reasons why VLIW was less register bound. Obviously in practice it had problems with lane occupancy and it had register band conflicts, etc.GCN can do that because the math pipeline is very short. It's no different in this regard than the VLIW chips.
That is a fine way to describe it as well. Both size and bandwidth are important for fast on-chip memory pools. 16 bit registers use only half the register file bandwidth (so you can access twice as many per clock).I think the more illuminating number is the peak register file bandwidth: 64 CUs with 64 lanes each using 16 bytes at 1GHz = 64TB/s.
Yes. Fermi had OoO and Kepler ditched it (and the double clocked ALUs). Both were not good for perf/watt. I was not suggesting that OoO is a benefit for GPUs (right now at least), I just used it as a comparison, because it is the CPUs most common latency hiding technique (in addition to cache prefetching). OoO has some benefits when it cames to register file size. For example Xbox 360 VMX-128 needed 2x 128 4d vector registers (128 for both HW threads) since it didn't have OoO (register renaming). Register renaming in general is a good idea for handling branches and cache misses, since neither are known at compile time. It allows the hardware to use as many registers as the taken code path (with actual cache misses) really needs, instead of reserving the worst case. The flip side of the coin is that OoO consumes lots of power.NVidia abandoned OoO machinery. It just isn't worth the power and die area. NVidia relies upon static compilation.
sebbbi is correct if referring to how a CU works. On GCN, there are 4 SIMDs per CU. Each SIMD executes a typical instruction in 4 clocks as there are 16 ALUs per SIMD so a 64-thread wavefront takes 4 clocks to process.No it doesn't. A single hardware thread can issue to the VALU/SALU for as long as the code that fits into instruction cache can run, until it hits a latency event (read/write memory, constant buffer, LDS etc.).
That's why you can get high throughput with only 2 or 3 hardware threads per VALU, if arithmetic intensity is very high.
Btw. How do you come up with 16 bytes per lane number (as registers are 4 bytes each)?
AMD Working On An Entire Range of HBM GPUs To Follow Fiji And Fury Lineup – Has Priority To HBM2 Capacity
Read more: http://wccftech.com/amd-working-entire-range-hbm-gpus-follow-fiji-fury-lineup/#ixzz3fpyrkUdp
I really don't know if I'd call it OoO in the big picture sense, but in any case Fermi was able to do dependency checking and register scoreboarding in hardware in order to pick warps and reorder instructions within warps.Which element was OoO in Fermi? I don't recall.
Is this a comparison between code optimized for an in-order pipeline with a significant amount of loop unrolling, and one that assumes the hardware can do so via register renaming?Both were not good for perf/watt. I was not suggesting that OoO is a benefit for GPUs (right now at least), I just used it as a comparison, because it is the CPUs most common latency hiding technique (in addition to cache prefetching). OoO has some benefits when it cames to register file size. For example Xbox 360 VMX-128 needed 2x 128 4d vector registers (128 for both HW threads) since it didn't have OoO (register renaming). Register renaming in general is a good idea for handling branches and cache misses, since neither are known at compile time. It allows the hardware to use as many registers as the taken code path (with actual cache misses) really needs, instead of reserving the worst case. The flip side of the coin is that OoO consumes lots of power.
Hmm, is this true:
Dont forget that Nvidia were working on HMC (Hybrid Memory Cube) and that turned out to be a complete failure, so nvidia brushed it under the rug and pretended nothing ever happened.
A quote from the comments.
I really don't know if I'd call it OoO in the big picture sense, but in any case Fermi was able to do dependency checking and register scoreboarding in hardware in order to pick warps and reorder instructions within warps.
http://images.anandtech.com/doci/5699/Scheduler.jpg
They're linking to each other, with a "it is said" reference. Shouldn't we link to this thread too as a source of corroboration?
Intel's register file is an interesting arrangement.Off-topic a bit, but: https://01.org/sites/default/files/...-osrc-hsw-commandreference-instructions_0.pdf
That's the Gen7 instruction set reference. It's pretty much general purpose for U8.
Edit: new thread for Gen! https://forum.beyond3d.com/threads/intel-gen-architecture-discussion.57146/
Dude its all over the interwebs alreadyThey're linking to each other, with a "it is said" reference. Shouldn't we link to this thread too as a source of corroboration?