AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
It is a small set, but I am not able to state things as definitively as I did originally. I don't have a ready reference to compare what the Gen 6 or 7 ISAs have in this regard.
That there are packed ops points to some hardware facility for extracting bytes earlier than VI.

On the other hand, Intel's graphics FPUs have a longer history of 2x Int 16 throughput.
 
GCN can do that because the math pipeline is very short. It's no different in this regard than the VLIW chips.
VLIW is significantly higher IPC (for one thread), meaning that it finishes a thread faster. When threads finish faster they keep the resources (registers) for shorter time. VLIW also was 16 wide, meaning that there was a smaller propability for a load/store (wave wide scatter/gather) to cache miss. A single missed missed lane stalls the whole wave. During the stall period the other wave registers are still reserved. These are some reasons why VLIW was less register bound. Obviously in practice it had problems with lane occupancy and it had register band conflicts, etc.

But there were and still are tasks that suit VLIW much better. GPU DXT compression was one of them. One thread encodes one block. It has high register usage (loads color+alpha data of 4x4 pixels to registers and stores the compressed versiom at end of the shader) and relatively low amount of threads.
I think the more illuminating number is the peak register file bandwidth: 64 CUs with 64 lanes each using 16 bytes at 1GHz = 64TB/s.
That is a fine way to describe it as well. Both size and bandwidth are important for fast on-chip memory pools. 16 bit registers use only half the register file bandwidth (so you can access twice as many per clock).

Btw. How do you come up with 16 bytes per lane number (as registers are 4 bytes each)?
NVidia abandoned OoO machinery. It just isn't worth the power and die area. NVidia relies upon static compilation.
Yes. Fermi had OoO and Kepler ditched it (and the double clocked ALUs). Both were not good for perf/watt. I was not suggesting that OoO is a benefit for GPUs (right now at least), I just used it as a comparison, because it is the CPUs most common latency hiding technique (in addition to cache prefetching). OoO has some benefits when it cames to register file size. For example Xbox 360 VMX-128 needed 2x 128 4d vector registers (128 for both HW threads) since it didn't have OoO (register renaming). Register renaming in general is a good idea for handling branches and cache misses, since neither are known at compile time. It allows the hardware to use as many registers as the taken code path (with actual cache misses) really needs, instead of reserving the worst case. The flip side of the coin is that OoO consumes lots of power.
 
No it doesn't. A single hardware thread can issue to the VALU/SALU for as long as the code that fits into instruction cache can run, until it hits a latency event (read/write memory, constant buffer, LDS etc.).

That's why you can get high throughput with only 2 or 3 hardware threads per VALU, if arithmetic intensity is very high.
sebbbi is correct if referring to how a CU works. On GCN, there are 4 SIMDs per CU. Each SIMD executes a typical instruction in 4 clocks as there are 16 ALUs per SIMD so a 64-thread wavefront takes 4 clocks to process.
 
Btw. How do you come up with 16 bytes per lane number (as registers are 4 bytes each)?

A single CU should be able to do 64 FMAs per clock, so it should be able to read 3 operands per ALU and write one result. That's 4×32 bits per ALU, or 16 bytes.
 
Both were not good for perf/watt. I was not suggesting that OoO is a benefit for GPUs (right now at least), I just used it as a comparison, because it is the CPUs most common latency hiding technique (in addition to cache prefetching). OoO has some benefits when it cames to register file size. For example Xbox 360 VMX-128 needed 2x 128 4d vector registers (128 for both HW threads) since it didn't have OoO (register renaming). Register renaming in general is a good idea for handling branches and cache misses, since neither are known at compile time. It allows the hardware to use as many registers as the taken code path (with actual cache misses) really needs, instead of reserving the worst case. The flip side of the coin is that OoO consumes lots of power.
Is this a comparison between code optimized for an in-order pipeline with a significant amount of loop unrolling, and one that assumes the hardware can do so via register renaming?
In terms of how register consumption appears relative to the same code sequence, register renaming can only increase consumption. At a minimum, the last non-speculative architected state is recorded in addition to any speculative registers. As far as the hardware goes, the fixed thread count and fixed register files leave plenty of slack.
Aside from the 8-16 of each register type in an x86 case, per thread, there can be tens to over a hundred registers in the register file that can only be used or wasted by 2 or possibly 4 contexts.
 
Hmm, is this true:

Dont forget that Nvidia were working on HMC (Hybrid Memory Cube) and that turned out to be a complete failure, so nvidia brushed it under the rug and pretended nothing ever happened.

A quote from the comments.


At least, they was collaborating to include HMC ( Micron, Elpida tech if im right ).. But as you have certainly seen it, today, nobody speak about HMC anymore.. Then suddenly, , we had seen Nvidia who was speak only about " 3D memory ". ( most funny is i have not seen one of their presentation, speaking about HBM, but allways using the term " 3D memory " . )... but it have been confirmed then that they had move from HMC to HBM .


As for the article, you will find it on nearly most sites ( can include Guru3D who have publish it today ); they just retake the article from each other. This dont mean that this is something who is coming from their own sources...
 
Last edited:
I really don't know if I'd call it OoO in the big picture sense, but in any case Fermi was able to do dependency checking and register scoreboarding in hardware in order to pick warps and reorder instructions within warps.

http://images.anandtech.com/doci/5699/Scheduler.jpg

Is this reordering between warps as in within the same thread, or between independent warps in the scheduler?
What I recall is a form of out of order completion where it was possible to continue issue as long as no dependence was found, meaning longer latency operations could return results after later short ones. Once the scoreboard picked up on a hazard, the GPU did not continue issuing for that warp.
 
They're linking to each other, with a "it is said" reference. Shouldn't we link to this thread too as a source of corroboration?

It's OK, just give the press a few days, websites will start coming up with the same story without mentioning Wccftech, to pretend that they came up with it on their own. Then this rumor will officially be DOUBLE CONFIRMED.
 
Off-topic a bit, but: https://01.org/sites/default/files/...-osrc-hsw-commandreference-instructions_0.pdf

That's the Gen7 instruction set reference. It's pretty much general purpose for U8.

Edit: new thread for Gen! https://forum.beyond3d.com/threads/intel-gen-architecture-discussion.57146/
Intel's register file is an interesting arrangement.
The IVB documentation points to byte granularity for the register file, although the EUs would promote it to a wider width for the purposes of execution.
 
Status
Not open for further replies.
Back
Top