I was just about to ask how big are and at what latency do GPU caches work
Rage3d speculated that rv770 has 64kb L1/SIMD (=640kbyte total on rv770) and 256kb/memory controller l2 (=total 1mb on rv770).
Intel paper on Larrabee specifies in paragraph 4.5 that each core is going to run 4 threads i.e. one thread per hardware context so there won't be any context switches (notice that the paper also specifies that threads are pinned to a core). The four threads are split into one for setup and three for rasterization/shading/blending and in each worker thread strands (i.e. fibers) is switched co-operatively to hide the latency of texture operation. This probably means that depending on some heuristics the Larrabee runtime will assign x quads for each worker thread and the worker thread code is unrolled/software-pipelined to accommodate for this. The code might look like this for a worker thread with four 'strands' (some very simple pseudo-code here):On larrabee the entire 2k register file gets flushed to cache on every fiber change no matter what.
for each batch of quads {
// strand 1
rasterize & interpolate quad 1
send the texture fetches for quad 1 to the TUs
// strand 2
rasterize & interpolate quad 2
send the texture fetches for quad 2 to the TUs
// strand 3
rasterize & interpolate quad 3
send the texture fetches for quad 3 to the TUs
// strand 4
rasterize & interpolate quad 4
send the texture fetches for quad 4 to the TUs
// strand 1
texels for quad 1 arrive in the L1
shade quad 1
blend quad 1
// strand 2
texels for quad 2 arrive in the L1
shade quad 2
blend quad 2
// strand 3
texels for quad 3 arrive in the L1
shade quad 3
blend quad 3
// strand 4
texels for quad 4 arrive in the L1
shade quad 4
blend quad 4
}
Umh..why would you want to do that?!?On larrabee the entire 2k register file gets flushed to cache on every fiber change no matter what.
Well, apart from starting an identifier with a digit (!), "FORTRAN 4 FTW !" :smile:eg:
345fewrtrtderewrt
345fewrtrtberewrt
Notice that the array base is most likely 64-bits, but the index offsets are only 32-bit integers. If you're doing normal 4-byte accesses that limits you to a 16GB of virtual address space you can gather/scatter across. Nothing you'd ever hit when doing graphics, but certainly something that the compiler will have to take into consideration and could become an issue in larger HPC systems.SCATTERD - Scatter Doubleword Vector to Memory
Downconverts and stores elements in doubleword vector v1 to the memory locations pointed by base address m + index vector index * scale scale.
From a hardware perspective this implementation makes sense. You have an initial 4x4x128-bit crossbar followed by 4 individual 4x4x32-bit crossbars. That results in a much smaller and faster design than a full 16x16x32-bit crossbar. However to do an any-to-any shuffle it takes 4 instructions and you need to either pre-set or set 4 mask registers. Again you're extremely unlikely to hit this in graphics, but it's a limitation for general purpose usage.SHUF128x32 - Shuffle Vector Dqwords Then Doublewords
Shuffles 128-bit blocks of the vector read from vector v2, and then 32-bit blocks of the result.
COMPRESS{D,Q} - Pack and Store Vector to Unaligned Memory
EXPAND{D,Q} - Load Unaligned and Unpack to Vector
I'm curious about their transcendental support. They list EXP2, LOG2, RECIP, and RSQRT, but no mention of precision (often the hardware instruction is only a close approximation, not exact). Also no SIN/COS support?
Notice that the array base is most likely 64-bits, but the index offsets are only 32-bit integers. If you're doing normal 4-byte accesses that limits you to a 16GB of virtual address space you can gather/scatter across. Nothing you'd ever hit when doing graphics, but certainly something that the compiler will have to take into consideration and could become an issue in larger HPC systems.
However to do an any-to-any shuffle it takes 4 instructions and you need to either pre-set or set 4 mask registers. Again you're extremely unlikely to hit this in graphics, but it's a limitation for general purpose usage.
Getting 3 options for the float MAD is kind of silly. V1 = V1*V3 + V2 is usefully different than V1 = V2*V3 + V1, but the compiler should be able to generate the 3rd variant. Also it's interesting that the float MADs have 3 variants plus 3 negate variants, while int MAD only has a single variant. It's also interesting that they appear to only be implementing a standard multiply-add, not a fused multiply-add.
I'm curious about their transcendental support. They list EXP2, LOG2, RECIP, and RSQRT, but no mention of precision (often the hardware instruction is only a close approximation, not exact).
There's a fixed number of hardware threads, 4, so it seems reasonable.Kind of surprising there is a fixed amount at all (very much unlike GPUs). In theory the task switch instructions might have a couple of bits to indicate the amount of registers used though.
It's not a lot really. What shader nowadays uses only 32 values (or eight 4-component vectors)? And if possible you probably also want to use them for interpolation parameters and such. Even if then you still have 'too many' registers, there's always loop unrolling.32 is quite a lot, too many even if they aren't being used ... but maybe you can chose to use less. That all depends on the really interesting instructions.
It's not a lot really. What shader nowadays uses only 32 values
I assume the compiler will try to execute 4 wide SIMD code in a 4x4 configuration rather than 16x1.It's not a lot really. What shader nowadays uses only 32 values (or eight 4-component vectors)? And if possible you probably also want to use them for interpolation parameters and such. Even if then you still have 'too many' registers, there's always loop unrolling.
It's probably higher than that. As far as I know it's not that hard to design fast hardware that is accurate up to 1-2 ulp (32-bit). That would make them immediately usable for graphics and most other applications that use 32-bit floats. Applictions that require exact results work with double precision anyway, and these are implemented in the 'utility math' functions. Newton-Rhapson works great for division and square root, and for everything else I imagine they use gather to look up polynomial coefficients.Probably the precision should be enough so that you can get the full 32-bit result with two Newton-Raphson iterations like their SSEx brethern.
The point is how many pixels are processed per thread? If it's 16 then you only have 32 scalars per pixel. And it can be even more pixels per thread by unrolling. So either way 32 registers is never going to be too many (nor too few for that matter).We are talking about 32x512 bits registers though... you have space for 16x32 bits values per register per HW thread or am I missing your point here?
The point is how many pixels are processed per thread?
Exactly. And that would be 32 live registers, not just 32 registers.The point is how many pixels are processed per thread? If it's 16 then you only have 32 scalars per pixel. And it can be even more pixels per thread by unrolling. So either way 32 registers is never going to be too many (nor too few for that matter).
It's software, you can do whatever you want to maximize efficiency.
You mean that they implemented only multiply-accumulate (MAC, 3 operands) instructions instead of multiply-add (MAD, 4 operands)? Because the only difference between a MAD/MAC and FMAD/FMAC is that the fused ones do not round the result of the multiplication. In the instruction descriptions there is no mention of rounding the result of the multiplications so those are all fused multiply-accumulate instructions.
[edit] I realized it is impossible to tell from the descriptions if the hardware instructions will be 3 operand MACs or 4 operand MADs, my bad. Anyway they are fused.