Which block on the die shot do we assume to be the LDS?Unless the claim is that the eDRAM is plugged directly into the LDS (it is visibly separate from the SIMDs), I didn't see a reason to discuss it.
Which block on the die shot do we assume to be the LDS?Unless the claim is that the eDRAM is plugged directly into the LDS (it is visibly separate from the SIMDs), I didn't see a reason to discuss it.
That much is clear - the device should have its 'proper' LDS. Question is, how close the eDRAM block would need to be to have a chance of being accessed by the SIMDs, Evergreen style.I'm assuming the big swath of eDRAM that is not near the SIMDs isn't LDS.
Yes, according to the official floorplan, the LDS block sits right on the left of the SIMD0 block, and not so close to the SIMD1 block.On a more regular layout like RV770, the LDS is between the texture block and the ALU blocks.
The less regular small APUs don't do this. Brazos might have its LDS to the left of the one of its SIMDs. I'm not sure about the Wii U's setup.
What do we know about the usage model of the eDRAM, though?The usage model described for the eDRAM doesn't match up with what the LDS is capable of doing, so I didn't have a reason to think it relevant.
There are descriptions of games using to hold framebuffer and intermediate buffer data. That involves handling ROP output, which the LDS does not have a path to receive.What do we know about the usage model of the eDRAM, though?
Apropos, backtracing from your earlier discussion with DRS, I'm curious to see what benchmarks demonstrate little latency benefits from the caches on an AMD VLIW design. Any pointers would be welcome.
Yes, but we don't know how the multiple banks in there factor in - are fb accesses interleaved across all banks, or are they interleaved across a few, and if the latter, what do the 'idle' banks do - are they open for access for other purposes?There are descriptions of games using to hold framebuffer and intermediate buffer data. That involves handling ROP output, which the LDS does not have a path to receive.
Thank you. I'm really curious to see Sandra's code, though, do they use vfetch ot texsample? If the former, in the Global mem random access case Evergreen would use its caches effectively as a mere coalescing apparatus - the cache lines are invalidated after each wave. Not so with the texfetch path - caches would behave as actual caches there.The first link is http://www.sisoftware.net/?d=qa&f=gpu_mem_latency .
One correction is that I had misrembered which lines belonged to which architectures. There is a difference of about 100-150ns from best case to worst case, with the best case being roughtly 500ns for a discrete GPU.
The physical external memory devices would have less than 40ns of latency, leaving the majority of the hit to be taken up by the cache and memory pipeline of the GPU.
That's a very good read, thanks (and props to prunedtree). I did a similar, albeit rudimentary test (involving 4x4 matrices) myself, and it demonstrates a clear gradation across the vfetch, texsample and lds accesses - roughly a mem efficiency factor of 2x, going from left to right (while the matmul alu clause remains essentially the same). I can post the asm listings as well if needed.The big question I have about the numbers is that they do seem very high, especially relative to what was tested for RV770 here:
http://forum.beyond3d.com/showthread.php?p=1322632#post1322632
That earlier testing showed 180 cycles for an L1 hit, which is in the neighborhood of half the Sandra numbers. In either case, the baseline latency for a GPU is either very long, or extremely long. External DRAM has a fixed component of total latency, and it's not the majority.
That sounds reasonable.The question comes back to whether the memory pipeline has been customized to bypass some of the main memory stages for the eDRAM, although even this is watered down by the very long L1 time.
That's part of my concerns re such benches as well - CPU caches are strictly domestic animals compared to the GPU caches in the wild.The numbers from SiSoft Sandra are in some cases just wrong (or misrepresent the operation of the caches in normal use). They probably didn't measure what they wanted to measure because of a lack of understanding of the different architectures. I wouldn't put too much faith in it.
I would be surprised if it was genuinely necessary to include fifteen year old integer, fixed-function hardware in a...*ahem*...current-ish...GPU. What would be so magically special about the TEV you think that it would necessitate being included in wuugpu? It was clocked at what, 180MHz or somesuch. Seriously, I think today's shader processors could do everything it could with both hands tied behind their backs and then some.Or that the scheduler delegates these commands to a set of dedicated TEV cores?
I would be surprised if it was genuinely necessary to include fifteen year old integer, fixed-function hardware in a...*ahem*...current-ish...GPU. What would be so magically special about the TEV you think that it would necessitate being included in wuugpu? It was clocked at what, 180MHz or somesuch. Seriously, I think today's shader processors could do everything it could with both hands tied behind their backs and then some.
Wuu has many more SIMDs than the TEV has ROPs. Many, many more. So not an issue to worry about really. Not that it'd be very costly to include it, hardware-wise, it's got to be a tiny thing at today's silicon processes, merely an inperceptible blip on the chip probably, but you gotta tie it into an alien rendering pipeline somehow, interface it with hardware it was never designed to co-exist with. That's gotta be a lot more complicated and costly than just the (relatively) small amount of logic for the TEV itself. It'll never be hardware compatible with the old hollywood chip anyway, nintendo's already stated they're not including the whole wii kit and kaboodle in wuu, so I wonder if it really would be any point in tearing out one piece of non-critical hardware and transplanting it into wuu...a SIMD alu does 2 FLOPS per clock it takes a few clocks to 'emulate' that.
Wuu has many more SIMDs than the TEV has ROPs. Many, many more. So not an issue to worry about really. Not that it'd be very costly to include it, hardware-wise, it's got to be a tiny thing at today's silicon processes, merely an inperceptible blip on the chip probably, but you gotta tie it into an alien rendering pipeline somehow, interface it with hardware it was never designed to co-exist with. That's gotta be a lot more complicated and costly than just the (relatively) small amount of logic for the TEV itself. It'll never be hardware compatible with the old hollywood chip anyway, nintendo's already stated they're not including the whole wii kit and kaboodle in wuu, so I wonder if it really would be any point in tearing out one piece of non-critical hardware and transplanting it into wuu...
I assume you're suggesting that multiple threads on the SIMD unit could emulate a single Flipper pipeline, as AMD's VLIW architectures have exactly 5 (or 4) ALUs per clock per thread on the SIMD.Wuu has many more SIMDs than the TEV has ROPs. Many, many more.
I think I can answer this myself now. Where I thought that Single Instruction Multiple Data referred to a complete SIMD core, its not. Each horizontal row (wavefront) can execute instructions independent of eachother. So to do 40SP with only 16 banks, the banks should be either clocked twice the speed or the SIMD must pair two rows to a single instruction (which doesn't help branching operations ofcourse but for graphics it would still be faster than having only half the rows). Is this the correct perception?Also back to the 160SP thing, which ofcourse is the base of my questions I think I can understand why a SIMD core needs more than 1 SRAM bank per lane and why thread execution is interleaved per clock; there is overlap in the current thread reading registers and the previous thread writing its result. However this just justifies 2 (single ported) banks per lane. So why must they have 4? In my head it seems that having 4 banks results in having 2 spare r/w cycles (assuming that a minimum of 4 threads is required in that case). 4 banks is ok for interfacing with outside world without interfering SIMD operations ofcourse but doesn't seem to be an absolute necessity.
The "lazy devs" mantra is still going strong. Seems to be the heir apparant to the "unfinished dev kits" catch-all.
I'm loathe to link to Lens of Truth, on the grounds that they are shit, but as no-one else seems to care enough about the Wii U any more here is Lens of Truth on Splinter Cell:
http://www.lensoftruth.com/head2hea...arison-and-analysis-ps3-vs-xbox-360-vs-wii-u/
They refuse to install the HD texture pack for the 360 because it's too confusing, or some shit like that. But it's the Wii U version that's really interesting because ... it's pretty much exactly the same as the PS360 versions (who saw that coming?).
And when I say pretty much exactly the same what I mean is worse frame rates than the 360 and really shit loading times. No torn frames though, which is a definite bonus, and seems to be achieved through triple buffering which the PS360 don't use (memory limits?). Too bad the extra memory isn't used for higher res textures but maybe the load times would have even worserer (and totally unacceptable if they had). Memory - both main ram and edram - is probably the only area where the Wii U has a significant and genuine advantage, so it's a shame to see it seemingly underutilised again.
So anyway, once again the Wii U is coming in at more or less PS360 levels: a little better in some ways, and a little worse in others. Surely there's nowhere left for the reality detached Nintendo fanboys to go now? I mean, surely?
The "lazy devs" mantra is still going strong. Seems to be the heir apparant to the "unfinished dev kits" catch-all.