I messed it up there.Yes, hot clock but it's scheduler clock that I mentioned in my comment (since we're talking about feeding the ROPs). Or are you saying it's only 1 warp per scheduler clock?
I messed it up there.Yes, hot clock but it's scheduler clock that I mentioned in my comment (since we're talking about feeding the ROPs). Or are you saying it's only 1 warp per scheduler clock?
This is a special case, though, as you have monsterous register use. Since you don't have enough threads to hide latency for a fetch immediately followed by an ALU clause that uses the data, the ordering of the fetches is very important.5 is faster than 3 despite the fact that 5 has less threads in flight per SIMD than 3. The estimated threads for version 3 is 256/28 = 9, while for version 5 it is 256/38 = 6. (Both estimates are subject to clause-temporary overhead. Also I suspect that 256 is not the correct baseline, something like 240 might be better, not sure...)
Number of threads has to do with number of registers per hw thread (i.e. registers per fragment * fragments per hw thread). It doesn't matter what the ALU:TEX ratio of the SIMD engine is. If the hw thread size grows to 128 fragments, then I agree (only for programs with extremely high register use, though), but I doubt ATI is going to do that because the branching granularity gap with Fermi really starts to get wide.Going back in time, your argument is that if AMD doubles ALU:TEX, e.g. 8:1 in the next GPU, but leaving the overall ALU/TEX architecture alone, that each ALU would only need half the register file. The 256KB of aggregate register file per SIMD we see in Evergreen would be enough for the next GPU. Well, clearly this is fallacious as version 5 above would be reduced to a mere 3 hardware threads, killing throughput (3 hardware threads means that both ALU and TEX clauses cannot be 100% occupied by hardware threads, since both require pairs of hardware threads for full utilisation).
I still say that when you multiply pixel count by 10 and geometry count by a few hundred, it's going to skew the distribution towards a greater percent of small triangles, not lower. It's not like artists only used more polys on world geometry and kept the same low poly enemies.In reference to Quake 2, most of the screen space is covered by the world geometry, which is large triangles, and not all that many of them. All the small triangles will primarily be in enemy models. The enemies themselves didn't have all that many triangles, but they weren't particularly large on screen. Overall there weren't many triangles being rendered, and the game really isn't that useful to discuss things more than 10 years later.
Not when we're talking about geometry bottlenecks, because triangles come in clumps that are similar sized (often zero sized), so you can't buffer out this inefficiency. You can't process those millions of small triangles while having the rasterizer and shading engines work on the large triangles, because the workload just doesn't arrive in such a neatly interleaved fashion.Talking about x percentage of the visible triangles less than 25 pixels isn't as useful as talking about x percentage of the screen is covered by triangles less than 25 pixels.
Interesting. I always thought format conversion happens in ROPs but that suggests there's some logic for that somewhere at the end of the pixel shader pipe.The bottleneck, as it seems right now, is the connection between shader engines and ROPs which is tailored to accomodate 32 pixels of 32 bits at a time. Formats like RGB9E5 or RGB10A2 and the like take up as many slots as fully blown FP16 pixels, thus being half rate her also. I can only guess at what the connection itself will bee, but it seems like it can (for each pixel of theoretical throughput) operate on four lanes of 8 bits at a time. If they exceed 8 bits, like in RGB9E5, it's taking double time, either serialized or with paired lanes (and then 2 by 2 serialized). More than 16 Bits and we go four cycle/ groups of four.
One-Channel FP32 pixels seems to be able to only occupy a single time slot for all four lanes, so maybe this is the base unit and can be split two- and four-way really.
Interesting. I always thought format conversion happens in ROPs but that suggests there's some logic for that somewhere at the end of the pixel shader pipe.
It wouldn't explain the slow 4-channel fp32 blend result but maybe this is really something like 1/4 full speed per channel for the blender, so 1/16 of nominal (34GPix/s) ROP rate. At least that would fit all fp32 blend data for both GTX285 and GTX480...
A bit strange that you'd have 48 rops but for most things they aren't any better than 32, though maybe they are very cheap anyway.
Maybe it's time to spend those transistors getting color data back to shader clusters instead and doing blending and writing back through some generic memory controller (which still would handle color compression), that rop design sounds kinda lame. Well for color at least.
Sure yes. Still lame design.The way Carsten described the issue it sounded like a bandwidth problem.
A bit strange that you'd have 48 rops but for most things they aren't any better than 32, though maybe they are very cheap anyway.
Well if fp32 blend is really quarter-speed per channel that would at least be good for a bit faster fp32 blend rate... Of course that would just be 48 incredibly slow fp32 blend units vs. 32 incredibly slow fp32 blend units but at least that's something...Well it's obvious why they aren't better than 32 for most things but I'm trying to understand if they're better than 32 for anything
Well if fp32 blend is really quarter-speed per channel that would at least be good for a bit faster fp32 blend rate
That theory makes sense but at a high level the measured performance seems to track more closely to shader throughput than anything else. This is based on Damien's numbers:
Right. So it's still a mystery...
btw we're always talking about 32 pixel (per clock) rasterization limit. But that apparently doesn't affect z fill rate (neither for Cypress nor GF100), so is this only true for pixels actually going to the shader core? I'm wondering what this actually measures...
I don't really understand what's going on with z-fillrate in general. Can someone confirm how we get high z-fillrates even when AA is not enabled? Are rasterizers actually capable of producing more depth samples than color samples per clock?
It's been that way since the NV3x, at least if you switch off color fill.
From my understanding of what I've been told, multiple concurrent kernels on GF100 only refers to kernels of the same type, i.e. different physics solvers for cloth, fluid and so on. They all have to, however, to the same operational mode/context, i.e. Cuda or graphics. Not sure about DX compute shaders though.
Currently, heavy use of Cuda-kernels will still bring the Windows GUI (Win7, Aero G.) to a crawl.
That was what i meant by "multiple concurrent kernels on GF100 only refers to kernels of the same type, i.e. different physics solvers for cloth, fluid and so on. They all have to, however, to the same operational mode/context, i.e. Cuda or graphics"No, in CUDA, it's actually arbitrary kernels.. any SM could be running up to 4 different kernels all at the same time. What it CAN'T do is suspend one of those kernels, swap out its state, and swap in a new one for a different kernel, then resume by reswapping.
Right - it doesn't seem to be available in hardware, otherwise Nvidia would have boasted with it too. AFAIK they only went on about having reduced context switching for the whole chip to 20 mikroseconds, which is supposed to be a couple of times faster than with previous geforce cards.This switching ability isn't in Fermi now (at least nobody has even hinted at it) but my question is mostly about how hard it'd be to add since the hardware already can do most of the substeps of context switching.