That used to be true in the land of fixed function interpolators. Not so much anymore..The pixel shader doesn't really care what triangle a quad is from, you can pool them and run them in a single thread group.
That used to be true in the land of fixed function interpolators. Not so much anymore..The pixel shader doesn't really care what triangle a quad is from, you can pool them and run them in a single thread group.
That used to be true in the land of fixed function interpolators. Not so much anymore..
Nvidia does have a head start, but the momentum of the industry is behind OpenCL.
I don't see why not ... for small triangles iterative interpolation does not provide a gain over purely parallel interpolation and if you do it purely parallel it suits SIMD just fine.That used to be true in the land of fixed function interpolators. Not so much anymore..
Sorry, Charlie, but that roadmap, at least the second table, is fake at best.
Nvidia top is GTX285, not 280, and anyway both versions can have 1 or 2 Gigabytes of VRAM. And this is only one of the inconsistencies.
Why would they want to do this? More efficiency would be traded for how compact and numerous they can make those 5-way scalar units.
I remember it was Mike Houston from AMD who said that the VLIW efficiency is at least 80% in Folding@Home code, and even better in complex game shaders.
I'm suspicious of this claim. If it were true, wouldn't a 4850, with it's 1 TFLOPS theoretical peak performance be running rings around a GTS 250, with it's paltry 0.5 TFLOPS theoretical peak? Instead, they're pretty similar in performance. Comparing actual game benchmarks, 1 nVidia FLOP seems to about equal 2 ATI FLOPs, suggesting that ATIs utilization efficiency in game shader code is closer to 50% than 80+ percent. I think Fermi will perform quite well, relative to Cypress, in games.
My assumption that this would be for consumer-level gaming chips, which lead me to expect a few scenarios:So what happens when you try and run some code that uses DP on such a design?
The physical register files themselves, I don't know.Register files care as well, since they have to allocate and access 64b regs differently.
That would be awfuly nice of Nvidia.So do you think they will do DP in a library as part of the driver?
I'm suspicious of this claim. If it were true, wouldn't a 4850, with it's 1 TFLOPS theoretical peak performance be running rings around a GTS 250, with it's paltry 0.5 TFLOPS theoretical peak? Instead, they're pretty similar in performance. Comparing actual game benchmarks, 1 nVidia FLOP seems to about equal 2 ATI FLOPs, suggesting that ATIs utilization efficiency in game shader code is closer to 50% than 80+ percent. I think Fermi will perform quite well, relative to Cypress, in games.
A lot of IFs and way more data injected in the pixel shader. No rocket science required, but not as straightforward as it used to be.Lets say the vertex shader precomputes 1/z and X/z per vertex (with X being the parameters) and the rasterizer supplies vertex blending factors per pixel ... why would a pixel shader + parameter interpolator care what triangle a pixel belongs to?
I do not think it makes much of a difference... OpenCL and CUDA are not like two completely different beasts and all the R&D poured into CUDA by nVIDIA will pay off for them when they move to OpenCL and DirectCompute IMHO.
It is not as nice for them as controlling the industry with their own proprietary standard, but I think they will be able to leverage their CUDA expertise (drivers and tools) when developing their OpenCL support (and they already have).
Fermi might have CUDA cores, but you could just as easily call them OpenCL cores .
No, just having an SIMD where 64 ALUs run the same instruction on the whole batch as opposed to 16 vec4 ALUs (I'll ignore the trancendental for a moment).what's your idea Mintmaster? Running 4 vertices/pixel per 5-vector?
Although working on 256 pixel batches doesn't sound terribly efficient when it comes down to dynamic branching. To not mention increased register pressure and less than optimal instruction bandwidth/instruction cache usage
Not in my opinion. See my post above to nAo. Cost is almost negligible.Why would they want to do this? More efficiency would be traded for how compact and numerous they can make those 5-way scalar units.
Not in my opinion. See my post above to nAo. Cost is almost negligible.
I think it's still true. The only difference is that the vertex parameters are stored nearby now (in the local memory?), and instead of each quad storing interpolated values for each pixel, it will store indices plus vertex weights.That used to be true in the land of fixed function interpolators. Not so much anymore..
Agreed.I don't see why not ... for small triangles iterative interpolation does not provide a gain over purely parallel interpolation and if you do it purely parallel it suits SIMD just fine.
FYI, this method needs at least two sets of blending factors per pixel for DX11, as you can now choose centroid or regular interpolation positions on the fly, and maybe more.Lets say the vertex shader precomputes 1/z and X/z per vertex (with X being the parameters) and the rasterizer supplies vertex blending factors per pixel ... why would a pixel shader + parameter interpolator care what triangle a pixel belongs to?
The vertex shader can't multiply by 1/z anyway in case there's clipping involved.Just to throw this out there, an alternative is to store perspective correct vertex blending factors, so that the vertex shader doesn't need to multiply each parameter by 1/z, and the pixel shader doesn't need to muliply each interpolated X/z by z.
My assumption that this would be for consumer-level gaming chips, which lead me to expect a few scenarios:
1) driver compiler code-replacement antics
That's putting it mildly.Breaking up the groups of 4 would bring up register file concerns.