I still don't get your angle on this. Let's say Nvidia's stuff benefits from ILP.
There's no if. Whether the degree is important is a separate question.
The point is that it is a lot less dependent on it than the competition.
Yes I said as much in the other thread
People recognize that it takes work to get the best from Nvidia hardware too. I know early on you were especially annoyed that people underestimated the amount of effort it took to get the best out of Nvidia hardware. Is that angst still at the root of all this negativity?
No, the "it must be alright, it's NVidia" default position that you and others have is tedious. And it's so entrenched that ...
Nothing that fancy. Just examples of actual applications.
You mean like the video conferencing system? I linked that. Or Cyberlink? I forget, is it their video encoder that includes AMD acceleration? I don't know if the medical visualisation stuff is a commercial application or just research.
Improvement compared to what? Where are all the fantastic applications taking advantage of AMD's DP advantage? You're arguing from a pretty weak position here.
Well, DGEMM is one - it's actually useful. But I don't know who's using it. It's boring, you know?
Your inability to recognize that you are constantly criticizing Nvidia when your team isn't even in the game is even more so.
Yes, you're still on that "zero GPGPU penetration" wicket.
What exactly is transpiring in the field? All you're doing right now is making excuses for AMD and trying your best to minimize the value of everything Nvidia and its partners have actually produced. Your position is so untenable it's crazy at this point.
I think you might want to scan through the names of the starter of threads:
http://forum.beyond3d.com/forumdisplay.php?f=42
and the subject of those threads
Yep, the one with the mature CPU Physics library? What do they have to do with AMD and/or GPGPU? Oh is the red-dress demo now equivalent to all of PhysX?
Eh? You think they knocked something up the day before GDC? and that's all we'll ever see of it?
Well isn't it interesting that you can recognize the need to do this extra work to get the same benefit of shared memory while at the same time dismissing the attractiveness of the shared-memory approach in the first place?
OK, when did I dismiss shared memory?
For all we know LDS in ATI is no good going forwards and they'll have to do something different. We discussed the utility back here:
http://forum.beyond3d.com/showthread.php?t=53089
Can't tell what the performance is like, except we now know that SGEMM is slowed down on ATI by the use of LDS.
It's not that you don't make good points when it comes to limitations inherent to Nvidia's approach. You obviously spend a lot of time thinking about this stuff. But it just doesn't make sense to push the "AMD is just as good or better" mantra when they have nothing to show for it.
No, I'm just pushing against the "AMD's hardware design is incapable of being competitive" mantra.
For example by having in-pipe registers there's essentially no register read-after-write latency (there are corner cases but they're seriously obscure). This is a feature that seems to go back to R300 as far as I can tell. It means that there's less total latency for the scheduler to hide over the lifetime of a shader, in comparison with an architecture that incurs 24 cycles of latency for every register write.
Do you think that makes a difference to the compiler? Do I need to give you a clue?
Do you think there's any relationship between the maturity of AMD's software stack and the hardware design? In any case, the proof in the pudding is in the eating as they say. So no amount of theorizing will ever stand up in the face of actual results.
Well as I pointed out the other day, there are still gotchas in hardware assembly. e.g. if I vectorise the Mandlebrot code to generate multiple points per thread instead of just the one, the fucking hardware compiler insists on solely using .x and runs out of registers because it's not using .y, .z and .w. The compiler doesn't tell me it's run out of registers - it just says "failed". 4 results per thread consumes 32 registers (that's 128 scalars)! All because I use a struct of floats.
There are gotchas on the NVidia side too, such as people complaining that they can't control the optimiser or that register allocation is a black art. Volkov's work has been very influential in shifting perspectives as far as I can tell and I know that NVidia's been putting a lot of effort into making the development experience a lot more like developing for a CPU, e.g. with support for profiling.
NVidia's very much in the "polish" stage of its toolset whereas AMD is still in the "let's make it work" stage.
Jawed