Apparently, it can change the number of samples used for a single screen-pixel, thus saving work deemed unnecessary by the driver team.What does Catalyst AI (not) do?
Apparently, it can change the number of samples used for a single screen-pixel, thus saving work deemed unnecessary by the driver team.What does Catalyst AI (not) do?
That's probably the point, isn't it? Make people think, your 40 TMUs are doing as good a job as the competitors 80 units. Somewhere you have to cut corners when you want to compete with your 280mm² chip against a die double the size.Well more power to them if the final result is the same.
Hopefully level.I didn't say you did, just that it's very hard to get a good idea of how GPGPU relates if you can only compare Cuda to Brook+, with both using a different programming model and using different compilers etc. OpenCL will even the playing field in many respects, just as OpenGL/D3D did for regular graphics tasks.
And so is GT200, so what?Well, first the obvious:
RV770 is functionally equivalent to G80, a chip that is 2 years older!
"On ATI you get better performance (in absolute terms), and if you use LDS then you get worse performance." I should have added on to the end something like "...get worse performance than without" to clarify that that wasn't a comparison of absolute performance, merely that ATI performance is lower with LDS.You said ATi was slower than nVidia when shared memory is involved.
I agree. AMD held steadfastly to the traditional "stream" view of stream computing whereas NVidia decided to embrace shared memory. I've said it before, it was a mistake. Apart from that, of course, AMD didn't actually have a marketing strategy for throughput computing - still stuck in thinking "if we build it someone will use it".Now I am willing to go as far as stating that shared memory will be a very important tool in many GPGPU algorithms (unlike graphics).
Yep, absolutely zeroYea, it's a shame ATi doesn't have any actual software out there.
You won't get any argument from me.And it proves my point that there are GPGPU algorithms that can benefit significantly from fast shared memory.
So, what's the basis for you saying this?Question remains: if the ATi client is rewritten to use shared memory, will it once again become competitive?
I wouldn't be surprised if they can't quite close the gap.
Hmm, I'm not the one counting chickens before they've hatched.Oh please, as if your bias wasn't showing enough already. This remark was completely uncalled for, and makes you look like a silly frustrated fanboy.
And Vantage shows the opposite.While I won't argue with you about the stciker-speed of theoretical DP throughput (but I'd love to see some real world numbers on real applications), especially TEX is a different beast since Catalyst AI saves them TUs quite some (mega)bits of fetching and filtering. Without the filtering stuff, some texturing benchmarks show the expected numbers wrt to frequency and unit count on both architectures.
And so is GT200, so what?
Hmm, I'm not the one counting chickens before they've hatched.
I think OpenCL on Larrabee will be the real eye-opener. But of course its competitiveness is utterly unknown.
What benchmark, or game being benchmarked, has 32-bit uncompressed texture operations?So that means when you THINK you're benchmarking 32-bit uncompressed texture operations, you're actually doing less.
So with their two-year lead, NVidia is still making GPUs that are uncompetitively large:2-year gap, as I said.
So, again, why are they bigger?RV770 is functionally equivalent to any of NVidia's CUDA capabilities as far as I can tell. Perhaps you'd like to indicate why they're bigger?Because that's what they focused on in the past few years, and that's largely the reason why their chips are so much bigger than ATi's.
If you're going to speculate:Neither am I.
I'm just saying that I *think* nVidia has put more thought into their GPGPU design than ATi in the past few years, which I *think* might give them an advantage in OpenCL. This is just a speculation thread about GT300, remember? That's what I'm doing, speculating.
I haven't said that I'm sure of it, let alone that I would actually invest money on this.
So your remark was rather lame and out of line, in my opinion.
... then answer my earlier question: why wouldn't you be surprised? What technical insight do you have?No, it does mean something:
ATi didn't have shared memory until recently. nVidia had it for years. nVidia is now reaping the benefits.
And it proves my point that there are GPGPU algorithms that can benefit significantly from fast shared memory.
Question remains: if the ATi client is rewritten to use shared memory, will it once again become competitive?
I wouldn't be surprised if they can't quite close the gap.
Well, Intel itself isn't aiming for the stars. They say they aim at midrange performance at introduction. Given the info we have on Larrabee so far, I think that is a reasonable goal.
So I don't expect Larrabee to outperform nVidia and ATi solutions, not in GPGPU tasks either, because I don't quite see any advantages of Larrabee in OpenCL. It seems to suit the G80+ architecture of nVidia just fine.
I don't see how's that means anything like this.
If anything the whole "simultaneous" GT200/RV770 launch thing screams that someone was waiting to see what's the other one will put to market. And from what i know (and logic should tell you) it wasn't AMD.
What benchmark, or game being benchmarked, has 32-bit uncompressed texture operations?
Jawed
SB, that's been a disingenuous comparison from day one. Regardless of whether or not Nvidia was caught by surprise by 48xx, those cards started off in price territory where Nvidia never meant to go with GT200. So you're incorrectly bundling in sales there of the 4850's for example for which the rightful comparison should be G92.
Because currently, it's more marketing than useful?Not sure why you hate on this approach so much.
AMD does dynamic clause scheduling - it's the instructions that are fixed in VLIW.It's obviously more flexible and has fewer corner cases than AMD's pre-determined clause scheduling that maps much better to traditional graphics workloads. Fine-grained scheduling requires more hardware yes but it's probably the way of the future.
How does VLIW affect viability?And as I said above, where are all the compute apps that demonstrate the viability of AMD's approach?
And so we should have noticed if this was being done.Anything which does post processing likely does, not that the driver could do a format change on you to make that faster. Also if you had a texture as a lookup table into a texture atlas (I think the older DICE presentation had an example of this in a shipped title). There are other examples. Adjusting texture format from uncompressed to compressed can in cases be really really bad.
There is also a big difference here as shared memory is just that, and not a cache.[*]Larrabee programmers will have the choice to use almost any amount of local memory(shared memory in CUDA) per work-item in comparison with relative small and fixed amounts in any GPU - unless the GPUs are radically re-designed.
What's interesting is that only 1 operand per instruction can come from shared memory in GT200. The same rule applies in Larrabee, if we label L1 cache as "shared memory".There is also a big difference here as shared memory is just that, and not a cache.
I never understood why it has been called "parallel data cache".
What benchmark, or game being benchmarked, has 32-bit uncompressed texture operations?
... then answer my earlier question: why wouldn't you be surprised? What technical insight do you have?
[*]Larrabee will be competitive purely in terms of FLOPs - and particularly in double-precision - because Larrabee is FLOPs swimming on a sea of cache with some texturing decorating the edges
[*]Larrabee programmers will have the choice to use almost any amount of local memory(shared memory in CUDA) per work-item in comparison with relative small and fixed amounts in any GPU - unless the GPUs are radically re-designed. Likely to be 32KB per SIMD in the GPUs (which it seems is what D3D11-CS requires, not 16KB as I originally thought). Though if GT300 is 64 SIMDs with 32KB each, that's going some Programmers will be freed from the terrors of CUDA occupancy balanced against bytes per thread of shared memory. I can hear them cheering already.
Tom Forsyth's presentation said most vector instructions have latencies in the 4-9 cycle range.There is. The scalar half of the core couldn't
But control flow and instruction sequencing within the VPU appears to be entirely static. It helps that the VPU is a purely scalar ALU from the point of view of a strand. So it seems like there's no need to perform instruction-by-instruction scheduling like NVidia does - though what we don't know yet is the read-after-write register latency in the VPU.