Jawed
Legend
Ah, I interpreted what Marco said to mean he wouldn't be using hardware filtering for extended precision, but do so in ALUs.If the math is non-linear though you can't use hardware filtering (or SATs for that matter)
Jawed
Ah, I interpreted what Marco said to mean he wouldn't be using hardware filtering for extended precision, but do so in ALUs.If the math is non-linear though you can't use hardware filtering (or SATs for that matter)
Could be, but then I see no reason not just to do the log filtering stuff that he has already been doing... however that wouldn't work with SAT. Not sure how to make any of this work with SAT nicely (without DP), but maybe that's where his cleverness comes in...Ah, I interpreted what Marco said to mean he wouldn't be using hardware filtering for extended precision, but do so in ALUs.
Err, just higher triangle setup rate?Anyone have any hints as to what they have changed to get "Much faster geometry shading" (other than the other bullet points)? Are we looking at some kind of hardware queue support (GS output) or perhaps just GS scheduling which is now thread friendly?
Heh, so you think they won't be benching it on a "balanced platform"...Err, just higher triangle setup rate?
MfA reported almost 1/3 throughput for double-precision dense matrix multiply when compared with single-precision. I think that's as bandwidth bound as you can get.As for the "low" DP rate in comparison to AMD, is the typical DP workflow on the GPU bandwidth bound anyway? Or are those using DP on AMDs GPUs actually very often reaching the peak DP rate other than the obvious and trivial cases?
Oh, I do too. That's all the information relating to DP I have, though.I think I would place dense matrix multiply into the trivial and obvious case (easy data locality and reuse).
It is, but it does tell you a couple of things. For instance matrix multiplication hits the caches very hard and it can deal with that competently. It would have been nice if they could get the near 100% efficiency of Cell, but 50% is pretty good regardless.I think I would place dense matrix multiply into the trivial and obvious case (easy data locality and reuse).
They schedule GS threads differently now, yeah, but because of a hardware change. It should be pretty obvious what that change is if you know why it goes slow on G8x/G9x. Measured throughput increases are nice and impressive.Anyone have any hints as to what they have changed to get "Much faster geometry shading" (other than the other bullet points)? Are we looking at some kind of hardware GS output merging support or perhaps just GS scheduling which is now thread friendly?
Was there some discussion why it's slow on G8x/G9x? All I remember was that vertex throughput on G80/G92 was pretty low (slower than HD3850 and slower than G94 even).They schedule GS threads differently now, yeah, but because of a hardware change. It should be pretty obvious what that change is if you know why it goes slow on G8x/G9x. Measured throughput increases are nice and impressive.
Was there some discussion why it's slow on G8x/G9x? All I remember was that vertex throughput on G80/G92 was pretty low (slower than HD3850 and slower than G94 even).
Was there some discussion why it's slow on G8x/G9x? All I remember was that vertex throughput on G80/G92 was pretty low (slower than HD3850 and slower than G94 even).
Rys seems to be the only one in the know on this issue. (Well, we all know why the GS is slow in general, but not why its so ridiculously slow on G80 ). Maybe a quick post or article summarizing it for us mortals?Guess I missed the B3D post on why GS was slow...
I doubt the G80 can cache the GS output at all.Well since the execution resources are the same for both GS and VS, and we have a throughput issue, I'd suspect either a cache
I doubt the G80 can cache the GS output at all.