Nvidia GT300 core: Speculation

Status
Not open for further replies.
That's pretty brutal considering it should be on-die.


What are the grid dimensions? How many warps per block?

Is it definitely coalescing that's increasing performance? Not merely the number of warps concurrently able to do atomics (either or both of number of clusters and number of MCs)?

Jawed

Good point.

Grid was {512,1} blocks and {16,16} threads per block (8 warps, 256 threads). I think this should have been enough to saturate both cards.

Actually you are right in that likely an increase in number of parallel units (ROPs or MC or whatever the hardware uses) able to do atomics would indeed factor into the increase performance in the non-100%-collide case.

Also I'm not sure how CUDA distributes blocks to cores. So likely this would also effect how many atomic operations had the same memory segment collision (in my test).
 
Was there the expectation that it wouldn't serialize accesses to the same location during atomic ops?
It's serialising memory accesses. If this was fully cached then it would be cheaper.

If this was a pixel blend, then I guess it's flushing the result to memory upon completion, on the assumption that cache isn't big enough to hold the pixel until the next blend operation arrives. So has to read the pixel back every time a new blend (atomic) comes along.

The multiple atomics case is just like having lots of pixels in flight being blended separately.

Jawed
 
Hmm, interesting. It looks like Nvidia's implementation of atomic add works in such a way that repeatedly accessing the same address takes longer than accessing different addresses and the latency is roughly equal to their memory latency.

They performance here is actually opposite of what you would expect in a cpu based system. It looks like Nvidia is fully serializing the accesses to the same address but in the case of different addresses, is issuing them and then switching warps which effectively hides the latency.

BTW, a few days ago I did least one quick test on the 8600 GTS in which I added instructions into the loop to see if I could get concurrent execution in the 100% collide case, and it seemed as if this was the case. I could vary the number of instructions following the global atomicAdd and wouldn't see an effect on run-time until I got round/over the expected number of cycles of latency.

I really need to double check those results again, but it seemed to me as if the ALUs were issuing the atomic operations without serializing execution (so atomic operation ALU work happening outside the ALUs)?
 
2400m at 490mm^2 will still be at a significant trans/mm^2 disadvantage (based on RV740). Notwithstanding the uselessness of that particular metric, it still points to the likelihood that they weren't able to match AMD's density.
 
Grid was {512,1} blocks and {16,16} threads per block (8 warps, 256 threads). I think this should have been enough to saturate both cards.
With <=32 warps per multiprocessor (but 24 on 8600GT), that's 4 blocks per multiprocessor. At any one time there are 4 blocks per multiprocessor * 30 multiprocessors = 120 blocks in flight.

The only issue with this is occupancy. How many registers allocated for the kernel? If you can allocate a huge amount of shared memory per thread, that might be a way to falsely lower the number of simultaneous warps/blocks per multiprocessor. Dunno if this will actually work to be honest.

It should be possible to work out the number of colliding addresses for each of your overlaps. Then you can get an average number of collisions for each of the 7 MCs, assuming that memory segments are distributed evenly, round-robin.

Also I'm not sure how CUDA distributes blocks to cores. So likely this would also effect how many atomic operations had the same memory segment collision (in my test).

Blocks can't be split across multiprocessors. The collision count is purely defined by overlap and the size of variable. Presumably you used a 32-bit atomic (only 32- and 64-bit global atomic variables are possible I think). GT200 has a 128-byte segment size for 32-bit variables, but it can halve that for indepedent memory operations, so 16 addresses:
  • overlap = 1 : 512 blocks / 16 addresses per segment = 32 colliding addresses and 16 * 8 warps * 2 half-warps = 256 collisions per address
  • overlap = 16 : 512 blocks / 1 address per segment = 512 colliding addresses and 8 warps * 2 half-warps = 16 collisions per address
Erm, I think that's how it works...

Jawed
 
Yes, but probably not by much...

If it was less than 480mm^2, surely it would have read < 480mm^2. Let's split the difference and say 485mm^2.

But... if we are to believe the specs then GT300 should offer a minimum of 3x RV740 performance to a maximum of 4x. 3x RV740 die size would be ~411mm^2, 4x 548mm^2, and 3.5x 480mm^2. From a performance/mm^2 they appear quite similar. After all it's not how many transistors so much as what you do with them.
 
http://www.brightsideofnews.com/news/2009/5/12/nvidias-gt300-is-smaller2c-faster-than-larrabee.aspx

GT300 packs 512 MIMD-capable cores and yet it uses "just" one billion transistors extra. I'll be first to admit that I wondered how GT300 packs at least three billion transistors, but according to our highly confidential source, the 2.4 billion transistors are packed in just 495mm2.
Later the article raises an eyebrow over the single-PCB GTX295 that's coming. Yes, it definitely would be interesting if NVidia launched a "GTX395" concurrently with "GTX380", using a single board for 2 GPUs.

But, if a single board is so good, why hasn't NVidia done it already?

Jawed
 
But, if a single board is so good, why hasn't NVidia done it already?

Seems to me that they went for the quick 'n' dirty solution.
Who knows... Perhaps they thought they weren't going to sell enough of them to do a major rework of the GT200 PCB. And perhaps they reconsidered when they were selling more GTX295s than they predicted... Or perhaps because the GT300 will use a PCB very similar to the GT200, and they figured they'd get a headstart on a 2-GPU PCB this time...
 
I personally don't think a single board is better for the consumer unless you water cool. I think dual PCBs are generally superior from a thermal standpoint and PCB/weight considerations. But it might be more cost effective for Nvidia.
 
Cooling is only an issue when you try to vent through the back ... for a gamer card I just don't see why they would have to religiously stick to that.
 
FWIW, I wouldnt take Theo's or hardware-infos info. They've been wrong on too many occasions and according to CJ are just taking stabs in the dark, emailing him asking for confirmation on their guesses. :unsure:

Maybe CJ can enlighten us what is going on in the background? :runaway:
 
FWIW, I wouldnt take Theo's or hardware-infos info. They've been wrong on too many occasions and according to CJ are just taking stabs in the dark, emailing him asking for confirmation on their guesses. :unsure:

Maybe CJ can enlighten us what is going on in the background? :runaway:

I don't even click 'bright side of news' links any more, it's almost always more theo valich drooling.
 
Status
Not open for further replies.
Back
Top