Nvidia GT300 core: Speculation

3dilettante · May 12, 2009

Was there the expectation that it wouldn't serialize accesses to the same location during atomic ops?

TimothyFarrar · May 12, 2009

Jawed said:
That's pretty brutal considering it should be on-die.

What are the grid dimensions? How many warps per block?

Is it definitely coalescing that's increasing performance? Not merely the number of warps concurrently able to do atomics (either or both of number of clusters and number of MCs)?

Jawed

Good point.

Grid was {512,1} blocks and {16,16} threads per block (8 warps, 256 threads). I think this should have been enough to saturate both cards.

Actually you are right in that likely an increase in number of parallel units (ROPs or MC or whatever the hardware uses) able to do atomics would indeed factor into the increase performance in the non-100%-collide case.

Also I'm not sure how CUDA distributes blocks to cores. So likely this would also effect how many atomic operations had the same memory segment collision (in my test).

Jawed · May 12, 2009

3dilettante said:
Was there the expectation that it wouldn't serialize accesses to the same location during atomic ops?

It's serialising memory accesses. If this was fully cached then it would be cheaper.

If this was a pixel blend, then I guess it's flushing the result to memory upon completion, on the assumption that cache isn't big enough to hold the pixel until the next blend operation arrives. So has to read the pixel back every time a new blend (atomic) comes along.

The multiple atomics case is just like having lots of pixels in flight being blended separately.

Jawed

3dilettante · May 12, 2009

It would seem Nvidia went for the simplest route in this case.
Write back with no route for an on-chip RAW attempt.

TimothyFarrar · May 12, 2009

aaronspink said:
Hmm, interesting. It looks like Nvidia's implementation of atomic add works in such a way that repeatedly accessing the same address takes longer than accessing different addresses and the latency is roughly equal to their memory latency.

They performance here is actually opposite of what you would expect in a cpu based system. It looks like Nvidia is fully serializing the accesses to the same address but in the case of different addresses, is issuing them and then switching warps which effectively hides the latency.

BTW, a few days ago I did least one quick test on the 8600 GTS in which I added instructions into the loop to see if I could get concurrent execution in the 100% collide case, and it seemed as if this was the case. I could vary the number of instructions following the global atomicAdd and wouldn't see an effect on run-time until I got round/over the expected number of cycles of latency.

I really need to double check those results again, but it seemed to me as if the ALUs were issuing the atomic operations without serializing execution (so atomic operation ALU work happening outside the ALUs)?

KonKort · May 12, 2009

GT300 will have around 2.4 billions transistors and under 490 sqmm.

Source: Hardware-Infos

pjbliverpool · May 12, 2009

Sounds very promising. A lesser die size gap than GT200 vs RV770 with (from the sounds of it) a larger performance gap.

trinibwoy · May 12, 2009

2400m at 490mm^2 will still be at a significant trans/mm^2 disadvantage (based on RV740). Notwithstanding the uselessness of that particular metric, it still points to the likelihood that they weren't able to match AMD's density.

Jawed · May 12, 2009

TimothyFarrar said:
Grid was {512,1} blocks and {16,16} threads per block (8 warps, 256 threads). I think this should have been enough to saturate both cards.

With <=32 warps per multiprocessor (but 24 on 8600GT), that's 4 blocks per multiprocessor. At any one time there are 4 blocks per multiprocessor * 30 multiprocessors = 120 blocks in flight.

The only issue with this is occupancy. How many registers allocated for the kernel? If you can allocate a huge amount of shared memory per thread, that might be a way to falsely lower the number of simultaneous warps/blocks per multiprocessor. Dunno if this will actually work to be honest.

It should be possible to work out the number of colliding addresses for each of your overlaps. Then you can get an average number of collisions for each of the 7 MCs, assuming that memory segments are distributed evenly, round-robin.

Also I'm not sure how CUDA distributes blocks to cores. So likely this would also effect how many atomic operations had the same memory segment collision (in my test).

Blocks can't be split across multiprocessors. The collision count is purely defined by overlap and the size of variable. Presumably you used a 32-bit atomic (only 32- and 64-bit global atomic variables are possible I think). GT200 has a 128-byte segment size for 32-bit variables, but it can halve that for indepedent memory operations, so 16 addresses:

overlap = 1 : 512 blocks / 16 addresses per segment = 32 colliding addresses and 16 * 8 warps * 2 half-warps = 256 collisions per address
overlap = 16 : 512 blocks / 1 address per segment = 512 colliding addresses and 8 warps * 2 half-warps = 16 collisions per address

Erm, I think that's how it works...

Jawed

DegustatoR · May 13, 2009

trinibwoy said:
2400m at 490mm^2 will still be at a significant trans/mm^2 disadvantage (based on RV740).

Well, it says "less than 490mm^2" isn't it...

ninelven · May 13, 2009

Yes, but probably not by much...

If it was less than 480mm^2, surely it would have read < 480mm^2. Let's split the difference and say 485mm^2.

But... if we are to believe the specs then GT300 should offer a minimum of 3x RV740 performance to a maximum of 4x. 3x RV740 die size would be ~411mm^2, 4x 548mm^2, and 3.5x 480mm^2. From a performance/mm^2 they appear quite similar. After all it's not how many transistors so much as what you do with them.

DegustatoR · May 13, 2009

ninelven said:
If it was less than 480mm^2, surely it would have read < 480mm^2. Let's split the difference and say 485mm^2.

Hint: GT200b is 490mm^2. So it's more like "less than GT200b" really.

ninelven said:
From a performance/mm^2 they appear quite similar.

How's that? We don't know anything about G300's architecture at the moment.

ninelven · May 13, 2009

ninelven said:
if we are to believe the specs...

Should I really have to do this?

DegustatoR said:
GT200b is 490mm^2. So it's more like "less than GT200b" really.

Since you apparently know GT300's die size why don't you go ahead and post it?

Jawed · May 13, 2009

http://www.brightsideofnews.com/news/2009/5/12/nvidias-gt300-is-smaller2c-faster-than-larrabee.aspx

GT300 packs 512 MIMD-capable cores and yet it uses "just" one billion transistors extra. I'll be first to admit that I wondered how GT300 packs at least three billion transistors, but according to our highly confidential source, the 2.4 billion transistors are packed in just 495mm2.

Later the article raises an eyebrow over the single-PCB GTX295 that's coming. Yes, it definitely would be interesting if NVidia launched a "GTX395" concurrently with "GTX380", using a single board for 2 GPUs.

But, if a single board is so good, why hasn't NVidia done it already?

Jawed

Scali · May 13, 2009

Jawed said:
But, if a single board is so good, why hasn't NVidia done it already?

Seems to me that they went for the quick 'n' dirty solution.
Who knows... Perhaps they thought they weren't going to sell enough of them to do a major rework of the GT200 PCB. And perhaps they reconsidered when they were selling more GTX295s than they predicted... Or perhaps because the GT300 will use a PCB very similar to the GT200, and they figured they'd get a headstart on a 2-GPU PCB this time...

ChrisRay · May 13, 2009

I personally don't think a single board is better for the consumer unless you water cool. I think dual PCBs are generally superior from a thermal standpoint and PCB/weight considerations. But it might be more cost effective for Nvidia.

MfA · May 13, 2009

Cooling is only an issue when you try to vent through the back ... for a gamer card I just don't see why they would have to religiously stick to that.

Arty · May 13, 2009

FWIW, I wouldnt take Theo's or hardware-infos info. They've been wrong on too many occasions and according to CJ are just taking stabs in the dark, emailing him asking for confirmation on their guesses.

Maybe CJ can enlighten us what is going on in the background? :runaway:

AlNom · May 13, 2009

Jawed said:
http://www.brightsideofnews.com/news/2009/5/12/nvidias-gt300-is-smaller2c-faster-than-larrabee.aspx

That's some very sketchy transistor scaling.

bowman · May 13, 2009

Arty said:
FWIW, I wouldnt take Theo's or hardware-infos info. They've been wrong on too many occasions and according to CJ are just taking stabs in the dark, emailing him asking for confirmation on their guesses.

Maybe CJ can enlighten us what is going on in the background?

I don't even click 'bright side of news' links any more, it's almost always more theo valich drooling.

Nvidia GT300 core: Speculation

3dilettante

TimothyFarrar

Jawed

3dilettante

TimothyFarrar

KonKort

pjbliverpool

B3D Scallywag

trinibwoy

Meh

Jawed

DegustatoR

ninelven

PM

DegustatoR

ninelven

PM

Jawed

Scali

ChrisRay

<span style="color: rgb(124, 197, 0)">R.I.P. 1983-

MfA

Arty

KEPLER

AlNom

Moderator

bowman

Similar threads