Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 20-Jul-2010, 18:56   #1
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,347
Default GeForce GTX 460 Fun

I wrote a CUDA program to test GF104's MP. For fully dependent computations, it's only able to use 32 of 48 SP in a MP, and it should be able to use all 48 SP with non-dependent computations. So I decided to test this theory.

I made three different kernels, the first one is like this (all floats):

e = b * c + a;
b = a * d + e;

repeat 1000 times and inside a loop for 100 times (i.e. total 1000 * 100 * 4 FLOPs).

The second kernel looks like this:

e = b * c + a;
a = e * d;

this one is for testing the so-called "dual-issue" MUL units, which should be only available on pre-GF100 GPUs.

The third one is the independent one:

e = b * c + a;
e2 = b2 * c2 + a2;
b = a * d + e;
b2 = a2 * d2 + e2;

This one only repeats 500 times to make the program size similar to the first one.

In theory, the first kernel should give ~64 FLOPs/MP/cycle. The second is probably ~48 FLOPs/MP/cycle, and the third one ~96 FLOPs/MP/cycle.

Real world test results for the first one is 63.4988 FLOPs/MP/cycle, which is pretty good. The second one is 47.6897 FLOPs/MP/cycle, again pretty good. However, the third one is pretty bad, only 67.3126 FLOP/MP/cycle. This result sort of "proved" that GF104 has more than 32 SP per MP, but the efficiency is pretty bad in this case. I'm not sure what's the problem, though.

I also tried using sm_21 as compile target but it didn't help much.

Another interesting thing is, at first, I tried using this:

a += b * c;
b += a * d;

This runs well on G92 (~ 16 FLOP/MP/cycle) but not as well on GF104 (only ~ 52 FLOP/MP/cycle), very weird.

The program (source + executable) can be downloaded here:

http://www.kimicat.com/dang-an-jia/cudatest.rar
pcchen is offline   Reply With Quote
Old 20-Jul-2010, 19:57   #2
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,809
Default

Bandwidth? Can you try the independent test without MADs and see if you get 48FLOPS/MP/cycle?

e = b * c;
e2 = b2 * c2;
b = d * e;
b2 = a2 * e2
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 20-Jul-2010, 20:37   #3
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,347
Default

Yes, it's possible to do 48 FLOPs/cycle/MP with 4 MULs. So apparently there are some limitations on register usages.

I tried this:

e = b * b + a;
e2 = b2 * b2 + a;
b = e * e + a;
b2 = e2 * e2 + a;

this does reach 95.6295 FLOPs/cycle/MP, but this is an extremely boring operation

Changing to

e = b * b + a;
e2 = b2 * b2 + c;
b = e * e + a;
b2 = e2 * e2 + c;

still able to do 95.593 FLOPs/cycle/MP, but

e = b * b + a;
e2 = b2 * b2 + a;
b = e * e + c;
b2 = e2 * e2 + c;

slows to 85.0111 FLOPs/cycle/MP.

e = b * b + a;
e2 = b2 * b2 + c;
b = e * c + a;
b2 = e2 * a + c;

slows further to 75.8765 FLOPs/cycle/MP.
pcchen is offline   Reply With Quote
Old 20-Jul-2010, 20:55   #4
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 985
Default

Quote:
Originally Posted by pcchen View Post
Yes, it's possible to do 48 FLOPs/cycle/MP with 4 MULs. So apparently there are some limitations on register usages.
You simply hit the register bandwidth wall. Obviously the register file is only able to deliver 96 operands per clock and SM (some additional access restrictions may apply). This isn't more than what GF100 register files are able to do. GF104 would need quite a bit more read ports (probably some more write ports too) to saturate the scheduling bandwidth with all possible instruction combinations when throwing in some SFU and L/S action into the mix.
Gipsel is offline   Reply With Quote
Old 20-Jul-2010, 21:04   #5
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,347
Default

I think it's probably some weird restrictions. For example, although

e = b * b + a;
e2 = b2 * b2 + a;
b = e * e + a;
b2 = e2 * e2 + a;

is ok, but simply rearrange the operands:

e = b * a + b;
e2 = b2 * a + b2;
b = e * a + e;
b2 = e2 * a + e2;

brings it back to ~ 64 FLOP/cycle/MP.
pcchen is offline   Reply With Quote
Old 20-Jul-2010, 21:55   #6
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 985
Default

Quote:
Originally Posted by pcchen View Post
I think it's probably some weird restrictions. For example, although

e = b * b + a;
e2 = b2 * b2 + a;
b = e * e + a;
b2 = e2 * e2 + a;

is ok, but simply rearrange the operands:

e = b * a + b;
e2 = b2 * a + b2;
b = e * a + e;
b2 = e2 * a + e2;

brings it back to ~ 64 FLOP/cycle/MP.
But both are using only two different source operands in each instruction. I agree that it is some weird additional register file access restriction (a fetched operand can only be shared if the same operand appears twice directly next to itself, otherwise it is fetched twice? ).

Maybe you should try:

e = a * b + b;
e2 = a * b2 + b2;
b = a *e + e;
b2 = a *e2 + e2;

to check it? If that is really the restriction it should be back to 96 flops/cycle.

But nevertheless, the registerfile apparantly can't deliver more than ~96 operands per cycle. The optimum would be twice that. Additionally, we don't know really how the registerfile is banked and how this maps to the ALU blocks in the SMs. Maybe it is simply some sort of bank conflict as nvidia reused the GF100 register file without adapting it to the additional requirements in the GF104 SMs, which could potentially further reduce the available register file bandwidth in a pretty unpredictable way.

Last edited by Gipsel; 20-Jul-2010 at 22:01.
Gipsel is offline   Reply With Quote
Old 21-Jul-2010, 15:15   #7
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,347
Default

This does not make a difference though. I tried another one with minimal register usage:

e = b * b + e;
e2 = b2 * b2 + e2;
b = e * e + b;
b2 = e2 * e2 + b2;

This also reach 96 FLOP/cycle/MP. However, simply alter the first line to

e = b * c + e;

slows it down to ~79 FLOP/cycle/MP.
Actually I can't find any combination with two different operands for mul which can reach 96 FLOP/cycle/MP.
I also looked the ptx file generated by NVCC and it looks pretty normal (just 2000 fma instructions).
Changing any one of the operands to a constant also slow it down.
pcchen is offline   Reply With Quote
Old 21-Jul-2010, 15:35   #8
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 985
Default

Quote:
Originally Posted by pcchen View Post
Actually I can't find any combination with two different operands for mul which can reach 96 FLOP/cycle/MP.
I also looked the ptx file generated by NVCC and it looks pretty normal (just 2000 fma instructions).
Changing any one of the operands to a constant also slow it down.

That sounds like quite a screw up on nvidia's part.
Gipsel is offline   Reply With Quote
Old 21-Jul-2010, 15:50   #9
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,347
Default

Quote:
Originally Posted by pcchen View Post
Actually I can't find any combination with two different operands for mul which can reach 96 FLOP/cycle/MP.
Aha I found one:

e = b * e + 2.0f;
e2 = b2 * b2 + e2;
b = e * b + 2.0f;
b2 = e2 * e2 + b2;

This is able to do 95.573 FLOP/cycle/MP. Further modification on e2 and b2 slows it down though.
So it still looks like a register bandwidth/allocation thing.

Also the PTX file for this is seriously werid. It looks like:

mov.f32 %f6, 0f40000000; // 2
fma.rn.ftz.f32 %f7, %f1, %f5, %f6;
fma.rn.ftz.f32 %f8, %f3, %f3, %f4;
mov.f32 %f9, 0f40000000; // 2
fma.rn.ftz.f32 %f10, %f1, %f7, %f9;
fma.rn.ftz.f32 %f11, %f8, %f8, %f3;

repeat 500 times (with the movs).
pcchen is offline   Reply With Quote
Old 22-Jul-2010, 15:30   #10
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 985
Default

Quote:
Originally Posted by pcchen View Post
Also the PTX file for this is seriously werid. It looks like:

mov.f32 %f6, 0f40000000; // 2
fma.rn.ftz.f32 %f7, %f1, %f5, %f6;
fma.rn.ftz.f32 %f8, %f3, %f3, %f4;
mov.f32 %f9, 0f40000000; // 2
fma.rn.ftz.f32 %f10, %f1, %f7, %f9;
fma.rn.ftz.f32 %f11, %f8, %f8, %f3;

repeat 500 times (with the movs).
Are you directly looking at the generated PTX or have you disassembled the cubin with decuda? For me it looks like the PTX generated by nvcc before the final optimization step (which would remove the moves and would also reuse registers).
Gipsel is offline   Reply With Quote
Old 22-Jul-2010, 17:45   #11
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,347
Default

Quote:
Originally Posted by Gipsel View Post
Are you directly looking at the generated PTX or have you disassembled the cubin with decuda? For me it looks like the PTX generated by nvcc before the final optimization step (which would remove the moves and would also reuse registers).
Oh, this is directly generated PTX. The cubin, of course, does not look like this. Unfortunately, since cubin format has been changed to ELF, which makes things complicated, decuda does not work directly anymore.
pcchen is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 06:33.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.