NVIDIA Kepler speculation thread

Ok, some of these countries might be rich, but there is something special in USA which makes people wanna go there and not somewhere else. ;)
And I know a few physicists searching for a job here to come back from the US to Germany. Two years ago I could have gone to California too, but I didn't want. I would not claim my sample size is representative. :rolleyes:
You mean like getting cheapo prices on hardware which in turn is barely available? ;)
:LOL:
 
This is way OT but....

I'm neither from the US or Europe but in my travels people from Europe talk way more about wanting to go to the US than the other way around.
 
This is way OT but....

I'm neither from the US or Europe but in my travels people from Europe talk way more about wanting to go to the US than the other way around.
I agree about the way OT, but I have to mention that we have (of course ;)) official statistics about such stuff in Germany from the "Statistisches Bundesamt" (Federal Statistics Office). As a net effect, we had more immigration than emigration in the last few years, also from the US (but that is almost even). Only if you count just German citizens slightly more leave than coming back (and as said, including all people irrespective of their citizenship coming from the USA we arrive almost at a net zero).
By the way, the most attractive target for German emigrants is Switzerland! :smile:

And now back to topic!

Does anybody got some more insight into the question I asked here?
I mean, if nV takes the data locality issue serious, they should pin warps to a certain vALU (or actually a set of vALU, SFUs and L/S units) in roughly the same way as GCN does it with pinning its wavefronts to a certain vALU.
 
This is way OT but....

I'm neither from the US or Europe but in my travels people from Europe talk way more about wanting to go to the US than the other way around.


Continuing way OT ...

I call it Hollywood effect ;)


Back on topic:
I was wondering why GK104 is slower in BitCoin mining than GF110. I know this workload is purely integer, yet still it seems odd new GPU is 20-30% slower in both OpenCL and CUDA miners (including CUDA miner compiled using 4.2 toolkit).

average numbers:
110MH/s (GTX680) vs 140MH/s (GTX580)
 
PS:
If they didn't have a similar mistake in that slides as during the Fermi presentation, the total register space is the same as with GF100/GF110 (2 MB), so really tiny compared to Tahiti (8 MB). I have a hard time believing that number, considering the similarity of the ALU count of GK104 and Tahiti. I would expect double the value given in that slide (4 MB), i.e. 512 kB per SMX or 128kB per Scheduler.
Tahiti can support much larger number of concurrent threads. I don't think the RF size in GK104 is particularly lacking in that relation. The number of warps per SMX is more troubling, and the consequences for the memory access latency hiding -- which takes us back to the question of how the new SW scheduling will deal with data locality and dependencies.
 
I was wondering why GK104 is slower in BitCoin mining than GF110. I know this workload is purely integer, yet still it seems odd new GPU is 20-30% slower in both OpenCL and CUDA miners (including CUDA miner compiled using 4.2 toolkit).

average numbers:
110MH/s (GTX680) vs 140MH/s (GTX580)
+
it's slower on GPC OpenCL benchmark too

Code:
              GTX 580   GTX680
SHA-1 Hash     571.0     471.9
 
I was wondering why GK104 is slower in BitCoin mining than GF110. I know this workload is purely integer, yet still it seems odd new GPU is 20-30% slower in both OpenCL and CUDA miners (including CUDA miner compiled using 4.2 toolkit).

Bitcoin is basically all shifts. Perhaps the shift hardware is not as good in Kepler? (It certainly has no reason to be as good for gaming loads.)
 
Bitcoin is basically all shifts. Perhaps the shift hardware is not as good in Kepler? (It certainly has no reason to be as good for gaming loads.)

Granted, but results are closer to the level of GTX560Ti which GK104 doubles in almost every aspect.
It will be interesting to see if big Kepler will bring any improvements in these tasks or not.
 
Last edited by a moderator:
Granted, but results are closer to the level of GTX560Ti which GK104 doubles in almost every aspect.
It will be interesting to see if big Kepler will bring any improvements in these tasks or not.
Why can't a new driver help with this?
 
Why can't a new driver help with this?

A new driver won't help much here, as the integer performance is severely handicapped compared to GTX 580 and even to GTX 560. According to CUDA C Programming Guide version 4.2, 32-bit integer shifts and compares have only 1/24 of the throughput of the 32-bit FMA. That would put the GTX 680 at around 1/6 of the GTX 580 throughput in those operations. The other integer operations aren't quite that slow though.
 
A new driver won't help much here, as the integer performance is severely handicapped compared to GTX 580 and even to GTX 560. According to CUDA C Programming Guide version 4.2, 32-bit integer shifts and compares have only 1/24 of the throughput of the 32-bit FMA. That would put the GTX 680 at around 1/6 of the GTX 580 throughput in those operations. The other integer operations aren't quite that slow though.

Seems like the compiler would then want to examine the potential to favor integer MADDs over left shifts, since they have 4x better throughput (and can pair with ADDs, being an alternative to a useful left shift + insert operation). It's not intuitive, but I've actually done that sort of thing in NEON code.
 
Seems like the compiler would then want to examine the potential to favor integer MADDs over left shifts, since they have 4x better throughput (and can pair with ADDs, being an alternative to a useful left shift + insert operation). It's not intuitive, but I've actually done that sort of thing in NEON code.

But the best int MADD with high throughput you can probably get is for 24-bit, and the problem that btc deals with really likes 32-bit int shifts.

The fastest implementation you could build is probably splitting the 32-bit words into 16-bit ones, but then you are going to at least quadruple the amount of ops, and double the amount of state per thread. The state is probably the bigger hit.

I have actually always been a bit puzzled as to exactly why AMD gpus are as good at 32-bit shifts as they are. There really isn't any use that justifies the expenditure outside crypto. Is AMD the main supplier to NSA or something?
 
I have actually always been a bit puzzled as to exactly why AMD gpus are as good at 32-bit shifts as they are. There really isn't any use that justifies the expenditure outside crypto. Is AMD the main supplier to NSA or something?
With the bitalign instruction you can do basically bitshifts of 64bit data (but it delivers only 32bits of the result) at full rate on AMD GPUs (since Cypress, R700 generation had only the normal shifts at full rate but was already a huge jump over R600/RV670, where bitshifts executed at 1/5 rate [only in t slot]). You can use this also for full speed rotates. [strike]AFAIK nVidia added this instruction with Fermi, too. But maybe it's slower (Executed on SFUs? Implemented only in a part of the vALUs? Only DP/iMUL32 throughput?) or they added it only as a macro consisting of multiple native instructions, no idea.[/strike] [edit]nV GPUs have only the BFE instruction, not bitalign and also not BFI as Man from Atlantis pointed out[/edit] [edit2]According to the documentation, starting with Fermi they have a BFI instruction, just the bitalign is missing compared to AMD.[/edit2] And don't forget HD 5870 and HD6970 had a higher peak arithmetic performance than GF100/110 either way.

As to the reason, I always thought that bit manipulating instructions are quite cheap, maybe save for the shifts. But AMD obviously thought it was less enough effort to put it in at full speed. Maybe someone can enlighten us, how much a 32bit shift unit costs compared to a FMA?
 
Last edited by a moderator:
There is some talk about nvidia's lack of BFI_INT and int rotate functions that makes them significantly slower than AMDs..
I guess GK104 will absolutely stink at bitcoin or cryptographic stuff if there is no hidden magic coming to the rescue. I've just seen in nV's documentation, that 32bit integer shifts are supposed to run only at a 1/24 rate (8 operations per clock cycle per SMX, same as double precision). For some extremely strange reason, it is slower than 32bit integer multiplication (1/6 rate), so one could try to exchange it with multiplications when possible.
Comparing GF100/110 : GF104/114 : GK104 per clock cycle (the whole CPU and taking the hotclock for Fermi), the instruction issue rates for 32bit integer shifts relate as 4:2:1 and 32bit integer multiplies 2:1:2, making this stuff really slow on GK104 (and you still have to consider the lower clock speed of Kepler), shifts are only about 1/3 of the speed of a GF114! Only integer adds are significantly faster.
 
Maybe someone can enlighten us, how much a 32bit shift unit costs compared to a FMA?

Bit shift units, especially ones that can operate at 4-cycle latency, are really, really cheap compared to FMA. Basically, having a name for the instruction and the paths to send operands to it are going to be more expensive than the actual shift hardware.

The thing is, throughput loads that use shifts don't really exist outside crypto. Which makes the existence of the instruction strange. I find it entirely believable that AMD added it for a single client.
 
The thing is, throughput loads that use shifts don't really exist outside crypto. Which makes the existence of the instruction strange. I find it entirely believable that AMD added it for a single client.
As said above, AMD made the shifts fast already with the R700 generation (RV770 was doing shifts ~12 times as fast as RV670), Cypress only added to it by enabling also full speed rotates and simplifying shifts of wider data with the bitalign instruction. If there would have been a specific customer, I guess they would have it done it only for RV770, not for the entire line (DP was also RV770 only). I think they did it mainly because it was cheap and even simplifies some things because everything (the ALUs) gets more symmetric.
 
I guess GK104 will absolutely stink at bitcoin or cryptographic stuff if there is no hidden magic coming to the rescue.
No need to guess, Kaotik posted some GPGPU benchmarks over in the 7970 thread, that includes results from a bitcoin miner:

http://muropaketti.com/artikkelit/naytonohjaimet/gpgpu-suorituskyky-amd-vs-nvidia,2 (2nd benchmark)

It isn't on "absolute stink" level, but still disappointing. If GK110 adds the missing instruction on full speed, it should beat a 7970.
 
Back
Top