NVIDIA Kepler speculation thread

As I understood it, he is suggesting the compiler has some say in which other warp to switch to.
I'm saying that the compiler is optimising ALU scheduling at the instruction level and it's NVidia's plan to pre-program the memory hierarchy usage pattern pro-actively in order to minimise the effort programmers spend on trying to get the hardware to use an optimal memory hierarchy.

e.g. using pre-fetch so that data is in cache rather than DDR. Which has the effect of improving ALU throughput.

No different than the kinds of techniques we've discussed with respect to Larrabee.
 
Fat lot of good that would do, here you can usually get away with paying no tax from newegg, EG, $499=$499.

Generally you only have to pay tax if you live in one of the few states where the online retailer has a physical distribution center.
For us in germany, and AFAICT in Europe in general, consumers cannot avoid tax, so € 499 is the normal price.
Cheapest 7970 here is ~440, cheapest 680 is ~480 right now; both readily available.

I didn't know that added tax is not mandatory in the US.
 
For us in germany, and AFAICT in Europe in general, consumers cannot avoid tax, so € 499 is the normal price.
Cheapest 7970 here is ~440, cheapest 680 is ~480 right now; both readily available.

I didn't know that added tax is not mandatory in the US.

Just on online sales, where the retailer does not have a physical presence in your state...even then there is pressure to end that loophole, but so far it stands.

So if Newegg has distribution centers only In NJ, Cali, and Tennessee, you only pay tax if you live in one of those 3 states...OTOH, if you buy from Wal Mart.com you will always pay tax because there are Wal Mart's everywhere...of course at all brick and mortar, tax is mandatory...
 
Just on online sales, where the retailer does not have a physical presence in your state...even then there is pressure to end that loophole, but so far it stands.

So if Newegg has distribution centers only In NJ, Cali, and Tennessee, you only pay tax if you live in one of those 3 states...OTOH, if you buy from Wal Mart.com you will always pay tax because there are Wal Mart's everywhere...of course at all brick and mortar, tax is mandatory...

Plus, sales tax vary considerably from state to state: http://en.wikipedia.org/wiki/US_sales_tax#By_jurisdiction
 
So what you are saying boils down to compiler inserting some kind of "don't deschedule this warp" flags in instructions to help with superscalar issue, say to avoid in pipe registers from being corrupted. GCN avoids this by not having dual issue at all.
In principle, it has nothing to do with dual issue. AFAIU, Kepler basically uses some kind of latency or dependency counters. The compiler knows the latency of each operation (it is fixed and not somewhat variable as in Fermi). It tries to reorder the instructions to insert as many independent instructions as possible between dependent ops. So far so simple.
An instruction dependent of another one gets encoded by the compiler to include information about this dependency in the instruction stream (which instruction and how many cycles need to be between the issue of the two ops). The scheduler uses the information to simply count the cycles between the issue of instructions. If the dependent one is reached before the dependency cycle counter reaches zero, the warp gets blocked from further instruction issue (the scheduler selects another one). Basically you replace the scoreboard and the OoO dispatch of Fermi with a handful of simple counters (a few bits are enough) for ALU instructions to handle the dependencies.
 
GCN doesn't do this. A given wavefront cannot issue another instruction for itself until after the current instruction is done. For most SIMD ops, there is a 4-cycle wavefront where the SIMD scheduler physically cannot pick another instruction before the current one is finished.
From GCN description, it seemed the GCN scheduler was able to execute multiple instructions from the same thread as long as they were independent - not OOO but a sort of in-order 'look ahead'- or are you referring to the lag time between issuing instructions to different SIMDs of the same GCN?

using pre-fetch so that data is in cache rather than DDR
What would you do if you have a sufficient number of 'if (a) [mem1] else [mem2]' in parallel threads? Prefetching (like using prefetchntx) makes sense as long as you are reasonably sure such data will be used... before or later. Probably that's something that would happen way less on GPGPU than rendering, I suppose.
 
Last edited by a moderator:
So what you are saying boils down to compiler inserting some kind of "don't deschedule this warp" flags in instructions to help with superscalar issue, say to avoid in pipe registers from being corrupted. GCN avoids this by not having dual issue at all.
It's not there for just superscalar issue.
Instruction execution was quoted as being 11 cycles. That's 11 cycles where the ALU scheduler needs to decide if it's going to stick with a warp or find a new one to decode instructions for.
It's not going to do so by checking the operands or register readiness of upcoming instructions for the same warp, it instead tracks the compiler's dependence information, which tells it when the warp will be active again.

I am curious about what the architecture does if the instruction information says there are no dependences, but the register identifiers to have a name dependence. Would the architecture stall an instruction if it detects a dependence, or would it blithely continue on?
The latter case would be consistent with a simplified classical non-interlocked VLIW. It could be assumed that the compiler knows what it's doing when it enters this state, but this won't lead to deterministic behavior unless the warp is guaranteed to issue and complete all instructions without other warps or system events injecting latency.

From GCN description, it seemed the GCN scheduler was able to execute multiple instructions from the same thread as long as they were independent - not OOO but a sort of in-order 'look ahead'- or are you referring to the lag time between issuing instructions to different SIMDs of the same GCN?
AMD's presentations say one instruction per wave per cycle.
If multiple instructions are picked in a cycle, they come from different threads.
 
AMD's presentations say one instruction per wave per cycle.
If multiple instructions are picked in a cycle, they come from different threads.
Not sure I was clear about, let me detail it.
I was meaning:
Code:
Cycle 0: SIMD0<-- Op0 from Thread 0, 4 cycles length
Cycle 1: SIMD1<-- Op1 from Thread 0 (Op0 and Op1 independent), 4 cycles length
Cycle 2: etc.
Where max 1 instruction/cycle is issued, but nothing prevent to schedule further independent instructions from the same thread.
Whereas you seem to imply something like
Code:
Cycle 0: SIMD0<-- Op0 from Thread 0, 4 cycles length
Cycle 1: SIMD1<-- Op0 from Thread 1, 4 cycles length
Cycle 2/3: etc.
Cycle 4: SIMD0<-- Op1 from Thread 0
 
AMD's presentations say one instruction per wave per cycle.
If multiple instructions are picked in a cycle, they come from different threads.
Yes. And each wave is considered only every four cycles for instruction issue. And according to AMD GCN can issue dependent vector instructions back-to-back, which means latency equals throughput (4 cycles, would like to have that tested though) and no dependency check is necessary at all for ALU instructions (vector to scalar has a fixed 4 cycle penalty and memory acesses are handled by the dependency counters).
 
Not sure I was clear about, let me detail it.
I was meaning:
Code:
Cycle 0: SIMD0<-- Op0 from Thread 0, 4 cycles length
Cycle 1: SIMD1<-- Op1 from Thread 0 (Op0 and Op1 independent), 4 cycles length
Cycle 2: etc.
Where max 1 instruction/cycle is issued, but nothing prevent to schedule further independent instructions from the same thread.
Whereas you seem to imply something like
Code:
Cycle 0: SIMD0<-- Op0 from Thread 0, 4 cycles length
Cycle 1: SIMD1<-- Op0 from Thread 1, 4 cycles length
Cycle 2/3: etc.
Cycle 4: SIMD0<-- Op1 from Thread 0
Threads are actually fixed to SIMDs, so 3dilletante's picture is right.
 
Threads are actually fixed to SIMDs, so 3dilletante's picture is right.
I totally missed that - the picture is way clearer now.
hmmm, doesn't it also require way more threads in flight then to fulfill a GCN for the same computational power, with potentially more stalls waiting the memory bus?
 
499 Euro is like $650 US. I'd rather pay $499 US. But maybe that's just me. Last I checked there was some on Ebay for ~$599 US which would still be $50 cheaper than getting it from europe it seems. There's no doubt Europe has better availability as they did so with the 7970 as well, and it's probably because the price winds up being much higher.

Without taxes it is around the same but you should also consider that the European market is much bigger than the North American market.

In Denmark you can get both the Radeon 7970 and Geforce 680 for around $535-545 without taxes.
 
I totally missed that - the picture is way clearer now.
hmmm, doesn't it also require way more threads in flight then to fulfill a GCN for the same computational power, with potentially more stalls waiting the memory bus?

The number of wavefronts per CU in GCN versus the maximum number of wavefronts on a SIMD is modestly increased.
The GCN must have room for at least 40 per CU, or 10 per SIMD unit, while Cayman had 32 for its VLIW4 SIMD (realworldtech indicates there were global restrictions that kept it around 20 if averaged across the chip).

The SIMDs for the prior generation needed at least 2 threads or the SIMD would be idle every other cycle.
A CU needs to have at least 4 to get all SIMDs active.

Except at the lowest count or the highest, the implementations seem to behave similarly. There are fewer scenarios where either extreme would be employed.

Stalls on memory would need testing. The L1 for the prior generation has a hundred or so cycles for latency, and it was smaller.
I haven't seen similar pointer chasing tests done for GCN's larger L1, or latency measurements for the rest of the hierarchy.
 
Without taxes it is around the same but you should also consider that the European market is much bigger than the North American market.

Yup, European market might be bigger but I think no one considers it as a whole. There are no Pan-European (r)e-tailers (like newegg in USA and Canada).
Maybe that's why all prices of electronic stuff are much lower in the USA, because of the union of the market in North America, and also because there are no small e-tailers (ok, there could exist some, but they are not the only ones) which have small turnover, so they look for bigger profit from a single unit sold.

In Denmark you can get both the Radeon 7970 and Geforce 680 for around $535-545 without taxes.

Also, it's not logical that in most countries around Europe that experience much poorer than in Canada or USA standard of life, everything to be so much more expensive.
 
You mean Eastern Europe, Middle East and North Africa?

Not only those, but Central European countries, Germany (especially german engineers who go to USA to work there :LOL: ), France, Italy, even Israel, Dubai...
Ok, some of these countries might be rich, but there is something special in USA which makes people wanna go there and not somewhere else. ;)
 
Yup, European market might be bigger but I think no one considers it as a whole. There are no Pan-European (r)e-tailers (like newegg in USA and Canada).
Maybe that's why all prices of electronic stuff are much lower in the USA, because of the union of the market in North America, and also because there are no small e-tailers (ok, there could exist some, but they are not the only ones) which have small turnover, so they look for bigger profit from a single unit sold.



Also, it's not logical that in most countries around Europe that experience much poorer than in Canada or USA standard of life, everything to be so much more expensive.

The distributors are the same, so yes, the entire European Market is basically one big pot of gold.

Regardless, this is getting off-topic ;)

But anyone in Europe would really argue about the poorer standard of life, when we have public healthcare systems and safety nets for the weakest in the society. Regardless, the graphic cards are the same price before value added taxes are applied.
 
Back
Top