NVIDIA shows signs ... [2008 - 2017]

Jawed · Oct 24, 2009

EduardoS said:
I can't see how a G200 can be fast with only 1/32 of it's RAW performance...

30 SIMDs running at 1.5GHz is still pretty fast, though you could argue it's no faster than ~3GHz 4-core SSE with perfect SIMD utilisation (scalar code on the GPU would be 4x faster). The GPU has more bandwidth to play with.

Random.

Flat or nested? What cycle counts for the alternate paths? What kinds of scatter/gather operations is the code doing?

For a trivial mandelbrot it's simple, I will get some data when time permits... Unfortunally I will to work tomorrow...

Just to remeber, PS was faster then scalar CS right?

Looking at the code and from what I remember, the vector CS is faster than the scalar CS on ATI. The PS is scalar, only, I believe.

It's a bool
Current hardware have no support for a real predicate register to be used in a per ALU base, in hardware also allows the implementation of the trick you said doesn't increase performance.

All the GPUs have real predicates. ATI has a stack of predicates. NVidia has predicate registers. I don't understand the distinction you're trying to make.

It may avoid a performance hit of up to 75%, so may increase the performance up to 4 times.

Sorry, you're right.

So, they did a double precision multiplication using 1 int multiplier and 2 fp multipliers, maybe they already did what I was describing :smile:

It'd be good to work out how Fermi does DP. If it really has 32 SP units and 32 integer units per core that are entirely distinct (sharing register file ports and not overlapped) then something along those lines sounds feasible.

It would be funny if Fermi and Larrabee do SP, DP and INT the same way.

AMD probably did a long multiplier using each of the multipliers in thin ALUs to do part of it, this is the most simple method and requires few extra transistors, but a 2x precision multiplier could be done with just 3 1x precision multipliers, the two methods that mades sense for me are Karatsuba and the long multiplier replacing the lowest multiply by a good guess, both requires some extra hardware, but cheaper than an full multiplier.

http://www.lirmm.fr/arith18/papers/libo-multipleprecisionmaf.pdf

The optimal multiple-precision unit described here is 3.7x the size of the single-precision unit. It's 18% bigger than the optimal double-precision only unit. The double-precision only unit is 3.1x the size of the single-precision unit.

So people already talked about some fo what I described here :smile:

Hmm, I guess the 16->4 granularity I was describing is like the 256->64 granularity you're describing.

Jawed

EduardoS · Oct 25, 2009

Jawed said:
30 SIMDs running at 1.5GHz is still pretty fast

Since each warp executes over 4 clocks and we are using only 1 of them it's more like 30 scalar cores running at 375MHz, and they need 6 (is that for GTX285?) threads per core just to hide execution latency, CPUs do better than that...

Jawed said:
Flat or nested?

Varies a lot, sometimes none.

Jawed said:
What cycle counts for the alternate paths?

Varies too much.

Jawed said:
What kinds of scatter/gather operations is the code doing?

Well... Currently I don't remeber any code where it matters... Do you remebers any?

Jawed said:
Looking at the code and from what I remember, the vector CS is faster than the scalar CS on ATI. The PS is scalar, only, I believe.

Yes, the vector is faster than scalar and PS is only scalar, I was refering to the speed difference between PS and CS in scalar version, forgot it, I did some mistakes...

Anyway I also did simple simulations today:

Simple, ugly, but better than nothing or than just guessing, I take it from a trivial mandelbrot of 1024x1024 pixels, counted the number of iterations of the loop of each thread to get a number of unmasked execution, and then divided by the number of iterations of a given warp/wavefront, the number at bottom are the width of the warp/wavefront, the number in column is the efficiency/ocupancy, I also tested two patterns, a row-major order, and a Z-Order, just for information.

Looking at the chart the first thing to note is that using row major order for 2D arrays is no-brainer

locality is important, the curve in this chart is related to the resolution I choosed.

The second thing, looking at the line of Z-Order, is the performance gain that nVidia would have by using 32 threads per warp instead of 64 and the performance drop by vectorizing (and so, increasing the granularity from 64 to 256) the code in AMD hardware, it's respectively 3.79% and 11.4%, hardly a big advantage for nVidia and also hardly an excuse for not vectorizing a code with less than 80% ALU occupancy.

Increasing the resolution increases efficiency for all cases, but I won't plot another chart today, tomorrow I will try add some methods for mitigating branch divergence, any bids?

About Voxille's Mandelbrot, Due to the loop unrolling I don'texpect it to suffer as much from branch divergence as the trivial implementation does.

The raw data:

Code:

                1024x1024                 4096x4096
         RowMajor     Zorder       RowMajor     Zorder
    2    98,75399%    98,75399%    99,42808%    99,42808%
    4    97,00922%    97,43842%    98,6191%     98,80145%
    8    94,20914%    95,67982%    97,43809%    97,96538%
   16    89,6497%     93,6318%     95,65114%    97,0504% 
   32    82,99384%    90,6495%     92,74299%    95,80479%
   64    74,42341%    87,33846%    88,18603%    94,32423%
  128    61,60484%    82,94035%    81,49467%    92,22841%
  256    41,60408%    77,61035%    73,10745%    89,71968%
  512    24,65091%    72,02141%    60,45533%    85,88808%
 1024    21,14622%    65,19784%    40,78072%    81,84855%
 2048    20,80997%    57,08746%    24,42915%    76,54399%
 4096    20,1047%     49,18869%    21,02423%    71,98172%
 8192    19,28069%    41,28873%    20,83935%    65,88856%
16384    18,3604%     32,75394%    20,49293%    57,56558%

Jawed said:
All the GPUs have real predicates. ATI has a stack of predicates. NVidia has predicate registers. I don't understand the distinction you're trying to make.

It doesn't have a predicate where I can execute intruction from one ALU but not from the other in the VLIW group, a predicate there would allow to two threads execute simultanelly by each using part of the execution units in the VLIW, note that it only makes sense in the context of trying to mitigate branch divergence penalities.

Jawed said:
http://www.lirmm.fr/arith18/papers/libo-multipleprecisionmaf.pdf

The optimal multiple-precision unit described here is 3.7x the size of the single-precision unit. It's 18% bigger than the optimal double-precision only unit. The double-precision only unit is 3.1x the size of the single-precision unit.

Optimal for a given task

The paper goes by the reverse path we are discussing, it have a full DP precision multiplier and wants to reuse it for vector SP, we are discussing how to use SP multipliers to perform a DP, wich is good for x86, maybe not so useful for GPUs.

We could lower the level, about how to build multipliers from scratch and make it able to perform a DP or x SP operations, I'm still defending that it's possible to achieve a DP or 3 SP operations with few extra resources compared to a DP only mulplier or 3 independent SP multipliers (and both happens to cost almost the same), and also that this ratio is the most cost-effective.

BTW, when designing a multiplier it's possible to do some area-latency tradeoffs, some tricks that are good for SSE CPUs may not be good for GPUs.

Jawed said:
Hmm, I guess the 16->4 granularity I was describing is like the 256->64 granularity you're describing.

Kind of.

CarstenS · Oct 25, 2009

Jawed said:
It'd be good to work out how Fermi does DP. If it really has 32 SP units and 32 integer units per core that are entirely distinct (sharing register file ports and not overlapped) then something along those lines sounds feasible.

Nvidia seems quite tight-lipped about anything Fermi beyond the whitepaper at the moment.

Considering what Keane said at GTC, I think it's not totally out of the question that they have an expensive and a cheap set of 16 ALUs in each SM (what before would have been a TPC). The expensive one being able to occupy ports from the cheap one and doing DP, while also doubling as an additional SP Unit during normal operation.

That's at least something that'd make sense considering the "even more modular" approach Fermi's taking and Nvidia supposedly being able to easily re-do a gaming-oriented Fermi with less transistors.

Squilliam · Oct 25, 2009

I was wondering how the above in depth technical discussion proves how strained Nvidia are. Can anyone explain this to me?

satein · Oct 25, 2009

Just came across this short article at VR-Zone, Jen-Hsun Huang: "Nvidia is a Software Company", Nvidia's future, Fermi", which was summarised from Jen Hsun's interview by CHW [link @CHW].

This is quoted from VR-zone's article..

Yes, JHH does talk about Fermi, and the reason why Fermi is largely targeted towards the Workstation customers is not just future promise, but also present fact. Though gaming GPUs make up more than 2/3rd of Nvidia sales, the high margin Quadro products bring in 2/3rd of Nvidia's profit.

Couple this with the massive potential in HPC based Tesla products, and we have a clear picture as to why Nvidia's Fermi brings about major improvements in computing and workstation graphics rather than gaming.

However, in the most interesting part of the interview, JHH insists the new graphics capabilities of Fermi have not been revealed yet. Fermi was announced at GTC - which is a parallel computing event rather than a gaming one - which is why the demonstration focused on the computing aspects of Fermi. It is possible, as JHH puts it, that there are a lot of "other exciting secrets" about Fermi that will be revealed later. There was no talk of release dates, however.

Perhaps, that high margin Quadra products was behind the design decision of Fermi.

Hope I put it on the right thread

rendezvous · Oct 25, 2009

Squilliam said:
I was wondering how the above in depth technical discussion proves how strained Nvidia are. Can anyone explain this to me?

Remember, remember the fifth of november.
NVIDIA will hold their Q3 2010 earning call on Nov 5, 5:00 pm Eeastern time. That should shine som light on their situation.

Jawed · Oct 25, 2009

EduardoS said:
Since each warp executes over 4 clocks and we are using only 1 of them it's more like 30 scalar cores running at 375MHz, and they need 6 (is that for GTX285?) threads per core just to hide execution latency, CPUs do better than that...

Whoops, yes. So that's down to approximately the rate of scalars in a 4 core CPU. As for the threads, well we'd be talking about something that's got some kind of data parallel feel to it, otherwise it wouldn't be on the GPU.

Varies a lot, sometimes none. [...] Varies too much.

That's a measure of the infancy of this stuff, still no real analysis.

Well... Currently I don't remeber any code where it matters... Do you remebers any?

Ray tracing is a nice, long-standing, example.

Anyway I also did simple simulations today:

Nice

Looking at the chart the first thing to note is that using row major order for 2D arrays is no-brainer locality is important, the curve in this chart is related to the resolution I choosed.

The row major order is more "random" I suppose and is therefore more interesting for the general case. I suppose the Z order's speed-up over row major is something like the gain one would expect to see from using DWF.

The second thing, looking at the line of Z-Order, is the performance gain that nVidia would have by using 32 threads per warp instead of 64 and the performance drop by vectorizing (and so, increasing the granularity from 64 to 256) the code in AMD hardware, it's respectively 3.79% and 11.4%, hardly a big advantage for nVidia and also hardly an excuse for not vectorizing a code with less than 80% ALU occupancy.

To me the row major order difference between 16 and 64 is the interesting one. That's Larrabee versus ATI on something that's reasonably random. That's 20% better for Larrabee. And nesting would make that difference grow massively, presuming there's a reasonable difference in path lengths. The Julia application seems to be nested 4 deep at most (looking at the D3D and ATI assembly) - two IFs then two loops, all nested.

Increasing the resolution increases efficiency for all cases, but I won't plot another chart today, tomorrow I will try add some methods for mitigating branch divergence, any bids?

How many times is the hardware allowed to re-combine?

Optimal for a given task

The paper goes by the reverse path we are discussing, it have a full DP precision multiplier and wants to reuse it for vector SP, we are discussing how to use SP multipliers to perform a DP, wich is good for x86, maybe not so useful for GPUs.

It implements 2-SP/1-DP. Seems equally applicable to CPUs and GPUs (though x86 has extended precision, 80-bit, as well). I quoted it because it gives a reasonable baseline for a comparison with what you're proposing and an indication of the cost of making something multi-functional.

We could lower the level, about how to build multipliers from scratch and make it able to perform a DP or x SP operations, I'm still defending that it's possible to achieve a DP or 3 SP operations with few extra resources compared to a DP only mulplier or 3 independent SP multipliers (and both happens to cost almost the same), and also that this ratio is the most cost-effective.

It's definitely interesting and it seems possible that it wouldn't be dramatically more expensive than the 1/4 rate we see in ATI right now. Though I'm still puzzled what it is about the current implementation that is so expensive that it is worthwhile cutting it out of the smaller chips. Maybe it's just marketing/differentiation.

BTW, when designing a multiplier it's possible to do some area-latency tradeoffs, some tricks that are good for SSE CPUs may not be good for GPUs.

ATI's execution pipeline appears to be only 6 cycles at most (if the MUL starts after operands 1 and 2 have arrived while operand 3 is being fetched) so there isn't much breathing room. Also SSE doesn't have FMA, and a quick Google:

http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/61121/

reveals fundamental uncertainties about how to add it...

Jawed

Squilliam · Oct 25, 2009

rendezvous said:
Remember, remember the fifth of november.
NVIDIA will hold their Q3 2010 earning call on Nov 5, 5:00 pm Eeastern time. That should shine som light on their situation.

Got it.

EduardoS · Oct 25, 2009

Jawed said:
Whoops, yes. So that's down to approximately the rate of scalars in a 4 core CPU. As for the threads, well we'd be talking about something that's got some kind of data parallel feel to it, otherwise it wouldn't be on the GPU.

Well... It's harder to scale to more threads... But ok, on some cases it may still be fast.

And this makes me remebering Niagara...

Jawed said:
The row major order is more "random" I suppose and is therefore more interesting for the general case.

The row major take a hit due to the format of mandelbrot, using long rectangles instead of squares increases the chances of crossing black regions (even if they don't take a big area they goes from top to bottom), it's not exactly random or unavoidable.

And... Well... I need more apps to see how common it's for others

Jawed said:
I suppose the Z order's speed-up over row major is something like the gain one would expect to see from using DWF.

I didn't had time today to simulate it, let's see tomorrow, would you like to propouse some algorythms to do it? Preeliminary results are not like you are expecting

Jawed said:
To me the row major order difference between 16 and 64 is the interesting one. That's Larrabee versus ATI on something that's reasonably random. That's 20% better for Larrabee. And nesting would make that difference grow massively, presuming there's a reasonable difference in path lengths. The Julia application seems to be nested 4 deep at most (looking at the D3D and ATI assembly) - two IFs then two loops, all nested.

I would argue "how much the front end costs? Does this pay for this gain?" But then I remembered that Larrabee is x86 with a greater overhead on the front end than both AMD and nVidia and also on architetural differences between them:
1) Having more overhead means the sweet spot moves to the wider SIMD side;
2) Slow gather (as apparently it is...) means if a 2D buffer isn't store in memory in z-order it will be very slow to load it in this way, if apps that does memory access behaves like mandelbrot wider SIMDs would take a serious performance hit;
3) Not being clause based could result in a high number of IDLE units for non-computation units (control flow, load/store);
4) Even not being so wide compared to others it's the most wide SIMD developed by Intel up to date.

Even if 1 favors wider vectors 2 and 3 goes against and they may not be willing to risk due to 4 and so the reason for the width choose may have no relation to branching.

Jawed said:
How many times is the hardware allowed to re-combine?

I could try many different forms.

Jawed said:
ATI's execution pipeline appears to be only 6 cycles at most (if the MUL starts after operands 1 and 2 have arrived while operand 3 is being fetched) so there isn't much breathing room.

6 cycles at 850MHz, about 5.3 times more time than 4 cycles at 3GHz of K-8.

Jawed said:
Also SSE doesn't have FMA, and a quick Google:

http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/61121/

reveals fundamental uncertainties about how to add it...

For Intel, everyone else already have FMA (IIRC Power 6 have and does at 5 cycles at 5GHz), AMD is set to add it regardless of Intel problems.

Silent_Buddha · Oct 26, 2009

Don't suppose there's anyway for a mod to split off the recent technical discussion and put it in the appropriate forum? It's fascinating stuff but isn't exactly relevant to this thread anymore.

Regards,
SB

Jawed · Oct 26, 2009

EduardoS said:
The row major take a hit due to the format of mandelbrot, using long rectangles instead of squares increases the chances of crossing black regions (even if they don't take a big area they goes from top to bottom), it's not exactly random or unavoidable.

I think worst-cases are more interesting...

And... Well... I need more apps to see how common it's for others

As well as Julia I suppose it's possible to rummage in the CUDA/Stream SDKs. Fishing around in GPGPU implementations that haven't scaled usefully might be worthwhile.

I didn't had time today to simulate it, let's see tomorrow, would you like to propouse some algorythms to do it? Preeliminary results are not like you are expecting

I'm not sure how detailed your simulation is, but constraints like strands have to retain their position when merged (e.g. modulo SIMD width or register file width) or being limited to 8 or 16 threads could cause some grief.

I would argue "how much the front end costs? Does this pay for this gain?" But then I remembered that Larrabee is x86 with a greater overhead on the front end than both AMD and nVidia and also on architetural differences between them:
1) Having more overhead means the sweet spot moves to the wider SIMD side;

Hmm, well it seems Fermi is generalising away from graphics, towards Larrabee - big L1 (bigger than Larrabee) is one indication. But Larrabee has MB of L2 cache.

2) Slow gather (as apparently it is...) means if a 2D buffer isn't store in memory in z-order it will be very slow to load it in this way, if apps that does memory access behaves like mandelbrot wider SIMDs would take a serious performance hit;

Larrabee doesn't favour Z order, architecturally, except for texturing I reckon. Z order is a pain in the arse for applications that don't easily match that.

3) Not being clause based could result in a high number of IDLE units for non-computation units (control flow, load/store);

Ha, you can't keep everything running at 100% utilisation. The Sequencer in ATI should be idle most of the time in any reasonably complex kernel. Anything that's compute bound will make load/store idle for some of the time, too. The scalar part of each Larrabee core is arguably more flexible than the Sequencer in ATI.

4) Even not being so wide compared to others it's the most wide SIMD developed by Intel up to date.

Just got to wait for the die shot that'll reveal the balance between x86, SIMD and cache.

I think NVidia and AMD have no choice about moving towards the kind of generality in Larrabee as the applications get more complex.

Even if 1 favors wider vectors 2 and 3 goes against and they may not be willing to risk due to 4 and so the reason for the width choose may have no relation to branching.

There's also the size of memory transactions, 512-bit cache lines and ring bus that all go with the SIMD width. And register file organisation.

6 cycles at 850MHz, about 5.3 times more time than 4 cycles at 3GHz of K-8.

That paper suggests 3 cycles for the combined DP/SP unit, which seems pretty nifty. ATI might be bound by the latency of transcendentals, not MAD/FMA. I've got no idea of typical cost/latency trade-offs.

Jawed

dkanter · Oct 27, 2009

Jawed said:
http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/61121/

reveals fundamental uncertainties about how to add it...

Jawed

I don't think there are any real uncertainties. I think the issue has to do with benefits and costs and quite a few of those are unique to Intel's situation. It takes a really long time for people to start using new instructions.

Intel knows how to do FMA, they have designed those units - and there are papers on doing FMA with an intermediate round for backwards compatibility (targeted at 2.5GHz for Rock).

I think if you read between the lines of what Mark said, you'll figure out what the problems are...

David

v_rr · Oct 27, 2009

GPU shipments grew 21.2% in Q3 '09

Jon Peddie Research have revealed GPU Shipment figures for Q3 '09, and the results are quite amazing. Total GPU shipments grew an immense 21.2% over Q2 '09. This is the highest quarter-to-quarter to growth in nine years!

The big gainers were AMD and Intel, experiencing 30.2% and 25.2% growth over Q2, respectively. Nvidia trailed behind, but still chalked up an increase of 3.3%.

Impressively, both AMD and Intel registered increases over pre-recession Q3 '08. Nvidia suffered a -4% decrease, but this is still much improved from the largely negative results of the previous quarters.

While AMD have been very competitive in providing the best value product since the HD 4800 series, Nvidia have been determined not to surrender any market share. Aggressive marketing, price cuts and die shrinks combined, Nvidia have held on to their dominant market share.

However, Q3 '09 sees Nvidia's market share finally succumb in alarming ways. Nvidia's market share dropped a full 4.3% over just one quarter, from 29.2% to 24.9%. Intel and AMD were both winners, increasing their market share to 52.7% and 19.8% respectively.

JPR do not expect Q4 results to be quite as dramatic, suggesting the channel had already been supplied before Q4 started.

Overall, the GPU results for Q3 have been quite staggering. To say that the GPU market seems well on it's way to recovery would be an understatement. They have, in fact, grown since pre-recession quarters!

Of course, these results reflect shipments from the entire GPU market - both discrete AIB and IGP. Intel have long dominated, thanks to their affordable IGP motherboards that finds their way into many homes, especially in the largely populated developing countries.

We can expect AIB results to follow shortly.

http://vr-zone.com/articles/gpu-shipments-grew-21.2-in-q3-09/7936.html?doc=7936
http://techreport.com/discussions.x/17834

Sxotty · Oct 27, 2009

Interesting that it adds to 97% in most recent quarter and 98% in quarter before. Someone else was the winner really

neliz · Oct 27, 2009

Sxotty said:
Interesting that it adds to 97% in most recent quarter and 98% in quarter before. Someone else was the winner really

According to BSoN

Largest gains were achieved by SiS and S3 who experienced staggering growth in emerging markets such as APAC [Asia-Pacific]. The effects of Chinese stimulus package resulted in 209.3% quarter-on-quarter growth for SiS, rising up to 1.1% of overall world market [1.31 million units]. VIA/S3 also came back from the grave, taking 1.5% of the market [1.79 million units], up from 0.8% the quarter before.
The biggest loser is now Matrox Graphics, dipping below a single decimal percentage and now takes 0.0% of the market.

Jawed · Oct 27, 2009

dkanter said:
I think the issue has to do with benefits and costs and quite a few of those are unique to Intel's situation.

Agreed. I was trying to suggest Intel has choices beyond area/latency.

Jawed

Sxotty · Oct 27, 2009

neliz said:
According to BSoN

Oh I don't look at that site.

Anyway in a funny someone earlier linked to newegg to show that Nvidia was discontinuing products b/c most were out of stock
http://www.newegg.com/Product/Produ...0048 106793261 1067949754&name=Radeon HD 5870

Oh noes! Discontinued

Actually that is obviously not the case, but it is annoying me the timing of all this as I wanted to get a card b4 Christmas (which for me means I want some competition or prices to decline), but I guess I can manage to wait.

neliz · Oct 27, 2009

Sxotty said:
Anyway in a funny someone earlier linked to newegg to show that Nvidia was discontinuing products b/c most were out of stock
http://www.newegg.com/Product/Produ...0048 106793261 1067949754&name=Radeon HD 5870

Oh noes! Discontinued

What they were trying to prove with that link is that the prices haven't gone down for nVidia products, so either a GTX285 that is more expensive than a HD5850 is really reall realy competetive or the product isn't really treated as a competetive part and they hope to minimize losses on brand sales.

Silent_Buddha · Oct 27, 2009

Interesting, if Nvidia doesn't get something out for the holiday season, it's quite possible that ATI might match or surpass them in Marketshare for Q4. Definitely NOT something I was expecting. Their Q3 numbers were far FAR worse than I expected.

Considering it appeared (at least to me) that they were still competitive in the retail space, I can only imagine much of this is due to OEM sales.

Regards,
SB

Sxotty · Oct 27, 2009

neliz said:
What they were trying to prove with that link is that the prices haven't gone down for nVidia products, so either a GTX285 that is more expensive than a HD5850 is really reall realy competetive or the product isn't really treated as a competetive part and they hope to minimize losses on brand sales.

And we already went over it. Newegg is not going to just hold them for fun unless someone else is paying them to. The longer they wait the cheaper they will need to go to move them. Newegg has been in business for quite some time so I figure they understand this. Either people are still buying them that do not know better or someone is paying newegg for the loss. There is no way they would just hoard them. Of course it is possible they have 1 in stock and are just keeping it there for window dressing I suppose.

NVIDIA shows signs ... [2008 - 2017]

Jawed

EduardoS

CarstenS

Moderator

Squilliam

Beyond3d isn't defined yet

satein

rendezvous

Jawed

Squilliam

Beyond3d isn't defined yet

EduardoS

Silent_Buddha

Jawed

dkanter

v_rr

Sxotty

neliz

GIGABYTE Man

Jawed

Sxotty

neliz

GIGABYTE Man

Silent_Buddha

Sxotty

Similar threads