NVIDIA GF100 & Friends speculation

Fermi being 6 months slower than a retaped-out Cypress and the midrange/mainstream part being even more delayed than Junpier sounds like a very obvious indicator

So what? Where were you to cry "compromising execution" when NV30 came out later and slower than R300, when R600 came out later and slower than G80, and when Larrabee never even came out? Clearly you don't know anything about the history of GPU design cycles for NVIDIA, ATI, and others.

And you're arguing that RV8XX's major enhancements wrt RV770 doesn't at all meet the requirements for GPU compute? That's rather daft.

Are you purposely trying to be dense, or is that simply the norm for you? I never said anything about RV7xx/8xx not meeting "requirements" (whatever that means). All I said was that GF100 has very strong graphics and compute feature set.

Save the ease of "scale to lower end" until you see any midrange cards. In fact Juniper was the first one being demonstrated in mid-2009 first with Wolfenstein then later with Heaven. That's pre-launch hard proof.

Again, re-read what I said. I said that GF100 architecture leads itself more to scaling down to lower end designs compared to GT200. Anyone who claims otherwise is simply showing ignorance.


That's really just fluff. Is nVidia really doing the best they can?

You sound like an armchair critic (or perhaps someone who has an axe to grind?). Clearly you don't seem to know the first thing about new product development, especially with respect to high end GPU's. If it were so easy to do with flawless execution, then everybody would be doing it, including Intel. ;)
 
Last edited by a moderator:
Evergreen introduced some of this, see the ADD_PREV, BCNT_ACCUM_PREV_INT, MBCNT_32LO_ACCUM_PREV_INT, MUL_IEEE_PREV, MUL_PREV, MULADD_IEEE_PREV, MULADD_PREV, SAD_ACCUM_PREV_UINT instructions. The INTERP_* instructions also do combinations of ADDs and DOTs.
Oh, neat. The limitations suggest that they aren't doing what I suggested, so do you think they lengthened the ALU pipeline? I can't imagine that Cypress can do A*B*C with 8 cycles latency.

Which geometry operation is the bottleneck currently?
Nothing is a major bottleneck right now, but it's still there, particularly during framerate minimums. I also think it might be a bigger problem in the future. I'd say faster backface and frustum culling would be a great first step if not setup.
 
Yes definitely and that's the point. I could write an app that doesn't use LDS at all but consistently reads randomly from global memory. This would probably run decently on Fermi but not so well on ATI (no L1$, only LDS).
Well if it's random then cache doesn't really make a difference ;)

But what I was trying to say was why couldn't ATI use its L2 for R/O buffers? And is L1 only for textures?
 
Again, re-read what I said. I said that GF100 architecture leads itself more to scaling down to lower end designs compared to GT200. Anyone who claims otherwise is an idiot.
Why is that? Before GF100, derivatives had half or one quarter the shading power but just as much geometry throughput. GF100 derivatives will be slower in all aspects, thus a greater step down. I don't see why it's so much easier to execute, either. Before GF100, there was basically no communication between clusters, so you could chop them off quite easily. GF100 has much more communication from the GPCs to the L2 and between the polymorph engines.

Thinking that GF100 is at least as hard as GT200 to scale down, if not more, is not idiotic at all.
 
With Fermi nearly here, I'm looking forward to a bit of fun with the histogramming "the words in Dickens' novels" problem that floored GTX285 a while back:

http://forum.beyond3d.com/showthread.php?p=1305053#post1305053

A little bit of competition with Cypress wouldn't go amiss either...

Jawed

This could be interesting :)
I ported my old 8 bits histogram codes to OpenCL, and did some experiments:

1. simply using global atomics, doing 16 MB 8 bits histogram:

GTX 285: 0.273 s
Radeon 5850: 0.043s

This means Cypress has insanely fast global atomics (probably with the help of the global data store?) compared to GT200.

2. using local atomics, split into 8 banks, doing 256MB 8 bits histogram:

GTX 285: 0.031s
Radeon 5850: 0.041s

All times are from OpenCL's internal profiler. These times are kernel execution times only, i.e. they do not include the times spend on copying data from/to the host memory.

The "8 bits number overflow to 32 bits number" trick does not work on Radeon 5850 because its OpenCL does not support byte addressable local memory.
 
Cypress is very fast at global atomics, but *much* faster at local atomics. I am surprised you don't get any benefit from using the LDS on the 5850
 
Why is that? Before GF100, derivatives had half or one quarter the shading power but just as much geometry throughput. GF100 derivatives will be slower in all aspects, thus a greater step down. I don't see why it's so much easier to execute, either. Before GF100, there was basically no communication between clusters, so you could chop them off quite easily. GF100 has much more communication from the GPCs to the L2 and between the polymorph engines.

Thinking that GF100 is at least as hard as GT200 to scale down, if not more, is not idiotic at all.

What I meant is that the GF100 architecture leads itself more to scaling down to lower end designs in a timely, cost-effective, and efficient manner once the high end GPU is ready compared to GT200. After all, each GPC in GF100 is reportedly nearly a full GPU in and of itself, right? Wasn't it ATI/AMD who decided in recent years to move away from monolithic GPU's in part because time to market for lower end derivatives was very poor compared to introduction of new GPU's at the high end? With GF100, it seems (at least on the surface) that once the high end GPU is ready, then time to market for the lower end derivatives stands to be significantly better than before. Also, if NVIDIA can make a balanced high end GPU, then by definition the lower end derivatives should be balanced too. Is it really balanced to have lower end derivatives with the same geometry throughput as the higher end models? Of course, ATI/AMD's strategy will always have some merit. NVIDIA cannot easily get around the fact that monolithic GPU's take a long time to come to market and are very difficult to engineer. That said, the proof is in the pudding, and the results later this year will speak for themselves. I guess we'll learn a lot more in the coming months in seeing how everything plays out.
 
Last edited by a moderator:
I think it's 6PM Eastern.

Usually a full review will get leaked from some 2 bit website online the night before NDA though. That's usually in the case of morning NDA's. So maybe something like 12 hours prior for this one.
 
Any chance that we will be able to view the video stream from this GF100 "launch" event?

According to the provantage.com link above, availability of that PNY GTX 480 card is estimated at 3-10 business days. So that would mean by April 9.

Edit: Actually, the link says nothing about availability, other than showing the company's average processing time of 3-10 business days. Looks like slim pickings for this card.
 
Last edited by a moderator:
GTX470 overlock benchmark:
10hqdja.jpg


Source:http://we.pcinlife.com/thread-1385213-1-1.html
 
It depends on how they set up their lineup. If they go for a die size like RV770 as their top chip again, then maybe 20% is enough because they won't be able to double SIMDs this time. They may resurrect the sideport, making dual-GPU work better, using that to take on Fermi2 at the $400-500 point and ignore the higher price points.
What I had in mind was that 5770 and 5750 ship with a 128bit bus and 5790 ships with a 192 bit one. The core count will need to change a bit, but that is fixable if this decision is taken early enough in design phase.
 
From the Fermi docs it appears that caching will be enabled by default on DirectCompute global UAV accesses which is a big deal. Beyond helping algorithms that actually have unpredictable memory access patterns, it begs the question of how important the local data store is now.

LDS is definitely more power efficient (and prolly area efficient too) over an equal capacity block of gp cache. There is a definite use case for existence of these things.
 
What I had in mind was that 5770 and 5750 ship with a 128bit bus and 5790 ships with a 192 bit one. The core count will need to change a bit, but that is fixable if this decision is taken early enough in design phase.

Why would they do that ?

I figure their whole line up will shift. The 5830 will fill the $200 gap not a 5790. The 5850 will drop to the $260 price range the 5830 is in . The 5870 will slot into the $350 price tag and the e6 5870 2GB will slot into the $400 price point. They can put out a 5890 in the $400 + range if they want to take the performance crown. I think its obvious that ati has high priced cards becuase there was no reason not to reap as much money from them as possible and I think dueto the large gaps in pricing we will see them drop down.

The best thing is the 5870 at $350 is only a $30 drop from its original $379 msrp. The 5850 at $260 is at its original msrp

Ati shifting pricing like that (which shouldn't be very hard for them when you look at the original msrps) wil lreally screw with nvidia's rumored $350 /500 price tags. Why buy a 470 when the 5870 is faster at the same price. Why 480 for $500 when the 5870 is 20% slower but $150 bucks less.

With pricing like that the 5850 will become a really big card and most likely the sweet spot for gamers for the next 3-6 months.
 
Back
Top