NVIDIA GF100 & Friends speculation

I see mo technical injunction against that.

Yea me neither, just that we havent seen ATI do it while NV has done it twice with G80 and GF100. Arguably ATI went overboard with 512 bit on R600 and 384 bit would have suited it much better

Although i think ATI definitely has to go 384 bit for the next gen. The highest GDDR5 speed they can probably get for it is 6 gbps, barely 20% increase in b/w. 384 bit guarantees them 50% greater b/w. Then again the overclocking tests done so far show that Cypress sees a higher gain with core clock rather than mem clock. By making it more b/w efficient who knows, they might get by with a 256 bit interface
 
Do we have a confirmed release date or timeframe for any fermi derivates btw? Surely if they had planned to release GF100 in Q4/09 they must have planned for the derivatives to be released within a few months. So does this mean that the derivatives are also delayed or was this the original plan? If it in fact was the original plan then NV have truly lost the plot. If GF104 does come out in Q2, say May/June then Juniper will have gone unchallenged in the mid range segment for like 8 months!
 
I don't think NVIDIA have officially announced anything about derivatives other than the GTX 480 and GTX 470 at this time.
 
The situation might be acceptable from amd's pov, but it is definitely an opportunity missed. I wonder if they'll go with 192 bit bus on Juniper's replacement.
I don't see a missed opportunity at all. You can't have an unlimited number of chips. If Juniper was bigger/faster then the 5670 and 5770 would have too big of a gap. The cost of designing and taping out another chip would probably be greater than that of the total idle silicon of all 5830's sold.

IMO, neither ATI nor NVidia will ever release a chip that is so uncompetitive that it will be blown out of the water at its price point. All that we'll see is one making more profit than the other.
 
GF104 won't compete with Cypress, it's projected specs put it somewhere north of GT200 in the computational department and a bit down on bandwidth. Definitely something to beat up Juniper with.

Last time I checked north of GT200 is actually north of 5850 which is still Cypress :LOL:
 
Although i think ATI definitely has to go 384 bit for the next gen. The highest GDDR5 speed they can probably get for it is 6 gbps, barely 20% increase in b/w.
It depends on how they set up their lineup. If they go for a die size like RV770 as their top chip again, then maybe 20% is enough because they won't be able to double SIMDs this time. They may resurrect the sideport, making dual-GPU work better, using that to take on Fermi2 at the $400-500 point and ignore the higher price points.

If the top N.I. chip was 256-bit, 2000 SPs capable of dependent instructions within an instruction group, and did all geometry operations at more than one tri per clock, that may be good enough for next gen.
 
If the top N.I. chip was 256-bit, 2000 SPs capable of dependent instructions within an instruction group, and did all geometry operations at more than one tri per clock, that may be good enough for next gen.

What do you mean by that?

BTW..I'd take coherent r/w caches over more than one triangle per clock any time of the day.
 
Who knows how long did it take nVidia to design and debug the distributed geometry units. This is not something I would expect ATI to add unless they had started working on it a long time ago.
 
If the top N.I. chip was 256-bit, 2000 SPs capable of dependent instructions within an instruction group,
Evergreen introduced some of this, see the ADD_PREV, BCNT_ACCUM_PREV_INT, MBCNT_32LO_ACCUM_PREV_INT, MUL_IEEE_PREV, MUL_PREV, MULADD_IEEE_PREV, MULADD_PREV, SAD_ACCUM_PREV_UINT instructions. The INTERP_* instructions also do combinations of ADDs and DOTs.

and did all geometry operations at more than one tri per clock, that may be good enough for next gen.
Which geometry operation is the bottleneck currently?

Jawed
 
BTW..I'd take coherent r/w caches over more than one triangle per clock any time of the day.
Agreed, but Fermi's cache actually brings about a really awkward situation in comparing these cards that exposes something that we have yet to see on GPUs too much to this point: you can now write very platform-specific code in HLSL (or OpenCL, etc).

From the Fermi docs it appears that caching will be enabled by default on DirectCompute global UAV accesses which is a big deal. Beyond helping algorithms that actually have unpredictable memory access patterns, it begs the question of how important the local data store is now.

In the more cheesy department, it means that NVIDIA (or anyone) can now write DC code that runs well on Fermi but terribly on ATI parts - even artificially - by just avoiding explicit use of local memory even in places with predictable memory access patterns. Conversely, I'm guessing that the "globallycoherent" modifier was added to DC with Fermi in mind (it doesn't appear to do anything on ATI parts), so ATI (or anyone) could artificially disable this caching in their own code by just putting that on every global buffer, whether it is "needed" or not.

More realistically, this means that code "optimized" on one card will not necessarily run well on the other (particularly in the developed on NVIDIA, run on ATI case for now). This has been the case for some specific operations for a little while, but we're talking about orders of magnitude now. We're in the range where you might need to start writing some different code for different architectures... "performance portability" is turning into a bit of historic thing. This obviously also goes for code scaling to future architectures as well, as it's amusing to see NVIDIA already noting several "legacy" code problems in their newest CUDA programming guides (use of texture lookups for caching, local data store sizes, bank conflict patterns, etc).

Sorry for being a bit OT, but I figured this was a good place to throw down some thoughts given the impending release of competitive benchmarks between these two architectures. It's going to be increasingly tough to declare overall "winners" since it's going to depend a lot more on how code is written in the future (more like CPUs - but more like Core i7 vs Cell or similarly vague comparisons).

[Aside: globallycoherent is a particularly weird modifier in that it seems to apply to only a case that is unsafe to write code against, namely one CS group communicating with another. The problem is that with parallelism and CS group execution ordering left completely undefined, I'm not sure if there is a "safe" use of this functionality. Maybe I'm misreading the usefulness of this though...]
 
From the Fermi docs it appears that caching will be enabled by default on DirectCompute global UAV accesses which is a big deal. Beyond helping algorithms that actually have unpredictable memory access patterns, it begs the question of how important the local data store is now.
Very good point.

AMD has to catch up, it's the direction we want to go anyway.

[Aside: globallycoherent is a particularly weird modifier in that it seems to apply to only a case that is unsafe to write code against, namely one CS group communicating with another. The problem is that with parallelism and CS group execution ordering left completely undefined, I'm not sure if there is a "safe" use of this functionality. Maybe I'm misreading the usefulness of this though...]
Maybe I'm not catching your drift, but semaphores via atomics are the key here, aren't they? Those plus append/consume.

Jawed
 
GZ007 said:
If they beat 5970 with the next card it should be fine.
I will be surprised if they do. I doubt they will be willing to increase die size by the same amount again given how well they have done with RV870.
 
Yes, but I still see a big, yawning gap between 5770 and 5850. I think AMD blew a big hole bw in their lineup. This may just provide nv with the opening they desperately need. Juniper with a 192 bit bus would have been a much better match for gf104. That gap is way too big and 5830 is pretty lame.

An opening only if Nvidia is actually making money on every board sold~ordered more wafers from TSMC. If the yields are still well sub 30% with little improvement foreseen due to their flawed base layer design, and Nvidia can't make A3 silicon at a profit, they may just slow milk their existing supply without ordering up any more A3 wafers, minimizing their losses.

If they're selling their chips at a loss at released board MSRP's, making up for it in volume ... not the best solution.
 
I don't think NVIDIA have officially announced anything about derivatives other than the GTX 480 and GTX 470 at this time.
Considering that it would just distract people from the launch of the GTX 480 and GTX 470, there's no way they would announce them until a little while after these parts are released.
 
An opening only if Nvidia is actually making money on every board sold~ordered more wafers from TSMC. If the yields are still well sub 30% with little improvement foreseen due to their flawed base layer design, and Nvidia can't make A3 silicon at a profit, they may just slow milk their existing supply without ordering up any more A3 wafers, minimizing their losses.

If they're selling their chips at a loss at released board MSRP's, making up for it in volume ... not the best solution.
Sorry, but there is essentially zero gain for nVidia to go into full production before yields are profitable. There just isn't any way they would do it.
 
What do you mean by that?
The y unit can use the result of the x unit, z can use the result of the y unit, etc for a single xyzw-instruction group. You know how the ALUs work on two wavefronts in a AAAABBBB pattern? Just use 8 batches in a ABCDEFGHAB... 32-cycle pattern. Same net instruction throughput, branch throughput, register access scheme, wavefront throughput, etc, except you get higher ALU utilization.

It's not too important given ATI's strength at math, but is low-lying fruit nonetheless.

BTW..I'd take coherent r/w caches over more than one triangle per clock any time of the day.
Yeah, but for a massively parallel processor, it's probably more costly and difficult. Out of curiosity, what applications are you thinking of? I figured that shared memory and atomics help reduce the need for coherency a fair amount.
 
Beyond helping algorithms that actually have unpredictable memory access patterns, it begs the question of how important the local data store is now.
I think it's still fairly important. Fermi's caches are small, if you don't use them carefully they will be effectively worthless due to capacity misses. The local store lets you protect highly reused data from eviction as you stream other data through.
 
Last edited by a moderator:
Back
Top