Nvidia GT300 core: Speculation

Status
Not open for further replies.
1 day in near future, a writing on the screen when starting your brand new IBM PC compatible...


"no CPU detected, starting software emulation"

LOL

One would hope that by then we wouldn't need POST-screens anymore. Who am I kidding, we're going to still be posting in x86 mode regardless... the future will bring x86 bootstrap hw in the mobo using CMOS for working set; mark my words. For an industry so quick to change we are an awful crotchety bunch.
 
With nV being so die space limited again, focusing heavily on the Tesla family in design, and trying to pack in as much compute power onto the die as possible under current fab, I'm guess the return of the NVIO is a safe assumption, no?

With that, what would the limitations be towards putting more than one traditional NVIO on the the PCB to allow for greater multiple monitor configurations (more as a rarer 'we can do it too' configuration than as a general design). With the DRAM and ROP/RBE partitions being an odd number as inferred from the blurry-diagram, I'm assuming a six-cluster would be easier to feed to two external NVIOs than 3 distinct groups of even numbers.

It would be another way to address a PR checkbox, in an era of the return of the checkbox (3DVision, Eyefinity, PhysX etc), and if possible would be simpler than an NVIO near-term redesign.

I'm just not sure of the restriction on the NVIO as there's not too much on the underlying design, just the base components included (TMDS, RAMDACs, etc).

I always thought the NVIO was a cop-out for near term, but would be essential if you wanted to go to an multi-die MCM style future design to avoid duplication of resources and maximize the transistor budget for this and the idea of multiple offspring designs (like Tesla).

I know there's 2 NVIO on the GTX295, but that's primarily due to the SLi considerations when communicating with the bridge.

Anywhoo, just curious if anyone knows for sure if dual NVIOs per chip was possible, or if it's limited by memory interface or RBE/ROP restrictions by design?
 
With that, what would the limitations be towards putting more than one traditional NVIO on the the PCB to allow for greater multiple monitor configurations (more as a rarer 'we can do it too' configuration than as a general design).
AFAIK even the first version of NVIO allows to have 4 simultaineous outputs.
And NVIO has nothing to do with being die size limited.
 
That's a bit of an overstatement.
At best 10-20% better performance than HD4890 in games despite having more bandwidth and being dramatically larger. I don't see any overstatement there.

So how come nobody did?
Maybe they did but AMD blanked them :cry:

GT218 is a 60mm^2 GPU. I don't think that you can compare it to the 140mm^2 RV740.
It wasn't a reference to the performance of RV740. It was a reference to the ability to refresh on a new node and improve all performance-per metrics significantly.

And you surely can't compare it to a GPU made on another process.
Why? They're direct competitors (until Cedar arrives). If it was higher performance and/or lower-power we'd say "that's the benefit of 40nm". Instead we're just scratching our heads.

In other words we need more information before any conclusion on GT21x being a failure can be made. One review of GT218 isn't enough for such conclusion.
It'll need to be quite a turnaround. Remember NVidia was boasting about expecting to be first with 40nm chips.

When something as "simple" as GT218 is delayed and working badly it's not particularly surprising that NVidia's not ready for W7 launch with a 40nm D3D11 GPU.

Jawed
 
Previously the interpolator provided the interpolated values to the shader. In SM 5.0 the shader can ask for interpolated values by itself. There are some functions for it: EvaluateAttributeAtCentroid(), EvaluateAttributeAtSample() and EvaluateAttributeSnapped().
Ooh, very interesting, thanks. Can't find anything about those online :cry:

Is there something similar for use in DS to help in obtaining attributes at the newly generated points?

Jawed
 
When something as "simple" as GT218 is delayed and working badly it's not particularly surprising that NVidia's not ready for W7 launch with a 40nm D3D11 GPU.
I fail to see any correlation between GT218 and DX11.
And it was late because of TSMC not NVIDIA. Which rises the question of who's to blame for it's power characteristics also.
 
Any such data path puts RV870 one step closer to fully closing the write/read loop in the manner CPU caches do.
It would probably still be less flexible and have higher latency, but at least there's an on-chip path.
Another thought is merely that the L2 system can query the RBE-owned render target structures and either decode the RBEs' compression tag tables or request decompression semantics for the data it wants to fetch from memory. So the on-chip linkage might be quite simple and L2 is simply doing most of the work, rather than having RBEs fetching the data and using the render target caches.

As a side note, I'm curious about the additional non-texture L1 that was added alongside the regular texture cache, as mentioned in the Anandtech article. What this brings to the table at that size compared to the larger texture and LDS, I'm not sure. It would help with problems with thrashing, if graphics and compute shaders hit the same SIMD, I suppose.
In a GPGPU situation, what would it offer over using the larger L1?
Sure this isn't just the regular L1 cache that's used for textures? I don't trust Anandtech.

Jawed
 
With now both NVIDIA and ATI interpolating in the shader cores divisions are used even more often :)
Even Larrabee has RCP and RSQRT intrinsics :D I wonder what the throughput for these is. I guess that's the cost of doing graphics, rather than just general compute. The EXP2 and LOG2 functions are useful too - though base-2 stuff is pretty easy I dare say (partly re-using FTOI/ITOF I guess?).

Jawed
 
How do you figure that?

I get 30*16-banks*4-bytes*600Mhz =~ 1.15TB/s
I was looking at it from the point of view of the throughput for the ALUs, that 1 operand per clock is available per MAD: 30 SIMDs * 8 ALUs * 4 bytes * 1476MHz (GTX285) = 1.417TB/s.

Jawed
 
I was looking at it from the point of view of the throughput for the ALUs, that 1 operand per clock is available per MAD: 30 SIMDs * 8 ALUs * 4 bytes * 1476MHz (GTX285) = 1.417TB/s.

Jawed

Oh, ok, though I'm not sure if the register file and/or shared memory runs at the hot clock. For one thing, results from the pipeline are written 16 at a time which implies some sort of buffering.
 
I fail to see any correlation between GT218 and DX11.
And it was late because of TSMC not NVIDIA. Which rises the question of who's to blame for it's power characteristics also.

That's hardly an excuse, AMD didn't suffer as much so it comes down to NV's design.

Thanks for the webcast time, appreciate it. :)
 
I fail to see any correlation between GT218 and DX11.
They both need a 40nm process and it's "safe" for IHVs to build an "easy" chip on a new process before attempting a behemoth.

And it was late because of TSMC not NVIDIA. Which rises the question of who's to blame for it's power characteristics also.
You think NVidia was entirely blameless?

Jawed
 
Except that increasing engine clock 9% alone was enough to gain 5%. Increasing memory clocks by 9% as well couldn't get you more than additional 4%, so engine clock has more impact that memory clock.
That doesn't prove nAo wrong. I did analysis of these games with RV770, and it has even less dependence on BW for Crysis. Crysis is a bad game to evaluate this, too, as the timedemos/walkthroughs that most reviewers use definately have some parts that are CPU limited.
 
Oh, ok, though I'm not sure if the register file and/or shared memory runs at the hot clock. For one thing, results from the pipeline are written 16 at a time which implies some sort of buffering.
That's due to banking. RF and SM are both twice as wide as the MAD SIMD.

Jawed
 
That doesn't prove nAo wrong. I did analysis of these games with RV770, and it has even less dependence on BW for Crysis. Crysis is a bad game to evaluate this, too, as the timedemos/walkthroughs that most reviewers use definately have some parts that are CPU limited.
If you gain 8% from a 9% increase in engine and memory clocks, how can you claim it's CPU limited?

-FUDie
 
Status
Not open for further replies.
Back
Top