Nvidia GT300 core: Speculation

Richard · Sep 30, 2009

chavvdarrr said:
1 day in near future, a writing on the screen when starting your brand new IBM PC compatible...

"no CPU detected, starting software emulation"

LOL

One would hope that by then we wouldn't need POST-screens anymore. Who am I kidding, we're going to still be posting in x86 mode regardless... the future will bring x86 bootstrap hw in the mobo using CMOS for working set; mark my words. For an industry so quick to change we are an awful crotchety bunch.

GrapeApe · Sep 30, 2009

With nV being so die space limited again, focusing heavily on the Tesla family in design, and trying to pack in as much compute power onto the die as possible under current fab, I'm guess the return of the NVIO is a safe assumption, no?

With that, what would the limitations be towards putting more than one traditional NVIO on the the PCB to allow for greater multiple monitor configurations (more as a rarer 'we can do it too' configuration than as a general design). With the DRAM and ROP/RBE partitions being an odd number as inferred from the blurry-diagram, I'm assuming a six-cluster would be easier to feed to two external NVIOs than 3 distinct groups of even numbers.

It would be another way to address a PR checkbox, in an era of the return of the checkbox (3DVision, Eyefinity, PhysX etc), and if possible would be simpler than an NVIO near-term redesign.

I'm just not sure of the restriction on the NVIO as there's not too much on the underlying design, just the base components included (TMDS, RAMDACs, etc).

I always thought the NVIO was a cop-out for near term, but would be essential if you wanted to go to an multi-die MCM style future design to avoid duplication of resources and maximize the transistor budget for this and the idea of multiple offspring designs (like Tesla).

I know there's 2 NVIO on the GTX295, but that's primarily due to the SLi considerations when communicating with the bridge.

Anywhoo, just curious if anyone knows for sure if dual NVIOs per chip was possible, or if it's limited by memory interface or RBE/ROP restrictions by design?

jaredpace · Sep 30, 2009

What is this?

DegustatoR · Sep 30, 2009

GrapeApe said:
With that, what would the limitations be towards putting more than one traditional NVIO on the the PCB to allow for greater multiple monitor configurations (more as a rarer 'we can do it too' configuration than as a general design).

AFAIK even the first version of NVIO allows to have 4 simultaineous outputs.
And NVIO has nothing to do with being die size limited.

Jawed · Sep 30, 2009

DegustatoR said:
That's a bit of an overstatement.

At best 10-20% better performance than HD4890 in games despite having more bandwidth and being dramatically larger. I don't see any overstatement there.

So how come nobody did?

Maybe they did but AMD blanked them

GT218 is a 60mm^2 GPU. I don't think that you can compare it to the 140mm^2 RV740.

It wasn't a reference to the performance of RV740. It was a reference to the ability to refresh on a new node and improve all performance-per metrics significantly.

And you surely can't compare it to a GPU made on another process.

Why? They're direct competitors (until Cedar arrives). If it was higher performance and/or lower-power we'd say "that's the benefit of 40nm". Instead we're just scratching our heads.

In other words we need more information before any conclusion on GT21x being a failure can be made. One review of GT218 isn't enough for such conclusion.

It'll need to be quite a turnaround. Remember NVidia was boasting about expecting to be first with 40nm chips.

When something as "simple" as GT218 is delayed and working badly it's not particularly surprising that NVidia's not ready for W7 launch with a 40nm D3D11 GPU.

Jawed

AnarchX · Sep 30, 2009

jaredpace said:
What is this?

G92(b)

Jawed · Sep 30, 2009

Humus said:
Previously the interpolator provided the interpolated values to the shader. In SM 5.0 the shader can ask for interpolated values by itself. There are some functions for it: EvaluateAttributeAtCentroid(), EvaluateAttributeAtSample() and EvaluateAttributeSnapped().

Ooh, very interesting, thanks. Can't find anything about those online

Is there something similar for use in DS to help in obtaining attributes at the newly generated points?

Jawed

DegustatoR · Sep 30, 2009

Jawed said:
When something as "simple" as GT218 is delayed and working badly it's not particularly surprising that NVidia's not ready for W7 launch with a 40nm D3D11 GPU.

I fail to see any correlation between GT218 and DX11.
And it was late because of TSMC not NVIDIA. Which rises the question of who's to blame for it's power characteristics also.

Jawed · Sep 30, 2009

3dilettante said:
Any such data path puts RV870 one step closer to fully closing the write/read loop in the manner CPU caches do.
It would probably still be less flexible and have higher latency, but at least there's an on-chip path.

Another thought is merely that the L2 system can query the RBE-owned render target structures and either decode the RBEs' compression tag tables or request decompression semantics for the data it wants to fetch from memory. So the on-chip linkage might be quite simple and L2 is simply doing most of the work, rather than having RBEs fetching the data and using the render target caches.

As a side note, I'm curious about the additional non-texture L1 that was added alongside the regular texture cache, as mentioned in the Anandtech article. What this brings to the table at that size compared to the larger texture and LDS, I'm not sure. It would help with problems with thrashing, if graphics and compute shaders hit the same SIMD, I suppose.
In a GPGPU situation, what would it offer over using the larger L1?

Sure this isn't just the regular L1 cache that's used for textures? I don't trust Anandtech.

Jawed

Jawed · Sep 30, 2009

nAo said:
With now both NVIDIA and ATI interpolating in the shader cores divisions are used even more often

Even Larrabee has RCP and RSQRT intrinsics

I wonder what the throughput for these is. I guess that's the cost of doing graphics, rather than just general compute. The EXP2 and LOG2 functions are useful too - though base-2 stuff is pretty easy I dare say (partly re-using FTOI/ITOF I guess?).

Jawed

Jawed · Sep 30, 2009

trinibwoy said:
How do you figure that?

I get 30*16-banks*4-bytes*600Mhz =~ 1.15TB/s

I was looking at it from the point of view of the throughput for the ALUs, that 1 operand per clock is available per MAD: 30 SIMDs * 8 ALUs * 4 bytes * 1476MHz (GTX285) = 1.417TB/s.

Jawed

3dilettante · Sep 30, 2009

Jawed said:
Sure this isn't just the regular L1 cache that's used for textures? I don't trust Anandtech.

I don't know. It seemed like an odd thing to just make up out of thin air.

trinibwoy · Sep 30, 2009

Jawed said:
I was looking at it from the point of view of the throughput for the ALUs, that 1 operand per clock is available per MAD: 30 SIMDs * 8 ALUs * 4 bytes * 1476MHz (GTX285) = 1.417TB/s.

Jawed

Oh, ok, though I'm not sure if the register file and/or shared memory runs at the hot clock. For one thing, results from the pipeline are written 16 at a time which implies some sort of buffering.

Arty · Sep 30, 2009

DegustatoR said:
I fail to see any correlation between GT218 and DX11.
And it was late because of TSMC not NVIDIA. Which rises the question of who's to blame for it's power characteristics also.

That's hardly an excuse, AMD didn't suffer as much so it comes down to NV's design.

Thanks for the webcast time, appreciate it.

Jawed · Sep 30, 2009

dnavas said:
At anyrate, from my limited experience, if I wanted to make my ALUs more generic, DIV would have to be close to the top of my list....

See figure 4:

http://www.ece.ubc.ca/~aamodt/papers/gpgpusim.ispass09.pdf

Sure, it's not comprehensive, but SFU isn't getting much use there.

Jawed

Jawed · Sep 30, 2009

DegustatoR said:
I fail to see any correlation between GT218 and DX11.

They both need a 40nm process and it's "safe" for IHVs to build an "easy" chip on a new process before attempting a behemoth.

And it was late because of TSMC not NVIDIA. Which rises the question of who's to blame for it's power characteristics also.

You think NVidia was entirely blameless?

Jawed

Mintmaster · Sep 30, 2009

FUDie said:
Except that increasing engine clock 9% alone was enough to gain 5%. Increasing memory clocks by 9% as well couldn't get you more than additional 4%, so engine clock has more impact that memory clock.

That doesn't prove nAo wrong. I did analysis of these games with RV770, and it has even less dependence on BW for Crysis. Crysis is a bad game to evaluate this, too, as the timedemos/walkthroughs that most reviewers use definately have some parts that are CPU limited.

Jawed · Sep 30, 2009

trinibwoy said:
Oh, ok, though I'm not sure if the register file and/or shared memory runs at the hot clock. For one thing, results from the pipeline are written 16 at a time which implies some sort of buffering.

That's due to banking. RF and SM are both twice as wide as the MAD SIMD.

Jawed

trinibwoy · Sep 30, 2009

Jawed said:
That's due to banking. RF and SM are both twice as wide as the MAD SIMD.

Jawed

Yeah I know but I took the RF and SM clocks to be 600Mhz (for GTX280). Not sure what's the right way to calculate it.

FUDie · Sep 30, 2009

Mintmaster said:
That doesn't prove nAo wrong. I did analysis of these games with RV770, and it has even less dependence on BW for Crysis. Crysis is a bad game to evaluate this, too, as the timedemos/walkthroughs that most reviewers use definately have some parts that are CPU limited.

If you gain 8% from a 9% increase in engine and memory clocks, how can you claim it's CPU limited?

-FUDie

Nvidia GT300 core: Speculation

Richard

Mord's imaginary friend

GrapeApe

jaredpace

DegustatoR

Jawed

AnarchX

Jawed

DegustatoR

Jawed

Jawed

Jawed

3dilettante

trinibwoy

Meh

Arty

KEPLER

Jawed

Jawed

Mintmaster

Jawed

trinibwoy

Meh

FUDie

Similar threads