Nvidia GT300 core: Speculation

CarstenS · May 15, 2009

Jawed said:
I haven't seen this before, but it just looks like a piece of the hierarchical rasterisation feature set that we've seen in other NVidia patents.

I wasn't aware of that.

Jawed said:
I can't see anything there that's meaningfully beyond G80.

Given that it's been filed only after G80's launch (and thus not been issued yet) I doubt this particular assumption.

ChrisRay · May 15, 2009

Scali said:
Yea, 3DMark03 is the one that the FX failed in.
It contained two ps1.x game tests, and the nature test with ps2.0.
FX did fine in the ps1.x tests, but completely died in the ps2.0 test.
Then a driver update appeared where the ps2.0 performance was 'fixed'... nVidia had replaced everything with int and half-precision shaders, and also 'optimized' some other things, like not rendering things that were outside the visible range (abusing the fact that the camera path was fixed). It ran about as fast as ATi's stuff, but it suffered from blocky aliasing because of the limited precision.
That's when Futuremark started with the whole driver approval thing.
Funny enough many people couldn't believe the FX series was THAT bad in ps2.0, and suspected foul play from FM/ATi instead. Then again, who could blame them, really? Games only used fixedfunction or ps1.x, and there was no reason to assume performance problems based on that.

If I remember correctly. Two of the game tests that did pixel shader 1.1 were actually using 1.4 on any card that supported it. So GT2 and GT3 were actually 1.4 shaders. Unless those were replaced. 1.1 shaders were not always faster than 1.4 shaders on the FX cards. It really depended on register usage.

TimothyFarrar · May 15, 2009

Jawed said:
So the restrictions are at the compiler/library level?

Restrictions mostly at the hardware level.

I dare say I'm getting a sense that GPU/game programmers will be blazing a trail, from what you've described. Though there's still a very tricky scaling question beyond a single GPU. I stumbled into this:

http://insidehpc.com/2009/05/12/argonne-researchers-receive-award-for-mpi-performance-study/

which paints a grim picture.

Not grim IMO, but rather shows what will become important. For example, note how the BG/P OS doesn't do disc backed memory, pages are always physically pinned so DMA engine has low latency and CPU doesn't touch pages during communication. What I gather from all of it is that eventually the hardware is going to consist of cores and interconnect which provides dedicated hardware support for the most important parallel communication patterns, so that the cores aren't involved in communication which is latency bound. Things like CPUs manually doing all the work on interrupts (preemption) just isn't going to scale ... nor is ALUs doing atomic operations on shared queues between cores ... etc. I think all this goes away at some point for dedicated hardware, and a different model of general purpose computing.

My little brother (James Lottes, different last name) worked at Argonne in the MCS Division on tough scaling issues for Bluegene (until he decided to go back to get his PHD this year, now he works there on/off). An interesting paper related to the issues of scaling algorithms in interconnect limited cases, http://www.iop.org/EJ/article/1742-6596/125/1/012076/jpconf8_125_012076.pdf?request-id=12293745-5238-4326-9be2-43b91b4c4753, covers how they adjust data exchange strategies for the problem to lower network latency.

Are global fetches cached? I disagree fundamentally on the cache question - just because you can hide latency doesn't mean performance is fine without a cache.

If you haven't read this PTX simulator paper, http://www.ece.ubc.ca/~aamodt/papers/gpgpusim.ispass09.pdf, you might find it interesting. Their results showed performance more sensitive to interconnection network bisection bandwidth rather than latency. They also added a cache in their simulation, which indeed helped some of the apps, but also reduced the performance of a lot of them.

Scali · May 15, 2009

ChrisRay said:
If I remember correctly. Two of the game tests that did pixel shader 1.1 were actually using 1.4 on any card that supported it. So GT2 and GT3 were actually 1.4 shaders. Unless those were replaced.

You are correct that they were ps1.4 on hardware that supported it.
And I'm not entirely sure, but I vaguely recall that nVidia may have reported ps1.1 capability in those tests because it ran faster than ps1.4 on the FX.

no-X · May 15, 2009

XMAN26 said:
1. I think you are confused there as I believe you mean 3dM2k not 2k1. 2k was a DX6/7 tester and 2k1 was 7/8. 3/5/6 are dx9 for the most part with maybe dx8 just a tincy bit..
2. Secondly, I've scored damn near close to 10k in 2k1 with a GF2(3d Prophet that made Nv angry for being as fast as a Pro card because of core and memory oc) and a P3 1ghz, I've yet to see any 2ghz single core processor with any none TnL gpu come close to that. Hell my laptop with an ATI express 200 and A64 s754 3200+ does even top 5k in 2k1

3D Mark 2001 is called DX8 test, but it doesn't test real DX8 capabilities at all:

You can run tests 1-3 in full quality on any other DX6 compatible graphics card. No effect will be missing. The only advantage of DX7/8 graphics card in these tests is hardware accelerated geometry.

Test 4 use PS1.1 on the lake surface, which is shown for 15-20% of testing time - that's the only DX8 exclusive effect, which can reflect DX8 performance in the score.

Score is calculated via this formula: (total low-detail FPS * 10) + (total high-detail FPS + nature FPS) * 20

Here are results of DX8 graphics card: (107,1 + 98,6 + 103,2)*10 + (41,4 + 67,3 + 46,9 + 29,4)*20 = 6789 3D Marks

The bold value (29,4) is framerate in Nature test. Imagine, that the graphics card would be so crappy in pixel-shading, that the performance in PS/lake scenes would be zero. We know, that the lake scenes takes about 18% of the test time, so it's quite easy to count, what the framerate will be: 29,4*0,82 = 24,1 FPS

If I use the 3D Mark formula, the graphics card would score 6683 3D Marks. Well, this "DX8 benchmark" shows 1,5% difference between fast DX8 graphics card and graphics card with zero DX8 performance.

Do you understand now, why I rate 3D Mark 2001 as DX6 test?

As for GeForce 2 scoring near 10k in 3DM01 - are you sure? 10k score was typical for GeForce 4 Ti...

I've yet to see any 2ghz single core processor with any none TnL gpu come close to that.

You don't need non-TnL GPU to prove my point. Just switch to SW TnL in 3DMark. For majority of DX7 TnL cards, SW TnL on 2GHz+ GPU will score slightly better in 3DM score. 8-lights tests score will be about twice as high with SW TnL.

The real performance advantage of GF2 wasn't hidden in the TnL engine, but in the 4x2 configuration. Competition was 4x1, 2x2, or 2x3 - GF2 simply offered almost double fill-rate...

Unknown Soldier · May 15, 2009

Scali said:
I think you have to see physics much like shadows.
When the first games with dynamic shadows arrived (eg Doom3), the effect was VERY expensive, and didn't do much for gameplay itself.
But they did make the game look nicer and more realistic, and now all games have it, and people take the performance hit for granted.

I'd thank you but the forums don't use thanks.

Oh wait! 'Thanks'

US

XMAN26 · May 15, 2009

no-X said:
3D Mark 2001 is called DX8 test, but it doesn't test real DX8 capabilities at all:

You can run tests 1-3 in full quality on any other DX6 compatible graphics card. No effect will be missing. The only advantage of DX7/8 graphics card in these tests is hardware accelerated geometry.

Test 4 use PS1.1 on the lake surface, which is shown for 15-20% of testing time - that's the only DX8 exclusive effect, which can reflect DX8 performance in the score.

Score is calculated via this formula: (total low-detail FPS * 10) + (total high-detail FPS + nature FPS) * 20

Here are results of DX8 graphics card: (107,1 + 98,6 + 103,2)*10 + (41,4 + 67,3 + 46,9 + 29,4)*20 = 6789 3D Marks

The bold value (29,4) is framerate in Nature test. Imagine, that the graphics card would be so crappy in pixel-shading, that the performance in PS/lake scenes would be zero. We know, that the lake scenes takes about 18% of the test time, so it's quite easy to count, what the framerate will be: 29,4*0,82 = 24,1 FPS

If I use the 3D Mark formula, the graphics card would score 6683 3D Marks. Well, this "DX8 benchmark" shows 1,5% difference between fast DX8 graphics card and graphics card with zero DX8 performance.

Do you understand now, why I rate 3D Mark 2001 as DX6 test?

As for GeForce 2 scoring near 10k in 3DM01 - are you sure? 10k score was typical for GeForce 4 Ti...

You don't need non-TnL GPU to prove my point. Just switch to SW TnL in 3DMark. For majority of DX7 TnL cards, SW TnL on 2GHz+ GPU will score slightly better in 3DM score. 8-lights tests score will be about twice as high with SW TnL.

The real performance advantage of GF2 wasn't hidden in the TnL engine, but in the 4x2 configuration. Competition was 4x1, 2x2, or 2x3 - GF2 simply offered almost double fill-rate...

I'm sorry, but I disagree with you and for funs and giggles, I will put together a P4 2.8Ghz HT fsb800 machine together and use a GF4/3 or 2MX(depending on what I can find stashed away) and will post the number from 2k1. And I will guaruntee, that SW T&L will not be faster than hardware cept for maybe the 2mx.

no-X · May 16, 2009

no-X said:
For majority of DX7 TnL cards, SW TnL on 2GHz+ GPU will score slightly better in 3DM score.

I will put together a P4 2.8Ghz HT fsb800 machine together and use a GF4/3 or 2MX

Love_In_Rio · May 16, 2009

Unknown Soldier said:
Thanks for the responses on Physics on the upcoming GPU's.

DX11 is supposed to have some physics implementation and with the new GPU's a lot more powerful than the currently crop it had me thinking and wondering if physics could be implemented with little or no performance loss in fps.

Looking at the PhysX Sacred 2 patch(youtube video of the differences here) has me thinking that Physics on GPU's will be a really good thing.

The amount of memory and bandwidth that the new GPU's will have with faster and more efficient cores and shaders should help with getting more and better physics in games or at least that's my hope.

As mentioned, fluid and cloth physics should get a boost imo(being easier to simulate), enviromental destruction physics though is a bit more taxing.

Still, all this would be nice if it does get the support it requires whether through PhysX, Havok Physics or with further implementations in DirectX from MS.

For me the most importan physics improvement: HAIR. When will be have proper hair physics ?.

Jawed · May 16, 2009

CarstenS said:
Given that it's been filed only after G80's launch (and thus not been issued yet) I doubt this particular assumption.

The provisional filing date was a year earlier. I'm not even sure what value there is in a comparison of patent application filing date and launch date for a technology.

Jawed

CarstenS · May 16, 2009

Normally, you file you patent as soon as you're done with your work and do not wait 'til all the other execution stages + marketing are done also.

But since the provisional filing was a year earlier, which I did not notice, this is also moot.

KonKort · May 16, 2009

Nvidia G300 has got taped out. He is actually running well at A1 step.
The GDDR5 memory, he is used, clocks higher than 1,000 MHz. So you can expect a bandwidth higher than 256 GB/s.

Source: Hardware-Infos

CarstenS · May 16, 2009

That (doubling bandwidth to ~280ish GByte/sec.) would IMO only be necessary if they've really decided to ditch the FF-ROPs (thus also removing quite a bit of compression/decompression hardware) and are doing all this stuff in the shader ALUs.

If I am not mistaken, the scheduler/scoreboarding stuff could also be simplified quite a lot with this step, since each pixel/thread is effectively "fire and forget", once it's left for the shader core. If there's geometry stuff to be done, it can be re-queued from VRAM.

Love_In_Rio · May 16, 2009

CarstenS said:
That (doubling bandwidth to ~280ish GByte/sec.) would IMO only be necessary if they've really decided to ditch the FF-ROPs (thus also removing quite a bit of compression/decompression hardware) and are doing all this stuff in the shader ALUs.

If I am not mistaken, the scheduler/scoreboarding stuff could also be simplified quite a lot with this step, since each pixel/thread is effectively "fire and forget", once it's left for the shader core. If there's geometry stuff to be done, it can be re-queued from VRAM.

Noooo, R600 all over again, noooo!.

Seriously it that would be the case i hope they had a real AA solve shader solution this time or what is the same: lots of flops!

Arun · May 16, 2009

CarstenS said:
That (doubling bandwidth to ~280ish GByte/sec.) would IMO only be necessary if they've really decided to ditch the FF-ROPs (thus also removing quite a bit of compression/decompression hardware)

I know NVIDIA's design decisions haven't always impressed everyone lately, but I hope you're not suggesting they replaced all their engineers by drunk monkeys?

CarstenS · May 16, 2009

@Love_in_Rio: First of all, I think it could make a difference, if you're planning your architecture around this "feature"/"economization" or if you have to bolt it on afterwards.

Second: please look at what Edge-Detec-AA costs you on HD 4890. I've just had time to run Deep Freeze from 3DMark 06 (at least it uses HDR-Rendering) in 1.680x1.050:

1x MSAA: 72,2 Fps
4x MSAA: 53,2 Fps
8x MSAA: 42,3 Fps
4x & EDAA: 47,3 Fps

Nice, isn't it?

@Arun:
At least shader-based AA seems feasible IMO. What else would you suggest one could need that amount of bandwidth for? We're talking about doubling again! If it's at all true, that is.

XMAN26 · May 16, 2009

no-X said:

Would GF2/4MXs be fine by you then, its not like teh T&L engine stopped being fixed on the 3/4s. And you claimed SW T&L on a 2+Ghz proc would be faster than on majority of DX7 capable hardware. GF3/4s are capable of DX7 or did they stop supporting it when they became DX8 capable? Something tells me they didn't.

CarstenS · May 16, 2009

I did some benchmarks wrt to TnL and CPUs over here:
http://www.forum-3dcenter.org/vbull...171842&highlight=tnl+eingel%F6ste+versprechen

outcome:
UT2003 pyramid demo on Geforce 256:
HTnL: 29,94fps
STnL: 15,62fps

CPU was a Celeron-T at 1 GHz. It should be a very close call then for a two GHz-CPU, especially one with a more recent architecture.

Lukfi · May 16, 2009

XMAN26 said:
GF3/4s are capable of DX7 or did they stop supporting it when they became DX8 capable? Something tells me they didn't.

Today's DX10/DX10.1 obviously support DX7 as well, yet you probably wouldn't call them "DX7 hardware"

Love_In_Rio · May 16, 2009

But, doesn´t RV760 use RBEs to solve AA ?.

Nvidia GT300 core: Speculation

CarstenS

Moderator

ChrisRay

<span style="color: rgb(124, 197, 0)">R.I.P. 1983-

TimothyFarrar

Scali

no-X

Unknown Soldier

XMAN26

no-X

Love_In_Rio

Jawed

CarstenS

Moderator

KonKort

CarstenS

Moderator

Love_In_Rio

Arun

Unknown.

CarstenS

Moderator

XMAN26

CarstenS

Moderator

Lukfi

Love_In_Rio

Similar threads