NV33 Rumours, if anyone is interested

mczak said:
I could be wrong, but AFAIK texture caches are tiny compared to cache sizes cpus have. Around 3/4 of all tranistors on a cpu are for caches, but the number is much lower on gpus. I don't have an estimation, however, public information about gpu internals is not very detailed.

Internal cache on R200 is 2K. Double that for RV250/M9.

MuFu.
 
Does this mean NV31 is not DX9?

Will ATI's RV350 not also be used for mobile just like RV250 and M9, RV200 and M7, RV100 and M6?

I would bet ATI is more than ready for nVidia. They will not give up market share easily and nVidia has never met their power expectations. ATI kills them here.
 
SpellSinger said:
Does this mean NV31 is not DX9?

NV31 is fully DX9 compliant (in nVidia's eyes anyway). NV34 is "DX9-compatible", whatever that means...

Will ATI's RV350 not also be used for mobile just like RV250 and M9, RV200 and M7, RV100 and M6?

Yeah - it is virtually identical to M10, although the latter has some pretty
brow.gif
, mobile-specific technology.

MuFu.
 
MuFu said:
Internal cache on R200 is 2K. Double that for RV250/M9.

MuFu.

Oh, then I finally understood why the 8500 was so darn slow compared to the GF4
The GF4 is 40% cache, according to publicly released nVidia information. That's obviously a LOT more than 2K. A lot, lot more. It's even a lot more than 4K...


Uttar
 
40% cache?!

The RV250 figure is from a internal document. I presume it refers to texture cache only.

MuFu.
 
Oh, texture cache only? Then that's probably about the same as the GF4.

But yes, all cache counted, the GF4 is 40% cache & 60% logic
I could have my numbers wrong by 5% ( it's been a while since I seen nVidia stating it ) , but probably not more.
BTW, those figures are for a NV25. Should have been more precise, it's most likely quite different for a GF4 MX.

Texture cache is really just a part of the overall cache. Here are several other caches used in a GPU:
- Vertex Cache, Pixel Cache & Primitive Cache
- Shader Caches ( that's temporary registers, instructions, ... )
- AGP cache ( Not sure if that really exists: very few things talk about it. And if it does, it's probably fairly small. I'd love some more info about it. )

A good reason many people don't see where all this cache go is that they don't consider the Shader Caches. Many would tend to consider it all as logic. But it isn't :)

It wouldn't surprise me if that shader cache might actually be guilty for a good part of the NV30 transistor count increase over the R300. 1024 PS instructions gotta cost a lot...
IMO, nVidia should have limited itself to about 512 instructions in the NV30. 1024 was kinda overkill... The R300 limit of 96 seems too little, however ( even Carmack says he already crossed that border several times when experimenting with stuff! )

Speculation: I think a good part of the cost of programmable architectures such as the GF3 is the cache. Because when it's programmable, you've got to cache what to do next, too.
I'd love to know NV30 & NV17 cache ratio, so we could know the real effect of programmability on it.


Uttar

P.S. : I can already imagine people wondering why there isn't more texture cache...
Well, the reason is simple. You know pretty much for *sure* that you ain't gonna use the same texture info on the other side of the triangle, and keeping useless info isn't optimal.
Much larger texture cache wouldn't provide a performance benefit AFAIK.
 
Re: Instructions

Heathen said:
Thought the R300 limit was 160 instructions?

Oopsy, my mistake. 96 is for PS2.0. , and the R300 is very slightly above specs ( it's 160 as you say )

But it's slightly more complex than that, too.
The PS2.0. spec divide instruction limit in texture & arithmetic, requiring 32 and 64 respectively.

PS3.0. & the NV30 put all in one huge pool.

The R300, however, divide it further. From http://www.beyond3d.com/articles/nv30r300/index.php?p=6#ppp

R300 has 32 texture instructions, and 64 ALU instructions each for scalar and vector. R300 can issue instructions from each of the instructions set each cycle, and consequently execute 3 instructions per cycle. A nice design!

Sorry for the mistake. Anyway, Carmack was referring to the R300 instruction limit, so I guess he was talking of 160 instructions.


Uttar
 
Re: Instructions

Uttar said:
Sorry for the mistake. Anyway, Carmack was referring to the R300 instruction limit, so I guess he was talking of 160 instructions.

I guess he's wanting for XXXX with unlimited instructions... ;)
 
Re: Instructions

Ante P said:
I guess he's wanting for XXXX with unlimited instructions... ;)
Eh, don't we all want that in secret? :D

I don't see him saying he crossed the 1024 limit of the NV30, however.

Uttar
 
Re: Instructions

Heathen said:
Thought the R300 limit was 160 instructions?
It's 32 texture address ops, 64 vector ops and 64 scalar ops in parallel, but this ability to execute one vector op and one scalar op in parallel is not exposed in D3D, IIRC.

I remember having read that Glaze3D was supposed to have 24 KiB of texture cache, 16 KiB for even mip levels and lightmaps, and 8 KiB for odd mip levels (or something like that)

2 KiB is really a bit small for a chip that supports 6 textures per pass.

btw, GFFX stores PS code in video memory AFAIK. There are no jumps, so access is predictable.
 
Re: Instructions

Uttar said:
I don't see him saying he crossed the 1024 limit of the NV30, however.

He probably got tired of waiting for the NV30 to finish executing it... ;)
 
Re: Instructions

Xmas said:
2 KiB is really a bit small for a chip that supports 6 textures per pass.

I thought that too - since they doubled the cache going from R200 to RV250 then perhaps that figure refers to the allocation per mapping unit or per pipe.

MuFu
 
Re: Instructions

Xmas said:
2 KiB is really a bit small for a chip that supports 6 textures per pass.

btw, GFFX stores PS code in video memory AFAIK. There are no jumps, so access is predictable.

Hmm, well yeah, with 6 textures and 32-bit textures, it is too small.
2048/6 = 341

But then, there are 4 pipes...

That's 85 bytes... Assuming 32-bit textures , that's 21 pixels.

Now, that seems too little.

Triangles are NOT processed on a per-line basis, there's multiple pixels being processed on a line, then you move to the next line. Then you move to the right. Then you again treat a part of the two lines.
That's because, otherwise, you'd need texture cache being able to fit two full lines. In such a system, you save a lot of transistors and barely lose any memory bandwidth.

But still, that would mean 10 pixels on a line being processed at the same time. That seems insufficent.

But then again, in most cases, you won't use 6 textures.

So, well, 30 pixels on a line when using 2 textures seems sufficent. It should give sufficent efficiency.
And in the case of not using 32-bit textures but something like low quality DXTC using 8 bit, it would be 120 pixels. That's nearly too much! :) Many games use 3 or 4 textures, so using DXTC and that, it should be "okay".

Something I wonder, too, is if the hardware can automatically determine how much pixels on a line are processed to get maximum efficiency based on texture cache size. That would be a lot more important. And in the case it can't, which is actually quite likely, it might real bad ( near zero ) efficiency when using 6 textures...

But even forgetting that problem, it would be "okay" - not much better.
If I didn't do any of my calculations wrong, 4KB for the four pipes might very well be fine.
But then again, more could always give a slight boost to performance. The real question is wether that boost is sufficent to justify the transistor count increase :)

Another interesting factor is the decreasing size of triangles. I don't think texture cache efficiency is good ( if not automatically nil ) when keeping texture info from another triangle. So, with the decreasing size of polygons, could something like 20 pixels/line be sufficent in most situations?


Uttar

EDIT: Sounds like you are right: the GFFX *does* store all of its instructions in Video Memory. Sounds like that's a good reason for NV31 & NV34 to support 1024 instructions too.
This would indeed unable Dynamic Branching to work effictively in the PS, I guess. But could Static Branching still work well in the PS using that? I'd guess it could, but I might be wrong.
But GFFX temp registers are still stored in cache. As are several others things used in shaders. And those things are more expensive than on the R300, because they're FP32 ( yes, although FP32 performance is bad and nVidia is trying to make DX9 drivers use FP16 everywhere, it sounds like they made everything with FP32 in mind - performance probably isn't on par with their expectations... )

EDIT 2: After rethinking about it, I just don't understand how putting all of that in video memory makes sense...
Let's imagine each instruction is 45 bits, just like in the case of the VS according to the B3D article. Or rather, let's imagine it is 40 bits, just to be conservative.
Imagine an average of 20 instructions/pixel, and 1600x1200. All that at 60FPS.
That's 12GB/s...

Now, I just don't quite understand how that makes sense. There gotta be a misunderstanding somewhere. Unless nVidia found a way to defy mathematics, too! :D Woah, that's gotta need serious driver tuning.
 
Re: Instructions

Uttar said:
And in the case of not using 32-bit textures but something like low quality DXTC
As the resident texture compression advocate, let me remind everyone that DXTC is not 'low quality'.

In the vast majority of cases DXTC is visually indistinguishable from 32-bit textures, and in most cases the ability to have more textures is far better for image quality than any percieved degredation from conversion to DXTC - as long as the DXTC is applied intelligently, to the right textures (about 80-90% of them will be highly compressible).

Put it this way: if you had the choice of 128M of 32-bit textures, or 128M of 20% 32-bit and 80% DXTC textures, there's no contest as to which would give better image quality.
 
Re: Instructions

Dio said:
As the resident texture compression advocate, let me remind everyone that DXTC is not 'low quality'.

In the vast majority of cases DXTC is visually indistinguishable from 32-bit textures, and in most cases the ability to have more textures is far better for image quality than any percieved degredation from conversion to DXTC - as long as the DXTC is applied intelligently, to the right textures (about 80-90% of them will be highly compressible).

Put it this way: if you had the choice of 128M of 32-bit textures, or 128M of 20% 32-bit and 80% DXTC textures, there's no contest as to which would give better image quality.

Okay, okay... Let me rephrase that.
"And in the case of not using 32-bit textures but something like low per-pixel quality DXTC"
IMO, DXTC got bad per-pixel quality. But where it shines, it's that it enables you to use bigger textures.
And we're talking pixels here, you know :)


Uttar
 
:) I still disagree that DXTC is in any way 'low quality'.

When I was first plugging S3TC our open challenge was for anyone to bring an image in, we'd compress it and then play spot the difference, with a pint bet that they couldn't. I won a lot of beer from that.

Even won the 'Tank Girl on a Mandelbrot background' that I was really quite worried about when I first saw the image.

Far too many people have only seen the results from low-quality DXTC compressors. The compressor is key. If you've got a good one, then once you've applied trilinear filtering you'll never spot the difference except on pathological cases (like the sky in Quake3).
 
Dio said:
Far too many people have only seen the results from low-quality DXTC compressors. The compressor is key. If you've got a good one, then once you've applied trilinear filtering you'll never spot the difference except on pathological cases (like the sky in Quake3).
Interpolating in 24-bit color is also very important... ;)
 
Back
Top