Parhelia-512 textureing compared to R8500 &Gf4

Now, I personally dont have any detailed understanding of these matters. However i would really like to see some discussion on this for you people here "in the know".

The P-512 has 4 pixel pipes and 4 TMU's per pipe. yet from what I read it can only do 4 Textures per pass. The GF4 can do 4 per pass. The 8500 can do 6 per pass.

Is the advantage of having 4 TMu's per pipe that 4 pixels can be processed with 4 textures in paralell? Or basically in a single clock cycle? Where the 8500 does a loop back. The GF4 must then only use 2x2 pipes to get 4 textures on 2 pixels on a single cycle? I am only asking as i am trying to understand the advantagfe the P-512 has.

I know this is probobly over suimplistic in my anasys. Please Go into as much detail as possible, as I (and others) would love to get some better insight on the matter.

In short... What is P-512's pixel pipe design advatage?
 
Parhelia has 4 TMUs/pipeline, while GF3/4/Radeon8500 has 2. So it doesn't need additional cycles for the other textures when 3 or 4 textures are used simultaneous. Whether it's worth it given the additional die space and memory bandwidth requirement is another question though.
 
Bandwidth is one issue, the more TMUs you try to keep fed with data the more likely one of the bits of data is still missing causing a full pipeline stall.

Also remember the potential "unused" resources when only 1 or 2 or 3 textures are used in a pass. Remeber that alpha textures are almost always single textured (causing 75% of your hardware units to sit there idle). Also many games developers still go from the lowest common denominator which is 2 TMUs these days. Less TMUs per pipe but more pipes with a loopback function would have been better but require more die space and harder to keep this up and running (pixel output number higher). Then again they theoretically have double the BW of a GF4, so "in theory" they should be able to handle twice the TMUs.

Many TMUs per pipe might help to execute Anistropic filtering also.
 
I'm not sure if I understand it correct, but for me it seems that the Perhelia512 can use 8 textures per pass when he combines the output of two pipelines ( Maybe they hope to support DX9 that way ).


Whether it's worth it given the additional die space and memory bandwidth requirement is another question though.

IMHO the 4TPU's per Pipeline are an waste, cause you need more than 350x4x4x4x32bit x 0,33 / 8 = ~30 GB/sec Bandwidth alone for the texturing to feed all 4 TPU's per Pipeline ( with 32bit textures ). Only when you use trilinear fitlering or better or when you use 16bit textures or S3TC then you could feed the pipelines, and so IMHO the 4TPU's seem to be an _big_ waste.
 
Well, in games like Doom3 it clearly wont be a waste... Right? I guess its greatest benefit will be less slowdown in really complex textured games.

Also, I have just read that the P-512 uses a 2mb on die Texture cache... If thats true wow...

It appears that bitboys are going to get beat to market with On chip cache.
 
Doom3 might use more multi-pass than multitexture, it also uses massive stencil testing which places the potential bottleneck at the per pixel level rather than per texel level.
 
Hellbinder said:
Well, in games like Doom3 it clearly wont be a waste... Right? I guess its greatest benefit will be less slowdown in really complex textured games.

That's how I see it too.

Also, I have just read that the P-512 uses a 2mb on die Texture cache... If thats true wow...

It appears that bitboys are going to get beat to market with On chip cache.

Knowing that 3D chips have had on-chip texture caches (in the kB range) for years, you must mean 2 MB (hey, bytes or bits?) of eDRAM as a texture cache; either as a separate L2 cache block, or embedded and built in to the TMU logic with gobs of bandwidth.

Kewl. Where did you find that info?

Yeah, Bitboys were narrowly beaten to the market. ROTFLMAO. ;)
 
I think DaveB hit on an important point here. With Parhelia, we're probably looking at a case where you are going to start seeing the greatest relative benfit when comparing AA and/or advanced filtering modes.

In other words, comparing say, a GeForce4 Ti vs. Parhelia at 1600x1200x32, no AA and bilinear filtering, the performance delta might be not that impressive. But turn on trilinear and/or aniso, plus edge AA, and then we'll probably start to see Parhelia's significant value. Just a hunch. ;)
 
The number of TMUs per pipe is not going to help with FSAA, it might help with better filtering but... below is an extract (translated) from www.chip.de (they took their article down again for obvious reasons) :

(altavista translate)

In the following one again the entire feature list of the Parhelia-512-GPU:

Chip feature

Chip technology: 512 bits
Memory interface: 256 bits
Memory range: around 20.8 GByte/sec (with a storing act of effectively 650 MHz)

3D-Features
Vertex Shader units: 4
Pixel pipelines: 4
Texture stages per pipeline: 4
dual textured, trilinear filtered pixels per clock: 4
Quadruple pixel per clock textured: 4

Pixel Shader per pipeline: 5
Entire Shader stages: 36 ((4+5)*4)

Vertex Shader version: 2.0 (DirectX9)
Pixel Shader version: 1.3 (DirectX8)
Anti- Aliasing quality: 16x/4x super Sampling
Hardware DISPLACEMENT Mapping (DirectX 9)
Depth adaptive Tessellation
N-Patch-Tessellation

2D-Features

Frame Buffer: 10 bits per color component
64 super Sampling Texture Filtering
TripleHead: Surround Gaming
Text anti- Aliasing with gamma correction
UltraSharp display technology
10-Bit-DVD-Playback
Two 10-Bit/400-MHz-RAMDACs
Maximum dissolution (similar): 2,048 x of 1,536 pixels @ 32 bits color
10-Bit-TV-Encoder

I have placed in bold the interesting bit, seems like Matrox has to use 2 TMUs to execute Trilin (they can not do 4 times 4 layer texturing with trilin filtering enabled ?). Would this not make the system just as fast/slow as a GF4 which has 4 pipes with 2 TMUs but can do trilin with one TMU unit (IIRC and always assuming the data needed to execute this is available).

K-
 
Ahh, I suspected something like this from what was said at ixbt, but I didn't dare to say it to sound too negative until we had more info.
Ixbt said something about 64 texels as input to the TMUs, and that matches 4x4 bilinear filter.

But bilinear might actually be useful in some lowres textures where you don't expect to reach any but the highest mipmap. For instance for caustics, or a blured environment map used for glossy lightning.

Otherwise yes, same performance as GF4, except the higher clock frequency.
 
So its the same speed in trilinear filtering as a GF4. However from what I read its FSAA performance (due to its method) will be much faster, with higher quality.

Interesting.....

It is going to be very interesting to see how the R300 and NV30 compare. And what methods they employ to accomplish filtering this time around. If R300 rumors hold true with 8 pipes and 4 tmu's per pipe... It may deliver some Serious Triliniar filtering/FSAA speed.
 
Otherwise yes, same performance as GF4, except the higher clock frequency.

Well, as far as fill-rate is concerned, yes, "clock for clock" the same performance with trilinear. However, if the GF4 is bandwidth limited in most trilinear filtering cases, then I would expect the Parhelia to be faster clock-for-clock in the real world. I know nvidia shuold probably have an advantage in "efficiency" as far as bandwidth per pixel is concerned, but I doubt it would make-up for a 2X increase in raw bandwidth.
 
mboeller said:
IMHO the 4TPU's per Pipeline are an waste, cause you need more than 350x4x4x4x32bit x 0,33 / 8 = ~30 GB/sec Bandwidth alone for the texturing to feed all 4 TPU's per Pipeline ( with 32bit textures ). Only when you use trilinear fitlering or better or when you use 16bit textures or S3TC then you could feed the pipelines, and so IMHO the 4TPU's seem to be an _big_ waste.

Why there is 350 x 4x4x4x32x0.33? clock rate x no of pipeline x no of tmu x ??? x color depth x ???

I just don't understand..
 
Tatchan said:
mboeller said:
IMHO the 4TPU's per Pipeline are an waste, cause you need more than 350x4x4x4x32bit x 0,33 / 8 = ~30 GB/sec Bandwidth alone for the texturing to feed all 4 TPU's per Pipeline ( with 32bit textures ). Only when you use trilinear fitlering or better or when you use 16bit textures or S3TC then you could feed the pipelines, and so IMHO the 4TPU's seem to be an _big_ waste.

Why there is 350 x 4x4x4x32x0.33? clock rate x no of pipeline x no of tmu x ??? x color depth x ???

I just don't understand..


350 = 350 MHz
x4 = 4 Pipelines
x4 = 4TMU's / Pipeline
x4 = 4texels / TMU ( =bilinear filtering )
x 0,33 = cache missrate for bilinear filtering ( could be a little higher or lower, depending on the texture-cache; texturefiltering and architecture of the chip; the 0,33 come from an Kyro-presentation )


/8 = converting Mbits into Mbytes/sec
 
Only when you use trilinear fitlering or better or when you use 16bit textures or S3TC then you could feed the pipelines, and so IMHO the 4TPU's seem to be an _big_ waste.

Is it really a bad thing to design your architecture in today's environment to be "optimal" when "trilinear filtering or better?" If the card is bandwidth limited when using Bilinear on non-compressed 32 bit textures, but becomes fill rate limited when using trilinear, 16 bit textures, or S3TC, then I'd say that's a decent fill-rate / bandwidth balance.
 
Hellbinder & Joe:
Yes, I meant that in a purely TMU performance sense. Other features and limitations will of course change the real life performance.
 
Back
Top