Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Old 25-Jun-2002, 15:34   #1
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,779
Default The utterly inefficient Parhelia (Let's do some math...)

Even though we found out that parhelia will only have a 220 MHz clock speed, I was still expecting decent performance out of Parhelia. Lets take Quake3 at 1600x1200x32 for example.

Counting pixels, a GF4 Ti4600 can do 1.2 billion / (1600x1200) = 625 fps. Taking overdraw to be 3.3 (some STMicro dude said that, and its about right), we're down to 190 fps. Because GF4 has some HSR, that will go back up to about 230 fps or so. Now what brings it down to ~150 fps that it actually scores?

1. Texture bandwidth. In some areas of the screen, the textures can require upwards of 50 bits per pixel (trilinear base map at 1 texel/pixel + light map + cache inefficiencies), although mostly it is much less.

2. Color bandwidth. When doing alpha blending, that extra read (32 bits extra) can sometimes really put bandwidth requirements over the top.

3. Trilinear filtering. Closer pixels are magnifying the biggest mipmap so its the same as bilinear filtering, but after a certain point you'll need to get two bilinear samples per texture, requiring an extra clock cycle and reducing fillrate to 2 pixels per clock.

4. Drivers, CPU, and T&L. While you can argue that the test is no longer CPU limited at 1600x1200, it is close enough to the CPU limit of ~210 fps (as seen in lower resolutions) that there will be some moments when the CPU is holding the GF4 back. T&L is also minimal, as Quake 3 is hardly polygon intensive.

Overall, the GF4's fillrate is effectively 3 pixels per clock instead of 4+HSR, which is actually extremely efficient - better than any other card today, and it trounces that POS known as GF2.



Now lets look at Parhelia. At 220 MHz, it's fillrate leads to a speed of 880 million / (1600 x 1200 x 3.3) = 139 fps. However, it has no excuses, as the above does not apply:

1,2: With a 256-bit bus and a much higher memory clock than core clock, Parhelia has more than twice the bandwidth per pixel than the GF4 (160 vs. 69 bits per pixel per clock). Very rarely would there be pixels requiring this much bandwidth.

3: With 4 texture units per pipe, Matrox has no excuse for extra cycles in trilinear filtering.

4: The Parhelia's score for 1024x768 is more than twice the score at 1600x1200 (151 vs 70 fps, or 2.2x). This is almost entirely due to the increase in pixels on the screen at 1600x1200 (2.4x), meaning that there are very few driver, CPU and T&L related bottlenecks.

Parhelia really should be getting close to 130 fps, but gets a truly pathetic 70 fps.

Well, there you have it. A mathematical proof of how much the Parhelia blows

Matrox's G400 was a very good card at the time it came out, but the Parhelia is a horrible effort considering they had 3 years since the G400 to focus on one chip, and other manufacturers put out 2 generations of cards in that time, with a third coming soon. They do not have a very talented hardware design team at all.
Mintmaster is offline  
Old 25-Jun-2002, 16:15   #2
JF_Aidan_Pryde
Member
 
Join Date: Feb 2002
Location: Sydney, Australia
Posts: 593
Send a message via MSN to JF_Aidan_Pryde
Default

I guess if the reviews so far has said anything, it's that NVIDIA and the Geforce4 are pretty darn optimized. All the best engineers are there, for one reason or another.
JF_Aidan_Pryde is offline  
Old 25-Jun-2002, 16:16   #3
mboeller
Member
 
Join Date: Feb 2002
Location: Germany
Posts: 845
Default

IMHO the math works a little bit different.

Looking at Quake3 :
At a resolution of 1024x768 you need 768432 x 3 Texel (3Dfx-style) fillrate. The 3Texel come from multitexturing and alphablending. On top you have an overdraw of around 1.2-1.5 (from memory only).

So you need an effective fillrate of 3538944 Texel (3DFx-Style; Overdraw 1.5) for every frame.

Bandwidth demand for this fillrate ( with 2pass rendering ) :

Textures :
3,5 Mio x 32bit x 4(bilinear filtering) *0,33 ( cache-missrate ) / 8 = ~19MB / Frame

Z-Buffer :
3,5 Mio x 32 x 2 x2(2passes) /8 = ~56 MB / Frame ( the GF4 can save up to 75% of this bandwidth )

Framebuffer :
3,5 Mio x 32 x2(2passes) /8 = ~28 MB /frame



So the GF4 4200 can do theoretically around :

1000 / ( 3,5 x4/3; cause of the pipeline ) = 214fps

Bandwidth demand :
214 x ( 19+0,25*56+28 ) = 13054 MB/sec

The GF4 4200 has only 8GB/sec bandwidth, and so it reaches only around 60% of the theoretical figure. This gives 214 x 0,6 = 130 fps

As you see this figure is too low, but this is because this is only an rough estimate without taking into account 8bit alpha-textures, S3TC, real cache-missrate etc...


The Parhelia on the other side would have the following speed :

880 / ( 3,5 x4/3; cause of the pipeline ) = 188 fps

Bandwidth demand :
188 x ( 19+56+28 ) = 19364 MB /sec
Theoretically the bandwidth demand for the Z-buffer and the framebuffer would be only half this amount given, because the Parhelia has 4TMU's per Pipeline, but most games are writen with 2TMU's per Pipeline in mind and so the extra units sit idle ( but can be used for trilinear filtering or anisotropic filtering )

real bandwidth : 275x256x2 /8 = 17,6 GB/sec

So the speed of the Parhelia should be 188 x (17600/19364) = 170fps.

The real speed should be even higher as we see with the GF4 4200.


So in the end, I agree the Parhelia is very inefficient at the moment; BUT this can be corrected with drivers more or less (unless Matrox has build an really inefficient chip ).
mboeller is offline  
Old 25-Jun-2002, 16:24   #4
Randell
Senior Daddy
 
Join Date: Feb 2002
Location: London
Posts: 1,869
Default

I dont know the theoritcal maths, but even I would have expected the raw speed to at least equal Gf4 Ti4200/4400 performance and then see it pull well ahead as the IQ was turned on.
Randell is offline  
Old 25-Jun-2002, 19:05   #5
Kristof
Senior Member
 
Join Date: Jan 2002
Location: Abbots Langley
Posts: 732
Default

Has anybody seen any single texture results from 3DMark, all I find is the multitexture ones where Matrox wins. I wonder about their pixel throughput rate...

K~
Kristof is offline  
Old 25-Jun-2002, 19:11   #6
CHHAS
Junior Member
 
Join Date: Feb 2002
Posts: 14
Default

Single Texture was around 700ish.
CHHAS is offline  
Old 25-Jun-2002, 19:13   #7
Jazz
Junior Member
 
Join Date: Jun 2002
Posts: 49
Default

Jazz is offline  
Old 25-Jun-2002, 22:00   #8
Kristof
Senior Member
 
Join Date: Jan 2002
Location: Abbots Langley
Posts: 732
Default

OK, 880 theoretical and 750.7 in a fairly cache friendly test with twice the bandwidth available as the others per pixel... efficiency rate : 85%. You'd think they would at least hit 100% efficiency on that one... wonder if they have any ability in the drivers to tweak their memory interface.

K~
Kristof is offline  
Old 25-Jun-2002, 22:03   #9
LittlePenny
Member
 
Join Date: Feb 2002
Location: Rolla, Missouri, USA
Posts: 276
Send a message via AIM to LittlePenny
Default

Maybe they need those old Voodoo5 drivers with the HSR.
LittlePenny is offline  
Old 25-Jun-2002, 22:38   #10
Nappe1
lp0 On Fire!
 
Join Date: Feb 2002
Location: South east finland
Posts: 1,527
Send a message via ICQ to Nappe1
Default

Quote:
Originally Posted by LittlePenny
Maybe they need those old Voodoo5 drivers with the HSR.
hehe too bad that doesn't work when T&L is done in chip...

I have been reading their white papers and started to wonder, if their FAA unit knows what pixels are totally covered, why not adding something Pixel Skip to that one? It would not have took much more room and would have been helping a quite a lot.
__________________
Nappe1 of Division & Future Vision
Founder of AF3DE
Nappe1 is offline  
Old 25-Jun-2002, 23:55   #11
Jerry Cornelius
Member
 
Join Date: May 2002
Posts: 116
Default

hmmm...
I used a much rougher calculation for Quake III using a single pixel per pipe and HQ filtering

880000000 / (1280 * 1024 * 6)

the 6 being 2 textures * 2 overdraw + 2 more for alpha and dynamic lights

That works out to 112, a far cry from the mid 70's reported in some reviews.

It's impossible to guestimate the impact of special effects in Q3. A multiplier of 7 or 8 may be more accurate.
Jerry Cornelius is offline  
Old 26-Jun-2002, 00:05   #12
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,779
Default

Quote:
Originally Posted by Kristof
OK, 880 theoretical and 750.7 in a fairly cache friendly test with twice the bandwidth available as the others per pixel... efficiency rate : 85%. You'd think they would at least hit 100% efficiency on that one... wonder if they have any ability in the drivers to tweak their memory interface.

K~
I think 100% is expecting a bit much, but 95% is totally reasonable. Anyway, I think their multitexturing is much worse:



As far as I know Parhelia can combine two pipelines together for 8 textures per pass (unless they can only combine the pixel shader part), so we're talking 320 bits of bandwidth per pixel with 8 textures. The textures are very low resolution and cache friendly, as you said, so the texture bandwidth for 8 textures will be very little, maybe ~20 bits per pixel at most (even lower at higher resolutions). Add in an alpha read, color buffer write, and no Z (I believe Z is disabled for this test, hence the good scores of Radeon 8500 and Geforce4), and you get only ~85 bits per pixel. They have SO much bandwidth to spare, it isn't funny.

Even if they can only do 4 textures per pass, there is plenty of bandwidth to spare (160 bits per pixel bandwidth available and ~75 bits required). I can understand the multitexture rate being slight less than 4 times the single texture rate, but to drop down to about 3 times is pretty bad.

Look at the Radeon 8500 - 76% efficient in single texturing and 93% efficient in multitexturing. To hit 70% efficiency in multitexturing with over twice the bandwidth per pixel per clock is just abhorrent.
Mintmaster is offline  
Old 26-Jun-2002, 00:57   #13
arjan de lumens
Senior Member
 
Join Date: Feb 2002
Location: gjethus, Norway
Posts: 1,256
Default

Appears to me so far that Matrox, when designing this chip, just assumed that raw memory bandwidth was the only big bottleneck holding back GPU performance, and that therefore a 256-bit bus alone would be enough to beat Nvidia/ATI's 128-bit solutions. The lack of significant performance optimizations in the chip (like fast/hierarchical Z tests, Z-compression, crossbar memory controllers, etc) and the generally low efficiency (70% efficiency when multi-texturing, as well as the performance hit taken when doing anisotropic mapping, point to a badly optimized texture cache; and the 4 vertex shaders deliver less than impressive performance also) of the architecture would point in that direction.
arjan de lumens is online now  
Old 26-Jun-2002, 01:12   #14
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,779
Default

Quote:
Originally Posted by mboeller
IMHO the math works a little bit different.
I don't know what kind of f#$%!d up math you're talking about, but what you're suggesting makes no sense at all. You're calculating based on memory bandwidth, but what about when memory bandwidth is not the limiting factor (i.e. low res texture near the cameras, only a colour write, no Z read due to fast Z clear, etc)? Your method will project far too high fillrates, especially for Parhelia. And there are more problems too...

Quote:
Originally Posted by mboeller
Looking at Quake3 :
At a resolution of 1024x768 you need 768432 x 3 Texel (3Dfx-style) fillrate. The 3Texel come from multitexturing and alphablending.
First, not every pixel has alpha blending. In fact, the majority don't. Why does alpha blending need another texel? It just needs more bandwidth for the colour read, which you add separately below.

Quote:
Originally Posted by mboeller
On top you have an overdraw of around 1.2-1.5 (from memory only).
Memory overdraw?!? What the hell is that? Overdraw is how many pixels are drawn divided by the visible pixels. Trust me when I say 3.3 is correct.

Quote:
Originally Posted by mboeller
Bandwidth demand for this fillrate ( with 2pass rendering ) :
What current video card needs 2 passes? The voodoo series?

Quote:
Originally Posted by mboeller
Textures :
3,5 Mio x 32bit x 4(bilinear filtering) *0,33 ( cache-missrate ) / 8 = ~19MB / Frame
Texture bandwidth is NEVER that high. A single mip map will never average more than 1 texel per pixel unless you crank the LOD, because the video card will then select the next lower mip map. In other words, your cache missrate is way off - a video card will rarely load the same texel twice from memory except for repeating textures and dependent textures. With trilinear, one mipmap is also much lower in resolution than the other, so the bandwidth is even less. Finally, near the camera there is texture magnification, which can reduce texture bandwidth by a factor of 10 easily. Then there's texture compression, bringing that down even further.

Quote:
Originally Posted by mboeller
Z-Buffer :
3,5 Mio x 32 x 2 x2(2passes) /8 = ~56 MB / Frame ( the GF4 can save up to 75% of this bandwidth )

Framebuffer :
3,5 Mio x 32 x2(2passes) /8 = ~28 MB /frame
Z and colour buffers need to be accessed only once per pixel, not once per texture sample.

Quote:
Originally Posted by mboeller
So the GF4 4200 can do theoretically around :

1000 / ( 3,5 x4/3; cause of the pipeline ) = 214fps
Again, video cards don't need an extra cycle to do alpha blending. Assuming you did get the texel count right, you're math is wrong.

fps = [texel rate] / [# texels on the screen]

that means (250*4*2)/(3.5)=571 fps. But your texel count is wrong - its 1024x768 x 2 textures x 3.3 overdraw = 5.9 million. Then you get fps of 340.

Quote:
Originally Posted by mboeller
Bandwidth demand :
214 x ( 19+0,25*56+28 ) = 13054 MB/sec

The GF4 4200 has only 8GB/sec bandwidth, and so it reaches only around 60% of the theoretical figure. This gives 214 x 0,6 = 130 fps
First off, no video card can get 100% bandwidth utilization. Its more like 85-90% for the very best memory controllers in very friendly, balanced situations. So you're calculation would result in even lower of a score. Secondly, GF4 can get far more than 214 fps at 1024x768, as its just CPU limited. It scores some 120 fps at 1600x1200, so if you multiply the pixel ratio, that gives you 290 fps at 1024x768, showing how your answer is way off.

Quote:
Originally Posted by mboeller
The Parhelia on the other side would have the following speed :

880 / ( 3,5 x4/3; cause of the pipeline ) = 188 fps

Bandwidth demand :
188 x ( 19+56+28 ) = 19364 MB /sec
Theoretically the bandwidth demand for the Z-buffer and the framebuffer would be only half this amount given, because the Parhelia has 4TMU's per Pipeline, but most games are writen with 2TMU's per Pipeline in mind and so the extra units sit idle ( but can be used for trilinear filtering or anisotropic filtering )

real bandwidth : 275x256x2 /8 = 17,6 GB/sec

So the speed of the Parhelia should be 188 x (17600/19364) = 170fps.

The real speed should be even higher as we see with the GF4 4200.
Here, you're method fails tremendously. Suppose that you did correctly calculate the bandwidth needed per frame. Matrox is hardly ever bandwidth bound, especially in Q3. Your calculation assumes Parhelia's pipelines are fast enough to keep the memory busy all the time, but you'd probably need 8 pixel pipes on a 300 MHz core to keep a 275 MHz 256-bit DDR bus busy with Quake 3.

Quote:
Originally Posted by mboeller
So in the end, I agree the Parhelia is very inefficient at the moment; BUT this can be corrected with drivers more or less (unless Matrox has build an really inefficient chip ).
Unless Matrox shipped their product before it was working properly (as NVidia did with GF3 I think), better drivers will only significantly help CPU limited scores. We can hope that they just didn't bother tweaking the memory controller, but even so it has 2.5 times the bandwidth per pixel as other cards that are more efficient. I really think there are serious design flaws in the hardware. Well, I guess anything is possible.

Anyway, this method of calculating theoretical framerate has more holes than a block of swiss cheese.
Mintmaster is offline  
Old 26-Jun-2002, 04:26   #15
3dcgi
Senior Member
 
Join Date: Feb 2002
Posts: 2,019
Default

Quote:
Originally Posted by Mintmaster
better drivers will only significantly help CPU limited scores.
That's not necessarily true. It all depends on what the driver inefficiencies are. If the drivers are sending more render state changes than necessary or mis-ordering commands they could stall the pipe which means they could be artifically limiting the rendering performance.

A driver comment unrelated to my first paragraph. I think the first chart on the following page at Anandtech might show that the drivers are creating some CPU inefficiencies. I find it odd that Parhelia trails significantly when the other cards are practically identical. http://www.anandtech.com/video/showdoc.html?i=1645&p=9
3dcgi is offline  
Old 26-Jun-2002, 14:30   #16
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,779
Default

Quote:
Originally Posted by 3dcgi
Quote:
Originally Posted by Mintmaster
better drivers will only significantly help CPU limited scores.
That's not necessarily true. It all depends on what the driver inefficiencies are. If the drivers are sending more render state changes than necessary or mis-ordering commands they could stall the pipe which means they could be artifically limiting the rendering performance.
True... I guess render state changes are a possibility. However, even on synthetic tests where everything is the same (i.e. VERY few renderstate changes) has problems. High polygon count is limited at 25Mpolys/sec. Single and Multitexturing fillrate (especially the latter) are not up to par.

Also, raising the resolution in Quake 3 increases the number of pixels without raising the number of render state changes, polygons, or CPU work per frame. Still, the Parhelia's scores are scaling almost exactly with increase in fillrate demands (demonstrated above - the 2.2x vs 2.4x thing), suggesting inefficient fillrate, not any other reason.

I'm not talking about 5-10% increase, I'm talking about a 80%+ increase (at least in Q3 at 1600x1200) needed for performance to be where it is expected.

Unless they have some problem with the card such as limiting it to only 128 bit due to a timing bug or haven't payed any attention to getting performance reasonable (this a far bigger problem than just "tweaking"), there really doesn't seem to be hope that performance will reach expected levels, or at least not what I'm expecting as explained above.

With over double the bandwidth per clock and double the texture units, we should see at least a 20% increase in efficiency PER CLOCK (i.e. if Parhelia were to run at GF4 speeds) compared to Radeon 8500 or GF4. Instead, its efficiency is actually significantly lower.
Mintmaster is offline  

 

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Parhelia experience Typedef Enum 3D Architectures & Chips 114 25-Sep-2002 16:22
Matrox to showcase new Parhelia at Seybold 2002 Dave Baumann Press Releases 0 03-Sep-2002 22:58
Matrox Announce Parhelia Based Boards Rookie 3D Architectures & Chips 5 18-Jun-2002 16:49
Matrox Introduces Parhelia-512 Graphics Accelerators Dave Baumann Press Releases 0 18-Jun-2002 14:48
Anand on Parhelia vs NV30 SteveG 3D Architectures & Chips 37 16-May-2002 16:22


All times are GMT +1. The time now is 21:29.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.