GeForce FX: 8x1 or 4x2?

Discussion in 'General 3D Technology' started by Dave Baumann, Feb 10, 2003.

  1. Pete

    Pete Moderate Nuisance
    Moderator Legend

    Joined:
    Feb 7, 2002
    Messages:
    5,777
    Likes Received:
    1,814
    Most of the discussion in the past two pages is slightly over my head, but allow me to pose a simple question:

    nVidia typically seems to have more cache built in that ATi. How does this help? Is it similar to a CPU's L1/L2 distinction, or do GPU's and CPU's function differently enough to make that analogy somewhat useless? And could a large amount of cache necessitate .13u simply for power draw, as I've read elsewhere?

    (I'm assuming NV30 was always meant to be 4x2, and was optimized for that--so, no thought that they traded 8x1 for 4x2 + extra cache.)
     
  2. WaltC

    Veteran

    Joined:
    Jul 22, 2002
    Messages:
    2,710
    Likes Received:
    8
    Location:
    BelleVue Sanatorium, Billary, NY. Patient privile
    According to some nVidia PR material I've read lately (I really need to start a links folder...;)), nVidia is maintaining that the actual physical bandwidth of the promotional Ultra product is ~20 gigs/sec. The PR formula by which this is achieved is:

    (1) 16-gigs/sec physical bandwidth across the local bus (DDRII)

    (2) A further ~4-gigs/sec as the result of unspecified and unidentified "gpu caching" nVidia claims. (Efforts to penetrate the "gpu caching" representation as to specifics have met with the same sort of "trade secret" barriers which undermined efforts to have nVidia explain its FSAA methods for 2x and QC so that a determination as to any actual FSAA that might be occurring could be made.)

    So, in many senses this is a product as shrouded in "trade secrets" as is a bucket of the Colonel's Kentucy Fried Chicken with its "11 secret herbs and spices"....;) Everywhere we probe for answers a shroud is thrown up to block them.

    But for a moment, anyway, let's take the trade secrets at face value. nv30 still seems to have a problem although it has what should be sufficient bandwidth for most cases, at least if we compare it to R300. Do the bandwidth limits you imply afflict both the promotional Ultra products and the actual standard FX (400/800)? In other words, you seem to be saying that a standard FX at 400/800 would more nearly achieve its potential at 400/1GHz....If so, I would think nVidia would see a 400/1GHz model as a better "ultra" than the one it originally planned. Could it be that the reason it doesn't perhaps answers the question?

    So...you're saying you think a concern for "very small polygons" was a prime motivator in the design philosphy, then...

    I see Dave already answered this one.

    Edits: typos
     
  3. Ichneumon

    Regular

    Joined:
    Feb 3, 2002
    Messages:
    414
    Likes Received:
    1
    While a 400/1000 GFFX would perhaps be more balanced, that extra 100mhz GPU speed gives them "bench points" so to speak even if you never really see that in actuall games/applications...
     
  4. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    As for bandwidth limitations, I'll attack this with a very simple argument:

    Bits of bandwidth available per clock: 256
    Assume only z-read and color write enabled: 64 bits per pixel
    4 pixels * 64 bits per pixel = 256 bits.

    Which already exhausts the bandwidth limitations.

    In a more realistic scenario, you may have up to 128 bits per pixel, depending (framebuffer read and write, for alpha blending, and z-buffer read and write), though 96 bits is more likely the norm.

    So, what about memory bandwidth savings techniques? These can help things a bit. For a very complex scene, where many neighboring pixels pass and fail z-tests, you won't be bandwidth limited (but rather limited by the pixel pipes trying to find some data that's actually meaningful to work on). However, this just doesn't happen in most cases.

    But then there's z-buffer compression. This can help quite a bit. Let's assume 4:1 compression. That will reduce the bandwidth per pixel for one of the examples above (color write + z read only) to about 40 bits per pixel, for a total of 160 bits per clock, leaving room for more. But how often will this occur? I'd say a more probable scenario for single texturing is an alpha blend (for explosions, skyboxes, etc.) where you'll have a z-read, a color read, and a color write.

    I guess what I'm trying to say is that it's conceivable that you won't be bandwidth-bound with single texturing on an FX, but I doubt that many upcoming games are going to be just applying a single non-transparent texture very commonly. Performance in very old games where this was common is kinda pointless to accelerate.
     
  5. antlers

    Regular

    Joined:
    Aug 14, 2002
    Messages:
    457
    Likes Received:
    0
    Funny how the pendulum swings. When T&L was new, people were predicting that newer games using it would have hordes of tiny polys, and adding pixel pipes (i.e. having 8x1 instead of 4x2) would be a waste.

    Now, we see higher-res more common, next gen games using geometry-intensive shadowing and dot3 lighting in lieu of lots of geometry, and shaders making state changes even more expensive. It looks like larger polys might stay around for a few years longer.
     
  6. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    So, if we've got two ALU's per pipeline I would assume that these are pipelined? i.e. its not selecting 8 pixels and operating one them all in parallel, but selecting 4 and going through the pipeline and then selecting another 4 on the next clock, in a more CPU like manner?
     
  7. Joe DeFuria

    Legend

    Joined:
    Feb 6, 2002
    Messages:
    5,994
    Likes Received:
    71
    In a nutshell:

    The FX is a pretty balanced part....problem is, the R9700 is also balanced, but at a higher end of the scale. That would be acceptable if the FX were priced lower than the 9700, unfortunately....
     
  8. WaltC

    Veteran

    Joined:
    Jul 22, 2002
    Messages:
    2,710
    Likes Received:
    8
    Location:
    BelleVue Sanatorium, Billary, NY. Patient privile
    Probably right. I had even thought that it might be because the 1GHz DDRII is still prohibitively expensive, and that due to a lack of general demand for the Ultra product from its OEMs nVidia paid more to its ram suppliers for a smaller supply, when it decided to limit shipments.
     
  9. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    I'm still hopeful that we'll have some very flexible HOS becomes standard soon, making high polycounts much more accessible.

    As for pixel sizes in current games, I can see where games like UT2k3 and Unreal 2 could have a significant number of triangles that are rather small.

    And keep in mind that organizing pixels in blocks won't hurt performance solely for small triangles, but there will be some performance hit at the edge of all triangles (unless, perhaps, the hardware is able to send, say, many triangles in a single strip through the pipes in such a way that different pipes can be working on different triangles...which shouldn't be impossible, and definitely would not be a bad thing).
     
  10. UncleSam DL iXBT

    Newcomer

    Joined:
    Feb 27, 2003
    Messages:
    36
    Likes Received:
    0
    seems that programable HOS is the question of not R400/NV40 but probably full DX10 - 50 and 500 chips

    IMHO R400 and NV40 is a pretty floating powerful DX9/PS 3.0 parts

    8х1 better then 4х2 if we do it in NV30 manner but R300 have different pixel pipe so its depends from task - where 4х2 of nv30 better than 8х1 of R300 and where not. with intensive dependent texturing nv30 is probably better, with intensive calculations - r300.

    so 8х1 on ati have its own lows and pluses...

    8х1 on nv30 manner requries to much of silicon to spend so may be its realy better to made 4х2 with large caches especialy with 128 bit bus.

    500 MHz unbalanced when we did'nt deal with anisotropic filtering, when we take it into conclusion it seems like pretty balanced ;)
     
  11. WaltC

    Veteran

    Joined:
    Jul 22, 2002
    Messages:
    2,710
    Likes Received:
    8
    Location:
    BelleVue Sanatorium, Billary, NY. Patient privile
    Exactly, which lends credence and enhances an understanding of nVidia's earlier remarks about doing a 256-bit bus "when it's needed." Assuming this is only 10% PR bluster and 90% fact, this certainly indicates a design philosophy behind nv30 in which 128-bit DDRII was deemed entirely sufficient for nv30's high end. (How this takes me back to 3dfx's "We aren't opposed to 24-bits plus 8, and we'll certainly get around to it when the time is right, in the mean here's 16/22-bits" remarks a few years ago. In the years since I've learned that this is actually doubletalk for "We wish we'd done it first." *chuckle*)
     
  12. WaltC

    Veteran

    Joined:
    Jul 22, 2002
    Messages:
    2,710
    Likes Received:
    8
    Location:
    BelleVue Sanatorium, Billary, NY. Patient privile
    I say 8x1 is "better" from the standpoint of current software, and the software likely to be written in the next year, or withint the useful, possibly even extended lifetime of nv30 or R300 as so-called high-end products. One thing is that R300 is already into its six-month of life, and we're still waiting on nv30 to ship. nVidia's nv30s aren't shipping and already the company is picking out synthetic benchmarks and criticizing them as being unlikely to represent future software development. This conjurs the question of whether nVidia has been designing nv30 around 3DMark 2001SE....*chuckle* And now has a big case of indigestion as the benchmark has shifted gears (not, of course, that I really believe that...completely *chuckle*) I just think all of this underscores just how late the nv30 design is in getting to market--released 8-9 months ago I can see very compelling, non-Dustbuster equipped markets--but it seems nVidia erred in making the production process more important than the architecture (which I think is the reverse of how it should be done. IE, you get the architecture right and then you look for benefits from future improvements in production processes.)
     
  13. Hyp-X

    Hyp-X Irregular
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,170
    Likes Received:
    5
    I was talking about the register combiners (the integer only units) - not the fp alus.
    The FP alu is looped.
    Notice that in fact there is likely one big FP alu.

    It can either produce:
    a.) 4 arithmetic results for 4 (2x2) pixels
    b.) 8 texture fetches from 2 textures for 4 (2x2) pixels
    It does (a) or (b) in every clock cycle until the pixel program is finished.
    (This is over-simplified the unit surely works on many pixel block alternatingly to hide texture fetch latency.)

    When a pixel program is finished (for 2x2 pixels) a few output values is sent to the register combiners, and the FP alu is contuniuing processing the next 2x2 pixels.

    After that the data is processed with the integer register combiners.
    Reg combiners can execute 0-2 instructions at 4 pixels/clock rate, 3-4 instructions at 2 pixels/clock rate or 5-8 instructions at 1 pixels/clock rate.

    (As always, everything is a guess based on the currently available information.)
     
  14. Vince

    Veteran

    Joined:
    Apr 9, 2002
    Messages:
    2,158
    Likes Received:
    7
    It's quite ironic that there is 2-3 pages of enlightened and amazingly intelligent discussion on the NV30's fundimental architecture. It's truely a great conversation between highly literate people and is a delight to read.

    And then people like Joe and WaltC come in and start talking useless rhetoric and back to this whole big nomenclature ideal of 4*2 and 8*1 - which outside of making it easier to comprehend and use in an anti-IHV crusade is useless.

    Listen up, you just might learn something - I hold out hope for us all to increase our knowledge on this topic.
     
  15. Nagorak

    Regular

    Joined:
    Jun 20, 2002
    Messages:
    854
    Likes Received:
    0
    In my opinion that's because hardwired T&L itself pretty much turned out to be a waste. By the time it made it into games, the CPU was fast enough to just handle the calculations, meanwhile T&L didn't really open any new doors for graphics development. Shaders may be a step in the right direction, but T&L, by and large, was just BS.
     
  16. Joe DeFuria

    Legend

    Joined:
    Feb 6, 2002
    Messages:
    5,994
    Likes Received:
    71
    Ahhh...Vince. Spoken truly like someone who doesn't disagree with what Walt and I are saying...it's just that you don't like to hear it being said.

    It's quite ironic how some people have nothing to add to a conversation but personal insults, and at the same time complain about the conduct of other members....
     
  17. Reverend

    Banned

    Joined:
    Jan 31, 2002
    Messages:
    3,266
    Likes Received:
    24
    Croteam on this issue.

    Just like what I said, and no, I didn't talk to Dean about this!

    Oh, wait a minute... the discussion has becomed much more detailed and informative about this topic. :lol:
     
  18. Diespinnerz

    Newcomer

    Joined:
    Jan 6, 2003
    Messages:
    23
    Likes Received:
    0
    Yeah, but everybody allready knows your the biggest nvidiot here rev... ;)

    Guess we can add the croteam to the list of mouths on nvidia's payroll.
     
  19. Nite_Hawk

    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    1,202
    Likes Received:
    35
    Location:
    Minneapolis, MN
    Rev:

    I think both you and he have missed the point to a certain extent. A lot of people made assumptions about the card based on the information that nvidia made available. We're now finding out that what we thought was true isn't, and people feel decieved. It begs the question: What other assumptions are we making about the card that potentially arn't true? Nvidia's been really hesitant to give concrete information about it, and that makes them seem suspicious. It really doesn't matter that a 4x2 may not be slower than a 8x1 in most cases. I think the thing that most people here are interested in is to see what nvidia left out when talking about it. Perhaps this is an inconsequential discussion in terms of how it will affect performance, but I think it's a very worthwhile discussion when anylizing nvidia's marketing tactics, and what *they* are worried about with the NV30. When looking for problems, it's probably best to first find out what they see as problems. (and are thus trying to keep quiet).

    Nite_Hawk
     
  20. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    I agree that effective "proxel" output is all that matters. The problem is that it looks like in all realistic heavy workloads the nv30 isn't able to achieve anything that justifies the implication of "8 pipelines".

    For example , if the nv35 were designed exactly the same way, except that by addition or reorgnization of processing functionality it overcame the limitations the nv30 is exhibiting (which I sort of think is pretty close to the minimum of what it should be), then the "8x1" naming would be justified in my mind. That's why I made up the "proxel" nomenclature..."pixel pipeline" is too abusable and indeterminate. I do also happen to think "zixel fillrate" matters, I just don't think it should be called "pixel fillrate" when it manifests like it does for the nv30 right now.

    Performance is what really matters, but the design of the nv30 doesn't deliver the performance in accordance to nvidia's marketing description of the architecture. It is advertised as a 500(or whatever) MHz 8x1 "pixel pipeline" part, which is simply a misleading description...consumers do pay attention to such numbers though I think there is less attachment to them for GPUs than for CPUs.
    Again, the "4x?" versus "8x1" discussion, to me, is a discussion about how performance falls short of what is advertised. Though I agree "pixel pipeline" might be inappropriate, I don't think using the term distorts the basic issue with the nv30...however, if the nv35 is only as I describe the minimum above, you'd likely find me on the other "side" of the issue at that time...it's just that I don't think we'd end up having an "uproar" about it being described as "8x1" in the first place.

    From the way things are shaping up as we explore further, it doesn't seem like the nv30 is capable of overcoming these issues in real world usage (though I am still curious about what they are able to realize with the full scope of optimizations they planned and I think someone should write a Cg version of some "proxel" fillrate/MIPS testing), and this the real issue.

    I do agree the nv30 will still be just as fast as its design and drivers dictate when running games, no matter how you describe it with words.

    ...

    I don't get why people are complaining about the "4x? uproar", since to me it is nvidia's fault we are discussing how poor an "8x1" implementation it is, instead of how excellent (IMO, apart from the AA and aniso issues :p ) a "4x?" implementation it is. :-?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...