3D Technology & Architecture

Discussion in 'Architecture and Products' started by Frank, May 18, 2007.

  1. santyhammer

    Newcomer

    Joined:
    Apr 22, 2006
    Messages:
    85
    Likes Received:
    2
    Location:
    Behind you
    I think the R600 should be faster than the 8800GTX... It can use GDDR4, doubles the stream units and bandwith is a lot more... I suspect the problem are some immature drivers ( and also immature DX10 ). All the new things need to mature a few before to start serious benchmark test... I remember when Metal Gear Solid came to PC.. the Psychomantis invisibility cubemap made my PC almost to hang due to the bad speed... one year after went like silk. Vista and the Catalyst need some time to be well optimized.

    I'm gonna buy a R600 definitely because:

    1) Is cheaper than the NVIDIA ones.

    2) The preliminary benchmarks don't make it justice ). Let's wait a few so the drivers and test mature a bit ( personally I'll wait for 3dmark 2007 before to judge the R600 performance ). Speed is not important for me, I just want alll the DX10 features at low cost.

    3) Comes in AGP ( or almost is what I saw here http://xtreview.com/addcomment-id-2...-HD-2900-XTX-and-AGP-Radeon-HD-2600-2400.html ) becase I lack a PCI-E motherboard. The 2600HD Pro looks very nice for me... no need for external power connection, 128bits GDDR3 1Ghz memory, passive silent dissipator, small and nice.

    4) I saw that wonderful terrain tessellation and Humus said wasn't made with the GS so I bet the R600 has a tessellator which I want to play with! If the R600 GS if is really better than the 8800 will help too.

    5) The HDMI and integrated "sound card" looks good.

    6) Its new AA modes are very interesting.

    So i'm trying to decide if I must get that 2600HD Pro or to wait a few and get a Barcelona CPU with a DX10 integrated VGA in the motherboard! ( or pray to NVIDIA so they release the 8600 in AGP! )
     
    #21 santyhammer, May 19, 2007
    Last edited by a moderator: May 19, 2007
  2. Unknown Soldier

    Veteran

    Joined:
    Jul 28, 2002
    Messages:
    2,461
    Likes Received:
    178
    By the time there is a use for GS at the limits Humus explained, Nvidia might have a proper solution at hand. I doubt they are sleeping at the moment, and also doubt there'll be a 8900 whatever. I expect them to be working on the Next Gen Card.

    US
     
  3. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    I'm surprised nobody has mentioned the scalar architecture yet. G80 at 1.35 GHz can deliver 346 GFLOPS all the time, R600 at 742 MHz can drop to 95 GFLOPS for shaders with all dependent scalars. To reach its peak performance of 475 GFLOPS it needs 5 independent scalars every clock cycle (multiply-add dependencies are ok).

    While I'm confident that driver improvements can offer some nice speedups, that worst case which is less than a third of G80's performance still exists. Frankly, I don't know of any reason to prefer a VLIW architecture over a scalar one.

    Anyway, am I exxagerating the importance of this or is it indeed one of the primary reasons R600 doesn't compete with GTX and Ultra?
     
    #23 Nick, May 19, 2007
    Last edited by a moderator: May 19, 2007
  4. ERP

    ERP Moderator
    Moderator Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Likes Received:
    49
    Location:
    Redmond, WA
    For the most part shaders work on vectors anyway. I very much doubt you will see anything close to the worst case in any useful shader.

    What surprises me is the lack of benchmarks that eyplain the differences in performance between the expected results and the measured results. It surprises me that sites don't do things like run a relatively standard shader, then progressively modify the shader to ascertain what stops it from hitting it's peak.

    As a dev I do things like reduce texture resolution to all the way down to the minimum possible on the hardware to ensure the texture is in the cache. I run in progressively smaller windows (down to 1x1) to isolate the performance of the Vertex shader. I remove texture reads and ALU instructions to try and understand what the bottle necks are.

    What I see online is running few standard benchmarks at different resolutions and rampant speculation.
     
  5. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    A major reason for VLIW is that you`ve got somewhere along a shitpile of money invested in such a thing, you`ve had major shuffling of your roadmap, and you thought that your primary competitor would do a sucky hack job, and he turned out to have some balls attached and try a serious departure from what it did before.

    I`m partially kidding there, as there are certainly advantages to the way ATi chose...as long as your compiler monkeys work proper magic and you have a good enough relationship with developers.
     
  6. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Vectors, yes, but of what dimension? Texture coordinates typically have two components, colors three, normals three, camera and ligth position three, etc. Furthermore, fog is a scalar, point size is a scalar, depth is a scalar...
    Worst case obviously not, but best case neither. The amount of instruction-level parallellism they need to extract is considerable. Also, at branches and the end of the shader it's highly unlikely you can fill all 5 ALU's.

    So maybe with a great compiler they can reach an average of 4 operations per shader unit. That's enough to beat the GTX, but still not enough to beat the Ultra. On paper R600 should have gone for gold, but clearly something is preventing it to reach best case performance.
    Why is that surprising? Reviewers rarely know how to write shaders, let alone evaluate the architecture with it. And even if they could do it, it takes considerable effort, delaying their review and costing them page hits. Furthermore, readers rarely want to be bothered with math and stuff. They want to see pretty pictures of benchmarks and performance graphs they can understand.

    Anyway, nobody's stopping you from creating a site/blog where you post your results and invite other professional developers to post theirs... ;)
     
    #26 Nick, May 19, 2007
    Last edited by a moderator: May 20, 2007
  7. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    8,580
    Likes Received:
    662
    Location:
    WI, USA
    I don't think that's true of the average B3D reader. :) (tho I do like pretty pictures and graphs interspersed to replenish my mental energies)
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    GTX is dual-issue, not scalar. It has plenty of hazards arising out of the SF units: think about contiguous scalar-dependent instructions that feed into or depend upon SF.

    Also think of texturing instructions, whose coordinates need interpolating (one ordinate at a time) before the texturing instruction can be fired off to the TMUs.

    Think about the apparently non-existent branch-evaluation unit in G80. Branching in G80 is a fair old mystery - you'll get hints of it in the CUDA guide.

    The result is compiler complexity, and less ALU-instruction throughput than the shader outwardly indicates.

    If you want to see a trivial example of G80's ALU throughput falling over for no apparent reason:

    http://forum.beyond3d.com/showthread.php?p=1005099#post1005099

    There's still not been an explanation for this behaviour. If you read on you'll see the shader code. R600 also has unexpected behaviour in these ALU tests, but G80's rather more eye-catching.

    I presume you're playing catch-up on this ALU-throughput storyline in the R600 soap opera :lol: That's the 3rd appearance of that storyline, at least...

    Jawed
     
  9. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    GTX can do 43 GFLOPS worth of special-function operations, while R600 does 47 GFLOPS. I don't think that's a significant difference. The SFU's are also used for interpolation, but that only costs one cycle, as you've explained to me. ;)

    Anyway, I don't see how this 'dual-issue' architecture could be made 'fully' scalar. So I don't think we can consider this a weakness of G80 unless there's a better way. Or am I missing something?
    Do they first compute 'u' for a whole batch, then 'v' for a whole batch? Or can they compute 'u' of the first 8 pixels, then 'v' of the first 8 pixels, then the rest of the pixels in the batch?
    Interesting. Any chance it's just limited by the CPU? Since this is an AMD test, have these numbers been confirmed with recent drivers for G80?
    Yeah, sorry, long threads scare me away. I don't have the time now to read everything, and the things I'm interested in only recently started to surface in shorter threads. :D
     
  10. aeryon

    Newcomer

    Joined:
    Oct 5, 2006
    Messages:
    85
    Likes Received:
    3
    Location:
    France / China
    No time to reply to other points (and everybody can easily give good arguments against them) but one shocks me since a lot of people is confused about R600 specs:

    5) HD2900XT has no UVD engine. this new UVD is only for RV610/630 and HDMI is only 1.2 so not compatible with HD-DVD and Blue Ray (for that you need HDMI 1.3). Without proper HDMI, this feature is IMHO useless...
     
    #30 aeryon, May 20, 2007
    Last edited by a moderator: May 20, 2007
  11. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,439
    Likes Received:
    280
    There are too many differences to come to any reliable conclusions. R600 might be based on Xenos, but it took so long to develop because there are a lot of differences.
     
  12. aeryon

    Newcomer

    Joined:
    Oct 5, 2006
    Messages:
    85
    Likes Received:
    3
    Location:
    France / China
    hmmmm

    [​IMG]

    [​IMG]


    source: dynamic branching test on hardware.fr

    not really what you say, in fact quite the opposite...
     
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,582
    Likes Received:
    625
    Location:
    New York
    Jawed you keep hanging onto those AMD benchmark numbers as gospel yet you ignore Rys' own findings:

    Also, you keep referring to this SF dual-issue hazard as G80's achilles heel yet we have no evidence of such a thing occurring outside of contrived cases. Do you really believe it will be an issue in real shaders or are you just trying to dispel the G80 scalar myth?
     
  14. Reverend

    Banned

    Joined:
    Jan 31, 2002
    Messages:
    3,266
    Likes Received:
    24
    Why do you want all the DX10 features?
     
  15. Geeforcer

    Geeforcer Harmlessly Evil
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,297
    Likes Received:
    465
    Call me Ishmael.
     
  16. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    There definitely is a branch evaluation unit in both G80 and R600, not sure where you got the idea there isn't? I'm not sure exactly how it works though, because I'm not sure you get a free branch every cycle (which would be ridiculous overkill, anyway); perhaps it's clocked at 675MHz, which is the scheduler's clock? That'd still be more branches/clock (no matter how useless that metric is ;)) than R600, although I'd presume neither architecture is really starved there.

    Also, one thing to keep in mind for the CUDA guide: they always, always explain things in terms of "number of clocks taken", so you cannot really conclude much about throughput there. For example, based on that documentation, you might conclude the main ALU and the SFU cannot execute instructions at the same time, but this is obviously incorrect. They explain things that way so that the doc is fairly abstract and not too architecture-dependent.

    As for the SFU/FMUL dual-issue, I want to finish my triangle setup & ROP testers, and then I'll try fiddling a bit with G80 again based on, let us say, new information... :)
     
  17. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,805
    Likes Received:
    1,093
    Location:
    Guess...
    True there are differences but its fairly safe to assume they didn't make anything slower than Xenos, everything should have been as fast or faster and so if you simply decrease R600's benchmark scores by 50% to account for the clock speed difference, that should pretty much be a best case scenario for Xenos (under texture addressing constraints).
     
  18. Rys

    Rys PowerVR
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,164
    Likes Received:
    1,461
    Location:
    Beyond3D HQ
    Branching isn't free on G80 or R600, so there's overhead there, but both have dedicated logic for it and the overhead is minimal (compared to the truly free branching on R5xx).

    EDIT: And Jawed's right, G80 isn't the paragon of simplicity in terms of scheduling that it can be made out to be, but that needs to be thought about in broader context.
     
  19. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Is that compared to requiring no scheduling effort at all or compared to actual architectures you have in mind that would have better scheduling properties (without increasing complexity to the point that you have to lower the number of ALU's and/or clock frequency)?

    I mean, they'll always have to rely on some compiler work. But no matter how hard you try, a VLIW architecture can only use a fraction of its ALU's with dependent scalars.

    So I'm going to ask again: Is there ever a reason to prefer VLIW over scalar? Or in other words: Will any new architecture ever use VLIW again?
     
  20. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    In theory, you could have some lower control overhead in terms of transistor counts, I'd imagine. Whether that's worth it or not depends on how efficient it is in practice. Arguably, this is already much less important for GPUs than for CPU, because the control overhead is much lower since the ALUs are SIMD anyway, even they if they were "scalar" from a programming point of view.

    And thanks for the correction regarding the cost of branching Rys - oopsie! :eek:
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...