AMD: R7xx Speculation

Discussion in 'Architecture and Products' started by Unknown Soldier, May 18, 2007.

Thread Status:
Not open for further replies.
  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    My concern for flexibility is in workloads that don't use rasterization, filtering, or Z-testing.
    AMD would be concerned about this too.

    That's a whole other level of flexibility I hadn't been thinking of, actually.
    I for some reason thought R600's virtualized resources already had some flexibility in this regard.

    SIMD count puts a very usable upper bound on the amount of concurrent and independent work a GPU in the style of R600 can do.
    We know by virtue of there being 4 SIMDs, we won't be seeing 5 different shader clauses hitting the ALUs at the same time.

    Inflexibility normally comes at the cost of utilization, which in turn impacts performance and power.
    Power, in particular, is something that will be even more critical in the future.
    I honestly hoped for a more assertive repudiation by now (by execs, some no longer with AMD) of some statements I heard before stating GPUs didn't have to worry about a power ceiling.

    Even if they did time-slice, time-slicing isn't concurrent execution.
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    In a given clock cycle, only one instruction is actually going through the ALUs. The SIMD alternates between the two sequences.
     
  3. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,111
    Location:
    New York
    Oh, I'm using the exact same definitions that you are. My possibly incorrect assumption is that an atomic instruction ADD/MUL/MAD etc is actually spread across multiple clocks/pipeline stages which opens the door for multiple batches to be "executing" at once. If the entire calculation is performed in a single cycle then I see what you're saying.
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    I think I was unconsciously assuming single-cycle ops, but I think the utilization argument still stands.

    If we stretch things over multiple cycles in a pipelined unit, we're injecting bubbles that still indicate ALU units are not performing useful work.
    If 1/3 of an ALU is idle in cycle one, then another 1/3 in cycle two, and then the remaining 1/3 in cycle three, after 3 cycles, it's the same as if the whole ALU were idle for 1.

    We won't find more than one instruction in the same stage of execution across all 16 SIMD elements.
     
    #1104 3dilettante, Apr 8, 2008
    Last edited by a moderator: Apr 8, 2008
  5. Shtal

    Veteran

    Joined:
    Jun 3, 2005
    Messages:
    1,344
    Likes Received:
    4
    This is just a theory; let say if ATI increases L2 cache to 1024KB - would larger level cache and only 32TMU's could removed limitation with combine 800 processors of RV770.
     
  6. Rys

    Rys Graphics @ AMD
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,182
    Likes Received:
    1,579
    Location:
    Beyond3D HQ
    You can see that today, where adding more GPUs gives you diminishing returns and a worse gaming experience because of AFR. The number of customers today with the ability to go to 3+ GPUs must surely be disappointed with their purchase, or wondering why they should bother, since those are supported shipping configurations.

    Multi-GPU needs new ways to scale rendering performance across the chips, or a renewed focus on an increased IQ contribution, and ideally both, for it to succeed in the longer term.

    If R700 is still RV770 x 2 (and I think it is), then going 3-way+ seems no more attractive to me than it is now, without something in the architecture to help it do new things there.

    EDIT: whoops, thought I was quoting a brand new post there. My points still stand though :lol:
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Well I have to say you've got me utterly mystified about the general point you're making.

    Clearly there's a sweet spot for performance when you have a set of kernels that make up an application, all of which need to communicate. But that sweet spot is heavily processor-specific due to bandwidths, latencies, cache sizes, supported-batch counts, state overheads, etc.

    It would certainly be an entertaining prospect if AMD is planning on a GPU of 4 SIMDs each 96 objects in parallel (1920 SPs in today's parlance, batch size of 384) for 2010.

    In the meantime GPUs' inability to timeslice kernels or simply execute multiple kernels side by side is definitely a problem.

    Jawed
     
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    If body of executable code A for whatever reason winds up utilizing overall resources by 40% and code B is even worse at 20% (made up numbers for the sake of argument), timeslicing them changes nothing.

    In that case, we'd wind up sandwiching time slices of 40%/20%/40%/20% utilization. No time whatsoever is saved, and units are wasted twiddling their thumbs.
    If the GPU could run them concurrently in an ideal world, we'd instead see a single period of 60% utilization.

    In a less than ideal world, we see that an R600-style GPU's ability to run concurrently is limited by the fact that the SIMDs are the basic unit of granularity.
    At 4 SIMDs, we could at best see a step function of quarter increments.

    If the SIMDs only got longer and still only numbered at 4, we'd only ever see utilization increments of 25% of even coarser and less utilized SIMDs.

    If we had more than 4 code streams, we'd see absolutely no change in utilization, because one stream must inevitably sit around and wait its turn.

    I would find it interesting, but even a moderate amount of underutilization would be like leaving an RV670's-worth of performance (but perhaps not the TDP) on the table.
    If it got that long, I'd hope for better utilization, or a way for a more definitive power-down of unused units and datapaths.

    If there were some way to fudge the SIMD units out of lockstep, or allow them to ocassionally sneak in other work, then the effective performance impact of extreme batch size could be reduced.

    I agree on that.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    The count of SIMDs in a GPU doesn't affect the divergence penalty for dynamic branching. It's the size of a batch, which is some multiple of SIMD-width, that determines this.

    If what you're saying is that more SIMDs are better in dynamic branching performance for a fixed number of pixels per clock - then I agree with that. For a given number of pixels, say 128, all executed during the same clock cycle at the same time, then it's better to have 16 SIMDs of 8 lanes instead of 4 SIMDs of 32 lanes. Clearly the former costs more die area.

    Now, the comparison of G80 and R600 is interesting because G80 processes 128 pixels per clock, while R600 processes 64. G80 uses 16 SIMDs, while R600 uses 4. So G80 has 8 lanes per SIMD while R600 has 16.

    We already know that in pixel shading R600 has twice the batch size of G80, i.e. it's worse. In vertex shading it's even worse, 16 on G80, 64 on R600.

    G80's custom logic implementation for the ALUs allows NVidia to use half as many as would otherwise have been needed. Otherwise G80 would have been something like 32 SIMDs that are 8 wide or 16 SIMDs that are 16 wide.

    Conditional Routing:

    http://cva.stanford.edu/publications/2004/kapasi-thesis/

    The problem is the fragmentation you get with nesting and the bandwidth/time consumed in forming the temporary batches required for CR.

    Jawed
     
  10. IbaneZ

    Regular

    Joined:
    Apr 15, 2003
    Messages:
    743
    Likes Received:
    17
    http://xtreview.com/addcomment-id-4461-view-RV770-30-40-percent-faster.html

    30-40% faster than RV670? That's good enough for me. Two 4850's in CF should be sweet.
     
  11. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,240
    Likes Received:
    3,395
    Anything idle in a chip is a waste of transistors -- that's a disadvantage.
    I don't think that G80 have any underused texturing capability. G92 -- yes, it has more than 'enough' TMUs, but i suppose that going from 32 trilinear to 64 bilinear didn't cost much from die area POV so it can't really be considered as a disadvantage.
    Going from 320 underutilized ALUs to 800 underutilized ALUs will cost a plenty of die space and if they'll be idle most of the time then this die space is going to be wasted. And if RV770 is a middle end chip then i don't see any way to waste die area in it (because what matters in mid-end is a cost/perfomance ratio on average in todays applications).
     
  12. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Two points:
    1) If TMUs are also increased to 40, then underutilization in a given workload wouldn't be any higher (excluding non-shader core bottlenecks). Clearly going for 32 TMUs would slightly, though.
    2) This would be a substantially faster chip, which means it is more likely to be used at higher resolutions. Because of limited texture resolutions, that means multi-cycle texture filtering is less likely to be necessary, thus increasing the effective ALU:TEX ratio.
     
  13. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    If we are going to be technical, lets be technical. Best case unless RV770 is a ground up redesign is 12 or less processor each consisting of 80 or more shader units/ALUs/SIMD bitslices.

    Aaron Spink
    speaking for myself inc.
     
  14. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    I'm not even sure we should *really* be counting the FPS numbers in AFR modes. Certainly they should be de-rated. I would like to see a much higher focus as well on either using SLI(a misnomer if there ever was one) or CF to either increase image quality or accelerate single frame based frame rates (which requires some sort of distributed workload).

    Especially with the single card designs they should be able to design in ~18-24 GB/s of inter-chip bandwidth should be plenty to support cross buffer reads and actually allow real improvements in frame rates.

    Aaron Spink
    speaking for myself inc.
     
  15. nicolasb

    Regular

    Joined:
    Oct 21, 2006
    Messages:
    421
    Likes Received:
    4
    For what it's worth: http://www.fudzilla.com/index.php?option=com_content&task=view&id=6721&Itemid=1

     
  16. XMAN26

    Banned

    Joined:
    Feb 17, 2003
    Messages:
    702
    Likes Received:
    1
  17. Rangers

    Legend

    Joined:
    Aug 4, 2006
    Messages:
    12,791
    Likes Received:
    1,596
    Err, I think it would be a good deal better than that. In some games 3870 even beats 8800GT/S. Generally 3870 is about as fast as a 9600GT which isn't far behind a 8800GT either. Heck fix the AA hit and 3870 would compete with the 8800GT right now (as it generally does with AA turned off).

    But I agree with your point, I am sure Nvidia is making a solution aimed at being 100% faster than G80. 30-40% is what AMD expects to compete with?

    I mean if the tech specs are true I dont see how that is either..all RV770 specs seem at least doubled from RV670..on the other hand the die size rumors place it pretty small..

    The fudzilla report is pretty exciting though, June 3 is less than two months away.
     
  18. Berek

    Regular

    Joined:
    Oct 17, 2004
    Messages:
    274
    Likes Received:
    4
    Location:
    Austin, TX
    Given numbers like in the review link below, it better be 50 percent MINIMUM over the 3870x2 if its going to compete all-around against the GX2. 100 percent or more may just bring it up to perhaps compete with the GT200 when that is released.

    http://www.anandtech.com/video/showdoc.aspx?i=3266&p=1

    Some of the benchies here bring the 3870x2 to within 15 percent, but most of them put it way behind for either OpenGL driver related problems or simple performance issues in hardware. The 3870x2 is considerably cheaper though, its benefit... if you're willing to settle for 50 percent less performance in many games.

    ATI must know they need to at least take out the GX2 (and get close to the GT200) one way or another, even if they aren't all about "high-end" at this time... whatever they come out with in June has GOT to be amazing or the next thing I'm getting is 2x "9800" gt's or the GT200.

    A good start would be more supportive drivers, combined with bringing performance up on those games that suffer severely against the GX2. It's too much of a mixed bag with the RV670 series at this time.
     
    #1118 Berek, Apr 9, 2008
    Last edited by a moderator: Apr 9, 2008
  19. nicolasb

    Regular

    Joined:
    Oct 21, 2006
    Messages:
    421
    Likes Received:
    4
    Mm, and I bet ATI is really worried about what your next purchase will be. :cool2:
     
  20. Berek

    Regular

    Joined:
    Oct 17, 2004
    Messages:
    274
    Likes Received:
    4
    Location:
    Austin, TX
    It wasn't an intended threat of course :). I was simply informing you guys of my decision factor, not a target against them... obviously. They will ultimately release what they want and see is most beneficial. I'm simply pointing out the differences in competition level requirements.

    Just to ensure clarification and avoid another unneeded response, I meant "performance" related competition :p.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...