Xenos - invention of the BackBuffer Processing Unit?

Discussion in 'Console Technology' started by Shifty Geezer, May 23, 2005.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Nitehawk - I think the IHVs use simulators for this kind of stuff.

    Jawed
     
  2. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    Where did you get the info about NV40 only processing 1 shader per clock over its 4 quads? I had understood that each quad processed its own discrete set of pixels, but that the quads were assigned a group of pixels on a case-by-case basis from a triangle, unlike R3/4xx, which sets up its pixel shaders with tiles.
     
  3. Nite_Hawk

    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    1,202
    Likes Received:
    35
    Location:
    Minneapolis, MN
    Do you think they use simulators to actually simulate the chip itself, or build statistical performance models? I'm curious, because if they actually simulated the chip itself I'd be worried that the underlying system would color the performance of the simulated part.

    Nite_Hawk
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    This page:

    http://www.beyond3d.com/previews/nvidia/nv40/index.php?p=7

    implies that all quads will render a single triangle.

    What's not clear is how soon a new triangle can be rendered by spare quads. It would make sense that (I'm wrong:) each quad has separate instruction decode and spare quads can start the new triangle immediately.

    But I'm not sure. It would be interesting to find out for sure.

    One of the curiosities of NV40 is that dynamic branching isn't at the quad level (as far as experimenters can tell)...

    Jawed
     
  5. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    NV40 pixel pipelines work on batches of ~1000 pixel, I doubt all the quads are working on the same triangle, performances on small triangles would be quite poor
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    http://3dcenter.org/artikel/2005/03-31_a_english.php

    Jawed
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Do you know how a batch is constructed?

    Jawed
     
  8. blakjedi

    Veteran

    Joined:
    Nov 20, 2004
    Messages:
    2,985
    Likes Received:
    88
    Location:
    20001
    When counting bandwidth is it reasonable to separate bandwidth into functional units?

    For example

    Logic to Logic : CPU -> GPU is an example

    Logic to RAM: CPU -> Main Ram is an example

    These bandwidth assessments represent "badnwidth hops" between functional unit types (memory and logic).

    The assessment of counting the internal memory of the SPUs doesn't work here because it is CPU work that normally would not require the use of external bandwidth under any circumstances. For example no one is counting the bandwith internal to X360 CPU cores and the L1/L2 caches.

    However under the structure of Xenos, work that would normally be done utilizing external bandwith being used (a "hop") is internalized and made faster. The hard part is that now that some logic is nested... do you count the bandwidth based on hops all the way through?

    CPU -> [GPU] is one "bandwidth hop" but then there are additional bandwidth hops represented as [GPU (Shader -> [eDRAM -> Logic])]

    Somehow or other we can justify the internal bandwidth hop from Shader to eDRAM (48 GB) but not the additional nested bandwidth hop (eDRAM memory -> eDRAM Logic) ?

    Think of it like this: if there was a small 10MB cache separate from GPU, CPU, and Main RAM used to do this work, you WOULD count the bandwidth it took to access it... because it encapsulates "a hop" in the traditional sense.

    Just because its unconventional doesnt mean you dont count it.
     
  9. Fafalada

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    2,773
    Likes Received:
    49
    So what you're saying is that all existing examples of what you call "internalized bandwith" aren't valid to count well... because .... well because ... we just don't count them...
    But in R500 we must make an exception to the above rule because... well.. because you or someone else(Microsoft?) says so...

    Ok just wanted to clear this up.

    Anyway even by this logic (aka, CPU's don't count or whatever), noone explained to me why GS page refill bandwith doesn't count - it's pretty much the same thing as the R500 high number.
     
  10. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,106
    Likes Received:
    16,898
    Location:
    Under my bridge
    It would require external bandwidth if it didn't have local storage. The introduction of local storage means we don't count the bandwidth the processors need, though they are consuming data at however many bytes/s. Why should that be different for GPU's?

    I would agree. In such cases [eDRAM -> Logic], why do you count eDRAM and Logic as two separate functional units?

    If you count hops, why does the backbuffer processing on local storage count as a hop ? If that is a hop, why isn't it a hop from level 1 cache to CPU logic?
     
  11. Nite_Hawk

    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    1,202
    Likes Received:
    35
    Location:
    Minneapolis, MN
    Thanks for the link, that was interesting reading. :)

    I'd be really interested in talking to Greg more about their testing Methodology. I wonder how hard he is to track down.

    Nite_Hawk
     
  12. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    I'm slowly starting to view R500 a bit as a hybrid IMR/TBDR of some kind. Sure, it doesn't have units to do Z layout of one tile while shading another. And it can't remove opaque overdraw completely, or do order-independent transparency. But it can do a very fast Z first pass and then remove most of the opaque overdraw (limited by hierZ granularity), do blending and AA without apparent cost, and it renders into a large "on-chip" tile. Though, we don't know how it handles the "binning" yet, i.e. how it handles triangles that pass tile borders, whether they're going through VS again or not, etc.
     
  13. blakjedi

    Veteran

    Joined:
    Nov 20, 2004
    Messages:
    2,985
    Likes Received:
    88
    Location:
    20001
    I think it is normal to do that back buffer work in main ram (like in the RSX). Under that circumstance you count the bandwidth used to access the RAM. Just because you use specialized RAM with a high-speed bandwidth and a fairly wide interconnect away from Main RAM why shouldn't it be counted? Its a functional work unit also. I'm not necessarily invested in it the outcome or the answer, I just like posing the question :D

    Its quite the opposite with the local SPUs because no one counts the bandwidth associated with CPU logic and local memory stores such as L1 and L2 and its normal not to do that...

    Some how or other, this makes sense to me but its not getting across very well.
     
  14. Laa-Yosh

    Laa-Yosh I can has custom title?
    Legend Subscriber

    Joined:
    Feb 12, 2002
    Messages:
    9,568
    Likes Received:
    1,455
    Location:
    Budapest, Hungary
    Some thoughts...

    People seem to try to dismiss the advantages of Xenos's EDRAM just because an MS PR person made the wrong choice of counting it into the system bandwith. Yeah, he was wrong - so does that suddenly remove this as an advantage?

    Summing up bandwith in the way presented here is obviously not the ideal approach, so why beat a dead horse from both sides? Stop it and move to more interesting topics :)

    I'm sure the console devs here have ideas about average bandwith requirements for actual ingame graphics. Like, we probably have such an amount of opaque overdraw, and such an amount of transparent overdraw and so on. We choose a resolution like 720p 2x AA to make the field as even as possible, and caclulate the probable backbuffer/framebuffer bandwith utilization for the two systems. We'll then see how much the RSX has to spend from its bandwith, how much is left; and how that actually compares to the ammount of traffic that Xenos needs to copy the backbuffer out to the main memory.
    Then we can repeat for other resolutions like 1080i/p, move up to 4X AA and so on... this would be a lot more profitable to the forum then the flamewar that you seem to get into about opinions :)
     
  15. JAD

    JAD
    Newcomer

    Joined:
    May 23, 2005
    Messages:
    25
    Likes Received:
    0
    Location:
    Netherlands
    As there is no real need for the xbox360 to do any kind of compression between the on die logic and the eDram because it has kind of unlimited "bandwidth" between them, you can bet that RSX will use as much compression as it can to save bandwidth. Can we, given this, calculate the backbuffer bandwidth utilization? I figure that with a more complex scene the compression would be less effective.

    Can anyone comment on this?
    Maybe we can calculate bandwidth utilization without compression and then sort of make an educated guess at how effective compression can be?
     
  16. Khronus

    Newcomer

    Joined:
    Apr 15, 2004
    Messages:
    62
    Likes Received:
    2
    An excellent question, can't wait to hear back on that one!
     
  17. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    How does PowerVR handle triangles that cross tile boundaries? I would imagine they are shaded once for each tile and clipped.
     
  18. jvd

    jvd
    Banned

    Joined:
    Feb 13, 2002
    Messages:
    12,724
    Likes Received:
    9
    Location:
    new jersey
    where are we getting the edram to xenos bandwidth from ? I know the edram to the rest of the edram chip logic is 256gb . But where did the figure for the edram to xenos come from ?
     
  19. Tacitblue

    Newcomer

    Joined:
    Apr 23, 2005
    Messages:
    131
    Likes Received:
    1
    Presumably from the Tech report diagram.

    http://techreport.com/etc/2005q2/xbox360-gpu/block.gif

    Pretty impressive but somewhat unfortunate that the MS marketing types use that number to fill in their aggregate bandwidth for the whole system when its architecturally an internal bus between 2 elements of the GPU rather than the rest of the system like to CPU or memory but that's what sells hype i guess. But that's been discussed elsewhere.
     
  20. Acert93

    Acert93 Artist formerly known as Acert93
    Legend

    Joined:
    Dec 9, 2004
    Messages:
    7,782
    Likes Received:
    162
    Location:
    Seattle
    The bandwidth between the Parent die (i.e. Shader Logic) and the Daughter die (i.e. eDRAM with backbuffer logic for Z/Alpha/Stencil) is said to be 32GB/s read and 16GB/s are figures derived from the Xbox Block Diagram leak from last year.

    I have not seen that information in any of the new material posted from MS. I guess Dave will be able to confirm what the bandwidth is when he gets his snazzy report done for us :D
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...