Xenos - invention of the BackBuffer Processing Unit?

Discussion in 'Console Technology' started by Shifty Geezer, May 23, 2005.

  1. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Many GPUs architecture can be seen as a (non symmetric) multiprocessors architecture:
    Nvidia calls NV40 a 3 processors architecture: vertex shaders + pixel shaders + ROPs.
    ROPs even run at memory clock not GPU core clock, and it's a given ROPs have their local store/cache that hold some pixel tiles.
     
  2. Titanio

    Legend

    Joined:
    Dec 1, 2004
    Messages:
    5,670
    Likes Received:
    51
    Statements like this should be qualified - there are many eyes on this board looking purely for fanboy console war fodder on other boards. It's it's own processor, but it's very limited to a specific task (framebuffer tasks), not a second equivalent to the parent GPU. Such logic is on pretty much every GPU, from what I know of the eDram in R500, they've just moved it out onto it's own chip closer to the eDram to save bandwidth.
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    This kind of evaluation is entirely valid in my opinion. Cell is an ultra-high performance digital signal processor. Very small blocks of data being processed incredibly quickly to produce a mind-boggling aggregate.

    Jawed
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    You can go further and describe each quad-pipeline as a processor core. :wink:

    What's interesting is that in R300 and up, the quad-pipelines are each able to run a different shader.

    In NV40 it seems that the same shader is running on all pipelines.

    So R420 is, for example, 4-way MIMD, whereas NV40 is 16-way SIMD.

    I think the vertex pipes in both architectures are purely independent and parallel.

    Hope I've got that all correct - wouldn't want to start a second off-forum flame-war in only two posts!

    Jawed
     
  5. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,106
    Likes Received:
    16,898
    Location:
    Under my bridge
    Yes, but we are making progress! :D It's agreed that the eDRAM isn't local storage for the GPU per se, but a subset of the GPU. It's connected to the GPU via a bus of 48 GB/s, which is fast, but not the same as actually being embedded on the GPU. Presumably this decision was made to improve yields, as embedding the backbuffer features with 10 Mbs along with the unified shaders would have been difficult.

    @ nAo : In the cases of other GPU's being consider three processors, how do those processors share data? Do they have an actual bus system with restricted data flow?
     
  6. Fafalada

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    2,773
    Likes Received:
    49
    The problem with this is two fold - one, how do you count this into "system" bandwith figures - simply adding the numbers together like those Xenon charts did is complete nonsense.
    Second, if you count these, the question immediately comes up why we don't count bandwith from on-demand loaded caches into equation as well.

    Or to drive my GS example one step further - "hidden" bandwith is there too.
    48GB/sec is the bandwith between page buffer and ROPs. The eDram->page buffer bandwith is actually much higher - 150GB/sec.
    Surely we should count that figure instead? ;)
     
  7. JAD

    JAD
    Newcomer

    Joined:
    May 23, 2005
    Messages:
    25
    Likes Received:
    0
    Location:
    Netherlands
    Yeah, but making the same mistake again this time again would be really stupid right, as there are real benefits. I've not contributed to the discussions between ps2 vs xbox so I really don't know what was said and what wasn't.

    I think that dismissing this bandwidth saving feature here and now because everyone did it 5 years ago is something we shouldn't do around here.
     
  8. Npl

    Npl
    Veteran

    Joined:
    Dec 19, 2004
    Messages:
    1,905
    Likes Received:
    7
    I dont think anyone is dissmissing it, its just the direct comparisons between bandwith of a 10MB "Cache" and the whole Memory that are futile.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Fafalada - I think the problem with these bandwidths is trying to identify them and work out which are useful to count up.

    When AMD doubles the cache on an A64, it bumps up CPU performance by around 10%. That's a good example of increased performance (increased effective bandwidth) accruing from a non-obvious design change. A Newcastle and a Clawhammer core at the same clock speed perform differently because of the difference in cache size.

    Damn doesn't that muck-up our ability to "count" system bandwidth? Yep!

    Jawed
     
  10. AzBat

    AzBat Agent of the Bat
    Legend

    Joined:
    Apr 1, 2002
    Messages:
    7,749
    Likes Received:
    4,847
    Location:
    Alma, AR
    Yeah, _now_ I know what you mean. I always like bigger numbers, but if only they're used correctly. Though I can understand these numbers being kinda sticky. ;)

    Tommy McClain
     
  11. Unknown Soldier

    Veteran

    Joined:
    Jul 28, 2002
    Messages:
    4,047
    Likes Received:
    1,670
    I've asked Dave to ask if 3Dc or higher is used in the R500. I've also asked for any clarification if Fast14 was used in the process of the Xbox2.

    Do you think MS might have used any of the above in the Xbox2 since they are bandwidth saving features?

    btw.. welcome JAD

    US
     
  12. Nite_Hawk

    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    1,202
    Likes Received:
    35
    Location:
    Minneapolis, MN
    Heya AzBat,

    Remember the project that we were thinking about a while back to figure out the contribution different hardware made to the speed at which the system could render a scene? I kind of wish I still had time to work on it. It's interesting when looking at this kind of stuff. :)

    Nite_Hawk
     
  13. AzBat

    AzBat Agent of the Bat
    Legend

    Joined:
    Apr 1, 2002
    Messages:
    7,749
    Likes Received:
    4,847
    Location:
    Alma, AR
    Yeah, I remember. That would be helluva a project. Not sure it would work in a closed console system, but it still could be useful for the PC market. Not as useful as 2 or 3 years ago though. It seems right now things a pretty even with the only 2 players out there. Basically make decisions based on price and brand preference. ;)

    Tommy McClain
     
  14. Bohdy

    Regular

    Joined:
    Jun 9, 2003
    Messages:
    731
    Likes Received:
    4
    I agree that the eDram is important, but that you can't just add the bandwidth to that of main memory.

    A good place to start is to figure out how the RSX does equvalent functions to the eDram logic, and how much bandwidth that would typically consume, then subtract it from the PS3 bandwidth figure for comparison. Or if the PS3 would not do some of those functions, then consider it as points towards the X360.
     
  15. nelg

    Veteran

    Joined:
    Jan 26, 2003
    Messages:
    1,557
    Likes Received:
    42
    Location:
    Toronto
    [guessing] I would suspect that the level of granularity in which this would be a problem is unlikely. The design seems well thought out and I would assume that it would be very unlikely that such a limitation would have been overlooked. GPU are all about hiding latency. The Arbiters job is to keep utilization of the ALUs high, so it would most likely be responsible to insure that such (pathological?) cases would have minimal effect. [/guessing]
     
  16. Nite_Hawk

    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    1,202
    Likes Received:
    35
    Location:
    Minneapolis, MN
    I actually think for a closed system it would be really interesting if ATI/MS or nVidia/Sony did something like this internally to figure out where their bottlenecks are coming from. In all of these threads we keep talking about whether or not the PS3 is going to be bandwidth starved with respect to the framebuffer throughput required and how effective the edram on the ATI part is going to be. All of these things are going to be immensely dependent on how effective their compression is and how much overdraw there is. ATI for instance, solved the problem with their ROP/edram solution. We don't know what nVidia's solution is (if they even have one), but we also don't really know if they need one.

    I think it would be immensely interesting to see tests done on something like the PS3 where the speeds of the spus, number of spus, gpu speed, and various bus speeds can all be manipulated to see how changing them changes performance. Does the PS3 even need a edram solution like ATIs? Does ATI really need the edram solution or could they have gotten away with less (like upping the main memory throughput)? I think these are the kinds of questions that such a system could help us answer. I wonder if something like this is already in place?

    Nite_Hawk
     
  17. PC-Engine

    Banned

    Joined:
    Feb 7, 2002
    Messages:
    6,799
    Likes Received:
    12
    I see another advantage of ATI's solution when shrinking to 65nm. :wink:
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    nelg - I agree with your guess, which is why I'm dubious that all 48 ALUs are running the same instruction.

    On the other hand if all 48 ALUs are running different instructions, that necessitates a huge amount of instruction decode logic for the whole GPU, whereas in previous designs this logic only occurred relatively few times (once per quad in the pixel shaders, and once per vertex pipeline (?)). It also increases the amount of program counter logic (every ALU requires a private counter) and means that the shader state memory has to be partitioned per ALU, too, so that register accesses are incoherent.

    In other words 48 ALUs with no grouping into quads is really messy.

    Perhaps it's 16 ALUs per group, so there are three different threads concurrently executing on 16 ALUs each. Sounds like a reasonable compromise to me, but still with a high branch misprediction cost.

    So this matter of SIMD versus n-way MIMD is intriguing...

    Jawed
     
  19. London Geezer

    Legend Subscriber

    Joined:
    Apr 13, 2002
    Messages:
    24,151
    Likes Received:
    10,297
    Who's fabbing the chip?
     
  20. PC-Engine

    Banned

    Joined:
    Feb 7, 2002
    Messages:
    6,799
    Likes Received:
    12
    Doesn't matter. :wink:
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...