Predict: The Next Generation Console Tech

Discussion in 'Console Technology' started by Acert93, Jun 12, 2006.

Thread Status:
Not open for further replies.
  1. Tahir2

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,978
    Likes Received:
    86
    Location:
    Earth
    In the run up to the eventual demise of LRB there was talk of a LRB 2 and 3. All fine and good but what is the point of LRB 1's fully programmable nature if then there was speculation that they would not be necessarily compatible with each other!

    And it was at Beyond3D that I read that LRB 1 was not going to be 100% DirectX 11 compliant! Just going to try and search through the many posts now - and hopefully update you when/if I get the information. Of course I may have be misinterpreting a post.

    In the end though there is no point of a feature in a graphics card if it is not usable in real time (and that means more than 3 or 4 fps).

    Edit: possibly related to the "fixed function texture filtering mechanism" seems to ring a bell however I may be completely wrong on this one and LRB was in fact destined to become a DX11 compliant part as well.

    Edit2: I have asked the proper people in the appropriate forum to put an end to my possible misguided thoughts. If I am wrong I blame the interweb.
     
    #3301 Tahir2, Jan 14, 2010
    Last edited by a moderator: Jan 14, 2010
  2. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    195
    Location:
    Stateless
    It could have been related to supported textures formats ;)
     
  3. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,106
    Likes Received:
    16,898
    Location:
    Under my bridge
    Yes, it could be limited by performance, just like the DX Reference Rasteriser could render any DX effect, but not at hardware-accelerated speeds. If implementing DX11 features brought LRB to its knees, then it could be considered DX10 in terms of what it could achieve in DX games. That said, it could also do things DX 10 couldn't (whether devs decided to or not!).
     
  4. Squilliam

    Squilliam Beyond3d isn't defined yet
    Veteran

    Joined:
    Jan 11, 2008
    Messages:
    3,495
    Likes Received:
    114
    Location:
    New Zealand
    Cant the Direct X 10 cards do everything in Direct X 11 in sofrware? Im not disagreeing with you but perhaps we might need another distinction in the age of programmable pipelines.
     
  5. sunscar

    Regular

    Joined:
    Aug 21, 2004
    Messages:
    343
    Likes Received:
    1
    Not that I'm aware of - You could code something like a tesselator in CUDA, but I'm unaware of any way to do that in DX10 specifically. That doesn't bar anybody from building such a thing in a primarily DX10 engine, it'd just mean the engineer would have to find a workaround (which may end up comparatively slooooow). As it stands, Larrabee should've been capable of spitting out just about any frame you could coax out of a future Radeon or Geforce, but it comes down to practicality, I.E. is it impractically-slow to do random frame A on Larrabee vs on dedicated hardware.
     
  6. Squilliam

    Squilliam Beyond3d isn't defined yet
    Veteran

    Joined:
    Jan 11, 2008
    Messages:
    3,495
    Likes Received:
    114
    Location:
    New Zealand
    Between the texture units and the ROP units, they are both over-sized for pretty much all situations as they are fixed function units the rule of the thumb is to spend more transistor area than is strictly needed because you cannot always predict what kind of workload you're going to get and the whole pipeline can be limited by how fast the process moves through these units. So between fixed function and unified function like with stream processors its a trade-off and theres a point somewhere when its better to have more general processing performance than specialised performance.

    The difference between the two functions is that a ROP unit generally limits you to Rasterization whereas a texture unit can be more flexibly applied using different render methods such as Ray Tracing and Voxel based rendering. In addition to this a ROP unit performs calculations which can be best emulated on a stream processor or if not best emulated with present stream processors are functions which increase the overall usefulness of the ALU banks for GPGPU work as rasterization is very vector heavy. We have already seen some benefit to more flexibility and efficiency here. This is the reason why Larrabee didn't have ROP units but still retained Texture units.

    I cannot personally answer this one, I tried then deleted then tried again but still I just don't have the fundamental knowledge required. I hope that someone else can answer you here! :)
     
  7. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    195
    Location:
    Stateless
    Nicely put :) I guess it's the reason why for example now in R8xx interpolation is done in shaders with good results. Well I get it, so far I've read nobody sane wanting to get rid of the texture units.
    I guess it's just that texture unit are that efficient are their job or the other way around that SIMD arrays are that bad at texture units job. Anyway as I'm not sane and I realized that the die space taken by the texture was far from marginal and with serious bandwidth bottleneck about to badly strike back I felt like "hey guys! are you really sure it would still make sense in a few year?" (as I obviously can't say/decide my-self).
    actually I wanted to rise another fight between Nick and and some other "violently disagreeing" members :lol:

    Actually I delete that part for many reasons.
    I got a bit carried away by my enthusiasm to the point where I feel is close to be a bit ridiculous, I mean that was quiet an question, the billions dollars question. There are clever guys out there (not only here on the forum) and actually there are team of these clever guys working on that kind of question. At some point even if one member could have answer I would not have been most likely able to understand the implications/nuances behind his answer. On the other hand I may have be happy with "at some point logic have to be that close to the execution, having a even not distant chip acting as a manager won't cut it" or the other way around "Kudos genius where else did you expect the GPUs to go if they don't follow the larrabee path". Anyway I feel like that kind of questions should be raised by members with real knowledge on the matter and discussed based on fact/rumors/etc. (if only for respect of the army of engineers working on the matter). The kind of topic I like to read and wonder but where posting would/should feel actually awkward.

    If I were to once again goes out of my way into territories I should not abord
    forgivable as I'm not sane and tho I'll never learn my lesson :lol:
    I should accept the fact that may be going the "larrabee path" is the right thing to do and the lack of criticism in regard to Intel choice (many simple cpu+wide SIMD not X86 as the ISA) should be enough to convince me given my knowledge on the matter. Actually the only person I saw questioning the concept a bit (he said he wanted a proof of concept in some interview) was John Carmack.

    Anyway when all this is say and done, contrition aside, I would still be really happy if a conversation between high level members could emerge on the matter :)
     
    #3307 liolio, Jan 15, 2010
    Last edited by a moderator: Jan 15, 2010
  8. Squilliam

    Squilliam Beyond3d isn't defined yet
    Veteran

    Joined:
    Jan 11, 2008
    Messages:
    3,495
    Likes Received:
    114
    Location:
    New Zealand
    Currently they are fighting the bandwidth demons with on larger die caches and local stores for the various components to keep them fed and make better use of what bandwidth is available. Dave has said this much himself. Developers are doing the same as there has been a shift towards deferred rendering so that should push the bandwidth monster away for at least a couple more years. In the future with relation to 3D hardware it makes a lot of sense for consoles at least to have on die frame-buffer and im pretty certain AMD and Nvidia have considered it for their desktop GPU parts so long as they can keep any tiling transparent to the developer. However the main issue is vertex load still and unless they can figure that out for the desktop parts I hate to think of the vertex load of a heavily tessellated model sitting astride the boundry of 2 or more tiles. :shock:

    Deleting something like the ROP units on a console part makes a lot more sense than for a desktop part. They only have to run previous generation titles and titles designed to fit within the consoles design parameters. Its also a lot easier to present a wide body of developers with a fait accompli as they have to make do with whatever the console manufacturer throws at them. In addition to this a desktop part would have to run a game engine perfectly from the start. AMD/Nvidia cannot simply say, 'wait 12 months and you'll see!'. They need their GPUs to work perfectly and efficiently right from the start. Thats probably the reason why Intel was so keen to get the LRB part into a console or two. :cool:

    Oh you sell yourself short! :lol: In any case I've passed the question along to a few knowledgeable members.

    Well if you don't raise the question then you'll get nowhere and like my teachers have all said its quite likely that someone else has the same question as you. So you may as well raise your hand and have everyone be better for it. What the engineers are doing is so steeped in NDA anyway so we have to make educated guesses or we wouldn't get very far. :razz:

    The problems with Larrabee were many, and you couldn't point to its lack of Raster hardware or just its memory architecture as a fault. It was probably still before its time as with time the overhead from any x86 baggage would diminish next to whatever efficiencies Intel brought the design and the sheer prowess of their fabrication processes. Perhaps as a larger chip it ought to have waited until 2013 for their 450mm Wafer production to ramp up with 22nm chips?

    I would be happy too if we could bring back some of the regular 3D architecture people into this discussion, however I would probably die of a heart attack if Dave showed up drunk one day and spilled all the beans. :-D

    Edit: Too many smileys? LOL
     
  9. green.pixel

    Veteran

    Joined:
    Dec 19, 2008
    Messages:
    2,546
    Likes Received:
    781
    Location:
    Europe
    I think someone in this thread (or some other) mentioned that Carmack said that there should be no mandatory resolution TRC, that devs should render @ whatever res they want, or something like that, but I can't find it. Can someone find that post and/or post a link to what JC said? :)
     
  10. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Texture filtering could be done entirely in the shaders ... but that's not really what makes the texture unit a texture unit. The problematic parts of the texture units are the decompression and the texture cache, the decompression is more expensive on the shaders and the texture cache has access patterns which make it unsuitable to pull into say the normal Larrabee L1 (unless you like wasting loads of bandwidth).
     
  11. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    195
    Location:
    Stateless
    from texture units to another "one billions dollars questions post"... :lol:

    Nice post ;) (was intended at Squilliam)
    I think a bit more about your idea about opening a new thread in the architecture forum and I'm not sure that we can't have this discussion here. I feel like decision made by console manufacturers will be of a great importance on ATI, Nvidia and Intel choices, especially as Ms owns the main 3D API and so the choices they make about their next console is likely to greatly impact what directx12 will be about and thus the direction taken by the industry both on the hardware side and the software side.
    The relevance of fixed function units is part of this as if there is a radical move in this regard move in this regard it's more likely to happen in the console realm first. The likeliness of such a move is another matter. I guess we can let this into the mods hands, no?

    Anyway you convinced me, sometime someone have to ask even. It happens that I've more questions about texture units and theirs impact on GPU architecture (not only in perfs).
    One thing that bothers me (possibly because I don't get it properly) is the "memory model" they enforce. As I understand it, most of the data ALUs/SIMD arrays deal with are textures (or data put into textures). If you look at a multi processor in an ATI card (say arbiter-sequencer & sequencer, texture units, LDS and the SIMD array) the only memory coherent space, the L1 cache is tied to the the texture unit. You raised the issue of "slightly overblown fixed function unit", I feel like this is another issue depending on what the ALUs are doing you may have this L1 cache full of data that doesn't need any texture unit action, the texture is simply not a texture but a container.
    It looks pretty inefficient as the fixed function hardware stalls. thus you may want to have a seperate way for the ALUs to access data. My feel is that would an already really complicated task (designing chip) even more complicated.
    This brings me to another point, how to clean this? It looks clear that the texture units are here to stay, on the other hand you would want to have a unified L1 cache. The texture cache is optimized for texture operations but could it make sense to lose perfs here in the name of a more lean memory model?
    I've another question orthogonal to this issue, sharing data between different entity raises quiet some issues like data protection as the memory space is coherent, balancing resources (I mean whether is hardware itself or software running on the hardware one has to decide how many memory space one can get or when evicted data from cache, etc.). This looks pretty hairy to me.
    You need logic, possibly an huge argument for many ala Larrabee design.
    It's a bit unclear to me how Intel handled the texture units in Larrabee but by the look of it I would guess that having one texture unit per "core" under the control of the CPU logic was not an option (ie you can't down scale the hardware as you play with a SIMD unit wideness, you need sizable blocks). As you can't have one Tex unit per core so Intel have to give a fixed share of the L2. What I mean is that it's still looks a bit inefficient to me as if at some point you texture units are idle (say you don't deal with graphic) so is the local share of L2.
    I came to think about this: "how about attach an healthy texture unit or multiple ones to a CPU core?
    I actually thank a bit further and thank about a "fusion/heterogeneous design" where you would have a lot of simplistic CPU cores augmented by a SIMD array, could the best choice be to tie the texture units to your more potent cores so if the text units are idle the potent core use all the cache?
    I could see such a chip as a mix of:
    Nvidia last SoC "Tegra2", in tegra 2 you find two A9 ARM CPUs and one ARM7 CPU.
    Intel Larrabee, for the potent SIMD (in our they would be matched to ARM A7 cores).
    Intel SCC chip, for the on chip grid network and the "grape of core".
    A last question as I know really of the underlying hardware in which consist tex units, what would look like tex units made more general purpose/completely programmable? (as it looks like you can't pass on specialized hardware it makes to maximize it usability). To which uses it would land? (on top handling textures).

    Quiet some questions once again.
    EDIT
    Mfa gave quiet some hints that may have change my post quiet a bit if not completely but I read his answer to late. I was completely mislead on performance critical areas... I learn something tho :)
    EDIT 2 I spoiled the part that are completely wrong (most of the post :lol: ) as Mfa made clear how mislead my diagnostic was.
     
    #3311 liolio, Jan 16, 2010
    Last edited by a moderator: Jan 16, 2010
  12. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    195
    Location:
    Stateless
    Really nice post you managed to clear quiet a lot of my misconceptions on the matter in so few words that it's not even funny, thanks :)
    I still have a little question about the texture cache, what make it so different that a standard CPU L1 cache? Associativity? Cache line size? The number of cache lines that can be access in one cycle?
    What is truer (if any obviously)? (Sorry for the rough approach but I could not find any better).
    Scale such a cache to a more standard L1 size for a CPU (from 8KB in RV8xx to 32KB in Larrabee)?
    Or
    Once scaled such a cache would perform badly?

    Thanks for your time :)
    EDIT as Nobody answered my post I had another question.
    I heard of multi dimensional cache but I never understand what it could possibly mean, is this related to the number of cache line you can acess in one cycle?
     
    #3312 liolio, Jan 16, 2010
    Last edited by a moderator: Jan 16, 2010
  13. assen

    Veteran

    Joined:
    May 21, 2003
    Messages:
    1,377
    Likes Received:
    19
    Location:
    Skirts of Vitosha
    Tom Forsyth of the LRB team said in a recent lecture that the typical access patterns are very local, meaning you get most of the benefit from a tiny cache, which covers just the bilinear/aniso fetching of adjacent pixels; then see no benefit increasing the cache, until it becomes texture-sized (e.g. 1-2 MB), which is impractical.
     
  14. Squilliam

    Squilliam Beyond3d isn't defined yet
    Veteran

    Joined:
    Jan 11, 2008
    Messages:
    3,495
    Likes Received:
    114
    Location:
    New Zealand
    No problems liolio!

    I have to bow out for a while as im kinda too busy for long thought out posts. Im no Joshua who can plow out a 10 page article in 2 minutes!
     
  15. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    AFAIK it's fully associative and I assume the "cache line" size is 128 bit ... so both I guess.
    That too, although they are probably banked to take the edge off (which is why it's not really accurate to say cache line size)
    It's not really an option.

    PS. no idea what the multidimensional cache is.
     
    #3315 MfA, Jan 16, 2010
    Last edited by a moderator: Jan 16, 2010
  16. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    The problematic parts are NOT the decompression and texture cache. That's really a design detail in the scheme of things. The issue with programmable texturing is the latencies involved and hence the number of loads that need to be kept in flight. Texel locality within a single frame is incredibly poor. Its actually fairly poor within a single texture load all things considered once you get into things like mipmaps and AF levels.

    Decompression can be done fairly easily in the load pipeline if required, and texture caches really are just normal caches with different indexing. Aside from the latencies involved, the next big issue is generating all the memory load addresses which require some specialized math, that could be added to a int execution unit if you wanted to.

    But in the end, from an efficiency perspective, you are better off with effectively FF texturing hardware simply because of the pipelined latencies required.
     
  17. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    Full associativity isn't really required. The main difference between a texture cache and a cpu/memory cache is the indexing.
     
  18. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Indeed, which is why you can spend quite a lot on expensive cache features which you would never spend on a processor L1. The coherent part of the working set is small and the accesses narrow (128 bit is just for the compressed formats, normally you are doing independent 64 bit reads for each shader).

    It's not the texture unit which keeps the loads in flight and has to take care of thread context storage though, so other than that I don't see how it's relevant.
    Also narrow gathers as the default access pattern (something which say Larrabee's L1 was obviously not designed for).
     
  19. pixelvertex

    Newcomer

    Joined:
    Jan 19, 2010
    Messages:
    2
    Likes Received:
    0
    Does anyone in here think the next XBox or Playsation could have a FERMI based GPU?
     
  20. dragonelite

    Veteran

    Joined:
    Dec 20, 2009
    Messages:
    1,556
    Likes Received:
    1
    Location:
    netherlands
    I think Ms will go with ati for the gpu.
    And sony will try something different like they always do.

    Just hope the next box has backward capabilities.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...