Predict: The Next Generation Console Tech

Status
Not open for further replies.
In the run up to the eventual demise of LRB there was talk of a LRB 2 and 3. All fine and good but what is the point of LRB 1's fully programmable nature if then there was speculation that they would not be necessarily compatible with each other!

And it was at Beyond3D that I read that LRB 1 was not going to be 100% DirectX 11 compliant! Just going to try and search through the many posts now - and hopefully update you when/if I get the information. Of course I may have be misinterpreting a post.

In the end though there is no point of a feature in a graphics card if it is not usable in real time (and that means more than 3 or 4 fps).

Edit: possibly related to the "fixed function texture filtering mechanism" seems to ring a bell however I may be completely wrong on this one and LRB was in fact destined to become a DX11 compliant part as well.

Edit2: I have asked the proper people in the appropriate forum to put an end to my possible misguided thoughts. If I am wrong I blame the interweb.
 
Last edited by a moderator:
In the end though there is no point of a feature in a graphics card if it is not usable in real time (and that means more than 3 or 4 fps).
Yes, it could be limited by performance, just like the DX Reference Rasteriser could render any DX effect, but not at hardware-accelerated speeds. If implementing DX11 features brought LRB to its knees, then it could be considered DX10 in terms of what it could achieve in DX games. That said, it could also do things DX 10 couldn't (whether devs decided to or not!).
 
I'm very confused! How can it be a DX10 class GPU and yet be fully programmable? What features are in DX11/12 that LRB couldn't do?

Cant the Direct X 10 cards do everything in Direct X 11 in sofrware? Im not disagreeing with you but perhaps we might need another distinction in the age of programmable pipelines.
 
Cant the Direct X 10 cards do everything in Direct X 11 in sofrware?

Not that I'm aware of - You could code something like a tesselator in CUDA, but I'm unaware of any way to do that in DX10 specifically. That doesn't bar anybody from building such a thing in a primarily DX10 engine, it'd just mean the engineer would have to find a workaround (which may end up comparatively slooooow). As it stands, Larrabee should've been capable of spitting out just about any frame you could coax out of a future Radeon or Geforce, but it comes down to practicality, I.E. is it impractically-slow to do random frame A on Larrabee vs on dedicated hardware.
 
nowhere close to be finished So please Squilliam don't answer now ;)
I may misunderstand but I don't see how Intel choice to not include a rasterizer is related to bandwidth saving measures. Intel chose deferred rendering as the most appropriate mode of rendering and nobody really discuss their choice but for me it's unrelated to the choice of not have a dedicated rasterizer. As i see it Intel chose a TBDF because each core in larrabee had a sizable amount of L2 and it made sense do to as many operation as possible on material within this memory space and make the most of bandwidth and flexibility provided by the caches. Even if rasterizer & triangle setup units are tiny Intel had to leverage their cost upon multiple cores. If Intel chose to have one multiple rasterizer and triangle units then come the question about how manages/feed those units. Bandwidth may have been a concern but I don't think the only reason behind Intel choice. They imho want to have a few as possible fixed function hardware because they weren't only aiming at graphic with larrabee and high FLOPS throughput was their primary goals.

I think that the ALUs would handle the textures units job most likely less efficiently.
Between the texture units and the ROP units, they are both over-sized for pretty much all situations as they are fixed function units the rule of the thumb is to spend more transistor area than is strictly needed because you cannot always predict what kind of workload you're going to get and the whole pipeline can be limited by how fast the process moves through these units. So between fixed function and unified function like with stream processors its a trade-off and theres a point somewhere when its better to have more general processing performance than specialised performance.

The difference between the two functions is that a ROP unit generally limits you to Rasterization whereas a texture unit can be more flexibly applied using different render methods such as Ray Tracing and Voxel based rendering. In addition to this a ROP unit performs calculations which can be best emulated on a stream processor or if not best emulated with present stream processors are functions which increase the overall usefulness of the ALU banks for GPGPU work as rasterization is very vector heavy. We have already seen some benefit to more flexibility and efficiency here. This is the reason why Larrabee didn't have ROP units but still retained Texture units.

I would also like to know, I'm not and by far an armchair expert on the matter, my original comment came from my realization about how "huge" texture units looked in R770.
As you saw I've been most likely mislead by some die shot (cf 3Dilettante's post). In the end it's a bit unclear how huge are the texture unit vs the ALUs. The other part of the problem is how much more efficient texture units in perf/mm² vs even reworked ALUs arrays. I've no answer I was trying to initiate a discussion on the matter.
My feel is that GPUs will lose some of their fixed functions units in the upcoming years.
It looks like GPU will have to become more flexible and more "autonomic" to satisfy the needs of developers from various fields. By autonomic I mean in a CPU way. Devs want to have read and write caches, on the other hand it's not the model that offer the best scaling and hardware cost can turn prohibitive. My feel (realy a feel more than anything else) is that the removal of fixed function would help to design chip easier to design from the hardware vendor POV and would ease the development of the matching software model/layer.
One of my other "feel" is in regard to Intel choices, I read a lot of comment about their choice of X86 for larrabee ISA on the other hand I see really few arguments about the real architectural choices on which larrabee stands: many many cores augmented by potent SIMD units.
I can guess/see the merit of this choice but I'm not sure that one simplistic CPU core augmented is the sweet spot, the proper building block to build the GPU 2.0.
Such a core is tiny so you need many of them to achieve the throughput the industry is aiming which imply quiet some communications whereas it's for memory space coherency or simply move data around.
You end with relatively few "FLOPS" to amortize the CPU front end/logic (even if one could do better than X86).
It's a really interesting debate and I would absolutely love to see more educated persons discuss the issue.
For example what would be the pro/con of say a properly done larrabee clone vs something looking a bit like a mix of a Cell and a GPU.
What I mean by a mix of a GPU and the Cell is an idea I got after reading a link about "COMICS" Mfa posted sometime ago. If I'm right "COMICS" is a software layer that provides software cache coherency for the SPU. In this setup I find that the PPU acts a bit in the same way as the Command Processor in a GPU does. So I wondered if a general purpose GPU or GPU 2.0 could be something like a blend of this idea.
Say you have some SIMD arrays attached to a specialized CPU. Each SIMD array would be more potent than they are now (a bit like SPU who can't do everything a standard CPU does). The specialized CPU would be completely hide from the developer and would do the job a CP does in a GPU and could do what the PPU does in the aforementioned set-up providing various software coherency model for the memory (under the programmer command).
I feel like this would have to be tinier than actual GPU so the specialized CPU would also handle load balancing, memory coherency between its "brothers".
Say you could have such a core made by:
one specialized CPU with its own resources, that may have hardware cache(hidden from developer)
8(*) "16-wide arrays" each with a dedicated LS as in SPUs (* random number, may be I love quadratic number as KK :LOL:)
a bigger LS shared among the SIMD array.
A router

You put some of these cores on a grid (aka Intel SCC or Tilera products) but I would be less of a burden than to put X time as many tinier cores ala larrabee (simplistic CPU =SIMD) on a chip of the same size to achieve the same throughput.
In the end developers would have a lot of choices anywhere from seeing each individual SIMD arrays to handle the chip as a nowadays GPUs.
You would pay for memory coherency (whether is die space, power consumption, etc.) only for the tasks that require it. Like Sweeney suggested in a presentation if you want you could run in software a coherent transactional model on some of the SIMD arrays if you need.

Such a chip may be easier to design, offer more perfs both per Watts and mm² but it would put a lot of pressure on the software layer/runtime that would make its use practical.

What do you members think? Is that a crazy idea? A brain fart?

I cannot personally answer this one, I tried then deleted then tried again but still I just don't have the fundamental knowledge required. I hope that someone else can answer you here! :)
 
Between the texture units and the ROP units, they are both over-sized for pretty much all situations as they are fixed function units the rule of the thumb is to spend more transistor area than is strictly needed because you cannot always predict what kind of workload you're going to get and the whole pipeline can be limited by how fast the process moves through these units. So between fixed function and unified function like with stream processors its a trade-off and theres a point somewhere when its better to have more general processing performance than specialized performance.

The difference between the two functions is that a ROP unit generally limits you to Rasterization whereas a texture unit can be more flexibly applied using different render methods such as Ray Tracing and Voxel based rendering. In addition to this a ROP unit performs calculations which can be best emulated on a stream processor or if not best emulated with present stream processors are functions which increase the overall usefulness of the ALU banks for GPGPU work as rasterization is very vector heavy. We have already seen some benefit to more flexibility and efficiency here. This is the reason why Larrabee didn't have ROP units but still retained Texture units.
Nicely put :) I guess it's the reason why for example now in R8xx interpolation is done in shaders with good results. Well I get it, so far I've read nobody sane wanting to get rid of the texture units.
I guess it's just that texture unit are that efficient are their job or the other way around that SIMD arrays are that bad at texture units job. Anyway as I'm not sane and I realized that the die space taken by the texture was far from marginal and with serious bandwidth bottleneck about to badly strike back I felt like "hey guys! are you really sure it would still make sense in a few year?" (as I obviously can't say/decide my-self).
actually I wanted to rise another fight between Nick and and some other "violently disagreeing" members :LOL:

I cannot personally answer this one, I tried then deleted then tried again but still I just don't have the fundamental knowledge required. I hope that someone else can answer you here! :)
Actually I delete that part for many reasons.
I got a bit carried away by my enthusiasm to the point where I feel is close to be a bit ridiculous, I mean that was quiet an question, the billions dollars question. There are clever guys out there (not only here on the forum) and actually there are team of these clever guys working on that kind of question. At some point even if one member could have answer I would not have been most likely able to understand the implications/nuances behind his answer. On the other hand I may have be happy with "at some point logic have to be that close to the execution, having a even not distant chip acting as a manager won't cut it" or the other way around "Kudos genius where else did you expect the GPUs to go if they don't follow the larrabee path". Anyway I feel like that kind of questions should be raised by members with real knowledge on the matter and discussed based on fact/rumors/etc. (if only for respect of the army of engineers working on the matter). The kind of topic I like to read and wonder but where posting would/should feel actually awkward.

If I were to once again goes out of my way into territories I should not abord
forgivable as I'm not sane and tho I'll never learn my lesson :LOL:
I should accept the fact that may be going the "larrabee path" is the right thing to do and the lack of criticism in regard to Intel choice (many simple cpu+wide SIMD not X86 as the ISA) should be enough to convince me given my knowledge on the matter. Actually the only person I saw questioning the concept a bit (he said he wanted a proof of concept in some interview) was John Carmack.

Anyway when all this is say and done, contrition aside, I would still be really happy if a conversation between high level members could emerge on the matter :)
 
Last edited by a moderator:
Nicely put I guess it's the reason why for example now in R8xx interpolation is done in shaders with good results. Well I get it, so far I've read nobody sane wanting to get rid of the texture units.
I guess it's just that texture unit are that efficient are their job or the other way around that SIMD arrays are that bad at texture units job. Anyway as I'm not sane and I realized that the die space taken by the texture was far from marginal and with serious bandwidth bottleneck about to badly strike back I felt like "hey guys! are you really sure it would still make sense in a few year?" (as I obviously can't say/decide my-self).
actually I wanted to rise another fight between Nick and and some other "violently disagreeing" members :LOL:

Currently they are fighting the bandwidth demons with on larger die caches and local stores for the various components to keep them fed and make better use of what bandwidth is available. Dave has said this much himself. Developers are doing the same as there has been a shift towards deferred rendering so that should push the bandwidth monster away for at least a couple more years. In the future with relation to 3D hardware it makes a lot of sense for consoles at least to have on die frame-buffer and im pretty certain AMD and Nvidia have considered it for their desktop GPU parts so long as they can keep any tiling transparent to the developer. However the main issue is vertex load still and unless they can figure that out for the desktop parts I hate to think of the vertex load of a heavily tessellated model sitting astride the boundry of 2 or more tiles. :oops:

Deleting something like the ROP units on a console part makes a lot more sense than for a desktop part. They only have to run previous generation titles and titles designed to fit within the consoles design parameters. Its also a lot easier to present a wide body of developers with a fait accompli as they have to make do with whatever the console manufacturer throws at them. In addition to this a desktop part would have to run a game engine perfectly from the start. AMD/Nvidia cannot simply say, 'wait 12 months and you'll see!'. They need their GPUs to work perfectly and efficiently right from the start. Thats probably the reason why Intel was so keen to get the LRB part into a console or two. :cool:

Actually I delete that part for many reasons.
I got a bit carried away by my enthusiasm to the point where I feel is close to be a bit ridiculous, I mean that was quiet an question, the billions dollars question. There are clever guys out there (not only here on the forum) and actually there are team of these clever guys working on that kind of question. At some point even if one member could have answer I would not have been most likely able to understand the implications/nuances behind his answer.

Oh you sell yourself short! :LOL: In any case I've passed the question along to a few knowledgeable members.

On the other hand I may have be happy with "at some point logic have to be that close to the execution, having a even not distant chip acting as a manager won't cut it" or the other way around "Kudos genius where else did you expect the GPUs to go if they don't follow the larrabee path". Anyway I feel like that kind of questions should be raised by members with real knowledge on the matter and discussed based on fact/rumors/etc. (if only for respect of the army of engineers working on the matter). The kind of topic I like to read and wonder but where posting would/should feel actually awkward.

Well if you don't raise the question then you'll get nowhere and like my teachers have all said its quite likely that someone else has the same question as you. So you may as well raise your hand and have everyone be better for it. What the engineers are doing is so steeped in NDA anyway so we have to make educated guesses or we wouldn't get very far. :p

If I were to once again goes out of my way into territories I should not abord
forgivable as I'm not sane and tho I'll never learn my lesson
I should accept the fact that may be going the "larrabee path" is the right thing to do and the lack of criticism in regard to Intel choice (many simple cpu+wide SIMD not X86 as the ISA) should be enough to convince me given my knowledge on the matter. Actually the only person I saw questioning the concept a bit (he said he wanted a proof of concept in some interview) was John Carmack.

Anyway when all this is say and done, contrition aside, I would still be really happy if a conversation between high level members could emerge on the matter

The problems with Larrabee were many, and you couldn't point to its lack of Raster hardware or just its memory architecture as a fault. It was probably still before its time as with time the overhead from any x86 baggage would diminish next to whatever efficiencies Intel brought the design and the sheer prowess of their fabrication processes. Perhaps as a larger chip it ought to have waited until 2013 for their 450mm Wafer production to ramp up with 22nm chips?

I would be happy too if we could bring back some of the regular 3D architecture people into this discussion, however I would probably die of a heart attack if Dave showed up drunk one day and spilled all the beans. :D

Edit: Too many smileys? LOL
 
I think someone in this thread (or some other) mentioned that Carmack said that there should be no mandatory resolution TRC, that devs should render @ whatever res they want, or something like that, but I can't find it. Can someone find that post and/or post a link to what JC said? :)
 
Texture filtering could be done entirely in the shaders ... but that's not really what makes the texture unit a texture unit. The problematic parts of the texture units are the decompression and the texture cache, the decompression is more expensive on the shaders and the texture cache has access patterns which make it unsuitable to pull into say the normal Larrabee L1 (unless you like wasting loads of bandwidth).
 
from texture units to another "one billions dollars questions post"... :LOL:

Nice post ;) (was intended at Squilliam)
I think a bit more about your idea about opening a new thread in the architecture forum and I'm not sure that we can't have this discussion here. I feel like decision made by console manufacturers will be of a great importance on ATI, Nvidia and Intel choices, especially as Ms owns the main 3D API and so the choices they make about their next console is likely to greatly impact what directx12 will be about and thus the direction taken by the industry both on the hardware side and the software side.
The relevance of fixed function units is part of this as if there is a radical move in this regard move in this regard it's more likely to happen in the console realm first. The likeliness of such a move is another matter. I guess we can let this into the mods hands, no?

Anyway you convinced me, sometime someone have to ask even. It happens that I've more questions about texture units and theirs impact on GPU architecture (not only in perfs).
One thing that bothers me (possibly because I don't get it properly) is the "memory model" they enforce. As I understand it, most of the data ALUs/SIMD arrays deal with are textures (or data put into textures). If you look at a multi processor in an ATI card (say arbiter-sequencer & sequencer, texture units, LDS and the SIMD array) the only memory coherent space, the L1 cache is tied to the the texture unit. You raised the issue of "slightly overblown fixed function unit", I feel like this is another issue depending on what the ALUs are doing you may have this L1 cache full of data that doesn't need any texture unit action, the texture is simply not a texture but a container.
It looks pretty inefficient as the fixed function hardware stalls. thus you may want to have a seperate way for the ALUs to access data. My feel is that would an already really complicated task (designing chip) even more complicated.
This brings me to another point, how to clean this? It looks clear that the texture units are here to stay, on the other hand you would want to have a unified L1 cache. The texture cache is optimized for texture operations but could it make sense to lose perfs here in the name of a more lean memory model?
I've another question orthogonal to this issue, sharing data between different entity raises quiet some issues like data protection as the memory space is coherent, balancing resources (I mean whether is hardware itself or software running on the hardware one has to decide how many memory space one can get or when evicted data from cache, etc.). This looks pretty hairy to me.
You need logic, possibly an huge argument for many ala Larrabee design.
It's a bit unclear to me how Intel handled the texture units in Larrabee but by the look of it I would guess that having one texture unit per "core" under the control of the CPU logic was not an option (ie you can't down scale the hardware as you play with a SIMD unit wideness, you need sizable blocks). As you can't have one Tex unit per core so Intel have to give a fixed share of the L2. What I mean is that it's still looks a bit inefficient to me as if at some point you texture units are idle (say you don't deal with graphic) so is the local share of L2.
I came to think about this: "how about attach an healthy texture unit or multiple ones to a CPU core?
I actually thank a bit further and thank about a "fusion/heterogeneous design" where you would have a lot of simplistic CPU cores augmented by a SIMD array, could the best choice be to tie the texture units to your more potent cores so if the text units are idle the potent core use all the cache?
I could see such a chip as a mix of:
Nvidia last SoC "Tegra2", in tegra 2 you find two A9 ARM CPUs and one ARM7 CPU.
Intel Larrabee, for the potent SIMD (in our they would be matched to ARM A7 cores).
Intel SCC chip, for the on chip grid network and the "grape of core".
A last question as I know really of the underlying hardware in which consist tex units, what would look like tex units made more general purpose/completely programmable? (as it looks like you can't pass on specialized hardware it makes to maximize it usability). To which uses it would land? (on top handling textures).

Quiet some questions once again.
EDIT
Mfa gave quiet some hints that may have change my post quiet a bit if not completely but I read his answer to late. I was completely mislead on performance critical areas... I learn something tho :)
EDIT 2 I spoiled the part that are completely wrong (most of the post :LOL: ) as Mfa made clear how mislead my diagnostic was.
 
Last edited by a moderator:
Texture filtering could be done entirely in the shaders ... but that's not really what makes the texture unit a texture unit. The problematic parts of the texture units are the decompression and the texture cache, the decompression is more expensive on the shaders and the texture cache has access patterns which make it unsuitable to pull into say the normal Larrabee L1 (unless you like wasting loads of bandwidth).
Really nice post you managed to clear quiet a lot of my misconceptions on the matter in so few words that it's not even funny, thanks :)
I still have a little question about the texture cache, what make it so different that a standard CPU L1 cache? Associativity? Cache line size? The number of cache lines that can be access in one cycle?
What is truer (if any obviously)? (Sorry for the rough approach but I could not find any better).
Scale such a cache to a more standard L1 size for a CPU (from 8KB in RV8xx to 32KB in Larrabee)?
Or
Once scaled such a cache would perform badly?

Thanks for your time :)
EDIT as Nobody answered my post I had another question.
I heard of multi dimensional cache but I never understand what it could possibly mean, is this related to the number of cache line you can acess in one cycle?
 
Last edited by a moderator:
I still have a little question about the texture cache, what make it so different that a standard CPU L1 cache? Associativity? Cache line size? The number of cache lines that can be access in one cycle?
What is truer (if any obviously)? (Sorry for the rough approach but I could not find any better).
Scale such a cache to a more standard L1 size for a CPU (from 8KB in RV8xx to 32KB in Larrabee)?
Or
Once scaled such a cache would perform badly?

Tom Forsyth of the LRB team said in a recent lecture that the typical access patterns are very local, meaning you get most of the benefit from a tiny cache, which covers just the bilinear/aniso fetching of adjacent pixels; then see no benefit increasing the cache, until it becomes texture-sized (e.g. 1-2 MB), which is impractical.
 
No problems liolio!

I have to bow out for a while as im kinda too busy for long thought out posts. Im no Joshua who can plow out a 10 page article in 2 minutes!
 
I still have a little question about the texture cache, what make it so different that a standard CPU L1 cache? Associativity? Cache line size?
AFAIK it's fully associative and I assume the "cache line" size is 128 bit ... so both I guess.
The number of cache lines that can be access in one cycle?
That too, although they are probably banked to take the edge off (which is why it's not really accurate to say cache line size)
Once scaled such a cache would perform badly?
It's not really an option.

PS. no idea what the multidimensional cache is.
 
Last edited by a moderator:
Texture filtering could be done entirely in the shaders ... but that's not really what makes the texture unit a texture unit. The problematic parts of the texture units are the decompression and the texture cache, the decompression is more expensive on the shaders and the texture cache has access patterns which make it unsuitable to pull into say the normal Larrabee L1 (unless you like wasting loads of bandwidth).

The problematic parts are NOT the decompression and texture cache. That's really a design detail in the scheme of things. The issue with programmable texturing is the latencies involved and hence the number of loads that need to be kept in flight. Texel locality within a single frame is incredibly poor. Its actually fairly poor within a single texture load all things considered once you get into things like mipmaps and AF levels.

Decompression can be done fairly easily in the load pipeline if required, and texture caches really are just normal caches with different indexing. Aside from the latencies involved, the next big issue is generating all the memory load addresses which require some specialized math, that could be added to a int execution unit if you wanted to.

But in the end, from an efficiency perspective, you are better off with effectively FF texturing hardware simply because of the pipelined latencies required.
 
The issue with programmable texturing is the latencies involved and hence the number of loads that need to be kept in flight. Texel locality within a single frame is incredibly poor. Its actually fairly poor within a single texture load all things considered once you get into things like mipmaps and AF levels.
Indeed, which is why you can spend quite a lot on expensive cache features which you would never spend on a processor L1. The coherent part of the working set is small and the accesses narrow (128 bit is just for the compressed formats, normally you are doing independent 64 bit reads for each shader).

It's not the texture unit which keeps the loads in flight and has to take care of thread context storage though, so other than that I don't see how it's relevant.
and texture caches really are just normal caches with different indexing.
Also narrow gathers as the default access pattern (something which say Larrabee's L1 was obviously not designed for).
 
Status
Not open for further replies.
Back
Top