Xenos hardware tesselator

But you can use tile and memexport in the scene, just not at the same time?

(or does the tile bracket contains the entire frame?)

Yes, you can use memexport, but not inside predicated tiling. This is not a hardware limitation. The GPU does not know the "concept" of predicated tiling directly. PTR is only a very clever software technique built upon several hardware features and the API sits at a higher level than things like drawing a primitive.
To cut a long story short, it's just a bit more than recording a command buffer and submitting it to the GPU once per tile (the command buffer may or may not contain the entire scene, it depends on the engine), while predicating away all the primitives outside the current tile.
When it comes to the GPU, if it sees a command to do memexport in the command buffer, it wil just execute it: since the command buffer is executed once per tile, that memexport command will be executed once per tile. The GPU doesn't know about predicated tiling, it only sees a stream of commands to execute.
The API limitation that prevents memexport during predicated tiling comes from here, which makes a lot of sense in the context of console programming, where you don't want to add a potentially big performance pitfall behind the back of the developer.
 
Yes, you can use memexport, but not inside predicated tiling. This is not a hardware limitation. The GPU does not know the "concept" of predicated tiling directly. PTR is only a very clever software technique built upon several hardware features and the API sits at a higher level than things like drawing a primitive.
To cut a long story short, it's just a bit more than recording a command buffer and submitting it to the GPU once per tile (the command buffer may or may not contain the entire scene, it depends on the engine), while predicating away all the primitives outside the current tile.
When it comes to the GPU, if it sees a command to do memexport in the command buffer, it wil just execute it: since the command buffer is executed once per tile, that memexport command will be executed once per tile. The GPU doesn't know about predicated tiling, it only sees a stream of commands to execute.
The API limitation that prevents memexport during predicated tiling comes from here, which makes a lot of sense in the context of console programming, where you don't want to add a potentially big performance pitfall behind the back of the developer.

Great explanation. This was like the "predicated tiling for dummies" post.

Question: So who breaks up the frame into tiles in the GPU? Does the dev have to know that function x needs three tiles to be executed so places those commands for execution, or does the GPU compute the result of function x on a scene, determines the entire scene size without the function and the result, and then breaks it into tiles for action within the edram automatically?

Also what is the difference in the GPU between normal tiling and the "fast tiling" function mentioned in some of the specifications of the Xenos?
 
Fran, is the implementation of manual tiling in a game something that is solved on a 'per engine' basis or does it need to be figured out with each game? For example, if valve implemented manual tiling support for the source engine would every subsequent game have a relatively easy time using it or is it something that represents itself with each individual title, regardless of the engine and its previous uses?
 
Yes, you can use memexport, but not inside predicated tiling. This is not a hardware limitation. The GPU does not know the "concept" of predicated tiling directly. PTR is only a very clever software technique built upon several hardware features and the API sits at a higher level than things like drawing a primitive.
To cut a long story short, it's just a bit more than recording a command buffer and submitting it to the GPU once per tile (the command buffer may or may not contain the entire scene, it depends on the engine), while predicating away all the primitives outside the current tile.
When it comes to the GPU, if it sees a command to do memexport in the command buffer, it wil just execute it: since the command buffer is executed once per tile, that memexport command will be executed once per tile. The GPU doesn't know about predicated tiling, it only sees a stream of commands to execute.
The API limitation that prevents memexport during predicated tiling comes from here, which makes a lot of sense in the context of console programming, where you don't want to add a potentially big performance pitfall behind the back of the developer.

First of all, great explanation. Thank you again for taking the time to reply.

But one more question ;)

In the case the command buffer does not contain the entire scene, could you somehow arrange the tiles so the data that need memexport fits on a single tile? That way at least for what i understood would allow you to memexport while tiling without any potencial perfomance hit, right? And if its possible can you bypass the Api and execute a memexport from inside the tile bracket?

In other words (english is not my first language, so i'm not sure if i'm making sense :p): Could a engine designed for the ground up for tiling in xenos, be possible to tile and memexport at the same time?

And more important: If there's no way to memexport while tiling, is it a big hit in perfomance (does tiling takes long enought to complete so it can hurt your memexport perfomance?)? If there's a way to memexport and tile, the cost associaded with that (making sure that every data needed to memexport would be used in one tile only) is worth it? And the last one: Usually memexport is done before or after tiling? If its first, and memexport take longer than expected, could you begin to tile or that would be a huge scrull up?

Sorry for being too asky :(
 
I think it's predicated tiling rendering. (shoot me if I'm wrong :LOL:)

edit: Even though it's a good compromise, I would say that the pros will have to be pretty darn amazing to outset the cons. I mean the system has been out for a year now. The best looking title is a game based on UE3 which doesn't use tiling also given the fact that tons of games are multiplatform out of which possibly none? will use tiling. It also seems like UE3 will become very popular engine for X360 and if there are future updates that will make tiling possible remains to be seen. So I might be sadly mistaken when I say that it looks like about a handfull of first party titles will actually use tiling and even in those titles developers will have to do lot's of work to get it up and running properly. So often when I think about this, I feel like maybe it could have been more usefull if the transistors had been put elsewhere.
Why do people think eDRAM is only useful when tiling is used?

It eliminates far and away the biggest bandwidth consumer in the pipeline from the GDDR3. It saves them transistors in compression, decompression, and the memory controller. It gives game devs the huge fillrate BW they're used to on the PS2. It lets X360 draw 64 samples per clock for shadow maps. All this for about 15% of the console's total silicon, maybe half that when you subtract the ROPs. In fact, according to NEC's specs, it's only 18 mm2 for the memory cells alone.

EDRAM makes a lot of sense on a console even without tiling.
 
The cost of a wide external bus or die area for embedded DRAM are not the only resources to trade against to solve the bandwidth issue. The rendering pipeline could afford some deepening to offer a more optimal approach.
What exactly are you talking about here?
 
Why do people think eDRAM is only useful when tiling is used?

It eliminates far and away the biggest bandwidth consumer in the pipeline from the GDDR3. It saves them transistors in compression, decompression, and the memory controller. It gives game devs the huge fillrate BW they're used to on the PS2. It lets X360 draw 64 samples per clock for shadow maps. All this for about 15% of the console's total silicon, maybe half that when you subtract the ROPs. In fact, according to NEC's specs, it's only 18 mm2 for the memory cells alone.

EDRAM makes a lot of sense on a console even without tiling.

And it can be a win for cost reduction. Instead of a large memory bus and larger, multi-chip configurations you are able to consolidate a large bandwidth reserve (for a specialized task) in a small area that should see fairly quick cost reduction.
 
Why do people think eDRAM is only useful when tiling is used?

It eliminates far and away the biggest bandwidth consumer in the pipeline from the GDDR3. It saves them transistors in compression, decompression, and the memory controller. It gives game devs the huge fillrate BW they're used to on the PS2. It lets X360 draw 64 samples per clock for shadow maps. All this for about 15% of the console's total silicon, maybe half that when you subtract the ROPs. In fact, according to NEC's specs, it's only 18 mm2 for the memory cells alone.

EDRAM makes a lot of sense on a console even without tiling.

How do you get that it's 15% of the silicon? The EDRAM die is ~70mm^2 supposedly (where Xenos is ~180 and RSX ~240). That's a significant chunk of change. Further, the transistors has to be 80+ million right? You need 8 per bit..10 Megabyte. The EDRAM die is stated at 105 m transistors so..the vast majority of it is EDRAM.

Seriously, if EDRAM was that cheap/small, why didn't microsoft put 30MB in there so you get 4xAA 720P without tiling?
________
PiercingsPussy live
 
Last edited by a moderator:
How do you get that it's 15% of the silicon? The EDRAM die is ~70mm^2 supposedly (where Xenos is ~180 and RSX ~240). That's a significant chunk of change. Further, the transistors has to be 80+ million right? You need 8 per bit..10 Megabyte. The EDRAM die is stated at 105 m transistors so..the vast majority of it is EDRAM.

Memory is more dense than logic.

Seriously, if EDRAM was that cheap/small, why didn't microsoft put 30MB in there so you get 4xAA 720P without tiling?

Using your numbers, I think you can see the difference between 70mm^2 and 210mm^2. Even if Mint's 15% is accurate, you would be looking at a jump to 45% (assuming a fixed die size regardless of what was present and not accounting for various densities). And larger chips are more prone to defect and not reaching spec, which all translates to more expense.

The real question is what were MS's options in regards to bandwidth. One alternative would have been a 256bit bus, but would have that met their needs for performance as well as cost control? 44.8GB/s, while better than 22.4GB/s it currently has, being used as a general resource would get consumed quickly. It would be barely enough for peak fillrate for the 8ROPs meaning stuff like texture access and CPU memory calls would be stalled. It also raises the issue of why design a system where 512MB of memory has times where couple MBs of memory footprint can at times consumes 100% of the bandwidth? You pay big money for bandwidth where 500MBs sits idle at times.

This gen there doesn't seem to have been a nice solution to fit all needs. Next one, with 64MB and larger eDRAM modules, we may see it being more robust. Hopefully by then devs can choose 1080p with MSAA and no tiles, or even more MSAA while tiling. I would think by 2011 that a lot of engines will have been designed with tiling as a consideration and will be less of an issue as it is now where PC and last gen technologies are still being leveraged.
 
acert93 said:
This gen there doesn't seem to have been a nice solution to fit all needs. Next one, with 64MB and larger eDRAM modules, we may see it being more robust.
Current rate of progression in rendering is making eDram(purely for render buffer usage) less relevant, not more. Not to mention restricting it to renderbuffer usage only is another tradeoff that limits its usefulness further.

I suppose one could build a chip enginereed specifically to accelerate deferred shading(hey, it almost happened by accident) to really get a good use out of eDram, but I don't know how good that would Really Seem compared to where "normal" GPU designs are going now.
 
Current rate of progression in rendering is making eDram(purely for render buffer usage) less relevant, not more. Not to mention restricting it to renderbuffer usage only is another tradeoff that limits its usefulness further.

But what if it's in PS4? Then what will you do!
________
JasminHunny cam
 
Last edited by a moderator:
Oh, another question about the tesselator... Wich kind of inputs does it accepts? Can I for example send a nurbs mesh to it, so it does its magic and create a polygonal mesh based on that?

I think these sort of things could be very great modeling wise, you could store a nurbs surface (wich unless i'm mistaken takes less space than a high poly model) and still manage to use a very detailed model in the scene.

(and if you could control in wich direction you tesselate, the vertices would concentrate on the areas more close to the camera, and in theory no matter how close you get, they would still be very rounded right? (not by increasing the poly size, just adjusting so they are more dense where needed))

It would be very nice if all 360 games were like that, because when played on 720, they all could get even better, because it would run with higher poly counts... (ok i'll land back on earth now :p)
 
Fully basing rendering upon tiling to use a very large amount of tiles saves both the external bus -- using on-chip SRAM whose speed is even faster than eDRAM and makes fully deferred texturing/shading practical -- and also conserves processor die area -- using very small tiles which afford an even more consistent level of image quality through deffered rendering.
 
How do you get that it's 15% of the silicon? The EDRAM die is ~70mm^2 supposedly (where Xenos is ~180 and RSX ~240).
Read my quote again. Total console silicon. You're excluding the CPU and other smaller things too.

I'm lumping everything together because it's an integrated system. Xenos is the MC for the CPU too, and saving BW on the GDDR3 lets the CPU have more BW too.
 
Fully basing rendering upon tiling to use a very large amount of tiles saves both the external bus -- using on-chip SRAM whose speed is even faster than eDRAM and makes fully deferred texturing/shading practical -- and also conserves processor die area -- using very small tiles which afford an even more consistent level of image quality through deffered rendering.
That's what I figured you were talking about, but deferred rendering complicates a LOT of things. It's not simply a matter of a "deeper rendering pipeline" as you put it.

The worst part is that deferred rendering can consume tons of memory for binning, which is probably the most precious resource on a console. It also uses a lot of frame time during the binning stage, and makes triangle setup a bigger bottleneck. Renderstate overhead can get very large in the paradigm of shaders (of which there are thousands) when you sort spatially.

Xenos' solution is the best of both worlds. Because you only have a few seams, you don't need polygon-level binning, thus saving most of the cons. You still get high-speed rendering of BW heavy loads and the same low BW usage of the main memory. You can get near TBDR culling efficiency with a Z-only pass and HiZ. The only con is that the EDRAM costs you a little perf/mm2 in low poly situations.

(BTW the SRAM speed doesn't help you at all because the EDRAM is already fast enough for 256GB/s or more. Latency is easy and cheap to hide since there's no register data. Internal buses are cheap in this application.)
 
Current rate of progression in rendering is making eDram(purely for render buffer usage) less relevant, not more. Not to mention restricting it to renderbuffer usage only is another tradeoff that limits its usefulness further.
True, but I don't see this progression continuing for much longer. A lot of the longer shaders today seem to be long simply because they can be, not because they offer much noticeable improvement. I personally think we're going to have to increase the data per pixel more than math per pixel to get more realistic graphics. Xenos/RSX can do what, ~10000 fp ops per final pixel at 720P/30fps?

It's also likely that memory will grow in capacity much faster than speed, and rendering ability (Gsamples/s) will also increase much faster than memory speed. High res textures make a big difference in realism, and they will increase BW usage. BW per ROP has been decreasing a lot for many years. For cheap consoles, wider buses don't look like an option.
 
True, but I don't see this progression continuing for much longer. A lot of the longer shaders today seem to be long simply because they can be, not because they offer much noticeable improvement. I personally think we're going to have to increase the data per pixel more than math per pixel to get more realistic graphics. Xenos/RSX can do what, ~10000 fp ops per final pixel at 720P/30fps?

It's also likely that memory will grow in capacity much faster than speed, and rendering ability (Gsamples/s) will also increase much faster than memory speed. High res textures make a big difference in realism, and they will increase BW usage. BW per ROP has been decreasing a lot for many years. For cheap consoles, wider buses don't look like an option.

I don't know, a lot of what's happening now is increasingly subtle and subtlety costs in terms of computation.
 
Back
Top