PS360 bandwidths and rendering *full circle*

Liandry

Regular
As my question is connected to framebuffer I think I can ask it here. Why Xbox 360 main core to edram core bus is 32 Gb/s? EDRAM is only 10Mb, it contains backbuffer and z-buffer, if minimum is 30fps game without tiling so it's only 300Mb of data, maximum is game 60fps game with two tiles so it's 1,2Gb, then why there is need for 32 Gb/s?
 
As my question is connected to framebuffer I think I can ask it here. Why Xbox 360 main core to edram core bus is 32 Gb/s? EDRAM is only 10Mb, it contains backbuffer and z-buffer, if minimum is 30fps game without tiling so it's only 300Mb of data, maximum is game 60fps game with two tiles so it's 1,2Gb, then why there is need for 32 Gb/s?
Well simple. You don't just write the end result in there. You read and write a lot more for just one frame, than the complete frame. All those data that's inside just changes a lot in 33ms (or 16ms). 32GB is not too much but in most cases it was enough.
 
Well simple. You don't just write the end result in there. You read and write a lot more for just one frame, than the complete frame. All those data that's inside just changes a lot in 33ms (or 16ms). 32GB is not too much but in most cases it was enough.
PS3 have 20,8 Gb/s VRAM bandwith, and it's not only for backbuffer and Z-buffer, but also for frontbuffer and textures.
 
As my question is connected to framebuffer I think I can ask it here. Why Xbox 360 main core to edram core bus is 32 Gb/s? EDRAM is only 10Mb, it contains backbuffer and z-buffer, if minimum is 30fps game without tiling so it's only 300Mb of data, maximum is game 60fps game with two tiles so it's 1,2Gb, then why there is need for 32 Gb/s?
GPU creates the result for each pixel in all surfaces that are sent to rasterization within daughter die. (well, perhaps not z-only writes.)
So the bandwidth needed must be enough enough to feed ROPs churning out 8 pixels per clock.
PS3 have 20,8 Gb/s VRAM bandwith, and it's not only for backbuffer and Z-buffer, but also for frontbuffer and textures.
RSX was quite bandwidth starved as a result.
 
GPU creates the result for each pixel in all surfaces that are sent to rasterization within daughter die. (well, perhaps not z-only writes.)
So the bandwidth needed must be enough enough to feed ROPs churning out 8 pixels per clock.
Then it's 500000000*8*32/8/1024/1024 ~ 16Gb/s write and same read what total is 32GB/s?
RSX was quite bandwidth starved as a result.
20,8 Gb/s and 22,4 + 32 = 54,4 Gb/s. It's too big difference. Of course 22,4 is't only for GPU, but still.
 
Then it's 500000000*8*32/8/1024/1024 ~ 16Gb/s write and same read what total is 32GB/s?
No, it's:

500MHz * (64 bits of color, alpha, and depth) * 8 ROPS
/
8 bits per byte * 2^30 bytes per GB

And when the bus is being used to feed the ROPS, all of this bandwidth is "write" bandwidth.

But "write" is a bit of a strange term, because that data isn't directly being written to the memory banks. It's being transmitted to the ROPs, which then do the actual read/modify/write operations to eDRAM. "Transmit" and "receive" might be better terminology than "write" and "read."

Also, the bus is bidirectional; it's also used for copying the eDRAM contents to main RAM through Xenos.
 
As my question is connected to framebuffer I think I can ask it here. Why Xbox 360 main core to edram core bus is 32 Gb/s? EDRAM is only 10Mb, it contains backbuffer and z-buffer, if minimum is 30fps game without tiling so it's only 300Mb of data, maximum is game 60fps game with two tiles so it's 1,2Gb, then why there is need for 32 Gb/s?
Xbox360 EDRAM was designed handle the worst case maximum. 32 bit color + 24/8 bit depth/stencil with 4xMSAA + alpha blending. This simplified the GPU design as there was no need for color/depth compression hardware.

EDRAM internal bandwidth = 64 bpp (color & depth) read & write * 8 ROPs * 500 MHz = 256 GB/s read & write.

GPU->EDRAM bus only needs 32 GB/s, since the EDRAM die internally handles the 4xMSAA sample replication. Only a single color value (+coverage mask = 4 bits?) needs to be output from each pixel shader instance.
 
No, it's:

500MHz * (64 bits of color, alpha, and depth) * 8 ROPS
/
8 bits per byte * 2^30 bytes per GB
Ok, I see.

And when the bus is being used to feed the ROPS, all of this bandwidth is "write" bandwidth.
But it can't be all 32Gb/s, because data also need to be read, right?

But "write" is a bit of a strange term, because that data isn't directly being written to the memory banks. It's being transmitted to the ROPs, which then do the actual read/modify/write operations to eDRAM. "Transmit" and "receive" might be better terminology than "write" and "read."
Just to make it clear, data goes from Main RAM to EDRAM core to ROP blocks, then they do modify then data goes to EDRAM memory banks, and then it goes back to Main RAM?

GPU->EDRAM bus only needs 32 GB/s, since the EDRAM die internally handles the 4xMSAA sample replication. Only a single color value (+coverage mask = 4 bits?) needs to be output from each pixel shader instance.
If it's 4bit per pixel then it's 16GB/s, right?
 
But it can't be all 32Gb/s, because data also need to be read, right?
No. Xbox 360 GPU doesn't read data from the EDRAM. Alpha blending and (per sample) Z/stencil-test is done internally inside the EDRAM die. GPU reads textures and vertices from the main memory (GDDR3).
Just to make it clear, data goes from Main RAM to EDRAM core to ROP blocks, then they do modify then data goes to EDRAM memory banks, and then it goes back to Main RAM?
Not correct. EDRAM is a temporary scratchpad solely dedicated to store ROP output (pixel shader results). ROPs are in the EDRAM die. There is no data path from main memory to EDRAM.
If it's 4bit per pixel then it's 16GB/s, right?
4bpp would be 2 GB/s.

I recommend reading this article:
https://www.beyond3d.com/content/articles/4/4
 
But it can't be all 32Gb/s, because data also need to be read, right?
While rendering something, the main GPU is just sending pixel data to the ROPs that reside in the daughter die, and the ROPs do the actual pixel fill operations. During this time, the eDRAM die is not transmitting data back to the main GPU.

It works vaguely like this:

qk6X1AX.gif
 
No. Xbox 360 GPU doesn't read data from the EDRAM. Alpha blending and (per sample) Z/stencil-test is done internally inside the EDRAM die. GPU reads textures and vertices from the main memory (GDDR3).
But data from EDRAM core need to get out somehow? Frontbuffer is stored in main RAM.

Not correct. EDRAM is a temporary scratchpad solely dedicated to store ROP output (pixel shader results). ROPs are in the EDRAM die. There is no data path from main memory to EDRAM.
Ok, I'll try say it different. There is 32GB/s bandwidth between main core and daughter core, so it's 16GB/s each way, right? Because there is same number of pixes goes from main core to ROP bocks, and same should go out from ROP blocks to main core and then to RAM.

It works vaguely like this:
First of all I want to say what it's great picture you've made. Second, now it explains a lot for me.
 
HTupolev's animation is great :)

However it is important to understand that the resolve (copy from EDRAM to main memory) doesn't happen immediately after writing MSAA fragments to EDRAM. The developer manually inserts a resolve operation to the command list when he/she is ready rendering the whole viewport. For example: Clear EDRAM, submit 5000 draw calls (GPU renders them), resolve result to main memory (as a texture). Post processing steps can now read the texture containing the main scene rendering result. Again after post processing you resolve the image back to main memory. Display output takes the result from the main memory at the end of the frame.
 
But data from EDRAM core need to get out somehow? Frontbuffer is stored in main RAM.
Yes. You submit a resolve command after you have rendered the whole scene to EDRAM.
Ok, I'll try say it different. There is 32GB/s bandwidth between main core and daughter core, so it's 16GB/s each way, right? Because there is same number of pixes goes from main core to ROP bocks, and same should go out from ROP blocks to main core and then to RAM.
The main idea of any ROP scratchpad (be it EDRAM or tiling buffer) is that you only resolve the final image (one pixel per one pixel). Usually there's around 2x-5x overdraw for each pixel. Same pixels get written over and over in the EDRAM. You don't pay any main memory bandwidth cost for the overdraw. You only resolve the final image.

Huge majority of the frame time, there is zero traffic out from the EDRAM. Traffic out of EDRAM only occurs when you do a resolve operation. Resolve writes the whole render target as one big chunk directly from EDRAM to main memory. This stalls the GPU, since the EDRAM is unavailable during the resolve operation.

Also it's worth noting that Xbox One ESRAM is completely different than Xbox 360 EDRAM. Xbox One has move engines capable of doing background transfers in/out of the ESRAM (without stalling the GPU). Also the Xbox One GPU can read the ESRAM directly. You don't need to resolve render targets to main memory in order to read it. This obviously saves BW and time.
 
However it is important to understand that the resolve (copy from EDRAM to main memory) doesn't happen immediately after writing MSAA fragments to EDRAM. The developer manually inserts a resolve operation to the command list when he/she is ready rendering the whole viewport. For example: Clear EDRAM, submit 5000 draw calls (GPU renders them), resolve result to main memory (as a texture). Post processing steps can now read the texture containing the main scene rendering result. Again after post processing you resolve the image back to main memory. Display output takes the result from the main memory at the end of the frame.
Very good explanation! I thought before what post-processing done differently. Ok, now I now how it really works.

The main idea of any ROP scratchpad (be it EDRAM or tiling buffer) is that you only resolve the final image (one pixel per one pixel). Usually there's around 2x-5x overdraw for each pixel. Same pixels get written over and over in the EDRAM. You don't pay any main memory bandwidth cost for the overdraw. You only resolve the final image.
This is what I understand. I just want to know is there same amount of data what goes from main core to daughter core and data what goes back to main core. :smile:

Huge majority of the frame time, there is zero traffic out from the EDRAM. Traffic out of EDRAM only occurs when you do a resolve operation. Resolve writes the whole render target as one big chunk directly from EDRAM to main memory. This stalls the GPU, since the EDRAM is unavailable during the resolve operation.
How strong it affects GPU performance?
 
There's still something unclear for me abot bandwith. Here is my calcuations.
If there is game with 1280x720 resolution. That woud be 1280x720 = 921600 pixels, what shoud be sent from GPU to ROP blocks. As here is 8 ROP blocks they can take only 8 pixels per clock, so it would be 921600/8 = 115200 blocks 8 pixels each. As one pixel is 32 bit that is 921600x32 = 29421200 bits of data. That is 3,515625 MB. That is for one frame! If here is even 60fps game, that woud be 210,9375 MB of data. That's all. Exacty the same amout of data goes from ROP blocks back to main core. So it's 421,875 MB. Why, why there is need for 32 GB/s bus between cores?
 
Some more info to previous post. On PS3 there is frontbuffer, backbuffer and Z-buffer in VRAM. So it would be 1280x720x32 = 3,515625 MB for frontbuffer. Same for backbuffer and Z-buffer. Total 10,546875 MB per frame. If it's 60fps game that woud be 632,8125 MB of bandwith one way. For both ways it's 1265,625 MB or around 1,24 Gb/s of VRAM bandwith. PS3 VRAM total bandwith is 20,8 Gb/s, so there is 20,8 - 1,24 = 19,56 Gb/s of bandwith for texturex etc.
 
There's still something unclear for me abot bandwith. Here is my calcuations.
If there is game with 1280x720 resolution. That woud be 1280x720 = 921600 pixels, what shoud be sent from GPU to ROP blocks. As here is 8 ROP blocks they can take only 8 pixels per clock, so it would be 921600/8 = 115200 blocks 8 pixels each. As one pixel is 32 bit that is 921600x32 = 29421200 bits of data. That is 3,515625 MB. That is for one frame! If here is even 60fps game, that woud be 210,9375 MB of data. That's all. Exacty the same amout of data goes from ROP blocks back to main core. So it's 421,875 MB. Why, why there is need for 32 GB/s bus between cores?
1280x720 * (32 bit color + 32 bit depth) = 7 372 800 bytes. 32 GB/s bus would have enough bandwidth for you to perform 32 GB / 60 frames / 7372800 bytes = 77.67 full screen passes / frame (60 fps). This of course assumes that you are ROP bound all the time. Only simple shaders are ROP bound, others are not using ROPs fully (and thus not using the ROP->EDRAM bandwidth fully). Maximum bandwidth is only reached if you transfer at maximum rate all the time, during the whole frame. No game ever does this. You'd have to use only simple shaders to reach maximum ROP rate all the time.

Theoretical maximum of 77.76 full screen passes per frame might at first seem like a big number, but once you factor in all the overdraw (objects behind each other, vegetation and trees), transparencies (such as fog, particles, glass windows) and post processing passes, this number no longer is huge. I have seen single trees (in shipping games) with 30x overdraw (near the trunk), and particle systems with over 200x overdraw.
 
You're not accounting for overdraw where each pixel is touched multiple times. Smoke effects many lead to the same pixel being touched tens of times.
Ok, then should that pixel been writen into frontbuffer each time (after each touch), or not? In othew words, when one pixel goes to ROP block, then in regular situation there is read/modify/write, and then pixel goes to main RAM in fronbuffer. What happens if it touched several times?

Maximum bandwidth is only reached if you transfer at maximum rate all the time, during the whole frame.

Why so? Can you please explain a little bit? Maybe with some examples.

And yes, why then on PS3 situation not so bad, if it have a lot less bandwith?
 
Ok, then should that pixel been writen into frontbuffer each time (after each touch), or not?
The term "frontbuffer" generally refers to the current buffer being output to the display. It does not get messed with during the rendering process.
By "frontbuffer" I'm assuming you mean buffers in main RAM corresponding to the data in eDRAM.

Look at my gif again. The GPU actually has the ROP write into one of the pixels twice while the filling is happening, overwriting some of the data. It's only after this that it copies the buffer out to main RAM. In theory, the GPU could do as many fill operations as it wants before copying the buffer out. It could spend hours filling tons and tons of color into those sixteen sample locations (obviously doing tons and tons of overwriting), and then, at the end of this lengthy spewing of data, copy out to main RAM.

There is no buffer in main RAM getting updated constantly while the GPU is writing to eDRAM. That's the whole point of having the eDRAM: ROP activity does not affect the bandwidth usage on the bus to main RAM. The GPU completes an entire rendering pass to eDRAM (whether that's the rendering of a shadow map, or rendering the opaque objects of the scene into a G-buffer, or whatever), and then at the end of that pass, copies the buffer(s) in eDRAM out to main RAM for use in later passes. (Or, if it's the final image being copied out, to sit around and become a frontbuffer.)
 
Back
Top