Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
Because it IS slow. Have you ever used an dedicated graphics card or integrated graphics that used ddr3?
Yep, and so have you. And the type of RAM is irrelevant to the bandwidth. There are GDDR5 cards that have the same or close to the same bandwidth as the X1 main pool. Radeon HD 7770 for instance, the 67XX and 57XX line too. That's not counting the ESRAM at all, or efficiencies gained by using compressed textures to the GPU.
 
Because it IS slow. Have you ever used an dedicated graphics card or integrated graphics that used ddr3?

I think you are confused on bandwidth versus latency (most people do, anyways).

GDDR5 is based on DDR3.

GDDR5 can move more data than DDR3 at any given time (high bandwidth).
DDR3 is actually faster in terms of when the processor would see the data (low latency)

Hence for GPU, GDDR is more suitable since the graphics data are streamed in blocks. For CPU, the accessing pattern is more random, so GDDR's latency is not ideal.
 
I think you are confused on bandwidth versus latency (most people do, anyways).

GDDR5 is based on DDR3.

GDDR5 can move more data than DDR3 at any given time (high bandwidth).
DDR3 is actually faster in terms of when the processor would see the data (low latency)

Hence for GPU, GDDR is more suitable since the graphics data are streamed in blocks. For CPU, the accessing pattern is more random, so GDDR's latency is not ideal.

He is talking about bandwidth for the GPU, DDR3 is cheap but not usable for high end GPUs. 68GB/s is very good for DDR, but still a very low number for good discrete video cards.
 
Yep, and so have you. And the type of RAM is irrelevant to the bandwidth. There are GDDR5 cards that have the same or close to the same bandwidth as the X1 main pool. Radeon HD 7770 for instance, the 67XX and 57XX line too. That's not counting the ESRAM at all, or efficiencies gained by using compressed textures to the GPU.

That's usually because they used a small (128 bit) bus with very low speeds (like 4Gbps), which still gives like 70.4 Gbps. In fact I don't think you can't go lower than 70.4 Gbps without further shrinking the bus.

the 68GB/s DDR3 on the X1 on the other hand is much more exotic, using 2133 mhz ones (which is high end in 2013) with 256 bit buses. You can't just say that since they reach ~70 Gbps though going as low as possible for one type and going as high as possible with the other type and call it a day saying there's no difference.

Type of RAM of course is relevant to the bandwidth.
 
Type of RAM of course is relevant to the bandwidth.

Sure but it isn't the only thing that is relevant or even necessarily the most important. As you, yourself just pointed out, the width of the data is just as important as the speed at which data can be sent.

For high bandwidth requirements GDDR is less costly than DDR. You need less traces on the MB, less pins on the GPU package, less chips, etc. That doesn't mean you couldn't get similar bandwidth by using DDR3 over GDDR5, however. But since cost favors GDDR5 when you get to really high bandwidths, you'll likely never see anyone implement DDR3 with a 512 bit interface, for example. :)

Of course, DDR3 has it's own benefits, and that comes in the form of very low cost per GB and significantly less power consumption per GB.

If GDDR5 were comparable in cost and power consumption then no one would ever use DDR3 for anything.

Regards,
SB
 
That's usually because they used a small (128 bit) bus with very low speeds (like 4Gbps), which still gives like 70.4 Gbps. In fact I don't think you can't go lower than 70.4 Gbps without further shrinking the bus.

the 68GB/s DDR3 on the X1 on the other hand is much more exotic, using 2133 mhz ones (which is high end in 2013) with 256 bit buses. You can't just say that since they reach ~70 Gbps though going as low as possible for one type and going as high as possible with the other type and call it a day saying there's no difference.

Type of RAM of course is relevant to the bandwidth.
Yes, you are correct, I misspoke, what I meant was that if you can meet the bandwidth requirements, what type of RAM you use to do it is irrelevant.
 
anyone take a stab at the bandwidth overhead to keep data in esram yet?

It's memory so once data is in there it remains there until overwritten. I don't understand what you mean. Please clarify.
 
data needs to be written to esram, and results read back to ddr, i'm curious about that overhead, in terms of bandwidth
 
It's memory so once data is in there it remains there until overwritten. I don't understand what you mean. Please clarify.

I think a lot of people are assuming you will need to frequently copy to/from ESRAM to/from DDR3, which will add some "overhead" to your bandwidth requirements.. I have no insight on the actual usage of the ESRAM.
 
data needs to be written to esram, and results read back to ddr, i'm curious about that overhead, in terms of bandwidth
It's exactly the same overhead as every other device that reads memory, changes it, and writes it back again. The PS4's frame buffer will not be in some magical place where it never has to be read or written.
If you use the ESRAM for storing intermediate buffers, for instance for shadows, screen-space anti-aliasing, reflection calculations, etc, then there should be minimal to no extra overhead related to copies between ESRAM and DRAM, especially if you then write the final frame buffer directly back to DRAM from the GPU. And you get the advantage of extremely low latency on your intermediate reads and writes, which hopefully will allow you to keep all your pipelines at maximum throughput.
 
data needs to be written to esram, and results read back to ddr, i'm curious about that overhead, in terms of bandwidth
This is not really how I would use fast embedded memory. I'd keep temporary data for dependent renders or some compute workload and rarely (if ever) stuff that'd have to be transfered back to main memory.
 
And you get the advantage of extremely low latency on your intermediate reads and writes, which hopefully will allow you to keep all your pipelines at maximum throughput.
Any insight as to how much lower the latency is?
The leaks are not very clear beyond it being lower.
 
The L2 caches of the Jaguar cores have 2-3 ns for a fraction of the capacity.
I'm interested in knowing what the interface for the eSRAM is modeled after, and how it hooks into the memory pipeline, since the DRAM itself is only part of the latency figure.

edit: I had a brain misfire, the 2ns figure would be for the L1. The L2 is in some nebulous 10-15 ns range.
 
Last edited by a moderator:
It's also been suggested that the L1 and L2 caches in GCN GPUs are in the hundreds of ns (or was that cycles?) and I'm assuming the ESRAM will be slower still. My understanding is GPU memory subsystems are not generally optimized for low latency the way CPUs are, so I have to wonder what the ESRAM latency advantage will actually be. Is it an order of magnitude compared to off chip DRAM, or simply a fractional advantage?
 
It's also been suggested that the L1 and L2 caches in GCN GPUs are in the hundreds of ns (or was that cycles?) and I'm assuming the ESRAM will be slower still. My understanding is GPU memory subsystems are not generally optimized for low latency the way CPUs are, so I have to wonder what the ESRAM latency advantage will actually be. Is it an order of magnitude compared to off chip DRAM, or simply a fractional advantage?
I think I've heard something closer to 16-25 cycles or so for the vector L1 latency (the scalar L1 may be at ~20 cycles). Forget about these strange Sandra tests, they are most likely just wrong.
If you have L2 misses and have to go out to memory, then the DRAM controller, very likely optimized to attain a high bandwidth at the expense of some latency, will incur a heavy cost, before it even goes to the DRAM itself. This penalty is likely to be lower for SRAM, as you don't need to care that much about the optimal sequence of opening and closing banks to get a high utilization. It won't remove the latency of the memory hierarchy of the GPU itself, but it will definitely be faster (but I can't put a number on it by how much).
This hierarchy are not necessarily the L1/L2 caches, as the ROP exports bypass them and go through the ROP caches, which are in some sense large write combining buffers. The ROPs load in compressed tiles from the framebuffer in RAM, decompress it to the internal caches, write or blend within the caches and write back to memory in compressed format when the tile needs to be evicted (because the ROPs accesses another tile and no space in the cache is left). This tremendeously increases the size of the memory accesses (as complete tiles are read and written, not individual pixels). This is clearly a bandwidth utilization optimization and should be fairly latency tolerant, as long as one is not doing some fancy stuff. But again, I'm not aware of any firm numbers for the involved latencies, so the potential effect of the eSRAM is unclear (to me).
 
If you use the ESRAM for storing intermediate buffers, for instance for shadows, screen-space anti-aliasing, reflection calculations, etc, then there should be minimal to no extra overhead related to copies between ESRAM and DRAM, especially if you then write the final frame buffer directly back to DRAM from the GPU. And you get the advantage of extremely low latency on your intermediate reads and writes, which hopefully will allow you to keep all your pipelines at maximum throughput.

Well... would you even need to? ;)


It's also been suggested that the L1 and L2 caches in GCN GPUs are in the hundreds of ns (or was that cycles?) and I'm assuming the ESRAM will be slower still. My understanding is GPU memory subsystems are not generally optimized for low latency the way CPUs are, so I have to wonder what the ESRAM latency advantage will actually be. Is it an order of magnitude compared to off chip DRAM, or simply a fractional advantage?

Different usage- I don't think the caches store anything post-rasterization... the ESRAM covers that instead of a hop back to the memory controller, and also serves as a bigger overflow if pre-raster data seems big/bursty.
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top