The pros and cons of eDRAM/ESRAM in next-gen

On a similar simple scenario we know GDDR5 can reach 90-91% of the peak BW so roughly 160GB/s from 176GB/s in real ideal case.

Did anyone have more context on this (particularly the bar on the far right):

0LOwYux.jpg
 
Did anyone have more context on this (particularly the bar on the far right):

It's a little vague on the details, but the general trend makes sense to me and is consistent with some of the concerns I put forward ahead of launch. The GPU numbers seem lower than 90% utilization, but since the GPU is more than ROPs that isn't unexpected.

The interleaving and ordering behavior of CPU and GPU memory pages are different, so there are hits due to CPU pages not being as aggressively distributed amongst the channels.
Unlike the GPU, the CPU portion is also latency-intolerant and more strongly ordered, so the memory controllers cannot be as aggressive about combining and waiting for more optimal accesses to opportunistically fill in gaps that would result from DRAM penalties. Long runs of accesses of a specific type become harder to sustain, and the patterns for CPUs tend to be less regular.
Since many DRAM penalties are disproportionate in their impact, a relatively small amount of sub-optimal CPU traffic will have a bigger effect from dead cycles and missed scheduling opportunities.
This may mean there is room for low-level optimization for the PS4 that could claw back some of this bandwidth, if games get creative about arranging their accesses.

At least some of my concerns about the PS4 having worse penalties than the Xbox One with regards to the GPU trampling on the CPU in terms of latency may have been at least partly incorrect--there is a real possibility that they are in the same league of badness, seeing how every other APU has such bad numbers.
 
I can't comment on that particular graph but we have bespoke server hardware with mixed processor architectures that in some configurations will share a bus to RAM. That graph typifies the symptoms of one processor benefiting from some sort of burst access mode of the bus, which suffers when a processor of another type (or just a second processor) uses the bus. The more the second processor accesses the bus, the greater (more disproportionate) the effect.

I deal with this kind of problem everyday, albeit on a larger scale. For me, the balance is CPU TYPE A is generic and benefits burst, CPU TYPE B is specific and doesn't benefit burst, when should I use CPU TYPE B to solve my problem, bearing in mind everything else running.
 
Did anyone have more context on this (particularly the bar on the far right):

In short it's saying that the bandwidth consumed by the CPU takes more bandwidth away from what the GPU can use, which makes total sense.

simplified example: 100Gbps bw total
GPU only: can use 100Gbps
if CPU uses 10Gbps, GPU can only use 80Gbps, leaving 10Gbps not usable.
 
We've been over this. The architects said with 145GB/s scenario with the esram by measuring real games. This is not some benchmark test to max out the bandwidth by using some bogus code, so if you are making a comparison on game code on the X1 vs synthetic code on the PS4 I don't see how that's a good comparison.

He's making a comparison based on synthetic code on the PC, but your point stands.
 
The GPU numbers seem lower than 90% utilization, but since the GPU is more than ROPs that isn't unexpected.

mmhm... I guess my question was where does this 135GB/s fit really (context of the measured scenario) when there's a 90% figure for PC cards.

But yeah, the trend with CPU mixed in there is pretty apparent.
 
mmhm... so where does this 135GB/s fit really when there's a 90% figure for PC cards?
90% figure for a ROP test, or more generally?
The highest numbers I've seen were specifically ROP-focused. Sony's graph is non-specific enough that I'm assuming there is at least some functionality being used besides that.
 
So in the case of XB1, esram will allow the ROPs to have "protected" BW, making performance more predictable?

Unfortunate that it only has 16 ROPs and 32 MB of esram, really.
 
The Vgleaks documentation said as much in one part.
ROP loads don't seem to appreciate the fixed read/write ratio the eSRAM offers, which is probably why all but the worst mixes in the Sony graph are only mildly behind, if at all.

The point stands that the analysis for which solution works best is very workload dependent, and can even be dependent on what point you are in a workload.
Either platform will likely have certain preferences and optimizations that work best for them, and some that might be neutral or negative if used on the other.

I really don't think we'll have a more firm picture until both platforms have much more mature software and development tools some time down the line, and even then I suspect some proprietary details will be sanitized from public disclosure.
 
you need more cycles for one read on dram. GDDR5 is even worse than DDR3. there are cycles that are not effective at all. the higher clock rate only helps to reduce the latency, but not the needed cycles.
You're completely forgetting that you do more cycles in the same time frame.
higher clock rate means for the GDDR5/DDR3 ram that more cycles get lost.
That doesn't make any sense. Higher clock rates mean you get to plow through your data before you queue up, and you also have much better throughput.

Below is an example to illustrate.

If I have data that comes requires me to move through a connection and I have two hypothetical connections

Connection A 15 bits per second

Connection B 60 bits per 4 seconds

I have a steady stream of data coming in requiring the connections to move at 1 bit per 0.1 seconds.
At 6-8 seconds (total of 3 seconds), we have a pause of data.
We then have a sudden spike of 45 bits at 8-9 seconds(1 second duration)
The data resumes after t=9.

t=0, A=0 bits B =0 bits
t=1, A=10 bits B=0 bits (10 queued)
t=2, A=20 bits B=0 bits (20 queued)
t=3, A=30 bits B=0 bits (30 queued)
t=4, A=40 bits B=40 bits (0 queued)
t=5, A=50 bits B=40 bits (10 queued)
t=6, A=60 bits B=40 bits (20 queued)
t=7, A=60 bits B=40 bits (20 queued)
t=8, A=60 bits B=60 bits (0 queued)
t=9, A=75 bits (30 queued) B=60 bits (45 queued)
t=10, A=90 bits (25 queued) B= 60 bits (55 queued)
t=11, A=105 bits (20 queued) B= 60 bits (65 queued)
t=12, A=120 bits (15 queued) B= 120 bits (15 queued)
t=13, A=135 bits (10 queued) B= 120 bits (25 queued)
t=14, A=150 bits (5 queued) B= 120 bits (35 queued)
t=15, A=165 bits (0 queued) B= 120 bits (45 queued)
t=16, A=175 bits (0 queued) B= 175 bits (0 queued)

As you can see, the connection with the higher clock rate will perform better in this situation. It cleared the sudden influx of data at t=15, while the lower clock rate had to wait until t=16.

However, it is clear that A had 2 lost cycles (t=7 and t=8) while B didn't.
But who cares? A finished the job!

On paper both connections have the same bandwidth, but in reality the higher clocked one will achieve better throughput!

It also helps that at almost any given time, the faster clock rate connection has less data queued.

Going back to the example, say if you queued one bit to be written at t=5.5, and then you require to read it immediately afterwards.
On the higher clock rate, you can write it at t=6, and then immediately retrieve it at t=7. Done
Meanwhile, for the lower clock rate connection you will have to wait until t=8 for the write and then also wait until t=12 for the read to e completed.

On A it takes 1.5 seconds to finish the process, on B it takes 6.5 seconds to finish the process.

Which is better? you decide.

only if you read or write large chunks you can use the cycles a bit more efficient on dram. and yes the latencies are much lower on the esram, which means the bandwidth can be used more effective (small reads/writes).

Lower latencies do not lead to smaller reads/writes.

the higher clock-rate makes the latency of GDDR5 and DDR3 almost equal (latency not bandwidth), but the esram has still much better latency than DDR3. but latency is only needed, where many small operations are done. GDDR5 is good for big things (like textures) that are not accessed frequently. And if you often change between read and write you even loose more cycles, which again means loss of bandwidth.
And because MS has used DDR3 with lower bandwidth than the GDDR5 on the PS4, they can't afford loosing any bandwidth on those ineffective cycles. so they use esram to compensate so many of the small operations can be done in the small esram where they don't harm the bandwidth.
[strike]
the issue here is not latency, it's throughput.
People are forgetting that even though the latencies of eSRAM is low, it doesn't help that its cycle time is ~1700 nanoseconds.
So what if you your latency is 1 nanosecond if you can't do anything with it until 1700 nanoseconds later?

To put in context
GDDR5 timings as provided by Hynix datasheet: CAS = 10.6ns tRCD = 12ns tRP = 12ns tRAS = 28 ns tRC = 40ns

5500Mhz cycle time is 181 nanoseconds. Even if you add all the latencies above together, you won't go over the cycle time.

eSRAM with its 1700 nanosecond cycle time is NOT better than GDDR5 in dealing with small operations, if anything, it's worse as I have proven above.
[/strike]

However, we all know that latency is not important in GPU processes. Microsoft even went out of their way to say so.

please think of it a moment. You have small buffers on the gpu, that must be filled with data. that means small reads/writes. the 32Mb was only the render target. on ps4 it might be a little bit bigger, because you "only" have one memory pool but the situation is still the same. if the render target is not spread all over the memory you can never reach the esram bandwith. worst case would be your render target is just in one physical memory-module (512 MB of memory is one module). So you would be limited to 11 GB/s max. only way to get it faster is to spread into all memory-modules so you can reach theoretically the max bandwidth. but now you have even smaller chunks which reduce the effectiveness of dram memory tricks and you are actually loosing more bandwidth.

I don't say it is the holy grail, but the esram is really, really fast for it's size.
and all I'd said in my last post, was, that the bandwidth is not the limiting factor on xbone development so far.
)

Why do you insist that the system cannot spread the load across all the memory modules/pins???
You're proposing hypothetical situations that run in the face of technical specifications.

You don't have any proof that bandwidth is not the limiting factor.
 
Last edited by a moderator:
Strange said:
To put in context
GDDR5 timings as provided by Hynix datasheet: CAS = 10.6ns tRCD = 12ns tRP = 12ns tRAS = 28 ns tRC = 40ns

5500Mhz cycle time is 181 nanoseconds. Even if you add all the latencies above together, you won't go over the cycle time.
Not an memory expert, but your numbers seem wrong.

1 sec = 1,000,000,000 ns
5500Mhz = 5,500,000,000 Hz

1 sec / 5500Mhz gives, 0.18ns a cycle, I think you are way off by 1000X, but it's been a long day at coding and my brain is pretty mushy.

Also that GDDR5 is quad pumped, so in reality it's running at 1375Mhz, not sure what the implication is, just throwing it out there.

In general I've always found overly simplified models are not very helpful at understanding the reality of things at work, so I'll leave it at that.
 
Last edited by a moderator:
Not an memory expert, but your numbers seem wrong.

1 sec = 1,000,000,000 ns
5500Mhz = 5,500,000,000 Hz

1 sec / 5500Mhz gives, 0.18ns a cycle, I think you are way off by 1000X, but it's been a long day at coding and my brain is pretty mushy.

Also that GDDR5 is quad pumped, so in reality it's running at 1375Mhz, not sure what the implication is, just throwing it out there.

In general I've always found overly simplified models are not very helpful at understanding the reality of things at work, so I'll leave it at that.

Ya I'm off by 1000. my bad.

To correct myself.

GDDR5 is quad pumped, running at 1375Mhz =>5500Mhz
DDR3 is dual pumped running at 1066Mhz=> 2133Mhz.

So not only does GDDR5 push twice the data in a clock, it also does it in slightly less time.

meanwhile, the eSRAM still operates at 853 Mhz.
 
Last edited by a moderator:
Don't disagree on your particular example, except that it doesn't really represents what happens with real computing, there are so many factors being left out by the example, say, the dual-port, the difference of dram vs sram, interleaving of gpu and cpu access, interleaving of reads and writes on the bus and the memory, etc.
 
Why do you insist that the system cannot spread the load across all the memory modules/pins???

I answered ironically, 3dilettante answered him directly, yet he does not get it...

My concern was about this:
if you have a task that reads from memory, elaborate and store it in a buffer, wouldnt be DDR3+eSRAM at slight disadvantage, given the slower read speed compared to gddr5? Even with tiled access, that would anyway slow it down. Same on opposite.
Aren't they quite common cases in the 3d pipeline?
 
Last edited by a moderator:
I am not sure why one would want to store any graphics buffer into main memory and read. It would eat a large portion of the system wide bandwidth.
 
Logic in some of the above is fundamentally flawed. Max mem request rates are dictated by the client and not memory timings. As such there are workloads that can prefer either setup depending on their nature. Same is true with access granularity comparisons.
 
I am not sure why one would want to store any graphics buffer into main memory and read. It would eat a large portion of the system wide bandwidth.
That's what the RAMa nd bandwidth is there for. ;) Honestly, what else will you do with it? Stream 60 GB/s of audio data? Process 60 GB/s of AI entities and physics objects?
 
Ok I (very) poorly worded that. What I meant to say is what kinds of buffers would someone store and read in from main memory that would be of benefit versus making sure you fit whatever you need into embedded memory.
 
Had edram been ready at the 28nm node at gloflo or tsmc during xo development timeline, and given that you can fit approx 3x as much edram in the same space as 8t sram, do you think MS would have gone with a full +90MB (same real estate allocation) and maintain the same chip size or allocate more towards CU? Or shrunken the size and save on chip costs? Or something in the middle.

Maybe 2-4 more CU, and a 50-60MB edram cache, giving developers a much heftier scratch pad for buffers and other high badwith assets.

XB1SOC-2.jpg


What ifs are so much fun.
 
Had edram been ready at the 28nm node at gloflo or tsmc during xo development timeline, and given that you can fit approx 3x as much edram in the same space as 8t sram, do you think MS would have gone with a full +90MB (same real estate allocation) and maintain the same chip size or allocate more towards CU? Or shrunken the size and save on chip costs? Or something in the middle.

Maybe 2-4 more CU, and a 50-60MB edram cache, giving developers a much heftier scratch pad for buffers and other high badwith assets.

XB1SOC-2.jpg


What ifs are so much fun.

They'd probably just shrink it. Same die area under different nodes don't cost the same, and adding another 50MB of ESRAM isn't something that you can do without a redesign+a lot more development time.
 
Back
Top