Xbox One (Durango) Technical hardware investigation

AlNom · Jun 28, 2013

babybumb said:
Seriously they only now figured it can do read+write at the same time?

That would effectively double the bandwidth. That is not what is being described here, especially the "133GB/s during FP16x4 blend".

3dilettante · Jun 28, 2013

AlNets said:
That would effectively double the bandwidth. That is not what is being described here, especially the "133GB/s during FP16x4 blend".

If it's forwarding from the read/write queues, that could be a risky PR gambit.
It would be an artifact of Durango's GPU read path being wider than its write and its ability to send at least two different read requests, since it was already shown to be able to read from both eSRAM and DRAM.

If AMD stuck the eSRAM behind some modified version of its per-channel graphics memory controllers, there could be something similar baked in to the other memory controller logic for DDR3, or someone else's GDDR5.
If the latter is true, why not do the same math?

Cyan · Jun 28, 2013

3dilettante said:
If the physical interface is 128 bytes wide at 800 MHz, that would be the bandwidth the ports can provide.

The blending scenario could allow for queuing logic has some kind of forwarding or coalescing capability. Within the length of the read/write queue, detecting a read to a location with a pending write could allow the read to source from the queue and allow the next operation to issue. A small amount of read buffering could also provide a secondary read location within a small window.
Maybe it can also combine contiguous writes, but that might be unnecessary complication since no client can send enough write data to tell the difference.

The GPU's read capability is wider than its write port, so forwarding or coalescing reads in tandem with another access actually have additional data paths that can carry the data.
This brings back my earlier questions as to where in the memory subsystem the eSRAM hooks in.

What do you mean in the bolded part? Sorry my blurred vision but I dont understand if you mean the ESRAM is only connected to the GPU or is it physically in another part of the die compared to where it is expected it to be?

babybumb said:
They havent increased performance. They have changed how they calculate things.

Seriously they only now figured it can do read+write at the same time?

I knew the ESRAM is flexible/accessible enough in regards to read/write by the GPU.

But I thought that there was no existent hardware that could write and read at the same time.

BoardBonobo · Jun 28, 2013

At the end of the day if Mark Cerny did the math and worked out that 1TB of bandwidth from EDRAM wouldn't offset the advantages of a pure DDR5 solution, then what is the extra 88% worth really? Other than trying desperately to conjure some magic good will juju.

3dilettante · Jun 28, 2013

Cyan said:
What do you mean in the bolded part? Sorry my blurred vision but I dont understand if you mean the ESRAM is only connected to the GPU or is it physically in another part of the die compared to where it is expected it to be?

Even within the cache/memory subsystem of the GPU, there would be different parts of the pipeline it could fit into, like whether it's sitting behind the spot where there would otherwise be a standard graphics memory controller or if it's sitting in front of the control logic.

There should be more coalescing behavior if the eSRAM were sitting behind a modified memory controller, but the latency would be different and probably measurably longer than if it could sit in front and hook into/alongside the cache crossbar.

Kaotik · Jun 28, 2013

BoardBonobo said:
At the end of the day if Mark Cerny did the math and worked out that 1TB of bandwidth from EDRAM wouldn't offset the advantages of a pure DDR5 solution, then what is the extra 88% worth really? Other than trying desperately to conjure some magic good will juju.

You completely dismiss the possibility that Cerny in fact isn't omnipotent, godlike person and might actually make a wrong choice in terms of pure performance and other factors like cost etc?

COPS N RAPPERS · Jun 28, 2013

Cjail said:
DF has more about XBO here.

very interesting new information.

Now that close-to-final silicon is available,

I take it a couple more revisions to go? if so they really are cutting it close to launch. unless they plan have a limited number of consoles shipped in order to extend beta time; in which case that would kind of explain why we've seen dubious pc devkits still used nearing launch.

Microsoft has revised its own figures upwards significantly, telling developers that 192GB/s is now theoretically possible.

they might just superseded that number or round it off because it's still a guesstimate.

warb said:
Apparently there is no suggestion of any downclock. It's still 800MHz for the people making X1 games.

well, If you put it that way according to some of their titles, even an 800mhz clock still sounds a bit fishy. I did say a while back for quantum brake (on an other thread) that that particular title is either the best looking optimized game around, or the worst kept secret for an increase in specs; if the visuals hold true. only time will tell. :smile2:

scently · Jun 28, 2013

Its possible that the physical interface is more than 128 bytes. Its possible that their initial calculation puts the interface at 128 bytes minimum,ie, 128 bytes is worst case scenario with room for error or something. And now that they are getting actual chips, they are hitting better than expected results.

Mind you, I would assume that this would have probably happened before E3.

blakjedi · Jun 28, 2013

Strange. i would have thought the hardware design to be entirely complete by now. Maybe thats why they havent given out any hard specs yet wrt to bandwith numbers.

Xbat · Jun 28, 2013

Surely they should have final silicon by now.

warb · Jun 28, 2013

Kaotik said:
You completely dismiss the possibility that Cerny in fact isn't omnipotent, godlike person and might actually make a wrong choice in terms of pure performance and other factors like cost etc?

Also this is not really the eDRAM solution Cerny talked about.

ROG27 · Jun 28, 2013

It sounds like the news about the ESRAM is being misinterpreted in the Eurogamer article. If MS is claiming simultaneous read/write operations in certain instances yielding a theoretical total bandwidth of 192 GB/s, that means unidirectional bandwidth would be 96 GB/s (i.e., 192 / 2 = 96). This would be a downgrade from 102 GB/s unidirectional (i.e., specifically a 50 MHz downgrade from GPU/ESRAM originally clocked at 800 MHz down to 750 MHz).

Arwin · Jun 28, 2013

I suspect, considering they say a theoretical 190GB/s is possible, but have given an alpha blend example that reaches 133GB/s, it is some kind of function or set of functions, or a way of ordering them, that they hadn't investigated properly that allows a read and write to happen at the same time, in the ESRAM exclusively. A blend operation seems close to an ideal case scenario, so it remains to be seen how much more than the 33GB improvement is possible, but it may well be that they are aware of this, and therefore indicated 68GB+133GB is 201GB per second, which fits the over 200GB/s bandwidth figure that had been given earlier.

I am buying it.

XpiderMX · Jun 28, 2013

ROG27 said:
It sounds like the news about the ESRAM is being misinterpreted in the Eurogamer article. If MS is claiming simultaneous read/write operations in certain instances yielding a theoretical total bandwidth of 192 GB/s, that means unidirectional bandwidth would be 96 GB/s (i.e., 192 / 2 = 96). This would be a downgrade from 102 GB/s unidirectional (i.e., specifically a 50 MHz downgrade from GPU/ESRAM originally clocked at 800 MHz down to 750 MHz).

Not Eurogamer, their sources (Xbox One devs) said has not informed any downgrade.

3dilettante · Jun 28, 2013

Some random math I thought up for the scenario put forward in the article.

Let's assume there's some kind of fowarding going on from the queues in the eSRAM's arbitration and scheduling logic.

How to get from a supposed max of 102.4 GB/s of the interface, but also take into consideration that a pure doubling is 204.8 GB/s?

If it's forwarding:
Assume there's some portion of the time that has to be a pure write to eSRAM, limited by the 102.4 GB/s write path maximum.
Assume that the memory subsystem can handle additional requests because it can split its attention between eSRAM and DRAM, and assume the queues can actually use that spare capacity for eDRAM.

The interface=102.4 GB/s
Double that, if the so-called simultaneous read+write were true = 204.8 GB/s.
The theoretical maximum in the article = 192 GB/s.

How to get this?
Assuming forwarding.
Some fraction of the cycles is write cycles to the queue and limited by the GPU's write bandwidth.
The other fraction is full traffic with additional read requests serviced by forwarding, leading to an internal doubling.

(204.8*x+102.4)/(x+1)=192
204.8*x+102.4=192*x+192
204.8*x=192*x+89.6
204.8*x-192*x=89.6
(204.8-192)*x=89.6

12.8*x=89.6

7=x

It's early for me to think about this, but it sounds like 1/8 of the time, you can set up data you forward, and then you can send reads to the eSRAM proper and another portion is automatically forwarded. This leads to double the bandwidth in those cycles, although it would require more than one client is reading at a time, since no single requestor can take all that bandwidth. (edit: Or half read, half write if they can combine.)

At some point, the write has to file itself to the eSRAM, and during that phase the GPU can queue up the next round of fowarded data (every 8th cycle?). Other scenarios are possible, but that would be the big-number shot.

scently · Jun 28, 2013

Arwin said:
I suspect, considering they say a theoretical 190GB/s is possible, but have given an alpha blend example that reaches 133GB/s, it is some kind of function or set of functions, or a way of ordering them, that they hadn't investigated properly that allows a read and write to happen at the same time, in the ESRAM exclusively. A blend operation seems close to an ideal case scenario, so it remains to be seen how much more than the 33GB improvement is possible, but it may well be that they are aware of this, and therefore indicated 68GB+133GB is 201GB per second, which fits the over 200GB/s bandwidth figure that had been given earlier.

I am buying it.

Hmmm..., you might just be right. The eDRAM on the 360 for example only hits 256gb/s when its performing several functions at 4xMSAA, as such, it operates at a lower bandwidth when its not doing 4xMSAA.

So it seems that when the eSRAM is used for framebuffer ops it can hit 192gb/s, with the 133gb/s example being what it can hit if its doing "alpha transparency blending operations (FP16 x4)". 102gb/s might be the sustainable throughput when its doing normal operations.

McHuj · Jun 28, 2013

Perhaps this is nothing more than carefully laying out data in the SRAM to avoid bank collisions.

In theory, a memory with one read and one write ports could achieve double the bandwidth if you're reading and writing to different banks all the time. Maybe their test only achieves 133GB/sec because it's impossible to avoid all conflicts in a real world benchmark.

3dilettante · Jun 28, 2013

They would have factored in bandwidth from multiple ports already. Nobody subtracts bank conflicts from their theoretical peak.

Cyan · Jun 28, 2013

3dilettante said:
Some random math I thought up for the scenario put forward in the article.

Let's assume there's some kind of fowarding going on from the queues in the eSRAM's arbitration and scheduling logic.

How to get from a supposed max of 102.4 GB/s of the interface, but also take into consideration that a pure doubling is 204.8 GB/s?

If it's forwarding:
Assume there's some portion of the time that has to be a pure write to eSRAM, limited by the 102.4 GB/s write path maximum.
Assume that the memory subsystem can handle additional requests because it can split its attention between eSRAM and DRAM, and assume the queues can actually use that spare capacity for eDRAM.

The interface=102.4 GB/s
Double that, if the so-called simultaneous read+write were true = 204.8 GB/s.
The theoretical maximum in the article = 192 GB/s.

How to get this?
Assuming forwarding.
Some fraction of the cycles is write cycles to the queue and limited by the GPU's write bandwidth.
The other fraction is full traffic with additional read requests serviced by forwarding, leading to an internal doubling.

(204.8*x+102.4)/(x+1)=192
204.8*x+102.4=192*x+192
204.8*x=192*x+89.6
204.8*x-192*x=89.6
(204.8-192)*x=89.6

12.8*x=89.6

7=x

It's early for me to think about this, but it sounds like 1/8 of the time, you can set up data you forward, and then you can send reads to the eSRAM proper and another portion is automatically forwarded. This leads to double the bandwidth in those cycles, although it would require more than one client is reading at a time, since no single requestor can take all that bandwidth. (edit: Or half read, half write if they can combine.)

At some point, the write has to file itself to the eSRAM, and during that phase the GPU can queue up the next round of fowarded data (every 8th cycle?). Other scenarios are possible, but that would be the big-number shot.

What can I say.., that's fascinating stuff.

Perhaps your theory is the correct answer because the model of data management that the ESRAM provides programmers with means their memory turn around is in the order of a couple sets of ten cycles -or just ten- which opens up more algorithms and operations.

dragonelite · Jun 28, 2013

Maybe the missing 12% goes to the OS?
Im not an hardware guy but i did remember something about OS using 10% of the GPU cycles.
So maybe they said to devs you can only have 192GB/s.

Xbox One (Durango) Technical hardware investigation

AlNom

Moderator

3dilettante

Cyan

orange

BoardBonobo

My hat is white(ish)!

3dilettante

Kaotik

Drunk Member

COPS N RAPPERS

scently

blakjedi

Xbat

warb

ROG27

Arwin

Now Officially a Top 10 Poster

XpiderMX

3dilettante

scently

McHuj

3dilettante

Cyan

orange

dragonelite

Similar threads