esram astrophysics *spin-off*

Status
Not open for further replies.
check out the wording Boyd Multerer uses:
xbox_sram_cache2.jpg


Regarding cache latency.

" Having the right data in the right caches at all the right time makes all the difference in the world [for this architecture]"
This is what you can find in the VGLeaks documents regarding the latency.

http://www.vgleaks.com/durango-gpu-2/2/

ESRAM

Durango has no video memory (VRAM) in the traditional sense, but the GPU does contain 32 MB of fast embedded SRAM (ESRAM). ESRAM on Durango is free from many of the restrictions that affect EDRAM on Xbox 360.

Durango supports the following scenarios:

•Texturing from ESRAM
•Rendering to surfaces in main RAM
•Read back from render targets without performing a resolve (in certain cases)


The difference in throughput between ESRAM and main RAM is moderate: 102.4 GB/sec versus 68 GB/sec.

The advantages of ESRAM are lower latency and lack of contention from other memory clients—for instance the CPU, I/O, and display output. Low latency is particularly important for sustaining peak performance of the color blocks (CBs) and depth blocks (DBs).
Are colour blocks and depth blocks important for framerate, thus my theory is right? :smile2:
 
Last edited by a moderator:
I couldn't find anything on GCN, nor the consoles. But there are latency figures mentioned for the HD6000 here.

http://www.sisoftware.co.uk/?d=qa&f=gpu_mem_latency&l=en&a=b
Those measurements are dubious at best (I would say some blatantly contradict other available information) and wouldn't apply to GCN based GPUs anyway (GCN got an almost complete redesign of the cache structure [everything besides the ROP caches]).
I have been reading that article and I learnt some new things.

Fellow forumers have been trying to find out the differences between AMD's GPUs on the PC and their console counterparts.

And according to this article the thing is that the Xbox One differs from the GCN architecture. Whether the differences imply good or bad news, you can find them here:

</H3>

http://www.extremetech.com/gaming/147577-xbox-720-gpu-detailed-merely-a-last-gen-radeon
That 4way/64way L1 cache thing is a misinterpretation (confusion between number of ways and number of sets). All GCN based GPUs have a 64way set-associative L1 cache (older GPUs HD4000 through 6000 series had even a fully associative L1 [meaning 128way, the whole L1 cache consisted of only 128 64Byte cachelines]). There are 256 cachelines in total, meaning there are 4 sets (this is why it was mixed up, AMD mentioned both in presentations).
And regarding the relative size of the L2, Tahiti is the wrong chip to compare to. CapeVerde (smaller) as well as Pitcairn (larger respectivey equal size) and very probably also Bonaire have all 512kB L2, same als Orbis and Durango.
This is what you can find in the VGLeaks documents regarding the latency.

http://www.vgleaks.com/durango-gpu-2/2/

Are colour blocks and depth blocks important for framerate, thus my theory is right? :smile2:
Empirical evidence from synthetic fillrate tests on discrete GPUs doesn't appear to confirm this. There may be some effect, but apparently nothing major.
Btw., color and depth blocks are the two constitutents of the RBEs respectively ROPs.
 
Empirical evidence from synthetic fillrate tests on discrete GPUs doesn't appear to confirm this. There may be some effect, but apparently nothing major.
Btw., color and depth blocks are the two constitutents of the RBEs respectively ROPs.

So essentially more evidence with what several here (including iirc 3dilattante) have suggested the primary benefit is likely with ROP performance?
 
There is almost nothing the CPU would be doing that would require any sustained use of 30GB/s bandwidth. It's not like the CPU is generating giant fractals or something. These CPUs have relatively large caches.
Behaviours that would result in large bandwidth use, like a recursive read-modify-write on a very large and changing dataset is really the domain of the GPU. Not even the AI or Physics loop would be that much of a drain on bandwidth, until you get to many many thousands of objects being updated, and even then you could optimize that by utilizing the caches and loading objects that could affect each other in groups.

So... you are essentially saying that the (one of the only remaining?) Xbox One advantage(s) of having 30GB/s CPU bandwidth when compared to the 20GB/s competitor..
is meaningless :?:
And here I was, thinking Xbox One was all about balance. :cry:

If they cap the CPU bandwidth access (in software) they can probably limit the amount of framestalls when they system is under heavy load. I'd have to check up with some developers, but most of them are of the complaining types anyway, and they have already been very vocal about ESRAM... "performance" so I'll doubt they have anything nice to say about the overall GPU bandwidth when they system is under heavy load.
I'll update you regardless!
 
Are colour blocks and depth blocks important for framerate, thus my theory is right? :smile2:
All this stuff has been previously discussed, so I won't go into detail. Are there jobs low latency can help with? Yes. Do we know how much lower latency the ESRAM has and what impact that have on those jobs? No. We just have vague statements "lower latency == better." Therefore attributing any degree of improvement on the latency property is pure guesswork, which is what you're doing. You like the sound of it so you're running with that theory. Read up all the previous discussion and you'll see it's not at all clear cut, which is evidence aplenty that chance of massive improvements are minimal. I'll leave you with the response from the XB1 architects when asked directly about the benefits of lower latency in the DF interview:

Digital Foundry: There's been some discussion online about low-latency memory access on ESRAM. My understanding of graphics technology is that you forego latency and you go wide, you parallelise over however many compute units are available. Does low latency here materially affect GPU performance?


Nick Baker: You're right. GPUs are less latency sensitive. We've not really made any statements about latency.
That's all they had to say, and not, "By optimising for low latency, devs will see significant gains in [these areas of rendering] and substantial boosts in framerate."
 
So... you are essentially saying that the (one of the only remaining?) Xbox One advantage(s) of having 30GB/s CPU bandwidth when compared to the 20GB/s competitor..
is meaningless :?:
And here I was, thinking Xbox One was all about balance. :cry:

If they cap the CPU bandwidth access (in software) they can probably limit the amount of framestalls when they system is under heavy load. I'd have to check up with some developers, but most of them are of the complaining types anyway, and they have already been very vocal about ESRAM... "performance" so I'll doubt they have anything nice to say about the overall GPU bandwidth when they system is under heavy load.
I'll update you regardless!
Having the _ability_ to pull data at 30GB/s means that when you do need a short burst of a lot of data, the One can get that data into the CPU caches 50% faster than the PS4. It's still meaningful, it's just not what I would call a competitive advantage. Sony engineers are not dumb, they wouldn't have limited the PS4 to 20GB/s CPU bandwidth if a CPU required a lot more than that on a normal basis.
 
If they cap the CPU bandwidth access (in software) they can probably limit the amount of framestalls when they system is under heavy load. I'd have to check up with some developers, but most of them are of the complaining types anyway, and they have already been very vocal about ESRAM... "performance" so I'll doubt they have anything nice to say about the overall GPU bandwidth when they system is under heavy load.
I'll update you regardless!

Why the hell would you cap the "CPU bandwidth access" "in software" (assuming that's even possible)? It's not going to change the total amount of bandwidth used to perform the operations in question.

And why would you cap the bandwidth to the *more* latency sensitive CPU?

And how is crippling the CPU going to "limit the amount of framestalls"?

Which developers have been feeding you this shit?

Also, why do you even?!?
 
Those measurements are dubious at best (I would say some blatantly contradict other available information) and wouldn't apply to GCN based GPUs anyway (GCN got an almost complete redesign of the cache structure [everything besides the ROP caches]).
That 4way/64way L1 cache thing is a misinterpretation (confusion between number of ways and number of sets). All GCN based GPUs have a 64way set-associative L1 cache (older GPUs HD4000 through 6000 series had even a fully associative L1 [meaning 128way, the whole L1 cache consisted of only 128 64Byte cachelines]). There are 256 cachelines in total, meaning there are 4 sets (this is why it was mixed up, AMD mentioned both in presentations).
And regarding the relative size of the L2, Tahiti is the wrong chip to compare to. CapeVerde (smaller) as well as Pitcairn (larger respectivey equal size) and very probably also Bonaire have all 512kB L2, same als Orbis and Durango.
Empirical evidence from synthetic fillrate tests on discrete GPUs doesn't appear to confirm this. There may be some effect, but apparently nothing major.
Btw., color and depth blocks are the two constitutents of the RBEs respectively ROPs
.
So it is the same GCN in the end... I hope I got this right because I am terrible at those numbers and I don't fully understand what certain concepts mean.

Regarding the bolded part....

Are you talking about the framerate or the low latency, Gipsel? I don't know what the colour blocks and the depth blocks are and what kind of operations they perform. What are they used for?

Having the _ability_ to pull data at 30GB/s means that when you do need a short burst of a lot of data, the One can get that data into the CPU caches 50% faster than the PS4. It's still meaningful, it's just not what I would call a competitive advantage. Sony engineers are not dumb, they wouldn't have limited the PS4 to 20GB/s CPU bandwidth if a CPU required a lot more than that on a normal basis.
Could this impact framerates more than low latency? I wonder.... Afaik, one of the main limitations to achieve good framerates is usually the CPU.

You are talking about data into the CPU caches, not the main bandwidth, right? I mean that 30GB/s is 50% more than 20GB/s, but in both cases that's the bandwidth between the CPU and the main memory, not the CPU internal caches.

Confused....
 
So it is the same GCN in the end... I hope I got this right because I am terrible at those numbers and I don't fully understand what certain concepts mean.
Yes, at its core it's GCN, plain and simple.
Regarding the bolded part....

Are you talking about the framerate or the low latency, Gipsel? I don't know what the colour blocks and the depth blocks are and what kind of operations they perform. What are they used for?
That was an answer to the talk of MS that the low latency eSRAM is helping the ROPs (comprised of color and depth blocks) to achieve their maximum throughput. While there may be a positive effect in some scenarios, discrete GPUs without eSRAM routinely achieve their maximum throughput in fillrate tests while dealing with the higher latency (unknown to what extent compared to eSRAM) of GDDR5. The ROPs include small specialized caches which allow the ROPs to tolerate the RAM latency for the common operations. That means, it isn't clear at all, how much the eSRAM is able to help. I mean, AMD kept the size of the ROP tile caches constant for quite some time. Guess one gets only very limited returns from a size increase so they never did it. But while the size was kept unchanged, the GCN1.1 generation (Bonaire, Hawaii and obviously also Durango and Orbis) appear to have a somehow modified caching policy or some other internal changes helping the efficiency for 4xFP16 render targets, completely without eSRAM. And as Shifty has written above, when asked explicitly, MS basically backed off that statement that the lower latency is crucially important to improve GPU performance (which includes the ROPs).

And to answer the general question what ROPs are doing, that are the fixed function units used to write the results from the pixel shader to the render targets or to blend them with the values already in there. The color blocks are responsible for the actual targets, the Z blocks read/write from/to the Z/stencil buffer and do comparisons between old and new values (and one can do it conditionally, for instance only write to the render target if the new Z value is smaller than the existing one, so the Z and color blocks can work together and are not isolated from each other).
 
Last edited by a moderator:
So... why then not use eDRAM, when they eSRAM latency advantage might not even make a real difference? I mean, we've had the talk that it was unlikely of them to even go the route, because of size concerns and all that. They could've put in a lot of RAM this way, take the latency penalty and be done with it... or they could've raised the CU count by some amount. Or just make a cheaper console...
 
Microsoft already directly answered that one. It was something along the lines of they went whay was available from the suppliers they were using. I.e. edram simply wasn't an option. I'd dig out the quote but im on the phone.
 
So... why then not use eDRAM, when they eSRAM latency advantage might not even make a real difference? I mean, we've had the talk that it was unlikely of them to even go the route, because of size concerns and all that. They could've put in a lot of RAM this way, take the latency penalty and be done with it... or they could've raised the CU count by some amount. Or just make a cheaper console...
At that capacity, eDRAM could have been somewhat faster.


I speculated earlier that it came down to design targets and manufacturing options. What fab options there were presently for eDRAM at the desired node, and uncertainty about who would have it at a later advanced node.


It was touched on in the DF interview that it came down to who had the technology for a single-die solution with eDRAM, and the time frame.
Someone has to build a big chip with the eDRAM, and the design goals and constraints ruled out the fabs that could.
 
It was touched on in the DF interview that it came down to who had the technology for a single-die solution with eDRAM, and the time frame.
Someone has to build a big chip with the eDRAM, and the design goals and constraints ruled out the fabs that could.
They get to enjoy the same benefits when they decide to move to a smaller node, too. No need to worry about DRAM-friendly processes in the future.

SRAM is the gift that keeps on giving.
 
Just a pity that MS didn't push the boat out and go for a slightly bigger gift ...

Will be interesting to see how much bulk another 16 MB of SRAM would have added to the die.
 
Just a pity that MS didn't push the boat out and go for a slightly bigger gift ...

Will be interesting to see how much bulk another 16 MB of SRAM would have added to the die.


Well it´s been the same with current gen since deferred engines, they had to cut corners to fit 10 mb

Are there any ties between esram and CU count?
Preventing, what you say, adding two additional 8mb banks?

I think given that Orbis it´s in the same diesize ballpark (340 and 360 mm2 for Durango), they settled on that size from a yield POV??


I hope someday we get intel on the profiling and tests that MS made to choose Durango config
 
You are talking about data into the CPU caches, not the main bandwidth, right? I mean that 30GB/s is 50% more than 20GB/s, but in both cases that's the bandwidth between the CPU and the main memory, not the CPU internal caches.

Confused....

I am confused too. Where did the PS4's 20 GBs of cpu bandwidth to system memory originate? I am confused because I would expect that figure to be more in the 80-100 GBs range. Why would AMD find the need to cap bandwidth from memory to the cpu?
 
Last edited by a moderator:
I am confused too. Where did the PS4's 20 GBs of cpu bandwidth to system memory originate? I am confused because I would expect that figure to be more in the 80-100 GBs range. Why would AMD find the need to cap bandwidth from memory to the cpu?
Fast memory busses are big. Big busses require more space, and so does the logic managing them. In general, you want to make a bus large enough to satisfy your bandwidth needs and the requirements of the bus architecture, but no larger.

These CPUs aren't really designed to do the kind of fast crunching on massive data structures that the GPUs are. If one of these puny eight-core Jaguars finds itself needing 80-100GB/s, something has gone horribly wrong.
 
I am confused too. Where did the PS4's 20 GBs of cpu bandwidth to system memory originate? I am confused because I would expect that figure to be more in the 80-100 GBs range. Why would AMD find the need to cap bandwidth from memory to the cpu?

If an access is going to memory from a CPU, it is going through the individual connections to the L1 to L2, to the L2 interface, to the system request queue, and then the memory controller.
Jaguar is a low-power architecture, so there are already things like half-width buses between the caches, and the L2 interface per cluster is only able to transfer a limited amount of data per clock.

The magic of caches is that (usually, hopefully) you don't need to lean on these connections more than 10% (or some other small number) of the time per level. So you only want to go out of the L1 10% of the time, and of the times you do, only 10% will hit the L2 interface.

AMD's architectures past this point have a memory crossbar with fixed width connections, which has itself been a bottleneck in the past.
There is also a request queue that coherent accesses (most CPU accesses) must go on so that they can be kept in order, snoop the other L2, and so that the Onion bus can do its job.
This is all in the uncore and would be part of the northbridge.
This ordered and coherent domain has to manage the traffic between the clients and broadcast coherence information (a lot of back and forth) between them.
AMD probably wants to do this all without too much hardware, power, latency, or engineering effort.

Contrast this with Garlic, which is not coherent and deals with a GPU whose memory model is very weakly ordered, not CPU coherent, and whose memory subsystem has extremely long latency. This is a comparatively simpler connection that goes from the GPU memory subsystem to the memory controllers.
 
Fast memory busses are big. Big busses require more space, and so does the logic managing them. In general, you want to make a bus large enough to satisfy your bandwidth needs and the requirements of the bus architecture, but no larger.

These CPUs aren't really designed to do the kind of fast crunching on massive data structures that the GPUs are. If one of these puny eight-core Jaguars finds itself needing 80-100GB/s, something has gone horribly wrong.

Understood.

Thanks.
 
Status
Not open for further replies.
Back
Top