Predict: The Next Generation Console Tech

Status
Not open for further replies.
Sure it's actually STACKED memory though? From what I can decipher from peering at that terrible image it looks more like a traditional multi-chip module methinks.
 
As for MSAA and a large on die memory chunk, I don't see why if it is flexible enough that it needs to accommodate the full framebuffer. iirc Xenos has one bus from the GPU to the eDRAM+ROPs which then would have to go back through the GPU to the system memory.

512MB GDDR3 <=> Xenos <=> eDRAM+ROPs

But what if the memory pool on "Durango-GPU" is really just that: a completely flexible memory pool. It could be used as a cache or a workspace for binned tiles or whatever?

G <=> ESRAM
P |
U <=> System Memory

Would this not make the size relatively irrelevant? Yes for a game at 1080p with 4xMSAA and a lot of extra/large buffers there are going to be a handful of "tiles" but does not such a configuration avoid the issues with the Xenos eDRAM?
 
But what if the memory pool on "Durango-GPU" is really just that: a completely flexible memory pool. It could be used as a cache or a workspace for binned tiles or whatever?
I wouldn't expect anything less! The limitations of Xenos eDRAM should have been learnt from.

Would this not make the size relatively irrelevant? Yes for a game at 1080p with 4xMSAA and a lot of extra/large buffers there are going to be a handful of "tiles" but does not such a configuration avoid the issues with the Xenos eDRAM?
Tiling has still proven an unwanted limtation in Xenos, and I doubt devs would want to be forced down that route. But if the eDRAM is whatever the devs want to do with it, rather than a forced framebuffer space, and the system has enough BW for the rendering requirements, then it'd just be a bonus. However, one has to assume that the reason to add the cost of large local storage to the GPU is to address a BW limitation in the rest of the system. It must be there to serve BW heavy tasks. Whether these would have to be FB tasks are not, I don't know. I don't know how render pipelines currently and in future can be broken up across a large main RAM and a small working eDRAM. We have a whole thread on this discussion somewhere!
 
Shifty, was not the major issues with tiling (a) the significant cost of re-working geometry across tiles due to geometry passing over tile edges and (b) as it was tied to the ROPs (and not a general memory) it allowed for very limited number of applications.

The idea of a general memory pool of fast *general* embedded memory doesn't mean it has to work like Xenos--as you know the PS2 GS's 4MB of eDRAM was completely different from Xenos! GPUs already work internally in tiles so I don't see the issue with the concept, only the implementation.

Not all tiling is created equal so I don't think it is reasonable to say, "Tiling has still proven an unwanted limtation in Xenos, and I doubt devs would want to be forced down that route." Xenos style tiling would not be favorable but that is not what the rumor or what I laid out proposes by any stretch.
 
Shifty, was not the major issues with tiling (a) the significant cost of re-working geometry across tiles due to geometry passing over tile edges and (b) as it was tied to the ROPs (and not a general memory) it allowed for very limited number of applications.

The idea of a general memory pool of fast *general* embedded memory doesn't mean it has to work like Xenos
Yes, I was agreeing with you on that. I said:
But if the eDRAM is whatever the devs want to do with it, rather than a forced framebuffer space, and the system has enough BW for the rendering requirements, then it'd just be a bonus.
What I wonder is what BW advantages would be worth the realestate cost? The idea of 'free' transparency appeals to me. PS3 was a real backwards step in that regard vs. PS2.

As I say though, there's a whole thread on eDRAM somewhere where I'm sure the various options were discussed and the relative benefits of a given size eDRAM were noted. Perhaps in there the benefits of 10 and 20 and 32 etc. MBs were identified? Of course you could always render in tiles, such as rendering a 1080p smoke layer in tile using 10 MBs eDRAM/ESRAM with enough space to fit the tile and particle graphics. I do wonder if the brute force method will still be relevant next-gen though. Calculated volumetrics would look better and not need masses of BW.
 
So, the ESDRAM would mean DDR3 is a given?.

No, but I think the likelihood of having a very high bandwidth memory like GDDR5 would be very low.

I still think DDR4 will be the memory of choice. Choosing technologies that will be really cheap in the future (maybe not exactly at launch) will let them lower costs and prices more quickly.
 
Yes, I was agreeing with you on that. I said:
What I wonder is what BW advantages would be worth the realestate cost? The idea of 'free' transparency appeals to me. PS3 was a real backwards step in that regard vs. PS2.[/quiote]

Sorry, sleep deprivation. I would venture a guess that there are some neat Compute scenarios that having a "large" very fast local memory could be an advantage. I remember a very interesting sparse sample GI solution from 2005 which was mostly slowed by memory.
 
I hadn't thought of that. Considering the rate GPUs can churn through data, a fast bidirectional store seems plausibly advantageous. That said, none of the IHVs seem to be going that route. I suppose the typical dataset is too large to benefit from a small local store.
 
I hadn't thought of that. Considering the rate GPUs can churn through data, a fast bidirectional store seems plausibly advantageous. That said, none of the IHVs seem to be going that route. I suppose the typical dataset is too large to benefit from a small local store.

[puts on snarky hat] Typical datasets too large to benefit from local store? Where is all the vigor for local store from the Cell thread! If smart data organization and data streaming worked great for the 256KB the LS in SPEs then this should be a dream scenario![/snarky hat]

I am sure there are a lot of scenarios where the memory would not be big enough but I think the issue would be cost/benefit. I don't have an answer there but I wouldn't point to the GPU IHV's as a reason against as their issue is more market driven than technology limited. A Xenos framebuffer would die on the many-resolution desktop and even a proprietary memory we are discussing wouldn't be a "standard spec" so you would be trading less-used proprietary tech space for general use hardware which would get you killed in current gen/last gen benchmarks.
 
So, the ESDRAM would mean DDR3 is a given?.

I think DDR3 is unlikely because over the expected lifetime of the device, it would be more expensive than DDR4.

[puts on snarky hat] Typical datasets too large to benefit from local store? Where is all the vigor for local store from the Cell thread! If smart data organization and data streaming worked great for the 256KB the LS in SPEs then this should be a dream scenario![/snarky hat]

There is a massive, qualitative difference with a pool of <256K (remember, code has to fit too), and one of tens of megabytes. One of them can fit entire real-world datasets that you want to work with (namely, the framebuffer), and the other cannot.
 
Last edited by a moderator:
[puts on snarky hat] Typical datasets too large to benefit from local store? Where is all the vigor for local store from the Cell thread! If smart data organization and data streaming worked great for the 256KB the LS in SPEs then this should be a dream scenario![/snarky hat]
Just because you can, dosen't mean people want to - development complexity is one of the reasons for giving up Cell.

I am sure there are a lot of scenarios where the memory would not be big enough but I think the issue would be cost/benefit. I don't have an answer there but I wouldn't point to the GPU IHV's as a reason against as their issue is more market driven than technology limited. A Xenos framebuffer would die on the many-resolution desktop and even a proprietary memory we are discussing wouldn't be a "standard spec" so you would be trading less-used proprietary tech space for general use hardware which would get you killed in current gen/last gen benchmarks.
For graphics, yes. But for GPU manufacturers selling their chips to supercomputer builders, if local, fast working space was beneficial for GPGPU work, wouldn't they be adding it? Might be a bit too early yet, or it might be that GPGPU using massive parallel processing on massive datasets just needs access to large buckets of data and a few MBs won't be any help.

The way I see it, if you want to do lots of repeat computation on a dataset, then local store on the GPU would work, with the GPU CUs* reading and writing data. Think multipass image enhancement or something. If you have a larger dataset and just want to burn through it with some algorithm, local store won't be any use as you'll still be limited by system BW. I can see local store having uses, but it'll come at a cost of chip size/processing power.

*What exactly are we supposed to call GPU ALUs/math units these days??
 
There is a massive, qualitative difference with a pool of <256K (remember, code has to fit too), and one of tens of megabytes. One of them can fit entire real-world datasets that you want to work with (namely, the framebuffer), and the other cannot.

That was my point ;) I was being snarky due to the view some have proposed concerning the SPE LS in the other thread, especially since SPEs were often used for graphics.
 
That was my point ;) I was being snarky due to the view some have proposed concerning the SPE LS in the other thread, especially since SPEs were often used for graphics.

Reading your post, I'm pretty ashamed I didn't see that. The best I can do is blame lack of sleep.
 
Reading your post, I'm pretty ashamed I didn't see that. The best I can do is blame lack of sleep.

Your Posts > My Posts ... and I did the same thing to Shifty a couple posts up so I have no room to complain :p

Now if you have some free time soon I think a lot of us would be curious what you and other developers thought of, say a 50MB chunk of embedded memory on-die that is full write/read with ~ 1TB/s of bandwidth and low latency. Not a Xenos style eDRAM where it was limited to just the ROPs and limited read/write but a huge chunk of cache you can read and write to, stream data in and out of main memory from (e.g. bin your framebuffer and render in tiles if you wish), etc.

What would you like to see in such a HW scenario, what HW setups to avoid (Xenos and ... ?), and what neat algorithms and techniques would you salivate over doing on this--and which ones may this cause problems for due to to size or losing the ~40-50mm^2 on 28nm (if this is accurate in saying IBM is ~4.2Mb/mm^2 on 45nm) of GPU performance not worth it?
 
Status
Not open for further replies.
Back
Top