NGGP: NextGen Garbage Pile (aka: No one reads the topics or stays on topic) *spawn*

Status
Not open for further replies.
I was thinking Xenos was better than RSX is about three ways:
- Flexibility (as you mentioned)
- Vertex ops
- Efficiency

I'd say they were more or less the same thing. Flexibility in the context of my post was referring to the flexibility of being able to allocate all your shader resources to either pixel of vertex shaders as needed. That also covers the advantage of vertex ops and the additional overall efficiency of the shader array. I'd say Xenos's other two advantages over RSX were features - it was definitely more forward looking, and the edram, although that obviously came with it's own headaches.

Crytek said, when used properly, they are comparable (except the vertex handicap). Cell had to used by ND to overcome that by a large margin. It was used by 1st party to make RSX a lot more efficient than ordinarily.

It would be interesting if that extra CU, rumored to be in Orbis, can be used in such a flexible manner.

My understanding is that a lot of what Cell was used for was to make up for the relative lack of vertex shader performance in RSX. I can't see Orbis having that requirement given that it's shaders are already unified and offer way more peak performance than Durango from the sounds of it.

I'd expect the extra CU to be used for more interesting things like physics.
 
A number of the hardware features listed under Orbis are likely exposed in some fashion for Durango as well.
If the DME rumors are mostly accurate, this is actually an elaboration on the data movement infrastructure for GCN.

The point that Durango's graphics may be hampered by bandwidth--unless they tile--may be Microsoft's intent.

What about the API think? Did Microsoft exposed the full ISA of Xenos in Xbox360 ehnanched DirectX9 API?

Anyway, it's nothing that can't be done after. If they see that the performance gap is widening because Sony is giving access to low metal coding, they could provide it latter on.

What kind of architecture improvement could AMD have made to GNC for Durango? According to Aegis is something that as to do with how the SIMD work, if I understand correctly.
 
This is what interests me most about Orbis. Assuming the CU is GCN and runs at 800Mhz, then it would simply double the amount of raw processing power CPU-wise (102.4 GFLOPS -> 204.8 GFLOPS).

Has someone an idea what we can expect of a (for example) 8 core Jaguar/64 core GCN processor in the Orbis SoC compared to other homogeneous desktop CPUs, or heterogeneous CPUs like the Cell?

Core i5 2500K = 211 GFLOPS
Core i7 3770K = 224 GLFOPS
Core i7 3970X = 336 GFLOPS
Core i7 4770K = 448 GLFOPS
 
What about the API think? Did Microsoft exposed the full ISA of Xenos in Xbox360 ehnanched DirectX9 API?

Yes though some of the more esoteric features didn't come until after launch, they have to write the "driver", and stability is somewhat more important than raw feature set.

MS could take a different tack this time though. Both current consoles have very low overhead drivers just because the game has exclusive access the the GPU state. Assuming you have a lot of concurrently running features, you have to deal with arbitrating access to the GPU in some way, and perhaps more importantly, you have to decide whether you want to let a game lock up the entire system if it does something bad and crashes the GPU.
 
He's operating on the (same) lack of info we are. He's presuming the lack of main memory bandwidth is going to hurt the Durango, but supposedly those are mitigated by the dme's, esram and other circuitry.

Indeed. What he says about the potential of the ps4 sounds amazing and I am looking forward to what Sony's first party will produce given what they achieved with the ps3. Good times ahead.

On the durango, we simply don't know alot about the inner workings of the system to say much, and he is making too much suppositions there; dx11 pc, pre GCN gpu....until we know more, I would hesitate to condemn the system, the few comments from those in the know points to a very interesting system and how good each of these systems are compared to each other is something that will likely be answered over their lifetime duration.
 
To me it sounds like people present Durango in best case scenario when they talk about it without knowing what these dedicated engines do. Everybody expect some great stuff from PS4 since its obviously very gaming centric console, and it also has (at least by rumors) big theoretical performance advantage over Durango.

To counter that, people assume that Durango's CUs will somehow be much more efficient, that data move engines will help with lower bandwidth of main memory and the low latency of eSRAM will prove to be advantage over GDDR5 that PS4 is packing (no matter the speed, 192Gb/s or 160Gb/s). This to me sounds like wishful thinking and I think what we are getting is one high performing console and another that, while not that far from the other, still comes of as underpowered. Its just the way its going to be for next gen by the way things are shaping up.
 
Durango's latest tidbits were more interesting to discuss.
When something wacky turns up for Orbis, we could see more speculation on that.

A fair amount of the puzzling over Durango is focused on the question of how the fast memory pool can or cannot be used to match a raw bandwidth deficit.
 
Durango's latest tidbits were more interesting to discuss.
When something wacky turns up for Orbis, we could see more speculation on that.

A fair amount of the puzzling over Durango is focused on the question of how the fast memory pool can or cannot be used to match a raw bandwidth deficit.
I understand, but from what rumors are pointing out, Orbis won't have any problems with bandwidth either. It seems like general feel is "Durango will be able to use bandwidth very efficiently and keep GPU fed all the time, while move engines will result in performance gulf being much smaller", while at the same time completely ignoring the fact that Orbis should have no problems keeping GPU fed either.

It seems as though some people try to describe Durango in best possible case (very efficient, dedicated silicon for gfx related tasks, low latency), and Orbis like off shelf PC with worse efficiency and more raw power. I don't think it paints the full picture, but I guess the wackiest we can get from Orbis will be memory bandwidth of 144Gb/sec - 176Gb/sec as I don't see them achieving 192 Gb/sec.
 
Durango's latest tidbits were more interesting to discuss.
When something wacky turns up for Orbis, we could see more speculation on that.

A fair amount of the puzzling over Durango is focused on the question of how the fast memory pool can or cannot be used to match a raw bandwidth deficit.

Thankfully ERP's posts have been very helpful in illuminating the mechanisms by which it may be successful and what limitations it will likely incur regardless.
 
Durango's latest tidbits were more interesting to discuss.
When something wacky turns up for Orbis, we could see more speculation on that.

A fair amount of the puzzling over Durango is focused on the question of how the fast memory pool can or cannot be used to match a raw bandwidth deficit.

Oh that answers easy, it can't.

But it's the wrong question.

The question should be how much bandwidth is needed and how much can the SRAM increase the utilization of the small number of CU's in general workloads.

The DME's can never "save" memory bandwidth, they move memory this always "costs" bandwidth over a single pool where you don't have to move it, the question should be does the utility add more than the cost?

I'm somewhat interested in the Durango ESRAM because it's clearly an attempt to address a real problem with modern GPU's in a novel way. If this was just EDRAM for the framebuffer, IMO it would have no benefit whatsoever over a fast memory pool. But from what I've been told that's not the intent and that makes it interesting.
 
Oh that answers easy, it can't.

But it's the wrong question.

The question should be how much bandwidth is needed and how much can the SRAM increase the utilization of the small number of CU's in general workloads.

The DME's can never "save" memory bandwidth, they move memory this always "costs" bandwidth over a single pool where you don't have to move it, the question should be does the utility add more than the cost?

I'm somewhat interested in the Durango ESRAM because it's clearly an attempt to address a real problem with modern GPU's in a novel way. If this was just EDRAM for the framebuffer, IMO it would have no benefit whatsoever over a fast memory pool. But from what I've been told that's not the intent and that makes it interesting.

Hmmm...interesting. All the more reason to look forward to getting indepth knowledge about it.

On an aside, MUCH thanks to you ERP for your answers and contributions to this board, much appreciated.
 
This goes inline with my theory that MS will opt for a faster refresh cycle and a subscription model for the next console similar to smart phones. In 2-3years, a new more powerful version comes out. All games on the XBOX3 perform better on the more powerful hardware similar to PC land.

That was hinted in that leaked roadmap too, though the wording there sounded more like Ms was going to license 720 to different manufacturers who could come up with different performance targets, than Ms themselves refreshing the console like apple does with their products.

I can totally see that happening, though :p
 
Any none trivial computed texture coordinate in a shader will lead to some sort of unpredictable access pattern, say computing something based on a reflection vector.
However, texture accesses are usually relatively coherent, the the cache isn't useless unless it doesn't fil in the cache and your massively undersampling it.

Adding CU's doesn't help at all in these data bound cases, all you can do is increase the register pool and increase the number of inflight threads (or as it appears Durango does lower the latency to data) to try and hide the latency.

People tend to think of bottlenecks as simple and in the singular, it's not like that when your rendering a scene you will be ALU bound, Data bound, ROP bound, Vertex bound and Bandwidth bound, at various points, sometimes changing clock to clock, increasing the number of ALU's makes the cases where you are ALU bound faster, but nothing else if everything else remains constant.

When PC vendors try and improve performance they spend the transistors where they think it will help overall the most, but they don't have to option of radical solutions, because they have to run last years hot game better than their competitor and it's not going to get rewritten when they release a new card.

Now don't get me wrong I'm not saying that something like this would offset a huge compute deficit, it's impossible to know without running code and actually measuring things and I'm fairly certain MS isn't going to send me a devkit to play with. :p



I'm not sure I understand the context of the question.

Ah sorry for not being clear. I am typing these on a tablet. ^_^

By brute force measure with number of cores (or more power), I meant the compute bound part of the work gets completed earlier as a result. They may need to tackle/balance the I/O bound part of the business to prevent aggravating the situation.

One way the SPUs get around the problem is to overload them with many tasks at the same time. So when one job gets stuck, it moves on to another (A second "thread" keeps assorted data coming in double buffering manner). It will mean they need to customize the GPU to support fast context switching. If they can't do this effectively, then it gets tricky.

Yes, the problem is app/context specific, may be they can implement specific measures for some of the common occurrences ?
 
One way the SPUs get around the problem is to overload them with many tasks at the same time. So when one job gets stuck, it moves on to another. I will meant they need to customize the GPU to support fast context switching. If they can't do this effectively, then it gets tricky.

This is fundamentally what GPU's already do, they have massive register pool, and dynamically allocate them to potentially hundreds of running thread contexts. But this only gets you so far, you can only hide latency if there is other work to do not waiting on memory (and even cache hits cost cycles).
If your shader uses a lot of registers you get less threads in flight and your ability to accomodate latency suffers.

GPU designers can add more registers to the pool, but this is expensive, and at some point it's cheaper to increase the compute resources and make the none memory constrained parts run faster, than it is to spend more transistors trying to hide latency.
What it appears MS' solution tries to do is to reduce latency rather than increasing the tolerance of the GPU to it.

Modern ATI GPU's also allow you to run compute jobs while rendering, which will increase thread diversity, and probably ALU utilization as a result.

As I said it's just really hard to say what the impact is, we'll have a pretty good idea when we see cross platform titles, or developers start bitching in interviews.

Things like deferred renders worry me on MS' architecture, especially given their prevalence, but my back of the envelope math says it may not be as bad as I had originally thought.
 
*If* MS came this far with this architecture, they would have already ported/created a few representative games to run on the new platform. And the payoff is positive. At this time, besides optimization (for further gain), I believe they should be working on higher level issues, like running physics, the new Kinect stuff, etc. in parallel with rendering.

If the payoff is dubious or lousy, I think MS would have dropped the idea long ago.
 
The DME's can never "save" memory bandwidth, they move memory this always "costs" bandwidth over a single pool where you don't have to move it, the question should be does the utility add more than the cost?
That's too bad. I thought they had some functions along the line of the Cell DMA engine, which took advantage of its location in the pipeline to coalesce accesses. GCN's access coalescing abilities are somewhat nebulous. There is some straightforward detection of matched index values for the same vector memory instruction, but the rest of the cache heirarchy and memory controller's abilities weren't outlined.

Additionally, I pondered the possibility that the memory controller and DME parts of the pipeline could intercept accesses in the request traffic in order to route them to the SRAM, or possibly an explicit write path to the SRAM.


I'm somewhat interested in the Durango ESRAM because it's clearly an attempt to address a real problem with modern GPU's in a novel way. If this was just EDRAM for the framebuffer, IMO it would have no benefit whatsoever over a fast memory pool. But from what I've been told that's not the intent and that makes it interesting.
At least for the framebuffer, I was thinking not all the framebuffer, at least not all at once.
Unfortunately, I can't borrow your ears. I'll have to await further tidbits as they come.
 
how can be used smartly the eSRAM, aside the framebuffer allocation?
when I say smartly I'm not talking about particellar effects or similar useless stuff..
 
That's too bad. I thought they had some functions along the line of the Cell DMA engine, which took advantage of its location in the pipeline to coalesce accesses. GCN's access coalescing abilities are somewhat nebulous. There is some straightforward detection of matched index values for the same vector memory instruction, but the rest of the cache heirarchy and memory controller's abilities weren't outlined.

Additionally, I pondered the possibility that the memory controller and DME parts of the pipeline could intercept accesses in the request traffic in order to route them to the SRAM, or possibly an explicit write path to the SRAM.

At least for the framebuffer, I was thinking not all the framebuffer, at least not all at once.
Unfortunately, I can't borrow your ears. I'll have to await further tidbits as they come.

I shouldn't make blanket statements, because I don't have a lot of details on DME's, just what they do and I'm drawing conclusions from that.
Given their functionality I would be surprised if for example they had a direct path to the GPU.
But I wouldn't for example rule out an additional port on the SRAM for simultaneous copies, but that still doesn't "save" bandwidth the data has to come from somewhere, and if you didn't have the two pools you probably wouldn't be copying it at all.
 
As I said it's just really hard to say what the impact is, we'll have a pretty good idea when we see cross platform titles, or developers start bitching in interviews.
What you've described elsewhere about the DMEs' operations has me somewhat concerned that devs will have a Cell-like memory juggling act to perform, micromanaging memory accesses and movements around the system.
 
Status
Not open for further replies.
Back
Top