NGGP: NextGen Garbage Pile (aka: No one reads the topics or stays on topic) spawn

Ruskie · Jan 23, 2013

ShadowRunner said:
They are totally different architectures so flops are not really comparable. The Orbis vs Durango flops comparisons are being made within the context that the GPUs are shareing the same or very similar architectures from AMD

FLOPS are not different per se, its case of totally different architecture (one bringing much more efficiency than other). But pjb already mentioned, RSX is 232, not 400.

BRiT · Jan 23, 2013

Bagel seed said:
The reality is let's just close the boards and wait until E3.

Or do what I opted for, ban the trouble makers until after E3 or forever.

XpiderMX · Jan 23, 2013

Ruskie said:
FLOPS are not different per se, its case of totally different architecture (one bringing much more efficiency than other). But pjb already mentioned, RSX is 232, not 400.

Efficiency... can it be applied to Durango? or even Wii U?

EDIT: I mean Durango less flops but higher efficiency.

Vic · Jan 23, 2013

Kotaku suggesting Orbis has 4GB of RAM:

http://kotaku.com/5977849/the-plays...ncy-user-accounts-and-impressive-specs-so-far

ERP · Jan 24, 2013

3dilettante said:
There are stages where I can see the latency aspect is helpful, particularly outside of the latency-tolerant CU stages in the rendering pipeline.
There are various points of streamout and buffering in the fixed-function/specialized parts of the GPU that might not generally need as much bandwidth, but which cannot hide latency as readily.

I'm curious if that's as important to the CUs, which are latency tolerant in ways that the fixed-function domain is not. They are quite capable of burning bandwidth even before taking advantage of latency improvements.

That there are multiple data engines makes me wonder if there are multiple banks. If they are on-die, the aggregate width of their data buses could also be a source of differentiation.

The CU's are only latency tolerant to a point, but when cache misses are common they just sit idle, the maximum number of inflight threads is limited by the register pool, and it's certainly not enough to offset common cache misses.
It's very easy to write memory bound compute jobs, it's not uncommon for first attempts to run slower than a CPU with 1/100th of the FP performance, and I suspect given how many textures are read for the average pixel these days that a lot of modern shaders are more memory bound than they are ALU bound, I guess the easy way to check is swap all the input shaders for 1x1 textures and see if they run faster.
But overall it's hard to quantify how much of an advantage it would really give you.

Bagel seed · Jan 24, 2013

CPU side is only adequate if we assume we should have the same template of games that we had this gen though. I'm no expert on CPU utilization in games but from what I hear games with massive amounts of characters or more complex AI would benefit from more CPU power.

If we look at a game like Planetside 2 (supposedly coming to Orbis), which apparently can take as much CPU power as you throw at it when you're in huge 300, 400 man firefights, immediately that kind of game is out of the running or needs to be scaled down by a lot. Is that problem something that can be ported over to the GPU?

I think there should be a healthy bump in both areas but with more for the GPU side.

Averagejoe · Jan 24, 2013

thisGuy said:
X360 Xenos 240 GFLOPS / PS3 RSX 400 GFLOPS = .6667 which implies that the Xenos is only 66% the power of the RSX. Yes or No?

Cheers, that sounds interesting and sort of innovative, effectively you've got L3 cache for your GPU so I wonder sort of FLOP improvements could be gained from this, furthermore the CPU could also help with GPU workload couldn't it.

I think the FLOPs on the RSX were exaggerate by Nvidia or Sony,basically every tech guy say the same the GPU on the 360 was more powerful and faster to,the whole 2TF PS3 claim was silly like the 360 1TF claim as well.

Lucid_Dreamer · Jan 24, 2013

pjbliverpool said:
No, rsx has 16 programmable flops per pixel shader as you cant count the texture unit as mentioned above. Also mentioned above, it runs at 500mhz, not 550mhz. Thus rsx sports 232 GFLOPS compared with 240 in xenos. And xenos flops are more flexible thanks to them being unified. Xenos was a better gpu than rsx in quite a few ways.

I was thinking Xenos was better than RSX is about three ways:
- Flexibility (as you mentioned)
- Vertex ops
- Efficiency

Crytek said, when used properly, they are comparable (except the vertex handicap). Cell had to used by ND to overcome that by a large margin. It was used by 1st party to make RSX a lot more efficient than ordinarily.

It would be interesting if that extra CU, rumored to be in Orbis, can be used in such a flexible manner.

patsu · Jan 24, 2013

ERP said:
The CU's are only latency tolerant to a point, but when cache misses are common they just sit idle, the maximum number of inflight threads is limited by the register pool, and it's certainly not enough to offset common cache misses.
It's very easy to write memory bound compute jobs, it's not uncommon for first attempts to run slower than a CPU with 1/100th of the FP performance, and I suspect given how many textures are read for the average pixel these days that a lot of modern shaders are more memory bound than they are ALU bound, I guess the easy way to check is swap all the input shaders for 1x1 textures and see if they run faster.
But overall it's hard to quantify how much of an advantage it would really give you.

What are the cases where cache misses are common ?

ERP · Jan 24, 2013

patsu said:
What are the cases where cache misses are common ?

A lot of compute jobs.
Any Shader with unpredictable access.
Any Shader with enough inputs that the cache can't handle it.
Any unswizzled input

My guess would be that it's designed the way it is for a reason, MS do a lot of measurement of 360 titles and GPU utilization, and I'd guess that they found that a lot of the compute resources were under utilized because of data stalls.
There isn't much you can do on a PC to fix this, you need API support for something like a fast memory pool, all you can really do on a PC is increase the size of register pools to increase the number of threads in flight and increase the caches, and that might be more expensive for a given performance gain than throwing more CU's ar the problem.

As I said I wouldn't like to posit how much CU's are underutilized in the average modern renderer because of data stalls. If I were guessing I would GUESS it is a significant amount.
I do know it's stupidly easy to write a compute job that you think will run 100x faster than your trivial CPU solution, benchmark it and discover it's actually slower, because the ALU's are all sitting there waiting for data. It's one of the reasons I've been saying that FLOPS are not a useful performance metric.

And as I said before the danger with the solution MS has come up with is you need to schedule moves of source data to the fast RAM, that eats bandwidth, and if you can't get it there fast enough the entire rendering pipeline stalls.

I also have questions as to how deferred renderers are best handled with such a small fast memory pool. It wouldn't be unusual for a deferred renderer to write 28 bytes/pixel, and that won't fit in 32MB at 1080P. Can you split the MRT's to different pools? Does MS provide guidance for devs trying to do this?

To me it's an interesting approach, How effective it is will depend on how ALU bound vs Data bound shaders are in modern games, I just haven't looked at enough data or spoken to anyone who has to have a good idea.

Silent_Buddha · Jan 24, 2013

ERP said:
I also have questions as to how deferred renderers are best handled with such a small fast memory pool. It wouldn't be unusual for a deferred renderer to write 28 bytes/pixel, and that won't fit in 32MB at 1080P. Can you split the MRT's to different pools? Does MS provide guidance for devs trying to do this?

To me it's an interesting approach, How effective it is will depend on how ALU bound vs Data bound shaders are in modern games, I just haven't looked at enough data or spoken to anyone who has to have a good idea.

The rumors suggest that you are able to read/write to both the ESRAM and main system memory simultaneously. Would that be able to handle the above case? Assuming whatever MS/AMD came up with is sophisticated enough to able to deal with such a situation.

Regards,
SB

Hecatoncheires · Jan 24, 2013

Lucid_Dreamer said:
It would be interesting if that extra CU, rumored to be in Orbis, can be used in such a flexible manner.

This is what interests me most about Orbis. Assuming the CU is GCN and runs at 800Mhz, then it would simply double the amount of raw processing power CPU-wise (102.4 GFLOPS -> 204.8 GFLOPS).

Has someone an idea what we can expect of a (for example) 8 core Jaguar/64 core GCN processor in the Orbis SoC compared to other homogeneous desktop CPUs, or heterogeneous CPUs like the Cell?

Proelite · Jan 24, 2013

ERP said:
A lot of compute jobs.
Any Shader with unpredictable access.
Any Shader with enough inputs that the cache can't handle it.
Any unswizzled input

My guess would be that it's designed the way it is for a reason, MS do a lot of measurement of 360 titles and GPU utilization, and I'd guess that they found that a lot of the compute resources were under utilized because of data stalls.
There isn't much you can do on a PC to fix this, you need API support for something like a fast memory pool, all you can really do on a PC is increase the size of register pools to increase the number of threads in flight and increase the caches, and that might be more expensive for a given performance gain than throwing more CU's ar the problem.

As I said I wouldn't like to posit how much CU's are underutilized in the average modern renderer because of data stalls. If I were guessing I would GUESS it is a significant amount.
I do know it's stupidly easy to write a compute job that you think will run 100x faster than your trivial CPU solution, benchmark it and discover it's actually slower, because the ALU's are all sitting there waiting for data. It's one of the reasons I've been saying that FLOPS are not a useful performance metric.

And as I said before the danger with the solution MS has come up with is you need to schedule moves of source data to the fast RAM, that eats bandwidth, and if you can't get it there fast enough the entire rendering pipeline stalls.

I also have questions as to how deferred renderers are best handled with such a small fast memory pool. It wouldn't be unusual for a deferred renderer to write 28 bytes/pixel, and that won't fit in 32MB at 1080P. Can you split the MRT's to different pools? Does MS provide guidance for devs trying to do this?

To me it's an interesting approach, How effective it is will depend on how ALU bound vs Data bound shaders are in modern games, I just haven't looked at enough data or spoken to anyone who has to have a good idea.

What is the best case scenario if the architecture is sound and developers maximize their usage of it?

Would it be able to compete with 2 teraflop PC gpus at the same settings in common usage scenarios?

patsu · Jan 24, 2013

ERP said:
A lot of compute jobs.
Any Shader with unpredictable access.
Any Shader with enough inputs that the cache can't handle it.
Any unswizzled input

If the access is truly unpredictable, then it likely won't help for any cache-based or NUMA scheme. They can brute force the solution by scaling the number of cores up.

Can the code be refactored to solve the other issues ?

ERP · Jan 24, 2013

patsu said:
If the access is truly unpredictable, then it likely won't help for any cache-based or NUMA scheme. They can brute force the solution by scaling the number of cores up.

Any none trivial computed texture coordinate in a shader will lead to some sort of unpredictable access pattern, say computing something based on a reflection vector.
However, texture accesses are usually relatively coherent, the the cache isn't useless unless it doesn't fil in the cache and your massively undersampling it.

Adding CU's doesn't help at all in these data bound cases, all you can do is increase the register pool and increase the number of inflight threads (or as it appears Durango does lower the latency to data) to try and hide the latency.

People tend to think of bottlenecks as simple and in the singular, it's not like that when your rendering a scene you will be ALU bound, Data bound, ROP bound, Vertex bound and Bandwidth bound, at various points, sometimes changing clock to clock, increasing the number of ALU's makes the cases where you are ALU bound faster, but nothing else if everything else remains constant.

When PC vendors try and improve performance they spend the transistors where they think it will help overall the most, but they don't have to option of radical solutions, because they have to run last years hot game better than their competitor and it's not going to get rewritten when they release a new card.

Now don't get me wrong I'm not saying that something like this would offset a huge compute deficit, it's impossible to know without running code and actually measuring things and I'm fairly certain MS isn't going to send me a devkit to play with.

patsu said:
Can the code be refactored to solve the other issues ?

I'm not sure I understand the context of the question.

Hecatoncheires · Jan 24, 2013

This is a very interesting read: Timothy Lottes (Nvidia) commenting on Orbis and Durango

Briefly summarized it says that Orbis has great potential in terms of graphics, Durango not so much.

PlayStation: "(...) launch titles will likely be DX11 ports, so perhaps not much better than what could be done on PC. However if Sony provides the real-time OS with libGCM v2 for GCN, one or two years out, 1st party devs and Sony's internal teams like the ICE team, will have had long enough to build up tech to really leverage the platform. I'm excited for what this platform will provide for PS4-only 1st party titles and developers who still have the balls to do a non-portable game this next round."

XBox: "On this platform I'd be concerned with memory bandwidth. Only DDR3 for system/GPU memory pared with 32MB of "ESRAM" sounds troubling. (...) I'd bet most titles attempting deferred shading will be stuck at 720p with only poor post process AA (like FXAA)."

anexanhume · Jan 24, 2013

Hecatoncheires said:
This is a very interesting read: Timothy Lottes (Nvidia) commenting on Orbis and Durango

Briefly summarized it says that Orbis has great potential in terms of graphics, Durango not so much.

PlayStation: "(...) launch titles will likely be DX11 ports, so perhaps not much better than what could be done on PC. However if Sony provides the real-time OS with libGCM v2 for GCN, one or two years out, 1st party devs and Sony's internal teams like the ICE team, will have had long enough to build up tech to really leverage the platform. I'm excited for what this platform will provide for PS4-only 1st party titles and developers who still have the balls to do a non-portable game this next round."

XBox: "On this platform I'd be concerned with memory bandwidth. Only DDR3 for system/GPU memory pared with 32MB of "ESRAM" sounds troubling. (...) I'd bet most titles attempting deferred shading will be stuck at 720p with only poor post process AA (like FXAA)."

He's operating on the (same) lack of info we are. He's presuming the lack of main memory bandwidth is going to hurt the Durango, but supposedly those are mitigated by the dme's, esram and other circuitry.

Hecatoncheires · Jan 24, 2013

He also says that the 8GiB of DDR3 RAM imply that Microsoft is basically building a DX11 PC without granting low level access to the developers. He says that this will be nice for future backwards compatibility, but it will also have a negative impact on the XBox performance in the long run.

3dilettante · Jan 24, 2013

A number of the hardware features listed under Orbis are likely exposed in some fashion for Durango as well.
If the DME rumors are mostly accurate, this is actually an elaboration on the data movement infrastructure for GCN.

The point that Durango's graphics may be hampered by bandwidth--unless they tile--may be Microsoft's intent.

RobertR1 · Jan 24, 2013

Hecatoncheires said:
He also says that the 8GiB of DDR3 RAM imply that Microsoft is basically building a DX11 PC without granting low level access to the developers. He says that this will be nice for future backwards compatibility, but it will also have a negative impact on the XBox performance in the long run.

This goes inline with my theory that MS will opt for a faster refresh cycle and a subscription model for the next console similar to smart phones. In 2-3years, a new more powerful version comes out. All games on the XBOX3 perform better on the more powerful hardware similar to PC land.

NGGP: NextGen Garbage Pile (aka: No one reads the topics or stays on topic) spawn

Ruskie

BRiT

(>• •)>⌐■-■ (⌐■-■)

XpiderMX

Vic

ERP

Bagel seed

Averagejoe

Lucid_Dreamer

patsu

ERP

Silent_Buddha

Hecatoncheires

Proelite

patsu

ERP

Hecatoncheires

anexanhume

Hecatoncheires

3dilettante

RobertR1

Pro

Similar threads

NGGP: NextGen Garbage Pile (aka: No one reads the topics or stays on topic) *spawn*

(>• •)>⌐■-■ (⌐■-■)

Pro

Similar threads

NGGP: NextGen Garbage Pile (aka: No one reads the topics or stays on topic) spawn