I Then you'll get bandwidth bound because of all the interpolator-data you 'cache' in video memory for TBDR.
I don't understand what you mean by "interpolator-data you 'cache' in video memory". Can you explain?
I Then you'll get bandwidth bound because of all the interpolator-data you 'cache' in video memory for TBDR.
I'd rather say 'on low performance devices'.
I think the 'bandwidth brick wall' is getting worse for TBDR faster than for IMR.
The only way bandwidth might increase on iPad2 is due to using more polys which might result in more overdraw. But once you have enough compute power, your models start to get more and more detailed (which means it doesn't result in more overdraw due to more models, but in more detailed models->probably same overdraw). Then you'll get bandwidth bound because of all the interpolator-data you 'cache' in video memory for TBDR. Let's say you have 1Mio vertices (for simplicity reason, not that I claim games would use that on iPad2) for your frame, each
float4 position
float3 tangent
float4 bitangent
float2 UV
-> 52MB.
drawing to a 1024x768xRGBA8 means 6MB for Z+Color, so up to 9x overdraw you end up being cheaper bandwidth wise when running in forward mode.
If you really get pixel bound, a simple zpass will result in probably the same pixelshader load on forward pipelines like on TBDR piplines, due to EarlyZ optimizations.
I think time is playing for AMD/Nvidia, I'd really like to know how Imagination would handle tesselation, if NVidia would add that 'poison pill' into one of the future tegra gpus.
Does anyone have any information about the memory bandwidth the ipad might have?
the data that is passed from the vertex to the pixelshader (or more correctly, the interpolator that generated fragments), usually called 'interpolators' as those are per-vertex informations interpolated across the triangle. On TBDR you you work in two passes, the first is vertex processing, which spills out those interpolator data into the videomemory for usage in the 2nd pass.I don't understand what you mean by "interpolator-data you 'cache' in video memory". Can you explain?
No. PowerVR systems don't work like that. Perhaps you are getting confused with "software" deferred rendering.which spills out those interpolator data into the videomemory for usage in the 2nd pass.
I think after a z-pass, using EarlyZ, you have the same amount of texture fetching.You're ignoring Z buffer read/modify writes, FB read/modfiy writes on translucency and the saving from deferred texturing.
the zpass is usually very very cheap, you're right, you're either input limited or vertexshader limited, but if that's the limit, you're extremely fast.Z pre-pass massively increases Z buffer and input geometry bandwidth so although it effectively nullyfies the deferred advantage it isn't the bandwidth saviour that you think it is.
with zpass, I'd assume you could just increase your z-reject load, the amount of visible pixels will stay the same, and both architectures should react the same to that.The higher poly counts also tend to go into more detailed and immersive environments not just more detailed models e.g. an environment is more likely to have 1000's rocks in it instead of one extremely detailed rock. This means polygon sizes have not reduced as much as you'd think which means overdraw has tended to increase.
you either spill out tesselated geometry, which would result in a huge amount of data or you execute the tesselation a lot of times (for every tile), which would lead in worst case to executing the domain shader hundret of times per vertex.Tessellation only presents a problem to TBDR if you use dumb implementation, a correct implementation can turn tesselation into a huge advantage for TBDR.
I didn't claim it has an disadvantage atm. but if the vertex amount rises as the compute-power rises (as it did on PC), you'll rather reach bandwidth issues with TBDR than with EZ-IMR.The reality is that if you look at real application data instead of back of the envolope numbers you'll find that TBDR still has a bandwidth advantage over EZ-IMR.
Quad-core Tegra 3s won't force Apple's hand on performance. The CPU difference will be effectively marginal, and the A5 will still compare favorably in graphics.
I'm talking about TBDR, hardware processes all vertices in the first pass and saves the vertex processing output into a temporary buffer (referenced by tiles) and in the second run it reads per tile all referenced data.No. PowerVR systems don't work like that. Perhaps you are getting confused with "software" deferred rendering.
But NVidia is also touting 2500x1600 support.
So if as they claim there are Tegra 3 tablets later this year and they deliver their "Retina Displays" first?
ok, let me explain it again in stages and correct where my missunderstanding is.We interpolate 'on-demand' for the tile, and it's not coupled to vertex processing in that way.
1. pass
input: vertex data
processed by ALU using the vertex program
ouput: interpolator data
interpolator data saved into vmem
2. pass
input: interpolator data
processed by interpolation units
output: fragments
input: fragments
processed by ALU using fragment programs
output: pixel
pixel saved into vmem.
Yes... I'm reasonably familiar with TBDRI'm talking about TBDR, hardware processes all vertices in the first pass and saves the vertex processing output into a temporary buffer (referenced by tiles) and in the second run it reads per tile all referenced data.
Code:2. pass input: [strike]interpolator [/strike] transformed/projected data processed by interpolation units [B] output: fragments input: fragments[/B] processed by ALU using fragment programs output: pixel pixel saved into vmem.
That would also be a purely theoretical number.As a sidenote that 9x times figure is more like a theoretical marketing number than anything else. Of course is a SGX543 MP2 a lot faster than a SGX535, but in texel/pixel fill-rate limited conditions it's only a factor 2x (assuming same frequencies) against the 535.
Why would they change that?Yeah but they were so nice as to print the DRAM part number directly onto the chip, which I doubt we will see on the new A5.
you're maybe right, in the D3D world the output from vertexshaders are called 'interpolator registers' which hold the interpolator data. It's a little bit ambiguous, as that is the input for interpolator units and at the same time the fragment units get interpolator data into their 'interpolator registers'.Yes... I'm reasonably familiar with TBDR
I see from your reply to Rys that it seems to be a language problem. After the vertex shader, you don't really have "interpolated" data but, instead, transformed and projected vertices.
When you wrote "interpolated", that immediately implied, to us, that it was per-pixel interpolated data, which is where the confusion came in.
UPDATE:
I just noticed...
Just want to check that you aren't assuming the "output:fragments input:fragments" are actually written/read.
No. I've heard these sort of "TBDR is doomed" arguments for over a decade. They didn't convince me then and they still don't.
I could have imagined I'm not the only one that ever replied to the "AMD/NVidia is doomed" argument in that way, obvious things are obvious .No. I've heard these sort of "TBDR is doomed" arguments for over a decade. They didn't convince me then and they still don't.
What you've just described there is implementation specific in the main. You need to interpolate across the triangle for the per-pixel values, but we all do it differently and potentially at different stages. We don't go off chip for those values, so there's no external bandwidth cost.you're maybe right, in the D3D world the output from vertexshaders are called 'interpolator registers' which hold the interpolator data. It's a little bit ambiguous, as that is the input for interpolator units and at the same time the fragment units get interpolator data into their 'interpolator registers'.
More like never. For one it would be far too complicated for others to switch to deferred rendering architectures and from the other hand I don't see any bandwidth brick wall in sight to be honest especially if you consider how narrow bus-widths on current SoCs still are.
Think of it that way: IMG being the only IHV having a true TBDR architecture for all those years have patented whatever they could have patented so far.
Why would it be less useful just because VS handling isn't much different?I do feel that a TBDR would work best with a CPU like CELL sitting before it and handling as much of the vertex workload as possible if not all of it. I fail to see how the TBDR part comes into place as far as VS is concerned. It seems like sharing the same ALU for VS and PS workloads is less useful as it would be on other architectures.