iPad 2

I Then you'll get bandwidth bound because of all the interpolator-data you 'cache' in video memory for TBDR.

I don't understand what you mean by "interpolator-data you 'cache' in video memory". Can you explain?
 
I'd rather say 'on low performance devices'.

I think the 'bandwidth brick wall' is getting worse for TBDR faster than for IMR.
The only way bandwidth might increase on iPad2 is due to using more polys which might result in more overdraw. But once you have enough compute power, your models start to get more and more detailed (which means it doesn't result in more overdraw due to more models, but in more detailed models->probably same overdraw). Then you'll get bandwidth bound because of all the interpolator-data you 'cache' in video memory for TBDR. Let's say you have 1Mio vertices (for simplicity reason, not that I claim games would use that on iPad2) for your frame, each
float4 position
float3 tangent
float4 bitangent
float2 UV
-> 52MB.
drawing to a 1024x768xRGBA8 means 6MB for Z+Color, so up to 9x overdraw you end up being cheaper bandwidth wise when running in forward mode.

If you really get pixel bound, a simple zpass will result in probably the same pixelshader load on forward pipelines like on TBDR piplines, due to EarlyZ optimizations.

I think time is playing for AMD/Nvidia, I'd really like to know how Imagination would handle tesselation, if NVidia would add that 'poison pill' into one of the future tegra gpus.

Does anyone have any information about the memory bandwidth the ipad might have?

You're ignoring Z buffer read/modify writes, FB read/modfiy writes on translucency and the saving from deferred texturing. Z pre-pass massively increases Z buffer and input geometry bandwidth so although it effectively nullyfies the deferred advantage it isn't the bandwidth saviour that you think it is.

The higher poly counts also tend to go into more detailed and immersive environments not just more detailed models e.g. an environment is more likely to have 1000's rocks in it instead of one extremely detailed rock. This means polygon sizes have not reduced as much as you'd think which means overdraw has tended to increase.

Tessellation only presents a problem to TBDR if you use dumb implementation, a correct implementation can turn tesselation into a huge advantage for TBDR.

The reality is that if you look at real application data instead of back of the envolope numbers you'll find that TBDR still has a bandwidth advantage over EZ-IMR.

John.
 
I don't understand what you mean by "interpolator-data you 'cache' in video memory". Can you explain?
the data that is passed from the vertex to the pixelshader (or more correctly, the interpolator that generated fragments), usually called 'interpolators' as those are per-vertex informations interpolated across the triangle. On TBDR you you work in two passes, the first is vertex processing, which spills out those interpolator data into the videomemory for usage in the 2nd pass.
 
We interpolate 'on-demand' for the tile, and it's not coupled to vertex processing in that way.
 
which spills out those interpolator data into the videomemory for usage in the 2nd pass.
No. PowerVR systems don't work like that. Perhaps you are getting confused with "software" deferred rendering.
 
You're ignoring Z buffer read/modify writes, FB read/modfiy writes on translucency and the saving from deferred texturing.
I think after a z-pass, using EarlyZ, you have the same amount of texture fetching.
You're right about the transparent objects, but although you safe FB read/write, you have still high cost for those fragments, as the only thing you safe are FB Read/Writes, you execute the whole fragment program and fetch all the textures. I'd say, if you want to use that really for the advantage, you will be less slow with TBDR :)

Z pre-pass massively increases Z buffer and input geometry bandwidth so although it effectively nullyfies the deferred advantage it isn't the bandwidth saviour that you think it is.
the zpass is usually very very cheap, you're right, you're either input limited or vertexshader limited, but if that's the limit, you're extremely fast.
in TBDR it's probably more costly to output the interpolator data, than the vertex data reading during a zpass on an IMR.

The higher poly counts also tend to go into more detailed and immersive environments not just more detailed models e.g. an environment is more likely to have 1000's rocks in it instead of one extremely detailed rock. This means polygon sizes have not reduced as much as you'd think which means overdraw has tended to increase.
with zpass, I'd assume you could just increase your z-reject load, the amount of visible pixels will stay the same, and both architectures should react the same to that.

Tessellation only presents a problem to TBDR if you use dumb implementation, a correct implementation can turn tesselation into a huge advantage for TBDR.
you either spill out tesselated geometry, which would result in a huge amount of data or you execute the tesselation a lot of times (for every tile), which would lead in worst case to executing the domain shader hundret of times per vertex.
Do I miss a smarter (3rd) solution for that?

The reality is that if you look at real application data instead of back of the envolope numbers you'll find that TBDR still has a bandwidth advantage over EZ-IMR.
I didn't claim it has an disadvantage atm. but if the vertex amount rises as the compute-power rises (as it did on PC), you'll rather reach bandwidth issues with TBDR than with EZ-IMR.
The only TBDR advantage I see, that will probably always exist, is with massive alphablend drawing, but that's not real world (except maybe for some PS2 games).
 
Quad-core Tegra 3s won't force Apple's hand on performance. The CPU difference will be effectively marginal, and the A5 will still compare favorably in graphics.

But NVidia is also touting 2500x1600 support.

So if as they claim there are Tegra 3 tablets later this year and they deliver their "Retina Displays" first?
 
No. PowerVR systems don't work like that. Perhaps you are getting confused with "software" deferred rendering.
I'm talking about TBDR, hardware processes all vertices in the first pass and saves the vertex processing output into a temporary buffer (referenced by tiles) and in the second run it reads per tile all referenced data.
 
But NVidia is also touting 2500x1600 support.

So if as they claim there are Tegra 3 tablets later this year and they deliver their "Retina Displays" first?

I would think LCD supply for high-resolution displays of that density would be the limiting factor, more-so than the SoC's capabilities to handle the resolution.
 
We interpolate 'on-demand' for the tile, and it's not coupled to vertex processing in that way.
ok, let me explain it again in stages and correct where my missunderstanding is.

Code:
1. pass
  input: vertex data
     processed by ALU using the vertex program
  ouput: interpolator data
interpolator data saved into vmem


2. pass
  input: interpolator data
    processed by interpolation units
  output: fragments
  input: fragments
    processed by ALU using fragment programs
  output: pixel
pixel saved into vmem.
 
I'm talking about TBDR, hardware processes all vertices in the first pass and saves the vertex processing output into a temporary buffer (referenced by tiles) and in the second run it reads per tile all referenced data.
Yes... I'm reasonably familiar with TBDR :D

I see from your reply to Rys that it seems to be a language problem. After the vertex shader, you don't really have "interpolated" data but, instead, transformed and projected vertices. When you wrote "interpolated", that immediately implied, to us, that it was per-pixel interpolated data, which is where the confusion came in.

UPDATE:
I just noticed...
Code:
2. pass
  input: [strike]interpolator [/strike] transformed/projected data
    processed by interpolation units
[B]  output: fragments
  input: fragments[/B]
    processed by ALU using fragment programs
  output: pixel
pixel saved into vmem.

Just want to check that you aren't assuming the "output:fragments input:fragments" are actually written/read.
 
As a sidenote that 9x times figure is more like a theoretical marketing number than anything else. Of course is a SGX543 MP2 a lot faster than a SGX535, but in texel/pixel fill-rate limited conditions it's only a factor 2x (assuming same frequencies) against the 535.
That would also be a purely theoretical number.

Yeah but they were so nice as to print the DRAM part number directly onto the chip, which I doubt we will see on the new A5.
Why would they change that?
 
Yes... I'm reasonably familiar with TBDR :D

I see from your reply to Rys that it seems to be a language problem. After the vertex shader, you don't really have "interpolated" data but, instead, transformed and projected vertices.
you're maybe right, in the D3D world the output from vertexshaders are called 'interpolator registers' which hold the interpolator data. It's a little bit ambiguous, as that is the input for interpolator units and at the same time the fragment units get interpolator data into their 'interpolator registers'.

When you wrote "interpolated", that immediately implied, to us, that it was per-pixel interpolated data, which is where the confusion came in.

UPDATE:
I just noticed...


Just want to check that you aren't assuming the "output:fragments input:fragments" are actually written/read.

now that the language barrier is conquered:), I hope you guys see my point, why the future might be more challenging for TBDR architectures than for IMR in respect to bandwidth.
 
No. I've heard these sort of "TBDR is doomed" arguments for over a decade. They didn't convince me then and they still don't.
 
No. I've heard these sort of "TBDR is doomed" arguments for over a decade. They didn't convince me then and they still don't.

You would be a workplace issue if you cried out "we are doomed... we are doomed!" all day :p.
I do feel that a TBDR would work best with a CPU like CELL sitting before it and handling as much of the vertex workload as possible if not all of it. I fail to see how the TBDR part comes into place as far as VS is concerned. It seems like sharing the same ALU for VS and PS workloads is less useful as it would be on other architectures.

A TBDR like PowerVR designs seem to be best at handling workloads after the visibility phase has been dealt with. Preparing the scene with a CPU that can handle the vertex workload (see Dreamcast) and let the GPU work on that data as fast as possible.

(could you spread the word around that iOS specific tools like an OpenGL ES debugger and performance profiler that does run on OS X would be nice ;)?)
 
Last edited by a moderator:
No. I've heard these sort of "TBDR is doomed" arguments for over a decade. They didn't convince me then and they still don't.
I could have imagined I'm not the only one that ever replied to the "AMD/NVidia is doomed" argument in that way, obvious things are obvious :).
 
you're maybe right, in the D3D world the output from vertexshaders are called 'interpolator registers' which hold the interpolator data. It's a little bit ambiguous, as that is the input for interpolator units and at the same time the fragment units get interpolator data into their 'interpolator registers'.
What you've just described there is implementation specific in the main. You need to interpolate across the triangle for the per-pixel values, but we all do it differently and potentially at different stages. We don't go off chip for those values, so there's no external bandwidth cost.
 
More like never. For one it would be far too complicated for others to switch to deferred rendering architectures and from the other hand I don't see any bandwidth brick wall in sight to be honest especially if you consider how narrow bus-widths on current SoCs still are.

Think of it that way: IMG being the only IHV having a true TBDR architecture for all those years have patented whatever they could have patented so far.

Well, they still have all those GigaPixel patents, don't they?
 
If we count just bandwith and no other advantages, isnt cache actualy almost same ?
If they increase the cache size in traditional architectures they end up with something similar than TBDR. And as the process node shrinks the on chip memory can be bigger and save even more bandwith and power.
 
I do feel that a TBDR would work best with a CPU like CELL sitting before it and handling as much of the vertex workload as possible if not all of it. I fail to see how the TBDR part comes into place as far as VS is concerned. It seems like sharing the same ALU for VS and PS workloads is less useful as it would be on other architectures.
Why would it be less useful just because VS handling isn't much different?
 
Back
Top