Sony's Next Generation Portable unveiling - PSP2 in disguise

Ailuros said:
As for where you're standing, uhmm trust the more experienced one out of the two ;)
Where did we say different things? ;)
 
I'm not trying to start a flamewar between comrades or anything.. but where are we standing exactly?

Is it "so low" that you consider it "non significant"?
What ratios are we talking about? For each 100% increase in cores, you'll need 10% increase in memory bandwidth? More? Less? Not allowed to specify? :(

Hmm, here I'll say it again for you,
Actually I think we can be specific in saying that there is no significant change in memory cost associated with multi-core, I'm not sure why anyone would think there was
 
I'm not trying to start a flamewar between comrades or anything.. but where are we standing exactly?

Is it "so low" that you consider it "non significant"?
What ratios are we talking about? For each 100% increase in cores, you'll need 10% increase in memory bandwidth? More? Less? Not allowed to specify? :(

This is getting strange - the amount of bandwidth needed per core will depend on what you're trying to do with it. And what the guys from IMG are saying is that if for example Apple wanted to quadruple the number of pixels on the iPad3 and therefore (hypothetically) utilized a 543MP8 configuration or four times the number of cores, those cores would use four times the bandwidth to do the same job, they wouldn't incur any extra penalty.

The real difficulty lies in predicting what types of codes your customer wants to run, what resources will be spent on memory paths, and design for maximum efficiency in terms of gates/power/cost. If you over engineer, then your design will be bloated with baggage that goes largely unused, and you leave an open window for your competitors to do more with less. On the other hand, obviously you want to provide the capabilities that the customer may want, as well as provide juicy new IP to license. So IMG offers both cores with different levels of complexity, and also the possibility to widen many of these to fit perceived need.

What I would like to see is bandwidth usage data for different but typical tasks, for, say, IMG, Mali, and Tegra respectively.
 
This is getting strange - the amount of bandwidth needed per core will depend on what you're trying to do with it. And what the guys from IMG are saying is that if for example Apple wanted to quadruple the number of pixels on the iPad3 and therefore (hypothetically) utilized a 543MP8 configuration or four times the number of cores, those cores would use four times the bandwidth to do the same job, they wouldn't incur any extra penalty.
Yes. And I think some were suggesting that as you introduce more cores, there's an overhead, so instead of being a linear use it's exponential. eg. A linear memory would have 1 core using 5 GB/s, say; 2 cores 10 gb/s; 4 cores 20 GB/s. Whereas the suggestion is with overhead, that where 1 core consumes 5 GB/s, a second would push the total up to 12 GB/s, and quad-core would go to 28 GB/s.

Both Rys and JohnH are telling us that memory usage is linear, based on workload. You'll clearly need X times as much memory to drive X number of cores as they all consume data, but there's no additional penalty.
 
For the same number of pixels, bandwidth goes up almost negligably when you add cores to work on them. So it's not linear at all (for us anyway) when doing MP.
 
Yes. And I think some were suggesting that as you introduce more cores, there's an overhead, so instead of being a linear use it's exponential. eg. A linear memory would have 1 core using 5 GB/s, say; 2 cores 10 gb/s; 4 cores 20 GB/s. Whereas the suggestion is with overhead, that where 1 core consumes 5 GB/s, a second would push the total up to 12 GB/s, and quad-core would go to 28 GB/s.

Both Rys and JohnH are telling us that memory usage is linear, based on workload. You'll clearly need X times as much memory to drive X number of cores as they all consume data, but there's no additional penalty.

Why would you need X times as much memory for X cores, assuming that their job is to render a scene X times faster than a single core would? The amount of data assets wouldn't increase; same framebuffer size, same textures, and as far as binning is concerned I expect that to be the same too - you'd just be distributing the bins between different cores.

As far as bandwidth goes, there'd be no increase in outgoing to render targets since this is subdivided between cores with no overlap. There may be some instances where the same data needs to be loaded into separate cores where it would have been retained in the cache of a single core, but in that case it'll either stay in the cache of all the cores that loaded it or it wouldn't have stayed in the cache of the single core. In fact, the multiple cores will texture cache better because they'll have smaller working sets but the same amount of cache each (presumably). And if they have a shared L2 cache that's even better; I'd fully expect bandwidth requirements to go down after this, not up. Maybe someone can tell me if Series5XT MP has anything like this (like Mali-400MP does)

(then again, someone please tell me if there's some glaring flaw in my reasoning)
 
For the same number of pixels, bandwidth goes up almost negligably when you add cores to work on them. So it's not linear at all (for us anyway) when doing MP.

Yes but if workload increases for a config with N amount more cores, then naturally the bandwidth requirements should increase too. I know it's obviously a dumb clarification since it's self explanatory, but some of us need unfortunately a KISS approach to comprehend it easier.

Else for workload X irrelevant in theory if you have N pipelines from a hypothetical single core vs. the same amount of N pipelines spread over Y cores the bandwidth requirements are fairly similar, yes?
 
Why would you need X times as much memory for X cores,
I missed a 'bandwidth' there.

For the same number of pixels, bandwidth goes up almost negligably when you add cores to work on them. So it's not linear at all (for us anyway) when doing MP.

Yes but if workload increases for a config with N amount more cores, then naturally the bandwidth requirements should increase too.
Yeah, that's what I was getting at. If you are targeting a more powerful GPU, you'll feed it more data needing more BW. If you have the same graphics workload and start adding more cores, it won't change the bandwidth usage as it's the same amount of data, just being processed faster. Clearly the choice to go with a four or eight core SGX in a device would have to be coupled with a choice to increase RAM BW accordingly to feed them, but not with any overhead that means more cores = less memory efficiency.
 
Isn't the main benefit of TBDRs is that you don't need to go off-chip as IMRs meaning you need less bandwidth for a given workload?
 
Clearly the choice to go with a four or eight core SGX in a device would have to be coupled with a choice to increase RAM BW accordingly to feed them, but not with any overhead that means more cores = less memory efficiency.

That was exactly the info I was looking for.

By increasing the number of cores, one assumes the purpose is to also increase the number of rendered pixels, increased geometry, post-processing effects, higher-resolution textures, etc.
And by doing so, the SGX5 architecture would naturally need to also increase the memory bandwidth available for the whole GPU (and not memory bandwidth per-core), or the graphics subsystem would face a bottleneck eventually.


Maybe for not using the right terms, the answers I was getting were not for the question I made, hence all the confusion.
 
That was exactly the info I was looking for.

By increasing the number of cores, one assumes the purpose is to also increase the number of rendered pixels, increased geometry, post-processing effects, higher-resolution textures, etc.
And by doing so, the SGX5 architecture would naturally need to also increase the memory bandwidth available for the whole GPU (and not memory bandwidth per-core), or the graphics subsystem would face a bottleneck eventually.


Maybe for not using the right terms, the answers I was getting were not for the question I made, hence all the confusion.

It's perfectly reasonable to ask for an explanation in such a case. Unless I've understood something wrong bandwidth requirements will increase for N amount of more workload, but it's fairly irrelevant if you'd have one single GPU with say 12 pipelines or a MP3 with 4 pipelines each.
 
To put this simply, generally more pixels/s = more bytes/s, however the relationship isn't linear due to sharing of data wthin caches.

MP itself, or at least our implemenation of it, has negligible additional cost so it's use doesn't particularly influence how efficently we use memory.

John.
 
Last edited by a moderator:
Isn't the main benefit of TBDRs is that you don't need to go off-chip as IMRs meaning you need less bandwidth for a given workload?

I'm assuming you meant to say "as much as IMRs" there, not claim that you don't have to go off-chip at all. That's the benefit of the "TB" part, which saves on render-target bandwidth. Mainly by reducing standard workloads to a single stream to the framebuffer instead of having multiple writes per overdrawn pixel or read-modify-writes for alpha blending, and not having to update a depth/stencil buffer off-chip (note that depth buffer compression techniques on modern IMRs will help reduce the number of times you have to update the raw buffer due to overdraw, but you can still count on at least once per pixel, I think)

Render to texture probably still goes off chip then back on, I wouldn't expect there's any optimization bypassing this in the case where the texture is used in the same tile, which is probably not all that common..

The "DR" part saves compute resources and bandwidth, mainly by preventing texture lookups for occluded pixels. Of course this only translates to bandwidth savings when the texture isn't in cache.

On the flip side more bandwidth is used for vertex processing vs an IMR because the vertex stream has to be output back to main memory in the form of binned data, then read it back in the tile processing stages. Some claim that this is something like "doubling" the bandwidth, but in reality it's a lot less because the binned data doesn't include anything that is frustum or backface culled (latter is usually somewhere near 50% the triangles) and it's compressed. And not all data has to be read for every stage.. and with indexed vertexes on an IMR you can't stream all the vertexes straight to the GPU, what you're indexing has to be resident in memory first, and for static data it has to be read from memory anyway. With a shared memory device I think it'd look something like this:

IMR:
- CPU writes vertexes with all data to RAM
- CPU writes vertex indexes to GPU command FIFO
- GPU reads + shades vertexes with all data from vertex cache which reads from RAM where necessary, dispatches for rendering

Tile based:
- CPU writes vertexes with all data to RAM
- CPU writes vertex indexes to GPU command FIFO
- GPU reads vertex clip-space coordinates from RAM, culls/clips, then passing reads the rest of the vertex data, compresses, and writes to tile bin (and some additional bandwidth for maintaining the tiling data structures, which I'm sure are cached to some extent)
- GPU reads from tile bin to render the tile

There is some bandwidth increase as well where the tile binner has to create new vertexes or where vertexes get reproduced for where triangles get split across/included in multiple tiles. It'd actually be interesting to see some figures in just how many additional vertexes are created for a tiler that splits triangles. I've heard that Mali and maybe others don't split triangles at all and render the whole thing in each tile they're present in with guard-band clipping. If true I wonder what the actual cost of guard-band is, if it's something that scales per number of scanlines or even worse, has to reject individual pixels that fall outside of it (there's no way that'd be true for a tiler right..?)
 
There is some bandwidth increase as well where the tile binner has to create new vertexes or where vertexes get reproduced for where triangles get split across/included in multiple tiles. It'd actually be interesting to see some figures in just how many additional vertexes are created for a tiler that splits triangles. I've heard that Mali and maybe others don't split triangles at all and render the whole thing in each tile they're present in with guard-band clipping. If true I wonder what the actual cost of guard-band is, if it's something that scales per number of scanlines or even worse, has to reject individual pixels that fall outside of it (there's no way that'd be true for a tiler right..?)

I'm not aware of any tiler that splits triangles that cross tiles, although trangles will be read and setup for each overlapped tile. There is also no need for evaluation of pixels outside of the tile, crudely for each pixel in tile just test if it lies within triangle and there are many ways minimise the number of tests i.e. you don't actually have to evaluate every pixel within a tile if a triangle doesn't cover them.

John.
 
Ah I see, yeah I don't know why I figured you'd need geometric clipping against the tile edges...

So I guess it's something like this:
- For tile coordinate x/y, for each triangle in bin (that isn't optimized out from the test somehow) convert to barycentric depth representation of triangle
- If it's outside of 0 to 1, reject
- Convert to screen-space depth and compare to see if it's the front-most pixel, also storing triangle ID
 
I`ll just... leave this here. *shrug*

For instance, HDR is the latest in PC for rendering in the GPU, 16 bit per channel floating point (FP16) is used in a standard 64-bit buffer, 64-bit bus width, this is not suitable SGX543 in practical systems. Therefore, NGP for SGX543MP4 "+" is something which seems to support the buffer 9995 to demonstrate a decent 32-bit high dynamic range. Each 9-bit RGB, RGB each 5 bits are stored on the common exponential term, this format as well as texture, render it available as a.
http://translate.googleusercontent....7.html&usg=ALkJrhhMQojA4k72Jm9GPVJ4--nNwLYw2w

http://translate.googleusercontent....7.html&usg=ALkJrhj-9L_I5ujH81-31GDQgtfm0QMJig


Nothing new I`m sure... but eh.
 
Yeah, it's NGP is uses the OGL ES-centric SGX543, not the DX9-level SGX544, so that kind of thing suffers. I sure as hell hope they're not suggesting using 9-bit integer channels for HDR though, that's not really an option.

Interestingly they've got programmable blending so NAO32 should work extremely well. But it's not clear how you would handle MSAA resolve there - unlike on an IMR, it would be very expensive (performance *and* memory footprint) to do resolve in a pixel shader so here's hoping you can do better than that...
 
Is there a comparison around of the different 32-bit encoded HDR formats? I recall the Bungie one, but it wasn't really an interesting image or a real-world example (just... "such and such format is sucky at this range").

....though I wonder if it really matters for the portable screen.
 
AlStrong that sounds like it's the writers conjecture but I can't tell do the wonderful quality of the auto translate. Where is our friendly neighborhood console forum translator when we need him? :p
 
Back
Top