What to expect from Adreno 225 & Krait

Mmm, That seems abit low considering it is suposed to be a brand new architecture, and also a 'quad core?' (which lets face it could mean any one of a number of things these days..i will take it as full gpu cores, most likely 4x305 or something.)

You would assume they would be able to multiply the ALUs up cheaply and get much higher marketing numbers, i mean a quad core new architecture, is only 90% faster than its old generation single core??:???:
AFAICT, 225/305/320 have 2/1/4 TMUs respectively. So with only twice as many TMUs and possibly not twice as much effective memory bandwidth, ~2x performance seems like a good guess to me given that Adreno 2xx was already very strong in the ALU department. That doesn't mean it cannot be a better/more balanced/whatever architecture for other reasons but it seems to me that Qualcomm's performance estimates make sense.
 
AFAICT, 225/305/320 have 2/1/4 TMUs respectively. So with only twice as many TMUs and possibly not twice as much effective memory bandwidth, ~2x performance seems like a good guess to me given that Adreno 2xx was already very strong in the ALU department. That doesn't mean it cannot be a better/more balanced/whatever architecture for other reasons but it seems to me that Qualcomm's performance estimates make sense.

Well i supose i was comparing the 225-320 projections to the ARM MALI 400mp4 -T604 which was '5' times and then the T658 which they say is between 2-4 times faster than that with the same number of unified shaders as Adreno 220!?:???:

According to Anand the 220 has alot of performance unlocked under the hood, so even with the same TMU's and same amount of shaders, the increase in efficiency that a new achitecture should bring would bring 2x improvement with out any new hardware..just my take, and thats not bringing IMG TECH 5 series - Rogue into it.

Never the less i agree Qualcomm usually delievers a solid 2x when they say they will.
 
The Adreno design hasn't changed much since the first Snapdragon. Additional ALU's have been added and frequencies have increased but the base shader architecture and (relatively bad) drivers have remained. This allows performance increases to be relatively predictable compared to others.

The 300 series will be the first time a new shader architecture will be used. So we'll see how well that fares. A surprisingly big limitation thus far has been the CPU's to bin scenes. That can be taken care of either by brute force (faster CPU) or just having the driver filter out times when someone calls draw 50k times more than they should.
 
Between the jump in clock speed and the refined drivers, graphics on S4 platforms can definitely compete with Tegra 3.
 
Between the jump in clock speed and the refined drivers, graphics on S4 platforms can definitely compete with Tegra 3.

I think the only reason NVIDIA is able to put out any competitive graphics chips with such crummy innards, is because they are so much better than every one else with their drivers, especially in mobile space where no one else has the experience of a gpu dog fight:smile:

I would love to see Nvidia write drivers for Adreno....i think we would be suprised...
 
I think the only reason NVIDIA is able to put out any competitive graphics chips with such crummy innards, is because they are so much better than every one else with sending their own software engineers to developer studios in order to "assist" in developing the games

There, I fixed that for you.
 
NVIDIA's marketing claimed two things when they started out with the initial Tegra:

1. Tiling stinks for anything >DX7.
2. USC is questionable for anything up to DX9.

No it wasn't worded exactly like that, but I don't recall the claims word for word either. Marketing wash aside and any funky excuses in the direction "what we don't have sucks", I am always trying to keep a broader perspective just in case. Assuming there's some truth to both and ULP GFs in Tegras not having posed as "weak" in terms of performance, I'd like to stand corrected that it might not be any truth whatsoever behind the above. If it should be then of course is it software related but not in the exact same sense as implied so far.

One good indication that could point in the above direction is if USC based tilers of the current generation suddenly pose quite a bit better against Tegra GPUs under OGL_ES3.0.

Have a look here: http://www.codeplay.com/company/partners.html

What would a company like Qualcomm need third party graphics compilers for if things aren't as complicated as I suspect them to be?

Between the jump in clock speed and the refined drivers, graphics on S4 platforms can definitely compete with Tegra 3.

Definitely; but I'd still expect from 8 Vec4 USC ALUs @400MHz to deliver quite a bit more than 2 Vec4 PS + 1 Vec4 VS ALUs @520MHz.
 
NVIDIA's marketing claimed two things when they started out with the initial Tegra:

1. Tiling stinks for anything >DX7.
2. USC is questionable for anything up to DX9.

No it wasn't worded exactly like that, but I don't recall the claims word for word either. Marketing wash aside and any funky excuses in the direction "what we don't have sucks", I am always trying to keep a broader perspective just in case. Assuming there's some truth to both and ULP GFs in Tegras not having posed as "weak" in terms of performance, I'd like to stand corrected that it might not be any truth whatsoever behind the above.
I'd rather see the original statements you got those impressions from : )

But I'll still comment on 2: the only truth it may have would be entirely power-savings related (e.g. limiting fragment cores to only mid/lowp, etc). Performance-wise, it's generally wrong - I'd always take a gpu that has N+M USCs over one that has N vertex and M fragment cores.
 
I'd rather see the original statements you got those impressions from : )

http://forum.beyond3d.com/showpost.php?p=1267164&postcount=2

Not a tiling architecture
- Tiling works reasonably well for DX7-style content
- For DX9-style content the increased vertex and state traffic was a net loss
Not a unified architecture
- Unified hardware is a win for DX10 and compute
- For DX9-style graphics, however, non-unified is more efficient.

My memory isn't as weak as I think most of the times.

But I'll still comment on 2: the only truth it may have would be entirely power-savings related (e.g. limiting fragment cores to only mid/lowp, etc). Performance-wise, it's generally wrong - I'd always take a gpu that has N+M USCs over one that has N vertex and M fragment cores.

I won't disagree one bit; however what I'm asking here is if (apart from the above) it could be that <DX10 equivalent environments (and specifically OGL_ES2.0) could be complicating things for USCs. It doesn't make sense from where I stand, but it doesn't hurt to ask.

Nothing directly comparable of course, but one thing that made me raise an eyebrow was computerbase's recent article of how older and newer GPUs of different generations compare nowadays: http://www.computerbase.de/artikel/...rten-evolution/3/#abschnitt_leistung_mit_aaaf

They've used a collection of games that were released from 2005 up to recently and while I recall the G80 to have a sizeable difference against the G71 with AA/AF, it never seemed to be as big as this review shows. Almost 6x times the difference with AA/AF for just one generation and it coincidentially being the turn between DX9 and DX10 doesn't sound like a coincidence.
 
I don't get nVidia's argument that increased per-vertex state adds to binning overhead. Binning should only have to touch coordinate data, so no increase there.. and on a tiler that's a good incentive to keep the coordinate data from being interleaved with everything else.

There probably is a much lower fragment to vertex ratio on newer games, I'll give them that. But I wonder what the tiling they used for evaluation was like, vs IMG's.
 
I don't get nVidia's argument that increased per-vertex state adds to binning overhead. Binning should only have to touch coordinate data, so no increase there.. and on a tiler that's a good incentive to keep the coordinate data from being interleaved with everything else.

There probably is a much lower fragment to vertex ratio on newer games, I'll give them that. But I wonder what the tiling they used for evaluation was like, vs IMG's.

I'm guessing that as a tiler can revisit multiple states from tile to tile they incorrectly assumed that this would significantly impact bandwidth when the reality is that it remains a tiny proportion of overall BW.

The vertex BW "issue" is also typically hugely overstated by IMR guys. Although vertex BW has increased at the same time pixel related BW has typically increased by an order of magnitude more which tends to sway things even more in favour of a TBDR.
 
I don't get nVidia's argument that increased per-vertex state adds to binning overhead. Binning should only have to touch coordinate data, so no increase there.. and on a tiler that's a good incentive to keep the coordinate data from being interleaved with everything else.
I guess the binning scheme they evaluated stored the vertex attributes to the parameter buffer as well and not just position.
 
The vertex BW "issue" is also typically hugely overstated by IMR guys. Although vertex BW has increased at the same time pixel related BW has typically increased by an order of magnitude more which tends to sway things even more in favour of a TBDR.

And we are still patiently waiting for the promised article about the bandwidth advantages of an TBDR. :)

edit: ARRGGG..stupid missing "i" :oops:
 
Last edited by a moderator:
http://forum.beyond3d.com/showpost.php?p=1267164&postcount=2
My memory isn't as weak as I think most of the times.
Apologies, but I just had to ask as my memory plays the occasional trick on me.

I won't disagree one bit; however what I'm asking here is if (apart from the above) it could be that <DX10 equivalent environments (and specifically OGL_ES2.0) could be complicating things for USCs. It doesn't make sense from where I stand, but it doesn't hurt to ask.
IMO USCs handle gles2 just fine (keep in mind gles2 is just a very streamlined gl). The level of differentiation that USCs provide over split architectures is rather orthogonal to what most modern APIs expect from the hw. Perhaps the one API requirement most relevant to USCs is that the order of draw calls (and related draw state changes) should be effectively preserved on the output as that order arrives from the client on the input (which apparently could be a stronger limitation for scene capturers than for USCs). But drivers and hw are usually free to do whatever they like in the span between client's draw/state emits and fragments reaching the framebuffer (which is why scene capturers exist in the first place). Now, USCs, by virtue of being more flexible workload schedulers, might face the dilemma of 'Will this thread I can schedule right now be a problem WRT framebuffer's consistency to draw emit order?' more often than split architectures. But I'm yet to see a combination of driver and hw that manages to break there, as these are things that are usually take care of with high priority.

Nothing directly comparable of course, but one thing that made me raise an eyebrow was computerbase's recent article of how older and newer GPUs of different generations compare nowadays: http://www.computerbase.de/artikel/...rten-evolution/3/#abschnitt_leistung_mit_aaaf

They've used a collection of games that were released from 2005 up to recently and while I recall the G80 to have a sizeable difference against the G71 with AA/AF, it never seemed to be as big as this review shows. Almost 6x times the difference with AA/AF for just one generation and it coincidentially being the turn between DX9 and DX10 doesn't sound like a coincidence.
I think the above could be mainly attributed to the advancements of FSAA and AF implementations during that timespan.
 
Back
Top