NVIDIA Tegra Architecture

french toast · Apr 10, 2012

General consensus is often formed from people who don't really know what they're talking about >_> People are probably just seeing that it's non-unified and DX9 level and drawing conclusions from there. There's a lot more to it than that.

Well to be fair DX9 IS Geforce 6/7 class is it not??... of course its not really going to resemble a proper Geforce 6800 ultra...because its so cut down and simple...but i think the 6 series anology is apt.
-After all Nvidia said so them selfs when Tegra 1 hit...

Ailuros · Apr 10, 2012

Exophase,

http://www.anandtech.com/show/4940/qualcomm-new-snapdragon-s4-msm8960-krait-architecture/3

Adreno225 should be 8 Vec4 (ignore the SFU it isn't any sort of Vec4+1).

As for the ULP GeForce, it's an embedded design and shares parts from several NV DX9 desktop architectures; alas if one expects an embedded design to be a direct shrink/copy of a desktop design cluster.

metafor · Apr 10, 2012

Exophase said:
Cache maintenance operations include flushes and invalidations. There are versions that work by set/way and therefore you need to know the cache size and associativity to use them. It's important that they actually work instead of NOPing (much less faulting, an implementation of ARMv7a can't add faults where the spec doesn't call for it) because otherwise you can end up with a stale cache causing incoherencies.

With a write-back, inclusive cache, I don't see this being much of a problem as evictions would occur on their own. If hardware enforces evictions-on-conflict, I don't see what routines the OS could be doing in which it relies on being able to invalidate a set/way before write-back. That seems fairly dangerous...

Should they really be trying to take that market with IB and not Atom? At what point does perf/W scaling on IB reach diminishing returns and make no sense vs the much cheaper Atom chips?

Unless they come out with a drastically improved Atom uarch, I'd wager that point is well below the 5W range.

We'll see - I personally don't think that market potential is that much bigger than the market for Windows 7 tablets is right now. Because I don't think that many people are looking for tablets that are both top of class as far as conventional tablets go and run legacy x86 apps. People view the two things as pretty separate..

My point isn't that IB will be advantageous in that it'll support legacy apps. My point is that IB at the high-end tablet's power envelope will be competitive in its own right. Both in terms of performance and power consumption.

And good luck finding a 17W (or less) IB laptop for sub $400. That market is dedicated to Brazos, Atom, maybe higher end AMD APUs, and maybe heavily crippled Celerons.

Aren't Celerons just whatever design-of-the-year they're currently selling on the high-end, but with certain features and a bunch of cache disabled? It's a different name, but that'd be IB. And performance-wise, it'd still be on par if not ahead of A15 based SoC's.

What price issue with ARM SoCs? You're saying ARM SoCs are too expensive? Or tablets are too expensive?

The later.

Because the latter is obviously already shifting with tablets like Kindle Fire. It isn't any longer a matter of not being able to accept razor thin margins and high volume; once you're already on that model going with a $100+ high end (non-Atom) Intel chip is going to be a hard sell vs a $20 ARM SoC.

I'm going to go out on a limb here and say there is definitely a market for more expensive tablets than the Kindle Fire.....

Call me crazy.

Yes, I don't think IB will ever make it into the new wave of super-cheap tablets. But I think there's a sufficient market out there for pricier tablets and I think IB will fit into that pretty well.

Exophase · Apr 10, 2012

Ailuros said:
Exophase,

http://www.anandtech.com/show/4940/qualcomm-new-snapdragon-s4-msm8960-krait-architecture/3

Adreno225 should be 8 Vec4 (ignore the SFU it isn't any sort of Vec4+1).

I think this is wrong, because Adreno 200 definitely consists of only one vec4 FP32 ALU. I don't think anyone would be publicizing an increase in 2x and 4x performance if Adreno 205 and 220 changed to 4 and 8 vec4 ALUs respectively.

Ailuros said:
As for the ULP GeForce, it's an embedded design and shares parts from several NV DX9 desktop architectures; alas if one expects an embedded design to be a direct shrink/copy of a desktop design cluster.

And really just a cursory review of GeForce 6's design as presented in GPU Gems shows so many differences from GeForce ULP. It's really hard to call it "derived" from that on the information given so far.

metafor said:
With a write-back, inclusive cache, I don't see this being much of a problem as evictions would occur on their own. If hardware enforces evictions-on-conflict, I don't see what routines the OS could be doing in which it relies on being able to invalidate a set/way before write-back. That seems fairly dangerous...

So what you're saying is that you don't see a problem if the flush operations fail to actually flush what you think you're flushing. It's not enough for a cache to be write-back when you need dirty data to be flushed NOW, for instance when other devices (not coherent with the core) need to see what changed. If it were write through that may be another story but it isn't always.

metafor said:
Unless they come out with a drastically improved Atom uarch, I'd wager that point is well below the 5W range.

Maybe, but until Intel actually does release < 17W CPUs it's speculation. I don't think it's a lock that IB keeps scaling down to 10W at your specifications. Maybe someone should try measuring power consumption at full load with these hard clock limits.

metafor said:
My point isn't that IB will be advantageous in that it'll support legacy apps. My point is that IB at the high-end tablet's power envelope will be competitive in its own right. Both in terms of performance and power consumption.

Competitive while running what? How much of a demand do you think there currently is for vastly more power than tablets currently provide, but while running the same programs?

metafor said:
Aren't Celerons just whatever design-of-the-year they're currently selling on the high-end, but with certain features and a bunch of cache disabled? It's a different name, but that'd be IB. And performance-wise, it'd still be on par if not ahead of A15 based SoC's.

When I say crippled I mean a lot of its die space disabled, not just cache. For instance there are single core Celerons. I didn't mean to imply it wouldn't be the same uarch, but we're talking a major difference in performance levels. Lack of turbo makes a really big difference too, when you're talking about base clocks around 1GHz.

metafor said:
The later.

Which is what has been changing, since it's the cheap tablets that are now getting the lion's share of non-iOS sales.

Yes there's a market for more expensive stuff, but you're losing focus of your original claim. I asked why the list only has 10" stuff with a bunch of 4 + 1 Cortex-A15s. You said it's because nVidia is looking to compete against Ivy Bridge. This isn't saying there's a market here, that's saying it's the ONLY market nVidia sees its next gen tablets as playing in. Which I really, really doubt. What do you see nVidia speccing for those cheaper tablets? Obviously not Grey.

So far the market for more-expensive-than-Kindle Fire (and not an iPad) really has been pretty marginal. Especially if you take out turns-into-a-laptop-form-factor as a side feature.

metafor · Apr 10, 2012

Exophase said:
So what you're saying is that you don't see a problem if the flush operations fail to actually flush what you think you're flushing. It's not enough for a cache to be write-back when you need dirty data to be flushed NOW, for instance when other devices (not coherent with the core) need to see what changed. If it were write through that may be another story but it isn't always.

If you're writing to a device lock or non-coherent region, why wouldn't you mark it as non-cacheable? Ignoring that, as long as you flush L2 on a switch to the companion core, you're fine.

Maybe, but until Intel actually does release < 17W CPUs it's speculation. I don't think it's a lock that IB keeps scaling down to 10W at your specifications. Maybe someone should try measuring power consumption at full load with these hard clock limits.

Competitive while running what? How much of a demand do you think there currently is for vastly more power than tablets currently provide, but while running the same programs?

With Windows 8 going to tablets? Probably quite a few. In either case, just about every SoC vendor is aiming for higher performance designs. I don't think they're all planning on something that won't be needed...

When I say crippled I mean a lot of its die space disabled, not just cache. For instance there are single core Celerons. I didn't mean to imply it wouldn't be the same uarch, but we're talking a major difference in performance levels. Lack of turbo makes a really big difference too, when you're talking about base clocks around 1GHz.

Even at a base of 1GHz, even a crippled IB will still be quite competitive performance-wise with A15.

Which is what has been changing, since it's the cheap tablets that are now getting the lion's share of non-iOS sales.

Yes there's a market for more expensive stuff, but you're losing focus of your original claim. I asked why the list only has 10" stuff with a bunch of 4 + 1 Cortex-A15s. You said it's because nVidia is looking to compete against Ivy Bridge. This isn't saying there's a market here, that's saying it's the ONLY market nVidia sees its next gen tablets as playing in. Which I really, really doubt. What do you see nVidia speccing for those cheaper tablets? Obviously not Grey.

nVidia only aiming for the high-end and somehow finding a way to scale it down isn't new. In fact, that's par for the course.

So far the market for more-expensive-than-Kindle Fire (and not an iPad) really has been pretty marginal. Especially if you take out turns-into-a-laptop-form-factor as a side feature.

Not that I disagree, but the anticipation is that Windows 8 will blow up that market.

french toast · Apr 10, 2012

think this is wrong, because Adreno 200 definitely consists of only one vec4 FP32 ALU. I don't think anyone would be publicizing an increase in 2x and 4x performance if Adreno 205 and 220 changed to 4 and 8 vec4 ALUs respectively.

Links? the Adreno 200 was pathetic, adreno 205 was more than double the performance in practise, plus performance wouldn't just scale with shaders...i don't think TMU's and ROP's scaled up that much?
http://www.glbenchmark.com/compare.jsp

Exophase · Apr 10, 2012

french toast said:
Links? the Adreno 200 was pathetic, adreno 205 was more than double the performance in practise, plus performance wouldn't just scale with shaders...i don't think TMU's and ROP's scaled up that much?
http://www.glbenchmark.com/compare.jsp

The point isn't how much it DID improve, it's that there's no way they'd claim only a 2x improvement if they quadrupled the shading resources. This isn't how marketing works. They'd say "up to 4x improvement." That is if it were the same clock, which obviously it wasn't, hence why you got better than 2x (and yes, I'm sure they doubled the TMUs on the 205 over 200 as well. Adreno 200 had a single TMU)

You can find this information in the documentation for i.MX51 or i.MX53, and darkblu has confirmed it with me.

metafor said:
If you're writing to a device lock or non-coherent region, why wouldn't you mark it as non-cacheable? Ignoring that, as long as you flush L2 on a switch to the companion core, you're fine.

Because sometimes you still want to benefit from caching before writing to the external device.

You're not really fine if you flush L2 after switching to the companion core, if the app still thinks that the L2 is the different size because you didn't communicate to it that you changed it. Of course having to flush L2 on a switch is highly undesirable to begin with..

I think your expectations for Windows 8 are really high. You seem confident that it WILL be successful. I can see it going either way. One thing I know is that this is not a difficult market for a new ecosystem to enter into, just like phones weren't and WP7 has been struggling. If it IS successful, it's definitely not going to be centered around nothing but IB-level CPUs, but will be all over the place, and the lowest common denominator will have a big impact on performance expectations and demands - a lot more than it does on PC. The point here isn't really that the market doesn't move up in performance, it's that it's harder to differentiate by offering a product with much more of it if it costs a ton more at the same time, and a majority of the things you run on it don't really scale up to take advantage of it as well as it does on PC.

As for nVidia releasing high and moving down - that strategy works okay for Tegra 3 when the cores are tiny. And thus far I haven't actually seen them moving down at all. Actually, they really haven't been scaling down at all. But their market share hasn't been all that amazing either, and the initial advantage they got by pushing the Honeycomb reference in tablets is going to be quickly vanishing, so its important that they compete on more levels. I'm not saying they're not going to go with nothing but 5-core Cortex-A15s next gen. But I think that kind of decision will bite them in the ass.

Arun · Apr 10, 2012

Exophase said:
The point isn't how much it DID improve, it's that there's no way they'd claim only a 2x improvement if they quadrupled the shading resources. This isn't how marketing works. They'd say "up to 4x improvement." That is if it were the same clock, which obviously it wasn't, hence why you got better than 2x (and yes, I'm sure they doubled the TMUs on the 205 over 200 as well. Adreno 200 had a single TMU)

I've tested the ALU performance of Adreno 220 and it's definitely 8 Vec4 ALUs. It's bottlenecked by an awful compiler and a lack of fillrate.

And yes, big kudos to Qualcomm's PR for not pushing unrealistic performance expectations. I should point out that Imagination did the same for Series 5XT by claiming 40% faster performance for shader-heavy applications despite having twice as many flops.

Exophase · Apr 10, 2012

Arun said:
I've tested the ALU performance of Adreno 220 and it's definitely 4 Vec4 ALUs. It's bottlenecked by an awful compiler and a lack of fillrate.

And yes, big kudos to Qualcomm's PR for not pushing unrealistic performance expectations. I should point out that Imagination did the same for Series 5XT by claiming 40% faster performance for shader-heavy applications despite having twice as many flops.

Yes, 4x vec4 FP32 ALU's.. that's what I originally claimed, which - like the graph in the Anandtech review - is 2x Adreno 205's, which is 2x Adreno 200 which had a single vec4 FP32 ALU. And Adreno 225 is the same as Adreno 220, just higher clocks and better drivers.

That doesn't mean it's an unrealistic claim even though shader ALUs "only" increased by the amount they indicate, since they also increased everything else..

Arun · Apr 10, 2012

Exophase said:
Yes, 4x vec4 FP32 ALU's.. that's what I originally claimed, which - like the graph in the Anandtech review - is 2x Adreno 205's, which is 2x Adreno 200 which had a single vec4 FP32 ALU. And Adreno 225 is the same as Adreno 220, just higher clocks and better drivers.

That doesn't mean it's an unrealistic claim even though shader ALUs "only" increased by the amount they indicate, since they also increased everything else..

Gah, this is extremely embarassing, I'm sorry but I definitely meant 8 Vec4 ALUs (4 per TMU). And yes, it's basically the most ALU-intensive architecture out there today (SGX554 will be effectively more ALU-intensive given its higher efficiency per flop though, and that's before even considering Rogue...)

You'd be surprised at how many shaders do benefit from those ALUs. Adreno's performance is actually very good for the complex shaders used for e.g. characters or water. They lose out badly for the more simple shaders that are still used for the majority of the pixels though from my experience/analysis so far...

french toast · Apr 10, 2012

Well then that highly confusing from Anand, as he does state that he enquired with Qualcomm about the Vec4+1...still if you've only got scraps of info to work with, your going to come up short occasionally...

Yea marketing wise, Qualcomm has always been bang on, could teach Nvidia a thing or two

EDIT; HA just read Aruns post, no worries simple mistake.

Exophase · Apr 10, 2012

Arun said:
Gah, this is extremely embarassing, I'm sorry but I definitely meant 8 Vec4 ALUs (4 per TMU). And yes, it's basically the most ALU-intensive architecture out there today (SGX554 will be effectively more ALU-intensive given its higher efficiency per flop though, and that's before even considering Rogue...)

You'd be surprised at how many shaders do benefit from those ALUs. Adreno's performance is actually very good for the complex shaders used for e.g. characters or water. They lose out badly for the more simple shaders that are still used for the majority of the pixels though from my experience/analysis so far...

Then can you confirm Adreno 205 is 4 vec4 ALUs? Is it 1 TMU or 2? 1:4 TMU to ALU ratio seems pretty unbalanced for a mainstream phone SoC today. Are the TMUs more sophisticated than you would expect? They were promoting AF when the AF capabilities for SGX was unclear (to me, at least), but I'm not sure if it actually had better filtering.

The ALU/TMU balancing, along with a lack of SIMD within vector lines (ie, 2x16-bit and 4x8/10-bit like on USSE/USSE2) seem like weaknesses in the balancing of the architecture. Everyone else is either using multi-precision (IMG) or flat out lower precision for fragments (nVidia, ARM). This makes the relative area spent on all those ALUs even higher than the SIMD-width suggests..

Arun · Apr 10, 2012

Exophase said:
Then can you confirm Adreno 205 is 4 vec4 ALUs? Is it 1 TMU or 2?

I've not actually tested Adreno 205 but all the evidence I have ever seen points towards it being 1 TMU with 4 Vec4 ALUs.

1:4 TMU to ALU ratio seems pretty unbalanced for a mainstream phone SoC today.

It's compensated by the compiler which is clearly less efficient than ours or Tegra's AFAICT. For example, it doesn't do any vectorisation whatsoever, even in the most trivial cases. Also one thing I like to point out is that the optimal TMU vs ALU ratio depends on the relative die size of the two - if their ALUs were more efficient (BIG if) then it would make sense to have more of them.

I think it's pretty typical for different shaders in the same game/benchmark to have quite different ALU:TEX ratios. There are certainly a fair number of existing shaders where a 4:1 ratio would be a clear benefit compared to a 2:1 ratio even with a good compiler. Despite all this, I definitely agree Adreno wouldn't be balanced for its timeframe if it wasn't for the low compiler efficiency, and that's not a great justification...

Are the TMUs more sophisticated than you would expect? They were promoting AF when the AF capabilities for SGX was unclear (to me, at least), but I'm not sure if it actually had better filtering.

No idea about Adreno's AF quality but I don't see why it would be unusually fast at it, especially given its lack of texture fillrate.

The ALU/TMU balancing, along with a lack of SIMD within vector lines (ie, 2x16-bit and 4x8/10-bit like on USSE/USSE2) seem like weaknesses in the balancing of the architecture. Everyone else is either using multi-precision (IMG) or flat out lower precision for fragments (nVidia, ARM). This makes the relative area spent on all those ALUs even higher than the SIMD-width suggests..

Agreed. The ALU itself is not necessarily a huge part of the total area but it's obviously significant. All things considered, I think Adreno is clearly too ALU-heavy for its timeframe.

metafor · Apr 10, 2012

Exophase said:
Because sometimes you still want to benefit from caching before writing to the external device.

You're not really fine if you flush L2 after switching to the companion core, if the app still thinks that the L2 is the different size because you didn't communicate to it that you changed it.

Why wouldn't you be. The line will never be dirty again (since it's not there) and invalidate-by-set/way can be no-op'ed. Also, is this really a common practice on the OS or driver level when it comes to using caches? What do these procedures do when faced with a write-through cache?

Of course having to flush L2 on a switch is highly undesirable to begin with..

Not really. Typically, swapping to the companion core or back involves a drastic change in what the user is doing. Usually a change of the primary running application and its working set. I don't see a cache flush having much of a negative impact here.

I think your expectations for Windows 8 are really high. You seem confident that it WILL be successful.

I didn't say *I* was expecting it to be wildly successful. I said the major ARM SoC guys look to this as their blow-out in a marginal market dominated by Apple. Android isn't exactly bringing nVidia tons of volume sales in tablets.

I can see it going either way. One thing I know is that this is not a difficult market for a new ecosystem to enter into, just like phones weren't and WP7 has been struggling. If it IS successful, it's definitely not going to be centered around nothing but IB-level CPUs, but will be all over the place, and the lowest common denominator will have a big impact on performance expectations and demands - a lot more than it does on PC. The point here isn't really that the market doesn't move up in performance, it's that it's harder to differentiate by offering a product with much more of it if it costs a ton more at the same time, and a majority of the things you run on it don't really scale up to take advantage of it as well as it does on PC.

How well does it really on PC's? And what cases would those same things not scale on a tablet running Windows 8? Other than workstation workloads, that is.

As for nVidia releasing high and moving down - that strategy works okay for Tegra 3 when the cores are tiny. And thus far I haven't actually seen them moving down at all. Actually, they really haven't been scaling down at all. But their market share hasn't been all that amazing either, and the initial advantage they got by pushing the Honeycomb reference in tablets is going to be quickly vanishing, so its important that they compete on more levels. I'm not saying they're not going to go with nothing but 5-core Cortex-A15s next gen. But I think that kind of decision will bite them in the ass.

No argument there. My point was simply that this pattern -- and their motivations behind it -- are par for the course.

Exophase · Apr 10, 2012

metafor said:
Why wouldn't you be. The line will never be dirty again (since it's not there) and invalidate-by-set/way can be no-op'ed. Also, is this really a common practice on the OS or driver level when it comes to using caches? What do these procedures do when faced with a write-through cache?

Okay, what is it that you think is the purpose of the cache flush operations that they can just be dropped or are okay to refer to the wrong location? They aren't performance hints. Sometimes you need to force coherency within a predictable time frame and not just when the core happens to need to do it.

Using cached memory + flushes instead of writing to uncached memory isn't as pathological as you make it sound. For instance, it's a pretty common pattern for when you're preparing memory for a DMA engine. Sometimes you actually do want to benefit from the spatial and temporal locality of cache when preparing data before sending it off externally. Think of something like a tile-based renderer; you might write straight to the framebuffer, but rely on the cache to give you high performance repeat access before you're done rendering the tile. Or maybe going through the cache gives you better performance than the write buffer gives you on uncached access. Or maybe it's just more natural this way because cached memory is the default view of memory allocated in user space.

Write-through cache doesn't somehow pose a new problem. It doesn't matter if the memory happened to end up coherent. You just have to ensure that it isn't incoherent. Of course, write-through for last level cache tends to be a pretty bad idea.

metafor said:
Not really. Typically, swapping to the companion core or back involves a drastic change in what the user is doing. Usually a change of the primary running application and its working set. I don't see a cache flush having much of a negative impact here.

ARM disagrees with you, given the lengths they went through to optimize big.LITTLE to switch in a few hundred microseconds. They'd never get that if they had to flush the whole L2 cache.

metafor · Apr 11, 2012

Exophase said:
Okay, what is it that you think is the purpose of the cache flush operations that they can just be dropped or are okay to refer to the wrong location? They aren't performance hints. Sometimes you need to force coherency within a predictable time frame and not just when the core happens to need to do it.

A swap to companion core will do this for the initial line. Subsequent writes to that address wouldn't be in that set. The underlying hardware can then choose to either make those addresses non-cacheable, thus guaranteeing write-through behavior. Or remap the cache ops.

Using cached memory + flushes instead of writing to uncached memory isn't as pathological as you make it sound. For instance, it's a pretty common pattern for when you're preparing memory for a DMA engine. Sometimes you actually do want to benefit from the spatial and temporal locality of cache when preparing data before sending it off externally. Think of something like a tile-based renderer; you might write straight to the framebuffer, but rely on the cache to give you high performance repeat access before you're done rendering the tile. Or maybe going through the cache gives you better performance than the write buffer gives you on uncached access. Or maybe it's just more natural this way because cached memory is the default view of memory allocated in user space.

Fair enough. But again, in these situations, what is a show-stopper about cached operations bypassing the cache?

Write-through cache doesn't somehow pose a new problem. It doesn't matter if the memory happened to end up coherent. You just have to ensure that it isn't incoherent. Of course, write-through for last level cache tends to be a pretty bad idea.

So...no-op'ing cache invalidate ops because they were write-through to begin with is not going to break anything.

ARM disagrees with you, given the lengths they went through to optimize big.LITTLE to switch in a few hundred microseconds. They'd never get that if they had to flush the whole L2 cache.

They may be more ambitious with the type of workloads they can swap in between for big.Little then.

Exophase · Apr 11, 2012

metafor said:
A swap to companion core will do this for the initial line. Subsequent writes to that address wouldn't be in that set. The underlying hardware can then choose to either make those addresses non-cacheable, thus guaranteeing write-through behavior. Or remap the cache ops.

Fair enough. But again, in these situations, what is a show-stopper about cached operations bypassing the cache?

I don't think you're following the scenario here. This isn't about maintaining coherency between a switch, this is about the CPU not having a consistent view for how to manage the cache. Bear with me on this:

1) CPU gets cache size in order to perform maintenance operations
2) CPU is switched and suddenly it has a different L2 cache size, this is not somehow communicated to the CPU
3) New stuff enters the cache to be flushed
4) CPU performs flush operation using cache sizing to calculate cache line index from physical address, ends up getting it wrong
5) Flush operation doesn't happen like the CPU expects it to

metafor said:
So...no-op'ing cache invalidate ops because they were write-through to begin with is not going to break anything.

Unless you're suggesting to make the LLCs always be write-through to solve coherency problems I don't know what you're getting at.

metafor said:
They may be more ambitious with the type of workloads they can swap in between for big.Little then.

Then wouldn't that mean there actually is a use case for such swapping? Then it's more that the latency defines the use cases, not the other way around, and you want to avoid higher latencies if you can.

metafor · Apr 11, 2012

Exophase said:
I don't think you're following the scenario here. This isn't about maintaining coherency between a switch, this is about the CPU not having a consistent view for how to manage the cache. Bear with me on this:

1) CPU gets cache size in order to perform maintenance operations
2) CPU is switched and suddenly it has a different L2 cache size, this is not somehow communicated to the CPU
3) New stuff enters the cache to be flushed
4) CPU performs flush operation using cache sizing to calculate cache line index from physical address, ends up getting it wrong
5) Flush operation doesn't happen like the CPU expects it to

When you say CPU, you mean the OS's view, right? Cause the companion core knows full well what its cache size is. More-over, the L2 cache controller knows full well what cache size the OS thinks it's dealing with and the actual cache size. It has a job of either not caching those addresses (partial set association) or remapping cache ops it gets. If it doesn't cache those addresses, it can effectively no-op the cache ops.

Then wouldn't that mean there actually is a use case for such swapping? Then it's more that the latency defines the use cases, not the other way around, and you want to avoid higher latencies if you can.

It wouldn't really be swapping in the case of big.Little. The OS is aware of the heterogeneous configuration and assigns workloads appropriately. The two cores aren't muxed like they are in vSMP.

Exophase · Apr 11, 2012

metafor said:
When you say CPU, you mean the OS's view, right? Cause the companion core knows full well what its cache size is. More-over, the L2 cache controller knows full well what cache size the OS thinks it's dealing with and the actual cache size. It has a job of either not caching those addresses (partial set association) or remapping cache ops it gets. If it doesn't cache those addresses, it can effectively no-op the cache ops.

Yes, when I say CPU I mean the software.

Doing a virtual cache mapping is problematic because the associativity and line size and what have you actually has a practical impact on how the software works, just as the size does. The have to emulate the actual behavior, and the only way to really do that in a cache is to have the same cache arrangement..

Of course, it's not like nVidia is using a custom cache controller to begin with.

metafor said:
It wouldn't really be swapping in the case of big.Little. The OS is aware of the heterogeneous configuration and assigns workloads appropriately. The two cores aren't muxed like they are in vSMP.

Full swapping is a standard usage model for big.LITTLE, and ARM provides firmware code for it - and it's the swapping latency that ARM is describing (what else would they be?). Sure, the software triggers the swapping in big.LITTLE, but why would that increase the latency requirements? If anything, you would want something that transparently swaps you to be as low latency as possible..

french toast · Apr 11, 2012

You'd be surprised at how many shaders do benefit from those ALUs. Adreno's performance is actually very good for the complex shaders used for e.g. characters or water. They lose out badly for the more simple shaders that are still used for the majority of the pixels though from my experience/analysis so far...

Do you think Adreno's complex shaders go some way to explain why it dominates Basemark?? as i have heard people say that basemark is very 'shader heavy'?

Or is Qualcommm just optimising for that benchmark so it can 'win'?

NVIDIA Tegra Architecture

french toast

Ailuros

Epsilon plus three

metafor

Exophase

metafor

french toast

Exophase

Arun

Unknown.

Exophase

Arun

Unknown.

french toast

Exophase

Arun

Unknown.

metafor

Exophase

metafor

Exophase

metafor

Exophase

french toast

Similar threads