NVIDIA Tegra Architecture

An ARM cluster goes only up to 4 cores and the CCI400 can only harbour 2 clusters. Anything more than that needs a CCN-50X interconnect which is much more complex/power hungry.
Besides that the software for >2 clusters just isn't there.
 
It's far too OT but Mediatek should do itself a favour and get a wee bit more aggressive on the GPU side. If rumors are true then the successor of the 6595 gets also a G6200 but clocked at 700MHz instead of 600MHz....pfffff :rolleyes:
 
Nvidia seems to be making the most efficient SoC with regards to the GPU (claims 2x performance/watt in benchmarks) but they need to push harder to get their Denver CPUs out aswell.

My theory is that while Denver is very efficient with its innovative instruction profiling & caching it pays a big price when it comes to die area. Therefore it makes sense for Nvidia to wait for 16nm before releasing a new denver based SoC.

To me it really looks like they are the only serious competition for apple in the high end tablet space with denver+best GPUs.
 
My theory is that while Denver is very efficient with its innovative instruction profiling & caching it pays a big price when it comes to die area. Therefore it makes sense for Nvidia to wait for 16nm before releasing a new denver based SoC.

To me it really looks like they are the only serious competition for apple in the high end tablet space with denver+best GPUs.
Denver is considerably larger than the standard ARM CPU designs, but so is Apple Cyclone. If the Anandtech analysis is correct, Cyclone is closer to Haswell than standard ARM CPU designs by many measurements: 3 level caches, big 64 KB L1 cache, huge 4 MB L3 cache, massive 192 uop ROB, 6-wide issue, etc. These numbers beat AMD desktop chip numbers handily. IPC is certainly higher, but Apple has very conservative clock rates (1.3 GHz).

See these two articles for more details:
Cyclone: http://www.anandtech.com/show/7910/apples-cyclone-microarchitecture-detailed
Enhanced Cyclone: http://www.anandtech.com/show/8554/the-iphone-6-review/3
 
The L3 cache is part of the interconnect, so I wouldn't call that part of the CPU design.
Apple has custom interconnect and custom memory controllers. They have been the leader of memory bandwidth in mobile space for long time already. I haven't heard any rumors about Apple sharing the L3 with the GPU. Their L3 is solely designed to help the CPU (reduce the CPU memory traffic -> higher perf + lower power usage). We can of course fight over the semantics about whether the L3 caches (and even the memory controllers) are considered a part of the CPU in the custom Intel and Apple designs.
 
Apple has custom interconnect and custom memory controllers. They have been the leader of memory bandwidth in mobile space for long time already. I haven't heard any rumors about Apple sharing the L3 with the GPU. Their L3 is solely designed to help the CPU (reduce the CPU memory traffic -> higher perf + lower power usage). We can of course fight over the semantics about whether the L3 caches (and even the memory controllers) are considered a part of the CPU in the custom Intel and Apple designs.
Why are you so certain it's CPU only cache? It looks to act exactly the same as the L3 cache in ARM's CCN interconnects and Anand's been pretty clear about that (http://anandtech.com/show/7460/apple-ipad-air-review/3)

Apple isn't really the leader in bandwidth (at least not by the magnitude it's portrayed at), just ARM's CPUs are misunderstood and misrepresented in bandwidth benchmarks. I'll hope to have the 5433 article up soon where I'll explain that a bit in more depth.
 
Last edited:
Apple isn't really the leader in bandwidth (at least not by the magnitude it's portrayed at), just ARM's CPUs are misunderstood and misrepresented in bandwidth benchmarks. I'll hope to have the 5433 article up soon where I'll explain that a bit in more depth.
That's great! More technical details are always welcome!

It's hard to tell the exact reason why the ARM designs for example in Samsung, HTC and other Android flagship phones have historically lagged behind in bandwidth compared to Apple flagships. Mobile manufacturers (chip/SOC designers and integrators) don't publish as detailed specifications of their CPU and memory architecture designs as PC chip manufacturers (Intel and AMD). The mobile chip market is a business to business field. And businesses love their secrecy. Obviously you don't need to publish any detailed public information about your chips, because you are not selling them to the customers directly. Apple is practically the only mobile manufacturer with their own CPUs and SOCs (and Imagination GPUs). However their marketing has always hidden technical details about their products. Their marketing is all about the image of the product, and it seems to be working, so I don't expect big changes there.
 
It's hard to tell the exact reason why the ARM designs for example in Samsung, HTC and other Android flagship phones have historically lagged behind in bandwidth compared to Apple flagships. Mobile manufacturers (chip/SOC designers and integrators) don't publish as detailed specifications of their CPU and memory architecture designs as PC chip manufacturers (Intel and AMD). The mobile chip market is a business to business field. And businesses love their secrecy. Obviously you don't need to publish any detailed public information about your chips, because you are not selling them to the customers directly. Apple is practically the only mobile manufacturer with their own CPUs and SOCs (and Imagination GPUs). However their marketing has always hidden technical details about their products. Their marketing is all about the image of the product, and it seems to be working, so I don't expect big changes there.
Ok, for the sake of discussion here:

ARM's CPUs have separate read and write ports on the CPU interconnect (And I think even on the L caches, I'll have to double-check but it's hard). Common benchmarks only ever have a single read or write test. If you actually have simultaneous read/write activity going on which is actually what happens in 90% of real-world use-cases then you can basically double all bandwidth numbers out there.

You can theorize the benefits / downsides of that yourself... ARM basically says that most bandwidth bottlenecks are in stream/copy scenarios so this might have a benefit to latency or power.
 
ARM's CPUs have separate read and write ports on the CPU interconnect (And I think even on the L caches, I'll have to double-check but it's hard). Common benchmarks only ever have a single read or write test. If you actually have simultaneous read/write activity going on which is actually what happens in 90% of real-world use-cases then you can basically double all bandwidth numbers out there.
In stream-like benchmarks you have reads and writes streams running at the same time. And even if a single CPU fails at saturating the memory controller and its separated R and W buses, you'd expect several instances running on different CPU's to saturate both R/W buses.

Anyway it seems to me recent ARM CPU BW already are close to Apple ones: http://browser.primatelabs.com/geekbench3/compare/1730581?baseline=1730423
 
Pure memory copy benchmarks (stream copy and stream copy multicore) are 2x faster on iPad (in the comparison link you provided). Apple seems to still have considerably higher peak numbers. I do admit that ARM has been catching up nicely. The difference used to be much bigger some years ago.

Note that Mediatek seems to have catched up significantly on the sw side also with the 6595; up to recently driver overhead offscreen was in the 25-30fps ballpark for all of their SoCs. With the 84 fps of the 6595 http://gfxbench.com/device.jsp?benchmark=gfx30&os=Android&api=gl&D=Meizu MX4 they're also getting to Apple's typical >100 fps scores their i-devices land.

The highest overhead score goes to the Nexus9 at >166 fps http://gfxbench.com/result.jsp?benchmark=gfx30&test=553&order=score&base=gpu&ff-check-desktop=0
 
Any word on how the Nexus 9 has sold?

Google should be reporting their 2014 4th quarter soon. Not sure if they ever report Nexus sales.

Nvidia should be reporting as well but I wonder if they'll report how many of their devices they've sold.
 
Why are you so certain it's CPU only cache? It looks to act exactly the same as the L3 cache in ARM's CCN interconnects and Anand's been pretty clear about that (http://anandtech.com/show/7460/apple-ipad-air-review/3)

Apple isn't really the leader in bandwidth (at least not by the magnitude it's portrayed at), just ARM's CPUs are misunderstood and misrepresented in bandwidth benchmarks. I'll hope to have the 5433 article up soon where I'll explain that a bit in more depth.

Hopefully the Denver article hasn't been abandoned...
 
Pure memory copy benchmarks (stream copy and stream copy multicore) are 2x faster on iPad (in the comparison link you provided). Apple seems to still have considerably higher peak numbers. I do admit that ARM has been catching up nicely. The difference used to be much bigger some years ago.
The stream copy number probably is misleading: the stream copy loop gets optimized into memcpy and it's possible Apple memcpy is heavily optimized. Let's wait for Nebuchadnezzar results :)
 
Tomorrow, Xiaomi will announce a new gaming device, maybe an android based console (well at least this is what reports say). Who will power it ? Qualcomm, Intel or Tegra ?
 
The stream copy number probably is misleading: the stream copy loop gets optimized into memcpy and it's possible Apple memcpy is heavily optimized. Let's wait for Nebuchadnezzar results :)
But isn't heavily optimized memcpy the best way to measure memory bandwidth? Assuming you want to know the actual maximum bandwidth of the SOC? Another way to get the maximum bandwidth is to run an optimized GPU memory copy shader. If you add some math there the benchmarks might become ALU bound (= not a true measurement of SOC memory bandwidth).

Using CPU bound "more realistic" benchmarks to measure bandwidth is misleading, unless you are also running the GPU at the same time, and measuring its bandwidth as well. I am not talking about how much bandwidth some certain ARM CPU cores can consume. I am talking about the SOC bandwidth (including the GPU of course). If a CPU based benchmark cannot eat all the bandwidth, CPU + GPU at full steam surely can.
 
But isn't heavily optimized memcpy the best way to measure memory bandwidth? Assuming you want to know the actual maximum bandwidth of the SOC? Another way to get the maximum bandwidth is to run an optimized GPU memory copy shader. If you add some math there the benchmarks might become ALU bound (= not a true measurement of SOC memory bandwidth).
I don't see the point in that. You know the memory speed and the internal bus widths and speeds of all the components, so of course they'll saturate the memory controller (verified). It'd be a utter failure from the manufacturer if they didn't.
 
But isn't heavily optimized memcpy the best way to measure memory bandwidth? Assuming you want to know the actual maximum bandwidth of the SOC?
Sorry, I didn't make it clear: it's possible the memcpy function on the Android devices is not tuned enough, while it is on Apple.
 
I don't see the point in that. You know the memory speed and the internal bus widths and speeds of all the components, so of course they'll saturate the memory controller (verified). It'd be a utter failure from the manufacturer if they didn't.

Have to say I don't quite agree with this. I've seen large discrepancies between (speed x width) and ultimately achievable bandwidth. Since different usage scenarios results in different bus utilization efficiency, ideally you would test using different (and transparent for interpretation) methods. Which of course was originally the reason why John McCalpins STREAM didn't just use one test but four different ones.
 
Back
Top