PowerVR Furian Architecture

So now we should count floating point output per ALU per cycle as ALU number x1.5?
 
So now we should count floating point output per ALU per cycle as ALU number x1.5?

Instead of 2 MADDs (4 FLOPs) per SIMD lane, you now have 1 MADD + 1 MUL (3 FLOPs) per SIMD lane (if I haven't misunderstood their slides).

Rogue 4 clusters@500MHz :
4 * SIMD16 * 4 FLOPs * 0.5GHz = 128 GFLOPs FP32
Furian 4 clusters@500MHz :
4 * SIMD32 * 3 FLOPs * 0.5 GHz = 192 GFLOPs FP32
 
  • Like
Reactions: Rys
That's.. exactly the same as I wrote, no?

Each cluster has 32 ALUs, 16 of each are MADD and the other 16 are MUL. 2*16 + 16 = 48.
1 cluster does 32*1.5 = 48 floating point operations per clock.

Clusters now have twice the ALUs, but half of them only do the multiply operation.
 
Author indicates that cores won't be in end user products until at least end-of 2018. Which raises the question, what will Apple be using in the next gen Iphone ? Same as last year with higher clocks ? rework of last year ? Have Apple already gotten the raw IP and are now designing their own graphics blocks based on Furian ?
 
Last edited:
That's.. exactly the same as I wrote, no?

Each cluster has 32 ALUs, 16 of each are MADD and the other 16 are MUL. 2*16 + 16 = 48.
1 cluster does 32*1.5 = 48 floating point operations per clock.

It's 96 FLOPs/cluster:
32 lanes * 2 FLOPs (MADD) = 64
32 lanes * 1 FLOP (MUL) = 32
---------------------------------------------
64+ 32 = 96 OPs/cluster

Now I need something for fillrates :p

Clusters now have twice the ALUs, but half of them only do the multiply operation.

Hairsplitting: twice the lanes or stream processors. In my weird book an ALU = SIMD. I wonder if they kept the vec2 FP16 stuff, there's no reason to get rid of it.

Author indicates that cores won't be in end user products until at least end-of 2018. Which raises the question, what will Apple be using in the next gen Iphone ? Same as last year with higher clocks ? rework of last year ? Have Apple already gotten the raw IP and are now designing their own graphics blocks based on Furian ?

9 or 10 (7XT derived) clusters on roughly the same frequency? I doubt anyone ever said or mentioned when RTL for Furian has been delivered. If partners got it a year ago then the answer is obvious. In any other case Ryan is right.
 
Last edited:
One thing I've noticed is they're not saying how high the cluster count can go with Furion.
They claim Rogue could go up to 16 clusters but above 12 clusters it would lose scalability, but Furion has a higher limit. They don't say how high it is, though.
Could this scale up to 32+ clusters and match notebook/desktop-level APUs?
 
One thing I've noticed is they're not saying how high the cluster count can go with Furion.
They claim Rogue could go up to 16 clusters but above 12 clusters it would lose scalability, but Furion has a higher limit. They don't say how high it is, though.
Could this scale up to 32+ clusters and match notebook/desktop-level APUs?

The exact phrasing for Rogue is: "successful up to 12 clusters, with theoretical limit of 16". Apple has integrated 12 clusters into the A9X SoC and despite that there is a 16 cluster GT7900 it never ended up being materialized. I have severe doubts that even beyond 16 clusters wouldn't make sense for Rogue clusters and a GT7900 in theory is already at low end notebook/desktop level even today https://www.imgtec.com/blog/powervr-gt7900-redefining-performance-efficiency/

All the slide in question http://www.anandtech.com/Gallery/Album/5508#12 claims IMHO, is that they could theoretically scale even higher in performance with Furian, but as long as there's no interested licensee for it, it's just theoretical marketing wash.
 
I haven't checked, but my gut instinct is that 8XT in products late 2018 at the earliest would mean a bigger gap than previously seen in IMG's high-end IP coming to market.

If true, that would suggest it's a reaction to a lack of desire from their licencees for ever increasing graphics/GPU compute performance.

Also assuming Apple is a customer that has had very early deliverables, one wonders who the other customer is who is also in at an extremely early stage (announcement cites "multiple partners)
 
One thing I've noticed is they're not saying how high the cluster count can go with Furion.
They claim Rogue could go up to 16 clusters but above 12 clusters it would lose scalability, but Furion has a higher limit. They don't say how high it is, though.
Could this scale up to 32+ clusters and match notebook/desktop-level APUs?

eetimes article states up to 64 clusters ,although they expect not to see that initially.

http://www.eetimes.com/document.asp?doc_id=1331445&page_number=2
Theoretically, SoC designers can connect as many as 64 PowerVR clusters with Furian. Rogue was limited to about 12 to 16 clusters in an SoC, a level that will probably still be observed for initial SoCs using Furian cores.

So working on the marketing numbers, Furian could extend to around x10 the gaming performance of a A9x at similar frequencies, assuming the bandwidth wasn't a bottleneck.

Apple might look at a high-ish cluster count Furian for future Apple Macbooks.
 
Last edited:
Also assuming Apple is a customer that has had very early deliverables, one wonders who the other customer is who is also in at an extremely early stage (announcement cites "multiple partners)

Could even be some super expensive future TV set SoC with something like a dual cluster Furian...

eetimes article states up to 64 clusters ,although they expect not to see that initially.

http://www.eetimes.com/document.asp?doc_id=1331445&page_number=2
I'm too bored now to look it up, but I'm pretty certain that when Rogue was announced company officials claimed 32 clusters and beyond for it. Eetimes also makes the mistake to interpret it as if the max sensible design latency of Rogue is at just 16 clusters.

So working on the marketing numbers, Furian could extend to around x10 the gaming performance of a A9x at similar frequencies, assuming the bandwidth wasn't a bottleneck.
Apple might look at a high-ish cluster count Furian for future Apple Macbooks.

With what kind of CPU exactly? In theory the bandwidth problem could be solved with HBM if needed, but it all sounds too complicated for my taste. I'd rather believe Kyle Bennett's wild theory for a SoC with an Intel CPU and an AMD GPU before that.
 
Without making any product or customer claims, there are significant changes in Furian that make it (a lot) easier for the architecture to scale up to high SPU counts. That said, really the up-to numbers are mostly marketing; if a customer comes and asks for something big and commits, we will build it, announced or not.

In the early stages of a new architecture or revision, before the RTL is final and customers have committed to licensing something, we pick a set of target cores we think are likely to be popular and start to build those. That set can and does change as the first customers get on board, and from then the roadmap is customer driven. The base architecture is configurable enough that it's practically impossible for us to build and verify every possible scaling configuration, so the customer-driven approach is necessarily the correct one.
 
* I assume baseline featureset for Furian cores is still DX10?
* Rogue could be optionally be DX11.2 if the customer wanted it; is Furian scaling up to DX12.x if needed?
* I still can't figure out how to calculate texel fillrate for Furian vs. Rogue at the same amount of TMUs for each (yes I know it sounds dumb...)
* Front end triangles are at 0.5 Tris/clock if I'm reading your table accurately; assume I have 4 FEs in something like a 12 cluster 7XT GPU. Is the geometry throughput still the same?

Thank you in advance for whatever you can answer.
 
Baseline API support is DX10 if you take the Direct3D view of things. It's better to take the Vulkan view of things though for the base cores. There's a tessellator, for example. Peak feature set compliance is DX12.

For the same TPU count, Furian sample rate is 2x Rogue.

Geometry throughput is the same per front end, but the number of front ends is substantially different potentially, especially as the GPU gets bigger.
 
Yep, last bit of work before I disappear. If anyone has questions, please ask.

A little late but I'll try anyways.. in Rogue what was the utilization rate on the 2nd ALU and how often was it a MUL operation?
 
Back
Top