Next Generation Hardware Speculation with a Technical Spin [pre E3 2019]

Status
Not open for further replies.
OK. So let me try to wrap my head around this. Depending on the instruction you could get execution on the Core ALU, The Core ALU + the associated Side ALU, the associated Side ALU, or the associated Side ALU + the independant Side ALU? Can you execute on all three simultaneously?

From what I understand from the original patent the answer is no.
 
I think the mid-gen consoles have skewed the expectations of the next gen consoles. The reality is an 8-8.5TF console would be nearly a 4.5X jump over base PS4 gpu and nearly a 6.5X jump over base Xbox One gpu...not even taking into account architectural advantages.

If you do some napkin math to determine how many active CU's would likely fit on a next gen chip(using the current gen non-slim consoles apu die size as reference)...it's somewhere in the range between 48-52 active CU's.

That's only likely to give 8-8.5TFlops ...unless you think the clocks are going to go way beyond what we have seen on consoles.
 
More importantly, 7nmFF can only do so much vs 16nmFF. Scorpio further muddies things with it's seemingly exceptional GPU clock boost vs the other 3 SKUs. That may have implications for MS's higher end SKU, but we've had a fairly good idea of what we should expect for a while if you've been paying attention. :V

It's only somewhat recently that we've had any sort of hint/indication of how 7nm performs (Ryzen density & power consumption, less so on clocks), but it still remains to be seen how GPUs will fare in the coming transition.

/promptly deletes every personal console tech post in the last 2 years.
 
Surely there's going to be some big architectural improvements with Navi otherwise why has it taken so long to come out? I refuse to believe it's just going to be a rx580 at 7nm.

Weren't there rumours that some Ryzen guys were sent over to improve energy usage?
 
Surely there's going to be some big architectural improvements with Navi otherwise why has it taken so long to come out? I refuse to believe it's just going to be a rx580 at 7nm.

Weren't there rumours that some Ryzen guys were sent over to improve energy usage?
No, there weren’t rumors, it was confirmed by AMD in a Zen retrospective round table.

 
Surely there's going to be some big architectural improvements with Navi otherwise why has it taken so long to come out? I refuse to believe it's just going to be a rx580 at 7nm.

Let's not confuse architecture (extracting performance, for example) vs what sort of HW can be physically fabricated (in a ballpark sense). Obviously, we have no good reference point for what any new architecture would look like in terms of die space, but there's an upper ceiling for what is possible knowing that we have double density and some further unknowns for the uncore.

A similar die size for an nvidia architecture clearly has different end-user performance characteristics in the desktop space where drivers are much more important, but that might not change the theoretical numbers all that much (as a broad example).
 
Last edited:
I was bored.

ps5-architecture-ariel
ps5arieldiagram.png

I'm unclear on whether this is split across multiple chips, or if it's one die. The CAKE is used to translate packets on the SDF into a form to be passed over the SERDES of an off-chip link, and on the other chip a SERDES and then another CAKE would convert the packet into the form used by the SDF.
At least from data given for EPYC, the path is SDF->CAKE-> serial link (IFOP, IFIS, etc.) ->CAKE->SDF. Units like the GPU wouldn't plug directly into a CAKE, and if this is one chip the CAKE wouldn't be necessary.
(https://www.slideshare.net/AMD/amd-epyc-microprocessor-architecture slide 17).
A CAKE is a Coherent AMD Socket Extender (https://www.slideshare.net/AMD/isscc-2018-zeppelin-an-soc-for-multichip-architectures, slide 8), and at least in part the way the GPU front end still works as is a PCIe slave device over a non-coherent connection. For an APU like Raven Ridge, the GPU supposedly has two interfaces onto the SDF, however what type of interface they are isn't clear. GPUs tend to be treated as IO devices with the IOMMU, which may mean that a GPU plugged into the SDF is likely using an IOMS rather than CCM.

For Zen, CCM is what serves as a CCX's port onto the SDF, and at least from that there's no need for the IO Memory Management Unit to intercede. The cores have their own MMUs and TLBs as hosts, whereas slave processors are kept behind the IOMMU. Most likely, going from the diagrams already given and from AMD's PPR document (https://www.amd.com/system/files/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdf pg 28), the IOMMU is part of the IO complex, and is part of or linked to the IOMS. The IOMS would serve as the point of entry of IO into the SDF, with the CPUs not linking to it.
As ancillary controllers/IO, I'm not sure the ACP or display engine would link directly into the SDF rather than the IO Hub or some kind of IOMS.

As for the GPU L2s, it might be plausible at the TFLOP levels in question that more than four slices would be present.

The relationship between the command processor and ACE blocks and the L2/memory isn't entirely spelled out for GCN, although they should have the ability to use the L2. I'm not sure what other paths they have, though a CAKE is unlikely to be something they'd link to directly.

CAKE means Coherent AMD socKet Extender.

And well... I decided to make an speculation in form of diagram of the Super-SIMD unit. I am sure that is wrong but this is what I understood reading the last patents.

navisupersimdspeculation.png
The patent has an embodiment where the ALU block is more like one Core ALU that can pair with a Side ALU, and another Core ALU. The patent doesn't limit itself to any one specific combination, although in that embodiment a Core ALU has a multiplier that makes it more suited for the primary set of instructions, while the Side ALU only has an adder and ancillary hardware for assisting a Core ALU on an complex op. A Core+Side is needed to get the full range of operations of a traditional GCN unit.
As far as the diagram goes, the two Core ALUs have straight links to the operand network, while the Side ALU hangs off their ports. The destination cache also feeds into the register file or operand network, rather than the scheduler. As far as an L0 cache goes, the related patent on the register file has the output flops of the register file labelled as an L0.
Part of the motivation for the destination cache is that it can feed back into the ALU operand ports, since the register file itself has lower peak bandwidth than the ALU blocks require.
 
I'm unclear on whether this is split across multiple chips, or if it's one die. The CAKE is used to translate packets on the SDF into a form to be passed over the SERDES of an off-chip link, and on the other chip a SERDES and then another CAKE would convert the packet into the form used by the SDF.
At least from data given for EPYC, the path is SDF->CAKE-> serial link (IFOP, IFIS, etc.) ->CAKE->SDF. Units like the GPU wouldn't plug directly into a CAKE, and if this is one chip the CAKE wouldn't be necessary.
(https://www.slideshare.net/AMD/amd-epyc-microprocessor-architecture slide 17).
A CAKE is a Coherent AMD Socket Extender (https://www.slideshare.net/AMD/isscc-2018-zeppelin-an-soc-for-multichip-architectures, slide 8), and at least in part the way the GPU front end still works as is a PCIe slave device over a non-coherent connection. For an APU like Raven Ridge, the GPU supposedly has two interfaces onto the SDF, however what type of interface they are isn't clear. GPUs tend to be treated as IO devices with the IOMMU, which may mean that a GPU plugged into the SDF is likely using an IOMS rather than CCM.

For Zen, CCM is what serves as a CCX's port onto the SDF, and at least from that there's no need for the IO Memory Management Unit to intercede. The cores have their own MMUs and TLBs as hosts, whereas slave processors are kept behind the IOMMU. Most likely, going from the diagrams already given and from AMD's PPR document (https://www.amd.com/system/files/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdf pg 28), the IOMMU is part of the IO complex, and is part of or linked to the IOMS. The IOMS would serve as the point of entry of IO into the SDF, with the CPUs not linking to it.
As ancillary controllers/IO, I'm not sure the ACP or display engine would link directly into the SDF rather than the IO Hub or some kind of IOMS.

As for the GPU L2s, it might be plausible at the TFLOP levels in question that more than four slices would be present.

The relationship between the command processor and ACE blocks and the L2/memory isn't entirely spelled out for GCN, although they should have the ability to use the L2. I'm not sure what other paths they have, though a CAKE is unlikely to be something they'd link to directly.


The patent has an embodiment where the ALU block is more like one Core ALU that can pair with a Side ALU, and another Core ALU. The patent doesn't limit itself to any one specific combination, although in that embodiment a Core ALU has a multiplier that makes it more suited for the primary set of instructions, while the Side ALU only has an adder and ancillary hardware for assisting a Core ALU on an complex op. A Core+Side is needed to get the full range of operations of a traditional GCN unit.
As far as the diagram goes, the two Core ALUs have straight links to the operand network, while the Side ALU hangs off their ports. The destination cache also feeds into the register file or operand network, rather than the scheduler. As far as an L0 cache goes, the related patent on the register file has the output flops of the register file labelled as an L0.
Part of the motivation for the destination cache is that it can feed back into the ALU operand ports, since the register file itself has lower peak bandwidth than the ALU blocks require.
Well said. To put it shortly, data locality is king, and a big reason in the rise of memory-centric architectures. This also is in spirit with the recent Cerny patent on the heavy use of caches for GPU operations.

https://patents.justia.com/patent/20190035050
 
I'm unclear on whether this is split across multiple chips, or if it's one die. The CAKE is used to translate packets on the SDF into a form to be passed over the SERDES of an off-chip link, and on the other chip a SERDES and then another CAKE would convert the packet into the form used by the SDF.
At least from data given for EPYC, the path is SDF->CAKE-> serial link (IFOP, IFIS, etc.) ->CAKE->SDF. Units like the GPU wouldn't plug directly into a CAKE, and if this is one chip the CAKE wouldn't be necessary.
(https://www.slideshare.net/AMD/amd-epyc-microprocessor-architecture slide 17).
A CAKE is a Coherent AMD Socket Extender (https://www.slideshare.net/AMD/isscc-2018-zeppelin-an-soc-for-multichip-architectures, slide 8), and at least in part the way the GPU front end still works as is a PCIe slave device over a non-coherent connection. For an APU like Raven Ridge, the GPU supposedly has two interfaces onto the SDF, however what type of interface they are isn't clear. GPUs tend to be treated as IO devices with the IOMMU, which may mean that a GPU plugged into the SDF is likely using an IOMS rather than CCM.

For Zen, CCM is what serves as a CCX's port onto the SDF, and at least from that there's no need for the IO Memory Management Unit to intercede. The cores have their own MMUs and TLBs as hosts, whereas slave processors are kept behind the IOMMU. Most likely, going from the diagrams already given and from AMD's PPR document (https://www.amd.com/system/files/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdf pg 28), the IOMMU is part of the IO complex, and is part of or linked to the IOMS. The IOMS would serve as the point of entry of IO into the SDF, with the CPUs not linking to it.
As ancillary controllers/IO, I'm not sure the ACP or display engine would link directly into the SDF rather than the IO Hub or some kind of IOMS.

As for the GPU L2s, it might be plausible at the TFLOP levels in question that more than four slices would be present.

The relationship between the command processor and ACE blocks and the L2/memory isn't entirely spelled out for GCN, although they should have the ability to use the L2. I'm not sure what other paths they have, though a CAKE is unlikely to be something they'd link to directly.


The patent has an embodiment where the ALU block is more like one Core ALU that can pair with a Side ALU, and another Core ALU. The patent doesn't limit itself to any one specific combination, although in that embodiment a Core ALU has a multiplier that makes it more suited for the primary set of instructions, while the Side ALU only has an adder and ancillary hardware for assisting a Core ALU on an complex op. A Core+Side is needed to get the full range of operations of a traditional GCN unit.
As far as the diagram goes, the two Core ALUs have straight links to the operand network, while the Side ALU hangs off their ports. The destination cache also feeds into the register file or operand network, rather than the scheduler. As far as an L0 cache goes, the related patent on the register file has the output flops of the register file labelled as an L0.
Part of the motivation for the destination cache is that it can feed back into the ALU operand ports, since the register file itself has lower peak bandwidth than the ALU blocks require.

Thanks for the feedback.

I plan to update the diagram correcting it and changing with the new info.
 
I feel architecture will play a big part, yes there's a physical limit to what a new node can bring. What I'm saying is using the current gen chips to estimate what next gen will bring is the best we got now but I'm hopeful for some big improvement from the next chips.

Doesn't AMD RX chips run hotter and use more power than Nvidia chips?
 
Doesn't AMD RX chips run hotter and use more power than Nvidia chips?
At stock settings and equal performance, in general yes. Other than that, it all depends on the configurations you want to compare them at.
 
French website JeuxVideo.com has a concrete rumor about two Scarlett consoles:

- Two Scarlett devices announced this E3: Lockhart and Anaconda
- Lockhart = entry level device, no disc drive, digital focused machine
- Anaconda = high end device, price similar to Xbox One X's launch price
- Xbox should announce a disc drive less Xbox One in the coming months
- Launch fall 2020
- For the specs sheet, they say that this is what was rumored to be but can only confirm the presence of SSDs on both machines. They're NOT confirming the full sheet.
- Halo Infinite is a launch crossgen game
- Ninja Theory's game is slated for early 2020

mentioned specs:
- Lockhart Navi 4TF/discless/8coreZen2/12Gb DDR6/SSD 1Tb NVMe - One S pricing and emphasize on streaming abilities
- Anaconda Navi 12Tf/8coreZen2/16Gb DDR6/SSD 1Tb NVMe - One X pricing


via ResetEra

Tom Warren from Verge commented this leak with: "Not all of those specs are accurate."
 
If MS is indeed launching next console in late 2020, this E3 unveil [and GDC dev breefing] seems premature to me. But they most likely have nothing else great planned for Xbone, and they could want to seize every marketing advantage they can.
 
If MS is indeed launching next console in late 2020, this E3 unveil [and GDC dev breefing] seems premature to me. But they most likely have nothing else great planned for Xbone, and they could want to seize every marketing advantage they can.
but if they will not reveal this E3 when are they going to reveal it if its coming in 2020?
They usually show something one year before the latest.
 
but if they will not reveal this E3 when are they going to reveal it if its coming in 2020?
They usually show something one year before the latest.
Nothing was known about gen8 until feb 20 2013 when Andrew House presented PS4. MS followed with May black tent event and E3, and both of them launched in November.
 
It seems Consoles with VR next gen visors will be attractive for a wider public than gamers as also 4K 3D films can be viewed... I think at least Sony has understood that potential. We can maybe see in a future a Netflix-like 3D streaming service but running only on capable HW (pc and some consoles)... So, in this view, Console HW has to be on lower power & lower cost and plugable with next gen visors....
 
Status
Not open for further replies.
Back
Top