Xbox Series X [XBSX] [Release November 10 2020]

function · Aug 18, 2020

This slide references a "multi core command processor". Can't find any mention of the command processor in RDNA 1 being multi core.

In the past MS do seem to have have liked their custom command processors, perhaps this one is too...

scently · Aug 18, 2020

function said:
This slide references a "multi core command processor". Can't find any mention of the command processor in RDNA 1 being multi core.

In the past MS do seem to have have liked their custom command processors, perhaps this one is too...

They customized the Command Processor on X1 and customized it further in the X1X so I would expect further customizations here too.

Kugai Calo · Aug 18, 2020

function said:
This slide references a "multi core command processor". Can't find any mention of the command processor in RDNA 1 being multi core.

In the past MS do seem to have have liked their custom command processors, perhaps this one is too...

The "Geometry Engine" appearing here is also interesting...

This is some interesting insight into RDNA 2 too, 7-issue superscalar?

disco_ · Aug 18, 2020

Kugai Calo said:
The "Geometry Engine" appearing here is also interesting...

Not really. It's part of rdna1 so it makes sense it's in rdna2 as well.

iroboto · Aug 18, 2020

Kugai Calo said:
This is some interesting insight into RDNA 2 too, 7-issue superscalar?

I think RDNA 1 does this as well, but perhaps not as broken down as this.
RDNA issues 2 Vector ALU and 2 scalar as per what the white paper says. The 1 Vector Data and 2 Control may belong to the CU. So that's still 7.

pTmdfx · Aug 18, 2020

Kugai Calo said:
This is some interesting insight into RDNA 2 too, 7-issue superscalar?

This is a similar rate to GCN and RDNA. Basically it means the arbitrator selects up to N instructions from all non-blocked wavefronts, but only one instruction will ever be selected per wavefront. Otherwise, either the ISA needs to be compiler scheduled VLIW (definitely not the RDNA ISA) for co-issuing from the same wavefront, or the hardware itself needs to do scoreboarding which most GPU vendors deliberately avoid.

So it is not superscalar by the books, and it is unlikely changing.

iroboto · Aug 18, 2020

pTmdfx said:
So it is not superscalar by the books, and it is unlikely to change.

A bit OT, but how is this normally defined? Doesn't superscalar just mean multiple instructions being executed in parallel on different subsystems?

Kugai Calo · Aug 18, 2020

iroboto said:
I think RDNA 1 does this as well, but perhaps not as broken down as this.
RDNA issues 2 Vector ALU and 2 scalar as per what the white paper says. The 1 Vector Data and 2 Control may belong to the CU. So that's still 7.

Yeah, so still new insight, not necessarily RDNA2-specific.

pTmdfx said:
This is a similar rate to GCN and RDNA. Basically it means the arbitrator selects up to N instructions from all non-blocked wavefronts, but only one instruction will ever be selected per wavefront. Otherwise, either the ISA needs to be compiler scheduled VLIW (definitely not the RDNA ISA) for co-issuing from the same wavefront, or the hardware itself needs to do scoreboarding which most GPU vendors deliberately avoid.

So it is not superscalar by the books, and it is unlikely changing.

Is scoreboarding less costly to implement than Tomasulo's? Although it seems unnecessary, like you said, GPUs don't need to deal with name&data dependence...
Also it doesn't have to be VLIW to be software scheduled, just look at Nvidia's [post-] Volta ISAs.

Kugai Calo · Aug 18, 2020

iroboto said:
A bit OT, but how is this normally defined? Doesn't superscalar just mean multiple instructions being executed in parallel on different subsystems?

Traditionally I think it implies multiple instructions that belong to the same thread, therefore control and data hazards arise?

iroboto · Aug 18, 2020

Kugai Calo said:
Traditionally I think it implies multiple instructions that belong to the same thread, therefore control and data hazards arise?

Yea, I think I understand where he's going with this.
The dispatcher needs to look at all the commands in queue and figure out which ones to run in parallel together and which ones individually. So in effect there is some form of scoreboarding of instructions happening.
I believe there is no real dispatcher for the CU or a single instruction queue for it to look at wrt RDNA 1, the instructions (vector, memory and scalar) are split up into separate memory pools on the CU iirc..

Kugai Calo · Aug 18, 2020

iroboto said:
Yea, I think I understand where he's going with this.
The dispatcher needs to look at all the commands in queue and figure out which ones to run in parallel together and which ones individually. So in effect there is some form of scoreboarding of instructions happening.
I believe there is no real dispatcher for the CU or a single instruction queue for it to look at wrt RDNA 1, the instructions (vector, memory and scalar) are split up into separate memory pools on the CU iirc..

In another example, a n-way SMT core launches <= n instructions per cycle, but it's not called superscalar.
Yes there will be instruction dispatcher in the CU, you can't split the instructions stream into multiple ones of different instruction class and put them under different address range, it just doesn't make sense, programs don't work like that. On the other hand there's VLIW approach where you may put instructions of different classes into their corresponding 'slot' within an instruction 'packet'.

pTmdfx · Aug 18, 2020

It is not “superscalar by the books” because the motivation of superscalar is exploiting ILP in one instruction stream, through tracking data dependencies to discover opportunities for co-issuing. GCN and RDNA do no such thing — they are instead multiplexing up to 20 independent instruction streams into a multi-issue pipeline, while enforcing a 1 inst/clock issue limit on each stream. So from the perspective of each inst. stream, instructions are strictly sequentially issued one by one with no superscalar capability.

pTmdfx · Aug 18, 2020

My stand corrected a bit on RDNA scheduling — according to the ISA documentation, RDNA actually does track data dependencies in hardware, and does not require manually inserted wait state even though now it exposes the multi-cycle execution pipeline (unlike GCN).

So while I can’t rule out that it can do superscalar issue “by the books”, my two cents is that it probably isn’t, since it can already issue from a huge pool of wavefronts even with each of them contributing just 1 per clock.

iroboto · Aug 18, 2020

conference live blog
https://www.anandtech.com/show/1599...ft-xbox-series-x-system-architecture-600pm-pt

09:39PM EDT - Q: With 20 channels GDDR6, is that really cheaper than 2 stacks HBM? A: We're not religious about which DRAM tech to use. We needed the GPU to have a ton of bandwidth. Lots of channels allows for low latency requests to be serviced. HBM did have an MLC model thought about, but people voted with their feet and JEDEC decided not to go with it.

09:38PM EDT - Q: Says Zen 2 is server class, but you use L3 mobile class? A: Yeah our caches are different, but I won't say any more, that's more AMD.

09:37PM EDT - Q: TSMC 7nm enhanced, is it N7P, N7+, or something else? A: It's not base 7nm, it's progressed over time. Lots of work between AMD and TSMC to hit our targets and what we needed

some items we were discussing:

09:22PM EDT - supports up to 2x2
09:22PM EDT - VRS
09:21PM EDT - CUs have 25% better perf/clock compared to last gen (my edit: I assume last gen is with respect to RDNA 1 -- so this is not the advertised 50% better perf/watt as AMD specifies)

anexanhume · Aug 18, 2020

scently said:
I am not sure how you can prove this, neither did I remark to its nature as hardware or software, I simply state that its custom....

I didn’t say you said it was hardware

Software is inherently custom unless it’s OSS.

Once AMD confirmed RDNA 2 was full Tier 2 VRS, I went and re-read the MS VRS patent and came away with the above interpretation you quoted.

Kugai Calo · Aug 18, 2020

pTmdfx said:
...and does not require manually inserted wait state even though now it exposes the multi-cycle execution pipeline (unlike GCN).

On that you can play with shader disassembly here, just select Radeon GPU Analyzer as compiler.

iroboto · Aug 18, 2020

anexanhume said:
I didn’t say you said it was hardware

Software is inherently custom unless it’s OSS.

Once AMD confirmed RDNA 2 was full Tier 2 VRS, I went and re-read the MS VRS patent and came away with the above interpretation you quoted.

But hardware is involved. So perhaps I'm not understanding how their patent is only software based

Tiny area cost for 10-30% performance gain

PSman1700 · Aug 18, 2020

Well that confirms their custom vrs is hardware.

anexanhume · Aug 18, 2020

iroboto said:
But hardware is involved. So perhaps I'm not understanding how their patent is only software based

Tiny area cost for 10-30% performance gain

Of course hardware is involved. RDNA 2 is VRS tier 2 compatible.

The patent in question consistently refers to the methods described as “application-directed”, as in, the operations performed by the GPU are not completely self-determined in hardware.

Rangers · Aug 18, 2020

BRiT said:
Hot Chips is not about marketing or about consumers, its about technical matters, and that slide is entirely accurate.

Much of this appears to be marketing TBH. It's great stuff to dig deeper and legitimate technical info of course, but they're not going to tell us any potential hidden drawbacks of the hardware here.

I dont think there will be many native 8k games and it was definitely about marketing to put that on the Scarlett chip etc. 4k is this generation (if that).

Xbox Series X [XBSX] [Release November 10 2020]

function

None functional

scently

Kugai Calo

disco_

iroboto

Daft Funk

pTmdfx

iroboto

Daft Funk

Kugai Calo

Kugai Calo

iroboto

Daft Funk

Kugai Calo

pTmdfx

pTmdfx

iroboto

Daft Funk

anexanhume

Kugai Calo

iroboto

Daft Funk

PSman1700

anexanhume

Rangers

Similar threads