Xbox Series X [XBSX] [Release November 10 2020]

This slide references a "multi core command processor". Can't find any mention of the command processor in RDNA 1 being multi core.

In the past MS do seem to have have liked their custom command processors, perhaps this one is too...


abHvXEjBo5CvTr8SpwsAdR-1920-80.jpg
 
This slide references a "multi core command processor". Can't find any mention of the command processor in RDNA 1 being multi core.

In the past MS do seem to have have liked their custom command processors, perhaps this one is too...


abHvXEjBo5CvTr8SpwsAdR-1920-80.jpg
The "Geometry Engine" appearing here is also interesting...

O89Q8DY.png

This is some interesting insight into RDNA 2 too, 7-issue superscalar?
 
Last edited by a moderator:
This is some interesting insight into RDNA 2 too, 7-issue superscalar?
This is a similar rate to GCN and RDNA. Basically it means the arbitrator selects up to N instructions from all non-blocked wavefronts, but only one instruction will ever be selected per wavefront. Otherwise, either the ISA needs to be compiler scheduled VLIW (definitely not the RDNA ISA) for co-issuing from the same wavefront, or the hardware itself needs to do scoreboarding which most GPU vendors deliberately avoid.

So it is not superscalar by the books, and it is unlikely changing.
 
I think RDNA 1 does this as well, but perhaps not as broken down as this.
RDNA issues 2 Vector ALU and 2 scalar as per what the white paper says. The 1 Vector Data and 2 Control may belong to the CU. So that's still 7.
Yeah, so still new insight, not necessarily RDNA2-specific.

This is a similar rate to GCN and RDNA. Basically it means the arbitrator selects up to N instructions from all non-blocked wavefronts, but only one instruction will ever be selected per wavefront. Otherwise, either the ISA needs to be compiler scheduled VLIW (definitely not the RDNA ISA) for co-issuing from the same wavefront, or the hardware itself needs to do scoreboarding which most GPU vendors deliberately avoid.

So it is not superscalar by the books, and it is unlikely changing.
Is scoreboarding less costly to implement than Tomasulo's? Although it seems unnecessary, like you said, GPUs don't need to deal with name&data dependence...
Also it doesn't have to be VLIW to be software scheduled, just look at Nvidia's [post-] Volta ISAs.
 
Traditionally I think it implies multiple instructions that belong to the same thread, therefore control and data hazards arise?
Yea, I think I understand where he's going with this.
The dispatcher needs to look at all the commands in queue and figure out which ones to run in parallel together and which ones individually. So in effect there is some form of scoreboarding of instructions happening.
I believe there is no real dispatcher for the CU or a single instruction queue for it to look at wrt RDNA 1, the instructions (vector, memory and scalar) are split up into separate memory pools on the CU iirc..
 
Yea, I think I understand where he's going with this.
The dispatcher needs to look at all the commands in queue and figure out which ones to run in parallel together and which ones individually. So in effect there is some form of scoreboarding of instructions happening.
I believe there is no real dispatcher for the CU or a single instruction queue for it to look at wrt RDNA 1, the instructions (vector, memory and scalar) are split up into separate memory pools on the CU iirc..
In another example, a n-way SMT core launches <= n instructions per cycle, but it's not called superscalar.
Yes there will be instruction dispatcher in the CU, you can't split the instructions stream into multiple ones of different instruction class and put them under different address range, it just doesn't make sense, programs don't work like that. On the other hand there's VLIW approach where you may put instructions of different classes into their corresponding 'slot' within an instruction 'packet'.
 
It is not “superscalar by the books” because the motivation of superscalar is exploiting ILP in one instruction stream, through tracking data dependencies to discover opportunities for co-issuing. GCN and RDNA do no such thing — they are instead multiplexing up to 20 independent instruction streams into a multi-issue pipeline, while enforcing a 1 inst/clock issue limit on each stream. So from the perspective of each inst. stream, instructions are strictly sequentially issued one by one with no superscalar capability.
 
My stand corrected a bit on RDNA scheduling — according to the ISA documentation, RDNA actually does track data dependencies in hardware, and does not require manually inserted wait state even though now it exposes the multi-cycle execution pipeline (unlike GCN).

So while I can’t rule out that it can do superscalar issue “by the books”, my two cents is that it probably isn’t, since it can already issue from a huge pool of wavefronts even with each of them contributing just 1 per clock.
 
Last edited:
conference live blog
https://www.anandtech.com/show/1599...ft-xbox-series-x-system-architecture-600pm-pt

09:39PM EDT - Q: With 20 channels GDDR6, is that really cheaper than 2 stacks HBM? A: We're not religious about which DRAM tech to use. We needed the GPU to have a ton of bandwidth. Lots of channels allows for low latency requests to be serviced. HBM did have an MLC model thought about, but people voted with their feet and JEDEC decided not to go with it.

09:38PM EDT - Q: Says Zen 2 is server class, but you use L3 mobile class? A: Yeah our caches are different, but I won't say any more, that's more AMD.

09:37PM EDT - Q: TSMC 7nm enhanced, is it N7P, N7+, or something else? A: It's not base 7nm, it's progressed over time. Lots of work between AMD and TSMC to hit our targets and what we needed


some items we were discussing:
  • 09:22PM EDT - supports up to 2x2
  • 09:22PM EDT - VRS
  • 09:21PM EDT - CUs have 25% better perf/clock compared to last gen (my edit: I assume last gen is with respect to RDNA 1 -- so this is not the advertised 50% better perf/watt as AMD specifies)
 
Last edited:
I am not sure how you can prove this, neither did I remark to its nature as hardware or software, I simply state that its custom....

I didn’t say you said it was hardware :)

Software is inherently custom unless it’s OSS.

Once AMD confirmed RDNA 2 was full Tier 2 VRS, I went and re-read the MS VRS patent and came away with the above interpretation you quoted.
 
I didn’t say you said it was hardware :)

Software is inherently custom unless it’s OSS.

Once AMD confirmed RDNA 2 was full Tier 2 VRS, I went and re-read the MS VRS patent and came away with the above interpretation you quoted.
But hardware is involved. So perhaps I'm not understanding how their patent is only software based
  • Tiny area cost for 10-30% performance gain
202008180222121.jpg
 
But hardware is involved. So perhaps I'm not understanding how their patent is only software based
  • Tiny area cost for 10-30% performance gain
202008180222121.jpg
Of course hardware is involved. RDNA 2 is VRS tier 2 compatible.

The patent in question consistently refers to the methods described as “application-directed”, as in, the operations performed by the GPU are not completely self-determined in hardware.
 
Last edited:
Hot Chips is not about marketing or about consumers, its about technical matters, and that slide is entirely accurate.


Much of this appears to be marketing TBH. It's great stuff to dig deeper and legitimate technical info of course, but they're not going to tell us any potential hidden drawbacks of the hardware here.

I dont think there will be many native 8k games and it was definitely about marketing to put that on the Scarlett chip etc. 4k is this generation (if that).
 
Back
Top