Next Generation Hardware Speculation with a Technical Spin [post E3 2019, pre GDC 2020] [XBSX, PS5]

Nisaaru · Jan 4, 2020

AlphaWolf said:
From where is this savings coming? The greater density? $20 to 30 savings per apu seems highly optimistic. The new node will cost more per wafer. Initially there may be 0 or negative savings.

EUV requires less mask steps and has potential yield advantages as far as I understand. For some it could also mean less cooling costs but that probably depends on the targets here. IMHO with the assumed TDP ranges these consoles require 7nm+ anyway.

AlphaWolf · Jan 4, 2020

Potential yield advantages? What is that comment based on? Sounds like it was pulled from someone's nether regions.

Proelite · Jan 4, 2020

AlphaWolf said:
Potential yield advantages? What is that comment based on? Sounds like it was pulled from someone's nether regions.

I was saying that Xsx will have all WGP (28) active. Chips with defective zen cores and GPU parts will be used in Azure / Xcloud.

More than far fetched, this seems like no-brainer to me.

Nisaaru · Jan 4, 2020

AlphaWolf said:
Potential yield advantages? What is that comment based on? Sounds like it was pulled from someone's nether regions.

With less masks you have less potential failure sources depending on how reliable the EUV technology is by now.

Deleted member 11852 · Jan 4, 2020

Nisaaru said:
With less masks you have less potential failure sources depending on how reliable the EUV technology is by now.

Samsung have a 7nm EUV process which they committed to for some commercial products a while back but the number of chips confirmed in actual products has led to widespread industry speculation that Samsung are having serious yield issues. They deny it, but it's Samsung so uh.. yeah.

AlphaWolf · Jan 4, 2020

Nisaaru said:
With less masks you have less potential failure sources depending on how reliable the EUV technology is by now.

There is a reason 7nm is still around.

Proelite said:
I was saying that Xsx will have all WGP (28) active. Chips with defective zen cores and GPU parts will be used in Azure / Xcloud.

More than far fetched, this seems like no-brainer to me.

I have no doubt they would repurpose defective units as mitigation (provided there are enough of them), I am only doubting that using 7nm+ at launch is an automatic win of $20-30 per apu.

anexanhume · Jan 4, 2020

AlphaWolf said:
From where is this savings coming? The greater density? $20 to 30 savings per apu seems highly optimistic. The new node will cost more per wafer. Initially there may be 0 or negative savings.

Nisaaru said:
EUV requires less mask steps and has potential yield advantages as far as I understand. For some it could also mean less cooling costs but that probably depends on the targets here. IMHO with the assumed TDP ranges these consoles require 7nm+ anyway.

AlphaWolf said:
Potential yield advantages? What is that comment based on? Sounds like it was pulled from someone's nether regions.

There is truth to both of your arguments, but 7nm+ is not feasible in the near-term. It's important to keep in mind that Apple decided it was too expensive to implement this time around, and they've been faithful adopters of each new node. The savings EUV will bring are more longer term realizations, because the higher power levels required are something they're working gradually toward to get wafer throughput up.

On top of that, 5nm is already introducing complications for EUV.

https://semiengineering.com/mask-making-issues-with-euv/

iroboto · Jan 4, 2020

AlphaWolf said:
I am only doubting that using 7nm+ at launch is an automatic win of $20-30 per apu.

I would doubt any number that doesn't have a cost breakdown of how that number is achieved.

AlphaWolf · Jan 4, 2020

iroboto said:
I would doubt any number that doesn't have a cost breakdown of how that number is achieved.

I don't mind some speculation, but I like to see some basis besides hope.

iroboto · Jan 4, 2020

AlphaWolf said:
I don't mind some speculation, but I like to see some basis besides hope.

Yea I'd be curious to see how the numbers are broken down.

I suspect some formula of cost per transistor * Some approximation of transistors for the chip themselves. Then % number of defects per wafer. Number of chips that can be produced by dropping node size in a 12" wafer. Cost of a 12" wafer.

And you'll have to do for both 7 and 7+ to figure out a cost difference.

Proelite · Jan 6, 2020

So according to Verge, Lockhart is targeting 1440p.

https://www.theverge.com/2019/12/4/...ect-scarlett-lockhart-anaconda-launch-details

I didn't give too much thought to it previously, but I did some paper math, and it turns out you need 6.07TF to deliver 1440p, if 4TF delivers 1080p. The old leaks pointed Lockhart to a 1080p machine, but the Verge leak suggests Lockhart was upgraded.

Math:

(4TF, 1), 1080p is 1x multiplier the resolution over 1080p

(12TF, 4), 4K is 4x multiplier over 1080p

(x, 1.7778), 1440p is 1.7778x multiplier over 1080p.

We're solving for X. This is simple linear algebra. Formula is y = 0.375 -0.5. X is 6.07TF.

6.07TF allows for 1X BC.

If we take 56CU(all CUS active) for Arden, Lockhart is likely 26CU (2 disabled). You need slightly over 1800mhz clock to reach 6TF. 1800mhz should be a familiar clock for people.

This suggest an upgrade to a vapor chamber cooler for Lockhart.

Speculated specs for upgraded Lockhart:

8 core Zen 2 at 3.2+ghz

6.00TF, 1805mz

12GB ram, 14gbps gddr6. 336GB/s. (Funny how the BW is just enough for 6TF GPU + CPU)

1TB SSD

Price: $299-$399.

turkey · Jan 6, 2020

Proelite said:
So according to Verge, Lockhart is targeting 1440p.

https://www.theverge.com/2019/12/4/...ect-scarlett-lockhart-anaconda-launch-details

I didn't give too much thought to it previously, but I did some paper math, and it turns out you need 6.07TF to deliver 1440p, if 4TF delivers 1080p. The old leaks pointed Lockhart to a 1080p machine, but the Verge leak suggests Lockhart was upgraded.

Math:

(4TF, 1), 1080p is 1x multiplier the resolution over 1080p

(12TF, 4), 4K is 4x multiplier over 1080p

(x, 1.7778), 1440p is 1.7778x multiplier over 1080p.

We're solving for X. This is simple linear algebra. Formula is y = 0.375 -0.5. X is 6.07TF.

6.07TF allows for 1X BC.

If we take 56CU(all CUS active) for Arden, Lockhart is likely 26CU (2 disabled). You need slightly over 1800mhz clock to reach 6TF. 1800mhz should be a familiar clock for people.

This suggest an upgrade to a vapor chamber cooler for Lockhart.

Speculated specs for upgraded Lockhart:

8 core Zen 2 at 3.2+ghz

6.00TF, 1805mz

12GB ram, 14gbps gddr6. 336GB/s. (Funny how the BW is just enough for 6TF GPU + CPU)

1TB SSD

Price: $299-$399.

Math seems wrong or I am missing something fundamental :/

4k is 1080 * 4

You say 1440 is 1080 * 1.7 (less than 2)

6.07 TF is over half of 12 but needs to power less than half of 4k?

I am lost here

Proelite · Jan 6, 2020

turkey said:
Math seems wrong or I am missing something fundamental :/

4k is 1080 * 4

You say 1440 is 1080 * 1.7 (less than 2)

6.07 TF is over half of 12 but needs to power less than half of 4k?

I am lost here

4TF is 1/3 of 12TF but only powers quarter of 4k.

My linear equation takes into account the base compute power that's constant no matter the resolution. From the equation, that seems to be ~2TF.

For 720p, my equation says 2.52TF is needed.

3dilettante · Jan 6, 2020

Frenetic Pony said:
From AMD's standpoint the ability to have a more programmable pipeline seems a large potential strategic advantage, as they've got the contract for both upcoming "new" generation consoles that will have raytracing, and are thus able to de facto set whatever standard they can get through both Microsoft and Sony as some least common denominator.

It's programmable at the cost of relying on hardware that has some undesirable properties for the workload.
RDNA does improve things a bit over GCN by having 32 instead of 64 lanes subject to divergence, and I'd assume RT hardware can be significantly narrower.
RT cores may have their own caches and buffers. I'm not sure if they interface with the L1 or L2 for Nvidia, but any local storage would probably be tuned for lower latency.
AMD's L1 takes ~100 cycles for a hit, which a TMU intersection engine would be interfacing with, and which the programmable pipeline will need to be careful to use as little as possible.

L1 instruction cache behavior is another pain point, since the misses there are long-latency and worsen under load. In that regard, single-purpose functionality avoids thrashing in the L1, making it a better neighbor to other workloads and less affected by bad neighbors.
It's actually an area where I wonder if an instruction RDNA introduced could have significance. There's a prefetch instruction that changes the fetch behavior such that the prefetcher doesn't discard the previous three cache lines when fetching the next one. The documentation describes this as affecting and L1 of 4 64-byte lines, but I think this makes more sense as a description of a wavefront instruction buffer. A loop that fits in that space could be more self-contained, and if a shader using BVH instructions can be subdivided into phases with outer loops that fit in 256 bytes and the intersection engine helping condense the inner node/intersection evaluation, then maybe AMD's method could more closely approach the instruction cache footprint Nvidia touts as an advantage of its method.

Going back to what I discussed earlier about bugs getting the way of hardware: the prefetch instruction is currently not documented outside of LLVM bug flags because it apparently will freeze the the shader. There's also another RDNA branching bug that seems to occur with branch offsets of 63 bytes, and the workaround seems apt to blow out said instruction buffer (long streams of NOPs).
Other bugs related to workgroup processing mode or Wave32 versus Wave64 indicate there are possible failure cases in and around the TMU--which by happenstance a BVH unit would be as well.
Maybe if there is some internal evaluation hardware for this, it's vastly less useful because of how buggy the hardware is.

Programmable intersection tests could deliver advantages to developers Nvidia's hardware can't deliver, with non box testing showing differing advantages already.

Nvidia also gives the options for custom intersection tests, at some indeterminate performance cost. My question would be whether this means involving SM hardware in the same way AMD's method would. That could create a scenario where it's either fixed-function and faster, or programmable and same speed.
AMD's TMU method similarly has intersection hardware, so I think that intersection testing is an area where there are similarities.

On top of that a programmable traversal stage, already proposed by Intel, has many potential advantages as well, with things like stochastic tracing and easily selectable LODs coming into play. As far as is known neither is available to current Nvidia ratracing hardware, outdating their larger competitors lineup would be a major victory for AMD, though certainly bad for any Nvidia customers expecting their gaming hardware to last longer.

AMD's method is more programmable, but the fixed-function pipeline still generates a list of nodes to traverse, which may constrain what traversal methods can be employed by the programmable portion, since the number of nodes or the methods in finding them is in the intersection engine rather than the shader. AMD's method does offer the ability to skip the hardware, but at that point it's straight compute that doesn't differentiate from other compute methods.

AlphaWolf said:
Potential yield advantages? What is that comment based on? Sounds like it was pulled from someone's nether regions.

Each mask used introduces additional chances for alignment errors or defects to be introduced during the exposure process. Every step in lithography has a small but non-zero defect rate.
The large number of masks for the quad or octal patterning steps also creates concerns about the variability of the resulting patterns, even if no singular defects manifest.

The downside for EUV is that there is much less maturity along a number of other components of the process, and much higher sensitivity to things like mask defects, so I don't think there's an unambiguous winner at present.

EUV tends to struggle with exposure power relative to standard lithography, but on the other hand if the standard process needs 4-8 times the number of exposures it might still be worthwhile.
Turnaround time is one area where EUV is expected to be better or at least not as bad as standard lithography is expected to become.

Proelite · Jan 6, 2020

Question about GDDR6 clamshell. Is it required for entire bus to be clamshelled? I was wondering if only a fraction of the bus can be clamshelled. I.e 6 chips on top, and 2 chips on the bottom.

dobwal · Jan 6, 2020

DSoup said:
Samsung have a 7nm EUV process which they committed to for some commercial products a while back but the number of chips confirmed in actual products has led to widespread industry speculation that Samsung are having serious yield issues. They deny it, but it's Samsung so uh.. yeah.

TSMC claims that N7+ was of one of the fastest nodes to volume production and is already matching the yields of N7.

Globalisateur · Jan 6, 2020

Proelite said:
So according to Verge, Lockhart is targeting 1440p.

https://www.theverge.com/2019/12/4/...ect-scarlett-lockhart-anaconda-launch-details

I didn't give too much thought to it previously, but I did some paper math, and it turns out you need 6.07TF to deliver 1440p, if 4TF delivers 1080p. The old leaks pointed Lockhart to a 1080p machine, but the Verge leak suggests Lockhart was upgraded.

Math:

(4TF, 1), 1080p is 1x multiplier the resolution over 1080p

(12TF, 4), 4K is 4x multiplier over 1080p

(x, 1.7778), 1440p is 1.7778x multiplier over 1080p.

We're solving for X. This is simple linear algebra. Formula is y = 0.375 -0.5. X is 6.07TF.

6.07TF allows for 1X BC.

If we take 56CU(all CUS active) for Arden, Lockhart is likely 26CU (2 disabled). You need slightly over 1800mhz clock to reach 6TF. 1800mhz should be a familiar clock for people.

This suggest an upgrade to a vapor chamber cooler for Lockhart.

Speculated specs for upgraded Lockhart:

8 core Zen 2 at 3.2+ghz

6.00TF, 1805mz

12GB ram, 14gbps gddr6. 336GB/s. (Funny how the BW is just enough for 6TF GPU + CPU)

1TB SSD

Price: $299-$399.

Hold your horses with your highly speculative post ! we are not in the baseless thread here. He even reiterates the 4tf target about Lockhart in the article. Where does this 6tf number come from ?

Plans for Lockhart may change, but currently this console will debut with around 4 teraflops of graphical power.

"Target 1440p" is just some PR taken from MS's mouth in order to appease people about Lockhart.

Proelite · Jan 6, 2020

Globalisateur said:
Hold your horses with your highly speculative post ! we are not in the baseless thread here. He even reiterates the 4tf target about Lockhart in the article. Where does this 6tf number come from ?

"Target 1440p" is just some PR taken from MS's mouth in order to appease people about Lockhart.

If the mods think my post is baseless speculation they can move it to the other thread. I provided math to backup my speculation, and such an upgrade is well within the realm of reason if done a year prior to mass manufacturing. I am not speculating on hidden CUs, late silicon changes, multiple APU candidates, nor dual GPUs.

Deleted member 13524 · Jan 6, 2020

New pics of the DS5 and the "V" devkit.

There's a screen on the right "elevated" front of the console, which at this point seems to be displaying the network information.

Globalisateur · Jan 6, 2020

So PS5 is an always connected console. Those who don't have Internet access will have to play on their PS4.

EDIT: I am kidding...

Next Generation Hardware Speculation with a Technical Spin [post E3 2019, pre GDC 2020] [XBSX, PS5]

Nisaaru

AlphaWolf

Specious Misanthrope

Proelite

Nisaaru

Deleted member 11852

Guest

AlphaWolf

Specious Misanthrope

anexanhume

iroboto

Daft Funk

AlphaWolf

Specious Misanthrope

iroboto

Daft Funk

Proelite

turkey

Proelite

3dilettante

Proelite

dobwal

Globalisateur

Globby

Proelite

Deleted member 13524

Guest

Globalisateur

Globby

Similar threads