Next Generation Hardware Speculation with a Technical Spin [pre E3 2019]

milk · Jan 29, 2019

vipa899 said:
Everything so far has been unreal speculations as we have no single clue whats in ps5. Ppl come here and dream away till specs land.

It was like this, until party poopers shifty came here and spilled the real unadulterated beans first hand. You see, the human brain accepts every information as true until proven false, specially if it is convenient, so don't you go and tell me ps5 won't be 20TF.
While at it. How many Tera-Rays can ps5 GPU do? 4, 8, 12?
I'm betting on 12+...

vipa899 · Jan 29, 2019

PS5 better have RT yes, so things calm down abit here.

BRiT · Jan 29, 2019

There's some really big steps to see before I'll wager that next-gen consoles have RealTime RayTracing:
1) AMD announces RT-RT GPUs.
2) Prices of RT-RT GPUs are reasonable at $300 or less.

AlNom · Jan 29, 2019

----

I'm sure 20TF FP16 is about there.

fehu · Jan 29, 2019

BTW, FP16 precision has been used for some real performance boost on pc?
Or the kind of we reduced by 10% the latency in this specific pass of the rendering pipeline?

chris1515 · Jan 29, 2019

fehu said:
BTW, FP16 precision has been used for some real performance boost on pc?
Or the kind of we reduced by 10% the latency in this specific pass of the rendering pipeline?

A post about using mixed precision and some possible use case

https://gpuopen.com/first-steps-implementing-fp16/

Common Targets
A reliable target for FP16 optimisation is the blending of colour and normal maps. These operations are typically heavy on data-parallel ALU operations. What’s more, such data frequently originates from a low-precision texture and therefore fits comfortably within FP16’s limitations. A typical game frame has a plentiful supply of these operations in gbuffer export and post-process shaders, all ripe for optimisation.

BRDFs are an attractive but difficult candidate. The portion of a BRDF that computes specular response is typically very register- and ALU-intensive. This would seem a promising target. However, caution must be exercised. BRDFs typically contain exponent and division operations. There are currently no FP16 instructions for these operations. This means that at best there will be no parallelisation of those operations; at worst it will introduce conversion overhead between FP16 and FP32.

All is not lost. There is a suitable optimization candidate in the typical BRDF equation: the large number of vectors and dot products typically present. Whilst individual dot products are more a data reduction operation than a data parallel operation, many dot products can be performed in parallel using SIMD code. These dot products often feed back into FP32 BRDF code, so care must be taken not to introduce FP16 to FP32 conversion overhead that exceeds the gains made.

Finally, TAA or checker-boarding systems offer strong potential for optimisation alongside surprising risks. These systems perform a great deal of colour processing, and ALU can indeed be the primary bottleneck. UV calculations often consume much of this ALU work. It is tempting to assume these screen-space UVs are well within the limits of FP16. Surprisingly, the combination of small pixel velocities and high resolutions such as 4K can cause artefacts when using FP16. Exercise care when optimising similar code.

At present, FP16 is typically introduced to a shader retrospectively to improve its performance. The new FP16 code requires conversion instructions to integrate and coexist with FP32 code. The programmer must take care to ensure these instruction do not equal or exceed the time saved. Is is important to keep large blocks of computation as purely FP16 or FP32 in order to limit this overhead. Indeed, shaders such as post-process or gbuffer exports as FP16 can run entirely in FP16 mode.

AlNom · Jan 29, 2019

fehu said:
BTW, FP16 precision has been used for some real performance boost on pc?
Or the kind of we reduced by 10% the latency in this specific pass of the rendering pipeline?

On PC it's probably more related to the other boons of reduced register pressure and potentially internal bandwidth because of less cache thrashing. There's certainly an abundance of FP32/ALU throughput, so that's less likely to be a particular bottleneck. It's murky because of the various hardware profiles & settings at play.

Metal_Spirit · Jan 29, 2019

milk said:
You see, the human brain accepts every information as true until proven false, specially if it is convenient, so don't you go and tell me ps5 won't be 20TF.

Remember... A Console has two sides. Performance and Price.
A 20 Tflops console would have a way too huge price...

I think the unit we could use to relate those two would be... a flop!

So 20 Tflops = 1 flop...

3dilettante · Jan 29, 2019

MrFox said:
What I am curious about is whether we will ever see dram moving to some disabled capacity for yield, like flash have done for decades now. There was a statistic years ago saying consumer dram yield is around 50%, not sure if it went up or down. So imagine if they were to use all the imperfect 16gb and map the failed banks orthogonally to make 12gb chips out of it.

Redundant rows are used in DRAM. I haven't turned up the specifics of current DRAM manufacturers' methods so far, but there's a range of research papers on implementing fault recovery and it's referenced in Wikipedia going back to comparatively smaller array sizes than we have presently.
https://en.wikipedia.org/wiki/Dynamic_random-access_memory#Row_and_column_redundancy

That 50% figure sounds very low, especially for consumer DRAM with spot prices in the single digits. The data is confidential, but the last time (admittedly long ago) I saw a rough comparison between IC types, DRAM was near they top of the acceptable yield hierarchy, significantly over 90% or 95%. DRAM is highly redundant and regular, and it's a mass-produced commodity made by multiple manufacturers.

iroboto said:
Unlikely credible at all.
Partner solutions teams are isolated from the main company. Meaning special access requirements NDAs etc. Within that there are isolations between partners, so people working with MS cannot talk to the people working with Sony.

In order for a single leaker to get news on both would require some 3 levels deep of leaks. The first getting by the standard AMD employee and getting a semi custom person to speak out. In my company partner solutions peoples are super guarded because of the amount of money they are working with, plus, incentives to stay quiet.

During the leadup to the current generation there were some rumors that tried to make cross-platform comparisons. The informational worth of some of them seemed somewhat limited, but they came with the same leaks about AMD's semi-custom division. The need for compartmentalization is paramount, but there were some hints AMD had not been able to solidify its internal barriers sufficiently. It was something I noted as something of a detraction of AMD's position at the time, and then there's been a trickle of Linkedin profiles with code names and explicit or strongly implied links to specific clients. Hopefully things have been refined since then.

Metal_Spirit said:
So, a guy claiming he knows something about both consoles has found both manuals?

Or various documents wind up being sent to a common outlet, as happened with Vgleaks for Orbis and Durango.

BRiT · Jan 29, 2019

Or they use the same Kinko's Copy and Print center before they send out the large booklets...

Lukebg · Jan 29, 2019

from here

https://www.reddit.com/r/xboxone/comments/al55zv

also

https://twitter.com/x/status/1090371730290429953

AlNom · Jan 29, 2019

Scarlett has two t's :V

1GB L4 is strange unless they somehow reserve it from a unified pool of RAM. Strange to report that as such. What about the other GB?!

also... 3-way SMT? Even weirder.

The Opus name is a funny coincidence (was one of the Xbox 360 revision codenames too).

1650MHz GDDR6 on 352-bit bus would be 580GB/s. Strange to see that width for an AMD GPU too. That sort of partial bus width is more nV's alley. If that's due to some funky reservation (see CPU L4) for a 384-bit APU, then I guess total bandwidth would be ~633GB/s. Weird.

22GB -> 352-bit bus + 16Gbit density chips
+ 1GB L4 CPU
+ 1GB ??? WEIRD WEIRD WEIRD
----
= 24GB unified ????

4096 shaders @ 1400MHz would imply 11.47 TF.

(9x Durango = 11.79TF FWIW)

----

2.1 to 3.3GHz range of CPU clocks seemingly. *shrug*

---------------

salt

vipa899 · Jan 29, 2019

22GB Arcturus, Sony has a problem.
Easy to manufacture such a screen dump?

AlNom · Jan 30, 2019

vipa899 said:
Easy to manufacture such a screen dump?

Don't see why not. A developer could very easily troll us.

3dilettante · Jan 30, 2019

Arcturus, at least from some of the statements concerning the Linux drivers that mentioned it, was supposedly a return to an older naming scheme where it applied to a specific chip in a family. If that interpretation is accurate, it seems odd to add a number to a unique chip's name. More curious would be why a code name in open-source Linux drivers would be shared with a device that is not open. The custom architectures have not shared code names with other products in the past.
The shader count and "Arcturus engines" stuff seems odd, particularly given the use as a chip name and the redundancy of streaming processors versus CUs.

The trueaudio line is odd since to my knowledge it wasn't given as a feature for a console, since custom audio solutions were made for each, and modern GCN doesn't have trueaudio hardware.

Having a memory bus with 11 chips is a bit off. There are some rather high-end products with that kind of bus, but even then those have partially disabled DRAM partitions.

Why a dump would concern itself with an thread scheduling, particularly with SMT being off, seems questionable. The L4 seems questionable.

Kaotik · Jan 30, 2019

AMD also already replaced UVD & VCE with VCN in Raven Ridge, why would chip coming in year or two still have VCE & UVD?
TrueAudio has been replaced by shader based TrueAudio Next, why would they still include the old one relying on 3rd party IP?
Also 352-bit membus? Really? (Yes, I know it's possible to do but still)

McHuj · Jan 30, 2019

I think there could be a possibility of a cut-down bus next gen with a giant APU on 7nm, they may have to sacrifice that for yields. NVIDIA does this, AMD might as well.

352-bit would imply one 32-bit path has been disabled. I think it's possible. Maybe its further cut down for the low cost version of the console. I think yields will be a big problem for big 7nm chips and why AMD is even going with chiplets for consumer CPUs.

I'm not sure I understand the SMT disabled, but up to 24 thread hyper threading? Some sort of hardware assist for a software scheduler?

BRiT · Jan 30, 2019

Kaotik said:
AMD also already replaced UVD & VCE with VCN in Raven Ridge, why would chip coming in year or two still have VCE & UVD?
TrueAudio has been replaced by shader based TrueAudio Next, why would they still include the old one relying on 3rd party IP?

Backwards compatibility with Original Xbox, Xbox360, and Xbox One games -- maybe it's not hardware identical but offers a shim-layer to the newer hardware functions?

Metal_Spirit · Jan 30, 2019

Arcturus is not the name of the next AMD architecture.

bridgmanAMD

Just a reminder... the Arcturus code name refers to a specific chip, not an architecture. Somebody misread my post on Phoronix and thought it was a new architecture, and that has been echoing around the internet ever since.

Arcturus is just the first GPU we started after going back to using code names in the driver that are completely unrelated to the marketing name. The change should mean one less thing to worry about when we push Linux drivers out to a public repo on the way to upstream.

Besides 1 GB L4?

SMT Disabled?

Max CPU Speed: 3.3
Max GPU Speed 1650

Proportion 2:1 - Same as current gen!

Minimum CPU Speed: 2100
Minimum GPu Speed: 1400

Not the same proportion!

CPU and GPU downclock in diferente proportions???

Deleted member 13524 · Jan 30, 2019

I can't wrap my head around that 1GB L4. Nothing that would make sense comes to mind.
Maybe a large-ish scratchpad (64bit LPDDR4?) for the CPU to avoid doing GDDR6 access requests?

I mean if this was a two-chip solution, with the GPU acting as northbridge and using HBCC to take over the main GDDR6, and the CPU accessing the main GDDR6 through the GPU.

Next Generation Hardware Speculation with a Technical Spin [pre E3 2019]

milk

Like Verified

vipa899

BRiT

(>• •)>⌐■-■ (⌐■-■)

AlNom

Moderator

fehu

chris1515

AlNom

Moderator

Metal_Spirit

3dilettante

BRiT

(>• •)>⌐■-■ (⌐■-■)

Lukebg

AlNom

Moderator

vipa899

AlNom

Moderator

3dilettante

Kaotik

Drunk Member

McHuj

BRiT

(>• •)>⌐■-■ (⌐■-■)

Metal_Spirit

Deleted member 13524

Guest

Similar threads