Next Generation Hardware Speculation with a Technical Spin [pre E3 2019]

Status
Not open for further replies.
Anyone want to do a rundown on the potential pros and cons (i.e., clocks, TDP, size, cost, etc.) of a monolithic APU housing an 8c Ryzen CPU and 64CU GPU, as well as the pros/cons of a discrete CPU/GPU design?
That would be really interesting to see.
Would have to take into account Lockhart and azure apu usage.
Something I should've asked for earlier
 
Not sure I understand double cost on dead area.
Why would the memory controller be on the chiplets? I would expect the IO and memory controllers to be on single die?

I'm also unsure if it would be possible to move some of the front end from chiplets also.

Having 64(approx) cu's then having to clock it high must generate a lot of heat compared to chiplets that add up to much more cu's and can be clocked lower, less heat not sure how power usage would compare.

What is Dante?
I would go for of the shelf zen chiplets, custom gpu chiplets.

As you stated:

Anaconda
48 cu @ 1.15Ghz = 7.07 TF 2 disabled * 2 chiplets = 14.14 TF

So, you are using 96 CUs at 1.15 Ghz to get 14.14 Tflops

I guess the important thing here is:

Which one is cheaper? Your solution or:

60 CU @1850 = 14 .28 Tflops

This solution kills 4 CU, you kill 8.

This solution is one GPU, yours is two.

This solution is hotter. Your solution uses more power.

Now, with Navi expected to go above 2 Ghz, and Gonzalo working at 1.8 GHz, I do not see thermals as a real problem on Navi... Might be wrong though!
 
As you stated:

Anaconda
48 cu @ 1.15Ghz = 7.07 TF 2 disabled * 2 chiplets = 14.14 TF

So, you are using 96 CUs at 1.15 Ghz to get 14.14 Tflops

I guess the important thing here is:

Which one is cheaper? Your solution or:

60 CU @1850 = 14 .28 Tflops

This solution kills 4 CU, you kill 8.

This solution is one GPU, yours is two.

This solution is hotter. Your solution uses more power.

Now, with Navi expected to go above 2 Ghz, and Gonzalo working at 1.8 GHz, I do not see thermals as a real problem on Navi... Might be wrong though!
The answer may depend fully on the yields. The higher the clock speed the less yield you're going to get, CUs are just one factor. When you are approaching desktop speeds, the cost savings of running a weaker chip should be melted away.

This is likely why Scorpio came in 1 year later despite being the same node as 4Pro. They ran 40 CUs at higher clock speed vs many CUs at lower clock speed which was against what we thought was possible, they came in 300Mhz higher than the expectation. The Hovis method helped I'm sure, but likely they really were waiting for the 16nm process to improve better yields at that clock speed to create an achievable price point.
 
All loads lead to Rome? (even those descended from the heavens). /flees

It does sound a bit like decoupling the front-end of the GPU, and the upshot is that GPUs care less about latency than CPUs. That would have some interesting implications for PCs, although I think the elephant in the room is dealing with the bandwidth required between hypothetical GPU chiplets (i.e. interconnects). I could be mistaken, but AMD doesn't seem to have as good of an analogue to Intel's EMIB & without resorting to TSVs.

On the other hand, maybe the GPU L2 & ROPs would be on such an I/O die?

hm.
There would be no GPU interconnect because everything would go through the IO chip.. I mean adding second GPU chiplet would just appear as more CU in one GPU I guess? There would be no AFR rendering beacause everything would appear as one GPU... I guess?
 
The answer may depend fully on the yields. The higher the clock speed the less yield you're going to get, CUs are just one factor. When you are approaching desktop speeds, the cost savings of running a weaker chip should be melted away.

This is likely why Scorpio came in 1 year later despite being the same node as 4Pro. They ran 40 CUs at higher clock speed vs many CUs at lower clock speed which was against what we thought was possible. The Hovis method helped I'm sure, but likely they really were waiting for the 16nm process to improve better yields at that clock speed to create an achievable price point.

That is why I do not believe in 60 CUS, but 56. At 12.9 tflops!
 
That is why I do not believe in 60 CUS, but 56. At 12.9 tflops!
I also like higher clock speeds because the whole pipeline speeds up. If you go slow and wide, everything else needs to go wide as well. Lots of costs there and I'm not sure what the savings will be on a chip that large. They need the silicon space for things like ray tracing and other items.
 
This solution kills 4 CU, you kill 8.
Wouldn't mine be killing 4 cu's in total.
Although I did also say that may be able to have all fully enabled in Anaconda and Lockhart uses ones that needed to have some disabled.

1.8Ghz in console and cloud sounds high to me, but as you say maybe it's not.
If running so high isn't a problem then maybe my chiplet frequency examples was way to low?

Remember you would also have to factor in cooling into cost
 
I also like higher clock speeds because the whole pipeline speeds up. If you go slow and wide, everything else needs to go wide as well. Lots of costs there and I'm not sure what the savings will be on a chip that large. They need the silicon space for things like ray tracing and other items.

Yes... I totally agree!
 
I also like higher clock speeds because the whole pipeline speeds up. If you go slow and wide, everything else needs to go wide as well. Lots of costs there and I'm not sure what the savings will be on a chip that large. They need the silicon space for things like ray tracing and other items.
Depending on how RT is implemented, having lots of cu's may actually be a lot better.
I kept frequency low for heat reasons etc. But sounds like could be faster.
 
Wouldn't mine be killing 4 cu's in total.
Although I did also say that may be able to have all fully enabled in Anaconda and Lockhart uses ones that needed to have some disabled.

1.8Ghz in console and cloud sounds high to me, but as you say maybe it's not.
If running so high isn't a problem then maybe my chiplet frequency examples was way to low?

Remember you would also have to factor in cooling into cost

Oh sorry... I forgot. You were talking about binning GPUs.
 
Depending on how RT is implemented, having lots of cu's may actually be a lot better.
I kept frequency low for heat reasons etc. But sounds like could be faster.
I think RT cores are separated from the CUs, the CUs only do the work after the intersections are identified.
If this is the case, more CUs would be great for a lot of parallel and coherent work, but ray tracing intersections tend to be incoherent, I would suspect that having faster clock speed will help more in these scenarios.
 
Oh sorry... I forgot. You were talking about binning GPUs.
Yea, binning for defects more than speed as you can run everything a lot slower which also means more usable dies.

Lockhart gets the dies that needed cu's disabling.
I included some in Anaconda as I have no idea regarding yeild output. Just throwing ideas out and wanting input from everyone

Worth noting I would be surprised if it was chiplet based.
 
Last edited:
Worth noting I would be surprised if it was chiplet based.

It might... But i have serious doubts about that!
If rumours talked about future navis using multi gpu, I could believe that it was worth it, but there is nothing on the horizon or even mentioned about it beeing used.
 
I also like higher clock speeds because the whole pipeline speeds up. If you go slow and wide, everything else needs to go wide as well. Lots of costs there and I'm not sure what the savings will be on a chip that large. They need the silicon space for things like ray tracing and other items.
Save the clock boost for mid gen. :p

There is some merit to going wide in the sense that they can just boost things later in the generation. At the start they’ll be somewhat more concerned with yields, and both a modest clock and wide design can serve that (obviously needs balancing so we aren’t making a small pizza sized chip).
 
There would be no GPU interconnect because everything would go through the IO chip.. I mean adding second GPU chiplet would just appear as more CU in one GPU I guess? There would be no AFR rendering beacause everything would appear as one GPU... I guess?
The interconnect between the chiplets and IO die is what I mean. you still need IO perimeter on there. e.g. the HSIO link between 360 mother-daughter dies.

Things may matter in terms of where the GPU’s last level cache is? Hence wondering about putting that on a hypothetical IO die.

Anyways, I feel like we’re all straying far away from a console with all this chipper stuff.

Edit: thanks autocorrect. Chipper = chiplet
 
Last edited:
Save the clock boost for mid gen. :p

There is some merit to going wide in the sense that they can just boost things later in the generation. At the start they’ll be somewhat more concerned with yields, and both a modest clock and wide design can serve that (obviously needs balancing so we aren’t making a small pizza sized chip).
eh ;)
mid gen refresh came with both a CU increase and a clockspeed increase ;)
 
Yeaaah, but the situation surrounding yield & node improvements may be very different (28nm to 16nmFF vs 7nmFF to EUV 5nm).
 
Yeaaah, but the situation surrounding yield & node improvements may be very different (28nm to 16nmFF vs 7nmFF to EUV 5nm).
for sure ;) 5nm doesn't seem all that plausible for a big upgrade ;) go for dual GPUs now!
 
For the X1X CPU, Microsoft did some changes to the architecture. Do you think they will do the same for the next gen or will they use stock Zen 2?
 
For the X1X CPU, Microsoft did some changes to the architecture. Do you think they will do the same for the next gen or will they use stock Zen 2?

Depends, did AMD already integrate MS's X1X Jaguar tweaks in Zen2?
 
Status
Not open for further replies.
Back
Top