Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
You don't. You're either doing a 320-bit access or a 192-bit access depending on which client is asking for that particular segment of data and which pool it exists in.

I'm not getting your point.Can you explain a bit better?

Each module is connected on a 32 bits bus. All 10 modules accessing 1 GB allow for a 320 bit bus. The extra GB on the six 2 GB modules are accessed over the same 32 bits bus. But if you are accessing both pools the bandwidth is divided. As such the 32 bits are shared, so you will not get a 192 bits bus available to that memory.

Hope I'm making myself clear.

To get 560GB/s you would have to exclusively read from one pool at full load for an entire second. Where you are getting tripped up is you are using a metric of capacity to do work over a period of time and trying to apply it to moment to moment, cycle to cycle usage. If you were to precisely track the amount of data that was actually transferred over any given second when running a game, I'll bet it would be some lesser number than the theoretical max bandwidth. So whether this theoretical max number goes up or down with any particular usage pattern is irrelevant. What's relevant is whether this particular setup delivers sufficient available bandwidth to meet the needs of the system as it needs it.

Never questioned that.

But having 560 GB/s available or 292+168 or 244+366 it is not the same thing! Specially when bandwidth for each of the fast and slow memory can change at any moment
 
I think you'd run into these issues anyway whether it was 10x1GB chips, 10x2GB chips, PS5's setup, or XSX current setup. Both of you are describing a challenge of having memory contention between CPU and GPU, not a contention problem between 2GB and 1GB. Memory Contention by _default_ is sub optimal. Having to share memory between the CPU and GPU has it's pros and cons.
Cons: you have contention
Pros: you waste less cycles copying back and forth, you can access the same memory locations, you can efficiently have 2 processors work on the same memory locations without needing a copy and burdening other buses. You guys were absolutely there when HUMA was the big thing to talk about for PS4 and XBO didn't have it because of split pools of memory. Didn't seem like an issue then, why is it now?

The advantages outweigh the cons and that's why we do it.

The only difference here is that with PS5 you run into full memory contention, and we have a graph of how bandwidth drops significantly as the CPU uses more bandwidth on PS4. Which makes sense, while CPU is getting the data it needs, it's not going somehow mix the request and fill the gaps with GPU data. It's going to get it's pull, then the GPU gets what it needs, and vice versa.

On XSX it's just very obvious now what is happening you know exactly which chips will be under contention and which memory won't be.

It's not about GB/s, that doesn't even make sense, you're never going to maintain 560 GB/s every single second anyway, you're just going to pull data as you request it. It's not like for a full second you can keep pulling and putting 560GB/s of data back and forth on system. That's means your memory is running a full 32bits of data every single clock cycle from every single chip for a full second no breaks. No imperfect pulls.
If you can't saturate a CPU or GPU to 100% saturation, there was no way your memory would be like this as well. You'll always grab an imperfect amount of data. Developers can make things better by making texture sizes to be exactly the amount that memory would grab from each chip of course; but I mean there is only so much you can do.

You're going to have inefficiency somewhere.

On PS5 when the CPU needs its work, their GPU will be locked out until the CPU gets what it needs except for the chips it can pull from.

ON XSX when the CPU needs its work, their GPU will be locked out until the CPU gets what is needs, except for the chips that are still available to pull from. In this case, there will always be 4 chips dedicated for the GPU that the CPU can't touch.

There is just one thing. If on PS5 memory bandwith varies depending on utilization. Here it also changes with allocation, since using more of one memory will change bandwidth on the other.
Imagine you are doing anisotropic filtering. With intensive usage of bandwidth (PS4 had games limited on aniso due to this). On PS5 GPU bandwidth is what's available. But on X it is not... It's part on the fast memory, and part on the slow memory.
Since you have to work on one memory, you don't see the system suffering from this?
 
On Xbox Series X there is also going to have the usual memory contention problem on top of the aformentionned reduced bandwidth when CPU is accessing the memory.
You guys need to stop looking at this as split pools. Would you use this argument if it was 10 x 2GB chips on a 320 bit bus? It's still 560 GB/s with both CPU and GPU contending for it. So where would the aforementioned bandwidth be lost then?

Because as of right now you guys are making an argument that having 2GB chips is less bandwidth than 1GB chips. This is what it's boiling down to
 
Just like the more the CPU reads from the 448 GB/s the more the GPU speeds will be lowered.

That's something that happens on every system.

Both system will behave the same, except on one thing. If you want to use the remaining bandwidth on PS5 you use it. On Xbos you will have to optimize usage ob both the fast and slow ram, otherwise you will not use the full available bandwidth,
 
I'm not getting your point.Can you explain a bit better?

Each module is connected on a 32 bits bus. All 10 modules accessing 1 GB allow for a 320 bit bus. The extra GB on the six 2 GB modules are accessed over the same 32 bits bus. But if you are accessing both pools the bandwidth is divided. As such the 32 bits are shared, so you will not get a 192 bits bus available to that memory.

Hope I'm making myself clear.
Actually each GDDR6-chip has two independent 16-bit buses.
 
No. GPU only has to read from the 10GB pool to get the full bandwidth that's left. This is no different than the PS5. Why does everyone think it's any different?

Yes.. you can do that... With 10 GB available! Not the 16. If you use the remaining 6, you will keep a maximum of 560 GB/s but to have them you must read from both pools at once.

Actually each GDDR6-chip has two independent 16-bit buses.

I'm talking globally. But yes... you are correct.
 
yea I get that. It’s not something that’s going to happen over night. But someone is buying AMD hardware; over time with a large enough population ideally people may build more libraries for it.

I don't think you understand. Having a larger user base isn't as useful compared to appealing to the academic or professional community. There's a reason why AMD rolled out a proprietary compute API much like CUDA because they knew they could not trust their community to do anything for them if they wanted to move into anything more than just graphics. With key libraries, these projects are ported on their platforms with corporate efforts rather than community efforts.

Also it's not about the quantity of supported ML frameworks, it's about supporting quality ML frameworks since 90%+ of the data science done today with them is either on Tensorflow or PyTorch and this is where the future is headed with more fields to follow.

as for now; yes I’m stuck on nvidia if I want to stay with high level libraries that have GPU support.

Only if you're willing to move beyond Macs otherwise having Nvidia hardware won't help you there. CUDA is deprecated on Macs so people with those systems have no options in the future.
 
There is just one thing. If on PS5 memory bandwith varies depending on utilization. Here it also changes with allocation, since using more of one memory will change bandwidth on the other.
Imagine you are doing anisotropic filtering. With intensive usage of bandwidth (PS4 had games limited on aniso due to this). On PS5 GPU bandwidth is what's available. But on X it is not... It's part on the fast memory, and part on the slow memory.
Since you have to work on one memory, you don't see the system suffering from this?
Why would it suffer? Just because PS5 is working with 16 chips and XSX is working with 10?
I think this is the crux of your argument.

I think you're placing too much emphasis on where the data resides as being the deciding factor of bandwidth.

Bandwidth is a factor of bits pulled multiplied by clock rate.

If we built a hypothetical single memory chip that was 16 GB in size. It had 32 bit bus. But it had a clock rate so high that it somehow made the total system 560 GB/s. And the CPU and GPU were both accessing it. And I told you that I would restrict the memory access of the CPU to the last 6 GB of that chip. What would you say the bandwidth of this system?
 
I don't think you understand. Having a larger user base isn't as useful compared to appealing to the academic or professional community. There's a reason why AMD rolled out a proprietary compute API much like CUDA because they knew they could not trust their community to do anything for them if they wanted to move into anything more than just graphics. With key libraries, these projects are ported on their platforms with corporate efforts rather than community efforts.

Also it's not about the quantity of supported ML frameworks, it's about supporting quality ML frameworks since 90%+ of the data science done today with them is either on Tensorflow or PyTorch and this is where the future is headed with more fields to follow.



Only if you're willing to move beyond Macs otherwise having Nvidia hardware won't help you there. CUDA is deprecated on Macs so people with those systems have no options in the future.
Yea, I just use cloud based solutions on my mac, or i terminal iinto a compute cluster.

I don't have dreams of these things, I'm just a user. If I have to upgrade a GPU, I'll get one that allows me to do both. Otherwise my company provides P100s and V100s for use. I'm just thinking out loud since MS seems to be buying up a lot of AMD hardware for Machine Learning. Like if I chose Azure for instance and doing compute work, how do I know if it's running on AMD or Nvidia hardware. I don't think we know? (though, yea, I can request a nvidia machine on Azure but anyway).

I think i'm just looking at the overall industry to see if a company like Google or MS would use their GPUs they use for streaming to provide GPU compute services for cloud. And if so, how would they be used.
 
Since new information regarding AMD newer apus aren't readily available or discussed on the internet. I dug up some of the old data and while some of it not relevant anymore due to advancement. Some is still relevant and may be useful for this discussion.

https://www.realworldtech.com/fusion-llano/2/

The CPU’s cacheable memory relies on AMD’s MOESI protocol, and has the standard x86 consistency semantics with strong ordering of memory references. The CPU’s uncacheable memory also has the same behavior as before. The GPU memory region behaves in a totally different fashion. By default it has relaxed consistency (and thus is not x86 coherent), so that loads and stores can be freely re-ordered for higher memory bandwidth.

The Fusion GPUs have a dedicated non-coherent interface to the memory controller (the Radeon Memory Bus or Garlic, shown with a dotted line) for commands and data. The bus is 256-bits (32B) wide in each direction and is replicated for each memory channel (2x32B read and 2x32B write for Llano, half for Zacate). Garlic operates on the Northbridge clock – up to 720MHz for notebook versions of Llano and 492MHz for Zacate. This is a factor of 2-3X more bandwidth than memory can provide (roughly 17GB/s measured), which is needed to handle bursts of memory transactions (e.g. texture reads).

The GPU has a separate interface for sending memory requests that target the coherent system memory. The Fusion Control Link (or Onion) is a 128-bit (16B) bi-directional bus that feeds into a memory ordering queue shared with the coherent requests from each of the 4 cores. Onion runs at up to 650MHz for notebook variants of Llano (10.4GB/s read + 10.4GB/s write) and 492MHz for Zacate. An arbiter in the IFQ is responsible for selecting coherent requests (based on memory ordering) to send to the memory controller. Desktop versions of Llano will probably run Garlic and Onion faster still, given the extra power budget.

The memory controller arbitrates between coherent (i.e. ordered) and non-coherent accesses to memory. Llano has two 64-bit channels of DDR3 memory that must operate independently, while the smaller Fusion cousin only has a single channel. The GPU memory is interleaved across both channels for maximum streaming bandwidth and requests will close DRAM pages after an access. In contrast, system memory is optimized for latency and locality; contiguous requests will tend to stay to one memory channel and keep DRAM pages open. The memory can run up to 1.86GT/s for a total of 29.8GB/s memory bandwidth on Llano. It also contains an improved hardware prefetcher that tracks 8 different strides or sequence of strides and speculatively fills into the memory controller (rather than the caches).

The GPU can access uncacheable system memory using the Garlic bus, but the memory must be pinned since there is no demand based paging for graphics (yet). System memory is generally slower than frame buffer memory because there is no interleaving (12GB/s versus 17GB/s for framebuffer). However, it is substantially faster than accessing cacheable shared memory, since there is no coherency overhead. For example, this approach could be used to read in data from the CPU to start an OpenCL kernel on the GPU.

https://developer.amd.com/wordpress/media/2013/06/1004_final.pdf

Provides information of Llanos's bandwidth of cpu/gpu across the different regions of memory.
 
So on PS5 if both the CPU and GPU try to access memory at the same time, how much bandwidth does the GPU get?
That can vary based on the patterns and the policies set by the system and memory controllers. Managing competing clients with different latency tolerances is one of the areas vendors put a lot of proprietary effort into.
One question that would need answering is the bandwidth the CPU path can have. Two CCXs would generate up to 64 bytes of read traffic per cycle, but without knowing the speed and width of the fabric to the memory controllers, it can be far less. Desktop CPUs see a fabric clock topping out ~1.8GHz, which is 115.2 GB/s assuming the system allows the CPU such a wide path to memory. If the fabric gave a sufficiently wide path to the GDDR6 controllers and the cores are at 3.8 GHz, that's 243.2 GB/s max for the CPU. This is probably why Microsoft says the CPU cannot see the difference between GPU-optimized and regular memory, since only the GPU can generate enough traffic to know the difference.

That aside, the controllers will try their best to balance client requests, while at the same time trying to coalesce and re-order accesses as best they can to avoid costly DRAM penalties due to banking conflicts or turnaround penalties. This is where GPU and CPU can interfere with each other more often than not, since they are likely accessing different DRAM banks and the CPU's latency intolerance can force the memory controllers to suspend their re-ordering and coalescing in order to fulfill urgent requests. This can burn bandwidth, while the GPU memory subsystem has many hundreds of cycles to find a more ideal pattern for maximum bandwidth utilization.



You don't. You're either doing a 320-bit access or a 192-bit access depending on which client is asking for that particular segment of data and which pool it exists in.
In reality, I think it's generally that you'll see a broad stream of 64 byte accesses from either client, and the memory subsystem puts them in the appropriate memory controller queues and the controllers try to satisfy them in a way that doesn't waste access cycles while not taking too long for latency-sensitive clients.
Cache lines are generally 64B or so, and the 16-bit channels have a burst length of 16. That usually means 2 DRAM burst per cache line fill, although having one or more accesses in a row can hide some DRAM latencies or relax pressure on the command bus.
320 and 192 bits on their own don't reflect what an access usually sees.
 
That can vary based on the patterns and the policies set by the system and memory controllers. Managing competing clients with different latency tolerances is one of the areas vendors put a lot of proprietary effort into.
One question that would need answering is the bandwidth the CPU path can have. Two CCXs would generate up to 64 bytes of read traffic per cycle, but without knowing the speed and width of the fabric to the memory controllers, it can be far less. Desktop CPUs see a fabric clock topping out ~1.8GHz, which is 115.2 GB/s assuming the system allows the CPU such a wide path to memory. If the fabric gave a sufficiently wide path to the GDDR6 controllers and the cores are at 3.8 GHz, that's 243.2 GB/s max for the CPU. This is probably why Microsoft says the CPU cannot see the difference between GPU-optimized and regular memory, since only the GPU can generate enough traffic to know the difference.

That aside, the controllers will try their best to balance client requests, while at the same time trying to coalesce and re-order accesses as best they can to avoid costly DRAM penalties due to banking conflicts or turnaround penalties. This is where GPU and CPU can interfere with each other more often than not, since they are likely accessing different DRAM banks and the CPU's latency intolerance can force the memory controllers to suspend their re-ordering and coalescing in order to fulfill urgent requests. This can burn bandwidth, while the GPU memory subsystem has many hundreds of cycles to find a more ideal pattern for maximum bandwidth utilization.




In reality, I think it's generally that you'll see a broad stream of 64 byte accesses from either client, and the memory subsystem puts them in the appropriate memory controller queues and the controllers try to satisfy them in a way that doesn't waste access cycles while not taking too long for latency-sensitive clients.
Cache lines are generally 64B or so, and the 16-bit channels have a burst length of 16. That usually means 2 DRAM burst per cache line fill, although having one or more accesses in a row can hide some DRAM latencies or relax pressure on the command bus.
320 and 192 bits on their own don't reflect what an access usually sees.

Thank you! I wondered if you could combine multiple non-interfering memory requests into a single request. I'm assuming by coalesce, this is what you mean?
 
Yea, I just use cloud based solutions on my mac, or i terminal iinto a compute cluster.

I don't have dreams of these things, I'm just a user. If I have to upgrade a GPU, I'll get one that allows me to do both. Otherwise my company provides P100s and V100s for use. I'm just thinking out loud since MS seems to be buying up a lot of AMD hardware for Machine Learning. Like if I chose Azure for instance and doing compute work, how do I know if it's running on AMD or Nvidia hardware. I don't think we know? (though, yea, I can request a nvidia machine on Azure but anyway).

I think i'm just looking at the overall industry to see if a company like Google or MS would use their GPUs they use for streaming to provide GPU compute services for cloud. And if so, how would they be used.

I think Microsoft Azure uses Linux ? If they are using AMD hardware then it would be running on their ROCm driver stack. There's even a ROCm fork of Tensorflow that they are working on upstreaming.

AMD's big customers like Google (maintainer of Tensorflow) or Facebook (maintainer of PyTorch) have been happy with the work that they've been putting so far in their frameworks.
 
Thank you! I wondered if you could combine multiple non-interfering memory requests into a single request. I'm assuming by coalesce, this is what you mean?
I was using it as a catch-all term for combining traffic of various kinds, such as trying to satisfy requests for the same address at the same time, or trying to reorder accesses to the same bank so that they are all satisfied before closing the bank and moving on, collecting writes and reads into contiguous stetches, etc.. It can be subject to limits to how long the controller can delay an access in the hopes of collecting enough potential requests to fill in enough cycles, and how stringent the client is about the order accesses. CPUs on average are less forgiving on both, which is part of why there's an additional amount of overhead if the CPU is generating a significant amount of traffic while the GPU is doing the same.
 
I was using it as a catch-all term for combining traffic of various kinds, such as trying to satisfy requests for the same address at the same time, or trying to reorder accesses to the same bank so that they are all satisfied before closing the bank and moving on, collecting writes and reads into contiguous stetches, etc.. It can be subject to limits to how long the controller can delay an access in the hopes of collecting enough potential requests to fill in enough cycles, and how stringent the client is about the order accesses. CPUs on average are less forgiving on both, which is part of why there's an additional amount of overhead if the CPU is generating a significant amount of traffic while the GPU is doing the same.

So, if there were requests from the CPU to data contained in the 6 chips that make up the slow pool and requests from the GPU for data contained in the 4 chips not included in the slow pool, could these theoretically be fulfilled concurrently?
 
It all add up together, but probably more the cooling cost than the PSU, there's almost always standardized designs from PSU manufacturers, they adapt well tested designs (already passed all regulation worldwide, dirt cheap to make), there's very little difference in cost between say a 300W and a 350W.

The problem I see with cooling is that it can reach a threshold where different materials and assembly becomes required. Like going from just aluminum, then alu+copper slug, then add heatpipes, then more heatpipes, then a large vapor chamber is required.

So maybe Sony decided to stay with two or three heatpipes instead of a vapor chamber?
So no major thermal win with a more efficient power supply then. Dunno how effecient the average console PSU would be but 8 to 10% more might not get you much.
Yeah material and design/manufacturing costs and it has to be pretty quiet. I was going to say maybe the cooling doesn't have to be expensive as long as you move the air quickly enough but then that is noise. Maybe the did something there ?
The PS5 might be an intriguingly baffling system. :LOL:
 
Status
Not open for further replies.
Back
Top