Apple is an existential threat to the PC

Both AMD and Nvidia make components for servers farms, costing almost as much as the most expensive Mac Studio configuration cost

Boutique, which they will price down massively for supercomputer design wins and hyperscaler orders. The boutique prices we see are for boutique items, like PC workstations ... we never get to see the negotiated prices for specified use, which pay much closer to the mm2 prices than we can ever get (which are not the same as COTS volume prices either). The time where workstations were big volume big publicity items is long gone for PCs, not so much for Apple due to their unique circumstances. Apple owns more and more of the high end market and without fragmentation unlike the PC, so you can't compare it against PCs as a whole.

When ILM wants something to run their renderers for their good software they will not build it out of boutique priced items. They build it down to cost ... and that cost can't pay for massive interconnects to scale bad software a little better over a couple 10s of cores since they need to scale to 1000s any way, so faster interconnect between 10s of cores does nothing to help them. It's a very limited market where that kind of interconnect is really necessary for good software.

Sometimes good software can make use of Fugaku, Cerebras or M1 ultra bandwidth, but eeking a little more performance out of bad software, that's always been the true domain of SMP. Which combined with the willingness of Apple customers to just spend a little bit more is what will drive the use of the Studio too (where OS X plays not a small part too, if you're going to pay for quality you want a quality stack, hardware is not enough). Running poorly scaling software just a little bit faster and the occasional artist off-line rendering on it when he would probably be better of renting some renderfarm time, but can't be bothered to learn how.
 
Last edited:
Boutique, which they will price down massively for supercomputer design wins and hyperscaler orders. The boutique prices we see are for boutique items, like PC workstations

This is utter drivel. Nvidia, AMD, Intel and ARM are all competing in very competitive markets where the ability to delivery maximum performance at cost is critical to competitiveness. For decades, one of the biggest issues with paraellisation of hardware is decreasing efficiency because X2 theoretical maximum performance is closer to X1.4-1.6 deliverable maximum performance.

Software can mitigate some of this by allocating processes requiring inter-core communication and shared memory within clusters but there is no optimum software solution, it's about mitigating the loss of achievable performance. Removing the hardware barriers by introducing massively wider and faster interconnects is the solution. Software being a bandaid for inefficient hardware design is not the way forward. Fixing the inefficient hardware design is.
 
Fugaku isn't selling very well to cloud providers.

The biggest issues with parallelisation has always been bad software more than bad hardware. The moment bandwidth needs to move off a single die the cost shoots up, most markets won't bear the cost of massively parallel buses on interposers or cross reticule interconnects, AMD uses an organic substrate purely because it's cheap and good enough.
 
Right, that would be a decision made by the SSD manufacturer, unless of course Apple is building their own? Even then, I would suspect they're using an existing (eg non-Apple) flash controller chip. I dunno, it just seems a strange decision to me. I guess it's another way to upsell the size?

They have been using their own design since the T2 chip or licensed something they have integrated in their own silicon.
 
Last edited:
The biggest issues with parallelisation has always been bad software more than bad hardware.
No, this is plain wrong. There are many parellisation problems where it's simply not possible the break down jobs at the software process level to account for deficiencies in the hardware. It's like putting. aha on a person profusely bleeding from the head. You aren't solving the problem. you are trying to hide it.

Do you honestly believe the reason that SLI configurations don't deliver twice the performance is because of software? :???:
 
Boutique, which they will price down massively for supercomputer design wins and hyperscaler orders. The boutique prices we see are for boutique items, like PC workstations ... we never get to see the negotiated prices for specified use, which pay much closer to the mm2 prices than we can ever get (which are not the same as COTS volume prices either). The time where workstations were big volume big publicity items is long gone for PCs, not so much for Apple due to their unique circumstances. Apple owns more and more of the high end market and without fragmentation unlike the PC, so you can't compare it against PCs as a whole.

When ILM wants something to run their renderers for their good software they will not build it out of boutique priced items. They build it down to cost ... and that cost can't pay for massive interconnects to scale bad software a little better over a couple 10s of cores since they need to scale to 1000s any way, so faster interconnect between 10s of cores does nothing to help them. It's a very limited market where that kind of interconnect is really necessary for good software.

Sometimes good software can make use of Fugaku, Cerebras or M1 ultra bandwidth, but eeking a little more performance out of bad software, that's always been the true domain of SMP. Which combined with the willingness of Apple customers to just spend a little bit more is what will drive the use of the Studio too (where OS X plays not a small part too, if you're going to pay for quality you want a quality stack, hardware is not enough). Running poorly scaling software just a little bit faster and the occasional artist off-line rendering on it when he would probably be better of renting some renderfarm time, but can't be bothered to learn how.

Any source for the claim that workstations arent attractive in the x86 space, in favor of Apple? I tried to google these claims but am not finding anything workstation related (yet). What i did find was that Apple covers around 7.7% of the total PC market (yes, they are pc's, even the m1 studio, like it or not). Thats workstations, laptops, everything excluding mobile devices like phones watches etc.
Its certainly not what i am seeing atleast, most coorporations use Intel based workstations, but that might be differnet where you are ofcourse.
Share some sources on the workstation thingy since i really have no clue on just that specific market. Anyway, Apple has a very tiny share of the total pc market (and that is including Intel based mac pc's).
 
Do you honestly believe the reason that SLI configurations don't deliver twice the performance is because of software? :???:

I believe if SLI is implemented well, the bandwidth requirements vastly diminish. Most of the implementation belongs in the engine, not drivers.

PS. also of course you would interleave sections, not scan lines. True Scan Line Interlacing is bad software period, alternate frame "SLI" has lots of latency related issues due to the way the engine&driver implementation of the renderer are haphazardly divided which bandwidth won't help and it's also bad software ... but it's what the GPU manufacturers on the PC have to work with, bad software.

PPS. with Apple's foray into AR and possibly VR I could see them launching their own engine, with blackjack, hookers and proper parallel tile rendering.
 
Last edited:
PowerVR used to part odd end even tiles in multiple core configs like the Naomi2 Arcade systems. Since Series5 when they moved to multicore configs again they started using a form of macro/micro tiling and I don't see why Apple would use something different. While deferring one frame at a time it shouldn't be any witchcraft to divide viewports/workloads evenly between N amount of different cores,
 
Last edited:
Right, that would be a decision made by the SSD manufacturer, unless of course Apple is building their own? Even then, I would suspect they're using an existing (eg non-Apple) flash controller chip. I dunno, it just seems a strange decision to me. I guess it's another way to upsell the size?
I guess that to benefit from all channels you need to attach flash chips to them so perhaps lower capacities needing less flash chips means less bandwidth (sorry if that was stupidly obvious, I might have missed part of the discussion).
 
I believe if SLI is implemented well, the bandwidth requirements vastly diminish. Most of the implementation belongs in the engine, not drivers.
You have a of beliefs that share little common ground with reality.
 
I guess that to benefit from all channels you need to attach flash chips to them so perhaps lower capacities needing less flash chips means less bandwidth (sorry if that was stupidly obvious, I might have missed part of the discussion).
That is the general gist of it :)
 
I guess that to benefit from all channels you need to attach flash chips to them so perhaps lower capacities needing less flash chips means less bandwidth (sorry if that was stupidly obvious, I might have missed part of the discussion).
Indeed.

SSD manufacturers have basically three options:
  • Use the same controller and link different numbers of flash dies to it. This means bandwidth and size increase quasi-linearly with respect to total connected die count
  • Use the same controller and use different size dies to keep the populated channel equal. This means bandwidth is roughly the same regardless of total disk capacity.
  • Use the same controller but differenct combinations of the both die sizes and die count depending on the capacity of the disk. This typically results in bandwidth on the lowest-capacity part being lower, yet beyond a certain capacity the bandwidth levels off.
  • Technically there's a fourth option: use different controllers for different capacity drives. Most manufacturers have stopped doing this because the performance changes between different controllers can make things quite different during reviews.
Since i used the Samsung 980Pro as an example before, I'll use it again here -- they are doing the 3rd option above: Samsung 980 Pro M.2 NVMe SSD Review: Redefining Gen4 Performance | Tom's Hardware (tomshardware.com)
Toms Hardware said:
There are just two NAND packages onboard the 980 Pro’s PCB, which applies to all capacities. The 250GB to 1TB 980 Pros come with 256Gb dies while the 2TB model, when available, will feature 512Gb dies. This means that both the 1TB and 2TB models feature 32 dies in total for optimal interleaving and peak performance characteristics. To boost performance, Samsung’s V-NAND features two planes per die (independent regions of die access) for further interleaving.
The 256GB, 512GB and 1024GB drives all use the same controller but just keep stacking more 256Gbyte dies -- hence performance continues to increase with size. Beyond 1TB, Sammy flips to using larger dies and keeps the same channel count. Thus, the 2TB model is basically the same speed as the 1TB model.
 
It's a concious choice by Apple, IMO.

They could attach 8 packages to all their SSD skus and just use fewer dies in each package (or smaller capacity dies).

It is a way to justify the price hike for the larger capacity models; They also come with better performance.

Cheers
 
You have a of beliefs that share little common ground with reality.

Maybe, lets do an overly simplified model. Lets say you map objects to individual GPUs exactly where it should be done in good software, during frustum culling on a hierarchical representation, and you render section by section (ie. not all the sections for a single GPU at once). That way you can send the finished shadow/environment maps for a section to the other GPU while you're rendering the next one instead of having to hurry at the end of rendering the entire buffer or equally bad software wise having to communicate the writes as they happen (there's still a small bubble after the rendering of the last section, but lets ignore that for simplicity).

Lets say 4*4K*4K*32bit split shadow map in 10 ms. So the necessary bandwidth to communicate the results is 25.6 GB/s ... bugger fuck all.

Immediate mode whole frame/buffer rendering is bad software for parallelisation. (Macro-)Tiling if implemented directly in the engine can be good software for parallelisation.
 
Last edited:
Maybe, lets do an overly simplified model.

Fixing technical problems in theory does not fix them in reality. There are well established engineering models in practise, on both a hardware and software level, that determine the maximum and achieve efficiency of introducing increasing levels of processing parallelisation across district processing elements like CPU and GPUs in SLI mode.

If you think you can fix SLI's in software and make it more efficient than Nvidia then put you money where your mouth is do that. But please stop posting bollocks in this thread. Implying companies like Intel and Nvidia, Intel and AMD aren't doing their engineering properly and that this can be solved in software compared to Apple's 10,000 signal interconnect solution has to be among the most absurd posits I've every read here.

Clearly an armchair guru has the answers and all the engineers in all these companies are just idiots. Sure.
 
SLI theoretically was intresting just that it didnt scale well (hardware wise mostly) for most gaming oriented things. For Apple m1 ultra its less of a problem because its main use will be computing (lightroom, media export/creation etc). SLI was quite impressive in benchmarks but thats not real world use-age.
 
Fixing technical problems in theory does not fix them in reality. There are well established engineering models in practise, on both a hardware and software level, that determine the maximum and achieve efficiency of introducing increasing levels of processing parallelisation across district processing elements like CPU and GPUs in SLI mode.

If you think you can fix SLI's in software and make it more efficient than Nvidia then put you money where your mouth is do that. But please stop posting bollocks in this thread. Implying companies like Intel and Nvidia, Intel and AMD aren't doing their engineering properly and that this can be solved in software compared to Apple's 10,000 signal interconnect solution has to be among the most absurd posits I've every read here.

Clearly an armchair guru has the answers and all the engineers in all these companies are just idiots. Sure.
There’s a small outside chance mfa has a platform to stand on. Though whether it will ever arrive to fruitution is a separate debate (I think not likely anytime soon).
The APIs were never very good at letting developers really have control over multi GPU.
But with DX12 there is much better control over it. But I haven’t seen any developer waste any significant time to really do it.

https://developer.nvidia.com/explicit-multi-gpu-programming-directx-12

DX12 in the right hands of a developer willing to put in the investment can create a very different pipeline for multi-gpu that will result in improved performance. To generate a flat out 2x performance may be very challenging for some titles unless the goal is to only target mGPU. That being said, this is a demo, but it's worth reviewing if you could imagine a future where everything was mGPU, things would be better in this field. I had a chance to see this DX12 slide deck presented at build 2015. This was an interesting bit and spoke to Max McCullen about it. Quite interesting things you can do with EMA, but I don't think it'll ever see use.

Explicit1_575px.PNG


That brings us to today’s article, an initial look into the multi-GPU capabilities of DirectX 12. Developer Oxide Games, who is responsible for the popular Star Swarm demo we looked at earlier this year, has taken the underlying Nitrous engine and are ramping up for the 2016 release of the first retail game using the engine, Ashes of the Singularity. As part of their ongoing efforts to Nitrous as a testbed for DirectX 12 technologies and in conjunction with last week’s Steam Early Access release of the game, Oxide has sent over a very special build of Ashes.

What makes this build so special is that it’s the first game demo for DirectX 12’s multi-GPU Explicit Multi-Adapter (AKA Multi Display Adapter) functionality. We’ll go into a bit more on Explicit Multi-Adapter in a bit, but in short it is one of DirectX 12’s two multi-GPU modes, and thanks to the explicit controls offered by the API, allows for disparate GPUs to be paired up. More than SLI and more than Crossfire, EMA allows for dissimilar GPUs to be used in conjunction with each other, and productively at that.

https://www.anandtech.com/show/9740/directx-12-geforce-plus-radeon-mgpu-preview


78164.png

Not quite 2x (far from), but no micro stutter that usually plagues mGPU setups. This is a good step into that direction, but I don't think we are taking any real steps to develop these types of pipelines. Perhaps if consoles ever went mGPU we'd see better development here.
 
Last edited:
There’s a small outside chance mfa has a platform to stand on. Though whether it will ever arrive to fruitution is a separate debate (I think not likely anytime soon).
As somebody with a lot of compute server experience - where the limits of parallisation manifest themselves heavily - I can assure that software cannot solve this problem.

That's not to say that the underlying operating system, the APIs and software/applications themselves cannot make better use of hardware but fundamentally if you have two disparate processing elements needing to work closely together beyond the ability of the connection to provide those needs in realtime without introducing latency, then that's a barrier to efficiency and why doubling the processing elements doesn't result in double efficiency. It's why many servers have multiple more levels of cache and interconnects than desktop computers and console. But also why it requires increasingly complex silicon to manage cache coherence - because cache becomes the method through which you share data wider and wider as parellisation increases each attract level beyond whatever you consider local high-speed bus, but obviously at increasingly greater latencies the wider (and higher up) the cache architecture you go.

This is only a software-solvable problem for a very small number of limited problems; those which can be broken down initially to conveniently separate sub-jobs that do not demand further inter-processor communication exceeding what the interconnect bus.

If you could solve this, you would be worth billions to Google, Apple, Facebook, Amazon and everybody else who runs server.
 
As somebody with a lot of compute server experience - where the limits of parallisation manifest themselves heavily - I can assure that software cannot solve this problem.

That's not to say that the underlying operating system, the APIs and software/applications themselves cannot make better use of hardware but fundamentally if you have two disparate processing elements needing to work closely together beyond the ability of the connection to provide those needs in realtime without introducing latency, then that's a barrier to efficiency and why doubling the processing elements doesn't result in double efficiency. It's why many servers have multiple more levels of cache and interconnects than desktop computers and console. But also why it requires increasingly complex silicon to manage cache coherence - because cache becomes the method through which you share data wider and wider as parellisation increases each attract level beyond whatever you consider local high-speed bus, but obviously at increasingly greater latencies the wider (and higher up) the cache architecture you go.

This is only a software-solvable problem for a very small number of limited problems; those which can be broken down initially to conveniently separate sub-jobs that do not demand further inter-processor communication exceeding what the interconnect bus.

If you could solve this, you would be worth billions to Google, Apple, Facebook, Amazon and everybody else who runs server.
for sure. And I respect that. We're having to move more of our compute environments to cloud for this reason, it's a trade off in latency etc in exchange for bigger numbers in particular ram amount being a large limiting factor for large data jobs.

with respect to mgpu, you're right. The crux of the issue is how one divides the work appropriately, but you are unlikely to obtain 2x scaling for a game unless you're setting up a particular synthetic bench where the GPUs don't need to share anything with each other and that's just not realistic for video games. But to be fair, at the core of the mgpu debate we've very much been stuck with implicit multi-adaptor and tpyically that's driver based AFR. I think we'd probably get better performance out of explicit developer work in this space. But you're right in saying one should not expect 2x fold. We don't see 2x fold by doubling compute units in a GPU, you won't see it by doubling the clock rate (though you're much closer) and you won't see it on mGPU either.

But we can do better than SLI and crossfire.
 
Back
Top