Is current computer architecture a limiting factor in the development of an ideal games machine? *sp

Shifty Geezer

uber-Troll!
Moderator
Legend
I feel pretty confident in saying the primary problems for general game workloads right now are on the BVH building/streaming/LOD side, but I don't think there's a silver bullet.
On a wider gamedev level, it strikes me that this spatial representation, and larger game world representation, is a crux. Games are effectively a big arse database, with queries based on positions for 'physical' interactions, and queries of properties to enact data changes such as 'health', and queries of objects to add/remove. To see which objects have collided with a given object, that given is tested against all the others. You end up hammering RAM for everything, just to see if a property is of a given value, and we develop acceleration structures to work around this. Either you do lots of random access, or you structure your workloads as linear jobs accessing streamed data structures.

RT is just more of the same. You have to test which objects lie on a trajectory, so have to test all the objects. BVHs help reduce the search and test requirements for RT, but it's still really a question of a database query and optimising for that. How do we find which 'objects' are satisfying certain criteria out of all the objects? BVHs are a workaround for our database modelling. If it were hypothetically possible to store every objects relationship with every other, we wouldn't have to worry about BVH generation and could just select from our database. That requires more resources than we have, so we are looking for workarounds, but I wonder if those workarounds are missing a trick somewhere?

I'm starting to think the whole computer topology and working models are far from ideal. Object oriented development seems a dreadful fit for game optimisation; everything should to be stored in RAM in a way to maximise query efficiency rather than make it easy for devs to visual what's going on. This is where we have data oriented engines, but we aren't extending that to data-oriented hardware. Hardware is still driven at the conceptual level by the concepts laid down in creating general process random-access compute machines, with the focus on calculations, as opposed to data manipulation machines designed for finding and modifying data in massive datasets.

Conceptually, maybe the ultimate game machine has a relatively small amount of calculation capability but a much stronger and more capable way of dealing with data? Instead of spending time computing acceleration structures to find data, could we have a more direct way of finding that data?

/morning pondering

Edit: In short, the solution for fast ray tracing optimisations should also be leveraged for fast physics queries etc. That is, rather than dealing with RT like one problem, and physics like another, and AI like another, perhaps a homogenised view can simplify workloads into a specific architectural solution that'd benefit from a different hardware philosophy?
 
Last edited:
Conceptually, maybe the ultimate game machine has a relatively small amount of calculation capability but a much stronger and more capable way of dealing with data? Instead of spending time computing acceleration structures to find data, could we have a more direct way of finding that data?

/morning pondering

Don't consoles already allow that, kind of?

They don't have the compute performance of PC but the API's and optimisations allow them to be more efficient with memory handling.
 
Conceptually, maybe the ultimate game machine has a relatively small amount of calculation capability but a much stronger and more capable way of dealing with data? Instead of spending time computing acceleration structures to find data, could we have a more direct way of finding that data?
Well, acceleration structures (or alternatives like hashe maps) are the way to find our data efficiently, it's just that making them can't be free.
I think all we can to is to parallelize. Instead taking one game object (or ray) and doing a query of data, then repeating the same thing for the next object, we want to process groups of objects so we don't have to run through the data for each single object.
GPUs almost enforce this by design. We can store many objects in thread registers or LDS, so we only need to make sure the groups objects have similar access patterns, which may require to build another acceleration structure to get good groupings before we can start the real work.
In this regard GPUs are ahead of CPUs, and things look pretty good.
But we can not take our objects and call another function giving them as a parameter. To do so, we have to store them to VRAM, make our second function it's own dispatch controlled from CPU, and care about synchronisation as well. Then the second function needs to load the objects from VRAM, and onlty after all that we can continue work.
In this regard GPUs (or related APIs to be precise) are still stuck in some prehistoric programming model expecting to brute force over huge workloads one after another. Doing any kind of fine grained work reduction and optimization is hard and cumbersome.

But don't worry. Just make GPUs three times bigger, then doing three times more work than needed is no problem at all. Problem solved. We can't produce enough chips anyway, so we'll sell them even at three times the cost.
Ok, i'm sarcastic, but this mindset was visible decades before current chip crisis. It is all there is and ever was, so most people do not even consider there might be a better way.
 
Don't consoles already allow that, kind of?

They don't have the compute performance of PC but the API's and optimisations allow them to be more efficient with memory handling.
They are still tied to the same memory access architecture dating back to the 70s. The processor will request some data from a memory location, then data from a different memory location, then compare that data to decide what to do. The design is around fast processing of data, and then we've just worked with the slow RAM accessing as best as possibly, using caches and then OOOP and multithreading etc. It's hard to say whether the von Neumann model has shaped the development of tech to create it what we have now, or if the evolution of tech forced us into this way of being.
 
But don't worry. Just make GPUs three times bigger, then doing three times more work than needed is no problem at all. Problem solved. We can't produce enough chips anyway, so we'll sell them even at three times the cost.
Ok, i'm sarcastic, but this mindset was visible decades before current chip crisis. It is all there is and ever was, so most people do not even consider there might be a better way.
Had a quick nose at the old SaarCOR work:

https://www.google.com/url?sa=t&rct...ep1&type=pdf&usg=AOvVaw3COJryAEtrpNeL2SfJwXLh

I remember that being small and power efficient. I get the impression from a scan of the document their results were focussed on static scenes, but they were deliberately using different caches at the hardware level, considering memory access as part of the problem.
 
I do think PC has played a big part in this problem, for decades the platform has allowed and kept all types of legacy hardware and software support rather then getting rid of it.

It's the same for gaming on PC, you can play games released in 1995 on a GPU released in 2021, now the fact you can do that is amazing but that has come at a cost.

And that cost has made moving to newer, more efficient methods and hardware much more difficult.

This is less of an issue for a console as long as

1. The hardware has good tools
2. They're willing to restrict backwards compat
3. Willing to fund the R&D

/earlymorningbrainfart
 
perhaps a homogenised view can simplify workloads into a specific architectural solution that'd benefit from a different hardware philosophy?
It’s how we approach our challenges not so much the issue with the silicon or architecture. If we come up with a better software method of handling things, then the hardware should follow. The main challenge today is that as things get more complex and we want additional fidelity, the cost scaling is massive. Computational costs are increasing so significantly that different models will start outperform pure computation. So this is why I believe we are seeing more ML in games to come.

And with it; AI acceleration.

 
Last edited:
I do think PC has played a big part in this problem, for decades the platform has allowed and kept all types of legacy hardware and software support rather then getting rid of it.

It's the same for gaming on PC, you can play games released in 1995 on a GPU released in 2021, now the fact you can do that is amazing but that has come at a cost.

And that cost has made moving to newer, more efficient methods and hardware much more difficult.

This is less of an issue for a console as long as

1. The hardware has good tools
2. They're willing to restrict backwards compat
3. Willing to fund the R&D

/earlymorningbrainfart
I really don't see it that strict. Yes, BC is more work, but in case of windows, almost only more work for Microsoft. The hardware-manufactures more or less introduce new features and also kill them but they are still useable because of the software layer.

The bigger "problem" for PCs is that not every potential customers buys new hardware every year. So if you create a game only focusing on the newest stuff, you won't have a big potential audience. No publisher wants that. So PC games must always run on as many hardware configurations as possible. That is also why something e.g. like NVMe as base "SSD" won't be a thing the next few years. It would already be a great step if game developers would require an SSD at all. This would already make the door wide open for smaller games and new streaming systems (used in the future on consoles).
 
IHVs can only iterate on features that already fit in with the rendering paradigms. eg. Hardware support for HOS or SDF - if no game is going to use them because they are niche features, that's wasted silicon that your rival invests in better rasterising and wins all the benchmarks.
 
I really don't see it that strict. Yes, BC is more work, but in case of windows, almost only more work for Microsoft. The hardware-manufactures more or less introduce new features and also kill them but they are still useable because of the software layer.

The bigger "problem" for PCs is that not every potential customers buys new hardware every year. So if you create a game only focusing on the newest stuff, you won't have a big potential audience. No publisher wants that. So PC games must always run on as many hardware configurations as possible. That is also why something e.g. like NVMe as base "SSD" won't be a thing the next few years. It would already be a great step if game developers would require an SSD at all. This would already make the door wide open for smaller games and new streaming systems (used in the future on consoles).

The issue is if say Nvidia create a whole new way of rendering that's more efficient then what we have now they couldn't release it on PC because the need to support legacy systems and software.

Where as long as back compat isn't a priority they could easily release it in a console.
 
The issue is if say Nvidia create a whole new way of rendering that's more efficient then what we have now they couldn't release it on PC because the need to support legacy systems and software.

Where as long as back compat isn't a priority they could easily release it in a console.
That is the classic chicken egg problem.
If you add new features and no one buys it, no one will write software for the hardware.
But non the less, there is so far nothing really exceptional that is worth investing in. E.g. x86 is just a "software"-level compatibility (well software inside the hardware). AMD always used their own architecture but converts the commands to their own command pipeline. Compatibility is there to ensure someone buys their CPUs.
Also "old" paradigmen does not make them "bad". It is much better to stay with something that proved to work and add new stuff until there is new stuff that really can replace the "old" way to do it. If there are not standards (basically that is what makes things compatible) we only have random stuff popping up and disappear. You can currently see this in the node.js community. First there were many frameworks popping up to than be abandoned because something else has proved to be better. Problem is, if you have once decided to stick with one of these, you have a lot of work before you because you must replace those old things, because they become insecure (btw, npm-packets are the new DLL hell).

If you refer e.g. to RTRT there is really not enough power in those chips to completely go with this. And than RTRT is also just only an approximation and you need many tricks to get it looking good (like the thing before). The hardware to really completely rely on RTRT is not "born" yet. Even now the entry point with about $400 (if the prices were normal) is much to high. So it won't attract many consumers -> less interesset from publishers -> no games.
 
I think over-emphasizing other areas of computer science runs the risk of overlooking the serious advances in rendering and games in general. Games are full of state of the art data structures for fast retreival, data coherency, etc. Rendering a super complex scene in 13ms is very unlike the kinds of problems you use database for. And technology has advanced rapidly through the years, typically _following_ engineering advances. Apis and hardware have rapidly caught up to providing more flexibility, more compute, and hardware accelerated alternatives to whatever the emerging trend is at any time.

Im sure that if it wasn't shared with other computing tasks, gaming machine design might be little faster, but we're pretty spoiled already.
 
If you refer e.g. to RTRT there is really not enough power in those chips to completely go with this.
That's kinda missed the point that spawned this thread. ;) It's not really a 'power to trace rays' problem but a 'power to create the acceleration structures', which are all about trying to solve the problem of searching RAM for relevant data.
 
That's kinda missed the point that spawned this thread. ;) It's not really a 'power to trace rays' problem but a 'power to create the acceleration structures', which are all about trying to solve the problem of searching RAM for relevant data.
like a very large 1-2GB cache?
 
I have some hope on chip RAM could help a bit with memory access (but not sure if we want to use it as caches or to manage it ourselves).
But i have even bigger hope it enables SoCs with GPUs powerful enough for games, which end up cheaper than dGPU + CPU.
I hope M1 ignites a race for other companies to catch up. Would switch to a new low power platform rather quickly, and 5 tf would be enough.
Companies like AMD and Intel really seem to sleep over the current oportunity :/
 
That's kinda missed the point that spawned this thread. ;) It's not really a 'power to trace rays' problem but a 'power to create the acceleration structures', which are all about trying to solve the problem of searching RAM for relevant data.

It may be ignorance on my part, but im not sure where this goes which is better than current acceleration structures. Doing a linear search over a bunch of well chosen tris is always going to be worse than doing a log search over the same well chosen tris. If you mean like, caching the result of that search (maybe even at build time when possible) to avoid doing so many lookups in the first place, thats already what we do -- tracing to a surface with some kind of cache of the bounced light, or looking up a nearby probe and only updating that probe by casting rays when necessary, etc. Or using a ML optimization to guess at the same lookup.

More fast memory would expand how much we could put in data structures and cache, but not change the basic approach. In fact we more often see the opposite -- a clever new apptoach to storing data on the same old hardware enables huge wins, as in nanite.
 
Last edited:
It may be ignorance on my part, but im not sure where this goes which is better than current acceleration structures.
What about 1000 processors each with their own RAM pool which isn't fast but doesn't have to deal with lots of redundant searches by a Data Coordination Engine coupled with a 3 dimensional RAM topology where the position in RAM, its address, corresponds to a valuable selection criteria?

This is really an exercise in thinking outside the box at what the real problem is. It's like needing to look up a thousand address from the phone book where the name has a vowel as the third letter. Do you train one person to read through quickly? Do you rewrite the phone book to fit the job? Do you get 1,000 people to check their portion? Do you have teams of people doing different refinement jobs, so one bunch blacking out names your not interested in, another group photocopying just the necessary pages, and a third group getting the addresses from this copy?

We have various measures analogous to these sorts of solutions in software, but we aren't designing our hardware any differently than large pool of RAM > Fat pipe > a few worker bees, some thousands in the case of GPU Compute, removing and replacing data in RAM.
 
....
We have various measures analogous to these sorts of solutions in software, but we aren't designing our hardware any differently than large pool of RAM > Fat pipe > a few worker bees, some thousands in the case of GPU Compute, removing and replacing data in RAM.
The reason for that is the cost. Yes it would be much better (in many cases) if the GPU had separate memory pools. Some with faster memory and some with slow. This would get rid of the memory-contention problem and e.g. "sleeping" data that is only accessed every now and than wouldn't hurt the fast memory. But it is much easier to just have one pool of memory. This reduces design/fabrication costs and developing costs.
With RTRT you can throw an almost "infinite" amount of really fast memory (depending on view distance, resolution, ...) and lots of processing power (a whole lot more than currently available) on the problem. You don't have that kind of problem with "Normal" RT (without the need to finish it in 16-33 ms). Getting in into the tiny time-frame, is still a problem. Even if you have many, many processors with their own memory, in the end you have so many locks because one processor needs the work of another calculation that it gets really hard to squash all that into that tiny time window. Lightspeed can be a real problem for real time processes :)
 
3 dimensional RAM topology where the position in RAM, its address, corresponds to a valuable selection criteria?
Turning this into a naive idea like having a voxel cube of ram, where voxels map spatially to scene data of that area, we would get the problem some voxels have not enough memory for dense detail, and others being underutilized.
We would end up at the same solution again as now, requiring some indirection to map 3D oordinates to multiple voxels at random positions in RAM.
So there seems no software side benfit of having multidimnsional RAM - we would still treat it as a 1D sequence.
Same for a many SoCs architecture. The sync and bus traffic would probably outweight the benefit if all processors work on the same problem. Generally i don't think that different HW can solve SW probems, but who knows.

Recently i've had some aha moment by reversing the raytracing problem in regard of reflections. Instead of thinking about rays from the surface, i was thinking about processing the scene just once and projecting it to the surface. So, more like environment maps and rasterization works.
Problem is: The same patch of scene may be visible on multiple surfaces, so we need to search for those surfaces. But after that, we could just 'rasterize' our scene patch to those surfaces. Which sounds fast even if projection is not planar.
Potential optimization: Each surface shows only one reflection. (Unlike shadows or GI, where we need multiple rays to support multiple lights or parts of the scene.)
Interesting: We may not need an acceleration structure for the scene, as we process the whole of it just once, and like with rasterization we can do so in random order.

I kept thinking and ended up at some algorithm similar to packet traversal, like to the paper you have posted. Pretty boring, but i feel like still missing something...
 
Back
Top