Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 19-Apr-2013, 08:25   #76
keldor314
Junior Member
 
Join Date: Feb 2010
Posts: 92
Default

Quote:
Originally Posted by rpg.314 View Post
But that's very inefficient to transfer a multilevel data structure, like a BVH, which is where you started out from. For something like BVH, you really need a cache coherence protocol. Abusing page fault handlers just wont do.
What I'm describing is a cache coherency protocol, operating at a memory page level rather than cache line level. Trying to keep track of it at a cache line level would be far too slow, since the data transfer would be dominated by overhead, since cache lines are so small. In addition, the directory would be huge.

The reason to do it at the page level in particular is that you already have dedicated hardware. The big limitation is that you wouldn't be able to do read only sharing, since I don't believe the TLB has any concept of read-only, and thus can't throw an interrupt when the first memory write is encountered. You could however have a state in your handler emulating read only sharing by simply not requesting an invalidation whenever a page belonging to a read sharable memory space is brought in, though this would have to be respected by the programmer.

How efficient transferring a multilevel structure would depend on how well malloc can cluster objects together. If the memory page had a large number of other elements of the structure, it would be much more efficient to transfer the entire page than to do each element separately and update pointers (a BVH is, admittedly, an extreme example, but there are many cases where you have small multilevel structures you want to use on both devices). However, false sharing would be *bad*.

What you'd need then is some concept of memory spaces for malloc, so that it wouldn't inadvertently allocate some CPU scratch variable in the middle of a memory page containing GPU data. A memory space would simply be some set of memory allocations that's guaranteed not to alias pages with memory allocations associated with any other memory space. It'd also be important to ensure that any page from a memory space that can be read by multiple devices is small.

The ultimate result would be a system that, while not an ideal coherency protocol, could still retain many of the benefits with regards to usability, and would not make anything slower in the traditional usage pattern. The important part is that this could be implemented in software only, so the only thing that would have to be updated would be the OS and the drivers.

Last edited by keldor314; 19-Apr-2013 at 08:43.
keldor314 is offline   Reply With Quote
Old 19-Apr-2013, 18:26   #77
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,885
Default

Quote:
Originally Posted by keldor314 View Post
Trying to keep track of it at a cache line level would be far too slow, since the data transfer would be dominated by overhead, since cache lines are so small. In addition, the directory would be huge.
Maybe for a discrete GPU, but I don't see a problem on integrated where it's using the same memory hierarchy anyways... both AMD and Intel are going this direction.

How much effort to throw at discrete GPUs/memory spaces depends on how much you believe they are going to matter in the future. With all of the consoles going unified and arguably everything laptop level and down as well, it's only going to be the very high end desktop stuff left as discrete. One could make an argument that those systems could take a more brute force path and still be acceptable. I have a hard time accepting that APIs should be designed around their constraints going forward, even though I love my massive discrete GPUs
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 20-Apr-2013, 03:15   #78
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,079
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by keldor314 View Post
What I'm describing is a cache coherency protocol, operating at a memory page level rather than cache line level. Trying to keep track of it at a cache line level would be far too slow, since the data transfer would be dominated by overhead, since cache lines are so small. In addition, the directory would be huge.

The reason to do it at the page level in particular is that you already have dedicated hardware. The big limitation is that you wouldn't be able to do read only sharing, since I don't believe the TLB has any concept of read-only, and thus can't throw an interrupt when the first memory write is encountered. You could however have a state in your handler emulating read only sharing by simply not requesting an invalidation whenever a page belonging to a read sharable memory space is brought in, though this would have to be respected by the programmer.

How efficient transferring a multilevel structure would depend on how well malloc can cluster objects together. If the memory page had a large number of other elements of the structure, it would be much more efficient to transfer the entire page than to do each element separately and update pointers (a BVH is, admittedly, an extreme example, but there are many cases where you have small multilevel structures you want to use on both devices). However, false sharing would be *bad*.

What you'd need then is some concept of memory spaces for malloc, so that it wouldn't inadvertently allocate some CPU scratch variable in the middle of a memory page containing GPU data. A memory space would simply be some set of memory allocations that's guaranteed not to alias pages with memory allocations associated with any other memory space. It'd also be important to ensure that any page from a memory space that can be read by multiple devices is small.

The ultimate result would be a system that, while not an ideal coherency protocol, could still retain many of the benefits with regards to usability, and would not make anything slower in the traditional usage pattern. The important part is that this could be implemented in software only, so the only thing that would have to be updated would be the OS and the drivers.
Too high latency. Every page fault kicks to the kernel. This is not a solution to any real problem. Much better to batch up the stuff and do a single copy call.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 21-Apr-2013, 09:49   #79
keldor314
Junior Member
 
Join Date: Feb 2010
Posts: 92
Default

Quote:
Originally Posted by rpg.314 View Post
Too high latency. Every page fault kicks to the kernel. This is not a solution to any real problem. Much better to batch up the stuff and do a single copy call.
Well, of course. An explicit copy will always be faster. That said, batching up and copying anything but trivial structures is quite ugly. Also, index based indirection doesn't really cut it, since it forces you to use an extra register to access anything (base address + offset), in addition to extra address math. There really needs to be some way of dealing with pointers without jumping through hoops.

HSA is interesting, but until the CPU socket catches up and closes the order of magnitude wide memory bandwidth gap with the GPU socket, it's useless for high end stuff.
keldor314 is offline   Reply With Quote
Old 23-Apr-2013, 03:30   #80
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,079
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by keldor314 View Post
Well, of course. An explicit copy will always be faster. That said, batching up and copying anything but trivial structures is quite ugly. Also, index based indirection doesn't really cut it, since it forces you to use an extra register to access anything (base address + offset), in addition to extra address math. There really needs to be some way of dealing with pointers without jumping through hoops.

HSA is interesting, but until the CPU socket catches up and closes the order of magnitude wide memory bandwidth gap with the GPU socket, it's useless for high end stuff.
It's ugly vs slower. Ugly will win because performance matters.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 21:43.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.