Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 18-Apr-2009, 18:45   #26
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,235
Send a message via ICQ to MfA
Default

Would be nice ... and almost free as far as transistors is concerned.
MfA is online now   Reply With Quote
Old 18-Apr-2009, 19:11   #27
Rayne
Junior Member
 
Join Date: Jun 2007
Posts: 91
Default

Quote:
Originally Posted by rpg.314 View Post
Can we have some more details please for that?
I barely remember anything, because it was year 2006/7, but i remember that with a single buffer for the DMA transfer, the SPE was idle most of the time. But, with multiple buffers and overlapping the computation on one buffer with the data transfer in other, the results were better. This is a must for any Cell app imho. I also remember problems trying to multi-thread my code because there was shared data across the threads.

Quote:
Originally Posted by rpg.314 View Post
It's really a shame that SSEx isn't orthogonal. Why does intel have to design every ISA (save lrbni) like they are brain dead or something? For your problem, may be loading the four values into a aligned float[4] array and then a load may be faster
Yeah, i did something like that, but with the suffle instructions & several registers.

Quote:
Originally Posted by rpg.314 View Post
That would be fun. They do have an optimized 4x4 transpose macro available in if you use the intrinsics.
I remember using the _MM_TRANSPOSE4_PS macro, but it needed the 8 XMM registers, and sometimes, you are using some registers to store some 'previously suffled' data

Quote:
Originally Posted by rpg.314 View Post
That would make a lot of effort put into exploiting memory hiearchy useless.
Don't shoot the messenger
Rayne is offline   Reply With Quote
Old 19-Apr-2009, 04:35   #28
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
Default

Quote:
Originally Posted by TimothyFarrar View Post
How about banked vs cached memory in terms of vector scatter/gather. Clearly read only vector gather can be covered by the texture units, so I'm talking about R/W vector scatter/gather only.
Cache memories are banked as well so I am not sure the premise is entirely correct, though I still get what you are pointing at
Also gather done via texture unit doesn't sound like a good idea if you are going to re-use your data multiple times.

Quote:
A LRB hyperthread doing scatter/gather can only access one L1 line per clock, so if you actually do a scatter/gather which doesn't simply load from the same vector sized cache line, then performance suffers proportional to the number of L1 lines hit. In my eyes, this vastly marginalizes the usefulness of scatter/gather.
This is a scalable approach, which can get faster in the future. Doesn't nvidia operate in a very similar way for (uncached) global memory accesses?
Unfortunately scratchpad memory based programming models don't scale so nicely.

Quote:
One can compare and contrast to current NVidia hardware where scatter/gather to random memory locations runs at full speed as long as each lane of the vector accesses a separate bank of memory. This opens up a lot more flexibility with scatter/gather compared to the LRB approach in terms of keeping up ALU throughput.
It certainly fast, but I wouldn't call gather/scatter from a minuscule memory that you have to manange on your own 'flexibile'
__________________
[twitter]
More samples, we need more samples! [Dean Calver]
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way
nAo is offline   Reply With Quote
Old 19-Apr-2009, 11:01   #29
MrGaribaldi
Member
 
Join Date: Nov 2002
Location: In transit
Posts: 604
Default

Quote:
Originally Posted by Jawed View Post
Quote:
Originally Posted by rpg.314 View Post
What about DCT, eh?
It's in the AMD presentation. Firmly stuck in CPU land.
But why is that? They don't give any reasons in the slides, although it looks like the ME is taking "too long".

Nvidia has a nice CUDA sample for doing DCT on the GPU with 2 different kernels that are enjoyably fast of themselves, and can be further tweaked, so I don't see why you couldn't offload it to the GPU as well.

Granted, I've only played around with doing MJPEG compression on the Cell and GPU, so could very well be that I'm missing some of picture for doing .x264 encoding...
__________________
"Artificial Intelligence can never replace Human Stupidity"
MrGaribaldi is offline   Reply With Quote
Old 19-Apr-2009, 11:07   #30
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Blockwise DCT should be very nicely parallelizable. Also cuda helps as it exposes the dedicated video decode hardware, alteast on windows, so only the last lossless compression needs to be done on the CPU. ME and DCT can be both on GPU.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 19-Apr-2009, 11:10   #31
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
This is a scalable approach, which can get faster in the future. Doesn't nvidia operate in a very similar way for (uncached) global memory accesses?
Unfortunately scratchpad memory based programming models don't scale so nicely.
I'd like a detailed explanation on this please. It seems to me that you are implying that the hardware managed caches (with s/w hints like on lrb if need be) scale better than purely software managed caches like on gpu's.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 19-Apr-2009, 11:21   #32
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,864
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by MrGaribaldi View Post
But why is that? They don't give any reasons in the slides, although it looks like the ME is taking "too long".
Yeah, shame there's nothing more detailed.

Is there an NVidia presentation on the details of h.264 encoding? Anyone else done one for CUDA encoding?

Quote:
Nvidia has a nice CUDA sample for doing DCT on the GPU with 2 different kernels that are enjoyably fast of themselves, and can be further tweaked, so I don't see why you couldn't offload it to the GPU as well.
There's a CAL sample for DCT too.

Quote:
Granted, I've only played around with doing MJPEG compression on the Cell and GPU, so could very well be that I'm missing some of picture for doing .x264 encoding...
I don't understand the h.264 encoding pipeline at all well, so I don't know how realistic ME and DCT both on the GPU is.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 19-Apr-2009, 21:05   #33
TimothyFarrar
Member
 
Join Date: Nov 2007
Location: Santa Clara, CA
Posts: 427
Default

Quote:
This is a scalable approach, which can get faster in the future. Doesn't nvidia operate in a very similar way for (uncached) global memory accesses?
Not to my knowledge, global accesses (ie uncached) are also banked addressed (just like shared memory) with CUDA. I'd argue that in terms of bandwidth limited cases, that having the extra 16x addressing capacity (16 bank addresses, vs 1 cacheline) per global access can be quite a performance benefit (well assuming the programmer can make use of it).
__________________
Timothy Farrar :: blog
TimothyFarrar is offline   Reply With Quote
Old 19-Apr-2009, 21:30   #34
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,809
Default

Global memory accesses don't seem to be banked. Coalescing is contingent on all required addresses being present in the same contiguous memory segment. However the segment size can be either 32, 64 or 128 bytes. So global memory access works similar to the LRB cache where there is one read per segment (cache line).
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 19-Apr-2009, 22:34   #35
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
Default

Quote:
Originally Posted by TimothyFarrar View Post
Not to my knowledge, global accesses (ie uncached) are also banked addressed (just like shared memory) with CUDA. I'd argue that in terms of bandwidth limited cases, that having the extra 16x addressing capacity (16 bank addresses, vs 1 cacheline) per global access can be quite a performance benefit (well assuming the programmer can make use of it).
Just checked CUDA docs and global memory accesses are "simply" coalesced, which sounds similar to what LRB does (caching aside..).
Also this cache vs bank memory stuff doesn't make any sense, cache memories *are* banked.
__________________
[twitter]
More samples, we need more samples! [Dean Calver]
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way
nAo is offline   Reply With Quote
Old 19-Apr-2009, 22:41   #36
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
Default

Quote:
Originally Posted by nAo View Post
Just checked CUDA docs and global memory accesses are "simply" coalesced, which sounds similar to what LRB does (caching aside..).
Yes, global memory access is basically just following the memory controller pattern. However, in the case of GT200, it seems that there's some sort of a reorder buffer between the memory controller and the ALU, so the coalescing rules are much more relaxed than G80.
pcchen is offline   Reply With Quote
Old 19-Apr-2009, 22:48   #37
nAo
Nutella Nutellae
 
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
Default

Quote:
Originally Posted by pcchen View Post
Yes, global memory access is basically just following the memory controller pattern. However, in the case of GT200, it seems that there's some sort of a reorder buffer between the memory controller and the ALU, so the coalescing rules are much more relaxed than G80.
Yep, that's pretty cool.
__________________
[twitter]
More samples, we need more samples! [Dean Calver]
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way
nAo is offline   Reply With Quote
Old 20-Apr-2009, 00:25   #38
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,864
Send a message via Skype™ to Jawed
Default

An example of the effect of the improved memory controller in GT200:

http://dl.getdropbox.com/u/484203/Le...lutionSoup.pdf

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 20-Apr-2009, 05:19   #39
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Global memory accesses don't seem to be banked. Coalescing is contingent on all required addresses being present in the same contiguous memory segment. However the segment size can be either 32, 64 or 128 bytes. So global memory access works similar to the LRB cache where there is one read per segment (cache line).
If 64 threads issue contigous reads, each thread reading 16 bytes, then you have fetched 1k data from memory. Probably Timothy is referring to the fact that that 1K will be fetched by different memeory controllers and then merged together iin the register file, hence the banking. Elements of this technique re already there in CPU's with multi channel memory controllers, so I suppose it is an issue of semantics.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 20-Apr-2009, 09:32   #40
Simon F
Tea maker
 
Join Date: Feb 2002
Location: In the Island of Sodor, where the steam trains lie
Posts: 4,382
Default

Quote:
Originally Posted by pcchen View Post
For example, I am not sure if CABAC can be done fast on a GPU.
It's damned difficult to do fast on anything

FWIW Encoding is probably easier than decoding which is ironic as the latter is surely going to be more common.
__________________
"Your work is both good and original. Unfortunately the part that is good is not original and the part that is original is not good." -(attributed to) Samuel Johnson

"I invented the term Object-Oriented, and I can tell you I did not have C++ in mind." Alan Kay
Simon F is offline   Reply With Quote
Old 20-Apr-2009, 10:17   #41
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

For decoding we already have dedicated hw in gpu's today. So transcoders most likely will be taking advantage of it.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 20-Apr-2009, 12:36   #42
T.B.
Member
 
Join Date: Mar 2008
Posts: 154
Default

Quote:
Originally Posted by nAo View Post
Yep, that's pretty cool.
It really simplifies the code in many cases, which can give you a nice speed-up. For my biggest algorithm, I got some 30% just by removing the ungodly-complex-GF8-coalescing-code with a very straight-forward partially coalesced one.

Staying on topic, I'd say having a little bit of smarts in the way you process memory accesses is often quite useful, especially if you are not on a fixed platform. Cell suffers from not having this, but makes up for it with 6 cycles latency. It would be interesting to know how much the GT200's reorder-buffer costs in terms of latency.
T.B. is offline   Reply With Quote
Old 20-Apr-2009, 13:14   #43
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

I suspect it is about 50-100 cycles. In initial CUDA docs they said that 400-600 clock cycle latency should be expected. But when volkov published his benchmarks, he found more like 500-700 clocks on GT200. This is a rather crude estimate, I admit.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 20-Apr-2009, 13:39   #44
T.B.
Member
 
Join Date: Mar 2008
Posts: 154
Default

Quote:
Originally Posted by rpg.314 View Post
I suspect it is about 50-100 cycles.
That sounds waaaaay too high for me. Like an order of magnitude too high.
OK, the reorder is at base-clock, not shader clock, which I assume you meant, so that brings it down a good bit.
Still, I may be totally off here, but even 20cy is a long time.
T.B. is offline   Reply With Quote
Old 20-Apr-2009, 13:48   #45
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

The numbers there are definitely in terms of shader clocks. However, considering the logic involved, 100 cycles is indeed high. However, it may well be the case that nv wanted to minimize the area used (as it is CUDA specific) so used a smaller coalescer many times to reduce an already bloated die. Like I said, it is a crude estimate.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 20-Apr-2009, 13:55   #46
CouldntResist
Member
 
Join Date: Aug 2004
Posts: 244
Icon Wink

Only Arm and Core will prevail. Other architectures (nvidia, amd, larrabee, itanium, cell) will be assimilated or obsoleted.

Resulting polarity on the industrial scene, will be seed of epic conflict within human civilisation, destined to last for eons. Eventually spreading over entire galaxy, the conflict will outlast biology based life form of humanity. Our cosciousnesses, now encased in machine shells, will be occupied with the ultimate goal: complete annihilation of the opponent, before the Heat Death of Universe happens...
CouldntResist is offline   Reply With Quote
Old 20-Apr-2009, 13:59   #47
T.B.
Member
 
Join Date: Mar 2008
Posts: 154
Default

Quote:
Originally Posted by CouldntResist View Post
Only Arm and Core will prevail... Eventually spreading over entire galaxy...
Galactic Arm vs. Galactic Core? Hmm....
T.B. is offline   Reply With Quote
Old 20-Apr-2009, 16:04   #48
TimothyFarrar
Member
 
Join Date: Nov 2007
Location: Santa Clara, CA
Posts: 427
Default

Quote:
Originally Posted by pcchen View Post
Yes, global memory access is basically just following the memory controller pattern. However, in the case of GT200, it seems that there's some sort of a reorder buffer between the memory controller and the ALU, so the coalescing rules are much more relaxed than G80.
Oops, that's what I get for a quick post in the local Apple Store this weekend. I was referring to the ability of the GT2xx series to reduce access size. Specifically the ability for the hardware to reduce global access requests from 128 byte segments to 64 byte segments or 32 byte segments. My mistaken 16x addressing factor is really a 4x addressing factor with the reduction from a 128 byte segment to four 32 byte segments.

The 32 byte segment turns out to only be 1/2 the size of a LRB cache line. Marco's point is indeed a good one for global memory accesses.
__________________
Timothy Farrar :: blog

Last edited by TimothyFarrar; 20-Apr-2009 at 16:23.
TimothyFarrar is offline   Reply With Quote
Old 20-Apr-2009, 16:43   #49
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,809
Default

Quote:
Originally Posted by nAo View Post
Just checked CUDA docs and global memory accesses are "simply" coalesced, which sounds similar to what LRB does (caching aside..).
Also this cache vs bank memory stuff doesn't make any sense, cache memories *are* banked.
Sure, but with the shared memory stuff a single read can pull data from all banks simultaneously at arbitrary offsets within each bank. Aren't single-ported cache reads rigidly limited to pulling a single contiguous cache line at once?
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 20-Apr-2009, 16:59   #50
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,864
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by trinibwoy View Post
Sure, but with the shared memory stuff a single read can pull data from all banks simultaneously at arbitrary offsets within each bank.
What algorithms need to do this? Why? When are these offsets arbitrary, but not random?

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote

Reply

Tags
intel

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 18:46.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.