If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#26 |
|
Regular
|
Would be nice ... and almost free as far as transistors is concerned.
|
|
|
|
|
|
#27 | |||
|
Junior Member
Join Date: Jun 2007
Posts: 91
|
I barely remember anything, because it was year 2006/7, but i remember that with a single buffer for the DMA transfer, the SPE was idle most of the time. But, with multiple buffers and overlapping the computation on one buffer with the data transfer in other, the results were better. This is a must for any Cell app imho. I also remember problems trying to multi-thread my code because there was shared data across the threads.
Quote:
Quote:
Quote:
|
|||
|
|
|
|
|
#28 | |||
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
|
Quote:
Also gather done via texture unit doesn't sound like a good idea if you are going to re-use your data multiple times. Quote:
Unfortunately scratchpad memory based programming models don't scale so nicely. Quote:
__________________
[twitter] More samples, we need more samples! [Dean Calver] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|||
|
|
|
|
|
#29 | |
|
Member
Join Date: Nov 2002
Location: In transit
Posts: 604
|
Quote:
Nvidia has a nice CUDA sample for doing DCT on the GPU with 2 different kernels that are enjoyably fast of themselves, and can be further tweaked, so I don't see why you couldn't offload it to the GPU as well. Granted, I've only played around with doing MJPEG compression on the Cell and GPU, so could very well be that I'm missing some of picture for doing .x264 encoding...
__________________
"Artificial Intelligence can never replace Human Stupidity" |
|
|
|
|
|
|
#30 |
|
Senior Member
|
Blockwise DCT should be very nicely parallelizable. Also cuda helps as it exposes the dedicated video decode hardware, alteast on windows, so only the last lossless compression needs to be done on the CPU. ME and DCT can be both on GPU.
|
|
|
|
|
|
#31 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#32 | |||
|
Regular
|
Quote:
Is there an NVidia presentation on the details of h.264 encoding? Anyone else done one for CUDA encoding? Quote:
Quote:
Jawed
__________________
Can it play WoW? |
|||
|
|
|
|
|
#33 | |
|
Member
Join Date: Nov 2007
Location: Santa Clara, CA
Posts: 427
|
Quote:
__________________
Timothy Farrar :: blog |
|
|
|
|
|
|
#34 |
|
Meh
Join Date: Mar 2004
Location: New York
Posts: 9,809
|
Global memory accesses don't seem to be banked. Coalescing is contingent on all required addresses being present in the same contiguous memory segment. However the segment size can be either 32, 64 or 128 bytes. So global memory access works similar to the LRB cache where there is one read per segment (cache line).
__________________
What the deuce!? |
|
|
|
|
|
#35 | |
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
|
Quote:
Also this cache vs bank memory stuff doesn't make any sense, cache memories *are* banked.
__________________
[twitter] More samples, we need more samples! [Dean Calver] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|
|
|
|
|
|
#36 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,348
|
Yes, global memory access is basically just following the memory controller pattern. However, in the case of GT200, it seems that there's some sort of a reorder buffer between the memory controller and the ALU, so the coalescing rules are much more relaxed than G80.
|
|
|
|
|
|
#37 | |
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
|
Quote:
__________________
[twitter] More samples, we need more samples! [Dean Calver] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|
|
|
|
|
|
#38 |
|
Regular
|
An example of the effect of the improved memory controller in GT200:
http://dl.getdropbox.com/u/484203/Le...lutionSoup.pdf Jawed
__________________
Can it play WoW? |
|
|
|
|
|
#39 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#40 |
|
Tea maker
Join Date: Feb 2002
Location: In the Island of Sodor, where the steam trains lie
Posts: 4,382
|
It's damned difficult to do fast on anything
FWIW Encoding is probably easier than decoding which is ironic as the latter is surely going to be more common.
__________________
"Your work is both good and original. Unfortunately the part that is good is not original and the part that is original is not good." -(attributed to) Samuel Johnson "I invented the term Object-Oriented, and I can tell you I did not have C++ in mind." Alan Kay |
|
|
|
|
|
#41 |
|
Senior Member
|
For decoding we already have dedicated hw in gpu's today. So transcoders most likely will be taking advantage of it.
|
|
|
|
|
|
#42 |
|
Member
Join Date: Mar 2008
Posts: 154
|
It really simplifies the code in many cases, which can give you a nice speed-up. For my biggest algorithm, I got some 30% just by removing the ungodly-complex-GF8-coalescing-code with a very straight-forward partially coalesced one.
Staying on topic, I'd say having a little bit of smarts in the way you process memory accesses is often quite useful, especially if you are not on a fixed platform. Cell suffers from not having this, but makes up for it with 6 cycles latency. It would be interesting to know how much the GT200's reorder-buffer costs in terms of latency. |
|
|
|
|
|
#43 |
|
Senior Member
|
I suspect it is about 50-100 cycles. In initial CUDA docs they said that 400-600 clock cycle latency should be expected. But when volkov published his benchmarks, he found more like 500-700 clocks on GT200. This is a rather crude estimate, I admit.
|
|
|
|
|
|
#44 |
|
Member
Join Date: Mar 2008
Posts: 154
|
That sounds waaaaay too high for me. Like an order of magnitude too high.
OK, the reorder is at base-clock, not shader clock, which I assume you meant, so that brings it down a good bit. Still, I may be totally off here, but even 20cy is a long time. |
|
|
|
|
|
#45 |
|
Senior Member
|
The numbers there are definitely in terms of shader clocks. However, considering the logic involved, 100 cycles is indeed high. However, it may well be the case that nv wanted to minimize the area used (as it is CUDA specific) so used a smaller coalescer many times to reduce an already bloated die. Like I said, it is a crude estimate.
|
|
|
|
|
|
#46 |
|
Member
Join Date: Aug 2004
Posts: 244
|
Only Arm and Core will prevail. Other architectures (nvidia, amd, larrabee, itanium, cell) will be assimilated or obsoleted.
Resulting polarity on the industrial scene, will be seed of epic conflict within human civilisation, destined to last for eons. Eventually spreading over entire galaxy, the conflict will outlast biology based life form of humanity. Our cosciousnesses, now encased in machine shells, will be occupied with the ultimate goal: complete annihilation of the opponent, before the Heat Death of Universe happens... |
|
|
|
|
|
#47 |
|
Member
Join Date: Mar 2008
Posts: 154
|
|
|
|
|
|
|
#48 | |
|
Member
Join Date: Nov 2007
Location: Santa Clara, CA
Posts: 427
|
Quote:
The 32 byte segment turns out to only be 1/2 the size of a LRB cache line. Marco's point is indeed a good one for global memory accesses.
__________________
Timothy Farrar :: blog Last edited by TimothyFarrar; 20-Apr-2009 at 16:23. |
|
|
|
|
|
|
#49 |
|
Meh
Join Date: Mar 2004
Location: New York
Posts: 9,809
|
Sure, but with the shared memory stuff a single read can pull data from all banks simultaneously at arbitrary offsets within each bank. Aren't single-ported cache reads rigidly limited to pulling a single contiguous cache line at once?
__________________
What the deuce!? |
|
|
|
|
|
#50 | |
|
Regular
|
Quote:
Jawed
__________________
Can it play WoW? |
|
|
|
|
![]() |
| Tags |
| intel |
| Thread Tools | |
| Display Modes | |
|
|