NVIDIA GT200 Rumours & Speculation Thread

Status
Not open for further replies.

From the Wikipedia link you provided:

"NUMA attempts to address this problem by providing separate memory for each processor"

Yeah, NUMA stands for non-uniform memory access which of course would mean a greatly extended ring bus for the MC's; but if the 4870x2 is indeed one memory space for two GPU's wouldn't that contradict with the sentence I quoted above?

I don't mean to "disagree", I don't know about this enough to disagree with anybody I'm just trying to understand. :smile:
 
Any moment now one of the mods is going to notice we now have two R7xx threads.

Before that happens though ... the most straightforward way to do better than crossfire/"SLI" is simply adding a bridge on the ring bus and connecting the chips through it, this requires minimum invasiveness in the rest of the design (everything else could be done in software). You would still need the bridge chip though.

Come to think of it, my previous estimates of the needed bandwidth were inflated ... you would only need enough bandwidth for vertices and to copy finalized render buffers (or alternatively texturing from them directly).
 
Last edited by a moderator:
From the Wikipedia link you provided:

"NUMA attempts to address this problem by providing separate memory for each processor"

Yeah, NUMA stands for non-uniform memory access which of course would mean a greatly extended ring bus for the MC's; but if the 4870x2 is indeed one memory space for two GPU's wouldn't that contradict with the sentence I quoted above?

I don't mean to "disagree", I don't know about this enough to disagree with anybody I'm just trying to understand. :smile:

Two different views, physical vs. logical.

Physical path: 2+ different paths.
Logical view:
opteron: single address space, virtualized. Applications need no recompile, may suffer performance degradation.
Hypothetical rv770x2: possibly virtual address space. I think i heard r600 has fully virtual addresses. Possibly application will suffer degradation (<x2 performance).

Dont know if CF/SLI uses a message pasign paradigm (cluster of CPUs, RPCs) or a shared virtual mem (NUMA, Single System Image, threads jumping nodes). Guess this matters for driver development time.
 
From the Wikipedia link you provided:

"NUMA attempts to address this problem by providing separate memory for each processor"

Yeah, NUMA stands for non-uniform memory access which of course would mean a greatly extended ring bus for the MC's; but if the 4870x2 is indeed one memory space for two GPU's wouldn't that contradict with the sentence I quoted above?

I don't mean to "disagree", I don't know about this enough to disagree with anybody I'm just trying to understand. :smile:
The context is "Multi-processor systems make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access memory at a time".

There are 3 primary kinds of memory usage in a GPU, vertex data, textures and render targets (though D3D10 blurs the boundaries making them all mutable). Anyway dedicated parts of each GPU make their own accesses to these kinds of data, independently of other clients. Within a single GPU you have a merry dance of conflicting requests against memory. It's up to the memory system to regiment these requests to make efficient use of memory bandwidth/latency - particularly as the burst size of GPU memory is comparatively large (which is how GPU memory can muster such high bandwidth).

One part of the solution is to utilise multiple memory channels, e.g. 4 channels that are 64-bits wide. If you provide each channel with a set of queues and fuzzy logic you can construct something that's load-balanced with high throughput for all clients (processors).

The ring bus is a scalable architecture to connect clients to MCs. With the memory system distributed across the nodes of the ring bus there's no bottleneck in a single unit that determines how clients access memory.

So, in theory, multiple GPUs in NUMA can be made to cooperate in their use of available memory channels, no matter where the memory chips are attached. This cooperation is enforced by the distributed memory system, in theory - using the same techniques as deployed within a single GPU.

You can peruse the patent for the fully distributed ring bus based memory controller here:

http://forum.beyond3d.com/showthread.php?p=1165766#post1165766

Jawed
 
The context is "Multi-processor systems make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access memory at a time".

There are 3 primary kinds of memory usage in a GPU, vertex data, textures and render targets (though D3D10 blurs the boundaries making them all mutable). Anyway dedicated parts of each GPU make their own accesses to these kinds of data, independently of other clients. Within a single GPU you have a merry dance of conflicting requests against memory. It's up to the memory system to regiment these requests to make efficient use of memory bandwidth/latency - particularly as the burst size of GPU memory is comparatively large (which is how GPU memory can muster such high bandwidth).

One part of the solution is to utilise multiple memory channels, e.g. 4 channels that are 64-bits wide. If you provide each channel with a set of queues and fuzzy logic you can construct something that's load-balanced with high throughput for all clients (processors).

The ring bus is a scalable architecture to connect clients to MCs. With the memory system distributed across the nodes of the ring bus there's no bottleneck in a single unit that determines how clients access memory.

So, in theory, multiple GPUs in NUMA can be made to cooperate in their use of available memory channels, no matter where the memory chips are attached. This cooperation is enforced by the distributed memory system, in theory - using the same techniques as deployed within a single GPU.

You can peruse the patent for the fully distributed ring bus based memory controller here:

http://forum.beyond3d.com/showthread.php?p=1165766#post1165766

Jawed

OK, I understand perfectly. Thanks.
 
And Lux_, uhhhh, I have no idea why you think it's even theoretically possible NOT to render his frame from scratch? (at least for the main framebuffer)
Correct me if I'm wrong, but currently games are rendered like MJPEG-encoded movie, instead of MPEG4-encoded one.
 
Correct me if I'm wrong, but currently games are rendered like MJPEG-encoded movie, instead of MPEG4-encoded one.
It's pretty much impossible to reuse 3D data from frame to frame. Lighting may change, effects may change, position will most likely change whether due to shifts in camera position or shifts in object position.

When you're playing your favorite 3D game, chances are something (or everything) is in motion at all times.

-FUDie
 
is this the same info in German that we saw ripped from the 'editor's day' article?

http://www.hardware-infos.com/news.php?news=2100

today's date of course - i found this elsewhere also, and Hal posted it here but no one commented

My German is not so good [but better than those translators]. Are these new figures? or rehash? This is in part and there is more also. Help please!

Zum Einsatz kamen zwei Benchmarks: Zu allererst wird ein zwei minütiger Videoclip in 720p HD in ein ipod-taugliches Format transkodiert. Die konventionelle Methode, also die Kodierung via itunes und einem 20 US-Dollar teuren Mpg-2 Codec, nimmt dabei auf einem Quad-Core mit 3,0 GHz mehrere Stunden in Anspruch.
Unter Zuhilfenahme einer unbekannten GPU der neuen GT200-Generation (vermutlich GTX 280) via CUDA benötigt man für denselben Vorgang streng genommen nur Sekunden. Das Video wird in 5-facher Echtzeitgeschwindigkeit kodiert. Sprich ein Videoclip mit 30 Frames pro Sekunde wird mit 150 Frames pro Sekunde bearbeitet. Daher ist es auch möglich, mehrere Videos gleichzeitig in wenigen Minuten umzuwandeln. Die CPU wird dabei nur wenig belastet. Mehr als ein Core wird für das Processing nicht beansprucht.

Der zweite und mindestens genauso eindrucksvolle Test beschäftigt sich mit der Kodierung eines 1080i HD MPEG 2 Clips in Adobe Premier Pro. Der Clip wurde mit 25 Mbit Bitrate und 30 fps aufgenommen und soll nun in ein hi264 genanntes Format zur weiteren Bearbeitung exportiert werden. Während ein herkömmlicher Core 2 Duo E6400 das Video mit 2-6 fps kodiert, sprich in 1/6 Echtzeit rechnet, bringt es die GT200 auf 46 fps. Die GT200 rechnet also auch hier wieder schneller als Echtzeit.

i know we are discussing this but i am unsure of the relevance of what is quoted. The figures themselves are impressive, i think and portends what is possible. Of course, it may be that AMD is also going to do something similar with multi-gpu 4xx0 series. Possibly?! i'd hope so.
 
It's pretty much impossible to reuse 3D data from frame to frame.
I agree that as of today, there probably are limitations in current APIs and hardware, that make it not worth the effort - how to manage a some kind of a general data structure that keeps track of changes, how to sync it between GPU and CPU.

Yet, this approach is already in use in smaller scale. For example (if I remember correctly), Crysis uses extrapolation for some lightning calculations, and recalculates only in N frames or when something significant happens. Also instancing is different face of the same cube.

I believe it's time for these capabilities to be built into engines/APIs more generally. If DX10 is restrictive, maybe CAL/CUDA could help? Like NVidia implemented PhysX into Unreal 3 engine (of course, being isolated path, it dind't distrupt main workflow and rendering really as much as more general caching/reusing would). Or wait and hope for DX11. :smile:
 
That German looks like a summary of the presenter in the video itself. $20 for an MPEG2 codec or whatever, Quad Core 3GHz Core 2 vs. GTX 280 = 20 mins to transcode a ~5min 720p NV clip vs. 5x-faster-than-realtime [150fps] via CUDA-coded transcoder, etc. (I thought the presenter said the QC 3GHz was only loaded to 2 cores, though, which is odd for video processing--I'm probably wrong.) Someone present asked if porting this to GT200 (from G80, IIRC) was a big deal, he said no. He ended by futzing with some HD home video (I'm guessing the 1080i h264 bit) of an NV guy's kids.

As for AMD, didn't they release some transcoder thingy for the X1000 series? I remember a beta was released that ran only on the CPU that was still pretty quick, but not much else.
 
That German looks like a summary of the presenter in the video itself. $20 for an MPEG2 codec or whatever, Quad Core 3GHz Core 2 vs. GTX 280 = 20 mins to transcode a ~5min 720p NV clip vs. 5x-faster-than-realtime [150fps] via CUDA-coded transcoder, etc. (I thought the presenter said the QC 3GHz was only loaded to 2 cores, though, which is odd for video processing--I'm probably wrong.) Someone present asked if porting this to GT200 (from G80, IIRC) was a big deal, he said no. He ended by futzing with some HD home video (I'm guessing the 1080i h264 bit) of an NV guy's kids.

As for AMD, didn't they release some transcoder thingy for the X1000 series? I remember a beta was released that ran only on the CPU that was still pretty quick, but not much else.

Thank-you, it seemed like a summary .. i just wasn't sure if there was anything added

and yes, Kyle Bennett evidently just confirmed it a few minutes ago for 3800 series:
{sorry about that, but it is appropriate, just by serendipity}

http://www.hardforum.com/showthread.php?p=1032543925#post1032543925

Originally Posted by Kyle_Bennett View Post
AMD showed 3800 series doing it in December of last year transcoding video about 10X faster than quad core...
 
I'll give my best: my comments in parenthesis. My mother tongue is neither german nor english.

Zum Einsatz kamen zwei Benchmarks: Zu allererst wird ein zwei minütiger Videoclip in 720p HD in ein ipod-taugliches Format transkodiert. Die konventionelle Methode, also die Kodierung via itunes und einem 20 US-Dollar teuren Mpg-2 Codec, nimmt dabei auf einem Quad-Core mit 3,0 GHz mehrere Stunden in Anspruch.

Two benchmarks were used:

1) aprox 2 minute long 720p HD video transcoding to ipod compatible format. The conventional method, that is, the coding thru itunes or a $20 mpg2 codec, takes some hours on a 3GHz QuadCore

(WTF?? NV PR at its best. FFMpeg 720hd->320x240 h.264, 2 pass, wont take 1+ hours for a 2 minute 720hd video, sigh.)

Unter Zuhilfenahme einer unbekannten GPU der neuen GT200-Generation (vermutlich GTX 280) via CUDA benötigt man für denselben Vorgang streng genommen nur Sekunden. Das Video wird in 5-facher Echtzeitgeschwindigkeit kodiert. Sprich ein Videoclip mit 30 Frames pro Sekunde wird mit 150 Frames pro Sekunde bearbeitet. Daher ist es auch möglich, mehrere Videos gleichzeitig in wenigen Minuten umzuwandeln. Die CPU wird dabei nur wenig belastet. Mehr als ein Core wird für das Processing nicht beansprucht.

With the help of an unknown GT200 GPU (possibly GTX 280) using CUDA the same transcoding takes only seconds. The video is transcoded at 5x real-time. That means, a video coded at 30 PFS is transcoded at 150 FPS. So it becomes possible to simultaneously (?) transcode a few videos in few minutes. (I understand that sentence as in sequential transcoding, not parallel.) The CPU is lightly loaded. No more than 1 core is used for the transcoding.

(Nitpick: 720hd at 150fps, not 5x real-time. If the video is at 24, 29, or 60 fps that relative speed is useless. Use absolute FPS transcoding speed you PR weasels :p)

Der zweite und mindestens genauso eindrucksvolle Test beschäftigt sich mit der Kodierung eines 1080i HD MPEG 2 Clips in Adobe Premier Pro. Der Clip wurde mit 25 Mbit Bitrate und 30 fps aufgenommen und soll nun in ein hi264 genanntes Format zur weiteren Bearbeitung exportiert werden. Während ein herkömmlicher Core 2 Duo E6400 das Video mit 2-6 fps kodiert, sprich in 1/6 Echtzeit rechnet, bringt es die GT200 auf 46 fps. Die GT200 rechnet also auch hier wieder schneller als Echtzeit.

2) The second and equally impressive test is the transcoding of 1080i HD Mpeg2 using adobe premier pro. The clip is 25Mbit/30 fps and is transcoded to a format called hi264 (H.264 High Profile?). A conventional C2D E6400 transcodes at 2-6 FPS, that is 1/6 real-time, the GT200 arrives at 46fps. The GT200 is again faster than real-time.

(2 or 6 FPS?? More weaseling. Drop the exact same bitcopy of the source videos on an ftp and give us ffmpeg/x264 times. I dont know adobe premier pro, is it fast? Any video gurus around?)

Lacking precise details, but impressive numbers anyway.
 
It's pretty much impossible to reuse 3D data from frame to frame. Lighting may change, effects may change, position will most likely change whether due to shifts in camera position or shifts in object position.

Actually it's not impossible. It's actually quite easy, as with 3d data, you have the exact information about the motion vector of each pixel. You know the movement (and acceleration) and rotation (and angular acceleration) of your objects and your camera. With this information you can do cheaper and more correct motion estimation compared to the current codecs and HDTVs. The biggest problem comes with the frame latency and stuttering, as we have to do this on real time. The motion estimated frames are much cheaper to render (around 10x in my testing scenario) than the real frame. This causes noticeable stuttering, unless I queue the frames. Queuing frames however causes noticeable control latency (much like AFR SLI setups).

Also on deferred rendering systems you can easily motion compensate only your g-buffer creation, while recalculating the lighting on every frame. But if you only calculate one extra frame between the rendered frames (like I do), the lighting error between 2 frames is not usually noticeable. On 3d rendering however, you can detect these "non-usual" scenarios. If your camera moves too much in one frame or some light turns off or on, you can do the motion estimation more precisely on that frame (or just render a real frame instead).
 
Thank-you so much, Karoshi! My German is rudimentary [if you need Portuguese, i am OK]

Excuse me .. but Holy .. heck! ---that is impressive - at least to my little mind! We are talking about these same numbers i guess, but is a leap imo. Is there anyone here disagreeing with the significance of these benchmarks, or that they may be exaggerated "best case"? i see you nitpicking, but it is still good for half those numbers! And if AMD can also do this with X2 or X3 , it appears we all win!
 
03RV770part2.jpg

[/URL]
IMHO, the "RV770 advantages vs G92" listed in the slide look like desparation
 
When did being faster, more power efficient, better at video, more feature rich and using more advanced processes, supporting a higher revision of DirectX and using new memory technology count as desperation?

I can't even see how you've remotely got a case there.
 
Okay, did I just see "accelerated encoding"?

Now is that some snarky, subtle answer to the paid CUDA accelerated encoder, or just something else :D

Hopefully this is done on the ALUs instead of dedicated hardware where we just went through the same format/capability concerns with GT200.
 
Status
Not open for further replies.
Back
Top