How many GFLOPs are actually in use to decode hd level VC-1/mpeg-4?

randycat99

Veteran
Has this ever been figured out? We've seen how easily it can tie up even the hardiest PC hardware...but how many GFLOPs are actually being burned to ensure a steady 1080p24 from VC-1 or mpeg-4? I am very curious, since this seems to have become the impromptu hardware stress test for the consumer space. Maybe not everyone is interested in the most elaborate PC game, but hd movies are sure to touch the life of nearly everyone in some way. It's the first (potentially) mainstream "app" that literally demands the highest-spec hardware one can buy to run it well.

Using the Toshiba HD-DVD player as an example, supposedly there is a 2.5 Ghz P4-m with a whopping 1 GHz of ram to make hd movie playback possible. That is astounding, imo. That would constitute quite a chunk of hardware (though not the absolute highest spec, of course) for anybody's personal use computer. Yet, playing one hd movie is enough to gobble all of this resource up with no mercy. So how many GFLOPs do you think are at work to make this happen?
 
Very good profiling of H.246 at 1920x1080x25 here
http://personals.ac.upc.edu/aramirez/papers/iiswc05-avc.pdf

On a IBM 970 PPC @ 1.6Ghz (Theoretical peak of 12.8GFLOPS). The average of 4 different videos resulted in this information:

Reference H.264 decoder on a 1080p frame:
2250mil Cycles
1720mil Instructions

Optimized H.264 decoder on a 1080p frame:
213mil Cycles
220mil Instructions

Mpeg4 decoder on a 1080p frame:
165mil Cycles
165mil Instructions

Mpeg2 decoder on a 1080p frame:
31mil Cycles
21mil Instructions

And like Simon F hints above, these are not dominated by Floating Point Instructions. For the reference decoder 99% of the instructions are Integer ops. 50% integer arithmetic, ~49% Load/Store/Branch instuctions.

The reference decoder results are fascinating because there are almost no SIMD instructions. To decode 1080p24 in realtime using the reference decoder would require a ~55Ghz CPU!
 
Last edited by a moderator:
That would be 7fps for that PPC...

I wonder if any single core PC procesor can decode 1080p/25 at maximun BR bitrate, the worst case in that article is 5.4 fps for an 1.6 GHz PPC.

HD-DVD players use hardware decoders.
 
Ok so, if HD playback really isn't dependant on Floating Point operations, why is Cell so good at running multiple streams at once? If that's true of course...
 
london-boy said:
Ok so, if HD playback really isn't dependant on Floating Point operations, why is Cell so good at running multiple streams at once? If that's true of course...

Weren't that 48 stream demo little 162x130 MPEG2 tiled onto a 1929x1080 screen?
 
london-boy said:
Ok so, if HD playback really isn't dependant on Floating Point operations, why is Cell so good at running multiple streams at once? If that's true of course...
Because Cell is also a integer-monster and the local stores are way faster than casual cache-architectures - aslong you can fit enough relevant data in there (Stream decoding does fit very well).
 
Ok so, if HD playback really isn't dependant on Floating Point operations, why is Cell so good at running multiple streams at once? If that's true of course...
You do know that cell can run integer arithmethic just as fast as FP, do you? I ask because this fact gets missed a lot in discussions like this.
[edit]I was a bit late... but yeah, what Npl said.
 
Of course! I missed that small detail out :D

NANOTEC, i was talking about the 12 streams at 1920x1080 res running simultaneously, not the 48-videos demo Toshiba showed...
 
london-boy said:
Of course! I missed that small detail out :D

NANOTEC, i was talking about the 12 streams at 1920x1080 res running simultaneously, not the 48-videos demo Toshiba showed...

The 12 streams were also MPEG2 much less processing power required than VC-1/MPEG4 as seen by the numbers inefficient posted.

Reference H.264 decoder on a 1080p frame:
2250mil cycles
1720mil Instructions

Optimized H.264 decoder on a 1080p frame:
213mil cycles
220mil Instructions

Mpeg4 decoder on a 1080p frame:
165mil cycles
165mil cycles

Mpeg2 decoder on a 1080p frame:
31mil cycles
21mil cycles
 
Last edited by a moderator:
NANOTEC said:
Weren't that 48 stream demo little 162x130 MPEG2 tiled onto a 1929x1080 screen?

They were SD streams, so it was more than 162x130 for each. They were resized to fit on the screen, after decoding.
 
Titanio said:
They were SD streams, so it was more than 162x130 for each. They were resized to fit on the screen, after decoding.

That sounds right but they also didn't mention quality or framerate.
 
deathkiller said:
DVD quality and framerate? they said 48 DVD streams I think.

http://techon.nikkeibp.co.jp/english/NEWS_EN/20050425/104149/?ST=english

"48 SDTV format MPEG-2 streams"
So that would be 480i

On a different note. Have we seen a single example of Cell doing a H.264 video decode?

On the Ps3, at least it will get some hardware assist from nvidia's RSX. But it seems that H.264 decoding is very branch heavy. And the SPE's inablity to branch predict would appear to be a problem area. It will be interesting to see how they will tackle this. Perhaps brute force will be enough.
 
http://en.wikipedia.org/wiki/DVD#DVD-Video

Dvd Movies look to be that res at 10 MBps. I don't know how compression ratio affects performance requirements though. Presumably the increased compression ratio requires more processing power to decompress (?). Or less, as there's less data to get a hold of. Or the same, because its the same process. Beats me! And none of that helps with determining requirements for HD MPG4/VC1 either.
 
inefficient said:
And the SPE's inablity to branch predict would appear to be a problem area. It will be interesting to see how they will tackle this. Perhaps brute force will be enough.
The SPE's ability to perfectly prefetch with no hit at all the correct instructions in advance via branch hints might be a big win, even over complex OOOE cores.
 
nAo said:
The SPE's ability to perfectly prefetch with no hit at all the correct instructions in advance via branch hints might be a big win, even over complex OOOE cores.


But only on predicate situations (like this)?
 
Back
Top