GNOME Bugzilla – Bug 675772
byte_reader + adapter: _masked_scan_uin32() could be optimized
Last modified: 2014-12-22 15:16:44 UTC
Currently the scan will rotate byte into the state and recheck the state if it match the masked and value. While this ensure that no more then 1 load is done per-byte we can do better then that. If one byte does not match mask in any of word from it's current location to the first, it means we can directly skip over that byte, allowing bigger step. This would mean we would do fewer load, but increase the bit masked + check. As bit operation are much cheaper then load, it should give a good speed up. Note keeping in mind that adding many branches could make it as bad on certain CPU.
It was proposed in bug #702357 to simply add few optimized patch, like the mpeg case, where mask is 0xFFFFFF00 and value 0x00000100. I think this is a simple, and safe solution. It is likely that there exist a generalisation for the algorithm found in mpegvideoparse, but I didn't got time to look into it, yet.
(In reply to comment #1) > It is likely that there exist a generalisation for the algorithm found in > mpegvideoparse, but I didn't got time to look into it, yet. It is using a specialization of http://en.wikipedia.org/wiki/Boyer_moore
I just learned that by reading gstrtph264pay next_start_code function. Are you having a look, or shall I take care ?
(In reply to comment #3) > I just learned that by reading gstrtph264pay next_start_code function. Are you > having a look, or shall I take care ? Please go ahead, but I don't think it makes a lot of sense to implement Boyer-moore there...
(In reply to comment #4) > (In reply to comment #3) > > I just learned that by reading gstrtph264pay next_start_code function. Are you > > having a look, or shall I take care ? > > Please go ahead, but I don't think it makes a lot of sense to implement > Boyer-moore there... Agreed, the cost of generating the tables is probably too high, I had a simplier version in mind, just need to implement it.
I have experimented with ageneric solution: see http://cgit.collabora.com/git/user/nicolas/gstreamer.git/log/?h=boyer-moore But this endup being slower. Instead it was decided to optimize certain know case. The H264/MPEG/VC1 case has been fixed: commit de75bca9b387c404a29cb057046d0db1d601794c Author: Nicolas Dufresne <nicolas.dufresne@collabora.com> Date: Thu Aug 8 12:08:31 2013 +0200 bytereader: Accelerate MPEG/H264 start code scanning Accelerate MPEG/H264 start code scanning using Boyer-Moor bad character heuristic. https://bugzilla.gnome.org/show_bug.cgi?id=702357 I think we can also optimizing the cases with mask 0xFFFFFFFF and 0xFFFFFF00 (found in Matroska), where generating table of offset would be possible since we don't need to mask each bytes. Thus I'll keep this open until I'm done experimenting. p.s. the Adapter version has not been optimized since it's not used for 0xffffff00/0x00000100 start codes.
Nearly a year without any more complaint on performance. I think it's fine to close this.