Bug 730783 – bytereader: SIMD-based optimization to find start code on H.264 and H.265 streams

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 730783 - bytereader: SIMD-based optimization to find start code on H.264 and H.265 streams


Summary:	bytereader: SIMD-based optimization to find start code on H.264 and H.265 str...


Status:	RESOLVED OBSOLETE

Product:	GStreamer
Classification:	Platform
Component:	gstreamer (core)
Version:	git master
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	git master
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-05-27 04:44 UTC by Sungho Bae
Modified:	2018-11-03 12:20 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
001_gstbytereader_neon_improvement_patch (1.97 KB, patch) 2014-05-27 04:44 UTC, Sungho Bae	rejected	Details \| Review
Doucument for SIMD-based Advanced Byte Scan Algorithm.txt (14.71 KB, text/plain) 2014-06-18 06:43 UTC, Sungho Bae		Details
0001-bytereader-SIMD-based-Advanced-Byte-Scan-for-Start-Codes.patch (3.60 KB, patch) 2014-06-18 11:36 UTC, Sungho Bae	none	Details \| Review
0001-bytereader-SIMD-based-Advanced-Byte-Scan-for-Start-C.patch (3.60 KB, patch) 2014-06-18 11:39 UTC, Sungho Bae	none	Details \| Review
0001-bytereader-SIMD-based-Advanced-Byte-Scan-for-Start-Code.patch (3.86 KB, patch) 2014-07-01 11:56 UTC, Sungho Bae	none	Details \| Review
0002-configure-add-to-check-if-simd-optimization-is-used-or-not.patch (5.32 KB, patch) 2014-07-01 12:00 UTC, Sungho Bae	needs-work	Details \| Review
0002-configure-add-to-check-if-simd-optimization-is-used-or-not-v1.patch (5.24 KB, patch) 2014-07-09 06:25 UTC, Sungho Bae	none	Details \| Review

Description Sungho Bae 2014-05-27 04:44:24 UTC

Created attachment 277253 [details] [review]
001_gstbytereader_neon_improvement_patch

In the parse phase for video streams, bytescanning is performed to find the start and end of each NAL unit. This implementation is to improve the performance of bytescanning for start code using both ARM NEON instructions and pointer access instead of indexed access.
The advantages are to reduce CPU usage and to enhance the scanning performance.

This patch assumes that '0x0000' is unlikely to appear and the zeros are part of the start code, that is, '0x010000'.
Our proposed idea is based on the assumption.
If we quickly know whether or not there exists '0x0000' in the scanning area to find the start code, we can skip the scanning process for the start code.
We thus implemented the preprocessing to know whether or not there exists '0x0000'.
Because the probability of zeros to appear is low, its performance dramatically improved.

Comment 1 Edward Hervey 2014-05-27 05:41:51 UTC

The idea makes sense.

I'm not a neon expert, but isn't vceq.i8 checking for 0x00 (and not 0x0000) ?

Furthermore, aren't there restrictions on alignments for the input 'pdata' variable ? Looks as though it would need to be 16-bit-aligned.

In terms of code clarity, it would also be better if you could either:
* avoid duplicating the code-paths and variables (i.e. just #if the new variables and asm path instead of duplicating everything).
* or #if the whole function (i.e. add a new one instead of modifying the existing one).

Comment 2 Sebastian Dröge (slomo) 2014-05-27 06:58:15 UTC

Comment on attachment 277253 [details] [review]
001_gstbytereader_neon_improvement_patch

This also needs a check if a compiler is uesd that can handle gnu style inline assembler

Comment 3 Nicolas Dufresne (ndufresne) 2014-05-27 12:07:56 UTC

Note for H264 header parsing performance, h264parse does not keep it's parsing context, hence re-parse the same buffer over and over. I think there is a bug for that, which I can't find now, but I'm sure I've created it. As the h264 header parsing is pretty performant already (with the byte skipping algorithm), I'd focus on fixing this first, before going into having to maintain assembly code.

Comment 4 Nicolas Dufresne (ndufresne) 2014-05-27 12:57:44 UTC

Review of attachment 277253 [details] [review]:

Would it be possible to document here you're algorithm. The skip mark being on third byte, start code not being aligned, I have a hard time to understand where you can comup with multiple of 4 skips.

Comment 5 Nicolas Dufresne (ndufresne) 2014-05-27 13:01:28 UTC

(In reply to comment #0)
> This patch assumes that '0x0000' is unlikely to appear and the zeros are part
> of the start code, that is, '0x010000'.

Do you mean this can have false positive, or just that this would be the worst case ?

> Our proposed idea is based on the assumption.
> If we quickly know whether or not there exists '0x0000' in the scanning area to
> find the start code, we can skip the scanning process for the start code.
> We thus implemented the preprocessing to know whether or not there exists
> '0x0000'.
> Because the probability of zeros to appear is low, its performance dramatically
> improved.

Can you provide profiling, best case, mid case, worst case ?

Comment 6 Sungho Bae 2014-05-28 08:30:08 UTC

(In reply to comment #1)
> The idea makes sense.
> 
> I'm not a neon expert, but isn't vceq.i8 checking for 0x00 (and not 0x0000) ?
> 

That's correct! vceq.i8 is to check for 0x00.
but, the previous instruction(vld2.8) is to load data using interleaved access with 2 offset. 

it means the following:
inst: vld2.8 {d16-d19} [r1]
data: 0x   00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F ....
load: q8   00 02 04 06 08 0A 0C 0E  ......   (even indexed)
      q9   01 03 05 07 09 0B 0D 0F  ......   ( odd indexed)

The purpose of preprocessing using neon instructions is to skip noncandidates which have two consecutive zeros.
In contrast, candidates for start code have two consecutive zeros and we have to find 0x00 though it uses interleaved access (and not full access).

The above idea is to reduce the number of instructions in view of optimization.

> Furthermore, aren't there restrictions on alignments for the input 'pdata'
> variable ? Looks as though it would need to be 16-bit-aligned.

According to "ARM®Architecture Reference Manual for ARMv7-A and ARMv7-R edition or ARMv8", neon load instruction 'vldn' supports unaligned access.

if you use vldn with alignment qualifiers or vldm (multiple load),
you can improve the performance but you should consider restrictions on alignments for the input data.

in this patch, we do not use alignment qualifiers. it thus doesn't have to consider restrictions on alignment. it always works.

> 
> In terms of code clarity, it would also be better if you could either:
> * avoid duplicating the code-paths and variables (i.e. just #if the new
> variables and asm path instead of duplicating everything).
> * or #if the whole function (i.e. add a new one instead of modifying the
> existing one).

I concur with your opinion.
After modify the patch, let you know it.

Thanks.

Comment 7 Sungho Bae 2014-05-28 08:52:21 UTC

(In reply to comment #2)
> (From update of attachment 277253 [details] [review])
> This also needs a check if a compiler is uesd that can handle gnu style inline
> assembler

I compiled and tested the gstreamer core with my patch using gcc cross compiler for arm architecture.
Of course I checked whether or not the output binary object has correct assembly code using objdump.

I reviewed documents related to clang/llvm and intel c++ compiler.
Those compiler support gnu-style inline assembler as follows:
- clang/llvm: http://clang.llvm.org/compatibility.html
- intel c++ compiler: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-5100C4FC-BC2F-4E36-943A-120CFFFB4285.htm

If you know other compilers we need to consider, let me know them.

Comment 8 Sungho Bae 2014-05-30 06:55:05 UTC

(In reply to comment #3)
> Note for H264 header parsing performance, h264parse does not keep it's parsing
> context, hence re-parse the same buffer over and over. I think there is a bug
> for that, which I can't find now, but I'm sure I've created it. As the h264
> header parsing is pretty performant already (with the byte skipping algorithm),
> I'd focus on fixing this first, before going into having to maintain assembly
> code.

I am not familiar with h264parse.
As far as I know, h264parse uses extension fields for avc which does not have start code prefix to find the start of each nal unit.
In the case of mp4 container, it supports to explicitly denote the size of each nal unit by a field.
but, many of h264 streams adopt the start code based format and those need byte scanning algorithm to find start code. 
one of examples is a ts container format.

In addition, h264parse concentrates on the way to parse a nal unit depending on its nal type. but, gst_byte_reader_masked_scan_uint32() is used to extract nal units from media files before parsing them.
Therefore, because the purpose of gst_byte_reader_masked_scan_uint32() differs from h264parse's, I think this patch is useful although h264parse plugin exists.

I know you adopted Boyer-Moor bad character heuristic to _scan_for_start_code() function to gstreamer repository. I thought your algorithm was good. so, I investigated and proposed the improved idea.


You wrote as follows:
> h264parse does not keep it's parsing context, hence re-parse the same buffer over and over.

I am not familiar with h264parse. but, I will review h264parse plugin as much as I can. In case of extracting a nal unit from start code based media stream, it is needed two points of start codes: the first indexes of the current nal unit and the next nal unit.

+----------+--------------------+----------+---------------+--------+
| 0x000001 |  current nal unit  | 0x000001 | next nal unit | ....   |
+----------+--------------------+----------+---------------+--------+
                 ^----  it is placed between two start codes

If h264parse has a bug, I guess the bug is related to the above.
but, I'm not sure because I haven't yet reviewed.

Thanks.

Comment 9 Edward Hervey 2014-05-30 07:19:56 UTC

The problem that Nicolas is pointing out is the following:

Say your full stream is what you mentioned and it comes into h264parse as two buffers

+----------+--------------------+----------+---------------+--------+
| 0x000001 |  current nal unit  | 0x000001 | next nal unit | ....   |
+----------+--------------------+----------+---------------+--------+
\______________________________/ \________________________/ ...
          First Buffer               Second Buffer

What Nicolas meant that it doesn't keep context is that, for each buffer, h264parse will look for the first start_code (trivial), and then the next one (as you mentioned, to know how big the NALU is).

+----------+--------------------+
| 0x000001 |  current nal unit  |
+----------+--------------------+
                                ^
                                ^
   h264parse scans all the way to the end of the buffer...
but then doesn't remember it already scanned when the next buffer comes round

2nd buffer comes in, 1st one is still in the parser, we now have:

+----------+--------------------+----------+---------------+
| 0x000001 |  pending nal unit  | 0x000001 | next nal unit |
+----------+--------------------+----------+---------------+
           ^                    ^
           A                    B

h264parse will resume parsing from offset A instead of resuming it from offset B... resulting in scanning *twice* the nal unit

Comment 10 Edward Hervey 2014-05-30 07:20:46 UTC

So it's the classic example of "yes, this function is expensive, but instead of optimizing that function ... how about calling it less often ? :)"

Comment 11 Sungho Bae 2014-05-30 08:39:35 UTC

(In reply to comment #9)
> The problem that Nicolas is pointing out is the following:
> 
> Say your full stream is what you mentioned and it comes into h264parse as two
> buffers
> 
> +----------+--------------------+----------+---------------+--------+
> | 0x000001 |  current nal unit  | 0x000001 | next nal unit | ....   |
> +----------+--------------------+----------+---------------+--------+
> \______________________________/ \________________________/ ...
>           First Buffer               Second Buffer
> 
> What Nicolas meant that it doesn't keep context is that, for each buffer,
> h264parse will look for the first start_code (trivial), and then the next one
> (as you mentioned, to know how big the NALU is).
> 
> +----------+--------------------+
> | 0x000001 |  current nal unit  |
> +----------+--------------------+
>                                 ^
>                                 ^
>    h264parse scans all the way to the end of the buffer...
> but then doesn't remember it already scanned when the next buffer comes round
> 
> 2nd buffer comes in, 1st one is still in the parser, we now have:
> 
> +----------+--------------------+----------+---------------+
> | 0x000001 |  pending nal unit  | 0x000001 | next nal unit |
> +----------+--------------------+----------+---------------+
>            ^                    ^
>            A                    B
> 
> h264parse will resume parsing from offset A instead of resuming it from offset
> B... resulting in scanning *twice* the nal unit

That's correct! That is what I guess the cause of the problem.
I'm currently reviewing h264parse plugin related to the problem.
Once I solve the problem, I will report it by another thread with patch files.

Comment 12 Sungho Bae 2014-05-30 09:24:04 UTC

(In reply to comment #10)
> So it's the classic example of "yes, this function is expensive, but instead of
> optimizing that function ... how about calling it less often ? :)"

First of all, thanks to you and Nicolas.

I entirely agree with your opinions.
if the problem is fixed, I expect that it is improved approximately twice.

but, I think the performance of scanning function is also important.
I do unit tests with this patch and got the result as follows:

1) target environments:
- CortexA9 dual 1.0Ghz
- Input Size: H.265 Bitstream ( 9MB, includes 1203 nalus )

2) Result: 
+--------------------+--------+-------+--------+--------+-------------+
| method             | avg.   | std.  | max.   | min.   | enhancement |
+--------------------+--------+-------+--------+--------+-------------+
| origin             | 40.072 | 1.552 | 44.250 | 37.376 | 100.00%     |
| w/o mem idx        | 37.162 | 1.284 | 40.141 | 34.937 | 107.83%     |
| w/o mem idx + neon | 27.890 | 1.284 | 30.891 | 26.201 | 143.67%     |
+--------------------+--------+-------+--------+--------+-------------+

According to the result,
The enhancement of this patch is expected to average 43.67% for scanning start code.
It addresses that optimizing the scanning function as well as reducing the number of calling it is important.

In addition, I plan to solve the bug which mentioned by you and Nicolas.
but, I think it should be processed by another thread because the improving points are differed.

Thanks.

Comment 13 Nicolas Dufresne (ndufresne) 2014-05-30 13:22:33 UTC

Some notes, make sure to look at other parsers if you want to fix that. It was fixed a long time ago in MPEG2 Video, iirc by Mark. Ideally, I'd like the h264parser library, and other library to have consistent signature.

A second note, would be nice to share your test script, so we can reproduce. Will make the thing a little more scientific. Also, we have unit tests for this generic method in tests/check/libs/bytereader.c . I think they could be improved, but can already catch up if anything isn't completely reliable. You can run it using:

make -C tests/check libs/bytereader.check

And to debug it:

make -C tests/check libs/bytereader.gdb

I wouldn't create a new bug. This is about improving H264 parsing speed. Why we push so hard on the other case, is that it will optimize for all architectures without affecting the readability and maintainability of the code.

We also have bunch or other plan for this mask scan method where we can have higher gain, https://bugzilla.gnome.org/show_bug.cgi?id=675772

Final notes, thanks for your time and effort looking into this. We really appreciate that you are taking the time and making this effort in making GStreamer more efficient. Our review process can be exhausting sometimes, but I think you can understand that the last thing we'd like is to have code we don't fully understand, or too much assembly to maintain. So it's important for us to ask questions, take the time to reproduce, double check there is not error. Specially in core methods.

Comment 14 Sungho Bae 2014-06-09 05:01:42 UTC

(In reply to comment #13)
> Some notes, make sure to look at other parsers if you want to fix that. It was
> fixed a long time ago in MPEG2 Video, iirc by Mark. Ideally, I'd like the
> h264parser library, and other library to have consistent signature.

Thanks for your advice. I will look into MPEG2 Video, iirc by Mark.
If you know the related git repository name such as gst-plugin-bad or -good, let me know it.

> 
> A second note, would be nice to share your test script, so we can reproduce.
> Will make the thing a little more scientific. Also, we have unit tests for this

Of course, I think I must make you reproduce the test.
After removing unnecessary codes, I will upload the test source and script.

> generic method in tests/check/libs/bytereader.c . I think they could be
> improved, but can already catch up if anything isn't completely reliable. You
> can run it using:
> 
> make -C tests/check libs/bytereader.check
> 
> And to debug it:
> 
> make -C tests/check libs/bytereader.gdb
> 

I tested as follows:

# ./bytereader
Running suite(s): GstByteReader
100%: Checks: 11, Failures: 0, Errors: 0

I confirmed that it works correctly using the above unit test program.

> I wouldn't create a new bug. This is about improving H264 parsing speed. Why we
> push so hard on the other case, is that it will optimize for all architectures
> without affecting the readability and maintainability of the code.
> 
> We also have bunch or other plan for this mask scan method where we can have
> higher gain, https://bugzilla.gnome.org/show_bug.cgi?id=675772
> 
> Final notes, thanks for your time and effort looking into this. We really
> appreciate that you are taking the time and making this effort in making
> GStreamer more efficient. Our review process can be exhausting sometimes, but I
> think you can understand that the last thing we'd like is to have code we don't
> fully understand, or too much assembly to maintain. So it's important for us to
> ask questions, take the time to reproduce, double check there is not error.
> Specially in core methods.

I fully understand your opinions.

Comment 15 Sungho Bae 2014-06-18 05:54:14 UTC

I suggest that the patch split into the other two threads, because it consists of two kinds of optimization:


First, it enhances the performance by approximately 8% using pointer access instead of array indexed access.
The idea can be adopted for all architectures and it doesn't affect the readability and maintainability of the code.

https://bugzilla.gnome.org/show_bug.cgi?id=731442


Second, it improves the performance by about 30% using SIMD instructions which is the hot issue of this thread.
I agree that the implemented assembly code affects the readability and maintainability because of the dependency for ARM architecture.
but, I think it is also needed because the performance and effectiveness are important in resource-constraint environments, that is, embedded system. 
I want to discuss this issue in this thread continually.


Thanks.

Comment 16 Sungho Bae 2014-06-18 05:57:48 UTC

Review of attachment 277253 [details] [review]:

I split this patch into two parts: First is pointer access optimization and second is neon optimization.
So, I'll re-upload the modified patch.

Comment 17 Sungho Bae 2014-06-18 06:43:20 UTC

Created attachment 278655 [details]
Doucument for SIMD-based Advanced Byte Scan Algorithm.txt

(In reply to comment #4)
> Review of attachment 277253 [details] [review]:
> 
> Would it be possible to document here you're algorithm. The skip mark being on
> third byte, start code not being aligned, I have a hard time to understand
> where you can comup with multiple of 4 skips.

I attached the document to explain the proposed algorithm.
If you have any questions, please let me know it.

Thanks.

Comment 18 Sungho Bae 2014-06-18 11:36:18 UTC

Created attachment 278669 [details] [review]
0001-bytereader-SIMD-based-Advanced-Byte-Scan-for-Start-Codes.patch

I attached the new patch which implements only SIMD-based byte scanning optimization to find start-codes.

The detail algorithm of this patch is described in 'Document for SIMD-based Advanced Byte Scan Algorithm.txt'

Please review the patch.

Comment 19 Sungho Bae 2014-06-18 11:39:27 UTC

Created attachment 278670 [details] [review]
0001-bytereader-SIMD-based-Advanced-Byte-Scan-for-Start-C.patch

Comment 20 Sungho Bae 2014-07-01 11:56:51 UTC

Created attachment 279670 [details] [review]
0001-bytereader-SIMD-based-Advanced-Byte-Scan-for-Start-Code.patch

I consider how to merge the optimization without disturbing the policy for maintainability.

So, I suggest that the optimization can be enabled when 'configure' tool is performed.
Many architectures such as ARM, x86 have SIMD extensions.
The implementation focuses on ARM architecture with NEON Adv, even if the algorithm can be implemented on other architectures such as x86.
The implementations for other architectures remains for future works.

I attached the new patches which implements SIMD-based byte scanning optimization on ARM and configurations to enable the optimization.

Please review my patches.
If you have a question or comment, let me know it.

Thanks.

Comment 21 Sungho Bae 2014-07-01 12:00:14 UTC

Created attachment 279671 [details] [review]
0002-configure-add-to-check-if-simd-optimization-is-used-or-not.patch

This patch is about configuration before compiling.

If you want to enable SIMD-base optimizations,
you should run configure tool with '--with-simd=yes' option.

Comment 22 Sungho Bae 2014-07-09 05:16:01 UTC

Review of attachment 279671 [details] [review]:

Because '-mfloat-abi=softfp' option is not essential for neon extensions, the option should be removed from configuration related to neon extensions.

::: configure.ac
@@ +646,3 @@
+      AC_COMPILE_CHECK_NEON_SUPPORT_IFELSE(
+        [AC_MSG_RESULT([yes])
+

'-mfloat-abi=softfp' option is not essential for neon extensions

::: m4/simd_check.m4
@@ +5,3 @@
+  ac_compiler_support_neon=no
+  ac_saved_CFLAGS="$CFLAGS"
+  CFLAGS="-c -mfpu=neon -mfloat-abi=softfp"

'-mfloat-abi=softfp' option is not essential for neon extensions

Comment 23 Sungho Bae 2014-07-09 06:25:46 UTC

Created attachment 280207 [details] [review]
0002-configure-add-to-check-if-simd-optimization-is-used-or-not-v1.patch

'-mfloat-abi=softfp' option is not essential for neon extensions.
So, I removed '-mfloat-abi=softfp' option from the patch.

Please review it.

Comment 24 GStreamer system administrator 2018-11-03 12:20:52 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gstreamer/issues/59.