Bug 709826 – mpegtsmux: performance issue

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 709826 - mpegtsmux: performance issue


Summary:	mpegtsmux: performance issue


Status:	RESOLVED OBSOLETE

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-bad
Version:	git master
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	git master
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2013-10-10 14:24 UTC by Edward Hervey
Modified:	2018-11-03 13:18 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
callgrind graph of mpegtsmux usage (162.29 KB, image/png) 2013-10-10 14:24 UTC, Edward Hervey		Details
Optimization of stuffing bytesd (879 bytes, patch) 2013-10-12 12:05 UTC, Jesper Larsen	none	Details \| Review
Use buffer pool for mpegts packets (1.93 KB, patch) 2015-11-18 09:42 UTC, Alexander Vasiljev	none	Details \| Review
Use buffer pool for mpegts packets 2 (2.70 KB, patch) 2015-11-18 14:57 UTC, Alexander Vasiljev	none	Details \| Review
Use buffer pool for mpegts packets 3 (3.08 KB, patch) 2015-11-18 17:51 UTC, Alexander Vasiljev	needs-work	Details \| Review
tsmux_write_stream_packet can write larger buffer according to alignment (3.57 KB, patch) 2016-02-16 10:42 UTC, Alexander Vasiljev	needs-work	Details \| Review
tsmux: use memset to write stuffing bytes (865 bytes, patch) 2016-03-07 12:58 UTC, Jesper Larsen	none	Details \| Review
[PATCH 1/3] mpegtsmux: Use bufferpool to allocate packets (13.22 KB, patch) 2016-09-08 12:46 UTC, Jesper Larsen	none	Details \| Review
[PATCH 2/3] mpegtsmux: Remove output adapter (7.56 KB, patch) 2016-09-08 12:48 UTC, Jesper Larsen	none	Details \| Review
[PATCH 3/3] mpegtsmux: Only map outgoing buffers when needed (2.39 KB, patch) 2016-09-08 12:52 UTC, Jesper Larsen	none	Details \| Review
[PATCH 3/3 v2] mpegtsmux: Only map outgoing buffers when needed (2.68 KB, patch) 2016-09-26 09:09 UTC, Jesper Larsen	none	Details \| Review

Description Edward Hervey 2013-10-10 14:24:51 UTC

Created attachment 256917 [details]
callgrind graph of mpegtsmux usage

mpegtsmux does a lot of allocation, mapping and adapter usage.

Ideally it should introduce very little overhead (it doesn't even need to read the input data, just mux it).

right now it uses ~80000 cpu instructions per incoming buffer.

Comment 1 Jesper Larsen 2013-10-12 12:05:01 UTC

Created attachment 257098 [details] [review]
Optimization of stuffing bytesd

It does seem like there should be room for some optimization. Do we have a efficient muxer that we can use for inspiration?

Meanwhile, looking at the callgrind output brought my attention to how stuffing bytes are written.

This patch uses memset to set all stuffing bytes to 0xFF instead of using a loop to set them byte-by-byte.

Comment 2 Edward Hervey 2013-10-12 17:25:29 UTC

(In reply to comment #1)
> Created an attachment (id=257098) [details] [review]
> Optimization of stuffing bytesd
> 
> It does seem like there should be room for some optimization. Do we have a
> efficient muxer that we can use for inspiration?

  The problem is due to the very nature of mpeg-ts. In most container formats you can pass the incoming (audio/video) buffer as-is downstream and you "just" need to put headers and other things before/after.
  With mpeg-ts, you need to break it down into 188 byte chunks.

  The main performance issue is that we're shuffling around buffers in adapters, doing (expensive) copies, mallocs and so forth. We shouldn't do that. We should re-use the input memory as-is (i.e. without reading/copying it) and just put the required headers before/after.


> 
> Meanwhile, looking at the callgrind output brought my attention to how stuffing
> bytes are written.
> 
> This patch uses memset to set all stuffing bytes to 0xFF instead of using a
> loop to set them byte-by-byte.

  That patch is correct ... but you'll notice that function you optimized doesn't even appear in the callgrind graph, so it's negligeable :)

Comment 3 Edward Hervey 2013-10-13 06:11:30 UTC

I'm going to tackle this one on the plane today

Comment 4 Edward Hervey 2013-10-13 06:43:29 UTC

OMG ... OMG OMG OMG OMG

We are still creating internally 188 byte buffers (That alone is 22% of the muxing).... And then map/unmapping them individually (18%)... And then pushing them in an adapter to extract bigger buffers (12%) ....

Might require some kind of custom ringbuffer-based memory-/buffer-pool for outputted buffers (write into that, grab buffers of the size you want out from it).

Comment 5 Jesper Larsen 2013-10-14 06:44:54 UTC

(In reply to comment #2)
> (In reply to comment #1)
> 
> > 
> > Meanwhile, looking at the callgrind output brought my attention to how stuffing
> > bytes are written.
> > 
> > This patch uses memset to set all stuffing bytes to 0xFF instead of using a
> > loop to set them byte-by-byte.
> 
>   That patch is correct ... but you'll notice that function you optimized
> doesn't even appear in the callgrind graph, so it's negligeable :)

I won't argue with that. ;)

It appeared in my callgrind graph (about 5%) because I accidental used buffers without timestamps. This made the muxer write a PMT/PAT for every buffer, meaning a lot of stuffing bytes.

So in a situation where you write a stream containing a substantial amount of stuffing bytes the difference is no longer negligible.

Anyway, I thought it was silly to just ignore the optimization however small it is, when the readability of the code is not affected (IMO).

Comment 6 Edward Hervey 2013-10-14 16:37:22 UTC

Good catch. I'll add it once I'm done with the refactoring.

Essentially I'm getting rid of the "incoming" adapter and just pre-allocating one output buffer for the data to process. That should shave 60%+ overhead.

Comment 7 Edward Hervey 2014-01-27 16:14:39 UTC

Seems like I lost the WIP patch *sigh*.

But essentially we'd need to do the following (in that order):
* Get rid of the adapter, it's pointless/overhead
* instead, don't create small output buffers, but create a big-ish output buffer
* Actually interleave input data for the output (based on timestamp and duration).

Comment 8 Jesper Larsen 2014-07-21 11:22:59 UTC

I have some time on my hands, so I took a peak at this one again.

Right now the mux works as follows:

- A buffer is received through collectpads, this buffer is mapped and appended to an internal list.
- A buffer for a single packet (188/192 bytes) is allocated, and payload is memcpy'ed from the internal list
- The buffer is pushed to a output GstAdapter
- If the alignment property is set, a multiple of align packets are read from the adapter, and pushed on the src pad. If the alignment property is not set, all data in the adapter is pushed.

I'm working on a preliminary change that does something like this

- During initialisation, a buffer pool is created, and 3 buffers of size 188/192 * align is pre-allocated. Default alignment is 32 packets.
- Buffers received through collectpads are still mapped and appended to the internal list
- The muxer writes headers and memcpy payload from the internal list directly into a mapped buffer from the buffer pool
- Once the outgoing buffer is full, or the incoming buffer is starved, the output buffer is unmapped and pushed on the src pad

A side effect is that the alignment property will no longer give you output buffers that are a multiple of the alignment, but rather exactly the alignment or less. IMO this is what the alignment property is expected to do, and it removes the need to do chopmydata like hacks to send output through udpsrc (discussed here https://bugzilla.gnome.org/show_bug.cgi?id=722129).

I wanted to avoid the memcpy of payload, since there is no need to read/write the actual payload data. However, we do need to allocate memory for the headers, and that means we need to make a buffer that contains multiple memories. As far as I can tell, the data in a GstBuffer containing multiple GstMemory will be merged (i.e deep copied) if it is mapped, which means that the need to memcpy the payload is not removed. Please tell me that I'm ignorant and wrong in this regard!

I'm testing by muxing a single 1080p h264 stream. (i.e filesrc ! h264parse ! mpegtsmux ! fakesink)

The changes reduces the instruction count used on an incoming buffer by 17% - 80%. The 17% reduction is seen when alignment is set to 1 packet. 80% reduction is with the default alignment of 32 packets. Alignment of 7 packets (suitable for UDP packets with MTU size 1500 bytes) shows ~73% reduction.

The output from the test is 100% binary identical with output made with current master.

There is still a lot of work to do, to actually support everything as before. It should not be things that will reduce the performance significantly though. I'm not attaching a patch since it is very much WIP, but I wanted to share my findings.

Comment 9 Tim-Philipp Müller 2014-07-21 11:31:02 UTC

> I wanted to avoid the memcpy of payload, since there is no need to read/write
> the actual payload data. However, we do need to allocate memory for the
> headers, and that means we need to make a buffer that contains multiple
> memories. As far as I can tell, the data in a GstBuffer containing multiple
> GstMemory will be merged (i.e deep copied) if it is mapped, which means that
> the need to memcpy the payload is not removed. Please tell me that I'm ignorant
> and wrong in this regard!

This is somewhat correct. If someone does gst_buffer_map() on a buffer with multiple memories, those may have to be merged into one contiguous chunk of memory. However, it will depend on the next element whether it does gst_buffer_map() or gst_memory_map() for all individual memories. Elements such as filesink or udpsink should be able to write the memories individually using scatter/gather type API without merging the memories.

Comment 10 Edward Hervey 2014-07-21 11:53:00 UTC

(In reply to comment #8)
> I have some time on my hands, so I took a peak at this one again.

  Yay \o/

> I'm working on a preliminary change that does something like this
> 
>  - During initialisation, a buffer pool is created, and 3 buffers of size
> 188/192 * align is pre-allocated. Default alignment is 32 packets.
>  - Buffers received through collectpads are still mapped and appended to the
> internal list
>  - The muxer writes headers and memcpy payload from the internal list directly
> into a mapped buffer from the buffer pool
>  - Once the outgoing buffer is full, or the incoming buffer is starved, the
> output buffer is unmapped and pushed on the src pad
> 
> A side effect is that the alignment property will no longer give you output
> buffers that are a multiple of the alignment, but rather exactly the alignment
> or less. IMO this is what the alignment property is expected to do, and it
> removes the need to do chopmydata like hacks to send output through udpsrc
> (discussed here https://bugzilla.gnome.org/show_bug.cgi?id=722129).

  Yes, that is indeed the requested behaviour.

  An interesting thing to take into account while doing this refactoring would be to:
  * move up the tsmux lib code to the mpegtsmux element itself. The split is no longer needed.
  * Figure out a better way of interleaving the incoming buffers. AFAIR, right now it muxes full buffers, instead of spreading them out evenly in the outgoing buffer. That would require "driving" the collection system the other way round (instead of pushing buffers, you have the outgoing system deciding what it wants to pick).

> 
> I wanted to avoid the memcpy of payload, since there is no need to read/write
> the actual payload data. However, we do need to allocate memory for the
> headers, and that means we need to make a buffer that contains multiple
> memories. As far as I can tell, the data in a GstBuffer containing multiple
> GstMemory will be merged (i.e deep copied) if it is mapped, which means that
> the need to memcpy the payload is not removed. Please tell me that I'm ignorant
> and wrong in this regard!

  It will actually work quite nicely if you don't memcpy, as tim mentionned.

> There is still a lot of work to do, to actually support everything as before.
> It should not be things that will reduce the performance significantly though.
> I'm not attaching a patch since it is very much WIP, but I wanted to share my
> findings.

  Nice work.

Comment 11 chrisf 2015-07-16 21:36:24 UTC

Hey all.  I know it has been a long time since anyone has posted on this bug but im wondering if anyone has made any changes they could potentially push my way.  

I was the one that originally asked to have this bug open but in 2013 on this forum post: http://gstreamer-devel.966125.n4.nabble.com/mpegtsmux-at-high-bitrates-td4662473.html

I got put onto other items but this has actually recently come up again for me.  Don't know if i'll have the time to tackle this myself but i am curious if anyone has made progress limiting the memcpys so i don't have to reinvent the wheel here.

Thanks!

Comment 12 Jesper Larsen 2015-07-17 09:46:06 UTC

I lost the extra time I had on my hands, so I didn't get to prepare anything that was ready to be shared.

When you scratch the surface of the muxer, you will see that a lot of refactoring is needed, and I didn't find a good way to do it in "small" chunks.

As Edward mentioned earlier, the muxer should be more in control, and actively pick what to mux, instead of just grabbing full buffers on the sinks.

This could ultimately open for a more decent "live" performance as well. Pushing PATs, PMTs, PCR, what have you, based on the pipeline time instead of incoming buffers.

Comment 13 Alexander Vasiljev 2015-11-18 09:42:19 UTC

Created attachment 315811 [details] [review]
Use buffer pool for mpegts packets

I can suggest to switch to buffer pool. MPTS packets are of the same size so we can use pool for them.

Comment 14 Nicolas Dufresne (ndufresne) 2015-11-18 13:58:47 UTC

Review of attachment 315811 [details] [review]:

This is of course not a bad idea (the implementation needs work though). It might not solve all the problem. Dealing with 188 bytes buffer is a disaster performance wise. The muxer should group the buffers into logical groups, or toward something near an MTU.

::: gst/mpegtsmux/mpegtsmux.c
@@ +322,3 @@
+  gst_buffer_pool_config_set_params (config, NULL, 188, 100, 1000);
+  gst_buffer_pool_set_config (mux->pool, config);
+  gst_buffer_pool_set_active (mux->pool, TRUE);

Cool, ever considered useful to deactivate and free this pool ? What about downstream providing a SHM allocator ? What about downstream proposing a usable pool ? (that's rare though for that kind of format).

@@ +1676,3 @@
+    gst_buffer_set_size (buf, NORMAL_TS_PACKET_LENGTH);
+  } else {
+//     buf = gst_buffer_new_and_alloc (NORMAL_TS_PACKET_LENGTH);

No need to explain.

Comment 15 Alexander Vasiljev 2015-11-18 14:57:22 UTC

Created attachment 315831 [details] [review]
Use buffer pool for mpegts packets 2

OK. I've remade it and add cleaning. 
It doesn't solve all the problems, but with this patch I got drop of the cpu usage of the pipeline (v4l2 -> h2564enc -> mpegts -> udp) from 60% to 50% on the ARM cpu, so i think it worth to be applied. Anyway someone can find it helpfull.

Comment 16 Nicolas Dufresne (ndufresne) 2015-11-18 16:39:03 UTC

Hmm, ok, the remaining 50% is probably mostly due to splitting/copying the input into the output (as Edward described). I think we can try and do better, of course let's see how much time we find and decide if we merge or not this intermediate step.

Comment 17 Nicolas Dufresne (ndufresne) 2015-11-18 16:43:59 UTC

Review of attachment 315831 [details] [review]:

::: gst/mpegtsmux/mpegtsmux.c
@@ +372,3 @@
   GstBuffer *buf;
   GSList *walk;
+  GstStructure *poolConfig;

Coding style: use pool_config (or just config as everywhere else).

@@ +421,3 @@
+  poolConfig = gst_buffer_pool_get_config (mux->pool);
+  gst_buffer_pool_set_active (mux->pool, FALSE);
+  gst_buffer_pool_config_set_params (poolConfig, NULL, 188, 100, 0);

That is a bit of a waste of time. You free to reallocate the exact same thing. Imho, I'd let the pool grow by itself, and use a start size of about a standard mtu.

You should probably deactivate when going to READY state, and activate when going to PAUSED state instead.

Comment 18 Alexander Vasiljev 2015-11-18 17:51:31 UTC

Created attachment 315849 [details] [review]
Use buffer pool for mpegts packets 3

I hope all is well now.

Comment 19 Alexander Vasiljev 2016-02-16 10:42:49 UTC

Created attachment 321353 [details] [review]
tsmux_write_stream_packet can write larger buffer according to alignment

The performance can be significantly increased if a several mpegts packets can be created in one buffer. 
1) The tsmux_write_stream_packet fills the whole buffer with mpegts packets.
2) The size of buffer is alignment * 188. This is set in buffer pool config.
3) The tsmux_section_write_packet still sends single packets.

Comment 20 Alexander Vasiljev 2016-02-16 11:05:28 UTC

The large buffers patch should be applied over the buffer pool patch.

Comment 21 Alexander Vasiljev 2016-03-04 17:38:47 UTC

About performance. I have dm368 processor from Texas Instruments. It can do h264 hardware encoding. It is ARM that works on 430 MHz. 
Here is gst-launch command. 

gst-launch-1.0 v4l2src input-src=Camera ! videorate drop-only=true  ! 'video/x-raw, format=(string)NV12, width=1920, height=1088, framerate=25/1' ! ce_h264enc  target-bitrate=6000000 max-bitrate=7000000 encoding-preset=2 rate-control=5 encquality=3 bytestream=1 idrinterval=25 rcalgo=5 mealgo=true intraframe-interval=60 ! queue ! perf print-arm-load=true ! mpegtsmux  alignment=7 ! udpsink host=192.168.0.1 port=3000 -v

ce_h264enc is hardware encoder. So the main load on processor are mpegtsmux and udpsink.

Without these patches CPU (ARM) load is 90-100% on target-bitrate=6000000 (6MBits).
With these patches CPU load is 25-30%  on target-bitrate=6000000 (6MBits) and 
CPU load is 90-100%  on target-bitrate=50000000 (50MBits).
So these are worth patches.

Comment 22 Edward Hervey 2016-03-07 08:25:47 UTC

Can you check with the default alignment settings (-1) ? You might be setting negative values in the buffer pool config. Does make check pass ?

Comment 23 Jesper Larsen 2016-03-07 08:35:47 UTC

Great to see some patches of a manageable size that will improve the performance, it is surely needed.

Just a little heads-up. If I'm not mistaken, these patches will break support for m2ts-mode. In m2ts-mode (Blu-Ray) packets are 192 bytes instead of the normal 188 bytes.

Comment 24 Edward Hervey 2016-03-07 11:26:33 UTC

Review of attachment 257098 [details] [review]:

::: gst/mpegtsmux/tsmux/tsmux.c
@@ +647,3 @@
+  if (pos < min_length) {
+    memset (buf + pos, 0xff, min_length - pos);
+    pos += min_length - pos;

hmm... so...
pos = pos + min_length - pos;

...

how about just pos = min_length :D

Comment 25 Edward Hervey 2016-03-07 11:28:44 UTC

Review of attachment 315849 [details] [review]:

::: gst/mpegtsmux/mpegtsmux.c
@@ +322,3 @@
+
+  pool_config = gst_buffer_pool_get_config (mux->pool);
+  gst_buffer_pool_config_set_params (pool_config, NULL, 188, 7, 0);

As mentionned in other comments, don't hardcode 188, but instead use the needed size.

Also, make sure it's reconfigured when the setting changes (it might have a default of 188, but later be reconfigured to 192 or 204)

Like that you can also use the pool for other output sizes.

Comment 26 Edward Hervey 2016-03-07 11:29:26 UTC

Review of attachment 321353 [details] [review]:

::: gst/mpegtsmux/mpegtsmux.c
@@ +1758,3 @@
+      pool_config = gst_buffer_pool_get_config (mux->pool);
+      gst_buffer_pool_config_set_params (pool_config, NULL,
+          188 * mux->alignment, 7, 0);

the default for mux->alingment is -1 ... you might have issues :D

Comment 27 Jesper Larsen 2016-03-07 12:58:26 UTC

Created attachment 323263 [details] [review]
tsmux: use memset to write stuffing bytes

Reduced complexity by applying some very advanced arithmetics.

Comment 28 Alexander Vasiljev 2016-03-09 07:57:30 UTC

It doesn't break m2ts-mode, because in this mode we use gst_buffer_new_and_alloc. So this mode is just not optimal.

static void
alloc_packet_cb (GstBuffer ** _buf, void *user_data)
{
  MpegTsMux *mux = (MpegTsMux *) user_data;
  GstBuffer *buf;
  gint offset = 0;

  if (mux->m2ts_mode == TRUE) {
    offset = 4;

    buf = gst_buffer_new_and_alloc (NORMAL_TS_PACKET_LENGTH + offset);
    gst_buffer_set_size (buf, NORMAL_TS_PACKET_LENGTH);
  } else {
    gst_buffer_pool_acquire_buffer (mux->pool, &buf, NULL);
  }
  *_buf = buf;
}

Comment 29 Alexander Vasiljev 2016-03-09 08:04:52 UTC

::: gst/mpegtsmux/mpegtsmux.c
@@ +1758,3 @@
+      pool_config = gst_buffer_pool_get_config (mux->pool);
+      gst_buffer_pool_config_set_params (pool_config, NULL,
+          188 * mux->alignment, 7, 0);

If alignment is -1 it will be just minimum preallocated buffers. It doesn't hurt.
In gstbufferpool.c you can see this code 

  /* we need to prealloc buffers */
  for (i = 0; i < priv->min_buffers; i++) {
    GstBuffer *buffer;

So it will just not preallocate any buffer.

Comment 30 Alexander Vasiljev 2016-03-09 08:12:24 UTC

I have mistaken about alignment. It surely needs to check negative values.

Comment 31 Jesper Larsen 2016-09-08 12:46:13 UTC

I had a look at this again, since I needed better performance. I have extended upon Alexander's buffer pool approach, and found some other small (diff wise) improvements.

I will attached 3 patches now, which with the small memset patch shows a great deal of improvement on my system.

The muxer still does a memcpy on the payload data, which as discussed earlier is ridiculous. Is tried to fix this, but it requires a bit more refactoring than what I have time for right now.

Comment 32 Jesper Larsen 2016-09-08 12:46:32 UTC

Created attachment 335102 [details] [review]
[PATCH 1/3] mpegtsmux: Use bufferpool to allocate packets

A buffer pool approach that works for both 188 and 192 byte packets.

It changes the way we write 192 byte packets as well. Instead of pretending we have a normal 188 byte packet, and memmove at the end, we now write directly into the buffer at the correct place.

In fact, I think m2ts mode is broken at the moment because of the way PSI sections are written. Should be fixed up with this commit.

Alexander's approach appended NULL packets in case we could not fill the buffer up to the alignment. This approach will simply set the size of the buffer to match what we have, and push without padding. I do not know what we want here? Should alignment mean "exactly this many packets in one buffer", or should it mean "no more than this many packets in one buffer"? Maybe we need a to add a property to discern between these two cases?

Comment 33 Jesper Larsen 2016-09-08 12:48:38 UTC

Created attachment 335103 [details] [review]
[PATCH 2/3] mpegtsmux: Remove output adapter

Previously, output packets were gathered in an adapter in order to make buffers with the proper alignment. With the buffer pool in place, we no longer need the adapter.

Comment 34 Jesper Larsen 2016-09-08 12:52:03 UTC

Created attachment 335104 [details] [review]
[PATCH 3/3] mpegtsmux: Only map outgoing buffers when needed

All output buffers were being mapped in order to look for a PAT/PMT for stream headers. However, we only look for the tables until we have sent the stream headers.

We should only map the buffers if we actually use the memory.

Comment 35 Jesper Larsen 2016-09-26 09:09:01 UTC

Created attachment 336251 [details] [review]
[PATCH 3/3 v2] mpegtsmux: Only map outgoing buffers when needed

Cleanup of some code path we cannot reach.

Comment 36 GStreamer system administrator 2018-11-03 13:18:19 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-bad/issues/112.