GNOME Bugzilla – Bug 620323
mpegaudioparse: Add support for LAME tags and adjust segments based on the padding information from it
Last modified: 2018-11-03 14:42:23 UTC
Currently, gstreamer is ignoring the lame tags in mp3s during gapless playback. These tags contain information about the padding at the start and end of the files, which should be skipped to get perfect gapless playback of these files. Without using these tags, there'll always be a tiny silence or glitch between songs. Here's an example of the lame tags in a file which should play back gaplessly: $eyeD3 --lametag Track01.mp3 Track01.mp3 [ 3.77 MB ] Encoder Version : LAME3.98r LAME Tag Revision : 0 VBR Method : Variable Bitrate method2 (mtrh) Lowpass Filter : 18500 Radio Replay Gain : -2.0 dB (Set automatically) Encoding Flags : --nspsytune --nssafejoint ATH Type : 4 Bitrate (Minimum) : 192 Encoder Delay : 576 samples Encoder Padding : 960 samples Noise Shaping : 1 Stereo Mode : Joint Unwise Settings : False Sample Frequency : 44.1 kHz MP3 Gain : 0 (+0.0 dB) Preset : V2 Surround Info : None Music Length : 3.77 MB Music CRC-16 : 07F9 LAME Tag CRC-16 : E5BF The encoder delay and encoder padding tags describe how many samples were added at the start and end of the audio to complete the frames. Some more information is available on this page: http://gabriel.mp3-tech.org/mp3infotag.html
Yes, this would definitely nice if it was implemented :) Shouldn't be too much work to add support for this in the mad plugin.
I think there's a bug for this elsewhere already too. Ideally we shouldn't have to implement this for each decoder individually..
As far as I know these tags apply specifically to mp3 files only, as a way of getting around limitations in the format. I had a bit of a look in the mad source, and it looks like the delay and padding are stored in tag->lame.start_delay and tag->lame.end_padding. xmms uses these tags to gaplessly play the files by: data->samples_to_skip = lame->start_delay; data->samples_to_play = ((guint64) xmms_xing_get_frames (data->xing) * 1152ULL) - lame->start_delay - lame->end_padding;
> Shouldn't be too much work to add support for this in the mad plugin Shouldn't this rather be implemented in mpegaudioparse instead?
As this tags say how many samples should be dropped of the first and last frame this has to be implemented in the decoders somewhere. Or extraction in mpegaudioparse and then some way to forward this information downstream (hm, would it be possible with a carefully created newsegment event to drop these samples? probably)
I was assuming that the samples to be dropped are fewer than one frame, so thought maybe the parser could then tweak the first newsegment event(that should be quite easy anyway, shouldn't it?) and send an update for the last one before pushing out the last frame or so.
Yes, I guess adding this to the segments of the parser will work (and yes, it's less than one frame). Who wants to add lame tag parsing and the segment adjustments to mpegaudioparse? :)
Trying it now
mpegaudioparse hangs out in -good these days.
It seems mpegaudioparse can now parse the lame tag / encoder delay. What would be needed to get this working?
I have been developing a plugin for MP3 playback using mpg123 to decode the frames, with GstAudioDecoder as the base class, and relied on mpegaudioparse for seeking. This has worked well, except for the gapless part, which has turned out to be fairly difficult to achieve. I have tried out several ideas, but they all fail when seeking occurs. One big problem I have encountered is that the number of padding samples indicated by the LAME tag sometimes exceed the frame size. Example: frame size 1152 samples, encoder padding 1941 samples. I looked at the mpg123 code, which can do gapless internally (I will refer to that later), and what it apparently does is that it estimates the total length of the MP3, and subtracts the padding length from that. Once decoding reaches the calculated length, playback stops. However, I could be wrong, and maybe I misinterpreted the number 1941. Either way, one possibility would be to somehow truncate the start and end of the segment, as Sebastian Dröge suggested above. That leaves one question open, though: do sinks that have their sync property set to FALSE still respect the start/end times of the segment? If not, the gap would return if sync=FALSE.
Sorry, I accidentally pressed the commit button too early :) Continuing here: If I am wrong, and the LAME padding is always shorter than one frame, then perhaps mpegaudioparse could send the following in-band events downstream to help the decoder: (1) an event containing the delay and padding information from the LAME tag, (2) an event informing downstream that the frame that is about to be pushed is the first one, (3) an event informing downstream that the frame that is about to be pushed is the last one. (2) and (3) should respect seeking; for example, if I seek to the beginning of the song, (2) should be sent again. This way, the decoder could decode the frame as usual, then, with the information from (1), if it is the first frame, throw away the first N samples, if it is the last frame, shorten its duration. In a pre_push() function, the timestamps could then be corrected by the length of the delay. If the padding can be more than one frame, then I dont know what could be done. As for mpg123, it does have internal support for gapless. If enabled, it automatically throws away the null samples. However, that does not work well with seeking using GstAudioDecoder, because every time I seek, a flush event is generated, and as a consequence, I have to reset mpg123 (= recreate the decoder handle). It loses all gapless information. This is because mpg123 is being used without letting it handle seeking itself - mpegaudioparse is doing that now. So, mpg123 is oblivious to the seeking. Possible solutions would be to either feed back the gapless information into mpg123 (requiring modifications to the mpg123 library), or rewrite the plugin from scratch using GstElement as the base class (and that is something I really don't wanna do unless it is absolutely necessary). Therefore, I don't think we could use mpg123's internal gapless support, or any decoder's internal support, for that matter.
Related bug: https://bugzilla.gnome.org/show_bug.cgi?id=704520
Still applies in 1.x
*** Bug 719867 has been marked as a duplicate of this bug. ***
Created attachment 298715 [details] [review] Patch for mpegaudioparse to exclude padding samples from segment With this patch, gapless playback can be achieved. It adjusts the start, stop, and duration fields of the segment before a segment event is sent downstream. This way, padding samples will be clipped. Depends on the patch in Bug 740196 .
Review of attachment 298715 [details] [review]: ::: gst/audioparsers/gstmpegaudioparse.c @@ +1139,3 @@ + * calculations, because otherwise, the base parser's duration would be set + * twice (here and in gst_mpeg_audio_parse_adjust_segment) */ + if (((mp3parse->encoder_delay == 0) && (mp3parse->encoder_padding == 0)) || (mp3parse->rate != -1)) { Maybe just have a single function that sets these things? @@ +1501,3 @@ + segment->start = delay_in_ns; + segment->stop = duration - padding_in_ns; + segment->duration = duration - delay_in_ns - padding_in_ns; This should maybe also adjust the segment.time and segment.base. You want to drop samples, but should the first non-dropped sample have running/stream time 0 or N?
"Maybe just have a single function that sets these things?" I don't understand. What would you group into a function? The places which call gst_base_parse_set_duration() differ considerably. I do not see how these could be a function.
Created attachment 299214 [details] [review] Patch for mpegaudioparse to exclude padding samples from segment , v2 Decoder delay was not being applied to the padding value properly, and segment base and time fields were not set
Created attachment 300237 [details] [review] Patch for mpegaudioparse to exclude padding samples from segment , v3 Patch updated to make use of the gst_base_parse_set_padding function in baseparse
Any news on this? It's really a needed feature....
Created attachment 310584 [details] [review] Drop first frame if it's a Xing or VBRI info tag Here is a patch that drops Xing/VBRI info tags. Without this patch, the frame containing the info tag is "played", producing a short gap of zeroes in front of actual audio data. The patch depends on attachment 300237 [details] [review].
Created attachment 310586 [details] [review] Fix delay samples computation Here is another patch which avoids an extra zero sample being played after the end of an MP3. I do not really know why this is "+1" is necessary, really, but the output of the multifilesink plugin clearly gave me an extra 0 sample at the end. You can actually hear it as a click when playing a sine sample in a gapless loop, but there is no click with this patch applied. The patch depends on attachment 300237 [details] [review].
bump for this - i'm reading on LWN about all the cool things happening with GStreamer... but can't we have something basic like this patch applied? I use a non-GStreamer audio player (Audacious, which uses MPG123) to play back music only because of this issue.
It's not a simple patch, and there were discussions about how to implement it properly. For the Opus codec it is now implemented, for a legacy codec like MP3 someone will have to implement it similarly now. That is, doing these kind of things via segment adjustments is not ideal and there's now a GstMeta (GstAudioClippingMeta) that parsers can put on the encoded buffers for this and decoders should use to remove additional samples. And encoders should also put it on their output if they can. Any help with that would be appreciated :)
*** Bug 764267 has been marked as a duplicate of this bug. ***
Looked into GstAudioClippingMeta a bit. To correctly apply the clipping to each frame we need to know in which frame we are based on the end or start of the stream and atm we can only detect if we are in the first (offset=0) or in the last frame (offset + bpf == xing_bytes). A solution would be to buffer frames so that all the delay/padding samples fit in it and if we detect a first/last frame we can attach the GstAudioClippingMeta to the affected buffered frames. On the mpg123 side we would need to make sure that we can link input frames with decoded ones, but that seems possible with some work. Any idea how to buffer frames in mpegaudioparse/GstBaseParse? Alternative: Only support one frame of delay/padding for now, that would at least reduce the gap. Thoughts?
Since mpegaudioparse knows the total number of frames (it is contained in the Xing header), it could keep track of the current frame (= the frame that contains the current playback position). Also, out of the number of frames, mpegaudioparse computes the playback length. By using the LAME tag information and the total number of frames, it becomes possible to figure out the number of the last frame. This takes care of cases where the padding length in the LAME tag exceeds one frame length. Also, this way, it becomes possible to know which one is the last frame without any buffering. Anybody sees problems with this approach? Of course this is only applicable to MP3 data which is finite, which means, no web streams for example. But I'd argue that LAME tags make little sense with these anyway (well, at least for the last frame).
(In reply to Carlos Rafael Giani from comment #28) > Since mpegaudioparse knows the total number of frames (it is contained in > the Xing header), it could keep track of the current frame (= the frame that > contains the current playback position). How would that work with seeking? > Also, out of the number of frames, > mpegaudioparse computes the playback length. By using the LAME tag > information and the total number of frames, it becomes possible to figure > out the number of the last frame. This takes care of cases where the padding > length in the LAME tag exceeds one frame length. Also, this way, it becomes > possible to know which one is the last frame without any buffering. The problem I see is that we need to know when we are in the second to last frame (we need to attach two GstAudioClippingMeta to the last two frames). With frames having variable frame size we can't guess based on the file offset.
My proposal indeed relies on constant frame sizes. However, from what I've seen in the mpegaudioparse source, the frame size depends on the MPEG version and layer numbers (see gstmpegaudioparse.c:733). Can these change in the middle of a bitstream? With constant frames, seeking is also not really a problem of course, because then it is easily possible to figure out the current frame number from the current position.
> My proposal indeed relies on constant frame sizes. However, from what I've seen in the mpegaudioparse source, the frame size depends on the MPEG version and layer numbers (see gstmpegaudioparse.c:733). Can these change in the middle of a bitstream? It also depends on the bitrate and that usually changes every frame. That's why we have xiph/vbri tags to detect the duration and have a seek table despite variable size frames. You can see the frame sizes using "GST_DEBUG=mpegaudioparse:5"
Ah, but you are talking about byte-level seeking. I was referring to operations in the time domain. It is my impression that the number of *samples* per frame is constant, but that in VBR MP3s, these fixed number of samples can be encoded with a varying bitrate. So, if downstream requests a time seek, then it should work. This does break down however if downstream does a byte seek, yes. There is still the option to constrain the segment though. So far, this one still seems like the least problematic option to me.
> Ah, but you are talking about byte-level seeking. I was referring to operations in the time domain. It is my impression that the number of *samples* per frame is constant, but that in VBR MP3s, these fixed number of samples can be encoded with a varying bitrate. So, if downstream requests a time seek, then it should work. Ah, yes, samples are constant. But afaik time based seeking just uses the xiph seek table and interpolation and thus isn't exact. If we are one frame of we might clip away the real signal.
(In reply to Christoph Reiter (lazka) from comment #33) > > Ah, but you are talking about byte-level seeking. I was referring to operations in the time domain. It is my impression that the number of *samples* per frame is constant, but that in VBR MP3s, these fixed number of samples can be encoded with a varying bitrate. So, if downstream requests a time seek, then it should work. > > Ah, yes, samples are constant. But afaik time based seeking just uses the > xiph seek table and interpolation and thus isn't exact. If we are one frame > of we might clip away the real signal. Then I'd stick with the segment based approach. It works well (it is in use in production here already), does not need any additional buffering, and can clip exactly, because it relies on Xing/LAME/VBRI information. (For example, mpg123 also relies on the duration computed with Xing/LAME/VBRI data for trimming the padding samples, and it works flawlessly.) The baseparse patch needs cleanup though.
Any news on this feature?
bump! ;-)
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-good/issues/31.