After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 769650 - Use libpsl to implement soup_tld_domain_is_public_suffix() and remove copy of effective_tld_names.dat
Use libpsl to implement soup_tld_domain_is_public_suffix() and remove copy of...
Status: RESOLVED FIXED
Product: libsoup
Classification: Core
Component: Misc
unspecified
Other All
: Normal normal
: ---
Assigned To: libsoup-maint@gnome.bugs
libsoup-maint@gnome.bugs
Depends on:
Blocks:
 
 
Reported: 2016-08-08 21:16 UTC by Michael Catanzaro
Modified: 2018-04-18 00:23 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Update effective_tld_names.dat (149.15 KB, patch)
2016-08-08 21:16 UTC, Michael Catanzaro
accepted-commit_now Details | Review
soup-tld: use libpsl instead of our own copy of the public suffix list (220.28 KB, patch)
2018-02-22 13:27 UTC, Claudio Saavedra
none Details | Review
soup-tld: use libpsl instead of our own copy of the public suffix list (221.54 KB, patch)
2018-02-22 13:40 UTC, Claudio Saavedra
none Details | Review
soup-tld: use libpsl instead of our own copy of the public suffix list (221.58 KB, patch)
2018-02-23 12:35 UTC, Claudio Saavedra
committed Details | Review

Description Michael Catanzaro 2016-08-08 21:16:24 UTC
From https://publicsuffix.org/list/public_suffix_list.dat

Maybe we should do this more often than once every three years?
Comment 1 Michael Catanzaro 2016-08-08 21:16:27 UTC
Created attachment 332955 [details] [review]
Update effective_tld_names.dat
Comment 2 Dan Winship 2016-08-09 13:55:52 UTC
(In reply to Michael Catanzaro from comment #0)
> Maybe we should do this more often than once every three years?

Yes. Maybe have distcheck blow up if it's out of date?
Comment 3 Michael Catanzaro 2016-08-13 08:05:55 UTC
Or update it in a dist hook using 'wget'
Comment 4 Dan Winship 2016-08-29 16:44:47 UTC
Comment on attachment 332955 [details] [review]
Update effective_tld_names.dat

but leave the bug open
Comment 5 Dan Winship 2017-05-01 18:12:15 UTC
We should probably use libpsl (like curl and wget do)
Comment 6 Michael Catanzaro 2018-02-11 20:41:39 UTC
Claudio, until we have time to convert libsoup to start using libpsl, I recommend updating the public suffix list semi-regularly. The latest version is in the first comment, above.
Comment 7 Claudio Saavedra 2018-02-13 12:42:38 UTC
Thanks for the reminder, this had gone under my radar. I updated it now.
Comment 8 Claudio Saavedra 2018-02-14 08:56:09 UTC
I was checking yesterday a bit if we can use libpsl and started cooking a patch for it. In principle, it works, but the way we deal with TLD is slightly different so some tests break.

- libpsl doesn't support exception TLDs (and just considers them public), but we do deal with those. So either we would have to still handle them manually in libsoup or we would lose this functionality. In a straightforward implementation, breaks a few tests.

- libpsl doesn't give much information on to, for example, why a given hostname doesn't have a registrable domain (or base domain, in libsoup's lingo). Our API provides error information that we cannot get straightforwardly from libpsl. We would have to do that processing ourselves anyway in order to retain the GError information we currently provide. In the same straightforward implementation, this breaks tests (that expect detailed GError info).

I'll see if I can get around those issues so that we can maintain functionality.
Comment 9 Claudio Saavedra 2018-02-14 09:13:23 UTC
(It is also possible to just make libsoup use the public-suffix list shipped with the distribution instead of its own and keep our implementation. Debian ships it in the publicsuffix package, Fedora in the publicsuffix-info package, both lists installed under /usr/share/publicsuffix/. libpsl depends on the publicsuffix package and has API to fetch the location of the distributed file, so that could be used as well.)
Comment 10 Claudio Saavedra 2018-02-22 09:00:41 UTC
(In reply to Claudio Saavedra from comment #8)
> I was checking yesterday a bit if we can use libpsl and started cooking a
> patch for it. In principle, it works, but the way we deal with TLD is
> slightly different so some tests break.
> 
> - libpsl doesn't support exception TLDs (and just considers them public),
> but we do deal with those. So either we would have to still handle them
> manually in libsoup or we would lose this functionality. In a
> straightforward implementation, breaks a few tests.

I fixed this by adding API to libpsl to allow dropping the '*' rule when checking for TLDs. This was merged upstream yesterday and a new release containing this API should be released soon.
 
> - libpsl doesn't give much information on to, for example, why a given
> hostname doesn't have a registrable domain (or base domain, in libsoup's
> lingo). Our API provides error information that we cannot get
> straightforwardly from libpsl. We would have to do that processing ourselves
> anyway in order to retain the GError information we currently provide. In
> the same straightforward implementation, this breaks tests (that expect
> detailed GError info).

I fixed this by doing extra checks in libsoup's tld code. 

We will be able to use libpsl then. I'll clean up the patch and attach it here.  Because of the dependency in new API in an external dependency we probably want to wait a bit before applying this in master. I'd say after this cycle we can think of it.
Comment 11 Claudio Saavedra 2018-02-22 13:27:37 UTC
Created attachment 368767 [details] [review]
soup-tld: use libpsl instead of our own copy of the public suffix list

This adds a dependency on libpsl. For compatibility with our API, we
depend on a yet-unreleased version of libpsl that will probably be
0.19.2. All existing TLD tests are passing as expected so there
shouldn't be any regressions.
Comment 12 Claudio Saavedra 2018-02-22 13:40:57 UTC
Created attachment 368768 [details] [review]
soup-tld: use libpsl instead of our own copy of the public suffix list

This adds a dependency on libpsl. For compatibility with our API, we
depend on a yet-unreleased version of libpsl that will probably be
0.19.2. All existing TLD tests are passing as expected so there
shouldn't be any regressions.
Comment 13 Claudio Saavedra 2018-02-23 12:35:36 UTC
Created attachment 368821 [details] [review]
soup-tld: use libpsl instead of our own copy of the public suffix list

This adds a dependency on libpsl. For compatibility with our API, we
depend on libpsl 0.20.0, which is the first version to provide all
the features we need to pass all our tests as expected.

All existing TLD tests are passing as expected so there shouldn't be
any regressions.
Comment 14 Claudio Saavedra 2018-02-23 12:37:10 UTC
libpsl 0.20.0 was released, so we could merge this after branching. It would be great if someone could review the patch, though.
Comment 15 Michael Catanzaro 2018-02-23 20:15:43 UTC
Review of attachment 368821 [details] [review]:

Nice work getting the needed changes upstream so quickly.

::: configure.ac
@@ +99,3 @@
 AC_SUBST(SQLITE_LIBS)
 
+PKG_CHECK_MODULES(LIBPSL, libpsl >= 0.20.0)

Is https://github.com/rockdaboot/libpsl/commit/8fd480584e5eee7b60125e6ca89a51cdd93e7e77 important for us? If so, best depend on 0.20.1.

::: libsoup/soup-tld.c
@@ +86,3 @@
 
+	/* This will fail if libpsl's built-in data was disabled during compilation. */
+	g_assert (psl);

Use if (!psl) g_error() for this instead. Reason: g_assert() can be disabled at compile time, and this is checking for something that's not a libsoup programming error.
Comment 16 Tomas Popela 2018-02-26 08:00:47 UTC
(In reply to Claudio Saavedra from comment #14)
> libpsl 0.20.0 was released, so we could merge this after branching.

Claudio, can you please also fix the wip/meson after you merge this to the master?
Comment 17 Claudio Saavedra 2018-04-06 13:18:04 UTC
Attachment 368821 [details] pushed as 9d7559d - soup-tld: use libpsl instead of our own copy of the public suffix list

I also improved a bit the handling of the psl context in a subsequent commit.
Comment 18 Ignacio Casal Quinteiro (nacho) 2018-04-06 13:40:47 UTC
My feeling here is that we should revert these patches, ensure first that the library is properly building on the systems we support and then merge it...
Otherwise you will make a nightmare building with msvc in the meantime.
Comment 19 Tomas Popela 2018-04-06 13:43:42 UTC
(In reply to Ignacio Casal Quinteiro (nacho) from comment #18)
> My feeling here is that we should revert these patches, ensure first that
> the library is properly building on the systems we support and then merge
> it...
> Otherwise you will make a nightmare building with msvc in the meantime.

I feel the same.
Comment 20 Claudio Saavedra 2018-04-06 15:05:17 UTC
I don't have any particular inclination towards making libsoup work with msvc, so I think whoever cares will have to do that. The patch was here for a while now, so it seems that only by pushing it we've managed to make someone worry about it :) we're early enough in the cycle so that whatever you need to fix can be fixed, so I don't favor reverting anything.
Comment 21 Tomas Popela 2018-04-11 10:13:13 UTC
I found that soup-tld-private.h was not removed from docs/reference/Makefile.am so I removed it in commit d6ea55c8
Comment 22 Milan Crha 2018-04-17 08:18:40 UTC
It would be nice if the libpsl was an optional dependency, because you replaced ~200KB file with 42MB dependency, which is kind of impractical. And that library is incomplete, because it comes without that .dat file anyway, thus you only passed the responsibility to update the file to the packagers/people whom compile libsoup from sources.
Comment 23 Claudio Saavedra 2018-04-17 08:48:16 UTC
libpsl is not 42MB. Its size (plus libpsl-dev and publicsuffix) in Debian are abound 500KB when installed. The library also includes a builtin copy of the public suffix list, which libsoup uses if it's the most recently available. Also, in Debian these packages already depend on libpsl:

Reverse Depends:
  Depends: libpsl-dev (= 0.20.1-1)
  Depends: wget (>= 0.16.0)
  Depends: network-manager (>= 0.13.0)
  Depends: psl (>= 0.19.1)
  Depends: libcurl3 (>= 0.13.0)
  Depends: libcurl3-nss (>= 0.13.0)
  Depends: libcurl3-gnutls (>= 0.13.0)

So I don't think that having libpsl as an optional dependency and having to ship our own copy of the publicsuffix list is practical nor useful.
Comment 24 Milan Crha 2018-04-17 10:17:56 UTC
Maybe the tarball release of it is different from the snapshot download of git master from github, because that libpsl-master.zip file does report 42MB on the disk after unpacked and it does not contain the copy of public_suffix_list.dat in its list/ subdirectory, thus I've been forced to find it somehow. It can be just their issue in the build script, I'm not questioning it.

To not understand me wrong, I agree with responsibility delegation, I also do not like code duplication and similar things, it only wasn't that easy to get libsoup built even on Linux (it could be built under Windows more or less easily too, where the dependency hell is more problematic than on Linux machines, thus this new dependency makes it harder. It means actually 3 more dependencies, because to build libpsl I need libidn2 and libunistring development files, not counting publicsuffix-list package from which I copied the missing file).

I'd only see much simpler to have a `make dist` hook in libsoup which would `wget` the file and update it when the tarball is created, thus avoid to forget of it. That's all about it, right? To not forget to update that single file and regenerate something according to its content.
Comment 25 Claudio Saavedra 2018-04-17 10:49:21 UTC
You are not supposed to be downloading a git snapshot of a project. Just use git clone (the public-suffix list is actually a git submodule, run 'git submodule status' to see it, yet another reason why you should be using git instead of snapshots). The source code directory is indeed large as you describe, but that is mostly because libpsl is using a fuzzer for testing. An installed binary is not going to be more than a few hundred kilobytes.

It's not just a matter of having a copy of the list. The SoupTLD implementation used to be based on Mozilla's implementation of nsEffectiveTLDService. I don't remember how old that code is, but it certainly wasn't getting much maintainance in libsoup. I prefer to depend on a small and maintained library instead of having that aging code in the library.

I understand that this might be a bit painful for distributors initially (as libpsl had some build issues when srcdir!=builddir; but these have been fixed upstream), but I'm confident that in the long run this is the way to go.
Comment 26 Milan Crha 2018-04-17 10:55:55 UTC
(In reply to Claudio Saavedra from comment #25)
> I'm confident that in the long run this is the way to go.

True, I agree with you.
Comment 27 Michael Catanzaro 2018-04-17 15:59:34 UTC
FWIW, I'm still trying to get this building for GNOME, my approach is to disable everything:

./configure --disable-static --disable-runtime --disable-builtin

I haven't got a successful build yet, but I believe that disables the external libidn and ICU dependencies. I'm not sure if the library will actually work or not. If it somehow breaks Epiphany then I'll revisit.
Comment 28 Michael Catanzaro 2018-04-18 00:23:05 UTC
It built! Here's the result of the configure script:

  Version:           0.20.1
  Host OS:           linux-gnu
  Install prefix:    /usr
  Compiler:          gcc
  CFlags:            -O2 -g -fstack-protector-strong -D_FORTIFY_SOURCE=2 
  LDFlags:           -fstack-protector-strong -Wl,-z,relro,-z,now
  Libs:              
  Runtime:           no
  Builtin:           no
  PSL Dist File:     
  PSL File:          $(top_srcdir)/list/public_suffix_list.dat
  PSL Test File:     $(top_srcdir)/list/tests/tests.txt
  Sanitizers:        UBSan no, ASan no, CFI no
  Docs:              no
  Man pages:         no
  Tests:             Valgrind testing not enabled

Hopefully that's OK? I don't know what the runtime and builtin options are important for.