GNOME Bugzilla – Bug 769650
Use libpsl to implement soup_tld_domain_is_public_suffix() and remove copy of effective_tld_names.dat
Last modified: 2018-04-18 00:23:05 UTC
From https://publicsuffix.org/list/public_suffix_list.dat Maybe we should do this more often than once every three years?
Created attachment 332955 [details] [review] Update effective_tld_names.dat
(In reply to Michael Catanzaro from comment #0) > Maybe we should do this more often than once every three years? Yes. Maybe have distcheck blow up if it's out of date?
Or update it in a dist hook using 'wget'
Comment on attachment 332955 [details] [review] Update effective_tld_names.dat but leave the bug open
We should probably use libpsl (like curl and wget do)
Claudio, until we have time to convert libsoup to start using libpsl, I recommend updating the public suffix list semi-regularly. The latest version is in the first comment, above.
Thanks for the reminder, this had gone under my radar. I updated it now.
I was checking yesterday a bit if we can use libpsl and started cooking a patch for it. In principle, it works, but the way we deal with TLD is slightly different so some tests break. - libpsl doesn't support exception TLDs (and just considers them public), but we do deal with those. So either we would have to still handle them manually in libsoup or we would lose this functionality. In a straightforward implementation, breaks a few tests. - libpsl doesn't give much information on to, for example, why a given hostname doesn't have a registrable domain (or base domain, in libsoup's lingo). Our API provides error information that we cannot get straightforwardly from libpsl. We would have to do that processing ourselves anyway in order to retain the GError information we currently provide. In the same straightforward implementation, this breaks tests (that expect detailed GError info). I'll see if I can get around those issues so that we can maintain functionality.
(It is also possible to just make libsoup use the public-suffix list shipped with the distribution instead of its own and keep our implementation. Debian ships it in the publicsuffix package, Fedora in the publicsuffix-info package, both lists installed under /usr/share/publicsuffix/. libpsl depends on the publicsuffix package and has API to fetch the location of the distributed file, so that could be used as well.)
(In reply to Claudio Saavedra from comment #8) > I was checking yesterday a bit if we can use libpsl and started cooking a > patch for it. In principle, it works, but the way we deal with TLD is > slightly different so some tests break. > > - libpsl doesn't support exception TLDs (and just considers them public), > but we do deal with those. So either we would have to still handle them > manually in libsoup or we would lose this functionality. In a > straightforward implementation, breaks a few tests. I fixed this by adding API to libpsl to allow dropping the '*' rule when checking for TLDs. This was merged upstream yesterday and a new release containing this API should be released soon. > - libpsl doesn't give much information on to, for example, why a given > hostname doesn't have a registrable domain (or base domain, in libsoup's > lingo). Our API provides error information that we cannot get > straightforwardly from libpsl. We would have to do that processing ourselves > anyway in order to retain the GError information we currently provide. In > the same straightforward implementation, this breaks tests (that expect > detailed GError info). I fixed this by doing extra checks in libsoup's tld code. We will be able to use libpsl then. I'll clean up the patch and attach it here. Because of the dependency in new API in an external dependency we probably want to wait a bit before applying this in master. I'd say after this cycle we can think of it.
Created attachment 368767 [details] [review] soup-tld: use libpsl instead of our own copy of the public suffix list This adds a dependency on libpsl. For compatibility with our API, we depend on a yet-unreleased version of libpsl that will probably be 0.19.2. All existing TLD tests are passing as expected so there shouldn't be any regressions.
Created attachment 368768 [details] [review] soup-tld: use libpsl instead of our own copy of the public suffix list This adds a dependency on libpsl. For compatibility with our API, we depend on a yet-unreleased version of libpsl that will probably be 0.19.2. All existing TLD tests are passing as expected so there shouldn't be any regressions.
Created attachment 368821 [details] [review] soup-tld: use libpsl instead of our own copy of the public suffix list This adds a dependency on libpsl. For compatibility with our API, we depend on libpsl 0.20.0, which is the first version to provide all the features we need to pass all our tests as expected. All existing TLD tests are passing as expected so there shouldn't be any regressions.
libpsl 0.20.0 was released, so we could merge this after branching. It would be great if someone could review the patch, though.
Review of attachment 368821 [details] [review]: Nice work getting the needed changes upstream so quickly. ::: configure.ac @@ +99,3 @@ AC_SUBST(SQLITE_LIBS) +PKG_CHECK_MODULES(LIBPSL, libpsl >= 0.20.0) Is https://github.com/rockdaboot/libpsl/commit/8fd480584e5eee7b60125e6ca89a51cdd93e7e77 important for us? If so, best depend on 0.20.1. ::: libsoup/soup-tld.c @@ +86,3 @@ + /* This will fail if libpsl's built-in data was disabled during compilation. */ + g_assert (psl); Use if (!psl) g_error() for this instead. Reason: g_assert() can be disabled at compile time, and this is checking for something that's not a libsoup programming error.
(In reply to Claudio Saavedra from comment #14) > libpsl 0.20.0 was released, so we could merge this after branching. Claudio, can you please also fix the wip/meson after you merge this to the master?
Attachment 368821 [details] pushed as 9d7559d - soup-tld: use libpsl instead of our own copy of the public suffix list I also improved a bit the handling of the psl context in a subsequent commit.
My feeling here is that we should revert these patches, ensure first that the library is properly building on the systems we support and then merge it... Otherwise you will make a nightmare building with msvc in the meantime.
(In reply to Ignacio Casal Quinteiro (nacho) from comment #18) > My feeling here is that we should revert these patches, ensure first that > the library is properly building on the systems we support and then merge > it... > Otherwise you will make a nightmare building with msvc in the meantime. I feel the same.
I don't have any particular inclination towards making libsoup work with msvc, so I think whoever cares will have to do that. The patch was here for a while now, so it seems that only by pushing it we've managed to make someone worry about it :) we're early enough in the cycle so that whatever you need to fix can be fixed, so I don't favor reverting anything.
I found that soup-tld-private.h was not removed from docs/reference/Makefile.am so I removed it in commit d6ea55c8
It would be nice if the libpsl was an optional dependency, because you replaced ~200KB file with 42MB dependency, which is kind of impractical. And that library is incomplete, because it comes without that .dat file anyway, thus you only passed the responsibility to update the file to the packagers/people whom compile libsoup from sources.
libpsl is not 42MB. Its size (plus libpsl-dev and publicsuffix) in Debian are abound 500KB when installed. The library also includes a builtin copy of the public suffix list, which libsoup uses if it's the most recently available. Also, in Debian these packages already depend on libpsl: Reverse Depends: Depends: libpsl-dev (= 0.20.1-1) Depends: wget (>= 0.16.0) Depends: network-manager (>= 0.13.0) Depends: psl (>= 0.19.1) Depends: libcurl3 (>= 0.13.0) Depends: libcurl3-nss (>= 0.13.0) Depends: libcurl3-gnutls (>= 0.13.0) So I don't think that having libpsl as an optional dependency and having to ship our own copy of the publicsuffix list is practical nor useful.
Maybe the tarball release of it is different from the snapshot download of git master from github, because that libpsl-master.zip file does report 42MB on the disk after unpacked and it does not contain the copy of public_suffix_list.dat in its list/ subdirectory, thus I've been forced to find it somehow. It can be just their issue in the build script, I'm not questioning it. To not understand me wrong, I agree with responsibility delegation, I also do not like code duplication and similar things, it only wasn't that easy to get libsoup built even on Linux (it could be built under Windows more or less easily too, where the dependency hell is more problematic than on Linux machines, thus this new dependency makes it harder. It means actually 3 more dependencies, because to build libpsl I need libidn2 and libunistring development files, not counting publicsuffix-list package from which I copied the missing file). I'd only see much simpler to have a `make dist` hook in libsoup which would `wget` the file and update it when the tarball is created, thus avoid to forget of it. That's all about it, right? To not forget to update that single file and regenerate something according to its content.
You are not supposed to be downloading a git snapshot of a project. Just use git clone (the public-suffix list is actually a git submodule, run 'git submodule status' to see it, yet another reason why you should be using git instead of snapshots). The source code directory is indeed large as you describe, but that is mostly because libpsl is using a fuzzer for testing. An installed binary is not going to be more than a few hundred kilobytes. It's not just a matter of having a copy of the list. The SoupTLD implementation used to be based on Mozilla's implementation of nsEffectiveTLDService. I don't remember how old that code is, but it certainly wasn't getting much maintainance in libsoup. I prefer to depend on a small and maintained library instead of having that aging code in the library. I understand that this might be a bit painful for distributors initially (as libpsl had some build issues when srcdir!=builddir; but these have been fixed upstream), but I'm confident that in the long run this is the way to go.
(In reply to Claudio Saavedra from comment #25) > I'm confident that in the long run this is the way to go. True, I agree with you.
FWIW, I'm still trying to get this building for GNOME, my approach is to disable everything: ./configure --disable-static --disable-runtime --disable-builtin I haven't got a successful build yet, but I believe that disables the external libidn and ICU dependencies. I'm not sure if the library will actually work or not. If it somehow breaks Epiphany then I'll revisit.
It built! Here's the result of the configure script: Version: 0.20.1 Host OS: linux-gnu Install prefix: /usr Compiler: gcc CFlags: -O2 -g -fstack-protector-strong -D_FORTIFY_SOURCE=2 LDFlags: -fstack-protector-strong -Wl,-z,relro,-z,now Libs: Runtime: no Builtin: no PSL Dist File: PSL File: $(top_srcdir)/list/public_suffix_list.dat PSL Test File: $(top_srcdir)/list/tests/tests.txt Sanitizers: UBSan no, ASan no, CFI no Docs: no Man pages: no Tests: Valgrind testing not enabled Hopefully that's OK? I don't know what the runtime and builtin options are important for.