Bug 673802 – Add support for http://publicsuffix.org

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 673802 - Add support for http://publicsuffix.org


Summary:	Add support for http://publicsuffix.org


Status:	RESOLVED FIXED

Product:	libsoup
Classification:	Core
Component:	API
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	libsoup-maint@gnome.bugs
QA Contact:	libsoup-maint@gnome.bugs

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2012-04-09 22:32 UTC by Sergio Villar
Modified:	2012-06-14 18:18 UTC

See Also:
GNOME target:	---
GNOME version:	3.5/3.6

Attachments
soup-tld: added new object SoupTLD (93.44 KB, patch) 2012-04-09 22:37 UTC, Sergio Villar	none	Details \| Review
soup-cookie-jar: do not accept cookies for well known public domains (2.96 KB, patch) 2012-04-09 22:37 UTC, Sergio Villar	none	Details \| Review
*Added soup_tld_ utils** (95.23 KB, patch) 2012-04-19 18:49 UTC, Sergio Villar	needs-work	Details \| Review
SoupCookieJar: do not set cookies for public suffixes (1.33 KB, patch) 2012-04-19 18:51 UTC, Sergio Villar	reviewed	Details \| Review
*Added soup_tld_ utils** (95.49 KB, patch) 2012-05-31 10:11 UTC, Sergio Villar	reviewed	Details \| Review

Description Sergio Villar 2012-04-09 22:32:12 UTC

From the site description:

"The Public Suffix List is an initiative of Mozilla, but is maintained as a community resource. It is available for use in any software, but was originally created to meet the needs of browser manufacturers. It allows browsers to, for example:
* Avoid privacy-damaging "supercookies" being set for high-level domain name suffixes
* Highlight the most important part of a domain name in the user interface
* Accurately sort history entries by site"

Having some API in libsoup will not only benefit potential clients (as web browsers) but will also allow us to properly filter "supercookies" (see for example https://bugs.webkit.org/show_bug.cgi?id=64575#c7).

That public suffix list is currently used by Firefox, Chrome and Opera among others.

Comment 1 Sergio Villar 2012-04-09 22:37:15 UTC

Created attachment 211673 [details] [review]
soup-tld: added new object SoupTLD

SoupTLD is a new object that with the help of the list of public suffixes
published in http://publicsuffix.org allows API clients to get the base
domain of a given hostname or to check if a given domain is a public suffix
(one under which internet users can register new names).

Comment 2 Sergio Villar 2012-04-09 22:37:17 UTC

Created attachment 211674 [details] [review]
soup-cookie-jar: do not accept cookies for well known public domains

SoupCookieJar uses the brand new SoupTLD to reject cookies whose domains are
registered public domains. This prevents sites from setting supercookies.

Comment 3 Sergio Villar 2012-04-09 22:41:41 UTC

The first patch includes the last available list of public suffixes (loaded using the new glib resources API) and unit tests based on the test cases available in the project web page.

The second patch is basically the fix for the supercookies issue.

Comment 4 Dan Winship 2012-04-10 14:03:38 UTC

Oh, awesome.

So my first question is whether libsoup is the right API level for this. Looking at http://publicsuffix.org/ and http://publicsuffix.org/learn/, the only not-really-web-specific use case is "Highlight the most important part of a domain name in the user interface"... which isn't so super important that this would be worth moving to glib. So I think libsoup is right.

Second question is, should it be possible to update the list outside of libsoup releases? (Like, it could download it to ~/.cache if the built-in list is too old or something.) I guess we can worry about that later.

Comment 5 Dan Winship 2012-04-10 14:23:10 UTC

Comment on attachment 211673 [details] [review]
soup-tld: added new object SoupTLD

>+++ b/data/effective_tld_names.dat

So... it would be awesome if "make distcheck" verified that we had the most up-to-date version of the list... (should be able to use tests/get to download it?)

>+soup_tld_new (void)
>+{
>+	SoupTLD *tld = g_object_new (SOUP_TYPE_TLD, NULL);
>+
>+	g_thread_create (soup_tld_parse_tld_list_in_thread, tld, TRUE, NULL);

you should do that in soup_tld_init() so it works if you do "g_object_new(SOUP_TYPE_TLD..." too.

Also, the way it's set up now, it only works if you create the SoupTLD at startup, and then don't use it until later. If you try to do:

    tld = soup_tld_new ();
    if (soup_tld_is_public (tld, blah)) ...

then it will fail... there needs to be some way to avoid that. One way would be to have it implement GInitable/GAsyncInitable, so you know for sure when the parsing thread finishes. Another (slightly lamer) way would be to just have a signal emitted (in the calling thread, not the work thread) when the thread finishes.

A better possibility would be to try to do the heavy lifting at compile time and store the data in some pre-parsed format so that it can be read in much faster in soup_tld_new() without needing a separate thread.

>+ * soup_tld_get_base_domain:

>+ * The returned string should be freed with g_free() when no longer needed.

It seems like it should be possible to have this just return a pointer into @hostname rather than returning a new string. (For use cases like "highlight the most important part of the domain name in the UI", you need that anyway, so if libsoup doesn't implement it, epiphany would have to [assuming it's going to do that at some point].)

>+	if (g_hostname_is_non_ascii (hostname)) {
>+		ascii_hostname = g_hostname_to_ascii (hostname);

the list is in utf8

Comment 6 Sergio Villar 2012-04-18 10:25:25 UTC

(In reply to comment #5)
> (From update of attachment 211673 [details] [review])
> >+++ b/data/effective_tld_names.dat
> 
> So... it would be awesome if "make distcheck" verified that we had the most
> up-to-date version of the list... (should be able to use tests/get to download
> it?)

Yeah I guess we can store the ETag somewhere to check that we have the most recent version.

> A better possibility would be to try to do the heavy lifting at compile time
> and store the data in some pre-parsed format so that it can be read in much
> faster in soup_tld_new() without needing a separate thread.

This is the approach used by firefox. They just generate the initialization code for an array of structs at compile time. The problem with that approach is that the would have the memory of the static variable always there apart from the hash table which uses a quite compact (a integer hash for key and another integer for value ) implementation IMO.
 
> >+ * soup_tld_get_base_domain:
> 
> >+ * The returned string should be freed with g_free() when no longer needed.
> 
> It seems like it should be possible to have this just return a pointer into
> @hostname rather than returning a new string. (For use cases like "highlight
> the most important part of the domain name in the UI", you need that anyway, so
> if libsoup doesn't implement it, epiphany would have to [assuming it's going to
> do that at some point].)

If we want to do that then we must add an additional retriction to the API. It currently returns new memory because we "normalize" the hostname strings provided by the caller, that's the only reason.
 
> >+	if (g_hostname_is_non_ascii (hostname)) {
> >+		ascii_hostname = g_hostname_to_ascii (hostname);
> 
> the list is in utf8

Yeah but this method normalizes the hostnames passed by the user as arguments, it has nothing to do with the suffix list.

Comment 7 Dan Winship 2012-04-18 13:46:41 UTC

(In reply to comment #6)
> This is the approach used by firefox. They just generate the initialization
> code for an array of structs at compile time. The problem with that approach is
> that the would have the memory of the static variable always there apart from
> the hash table which uses a quite compact (a integer hash for key and another
> integer for value ) implementation IMO.

If it's a const array of const strings/structs, then there'd only be a single copy of it in memory for all processes using libsoup. (And if none of them were actually using the SoupTLD API then the kernel wouldn't need to keep those pages in memory anyway.)

> > the list is in utf8
> 
> Yeah but this method normalizes the hostnames passed by the user as arguments,
> it has nothing to do with the suffix list.

But you're normalizing the passed-in hostname to ASCII and then comparing them to the UTF-8 hostnames in the suffix list. You need to either asciify both or unicodify both.

Comment 8 Sergio Villar 2012-04-19 18:49:04 UTC

Created attachment 212374 [details] [review]
Added soup_tld_* utils

These are the main changes:

* There is no SoupTLD object. Now we have a soup_tld_* set of functions
* Suffix data is not parsed directly from the public suffix list file. We parse it at compile time and generate an array of structs directly included in the source code. That array is indexed by a hash table following a lazy approach.
* make distcheck can download the latest version of the list if curl is detected (I think we cannot use tests/get because according to the docs, the distcheck-hook is called before configuring and building the package)
* The public domain is returned as a pointer to the original string instead of creating a new one

Comment 9 Sergio Villar 2012-04-19 18:51:01 UTC

Created attachment 212376 [details] [review]
SoupCookieJar: do not set cookies for public suffixes

I moved the check to soup_cookie_jar_add_cookie() so that we check it always and not only when processing the Set-Cookie header.

Comment 10 Sergio Villar 2012-05-21 11:48:24 UTC

What are we going to do with this Dan?

Comment 11 Dan Winship 2012-05-21 12:59:04 UTC

sorry, i'm behind on libsoup patch review...

Comment 12 Dan Winship 2012-05-24 23:08:53 UTC

Review of attachment 212374 [details] [review]:

::: Makefile.am
@@ +19,3 @@
+# download the latest available public suffix list
+distcheck-hook:
+	curl -s $(TLD_DATA_FILE_URI) > @TLD_DATA_FILE@

hm... this just silently downloads it without letting you know...

tell you what, just drop this part. I'll think about it some more later

::: configure.ac
@@ +350,3 @@
+dnl *********************
+TLD_DATA_FILE='$(top_srcdir)/data/effective_tld_names.dat'
+AC_SUBST(TLD_DATA_FILE)

this doesn't need to be in configure. Just put it in libsoup/Makefile.am

::: docs/reference/libsoup-2.4-sections.txt
@@ +1171,3 @@
 
+<SECTION>
+<FILE>soup-tld</FILE>

you need to add this into libsoup-2.4-docs.sgml somewhere too... I guess "Core API"

::: libsoup/soup-tld-private.h
@@ +20,3 @@
+	gint hash;
+	guint flags;
+} TLDEntry;

I'd prefer to have that be "SoupTLDEntry", even though this is a private header...

::: libsoup/soup-tld.c
@@ +23,3 @@
+static GHashTable *rules = NULL;
+static TLDEntry tld_entries[] =
+#include "tld_data.inc"

it's more common to have the surrounding {}s be here rather than inside the #included file

@@ +30,3 @@
+ * searches.
+ */
+static void soup_tld_initialize_rules_hashtable (void)

newline after void

@@ +38,3 @@
+		g_hash_table_insert (rules,
+				     &(tld_entries[i].hash),
+				     &(tld_entries[i].flags));

this assumes the hashes are all unique... does mozilla's code do that too?

(Given that g_str_hash is not cryptographically strong, it would be easy to generate domain names that have the same hash as some (unrelated) entry in the list... the question is, would that ever be useful as part of an attack?)

@@ +46,3 @@
+ * @hostname: a UTF-8 hostname in its canonical representation form
+ * @error: location to store the error occurring, or %NULL to ignore
+ * errors. See #SoupTLDError for the available error codes

it's preferred to indent the second (and further) lines of a multi-line description by a few spaces

also "the error occurring" isn't very idiomatic. I'd say "return location for a #GError".

@@ +49,3 @@
+ *
+ * Finds the base domain for a given @hostname (for example for
+ * myhost.mydomain.com it will return mydomain.com).

You need to define what "base domain" means more. You can describe it in terms of is_public(). Or say maybe "This is the highest-level domain that a web site from @hostname could set a cookie in"

@@ +58,3 @@
+ * an error occurs, %NULL will be returned and @error set.
+ *
+ * Since: 2.38.1

2.40 (likewise throughout the file)

@@ +67,3 @@
+
+	if (!rules)
+		soup_tld_initialize_rules_hashtable ();

you need to make the initialization thread-safe. Use a GOnce and g_once().

@@ +73,3 @@
+
+/**
+ * soup_tld_domain_is_public:

soup_tld_domain_is_public_suffix() would be more accurate

@@ +95,3 @@
+
+	g_return_val_if_fail (domain, FALSE);
+	g_return_val_if_fail (!g_hostname_is_ascii_encoded (domain), FALSE);

i would remove this and instead do a g_return_val_if_fail() if soup_tls_get_base_domain_internal() returns INVALID_HOSTNAME or IS_IP_ADDRESS

@@ +101,3 @@
+
+	if (*domain == '.' && !(++domain))
+		return FALSE;

if (*domain == '.') then it doesn't match the format that the documentation says it has to take...

@@ +112,3 @@
+	if (g_error_matches (error, SOUP_TLD_ERROR, SOUP_TLD_ERROR_IS_IP_ADDRESS) ||
+	    g_error_matches (error, SOUP_TLD_ERROR, SOUP_TLD_ERROR_INVALID_HOSTNAME))
+		return FALSE;

yeah, so I'd make this
    g_error_free (error);
    g_return_val_if_reached (FALSE);

(You're also leaking error if it's SOUP_TLD_ERROR_NOT_ENOUGH_DOMAINS)

@@ +127,3 @@
+
+static const char *
+soup_tld_get_base_domain_internal (const char *hostname, guint additional_domains, GError **error)

fwiw, you always call this with additional_domains = 1

@@ +132,3 @@
+	gint add_domains;
+
+	if (*hostname == '.') {

you don't need this check because the one at the start of the loop does the same thing the first time through

@@ +142,3 @@
+		g_set_error_literal (error, SOUP_TLD_ERROR,
+				     SOUP_TLD_ERROR_IS_IP_ADDRESS,
+				     "Hostname is an IP address");

All error strings should be _()'ed (include <glib/gi18n-lib.h> for that)

@@ +216,3 @@
+		g_set_error_literal (error, SOUP_TLD_ERROR,
+				     SOUP_TLD_ERROR_NOT_ENOUGH_DOMAINS,
+				     "Not enough domains");

I don't really like this error description, though I'm not sure what else to say...

::: libsoup/tld-parser.c
@@ +14,3 @@
+#include "soup-tld-private.h"
+
+int main(int argc, char **argv) {

newline after int and before {

@@ +64,3 @@
+		}
+
+		domain = rule;

You need to reset flags to 0 each time through the loop too...
(did you run tld-test? If so, it must be buggy...)

@@ +82,3 @@
+		hash = g_str_hash (domain);
+
+		size = g_sprintf (output_line, "{ %d, %d },\n", hash, flags);

Use g_snprintf. Various tools will complain about using sprintf.

Comment 13 Dan Winship 2012-05-24 23:10:16 UTC

Review of attachment 212376 [details] [review]:

> are registered public domains. This prevents sites from setting supercookies.

"are registered public suffixes"

otherwise good, once the other patch is ready

Comment 14 Sergio Villar 2012-05-30 18:07:27 UTC

(In reply to comment #12)
> Review of attachment 212374 [details] [review]:
> 
> ::: Makefile.am
> @@ +19,3 @@
> +# download the latest available public suffix list
> +distcheck-hook:
> +    curl -s $(TLD_DATA_FILE_URI) > @TLD_DATA_FILE@
> 
> hm... this just silently downloads it without letting you know...

that's just because the -s argument, we might remove it

> tell you what, just drop this part. I'll think about it some more later
> 
> ::: configure.ac
> @@ +350,3 @@
> +dnl *********************
> +TLD_DATA_FILE='$(top_srcdir)/data/effective_tld_names.dat'
> +AC_SUBST(TLD_DATA_FILE)
> 
> this doesn't need to be in configure. Just put it in libsoup/Makefile.am

I just did that because I wanted to share the TLD_DATA_FILE variable between Makefiles
 
> ::: libsoup/soup-tld.c
> @@ +23,3 @@
> +static GHashTable *rules = NULL;
> +static TLDEntry tld_entries[] =
> +#include "tld_data.inc"
> 
> it's more common to have the surrounding {}s be here rather than inside the
> #included file

OK, will remove that from the parser

> @@ +38,3 @@
> +        g_hash_table_insert (rules,
> +                     &(tld_entries[i].hash),
> +                     &(tld_entries[i].flags));
> 
> this assumes the hashes are all unique... does mozilla's code do that too?

I have just checked and they use the domain as the key for their hash table

> (Given that g_str_hash is not cryptographically strong, it would be easy to
> generate domain names that have the same hash as some (unrelated) entry in the
> list... the question is, would that ever be useful as part of an attack?)

That's actually a good point, for the cookies particular case if an attacker could create an url able to defeat soup-tld but the domain check in the soup cookie jar will intercept it. Anyway in the general case I agree that this might lead to incorrect results.

> @@ +67,3 @@
> +
> +    if (!rules)
> +        soup_tld_initialize_rules_hashtable ();
> 
> you need to make the initialization thread-safe. Use a GOnce and g_once().

very true

> @@ +101,3 @@
> +
> +    if (*domain == '.' && !(++domain))
> +        return FALSE;
> 
> if (*domain == '.') then it doesn't match the format that the documentation
> says it has to take...

yeah, will remove that safety check
 
> @@ +127,3 @@
> +
> +static const char *
> +soup_tld_get_base_domain_internal (const char *hostname, guint
> additional_domains, GError **error)
> 
> fwiw, you always call this with additional_domains = 1

yes, I left there the argument because it might be useful for future API, I can remove it...
  
> @@ +216,3 @@
> +        g_set_error_literal (error, SOUP_TLD_ERROR,
> +                     SOUP_TLD_ERROR_NOT_ENOUGH_DOMAINS,
> +                     "Not enough domains");
> 
> I don't really like this error description, though I'm not sure what else to
> say...

Will come up with something...
 
> ::: libsoup/tld-parser.c
> @@ +14,3 @@
> +#include "soup-tld-private.h"
> +
> +int main(int argc, char **argv) {
> 
> newline after int and before {
> 
> @@ +64,3 @@
> +        }
> +
> +        domain = rule;
> 
> You need to reset flags to 0 each time through the loop too...
> (did you run tld-test? If so, it must be buggy...)

the tld-test does not check that the parser works fine, the data to be included is properly generated tough...

> @@ +82,3 @@
> +        hash = g_str_hash (domain);
> +
> +        size = g_sprintf (output_line, "{ %d, %d },\n", hash, flags);
> 
> Use g_snprintf. Various tools will complain about using sprintf.

copy

Comment 15 Sergio Villar 2012-05-31 10:11:09 UTC

Created attachment 215319 [details] [review]
Added soup_tld_* utils

new version with suggested changes

Comment 16 Sergio Villar 2012-05-31 10:15:00 UTC

In this new version I use the domain as the hash table key and I added a check to verify that the url and the hash table key are equal (since g_str_hash may lead to collisions).

I am not uploading the SoupCookieJar patch because it's basically the same. I just had to add a !g_hostname_is_ip_address(cookie->domain) before the call to soup_tld_domain_is_public_suffix() as the latter does not accept IP addresses by contract.

Comment 17 Dan Winship 2012-06-14 12:07:26 UTC

Comment on attachment 215319 [details] [review]
Added soup_tld_* utils

>+	if (!rules)
>+		soup_tld_initialize_rules_hashtable ();

You have to call soup_tld_initialize_rules_hashtable() unconditionally, because otherwise this could succeed after the "rules = g_hash_table_new ..." but before it finishes filling it in. (Using a temporary variable and only assigning to rules at the end won't help because of the possibility of the compiler optimizing that out.)

Maybe make the function name "soup_tld_ensure_rules_hash_table()" to make it clearer that it only initializes it the first time?

(Also, maybe move the call to the _internal() function rather than having it in both of the public functions?)

>+	/* If additional_domains > 0 the we haven't found enough additional domains. */

typo "the we"

>+main(int argc, char **argv)

space before (

>+/* From http://publicsuffix.org/list/test.txt */
>+static struct {
>+  const char *hostname;
>+  const char *result;
>+} tld_tests[] = {

There should be at least one UTF-8 test in the list... (and maybe suggest that to upstream too).


OK to commit with those fixes

Comment 18 Sergio Villar 2012-06-14 18:18:05 UTC

(In reply to comment #17)
> OK to commit with those fixes

Committed in master