Bug 689001 – Add sunpinyin to white list

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 689001 - Add sunpinyin to white list


Summary:	Add sunpinyin to white list


Status:	RESOLVED FIXED

Product:	gnome-control-center
Classification:	Core
Component:	Region & Language
Version:	3.6.x
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Control-Center Maintainers
QA Contact:	Control-Center Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2012-11-24 23:31 UTC by Weng Xuetian
Modified:	2013-03-06 19:51 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Weng Xuetian 2012-11-24 23:31:14 UTC

Sunpinyin is also a popular pinyin in China, and is the default Pinyin for Chinese in Ubuntu.

Though the following statistic shows that it might not so popular as ibus-pinyin due to it appear later, but technically the algorithm is more advance than ibus-pinyin.

http://qa.debian.org/popcon-graph.php?packages=libsunpinyin3%2Cibus-pinyin&show_installed=on&want_legend=on&want_ticks=on&from_date=&to_date=&hlght_date=&date_fmt=%25Y-%25m&beenhere=1

Comment 1 André Klapper 2012-11-25 12:05:38 UTC

Please provide *technical* reasons why to include it. 
Popularity per se is not an argument.

Comment 2 Bastien Nocera 2012-11-25 12:24:36 UTC

(In reply to comment #1)
> Please provide *technical* reasons why to include it. 
> Popularity per se is not an argument.

Actually, it is. We've added IMs based on their usage, though we do check for a minimum of quality.
And Mathieu's getting used to it too :)

Comment 3 Weng Xuetian 2012-11-25 13:14:21 UTC

Didn't I state the technical reason?

To be more concrete, Sunpinyin use modern algorithm based on statistical language model n-gram, which provides more accuracy when convert Pinyin into Chinese against other pinyin engine.

If you found that confusing, please use wikipedia/Google to learn the technical word in upper sentence. I can't explain it further without going into algorithm details, but I don't think that can make it more easier to understand.

Comment 4 Mike Qin 2012-12-02 23:37:48 UTC

(In reply to comment #2)
> (In reply to comment #1)
> > Please provide *technical* reasons why to include it. 
> > Popularity per se is not an argument.
> 
> Actually, it is. We've added IMs based on their usage, though we do check for a
> minimum of quality.
> And Mathieu's getting used to it too :)

I am one of the maintainers of this engine, and I need to test how it works for  3.6. But as its developer I cannot even use it on GNOME so as for now, I cannot guarantee its quality. You have to enable for me at least.

Besides the integration quality, other qualities of sunpinyin is guaranteed. It has been a long history and the project is well maintained. As its maintainer now, I receive very few bug reports of this engine (mostly sunpinyin core actually).

As for integration responsibility, I can take responsibility on ibus-sunpinyin engine. But of course, you have to enable it for me to test and maintain it.

Comment 5 Mike Qin 2012-12-03 00:04:31 UTC

(In reply to comment #1)
> Please provide *technical* reasons why to include it. 
> Popularity per se is not an argument.

I'm not sure why you need the technical details on this, but I can talk about this anyway here. I thought the only thing you want to know about it's its quality, right?

If you know about technical background of Chinese engine, the problem itself is similar to speech recognition. Given a sequence of code, P1,P2,P3...Pn IME translate into C1C2...Cm

Pi is the Pinyin, while C1...Cm are the Chinese characters. Both have semantics, however, for each Pi there are Ki possible ways to translate. So, all together you'll get K1*K2*...Km possible combinations. Ki could varies from less than 10 to 300-500, and m is about 20-30. So for each input session, you'll get at most 2^171 candidates, which is almost the number of particles in the universe. :) (Hell, yes, this is Chinese) And 1 out of 2^171 is the sentence that user want to input.

So, algorithms must do pruning to solve this problem. Current solutions like ibus-pinyin is doing simple greedy to solve this problem. This could stuck in the local maximum and basically will not work for very long n. Because it don't know how to be greedy...

Sunpinyin's approach is the classical approach in speech recognition. So given a ictionary that contains word Wi=C1...Cwi, it train itself to remember P(Wi|Wj) and P(Wi|WjWl), then the problem became, given P1P2P3...Pn find W1W2...Wm argmax{P(W1W2...Wm)}

P(W1W2..Wm) ~=~ P(W1)*P(W2|W1)*P(W3|W1W2)....P(Wm|Wm-1Wm-2)

The training material sunpinyin grab is from the internet (forums, articles, wiki pages), and we opened all the n-gram data on open-gram project on github. In this way, we actually teach sunpinyin to speak better and more modern Chinese than other dictionary approach. We can deal with larger n up to 48 (if I remember correctly, I'm not able to use it myself ever since I upgrade to gnome 3.6...)

One thing that sunpinyin cannot become the default is, it has been a opensolaris project lead my Sun. Although, long before Sun was purchased by Oracle sunpinyin was released under LGPL, still violate ibus preferences of distributing engines under GPL.

Comment 6 Mike Qin 2012-12-03 00:05:14 UTC

(In reply to comment #4)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > Please provide *technical* reasons why to include it. 
> > > Popularity per se is not an argument.
> > 
> > Actually, it is. We've added IMs based on their usage, though we do check for a
> > minimum of quality.
> > And Mathieu's getting used to it too :)
> 
> I am one of the maintainers of this engine, and I need to test how it works for
>  3.6. But as its developer I cannot even use it on GNOME so as for now, I
> cannot guarantee its quality. You have to enable for me at least.

I mean integration quality, sorry.

> 
> Besides the integration quality, other qualities of sunpinyin is guaranteed. It
> has been a long history and the project is well maintained. As its maintainer
> now, I receive very few bug reports of this engine (mostly sunpinyin core
> actually).
> 
> As for integration responsibility, I can take responsibility on ibus-sunpinyin
> engine. But of course, you have to enable it for me to test and maintain it.

Comment 7 Mike Qin 2012-12-03 00:08:08 UTC

(In reply to comment #5)
> (In reply to comment #1)
> > Please provide *technical* reasons why to include it. 
> > Popularity per se is not an argument.
> 
> I'm not sure why you need the technical details on this, but I can talk about
> this anyway here. I thought the only thing you want to know about it's its
> quality, right?
> 
> If you know about technical background of Chinese engine, the problem itself is
> similar to speech recognition. Given a sequence of code, P1,P2,P3...Pn IME
> translate into C1C2...Cm
> 
> Pi is the Pinyin, while C1...Cm are the Chinese characters. Both have
> semantics, however, for each Pi there are Ki possible ways to translate. So,
> all together you'll get K1*K2*...Km possible combinations. Ki could varies from
> less than 10 to 300-500, and m is about 20-30. So for each input session,
> you'll get at most 2^171 candidates, which is almost the number of particles in
> the universe. :) (Hell, yes, this is Chinese) And 1 out of 2^171 is the
> sentence that user want to input.
> 
> So, algorithms must do pruning to solve this problem. Current solutions like
> ibus-pinyin is doing simple greedy to solve this problem. This could stuck in
> the local maximum and basically will not work for very long n. Because it don't
> know how to be greedy...
> 
> Sunpinyin's approach is the classical approach in speech recognition. So given
> a ictionary that contains word Wi=C1...Cwi, it train itself to remember
> P(Wi|Wj) and P(Wi|WjWl), then the problem became, given P1P2P3...Pn find
> W1W2...Wm argmax{P(W1W2...Wm)}
> 
> P(W1W2..Wm) ~=~ P(W1)*P(W2|W1)*P(W3|W1W2)....P(Wm|Wm-1Wm-2)
> 
> The training material sunpinyin grab is from the internet (forums, articles,
> wiki pages), and we opened all the n-gram data on open-gram project on github.
> In this way, we actually teach sunpinyin to speak better and more modern
> Chinese than other dictionary approach. We can deal with larger n up to 48 (if
> I remember correctly, I'm not able to use it myself ever since I upgrade to
> gnome 3.6...)

rechecked, seems like there is no limitation on length of input.

> 
> One thing that sunpinyin cannot become the default is, it has been a
> opensolaris project lead my Sun. Although, long before Sun was purchased by
> Oracle sunpinyin was released under LGPL, still violate ibus preferences of
> distributing engines under GPL.

Comment 8 Bastien Nocera 2012-12-03 07:15:15 UTC

(In reply to comment #4)
<snip>
> As for integration responsibility, I can take responsibility on ibus-sunpinyin
> engine. But of course, you have to enable it for me to test and maintain it.

You can enable it for yourself...
gsettings set org.gnome.desktop.input-sources show-all-sources true

Comment 9 Mike Qin 2012-12-22 23:05:32 UTC

Hey all

I tried ibus-sunpinyin recently. It works the same quality as the ibus-pinyin does.

But both of them still suffered from the C-<space> user behavior. I could hack this and release a newer version.

Comment 10 Mike Qin 2013-01-27 22:10:29 UTC

(In reply to comment #8)
> (In reply to comment #4)
> <snip>
> > As for integration responsibility, I can take responsibility on ibus-sunpinyin
> > engine. But of course, you have to enable it for me to test and maintain it.
> 
> You can enable it for yourself...
> gsettings set org.gnome.desktop.input-sources show-all-sources true

Hi all

I've finished the C-<space> trigger at the engine level. You could add sunpinyin into the whitelist now.

Thanks
Mike

Comment 11 Rui Matos 2013-03-06 19:51:38 UTC

There's no whitelist anymore.