19 August 2009

ACL-IJCNLP 2009 and EMNLP 2009 have just finished here in Singapore. As an outsider to the field I had a hard time following many talks but nonetheless enjoyed the conference. The highlight for me was the talk by Richard Sproat who wondered whether there exists a statistical test to check if a series of symbol sequences is actually a language? If this test would exist, we could use it to decide whether the set of symbols known as the Indus Valey Script is actually a language. Very fascinating stuff: I immediately bought “Lost Languages” by Andrew Robinson to learn more about the history of deciphering dead languages.

The paper had some very cool papers; the first one I really liked was Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling by Daichi Mochihashi et al. They build on the work of Yee Whye Teh and Sharon Goldwater who showed that Kneser-Ney language modelling is really an approximate version of a hierarchical Pitman-Yor based language model (HPYLM). The HPYLM starts from a unigram model over a fixed dictionary and hence doesn’t accommodate for out of vocabulary words. Daichi et al extended the HPYLM so that the base distribution is now an character infinity-gram that is itself an HPYLM (over characters). They call this model the nested HPYLM or NPYLM. There is no need for a vocabulary of words in the NPYLM, rather, the HPYLM base distribution is a distribution over arbitrary long strings. In addition the model will perform automatic word segmentation. The results are really promising: from their paper, consider the following unsegmented English text

lastly,shepicturedtoherselfhowthissamelittlesisterofhersw
ould,intheafter-time,beherselfagrownwoman;andhowshe
wouldkeep,throughallherriperyears,thesimpleandlovingh
rlittlechildren,andmaketheireyesbrightandeagerwithmany
astrangetale,perhapsevenwiththedreamofwonderlandoo
ngago:andhowshewouldfeelwithalltheirsimplesorrows,an
dndapleasureinalltheirsimplejoys,rememberingherownc
hild-life,andthehappysummerdays. [...]

When the NPYLM is trained on this data, the following is found

last ly , she pictured to herself how this same little sister
of her s would , inthe after - time , be herself agrown woman ;
and how she would keep , through allher ripery ears , the simple
and loving heart of her child hood : and how she would gather
about her other little children ,and make theireyes bright and
eager with many a strange tale , perhaps even with the dream of
wonderland of longago : and how she would feel with all their
simple sorrow s , and a pleasure in all their simple joys ,
remember ing her own child - life , and thehappy summerday s .

A note on the implementation of Hierarchical Dirichlet Processes by Phil Blunsom et al. In this paper the authors discuss how previous approximate inference schemes to the HDP collapsed Gibbs sampler can turn out to be quite bad. In this paper they propose a more efficient and exact algorithm for a collapsed Gibbs sampler for the HDP.

A few other papers I really enjoyed: