ACL 09 & EMNLP 09
ACL-IJCNLP 2009 and EMNLP 2009 have just finished here in Singapore. As an outsider to the field I had a hard time following many talks but nonetheless enjoyed the conference. The highlight for me was the talk by Richard Sproat who wondered whether there exists a statistical test to check if a series of symbol sequences is actually a language? If this test would exist, we could use it to decide whether the set of symbols known as the Indus Valey Script is actually a language. Very fascinating stuff: I immediately bought “Lost Languages” by Andrew Robinson to learn more about the history of deciphering dead languages.
The paper had some very cool papers; the first one I really liked was Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling by Daichi Mochihashi et al. They build on the work of Yee Whye Teh and Sharon Goldwater who showed that Kneser-Ney language modelling is really an approximate version of a hierarchical Pitman-Yor based language model (HPYLM). The HPYLM starts from a unigram model over a fixed dictionary and hence doesn’t accommodate for out of vocabulary words. Daichi et al extended the HPYLM so that the base distribution is now an character infinity-gram that is itself an HPYLM (over characters). They call this model the nested HPYLM or NPYLM. There is no need for a vocabulary of words in the NPYLM, rather, the HPYLM base distribution is a distribution over arbitrary long strings. In addition the model will perform automatic word segmentation. The results are really promising: from their paper, consider the following unsegmented English text
lastly,shepicturedtoherselfhowthissamelittlesisterofhersw ould,intheafter-time,beherselfagrownwoman;andhowshe wouldkeep,throughallherriperyears,thesimpleandlovingh eartofherchildhood:andhowshewouldgatheraboutherothe rlittlechildren,andmaketheireyesbrightandeagerwithmany astrangetale,perhapsevenwiththedreamofwonderlandoo ngago:andhowshewouldfeelwithalltheirsimplesorrows,an dndapleasureinalltheirsimplejoys,rememberingherownc hild-life,andthehappysummerdays. [...]
When the NPYLM is trained on this data, the following is found
last ly , she pictured to herself how this same little sister of her s would , inthe after - time , be herself agrown woman ; and how she would keep , through allher ripery ears , the simple and loving heart of her child hood : and how she would gather about her other little children ,and make theireyes bright and eager with many a strange tale , perhaps even with the dream of wonderland of longago : and how she would feel with all their simple sorrow s , and a pleasure in all their simple joys , remember ing her own child - life , and thehappy summerday s .
A note on the implementation of Hierarchical Dirichlet Processes by Phil Blunsom et al. In this paper the authors discuss how previous approximate inference schemes to the HDP collapsed Gibbs sampler can turn out to be quite bad. In this paper they propose a more efficient and exact algorithm for a collapsed Gibbs sampler for the HDP.
A few other papers I really enjoyed:
- Minimized Models for Unsupervised Part-of-Speech Tagging by Sujith Ravi et al.
- Polylingual Topic Models by David Mimno et al.
- Graphical Models over Multiple Strings by Markus Dreyer and Jason Eisner
- Bayesian Learning of a Tree Substitution Grammar by Matt Post and Daniel Gildea