organisiert von Laura Kallmeyer und Wolfgang Maier
Donnerstag 14.30-16.00. Raum 23.21.U1.72.
In diesem Kolloquium sollen aktuelle Arbeiten aus der computerlinguistischen Forschung vorgestellt werden. Es wird eine Reihe von Gastvorträgen internationaler Wissenschaftler geben, die durch das Lesen zusätzlicher Literatur vorbereitet werden sollen.
Neben den MA-Studenten, die am Seminar teilnehmen, sind alle Interessierten herzlich zu den Gastvorträgen eingeladen.
|23.10.2014||Vortrag Wolfgang Maier (Düsseldorf):
Language variety identification in Spanish tweets
We study the problem of language variant identification, approximated by the problem of labeling tweets from Spanish speaking countries by the country from which they were posted. While this task is closely related to “pure” language identification, it comes with additional complications. We build a balanced collection of tweets and apply techniques from language modeling. A simplified version of the task is also solved by human test subjects, who are outperformed by the automatic classification. Our best automatic system achieves an overall F-score of 67.7% on 5-class classification.
Joint work with Carlos Gómez-Rodríguez, Universidade da Coruña, Spain
|06.11.2014||Vortrag Jackie Chi Kit Cheung (Toronto):
Towards Large-Scale Natural Language Inference with Distributional Semantics
Language understanding and semantic inference are crucial for solving complex natural language applications, from intelligent personal assistants to automatic summarization systems. However, current systems often require hand-coded information about the domain of interest, an approach that will not scale up to the large array of possible domains and topics in text collections today. In this talk, I demonstrate the potential of distributional semantics (DS), an approach to modelling meaning by using the contexts in which a word or phrase appears, to assist in acquiring domain knowledge and to support the desired inference, with applications to automatic summarization and natural language generation. I present a method that integrates phrasal DS representations into a probabilistic model in order to learn about the important events and slots in a domain, resulting in state-of-the-art performance on template induction and multi-document summarization for systems that do not rely on hand-coded domain knowledge. I also propose to evaluate DS by their ability to support inference, the hallmark of any semantic formalism, and discuss their use in a text-to-text generation setting. These results demonstrate the utility of DS for current natural language applications, and provide a principled framework for measuring progress towards automated inference in any domain.
Jackie CK Cheung is a graduating PhD candidate at the University of Toronto, and will begin at McGill University as an assistant professor in January, 2015. His research interests span several areas of natural language processing, including computational semantics, automatic summarization, and natural language generation. His work has been supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), as well as a Facebook Fellowship.
|20.11.2014||Vortrag Anders Søgaard (Copenhagen):
NLP for the 1%?
NLP as a field is biased toward some domains and text types, such as politics and newswire. A POS tagger, for instance, may get less than 1 word wrong in 40 for newswire, but more than 1 in 5 when it comes to spoken language or social media contents. This bias is unfortunate, since most of our potential users care more about spoken language and social media contents than about newswire. Moreover, some recent experiments we did, show that there is also considerable bias within these domains. Performance correlates positively with age and income. This, I believe, is a democratic problem, and we discuss ways to correct this bias in NLP.
|04.12.2014||Vortrag Yannick Versley (Heidelberg):
Grammarless parsing for discontinuous constituents
Traditionally, constituent parsing has relied on formal grammars — either in context-free form for CFG parsing or in the form of k-LCFRS for minimally context-sensitive parsing, while dependency parsing has started out from the complete space of dependency trees (e.g. Maruyama 1990, Menzel and Schröder 1998). In more recent times, these distinctions have been deemphasized by the fact that inference techniques familiar from grammar-based parsing can be used for dependencies (Eisner, 1996; Kuhlmann and Satta 2009), and that state-of-the-art constituent parsing uses rather impoverished representations as its starting point (Petrov et al. 2006; Hall et al. 2013).
In this talk, I will start from the argument of Becker, Rambow and Niv (1992) who argue that under certain assumptions, German would exhibit "doubly unbounded" constructions that are outside the power of minimally-context-sensitive formalisms, and present the parallel argument from McDonald and Pereira (2006) that dependency parsing with second-order factors is NP-hard.
After presenting an overview over approaches that are currently used in dependency parsing for the treatment of nonprojective dependencies, I present current results based on EaFi, a parser that is based on easy-first search with online reordering.
|18.12.2014||Vortrag Kilian Evang (Groningen):
Cross-lingual Transfer of a Semantic Parser via Parallel Corpora
Semantic parsers that translate text into logical form are gaining popularity. On specialized domains, their output can be directly executed, for example as a database query or as an instruction to a robot, enabling natural-language interfaces to these systems. On open-domain text, semantic parsers are becoming useful for automated reasoning about propositions expressed in natural language, and for providing logical-form-based features for systems that employ machine learning . One such open-domain semantic parser is formed by the C&C tools/Boxer , which produces state-of-the-art logical forms for English.
Analogous systems for other languages, such as Dutch or Italian, are not as readily available. Building them would require a large amount of annotated training data for CCG parsing as well as many person years of work building a semantic lexicon. Given that an English system already exists, can a comparable system instead be produced at a lower cost by exploiting parallel corpora?
I investigate this possibility by taking an English-Dutch sentence-aligned parallel corpus, using Boxer to produce logical forms for the English sentences, pairing them with the corresponding Dutch sentences and applying a variant of the learning algorithm of Zettlemoyer of Collins . This algorithm learns semantic grammars from sentences paired with logical form, with no syntactic supervision. Although learning would normally be extremely hard on open-domain text, additional supervision provided by the existing English lexicon and automatically produced English-Dutch word alignments make it viable.
Among the interesting challenges are cases where the same meaning is expressed in the two languages in structurally different ways, as in "I like to walk in the rain" vs. "Ik wandel graag in de regen" or "I love you" vs. "Ik hou van je". Here, not all appropriate lexical entries for the target language have 1:1 equivalents in the source language. Can they still be learned automatically from such pairs? I show that thanks to the flexibility of CCG and using some heuristics, this is indeed possible in many cases.
This talk will introduce the problem tackled and the method proposed in detail.
Results from current work in progress will be presented.
 Johannes Bjerva, Johan Bos, Rob van der Goot, Malvina Nissim (2014):
The Meaning Factory: formal semantics for Recognizing textual entailment
and determining semantic similarity. Proceedings of SEMEVAL.
 James R. Curran, Stephen Clark, Johan Bos (2007):
Linguistically motivated large-scale NLP with C&C and Boxer.
Proceedings of ACL.
 Luke S. Zettlemoyer, Michael Collins (2007):
Online learning of relaxed CCG grammars for parsing to logical form.
Proceedings of EMNLP-CoNLL.
|22.01.2015||Vortrag Sebastian Padó (Stuttgart):
Cross-lingual learning of syntax-based distributional semantics
Syntax-based distributional models provide a flexible and linguistically informed representation of word meaning. However, their construction requires large, accurately parsed corpora, which are unavailable for most languages.
In my talk, I will discuss ways to take advantage of the advanced state of the art of English NLP to induce syntax-based distributional models for other languages. I will focus on two methods — cross-lingual transfer via comparable corpora and via translation lexicons — and present evaluations on lexical-semantic benchmarks for German, Croatian, and Spanish.
The main findings are: (a), translation lexicons can be learned from comparable corpora; (b), translation lexicons are sufficient to construct high-precision syntax-based models for languages without any available parsed data, albeit at the expense of recall; (c), relatively simple strategies can combine monolingual and crosslingual models to successfully combine their strengths.
|05.02.2015||Vortrag Manfred Sailer und Sascha Bargmann (Frankfurt):
The Syntactic Flexibility of Non-decomposable Idioms