10/4 Meilin Zhan (MIT BCS)
Speakers often have more than one way to express the same meaning. What general principles govern speaker choice in the face of optionality when near semantically invariant alternation exists? Studies have shown that optional reduction in language is sensitive to contextual predictability, where the more predictable a linguistic unit is, the more likely it gets reduced. Yet it is unclear whether speaker choice is geared toward audience design, or toward facilitating production. Here we argue that for a different optionality phenomenon, namely classifier choice in Mandarin Chinese, Uniform Information Density and at least one plausible variant of availability-based production make opposite predictions regarding the relationship between the predictability of the upcoming material and speaker choices. In a corpus analysis of Mandarin Chinese, we show that the distribution of speaker choices supports the availability-based production account, and not Uniform Information Density.
10/18 Reuben Cohn-Gordon (Stanford Linguistics)
Bayesian Pragmatic Models for Natural Language
The Rational Speech Acts model (RSA) formalizes Gricean reasoning through nested models of speakers and listeners. While this paradigm offers an elegant way to simulate pragmatic behaviour in NLP tasks such as image captioning and translation, scaling from simple models to natural language presents several challenges. In particular, I discuss the problem of choosing alternative utterances among an unbounded set of sentences, including work on image captioning and on-going work on translation.
11/1 Keny Chatain (MIT Linguistics)
Is logic useful in the study of meaning? Pronouns can tell.
In this talk, I will show how this connection has been used to shed light on the system that underlies meaning. I will start off by showing that standard predicate logic provides a remarkably adequate understanding of the behavior of pronouns. I will then present the famous case of so-called donkey sentences that translations to predicate logic seem unable to capture. These cases are taken to argue in favor of new take on the meaning of sentences, called Dynamic Semantics ; the meaning of sentences, it is claimed, is more appropriately understood as the effect that sentences have on the context. I will show how this approach can capture donkey sentences, and other pronoun-related phenomena. Time allowing, I will discuss alternatives.
11/15 Hao Tang (MIT CSAIL)
Automatic speech recognition without linguistic knowledge?
Building state-of-the-art speech recognizers, besides a large corpus of transcribed speech, require several additional ingredients, such as a phoneme inventory, a lexicon, and a language model. These ingredients carry linguistic constraints to make training more feasible and more sample efficient. Recently, there has been a push towards building a speech recognizer end to end, i.e., using few or even none of the aforementioned ingredients. This raises a fundamental question: is it possible to train a speech recognizer without any linguistic constraint? How much data do we need to make it possible? What linguistic constraints are necessary for building a speech recognizer? In this talk, I will review the inner workings of conventional and end-to-end speech recognizers, and to help answer some of those questions, I will present empirical results in training end-to-end speech recognizers without any linguistic constraints.
12/6 Emma Nguyen (UConn Linguistics)
Using developmental modeling to specify learning and representation of the passive in English children
Complete knowledge of the passive takes significant time to develop in English children. Building on prior work identifying the importance of lexical factors for this process, we specify a Bayesian learning model that can capture experimentally-observed passivization behavior in five-year-olds, given child-directed speech to learn from. Through this developmental model, we identify (i) how English children may be integrating lexical feature information, and (ii) how costly they may view the passive structure to be.
2/26 Thomas Schatz (MIT/UMD)
Leveraging automatic speech recognition technology to model cross-linguistic speech perception in humans
Existing theories of cross-linguistic phonetic category perception agree that listeners perceive foreign sounds by mapping them onto their native phonetic categories. Yet, none of the available theories specify a way to compute this mapping. As a result, they cannot provide systematic quantitative predictions and remain mainly descriptive. In this talk, I will present a new approach that leverages Automatic Speech Recognition (ASR) technology to obtain fully specified mapping between foreign and native sounds. Using the machine ABX evaluation method, we derive quantitative predictions from ASR systems and compare them to empirical observations in human cross-linguistic phonetic category perception. I will present results both where the proposed model successfully predicts empirical effects (for example on the American English /r/-/l/ distinction) and where it fails (for example on the Japanese vowel length contrasts) and discuss possible interpretations.
3/5 Ishita Dasgupta (Harvard)
Evaluating Compositionality in Sentence Embeddings
An important challenge for human-like AI is compositional semantics. Recent research has attempted to address this by using deep neural networks to learn vector space embeddings of sentences, which then serve as input to other tasks. We present a new dataset for one such task, “natural language inference” (NLI), that cannot be solved using only word-level knowledge and requires some compositionality. We find that the performance of state of the art sentence embeddings (InferSent, Conneau et al. (2017)) on our new dataset is poor. We analyze some of the decision rules learned by InferSent and find that they are largely driven by simple heuristics that are ecologically valid in its training dataset. Further, we find that augmenting training with our dataset improves test performance on our dataset without loss of performance on the original training dataset. This highlights the importance of structured datasets in better understanding and improving NLP systems.
4/2 Idan Blank (MIT BCS)
When we “know the meaning” of a word, what kind of knowledge do we have?
Understanding words seems to require both linguistic knowledge (stored form-meaning pairings and ways to combine them) and world knowledge (object properties, plausibility of events, etc.). In this talk, I will pose some challenges for common distinctions between these knowledge sources. First, I will ask whether rich information about concrete objects could be, in principle, learned from just the co-occurrence statistics of different words even in the absence of non-linguistic (e.g., perceptual) information. To this end, I will introduce a domain-general approach for leveraging such statistics (as captured by distributional semantic models, DSMs) to recover context-specific human judgments such that, e.g., “dolphin” and “alligator” appear relatively similar when considering size or habitat, but different when considering aggressiveness. Second, I will probe DSMs for “syntactic”, abstract compositional knowledge of verb-argument structure (e.g., “eat”, but not “devour”, can appear without an object). I will demonstrate that these syntactic properties of verbs can often be predicted from distributional information (i.e., without explicit access to “syntax”), indicating that DSMs capture those aspects of verb meaning that correlate with verb syntax. Nevertheless, only a small fraction of distributional information is needed for predicting verb argument structure - the rest appears to capture semantic properties that are relatively divorced from syntax. In fact, the overall similarity structure across verbs in a DSM is independent from the similarity structure across verbs as determined by their syntax, and both kinds of similarity are needed for explaining human judgments. Together, these two studies attempt to push against the upper bound on the potential complexity of distributional word meanings.
4/23 Hendrik Strobelt (IBM/Harvard)
Visualization for Sequence Models for Debugging and Fun
Visual analysis is a great tool to explore deep learning models when there is no strong mathematical hypothesis yet available. I will present two visual tools where we used design study methodology to allow exploration of patterns in hidden state changes in RNNs/LSTMs (LSTMVis) and exploration of Sequence2Sequence models (Seq2Seq-Vis). Both model types have shown superior performance for NLP like language modeling or language translation. Examples about both tasks will be shown on a variety of models.
As beautiful distraction, we also utilize data science methods to investigate large data in a more artistic way. Formafluens is such a data experiment where we analyze a large collections of doodles made by humans in the Google Quickdraw tool.
9/28 - Ray Jackendoff (Tufts)
Morphology and Memory
We take Chomsky’s term “knowledge of language” very literally. “Knowledge” implies “stored in memory,” so the basic question of linguistics is reframed as
What do you store in memory such that you can use language, and in what form do you store it?
Traditionally – and in standard generative linguistics – what you store is divided into grammar and lexicon, where grammar contains all the rules, and the lexicon is an unstructured list of exceptions. We develop an alternative view in which rules of grammar are lexical items that contain variables, and in which rules have two functions. In their generative function, they are used to build novel structures, just as in traditional generative linguistics. In their relational function, they capture generalizations over stored items in the lexicon, a role not seriously explored in traditional linguistic theory. The result is a lexicon that is highly structured, with rich patterns among stored items.
We further explore the possibility that this sort of structuring is not peculiar to language, but appears in other cognitive domains as well. The differences among cognitive domains are not in this overall texture, but in the materials over which stored relations are defined – patterns of phonology and syntax in language, of pitches and rhythms in music, of geographical knowledge in navigation, and so on. The challenge is to develop theories of representation in these other domains comparable to that for language.
10/5 - Kasia Hitczenko (UMD/MIT)
Exploring the efficacy of normalization in the acquisition and processing of Japanese vowels
Infants must learn the sound categories of their language and adults need to map particular acoustic productions they hear to one of those learned categories. These tasks can be difficult because there is often a lot of overlap between the acoustic realizations of different categories that can mask which sounds should be grouped together. Previous work has proposed that this overlap is caused, at least in part, by systematic and predictable sources of variability, and that listeners could learn about the structure of this variability and normalize it out to help learn from and process the incoming sounds. In this work, we further explore this idea of normalization, by applying it to the problem of Japanese vowel length contrast – a contrast that current computational models fail to learn due to high overlap between short and long vowels. We find that, at least in the way it is implemented here, normalizing out systematic variability does not substantially improve categorization performance over leaving acoustics unnormalized. We then present an alternative path forward by showing that a strategy that uses both acoustic cues and non-acoustic top-down information in categorization is better able to separate the short and long vowels.
10/26 - Jon Gauthier (BCS MIT)
What does NLP tell us about language?
Many state-of-the-art models in natural language processing achieve top performance in challenging shared tasks while doing little to explicitly model the syntax or semantics of their input. In light of these results, I will attempt to pitch two of my own past projects in natural language processing with a philosophical spin. I first present a model which reduces syntactic understanding to a by-product of more general pressures of semantic accuracy. I next present evolutionary simulations in which word meanings spontaneously arise due to nonlinguistic cooperative pressures. We close with potentially heretical and hopefully interesting discussion on the ideal relationship between AI engineering and cognitive science.
11/3* - Ryan Cotterell (Johns Hopkins) *Joint with CBMM, Friday 4pm, special location 46-3189
Probabilistic Typology: Deep Generative Models of Vowel Inventories
Linguistic typology studies the range of structures present in human language. The main goal of the field is to discover which sets of possible phenomena are universal, and which are merely frequent. For example, all languages have vowels, while most—but not all—languages have an [u] sound. In this paper we present the first probabilistic treatment of a basic question in phonological typology: What makes a natural vowel inventory? We introduce a series of deep stochastic point processes, and contrast them with previous computational, simulation-based approaches. We provide a comprehensive suite of experiments on over 200 distinct languages.
11/16 - David Alvarez Melis (CSAIL MIT)
Interpretability for black-box sequence-to-sequence models
Most current state-of-the-art models for sequence-to-sequence NLP tasks have complex architectures and millions --if not billions--of parameters, making them practically black-box systems. Such lack of transparency can limit their applicability to certain domains and can hamper our ability to diagnose and correct their flaws. Popular black-box interpretability approaches are inapplicable to this context since they assume scalar (or categorial) outputs. In this work, we propose a model to interpret the predictions of any black-box structured input-structured output model around a specific input-output pair. Our method returns an "explanation" consisting of groups of input-output tokens that are causally related. These dependencies are inferred by querying the black-box model with perturbed inputs, generating a graph over tokens from the responses, and solving a partitioning problem to select the most relevant components. We focus the general approach on sequence-to-sequence problems, adopting a variational autoencoder to yield meaningful input perturbations. We test our method across several NLP sequence generation tasks.
11/30 - Takashi Morita (Linguistics MIT) & Timothy O'Donnell
Bayesian learning of Japanese sublexica
Languages have been borrowing words from each other. A borrower language often has a different list of possible sound sequences (phonotactics) from a lender's. While loanwords may be reshaped so that they fit to the borrower's phonotactics, they can also introduce new sound patterns into the language. Accordingly, native and loanwords can exhibit different phonotactics in a single language and linguists have proposed that such a language's lexicon is better explained by a mixture of multiple phonotactic grammars: words are classified into sublexica (e.g. native vs. loan), and words belonging to different sublexica are subject to different phonotactic constraints. This approach, however, raises a non-trivial learnability question: Can learners classify words into correct sublexica? Words are not labeled with their sublexicon, so learners need to infer the classification. In this study, we investigate Bayesian unsupervised learning of sublexica. We focus on Japanese data (coded in international phhonetic alphabet), whose sublexical phonotactics has been proposed in linguistic literature. It will turn out that even a simple Dirichlet process mixture of ngram leads to remarkably successful classification.
2 February - Richard Futrell (MIT BCS)
Memory and Locality in Natural Language
I explore the hypothesis that the universal properties of languages can be explained in terms of efficient communication given fixed human information processing constraints. First, I show corpus evidence from 37 languages that word order in grammar and usage is shaped by working memory constraints in the form of dependency locality: a pressure for syntactically linked words to be close to one another in linear order. Next, I develop a new theory of language processing cost, based on rational inference in a noisy channel, that unifies surprisal and memory effects and goes beyond dependency locality to a principle of information locality: that words that predict each other should be close. I show corpus evidence for information locality. Finally, I show that the new processing model resolves a long-standing paradox in the psycholinguistic literature, structural forgetting, where the effects of memory appear to be language-dependent.
16 February - Kevin Ellis (MIT BCS)
Inducing Phonological Rules: Insights from Bayesian Program Learning
How do linguists come up with phonological rules? How do kids learn pig latin? We develop computational models of these and other phenomena in language. The unifying element of the models is to treat grammars as programs, which lets us apply ideas from the field of program synthesis to learn grammars. This lets the models capture phonological phenomena like vowel harmony or stress patterns and learn synthetic grammars used in prior studies of rule learning. Going beyond individual grammar learning problems, we consider the problem of jointly inferring many related rule systems. By solving many textbook phonology problems, we can ask the model what kind of inductive bias best explains the attested phenomena.
2 March - Uriel Cohen Priva (Brown)
Segment informativity, cost, and universality
How costly are individual segments? Can this be tied to the actuation of lenition processes? I propose that segment informativity (its expected predictability) provides us with a window into these two questions. Using supporting evidence from 7 languages and 3 unrelated lenition processes, I propose that when the informativity of a segment is too low to be maintained faithfully, it is more likely to lenite, predicting the actuation of lenition processes. Extending this work gives insight into how this may be related to cost: Using over 40 languages, I show that segment informativity is universal and is more consistent cross-linguistically than other baselines such as typological frequency and within-language frequency. Such measurements correlate with phonetic and phonological cost, and seem to be sensitive to the morphological makeup of the languages that are investigated.
16 March - Yevgeni Berzak (MIT CSAIL)
From ESL to the Typology of the World's Languages (and Back)
Linguists and psychologists have long been studying cross-linguistic transfer, the influence of a native language on linguistic performance in a foreign language. In this work we provide empirical evidence for this process and demonstrate its use for typology learning, as well as for prediction of native language specific grammatical error distributions in English as Second Language (ESL). First, we show a strong correlation between language similarities derived from structural features in ESL texts and equivalent similarities obtained from the typological features of the native languages. We leverage this finding to recover an approximation of the native language typological similarity structure directly from ESL text, and perform prediction of typological features in an unsupervised fashion with respect to the target languages. Secondly, we present a formalization of the Contrastive Analysis framework that uses typological information to predict native language specific distributions of grammatical errors in ESL. Finally, we demonstrate that these two tasks can be combined in a bootstrapping strategy. Time permitting, we will also discuss related ongoing work and resources for learner language.
6 April - Evelina Fedorenko (MIT BCS)
Human language as a code for thought
Although many animal species are endowed with the ability for complex thought, only humans can share these thoughts with one another, via language. My research aims to understand i) the system that supports our linguistic abilities, including its neural implementation, and ii) the relationship between the language system and the rest of the human cognitive arsenal.
I will begin by briefly introducing the “language network”, a set of interconnected brain regions that support language comprehension and production. Based on data from fMRI studies and investigations of patients with global aphasia, I will argue that the language network is functionally selective for language processing over a wide range of non-linguistic processes that have been previously argued to share computational demands with language, including arithmetic, executive functions, music, action observation and execution, and the processing of non-lingustic social signals like gestures. I will then talk about the internal structure of the language network. Although the higher-level language regions are robustly separable from lower-level perceptual and motor articulation regions, dividing up the high-level language network into component parts has proven difficult. In particular, the traditional "cuts" that have been proposed in the literature (including the most common one, based on the distinction between lexico-semantic and syntactic processing) do not seem to be supported by the available evidence.
Given the combination of i) functional selectivity of the language network for linguistic processing, and ii) the apparent lack of clear functional divisions within the network, I will tentatively propose that the language network stores domain-specific knowledge representations in a highly distributed fashion, perhaps with the meaning similarity reflected in the neural pattern similarity. To that end, I will present some recent evidence of robust ability to decode meanings of single words and sentences from the neural activity in the language network using distributional semantic models. I will thus suggest that our language system can be thought of as a robust and generic encoder (in production) and decoder (in comprehension) of non-linguistic conceptual representations.
20 April - Andrei Barbu (MIT CSAIL)
Understanding vision through language and language through vision
We will discuss a research program to ground language in perception and solve a diverse set of vision-language tasks. This allows language understanding to help vision by providing priors over potential interpretations. In addition, itallows us to understand language by means of perception and imagination. We will examine a range of vision and language tasks (description, QA, search, disambiguation, etc.) and show how a single generative model can address seemingly unrelated problems that span different research communities. This single model performs several tasks without being retrained, capturing the crucial human ability to use knowledge learned in one context and generalize it to another. Finally, I will present our ongoing work on developing a novel cognitively-plausible approach to grounded language acquisition, language translation, planning, and physics-based event understanding.
4 May - Oren Tsur (Harvard CS)
What do People Say When They Say Fake News?
It seems that everyone is talking about fake news lately. Academic conferences and symposiums dedicated to the issue pop up like mushrooms after the rain (I'm aware of a couple in Cambridge in recent weeks). Fake news, however, are hard to define and track, let alone fight. In this meeting I'd discuss some characteristics of fake news and show some (very) initial results related to tracking and identifying fake news and the social dynamics that are related to this phenomena. Finally, I would muse (and you will share your take) whether these dynamics are unique to fake news or are inherent to the way language is used and evolve over time.
11 May - Yonatan Belinkov (CSAIL)
On Learning Form and Meaning in Neural Machine Translation Models
Neural machine translation (MT) models obtain state-of-the-art performance while maintaining a simple, end-to-end architecture. However, interpreting such models remains challenging and not much is known about what they learn about source and target languages during the training process. In this talk, I will present our recent work on investigating what neural MT models learn about morphology. We analyze the representations learned by neural MT models at various levels of granularity and empirically evaluate the quality of the representations for learning morphology through extrinsic part-of-speech and morphological tagging tasks. We conduct a thorough investigation along several parameters: word-based vs. character-based representations, depth of the encoding layer, the identity of the target language, and encoder vs. decoder representations. Our data-driven, quantitative evaluation sheds light on important aspects in the neural MT system and its ability to capture word structure. As time permits, I will also discuss ongoing work on semantics in neural Mt models and how distinctions between form and meaning are reflected in the neural MT representations.
22 September - Timothy J. O'Donnell (BCS)
Computation and Storage in Language
A much-celebrated aspect of language is the way in which it allows us to express and comprehend an unbounded number of thoughts. This property is made possible because language consists of several combinatorial systems which can be used to productively build novel forms using a large inventory of stored, reusable parts: the lexicon. For any given language, however, there are many more potentially storable units of structure than are actually used in practice --- each giving rise to many ways of forming novel expressions. For example, English contains suffixes which are highly productive and generalizable (e.g., -ness; Lady Gagaesqueness, pine-scentedness) and suffixes which can only be reused in specific words, and cannot be generalized (e.g., -th; truth, width, warmth). How are such differences in generalizability and reusability represented? What are the basic, stored building blocks at each level of linguistic structure? When is productive computation licensed and when is it not? How can the child acquire these systems of knowledge? I will discuss a theoretical framework designed to address these questions. The approach is based on the idea that the problem of productivity and reuse can be solved by optimizing a tradeoff between a pressure to store fewer, more reusable lexical items and a pressure to account for each linguistic expression with as little computation as possible. I will show how this approach addresses a number of problems in English morphology, phonology, and syntax.
6 October - Karthik Narasimhan (CSAIL)
Language understanding using reinforcement learning
In this talk, I will describe an approach to learning natural language semantics using reward-based feedback. This is in contrast to many NLP approaches that rely on non-trivial amounts of quality supervision, which is often expensive and difficult to obtain.We consider the task of learning control policies for text-based games where an agent needs to understand natural language to operate effectively in a virtual environment. In these games, all interactions in the virtual world are through text and the underlying state is not observed. We employ a deep reinforcement learning framework to jointly learn state representations and action policies using game rewards as feedback, capturing semantics of the game states in the process. Experiments on two game worlds show that reinforcement can be used to learn expressive representations.
20 October - Emily Morgan (Tufts Psychology)
Generative and item-specific knowledge contribute to language processing and evolution
The ability to generate novel utterances compositionally using generative knowledge is a hallmark property of human language. At the same time, languages contain non-compositional or idiosyncratic items, such as irregular verbs, idioms, etc. In this talk I ask how and why language achieves a balance between these two systems--generative and item-specific--from both the synchronic and diachronic perspectives. Specifically, I focus on the case of word order preferences for binomial expressions of the form “X and Y”, e.g. "bread and butter" versus "butter and bread". I show that ordering preferences for these expressions arise in part from violable generative constraints on the phonological, semantic, and lexical properties of the constituent words--e.g. short-before-long, men-before-women, etc.--but that expressions also have their own idiosyncratic preferences. Using behavioral experiments, corpus data, and evolutionary modeling, I will argue that both the way these preferences manifest diachronically and the way they are processed synchronically is constrained by expression frequency: in other words, the ability to learn and transmit idiosyncratic preferences for an expression is constrained by how frequently it is used. Moreover, I argue for a regularization bias in language learning and production, which, together with the process of cultural transmission, shapes the language-wide distribution of binomial expression preferences.
10 November - Roger Levy (BCS)
Broad-coverage data and computational models for understanding human language comprehension
Understanding the nature of the knowledge deployed for prediction and interpretation in real time language comprehension is of both fundamental scientific interest and practical significance. There is a several-decades-long tradition of addressing this question with controlled, small-scale studies using artificially constructed sentences. However, these studies yield only small data sets and are of limited ecological validity. An increasingly popular alternative is using broad-coverage models from computational linguistics to analyze human data from comprehension of more naturalistic materials, such as reading of newspaper text. While this approach yields larger datasets with greater ecological validity, the higher dimensionality and correlational structure of the resulting data poses substantial analytic challenges. Here I survey some of the recent work on broad-coverage models and data for investigating natural language understanding. I cover two studies from my own lab. The first sought to characterize the precise functional form of the relationship between prediction and word-by-word reading times. The second followed up on a striking claim by Frank and Bod (2011) that reading times were better predicted by sequential (simple recurrent network) models than by hierarchical (probabilistic context-free grammar) models. More broadly, this talk seeks to foster discussion about what new questions we should be asking with broad-coverage language comprehension models and datasets, and how best to get the answers.
Induction of phonological grammars using Minimum Description Length
Speakers' knowledge of the sound pattern of their language -- their knowledge of morpho-phonology -- goes well beyond the plain phonetic forms of words. The English-speaking child knows, for example, that the aspiration of the first segment of khæt ‘cat’ is predictable and the French-speaking child knows that the final l of table ‘table’ is optional and can be deleted while that of parle ‘speak’ cannot. According to a long-standing model in linguistics, morpho-phonological knowledge is distributed between a lexicon with morphemes, usually referred to as Underlying Representations (URs), and a mapping that transforms URs to surface forms, implemented using context-sensitive rewrite rules or interacting constraints. We will present two unsupervised learners that acquire both URs and the phonological mapping, one which induces rule-based grammars and another which induces constraint-based grammars. Our learners are based on the principle of Minimum Description Length (MDL) which -- like the closely related Bayesian approach -- aims at balancing the complexity of the grammar and its fit of the data. We will discuss ways in which the framework of Minimum Description Length may allow us to compare competing phonological models in terms of their predictions regarding learning.