Past Meetings

Spring 2021

3/10 Aniello De Santo (University of Utah)
Minimalist Parsing as a Psycholinguistic Model

Computational models grounded in rich grammatical formalisms can be used to explore whether --- and to which degree --- the structural representations hypothesized by generative linguists are relevant to sentence processing. In this talk, I present a line of work exploring how a top-down parser for Minimalist grammars (Stabler, 1996; MGs) can explain well-known contrasts in off-line sentence processing in terms of subtle structural differences. This model is especially suited to probe the relation between syntactic and processing complexity, as it specifies: 1) a formalized theory of syntax; 2) a sound and complete parser for the grammatical formalism;  3) a linking theory between syntactic assumptions and processing behavior, in the form of metrics measuring memory usage. To exemplify the mechanism behind the model, I discuss its performance on the off-line processing asymmetries reported for Italian post-verbal subject constructions. Then, I present ongoing work evaluating the MG parser as a good, non-probabilistic formal model of how gradient acceptability can be derived from categorical grammars. By investigating the MG model’s performance across a diverse array of processing phenomena, this project aims to add support to the psychological plausibility of fine-grained grammatical knowledge contributing to processing cost. It thus highlights the MG parsing model as a valuable, empirically grounded, theoretically insightful reframing of the Derivational Theory of Complexity (Miller and Chomsky, 1963).
 
4/7 Grusha Prasad (Johns Hopkins)
How do neural networks encode and use syntactic information?
 
Neural network models trained on text alone, without explicit syntactic supervision, have been surprisingly successful in tasks that require sensitivity to sentence structure. This raises the following questions: what syntactic information do these models encode in their sentence representations and how do these models use this encoded information? In this talk, I will describe two projects that explore these questions and discuss why answering these questions can be of interest to (psycho)linguists. In the first project, by drawing on the syntactic priming paradigm from psycholinguistics, we analyzed how neural networks (specifically LSTMs) represent English sentences with relative clauses (RCs). We found that the sentence representations in LSTMs we tested were organized in a linguistically interpretable manner: the representation of sentences with a specific subtype of RC (e.g., subject RC) were more similar to each other than they were to sentences with a different RC subtype (e.g., object RC); the representations of sentences of certain RC subtypes were similar to other RC subtypes to varying degrees (e.g., full object RCs were more similar to reduced object RCs than to full passive RCs). In the second project we used counterfactual representations of English sentences to explore how different neural networks (specifically BERT models) use information about RC boundaries when predicting the grammatical number of the verb in sentences. In a sentence like “The skater that the officers love smiles”, the linguistically correct way of predicting the number of “smiles” is to recognize that the verb is outside of the relative clause and therefore should agree with the “skater” and not “officers”. If the models use information about RC boundaries in this manner, then altering the model’s representation of the word “smiles” to inaccurate encode that the verb is inside the RC should worsen the model’s agreement prediction. We found that such counterfactual representations impacted the agreement prediction as expected in the larger BERT models, but not in the smaller ones. We also examined the representational similarity of sentences in the subspace of the models’ representation that encoded information about RC boundaries and found that this similarity was explained by interpretable linguistic features.
 
4/28 Anthony Yacovone (Harvard)
Unexpected words or unexpected languages? Two ERP effects of code-switching in naturalistic discourse
 
Bilingual speakers often switch between languages in conversation without any advance notice. Psycholinguistic research has found that these language shifts (or code-switches) can be costly for comprehenders in certain situations. In this talk, I will discuss a recent project that explores the nature of these costs by comparing code-switches to other types of unexpected linguistic material. To do this, we used a novel EEG paradigm, the Storytime task, in which we record readings of natural texts, and experimentally manipulate their properties by splicing in words. In this study, we manipulated the language of our target words (English, Spanish) and their fit with the preceding context (strong-fit, weak-fit). If code-switching incurs a unique cost beyond that incurred by an unexpected word, then we should see an additive pattern in our ERP indices. If an effect is driven by lexical expectation alone, then there should be a non-additive interaction such that all unexpected forms incur a similar cost. We found three effects: a general prediction effect (a non-additive N400), a post-lexical recognition of the switch in languages (an LPC for code-switched words), and a prolonged integration difficulty associated with weak-fitting words regardless of language (a sustained negativity). We interpret these findings as suggesting that the processing difficulties experienced by bilinguals can largely be understood within more general frameworks for understanding language comprehension. Our findings are consistent with the broader literature demonstrating that bilinguals do not have two wholly separate language systems but rather a single language system capable of using two coding systems.
 
5/12 Ellie Pavlick (Brown)
You can lead a horse to water...: Representing vs. Using Features in Neural NLP
 
A wave of recent work has sought to understand how pretrained language models work. Such analyses have resulted in two seemingly contradictory sets of results. On one hand, work based on "probing classifiers" generally suggests that SOTA language models contain rich information about linguistic structure (e.g., parts of speech, syntax, semantic roles). On the other hand, work which measures performance on linguistic "challenge sets" shows that models consistently fail to use this information when making predictions. In this talk, I will present a series of results that attempt to bridge this gap. Our recent experiments suggest that the disconnect is not due to catastrophic forgetting nor is it (entirely) explained by insufficient training data. Rather, it is best explained in terms of how "accessible" features are to the model following pretraining, where "accessibility" can be quantified using an information-theoretic interpretation of probing classifiers.
 

Spring 2019

Place: MIT 46-5165, Time: Thursdays, 5pm

Schedule 

2/14 Reading Group
Structures, not strings: linguistics as part of the cognitive sciences by Everaert et al. (2015)

2/21 Sherry Yong Chen (MIT Linguistics)

The linguistics of one, two, three

The interpretation of number words in context is of interest not only to linguists, but also logicians, psychologists, and computer scientists. In this talk, we will discuss some theoretical and developmental questions related to the linguistics of English numerals.
Bare numerals (e.g. two, three) present an interesting puzzle to semantic and pragmatic theories, as they seem to vary between several different interpretations: ‘at least n’, ‘exactly n’, and sometimes even ‘at most n’. We will examine how the availability of a particular interpretation seems to depend on the interaction between linguistic structure and contextual factors, and discuss three approaches that try to capture the relationship between these interpretations. 
Turning to the acquisition of bare numerals, developmental research suggests that preschoolers by the age of 5 are able to access ’non-exactly' interpretations of a bare numeral in contexts where these interpretations are licensed, just like adult speakers. A natural hypothesis for this is that the knowledge of the full range of interpretations may come through a prior understanding of the meaning of explicit expressions such as ‘at least/at most’ in English. This turns out to be questionable, however, since it is also shown that 5-year-olds haven’t yet acquired the meaning of the expressions at least and at most yet. Time permitting, we will end with a discussion about what all this means for the development of numerical concept and/or language development in general.

3/14 Reading Group: Danfeng Wu (MIT Linguistics), Syntactic Theory: A Formal Introduction Chapters 9.3-9.9

What is the role of psycholinguistic evidence (specifically evidence from language processing) in the study of language? What is the relation between knowledge of language and use of language? We hope to explore these questions through a discussion of an HPSG textbook chapter. HPSG (Head-driven phrase structure grammar) is a different syntactic framework from generative transformational grammar, and is surface-oriented, constraint-based and strongly lexicalist. This textbook chapter argues that HPSG is more compatible than transformational grammar with observed facts about language processing. For instance, language processing is incremental and rapid (e.g. Tanenhaus et al. 1995 & 1996, Arnold et al. 2002). The order of presentation of the words largely determines the order of the listener’s mental operations in comprehending them. And lexical choices have a substantial influence on processing (MacDonald et al. 1994). For these reasons, such psycholinguistic evidence supports an HPSG type of grammar, and poses difficulty to transformational grammar.

3/21 Paola Merlo (University of Geneva)

In the computational study of intelligent behaviour, the domain of language is distinguished by the complexity of the representations and the sophistication of the domain theory that is available. It also has a large amount of observational data available for many languages. The main scientific challenge for computational approaches to language is the creation of theories and methods that fruitfully combine large-scale, corpus-based approaches with the linguistic depth of more theoretical methods. I report here on some recent and current work on word order universals and argument structure that exemplifies the quantitative computational syntax approach. First, we demonstrate that typological frequencies of noun phrase orderings, universal 20, are systematically correlated to abstract syntactic principles at work in structure building and movement. Then, we investigate higher level structural principles of efficiency and complexity. In a large-scale, computational study, we confirm a trend towards minimization of the distance between words, in time and across languages. In the third case study, much like the comparative method in linguistics, cross-lingual corpus investigations take advantage of any corresponding annotation or linguistic knowledge across languages. We show that corpus data and typological data involving the causative alternation exhibit interesting correlations explained by the notion of spontaneity of an event. Finally, time permitting, I will discuss current work investigating on whether the notion of similarity in the intervention theory of locality is related to current notions of similarity in word embedding space.

4/4 Candace Ross (MIT CSAIL)

Grounding Language Acquisition by Training Semantic Parsers using Captioned Videos

We develop a semantic parser that is trained in a grounded setting using pairs of videos captioned with sentences. This setting is both data-efficient, requiring little annotation, and similar to the experience of children where they observe their environment and listen to speakers. The semantic parser recovers the meaning of English sentences despite not having access to any annotated sentences. It does so despite the ambiguity inherent in vision where a sentence may refer to any combination of objects, object properties, relations or actions taken by any agent in a video. For this task, we collected a new dataset for grounded language acquisition. Learning a grounded semantic parser — turning sentences into logical forms using captioned videos — can significantly expand the range of data that parsers can be trained on, lower the effort of training a semantic parser, and ultimately lead to a better understanding of child language acquisition.

4/11 Reading Group
Integration of visual and linguistic information in spoken language comprehension by Tanenhaus et al. (1995)

4/18 Tal Linzen (JHU Cognitive Science) (co-hosted with MIT CBMM)

Linguistics in the age of deep learning

Deep learning systems with minimal or no explicit linguistic structure have recently proved to be surprisingly successful in language technologies. What, then, is the role of linguistics in language technologies in the deep learning age? I will argue that the widespread use of these "black box" models provides an opportunity for a new type of contribution: characterizing the desired behavior of the system along interpretable axes of generalization from the training set, and identifying the areas in which the system falls short of that standard.

I will illustrate this approach in word prediction (language models) and natural language inference. I will show that recurrent neural network language models are able to process many syntactic dependencies in typical sentences with considerable success, but when evaluated on carefully controlled materials, their error rate increases sharply. Perhaps more strikingly, neural inference systems (including ones based on the widely popular BERT model), which appear to be quite accurate according to the standard evaluation criteria used in the NLP community, perform very poorly in controlled experiments; for example, they universally infer from "the judge chastised the lawyer” that "the lawyer chastised the judge”. Finally, if time permits, I will show how neural network models can be used to address classic questions in linguistics, in particular by providing a platform for testing for the necessity and sufficiency of explicit structural biases in the acquisition of syntactic transformations.

5/2 Rachel Ryskin (MIT BCS)

Lifelong learning of linguistic representations

Like much of the perceptual input that humans experience, language is highly variable. Two speakers may produce the same phoneme with different acoustics, and a sentence can have multiple syntactic parses. How do listeners (typically) choose the correct interpretation when the same sentence can map onto several potential meanings? In this talk, I will provide evidence that comprehenders navigate this variability by tracking distributional information in the input and using it to constrain the possible interpretations. For example, listeners can rapidly learn, from exposure to co-occurrence statistics, that a verb is much more likely to be followed by one syntactic structure than another structure (though both are grammatical). Thus, language representations are continuously shaped by experience even in adulthood. However, the underlying learning mechanisms and their neural underpinnings remain an open question. I will discuss recent efforts aimed at testing an error-based learning account, including work with patients with hippocampal amnesia, as well as evidence that the relevant linguistic representations may change over the lifespan.

5/9 Joshua Hartshorne (Boston College)

In popular culture, robots or other theory-of-mind-impaired fictional characters have difficulties with metaphors or indirect speech because they lack "common sense". In fact, common sense plays an even more central role in language processing than these depictions suggest. Compare:

(1) The city council denied the protesters a permit because they feared violence.

(2) The city council denied the protesters a permit because they advocated violence.

Most people interpret they as referring to the city council in (1) and the protesters in (2). It is hard to explain this without invoking common sense. Similarly, compare:

(3) The hat fit in the box because it was big.

(4) The hat didn't fit in the box because it was big.

Such examples are ubiquitous in natural language. In this talk, I will describe recent computational and experimental work that tries to make sense of this commonplace phenomenon.

5/16 Ethan Wilcox (Harvard)

Recurrent Neural Networks (RNNs) are one type of neural model that has been able to achieve state-of-the-art scores on a variety of natural language tasks, including translation and language modeling (which is used in, for example, text prediction). However, the nature of the representations that these 'black boxes' learn is poorly understood, raising issues of accountability and controllability of the NLP system. In this talk, I will argue that one way to assess what these networks are learning is to treat like subjects in a psycholinguistic experiment. By feeding them hand-crafted sentences that belie the model's underlying knowledge of language I will demonstrate that they are able to learn the filler--gap dependency, and are even sensitive to the hierarchical constraints implicated in the dependency. Next, I turn to "island effects", or structural configurations that block the filler---gap dependency, which have been theorized to be unlearnable. I demonstrate that RNNs are able to learn some of the "island" constraints and even recover some of their pre-island gap expectation. These experiments demonstrate that linear statistical models are able to learn some fine-grained syntactic rules, however their behavior remains un-humanlike in many cases.

 

Fall 2018

 

10/4 Meilin Zhan (MIT BCS)

Comparing theories of speaker choice using classifier production in Mandarin Chinese

Speakers often have more than one way to express the same meaning. What general principles govern speaker choice in the face of optionality when near semantically invariant alternation exists? Studies have shown that optional reduction in language is sensitive to contextual predictability, where the more predictable a linguistic unit is, the more likely it gets reduced. Yet it is unclear whether speaker choice is geared toward audience design, or toward facilitating production. Here we argue that for a different optionality phenomenon, namely classifier choice in Mandarin Chinese, Uniform Information Density and at least one plausible variant of availability-based production make opposite predictions regarding the relationship between the predictability of the upcoming material and speaker choices. In a corpus analysis of Mandarin Chinese, we show that the distribution of speaker choices supports the availability-based production account, and not Uniform Information Density. 

10/18 Reuben Cohn-Gordon (Stanford Linguistics)

Bayesian Pragmatic Models for Natural Language

The Rational Speech Acts model (RSA) formalizes Gricean reasoning through nested models of speakers and listeners. While this paradigm offers an elegant way to simulate pragmatic behaviour in NLP tasks such as image captioning and translation, scaling from simple models to natural language presents several challenges. In particular, I discuss the problem of choosing alternative utterances among an unbounded set of sentences, including work on image captioning and on-going work on translation.

11/1 Keny Chatain (MIT Linguistics)

Is logic useful in the study of meaning? Pronouns can tell.

Pronouns, like heshe, or it, are among the items with highest frequency in English and other languages that use overt pronouns. Within the same language, they have a variety of uses that do not form an obvious natural class. They can for instance be used to refer to a previously mentioned name (the anaphoric use), but also as variables in quantified statements (the bound use, e.g. every athlete thinks he will win). More intriguingly, it seems, in broad strokes, that no language distinguishes these uses by employing different forms, suggesting an underlying connection between them.

In this talk, I will show how this connection has been used to shed light on the system that underlies meaning. I will start off by showing that standard predicate logic provides a remarkably adequate understanding of the behavior of pronouns. I will then present the famous case of so-called donkey sentences that translations to predicate logic seem unable to capture. These cases are taken to argue in favor of new take on the meaning of sentences, called Dynamic Semantics ; the meaning of sentences, it is claimed, is more appropriately understood as the effect that sentences have on the context. I will show how this approach can capture donkey sentences, and other pronoun-related phenomena. Time allowing, I will discuss alternatives. 

11/15 Hao Tang (MIT CSAIL)

Automatic speech recognition without linguistic knowledge?

Building state-of-the-art speech recognizers, besides a large corpus of transcribed speech, require several additional ingredients, such as a phoneme inventory, a lexicon, and a language model. These ingredients carry linguistic constraints to make training more feasible and more sample efficient. Recently, there has been a push towards building a speech recognizer end to end, i.e., using few or even none of the aforementioned ingredients. This raises a fundamental question: is it possible to train a speech recognizer without any linguistic constraint? How much data do we need to make it possible? What linguistic constraints are necessary for building a speech recognizer? In this talk, I will review the inner workings of conventional and end-to-end speech recognizers, and to help answer some of those questions, I will present empirical results in training end-to-end speech recognizers without any linguistic constraints.

12/6 Emma Nguyen (UConn Linguistics)

Using developmental modeling to specify learning and representation of the passive in English children

Complete knowledge of the passive takes significant time to develop in English children. Building on prior work identifying the importance of lexical factors for this process, we specify a Bayesian learning model that can capture experimentally-observed passivization behavior in five-year-olds, given child-directed speech to learn from. Through this developmental model, we identify (i) how English children may be integrating lexical feature information, and (ii) how costly they may view the passive structure to be.

 

Spring 2018

 

2/26 Thomas Schatz (MIT/UMD)

Leveraging automatic speech recognition technology to model cross-linguistic speech perception in humans

Existing theories of cross-linguistic phonetic category perception agree that listeners perceive foreign sounds by mapping them onto their native phonetic categories. Yet, none of the available theories specify a way to compute this mapping. As a result, they cannot provide systematic quantitative predictions and remain mainly descriptive. In this talk, I will present a new approach that leverages Automatic Speech Recognition (ASR) technology to obtain fully specified mapping between foreign and native sounds. Using the machine ABX evaluation method, we derive quantitative predictions from ASR systems and compare them to empirical observations in human cross-linguistic phonetic category perception. I will present results both where the proposed model successfully predicts empirical effects (for example on the American English /r/-/l/ distinction) and where it fails (for example on the Japanese vowel length contrasts) and discuss possible interpretations.

3/5 Ishita Dasgupta (Harvard)

Evaluating Compositionality in Sentence Embeddings

An important challenge for human-like AI is compositional semantics. Recent research has attempted to address this by using deep neural networks to learn vector space embeddings of sentences, which then serve as input to other tasks. We present a new dataset for one such task, “natural language inference” (NLI), that cannot be solved using only word-level knowledge and requires some compositionality. We find that the performance of state of the art sentence embeddings (InferSent, Conneau et al. (2017)) on our new dataset is poor. We analyze some of the decision rules learned by InferSent and find that they are largely driven by simple heuristics that are ecologically valid in its training dataset. Further, we find that augmenting training with our dataset improves test performance on our dataset without loss of performance on the original training dataset. This highlights the importance of structured datasets in better understanding and improving NLP systems.

4/2 Idan Blank (MIT BCS)

When we “know the meaning” of a word, what kind of knowledge do we have?

Understanding words seems to require both linguistic knowledge (stored form-meaning pairings and ways to combine them) and world knowledge (object properties, plausibility of events, etc.). In this talk, I will pose some challenges for common distinctions between these knowledge sources. First, I will ask whether rich information about concrete objects could be, in principle, learned from just the co-occurrence statistics of different words even in the absence of non-linguistic (e.g., perceptual) information. To this end, I will introduce a domain-general approach for leveraging such statistics (as captured by distributional semantic models, DSMs) to recover context-specific human judgments such that, e.g., “dolphin” and “alligator” appear relatively similar when considering size or habitat, but different when considering aggressiveness. Second, I will probe DSMs for “syntactic”, abstract compositional knowledge of verb-argument structure (e.g., “eat”, but not “devour”, can appear without an object). I will demonstrate that these syntactic properties of verbs can often be predicted from distributional information (i.e., without explicit access to “syntax”), indicating that DSMs capture those aspects of verb meaning that correlate with verb syntax. Nevertheless, only a small fraction of distributional information is needed for predicting verb argument structure - the rest appears to capture semantic properties that are relatively divorced from syntax. In fact, the overall similarity structure across verbs in a DSM is independent from the similarity structure across verbs as determined by their syntax, and both kinds of similarity are needed for explaining human judgments. Together, these two studies attempt to push against the upper bound on the potential complexity of distributional word meanings.

4/23 Hendrik Strobelt (IBM/Harvard)

Visualization for Sequence Models for Debugging and Fun
Visual analysis is a great tool to explore deep learning models when there is no strong mathematical hypothesis yet available. I will present two visual tools where we used design study methodology to allow exploration of patterns in hidden state changes in RNNs/LSTMs (LSTMVis) and exploration of Sequence2Sequence models (Seq2Seq-Vis). Both model types have shown superior performance for NLP like language modeling or language translation. Examples about both tasks will be shown on a variety of models.
As beautiful distraction, we also utilize data science methods to investigate large data in a more artistic way. Formafluens is such a data experiment where we analyze a large collections of doodles made by humans in the Google Quickdraw tool.

 

Fall 2017

 

9/28 - Ray Jackendoff (Tufts)

Morphology and Memory

We take Chomsky’s term “knowledge of language” very literally.  “Knowledge” implies “stored in memory,” so the basic question of linguistics is reframed as

What do you store in memory such that you can use language, and in what form do you store it?

Traditionally – and in standard generative linguistics – what you store is divided into grammar and lexicon, where grammar contains all the rules, and the lexicon is an unstructured list of exceptions.  We develop an alternative view in which rules of grammar are lexical items that contain variables, and in which rules have two functions.  In their generative function, they are used to build novel structures, just as in traditional generative linguistics.  In their relational function, they capture generalizations over stored items in the lexicon, a role not seriously explored in traditional linguistic theory.  The result is a lexicon that is highly structured, with rich patterns among stored items.

We further explore the possibility that this sort of structuring is not peculiar to language, but appears in other cognitive domains as well.  The differences among cognitive domains are not in this overall texture, but in the materials over which stored relations are defined – patterns of phonology and syntax in language, of pitches and rhythms in music, of geographical knowledge in navigation, and so on.  The challenge is to develop theories of representation in these other domains comparable to that for language. 

 

10/5 - Kasia Hitczenko (UMD/MIT)

Exploring the efficacy of normalization in the acquisition and processing of Japanese vowels

Infants must learn the sound categories of their language and adults need to map particular acoustic productions they hear to one of those learned categories. These tasks can be difficult because there is often a lot of overlap between the acoustic realizations of different categories that can mask which sounds should be grouped together. Previous work has proposed that this overlap is caused, at least in part, by systematic and predictable sources of variability, and that listeners could learn about the structure of this variability and normalize it out to help learn from and process the incoming sounds. In this work, we further explore this idea of normalization, by applying it to the problem of Japanese vowel length contrast – a contrast that current computational models fail to learn due to high overlap between short and long vowels. We find that, at least in the way it is implemented here, normalizing out systematic variability does not substantially improve categorization performance over leaving acoustics unnormalized. We then present an alternative path forward by showing that a strategy that uses both acoustic cues and non-acoustic top-down information in categorization is better able to separate the short and long vowels.

10/26 - Jon Gauthier (BCS MIT)

What does NLP tell us about language?

Many state-of-the-art models in natural language processing achieve top performance in challenging shared tasks while doing little to explicitly model the syntax or semantics of their input. In light of these results, I will attempt to pitch two of my own past projects in natural language processing with a philosophical spin. I first present a model which reduces syntactic understanding to a by-product of more general pressures of semantic accuracy. I next present evolutionary simulations in which word meanings spontaneously arise due to nonlinguistic cooperative pressures. We close with potentially heretical and hopefully interesting discussion on the ideal relationship between AI engineering and cognitive science.

 

11/3* - Ryan Cotterell (Johns Hopkins) *Joint with CBMM, Friday 4pm, special location 46-3189

Probabilistic Typology: Deep Generative Models of Vowel Inventories

Linguistic typology studies the range of structures present in human language. The main goal of the field is to discover which sets of possible phenomena are universal, and which are merely frequent. For example, all languages have vowels, while most—but not all—languages have an [u] sound. In this paper we present the first probabilistic treatment of a basic question in phonological typology: What makes a natural vowel inventory? We introduce a series of deep stochastic point processes, and contrast them with previous computational, simulation-based approaches. We provide a comprehensive suite of experiments on over 200 distinct languages.

 

11/16 - David Alvarez Melis (CSAIL MIT)

Interpretability for black-box sequence-to-sequence models

Most current state-of-the-art models for sequence-to-sequence NLP tasks have complex architectures and millions --if not billions--of parameters, making them practically black-box systems. Such lack of transparency can limit their applicability to certain domains and can hamper our ability to diagnose and correct their flaws. Popular black-box interpretability approaches are inapplicable to this context since they assume scalar (or categorial) outputs. In this work, we propose a model to interpret the predictions of any black-box structured input-structured output model around a specific input-output pair. Our method returns an "explanation" consisting of groups of input-output tokens that are causally related. These dependencies are inferred by querying the black-box model with perturbed inputs, generating a graph over tokens from the responses, and solving a partitioning problem to select the most relevant components. We focus the general approach on sequence-to-sequence problems, adopting a variational autoencoder to yield meaningful input perturbations. We test our method across several NLP sequence generation tasks.

11/30 - Takashi Morita (Linguistics MIT) & Timothy O'Donnell

Bayesian learning of Japanese sublexica

Languages have been borrowing words from each other. A borrower language often has a different list of possible sound sequences (phonotactics) from a lender's. While loanwords may be reshaped so that they fit to the borrower's phonotactics, they can also introduce new sound patterns into the language. Accordingly, native and loanwords can exhibit different phonotactics in a single language and linguists have proposed that such a language's lexicon is better explained by a mixture of multiple phonotactic grammars: words are classified into sublexica (e.g. native vs. loan), and words belonging to different sublexica are subject to different phonotactic constraints. This approach, however, raises a non-trivial learnability question: Can learners classify words into correct sublexica? Words are not labeled with their sublexicon, so learners need to infer the classification. In this study, we investigate Bayesian unsupervised learning of sublexica. We focus on Japanese data (coded in international phhonetic alphabet), whose sublexical phonotactics has been proposed in linguistic literature. It will turn out that even a simple Dirichlet process mixture of ngram leads to remarkably successful classification.

 

Spring 2017

 

2 February - Richard Futrell (MIT BCS)

Memory and Locality in Natural Language

I explore the hypothesis that the universal properties of languages can be explained in terms of efficient communication given fixed human information processing constraints. First, I show corpus evidence from 37 languages that word order in grammar and usage is shaped by working memory constraints in the form of dependency locality: a pressure for syntactically linked words to be close to one another in linear order. Next, I develop a new theory of language processing cost, based on rational inference in a noisy channel, that unifies surprisal and memory effects and goes beyond dependency locality to a principle of information locality: that words that predict each other should be close. I show corpus evidence for information locality. Finally, I show that the new processing model resolves a long-standing paradox in the psycholinguistic literature, structural forgetting, where the effects of memory appear to be language-dependent.

 

16 February - Kevin Ellis (MIT BCS)

Inducing Phonological Rules: Insights from Bayesian Program Learning

How do linguists come up with phonological rules? How do kids learn pig latin?  We develop computational models of these and other phenomena in language.  The unifying element of the models is to treat grammars as programs, which lets us apply ideas from the field of program synthesis to learn grammars. This lets the models capture phonological phenomena like vowel harmony or stress patterns and learn synthetic grammars used in prior studies of rule learning.  Going beyond individual grammar learning problems, we consider the problem of jointly inferring many related rule systems. By solving many textbook phonology problems, we can ask the model what kind of inductive bias best explains the attested phenomena.

 

2 March - Uriel Cohen Priva (Brown)

Segment informativity, cost, and universality

How costly are individual segments? Can this be tied to the actuation of lenition processes? I propose that segment informativity (its expected predictability) provides us with a window into these two questions. Using supporting evidence from 7 languages and 3 unrelated lenition processes, I propose that when the informativity of a segment is too low to be maintained faithfully, it is more likely to lenite, predicting the actuation of lenition processes. Extending this work gives insight into how this may be related to cost: Using over 40 languages, I show that segment informativity is universal and is more consistent cross-linguistically than other baselines such as typological frequency and within-language frequency. Such measurements correlate with phonetic and phonological cost, and seem to be sensitive to the morphological makeup of the languages that are investigated.

 

16 March - Yevgeni Berzak (MIT CSAIL)

From ESL to the Typology of the World's Languages (and Back)

Linguists and psychologists have long been studying cross-linguistic transfer, the influence of a native language on linguistic performance in a foreign language. In this work we provide empirical evidence for this process and demonstrate its use for typology learning, as well as for prediction of native language specific grammatical error distributions in English as Second Language (ESL). First, we show a strong correlation between language similarities derived from structural features in ESL texts and equivalent similarities obtained from the typological features of the native languages. We leverage this finding to recover an approximation of the native language typological similarity structure directly from ESL text, and perform prediction of typological features in an unsupervised fashion with respect to the target languages. Secondly, we present a formalization of the Contrastive Analysis framework that uses typological information to predict native language specific distributions of grammatical errors in ESL. Finally, we demonstrate that these two tasks can be combined in a bootstrapping strategy. Time permitting, we will also discuss related ongoing work and resources for learner language.

 

6 April - Evelina Fedorenko (MIT BCS)

Human language as a code for thought

Although many animal species are endowed with the ability for complex thought, only humans can share these thoughts with one another, via language. My research aims to understand i) the system that supports our linguistic abilities, including its neural implementation, and ii) the relationship between the language system and the rest of the human cognitive arsenal.

I will begin by briefly introducing the “language network”, a set of interconnected brain regions that support language comprehension and production. Based on data from fMRI studies and investigations of patients with global aphasia, I will argue that the language network is functionally selective for language processing over a wide range of non-linguistic processes that have been previously argued to share computational demands with language, including arithmetic, executive functions, music, action observation and execution, and the processing of non-lingustic social signals like gestures. I will then talk about the internal structure of the language network. Although the higher-level language regions are robustly separable from lower-level perceptual and motor articulation regions, dividing up the high-level language network into component parts has proven difficult. In particular, the traditional "cuts" that have been proposed in the literature (including the most common one, based on the distinction between lexico-semantic and syntactic processing) do not seem to be supported by the available evidence.

Given the combination of i) functional selectivity of the language network for linguistic processing, and ii) the apparent lack of clear functional divisions within the network, I will tentatively propose that the language network stores domain-specific knowledge representations in a highly distributed fashion, perhaps with the meaning similarity reflected in the neural pattern similarity. To that end, I will present some recent evidence of robust ability to decode meanings of single words and sentences from the neural activity in the language network using distributional semantic models. I will thus suggest that our language system can be thought of as a robust and generic encoder (in production) and decoder (in comprehension) of non-linguistic conceptual representations.

 

20 April - Andrei Barbu (MIT CSAIL)

Understanding vision through language and language through vision

We will discuss a research program to ground language in perception and solve a diverse set of vision-language tasks. This allows language understanding to help vision by providing priors over potential interpretations. In addition, itallows us to understand language by means of perception and imagination. We will examine a range of vision and language tasks (description, QA, search, disambiguation, etc.) and show how a single generative model can address seemingly unrelated problems that span different research communities. This single model performs several tasks without being retrained, capturing the crucial human ability to use knowledge learned in one context and generalize it to another. Finally, I will present our ongoing work on developing a novel cognitively-plausible approach to grounded language acquisition, language translation, planning, and physics-based event understanding.

 

4 May - Oren Tsur (Harvard CS)

What do People Say When They Say Fake News?

It seems that everyone is talking about fake news lately. Academic conferences and symposiums dedicated to the issue pop up like mushrooms after the rain (I'm aware of a couple in Cambridge in recent weeks). Fake news, however, are hard to define and track, let alone fight. In this meeting I'd discuss some characteristics of fake news and show some (very) initial results related to tracking and identifying fake news and the social dynamics that are related to this phenomena. Finally, I would muse (and you will share your take) whether these dynamics are unique to fake news or are inherent to the way language is used and evolve over time.

 

11 May - Yonatan Belinkov (CSAIL)

On Learning Form and Meaning in Neural Machine Translation Models

Neural machine translation (MT) models obtain state-of-the-art performance while maintaining a simple, end-to-end architecture. However, interpreting such models remains challenging and not much is known about what they learn about source and target languages during the training process. In this talk, I will present our recent work on investigating what neural MT models learn about morphology. We analyze the representations learned by neural MT models at various levels of granularity and empirically evaluate the quality of the representations for learning morphology through extrinsic part-of-speech and morphological tagging tasks. We conduct a thorough investigation along several parameters: word-based vs. character-based representations, depth of the encoding layer, the identity of the target language, and encoder vs. decoder representations. Our data-driven, quantitative evaluation sheds light on important aspects in the neural MT system and its ability to capture word structure. As time permits, I will also discuss ongoing work on semantics in neural Mt models and how distinctions between form and meaning are reflected in the neural MT representations.

 

Fall 2016

 

22 September - Timothy J. O'Donnell (BCS)

Computation and Storage in Language

A much-celebrated aspect of language is the way in which it allows us to express and comprehend an unbounded number of thoughts. This property is made possible because language consists of several combinatorial systems which can be used to productively build novel forms using a large inventory of stored, reusable parts: the lexicon. For any given language, however, there are many more potentially storable units of structure than are actually used in practice --- each giving rise to many ways of forming novel expressions. For example, English contains suffixes which are highly productive and generalizable (e.g., -ness; Lady Gagaesqueness, pine-scentedness) and suffixes which can only be reused in specific words, and cannot be generalized (e.g., -th; truth, width, warmth). How are such differences in generalizability and reusability represented? What are the basic, stored building blocks at each level of linguistic structure? When is productive computation licensed and when is it not? How can the child acquire these systems of knowledge? I will discuss a theoretical framework designed to address these questions. The approach is based on the idea that the problem of productivity and reuse can be solved by optimizing a tradeoff between a pressure to store fewer, more reusable lexical items and a pressure to account for each linguistic expression with as little computation as possible. I will show how this approach addresses a number of problems in English morphology, phonology, and syntax.

 

6 October - Karthik Narasimhan (CSAIL)

Language understanding using reinforcement learning

In this talk, I will describe an approach to learning natural language semantics using reward-based feedback. This is in contrast to many NLP approaches that rely on non-trivial amounts of quality supervision, which is often expensive and difficult to obtain.We consider the task of learning control policies for text-based games where an agent needs to understand natural language to operate effectively in a virtual environment. In these games, all interactions in the virtual world are through text and the underlying state is not observed. We employ a deep reinforcement learning framework to jointly learn state representations and action policies using game rewards as feedback, capturing semantics of the game states in the process. Experiments on two game worlds show that reinforcement can be used to learn expressive representations.

 

20 October - Emily Morgan (Tufts Psychology)

Generative and item-specific knowledge contribute to language processing and evolution

The ability to generate novel utterances compositionally using generative knowledge is a hallmark property of human language. At the same time, languages contain non-compositional or idiosyncratic items, such as irregular verbs, idioms, etc. In this talk I ask how and why language achieves a balance between these two systems--generative and item-specific--from both the synchronic and diachronic perspectives. Specifically, I focus on the case of word order preferences for binomial expressions of the form “X and Y”, e.g. "bread and butter" versus "butter and bread". I show that ordering preferences for these expressions arise in part from violable generative constraints on the phonological, semantic, and lexical properties of the constituent words--e.g. short-before-long, men-before-women, etc.--but that expressions also have their own idiosyncratic preferences. Using behavioral experiments, corpus data, and evolutionary modeling, I will argue that both the way these preferences manifest diachronically and the way they are processed synchronically is constrained by expression frequency: in other words, the ability to learn and transmit idiosyncratic preferences for an expression is constrained by how frequently it is used. Moreover, I argue for a regularization bias in language learning and production, which, together with the process of cultural transmission, shapes the language-wide distribution of binomial expression preferences.

 

10 November - Roger Levy (BCS)

Broad-coverage data and computational models for understanding human language comprehension

Understanding the nature of the knowledge deployed for prediction and interpretation in real time language comprehension is of both fundamental scientific interest and practical significance.  There is a several-decades-long tradition of addressing this question with controlled, small-scale studies using artificially constructed sentences.  However, these studies yield only small data sets and are of limited ecological validity.  An increasingly popular alternative is using broad-coverage models from computational linguistics to analyze human data from comprehension of more naturalistic materials, such as reading of newspaper text.  While this approach yields larger datasets with greater ecological validity,  the higher dimensionality and correlational structure of the resulting data poses substantial analytic challenges.  Here I survey some of the recent work on broad-coverage models and data for investigating natural language understanding.  I cover two studies from my own lab.  The first sought to characterize the precise functional form of the relationship between prediction and word-by-word reading times.  The second followed up on a striking claim by Frank and Bod (2011) that reading times were better predicted by sequential (simple recurrent network) models than by hierarchical (probabilistic context-free grammar) models.  More broadly, this talk seeks to foster discussion about what new questions we should be asking with broad-coverage language comprehension models and datasets, and how best to get the answers.

 

17 November - Ezer Rasin and Roni Katzir (Linguistics)

Induction of phonological grammars using Minimum Description Length

Speakers' knowledge of the sound pattern of their language -- their knowledge of morpho-phonology -- goes well beyond the plain phonetic forms of words. The English-speaking child knows, for example, that the aspiration of the first segment of khæt ‘cat’ is predictable and the French-speaking child knows that the final l of table ‘table’ is optional and can be deleted while that of parle ‘speak’ cannot. According to a long-standing model in linguistics, morpho-phonological knowledge is distributed between a lexicon with morphemes, usually referred to as Underlying Representations (URs), and a mapping that transforms URs to surface forms, implemented using context-sensitive rewrite rules or interacting constraints. We will present two unsupervised learners that acquire both URs and the phonological mapping, one which induces rule-based grammars and another which induces constraint-based grammars. Our learners are based on the principle of Minimum Description Length (MDL) which -- like the closely related Bayesian approach -- aims at balancing the complexity of the grammar and its fit of the data. We will discuss ways in which the framework of Minimum Description Length may allow us to compare competing phonological models in terms of their predictions regarding learning.