DH Seminar Lecture - Network Inference from Textual Evidence

© 2019 Prof. David Smith

© 2019 Prof. David Smith

Tuesday 19 February 2019 - 16:00 to 17:00 - Room BC 420 .
DH Seminar Lecture - Network Inference from Textual Evidence: Information Propagation, Translation, and Multi-Input Attention
By Prof. David Smith, Northeastern University

Abstract
Mass digitization has provided a mountain of source material for the humanities and social sciences, but its structure is unevenly mapped. Dependencies among documents arise when copying manuscripts, citing scholarly literature, speaking from talking points, reposting social networking content, popularizing scientific papers, or otherwise transforming earlier sources. While some dependencies are observable—e.g., by citations or links—we often need to infer them from textual evidence. In our Viral Texts and Oceanic Exchanges projects, we have built models to trace information flow within and across languages in poorly OCR'd newspapers. Other projects in our group infer and exploit such dependencies to model the writing of legislation, the impact of scientific press releases, and changes in the syntax of language.

I discuss methods for inferring these dependency structures and exploiting them to improve other tasks. First, I describe a directed spanning tree model of information cascades and a new unsupervised contrastive training procedure that outperforms previous approaches to network inference. I then describe extracting parallel passages from non-parallel multilingual corpora by performing efficient search in the continuous document-topic simplex of a polylingual topic model to train translation systems with greater accuracy than smaller clean datasets. Finally, I describe methods for detecting multiple transcriptions of the same passage in a large corpus of noisy OCR and for exploiting these multiple witnesses to correct noisy text. These multi-input attention models provide efficient approximations to intractable multi-sequence alignment collation and enable 75% reductions in error with unsupervised models.