CIS Colloquium, Nov 22, 2013, 02:00PM – 03:00PM, Wachman 1015D
Probabilistic Topic Models of Text and Users
David Blei, Princeton University
Abstract:
Probabilistic topic models provide a suite of tools for analyzing large document collections. Topic modeling algorithms discover the latent themes that underlie the documents and identify how each document exhibits those themes. Topic modeling can be used to help explore, summarize, and for m predictions about documents. Traditional topic modeling algorithms take a document collection as input and analyze the texts to estimate its latent thematic structure. However, for many collections, there is an additional type of data: how people use the documents. For example, consider readers clicking on articles in a newspaper website or scientists placing article s in their personal libraries. User behavior data about documents is critical to building recommendation systems and gives new ways of understanding how a collection is implicitly organized. In this talk, I will review the basics of topic modeling and describe our recent research on collaborative topic models, which simultaneously analyze texts and corresponding user behavior data. We studied collaborative topic models on a large collection of 80,000 scientists’ libraries and the 250,000 abstracts of the corresponding articles. With this analysis, we can build recommendation systems that point scientists to articles they will like and, further, organize the scientific literature according to the discovered patterns of readership. As examples, we can identify articles that are important within a field and articles that transcend disciplinary boundaries. More broadly, topic modeling is a case study in the large field of applied probabilistic modeling. Finally, I will survey some recent advances in this field. I will show how modern probabilistic modeling gives data scientists a rich language for expressing statistical assumptions and scalable algorithms for uncovering hidden patterns in massive data.
Bio:
David Blei is an associate professor of Computer Science at Princeton University. He earned his Bachelor’s degree in Computer Science and Mathematics from Brown University and his PhD in Computer Science from the University of California, Berkeley. He has received several awards, including a Sloan Fellowship (2010), Office of Naval Research Young Investigator Award (2011), Presidential Early Career Award for Scientists and Engineers (2011), and Blavatnik Faculty Award (2013). His research focuses on probabilistic topic models, Bayesian nonparametric methods, and approximate posterior inference. He works on a variety of applications, including text, images, music, social networks, user behavior, and scientific data.