D. Sculley, Ph.D.

Alum of Tufts University
Department of Computer Science
email: dsculley (at) cs.tufts.edu
Resume (.pdf)
Publications

My work lies at the intersection of data mining, machine learning, and information retrieval, including work on relevance and CTR prediction for online advertising, information filtering (such as email spam and blog comment spam filtering), text analysis, digital libraries, and online advertising systems. I have also done work in computational biology and the digital humanities, and have a particular interest in scalable methods.

I have been employed in Google's Pittsburgh-based office since 2008, working on problems in online advertising and user experience. My daily work involves a mix of research, engineering, technical team leadership, and management.

The following papers on ad click through prediction and large scale learning with much smaller RAM footprints give a good picture of the kinds of things I work on these days.

Open Source Code

sofia-ml
The sofia-ml package contains a selection of fast online learners for classification, ranking, and ROC area optimization problems, including fast linear SVM variants such as Pegasos SVM and SGD-SVM. This code is freely available at the sofia-ml project homepage, and accompanies a recent workshop paper on large scale learning to rank. This implementation is especially well suited to large, sparse learning problems, and can be used to learn linear SVM models on hundreds of thousands or millions of data points using a only fraction of a second of CPU time for training on a normal laptop.

This package also contains the combined regression and ranking method proposed at KDD 2010, and a fast fast k-means clustering implementation based on mini-batch stochastic gradient descent.

Online Advertising

My current work at Google's Pittsburgh office, located near the campus of Carnegie Mellon University, is centered around systems to ensure that users have a positive experience when they click on advertisement links.

Our paper Detecting Adversarial Advertisements in the Wild reports a case study of the automated and semi-automated systems we built at Google to identify and block ads that users may find untrustworthy or harmful. This paper was awarded the best paper award for the industry track at KDD 2011.

Another example of research in this area is our paper at KDD 2009, Predicting Bounce Rates in Sponsored Search Advertisements. Another example is the paper Combined Regression and Ranking, which suggests that in sitations where we care about both rank-based and regression-based performance metrics a joint optimization method combining both criterion can improve overall performance.

I spent the Spring of 2006 doing an internship at Yahoo!, in the Data Mining Research team under the direction of Pavel Berkhin, working with Scott Gaffney and Rajesh Parekh. My work there involved the problem of keyword expansion for sponsored search advertisers; these approaches were included in a patent filed by Yahoo!. As part of this project, I developed several methods of rank aggregation among similar items which appeared as a poster paper at the 2007 SIAM data mining conference.

Information Filtering

In August of 2008, I completed my Ph.D. in machine learning and data mining, under the direction of Carla Brodley. My dissertation work explored advances in online learning-based spam filtering.

The use of machine learning methods for filtering spam and abuse remains a continuing interest for me.

I've been interested in the problem of training effective filters in the presence of noisy label feedback, as in the case of email spam with noisy user feedback and a similar scenario in filtering blog comment abuse. Some of this work has been with Gordon Cormack of U. Waterloo, with whom I've also recently explored the problem of training mini-filters that rely on only dozens rather than millions of features. We've found that these mini-filters can give surprisingly effective lightweight personalization, while remaining small enough to be served in RAM even for very large user bases.

At CEAS 2007, I presented work on online active learning for spam filtering, which I think is a more natural mode of active learning for spam filtering than pool-based active learning. This idea was adopted as a part of the TREC 2007 spam filtering track.

I have developed a fast online SVM variant suitable for streaming data called Relaxed Online SVMs with Gabriel M. Wachman. This work was named Best Student Paper at SIGIR 2007, and gave best results on several tasks at the TREC 2007 spam filtering competition. Source code is available for download here.

I've also been interested in the idea of one-sided feedback for online learning, typified by the case of the lazy user who never checks the 'spam box', and only looks at messages in the 'inbox'. In this case, an online learner will never get feedback on predicted negatives. We examined this problem, some prior theoretical solutions, along with new practical methods in our 2007 paper at KDD.

In 2006, I worked on spam classification with inexact string matching methods and linear classifiers. I have filed a patent on this work on behalf of Tufts University, with Gabriel M. Wachman and Carla E. Brodley. Our team has competed in the TREC 2006 Spam Filtering competitions, with good results. We presented our method at the TREC 2006 Conference in November, 2006. Text of our competition report is available here.

Text Analysis and Digital Libraries

I am interested in digital libraries, and the power of automated text analysis to make new scholarly discoveries possible. I've collaborated with Brad Pasanek, a literary scholar at the University of Virginia, on two journal articles in Literary and Linguistic Computing and several conference talks. The paper Mining Millions of Metaphors explores the use of automated text analysis on a hand curated, digital collection of metaphors of the mind drawn from Eighteenth century English literature. Our second article, Meaning and Mining, explores the impact of implicit assumption when applying data mining techniques to literary analysis. We have also given joint talks at the Chicago Colloquium on Digital Humanities (2006, 2007, 2008), the STOA consortium (2007), and the Digital Humanities conference (2007). This work has also received some attention in The Chronicle of Higher Education, The San Jose Mercury News, and The New York Times.

Other work in this area has included work on automatic word sense disambiguation using coarsely aligned parallel texts, with Gregory Crane of the Perseus digital library, which has appeared as a contribution to a paper on the evolution of digital libraries.

Other Work

In 2010, I served as the general conference chair of CEAS.

In 2009, I am serving as a program committee co-chair of CEAS.

In 2008, I served as a program committee member of SIGIR.

In March, 2006, I presented the paper "Compression and Machine Learning: a New Perspective on Feature Space Vectors", co-authored with Carla E. Brodley. (Test and training splits of the Unix user data sets used in this paper are available here.)

In May, 2005, I gave a talk on "Compression, Learning, and Metaphor" at Harvard's Graduate Student conference on Mind Brain and Behavior.

I received my M.S. in computer science in May, 2005 from Tufts University.

In the summer of 2004, I wrote a tutorial on basic electronics for Tufts Robotics.

I received my B.A. in Visual and Environmental Studies from Harvard University in 1997, and an M.Ed. in School Leadership and Development from the Harvard Graduate School of Education in 2002.

Before graduate school, I taught in high schools in Abu Dhabi, U.A.E.; Caracas, Venezuela; Lugano, Switzerland; and Claremont, California. I am a former member of the U.S. men's national team in indoor field hockey, winning a silver medal at the 2004 Pan-American competition, and spent seven seasons as an assistant coach of the Boston College field hockey team.

My other interests include chess, music, writing, the visual arts, hockey, and the Pittsburgh Steelers.

F.A.Q.
Q.Do you really go by D.?
A.Yes, thanks, I much prefer it.