D. Sculley, Ph.D.

Alum of Tufts University
Department of Computer Science
email: dsculley (at) cs.tufts.edu
Resume (.pdf)
Publications

My work lies at the intersection of data mining, machine learning, and information retrieval, including work on information filtering (such as email spam and blog comment spam filtering), text analysis, digital libraries, and online advertising systems. I have also done work in computational biology and the digital humanities.

In August of 2008, I completed my Ph.D. in machine learning and data mining, under the direction of Carla Brodley. My dissertation work explored advances in online learning-based spam filtering. I am currently employed in Google's Pittsburgh-based office, working on problems in online advertsising and user experience.

Online Advertising

My current work at Google's Pittsburgh office, located on the campus of Carnegie Mellon University, is centered around systems to ensure that users have a positive experience when they click on advertisement links. An example of research in this area is our paper at KDD 2009, Predicting Bounce Rates in Sponsored Search Advertisements. This work is a continuation of my internship at the same office in the fall of 2007.

I spent the Spring of 2006 doing an internship at Yahoo!, in the Data Mining Research team under the direction of Pavel Berkhin, working with Scott Gaffney and Rajesh Parekh. My work there involved the problem of keyword expansion for sponsored search advertisers; these approaches were included in a patent filed by Yahoo!. As part of this project, I developed several methods of rank aggregation among similar items which appeared as a poster paper at the 2007 SIAM data mining conference.

Information Filtering

The use of machine learning methods for filtering spam and abuse remains a continuing interest for me.

I've recently been interested in the problem of training effective filters in the presence of noisy label feedback, as in the case of email spam with noisy user feedback and a similar scenario in filtering Gordon Cormack, with whom I've also recently explored the problem of training mini-filters that rely on only dozens rather than millions of features. We've found that these mini-filters can give surprisingly effective lightweight personalization, while remaining small enough to be served in RAM even for very large user bases.

At CEAS 2007, I presented work on online active learning for spam filtering, which I think is a more natural mode of active learning for spam filtering than pool-based active learning. This idea was adopted as a part of the TREC 2007 spam filtering track.

I have developed a fast online SVM variant suitable for streaming data called Relaxed Online SVMs with Gabriel M. Wachman. This work was named Best Student Paper at SIGIR 2007, and gave best results on several tasks at the TREC 2007 spam filtering competition. Source code is available for download here.

I've also been interested in the idea of one-sided feedback for online learning, typified by the case of the lazy user who never checks the 'spam box', and only looks at messages in the 'inbox'. In this case, an online learner will never get feedback on predicted negatives. We examined this problem, some prior theoretical solutions, along with new practical methods in our 2007 paper at KDD.

I've worked with Arthur Brady, Lenore Cowen, Donna Slonim, Xintao Wei and Carla Brodley to develop a method of detecting and defeating "genomic spam" -- that is, pathogenic sequences appearing in otherwise benign genomes. This work is currently under review at a leading computational biology journal.

In 2006, I worked on spam classification with inexact string matching methods and linear classifiers. I have filed a patent on this work on behalf of Tufts University, with Gabriel M. Wachman and Carla E. Brodley. Our team has competed in the TREC 2006 Spam Filtering competitions, with good results. We presented our method at the TREC 2006 Conference in November, 2006. Text of our competition report is available here.

Text Analysis and Digital Libraries

I am interested in digital libraries, and the power of automated text analysis to make new scholarly discoveries possible. The literary scholar Brad Pasanek and I have collaborated on two journal articles in Literary and Linguistic Computing. The paper Mining Millions of Metaphors explores the use of text analysis on a hand curated, digital collection of metaphors of the mind. Our second article, Meaning and Mining , explores the impact of implicit assumption when applying data mining techniques to literary analysis. We have also given talks at the Chicago Colloquium on Digital Humanities (2006, 2007, 2008), the STOA consortium (2007), and the Digital Humanities conference (2007). This work has also received some attention in The Chronicle of Higher Education, The San Jose Mercury News, and The New York Times.

Other work in this area has included work on automatic word sense disambiguation using coarsely aligned parallel texts, with Gregory Crane of the Perseus digital library, which has appeared as a contribution to a paper on the evolution of digital libraries.

Other Work

In 2010, I will serve as the general conference chair of CEAS.

In 2009, I am serving as a program committee co-chair of CEAS.

In 2008, I served as a program committee member of SIGIR.

In March, 2006, I presented the paper "Compression and Machine Learning: a New Perspective on Feature Space Vectors", co-authored with Carla E. Brodley. (Test and training splits of the Unix user data sets used in this paper are available here.)

In May, 2005, I gave a talk on "Compression, Learning, and Metaphor" at Harvard's Graduate Student conference on Mind Brain and Behavior.

I received my M.S. in computer science in May, 2005 from Tufts University.

In the summer of 2004, I wrote a tutorial on basic electronics for Tufts Robotics.

I received my B.A. in Visual and Environmental Studies from Harvard University in 1997, and an M.Ed. in School Leadership and Development from the Harvard Graduate School of Education in 2002.

Before graduate school, I taught in high schools in Abu Dhabi, U.A.E.; Caracas, Venezuela; Lugano, Switzerland; and Claremont, California. I am a former member of the U.S. men's national team in indoor field hockey, winning a silver medal at the 2004 Pan-American competition, and spent seven seasons as an assistant coach of the Boston College field hockey team.

My other interests include chess, music, writing, the visual arts, ice hockey, and the Pittsburgh Steelers.

F.A.Q.
Q.Do you really go by D.?
A.Yes, thanks, I much prefer it.