D. Sculley

Ph.D. candidate
Department of Computer Science
Tufts University
email: dsculley (at) cs.tufts.edu
Resume (.pdf)
Publications

My current work lies at the intersection of data mining, machine learning, and information retrieval, including work on information filtering (such as email spam and blog comment spam filtering), text analysis, digital libraries, and online advertising systems. I have also done work in computational biology and the digital humanities. I am advised by Carla Brodley, and expect to graduate in August of 2008.

Information Filtering

I have developed a fast online SVM variant suitable for streaming data called Relaxed Online SVMs with Gabriel M. Wachman. This work was named Best Student Paper at SIGIR 2007, and gave best results on several tasks at the TREC 2007 spam filtering competition. Source code is available for download here.

At CEAS 2007, I presented work on online active learning for spam filtering, which I think is a more natural mode of active learning for spam filtering than pool-based active learning. This idea was adopted as a part of the TREC 2007 spam filtering track.

I've also been interested in the idea of one-sided feedback for online learning, typified by the case of the lazy user who never checks the 'spam box', and only looks at messages in the 'inbox'. In this case, an online learner will never get feedback on predicted negatives. We examined this problem, some prior theoretical solutions, along with new practical methods in our 2007 paper at KDD.

I've worked with Arthur Brady, Lenore Cowen, Donna Slonim, Xintao Wei and Carla Brodley to develop a method of detecting and defeating "genomic spam" -- that is, pathogenic sequences appearing in otherwise benign genomes. This work is currently under review at a leading computational biology journal.

In 2006, I worked on spam classification with inexact string matching methods and linear classifiers. I have filed a patent on this work on behalf of Tufts University, with Gabriel M. Wachman and Carla E. Brodley. Our team has competed in the TREC 2006 Spam Filtering competitions, with good results. We presented our method at the TREC 2006 Conference in November, 2006. Text of our competition report is available here.

Text Analysis and Digital Libraries

I am interested in digital libraries, and the power of automated text analysis to make new scholarly discoveries possible. The literary scholar Brad Pasanek and I have two papers forthcoming in Literary and Linguistic Computing. The paper "Mining Millions of Metaphors" explores the use of text analysis on a hand curated, digital collection of metaphors of the mind. Our second paper, "Meaning and Mining", explores the impact of implicit assumption when applying data mining techniques to literary analysis. We have also given talks the Chicago Colloquium on Digital Humanities in 2006 and 2007, the STOA consortium (2007), the Digital Humanities conference (2007), and the MLA annual meeting (2007).

Other work in this area has included work on automatic word sense disambiguation using coarsely aligned parallel texts, with Gregory Crane of the Perseus digital library, which has appeared as a contribution to a paper on the evolution of digital libraries.

Online Advertising

In the fall of 2007, I was an intern at the Pittsburgh office of Google, located on the campus of Carnegie Mellon University. I worked on systems to ensure that users had a positive experience when they clicked on advertisement links.

I spent the Spring of 2006 doing an internship at Yahoo!, in the Data Mining Research team under the direction of Pavel Berkhin, working with Scott Gaffney and Rajesh Parekh. My work there involved the problem of keyword expansion for sponsored search advertisers; these approaches were included in a patent filed by Yahoo!. As part of this project, I developed several methods of rank aggregation among similar items which appeared as a poster paper at the 2007 SIAM data mining conference.

Other Work

In 2008, I am serving as a program committee member of SIGIR.

In March, 2006, I presented the paper "Compression and Machine Learning: a New Perspective on Feature Space Vectors", co-authored with Carla E. Brodley. (Test and training splits of the Unix user data sets used in this paper are available here.)

In May, 2005, I gave a talk on "Compression, Learning, and Metaphor" at Harvard's Graduate Student conference on Mind Brain and Behavior.

I received my M.S. in computer science in May, 2005 from Tufts University.

In the summer of 2004, I wrote a tutorial on basic electronics for Tufts Robotics.

I received my B.A. in Visual and Environmental Studies from Harvard University in 1997, and an M.Ed. in School Leadership and Development from the Harvard Graduate School of Education in 2002.

Before graduate school, I taught in high schools in Abu Dhabi, U.A.E.; Caracas, Venezuela; Lugano, Switzerland; and Claremont, California. I am a former member of the U.S. men's national team in indoor field hockey, winning a silver medal at the 2004 Pan-American competition, and spent seven seasons as an assistant coach of the Boston College field hockey team.

My other interests include chess, music, writing, the visual arts, ice hockey, and the Pittsburgh Steelers.

F.A.Q.
Q.Do you really go by D.?
A.Yes, thanks, I much prefer it.