K. Noto and C. Elkan

Learning to Find Relevant Biological Articles Without Negative Training Examples.

Accepted to the Australasian Joint Conference on Artificial Intelligence, 2008.

Download PDF

bibtex entry

Abstract

Classifiers are traditionally learned using sets of positive and negative training examples. However, often a classifier is required, but for training only an incomplete set of positive examples and a set of unlabeled examples are available. This is the situation, for example, with the Transport Classification Database (TCDB, www.tcdb.org), a repository of information about proteins involved in transmembrane transport. This paper presents and evaluate a method for learning to rank the likely relevance to TCDB of newly published scientific articles, using the articles currently referenced in TCDB as positive training examples. The new method has succeeded in identifying 964 new articles relevant to TCDB in fewer than six months, which is a major practical success. From a general data mining perspective, the contributions of this paper are (i) devising and evaluating two novel approaches that solve the positive-only problem effectively, (ii) applying support vector machines in a state-of-the-art way for recognizing and ranking relevance, and (iii) deploying a system to update a widely-used, real-world biomedical database. All data and software are publicly available at www.cs.ucsd.edu/users/knoto/pub/ajcai08.

Data Sets

Each data set consists of a set of articles, and each corresponding file is a list of the articles' PubMed IDs.

File Records Description



TCDB.PubMedIDs 3,403 The PubMed ID of articles referenced in TCDB as of October 16, 2007 (Sections 2 and 3)
unlabeled.PubMedIDs 16,341 The PubMed ID of articles that appear recently in Medline from the set of journals referenced in TCDB. Roughly, these correspond to the period from October 1, 2007 to December 20, 2007 (Sections 2 and 3)



literature.positive.PubMedIDs 37 The PubMedIDs of articles hand-selected by the human expert from three journals (Section 4). The three journals are: The Journal of Bacteriology, The Journal of Biological Chemistry, and Nature.
literature.negative.PubMedIDs 381 The PubMedIDs of the remaining articles in the same journal issues as those in literature.positive.PubMedIDs (Section 4)
literature.all.PubMedIDs 15,125 Articles from 102 journals (i.e. 99 additional journals), corresponding to the same date range (September 7, 2007 - December 14, 2007) as literature.positive.PubMedIDs and literature.negative.PubMedIDs



1.october2007.PubMedIDs 6,108 Deployment set #1 (Section 5): Approximate date range October 1 2007 -- October 31 2007
2.november2007.PubMedIDs 10,233 Deployment set #2 (Section 5): Approximate date range November 1 2007 -- December 20 2007
3.january2008.PubMedIDs 3,885 Deployment set #3 (Section 5): Approximate date range December 21 2007 -- January 15 2008
4.february2008.PubMedIDs 6,975 Deployment set #4 (Section 5): Approximate date range January 15 2008 -- February 20 2008
5.march2008.PubMedIDs 2,544 Deployment set #5 (Section 5): Approximate date range February 21 2008 -- March 13 2008

Data Source

The abstracts of these articles are obtained using the following procedure: First, we download the article data using NCBI's entrez programming utilities (link). The positive examples are downloaded individually using their PubMed~ID number. (All Medline documents have a unique ID number in PubMed, which is NCBI's public interface to Medline.) The unlabeled examples are downloaded using the query term

journal [ta] & mindate= ... & maxdate= ...
for each journal in TCDB, for the date ranges of October 1 to 31, 2007 (retrieved on December 12, 2007) and November 1 to December 20, 2007 (retrieved on December 20, 2007). This results in a subset of the articles in the PubMed~ID range 17902656 to 18092361.

The subset contains 16,341 articles because it is restricted to the articles that appear in certain journals and have PubMed dates in the given range associated with them.

Data Representation

We represent each article as a vector of features corresponding to the words in the article's abstract. We apply the Porter stemming algorithm (Porter, 1980) to the abstract text, count the use of each word in each abstract using MALLET (McCallum, 2002), and limit the vectors to words that appear at least three times in the corpus. This reduces the vocabulary size in our training data set by 47% (17,060 words appear once, 6,534 words appear twice, and we retain a vocabulary of 26,364 words that appear at least three times). Finally, we represent each word by the value log(1+ni), where ni is the number of times word i appears in an abstract, and we normalize the vector for each article to have unit Euclidean length, i.e. to lie on the unit sphere. We do not use information associated with the articles (journal names, author names, etc.) other than the abstract text. We verify that the vocabulary does not contain any obvious "leaker" words, i.e. terms that discriminate between the training subsets but are not valid predictors for future articles. In particular, the trained models do not use article dates, directly or indirectly, to make predictions.

Pseudocode

You may view pseudocode illustrating the algorithms used to tune the SVM parameter C, and iteratively train and relabel our data sets (PDF). To evaluate each value of C, we use four-fold cross-validation and consider powers of 10. Figure 1 shows the tuning set precision for each of ten testing folds on our training set. Our approach chooses C=1 in nine of ten folds, and C=10 once.

Figure 1
Fig. 1: The tuning set precision associated with the value of the cost parameter in our SVM models. Each line corresponds to one of ten folds.

Results

These are a few figures and captions that supplement Sections 2-4.

Figure 2 shows test set ROC (receiver operating characteristic) curves comparing alternative learning algorithms on our original data set (i.e. the set of 3,403 positive examples from TCDB and 16,341 negative examples from MEDLINE). Each point on these curves represents a different value of k. The horizontal axis shows the false positive rate defined as the proportion of all irrelevant articles that are in the top k. The vertical axis shows the true positive rate, defined as the proportion of all relevant articles that are in the top k. The point on each line corresponds to the classifications of the highest-scoring k=3000 articles. Ideally, this point would lie on the vertical axis, indicating zero false positives. The following table gives the number of true positives in the top 3000 for each method.

True positives in the top k=3000 predictions using different ranking models. In each case, these consist of the 300 most highly-ranked articles from each of 10 test folds.
Method True Positives in the top k=3000 Predictions
SVM 2614
MaxEnt 2556
Naïve Bayes 2444
Winnow 1457

Figure 2
Fig. 2: Test set ROC curves for multiple methods on our data set.

Figure 3 shows the ROC curve for the recall experiments of Section 4.

Figure 3
Fig. 3: An ROC curve showing the performance of our classifier on 418 articles from three journals, 37 of which were manually identified as relevant to TCDB. The remaining 381 articles are from the same journal issues and are considered negative examples. The point highlighted on the curve corresponds to a reasonable number of articles to show to the human expert.

References

Porter, 1980: M. F. Porter. An algorithm for suffix stripping. Program 14:3 130-137, 1980.

McCallum, 2002: A. K. McCallum MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu, 2002.