Accepted to the Australasian Joint Conference on Artificial Intelligence, 2008.
Each data set consists of a set of articles, and each corresponding file is a list of the articles' PubMed IDs.
| File | Records | Description |
|---|---|---|
| TCDB.PubMedIDs | 3,403 | The PubMed ID of articles referenced in TCDB as of October 16, 2007 (Sections 2 and 3) |
| unlabeled.PubMedIDs | 16,341 | The PubMed ID of articles that appear recently in Medline from the set of journals referenced in TCDB. Roughly, these correspond to the period from October 1, 2007 to December 20, 2007 (Sections 2 and 3) |
| literature.positive.PubMedIDs | 37 | The PubMedIDs of articles hand-selected by the human expert from three journals (Section 4). The three journals are: The Journal of Bacteriology, The Journal of Biological Chemistry, and Nature. |
| literature.negative.PubMedIDs | 381 | The PubMedIDs of the remaining articles in the same journal issues as those in literature.positive.PubMedIDs (Section 4) |
| literature.all.PubMedIDs | 15,125 | Articles from 102 journals (i.e. 99 additional journals), corresponding to the same date range (September 7, 2007 - December 14, 2007) as literature.positive.PubMedIDs and literature.negative.PubMedIDs |
| 1.october2007.PubMedIDs | 6,108 | Deployment set #1 (Section 5): Approximate date range October 1 2007 -- October 31 2007 |
| 2.november2007.PubMedIDs | 10,233 | Deployment set #2 (Section 5): Approximate date range November 1 2007 -- December 20 2007 |
| 3.january2008.PubMedIDs | 3,885 | Deployment set #3 (Section 5): Approximate date range December 21 2007 -- January 15 2008 |
| 4.february2008.PubMedIDs | 6,975 | Deployment set #4 (Section 5): Approximate date range January 15 2008 -- February 20 2008 |
| 5.march2008.PubMedIDs | 2,544 | Deployment set #5 (Section 5): Approximate date range February 21 2008 -- March 13 2008 |
The abstracts of these articles are obtained using the following procedure: First, we download the article data using NCBI's entrez programming utilities (link). The positive examples are downloaded individually using their PubMed~ID number. (All Medline documents have a unique ID number in PubMed, which is NCBI's public interface to Medline.) The unlabeled examples are downloaded using the query term
journal [ta] & mindate= ... & maxdate= ...for each journal in TCDB, for the date ranges of October 1 to 31, 2007 (retrieved on December 12, 2007) and November 1 to December 20, 2007 (retrieved on December 20, 2007). This results in a subset of the articles in the PubMed~ID range 17902656 to 18092361.
The subset contains 16,341 articles because it is restricted to the articles that appear in certain journals and have PubMed dates in the given range associated with them.
We represent each article as a vector of features corresponding to the words in the article's abstract. We apply the Porter stemming algorithm (Porter, 1980) to the abstract text, count the use of each word in each abstract using MALLET (McCallum, 2002), and limit the vectors to words that appear at least three times in the corpus. This reduces the vocabulary size in our training data set by 47% (17,060 words appear once, 6,534 words appear twice, and we retain a vocabulary of 26,364 words that appear at least three times). Finally, we represent each word by the value log(1+ni), where ni is the number of times word i appears in an abstract, and we normalize the vector for each article to have unit Euclidean length, i.e. to lie on the unit sphere. We do not use information associated with the articles (journal names, author names, etc.) other than the abstract text. We verify that the vocabulary does not contain any obvious "leaker" words, i.e. terms that discriminate between the training subsets but are not valid predictors for future articles. In particular, the trained models do not use article dates, directly or indirectly, to make predictions.
You may view pseudocode illustrating the algorithms used to tune the SVM parameter C, and iteratively train and relabel our data sets (PDF). To evaluate each value of C, we use four-fold cross-validation and consider powers of 10. Figure 1 shows the tuning set precision for each of ten testing folds on our training set. Our approach chooses C=1 in nine of ten folds, and C=10 once.
These are a few figures and captions that supplement Sections 2-4.
Figure 2 shows test set ROC (receiver operating characteristic) curves comparing alternative learning algorithms on our original data set (i.e. the set of 3,403 positive examples from TCDB and 16,341 negative examples from MEDLINE). Each point on these curves represents a different value of k. The horizontal axis shows the false positive rate defined as the proportion of all irrelevant articles that are in the top k. The vertical axis shows the true positive rate, defined as the proportion of all relevant articles that are in the top k. The point on each line corresponds to the classifications of the highest-scoring k=3000 articles. Ideally, this point would lie on the vertical axis, indicating zero false positives. The following table gives the number of true positives in the top 3000 for each method.
| Method | True Positives in the top k=3000 Predictions |
|---|---|
| SVM | 2614 |
| MaxEnt | 2556 |
| Naïve Bayes | 2444 |
| Winnow | 1457 |
Figure 3 shows the ROC curve for the recall experiments of Section 4.
Porter, 1980: M. F. Porter. An algorithm for suffix stripping. Program 14:3 130-137, 1980.
McCallum, 2002: A. K. McCallum MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu, 2002.