Exponentiated and Gradient Descent for Text Classification Jason Kroll ========================================================== Oct 2003 Abstract -------- Gradient descent and exponentiated gradient are trained to classify text and applied to unseen real-world examples. GD and EG are testing on dense and sparse data, over a range of training sizes, eta values and training iterations. Tolerarnce to noise, performance with bounded weights, and longest survivor results are assessed. Introduction ------------ Common wisdom in machine learning says that gradient descent performs better on dense target vectors, while exponentiated descent excels on sparse target vectors. Intuitively we expect this, as the exponentiated gradient will heavily emphasize the highest vector values, while linear gradient will give representation to all. As a corollary, we expect that exponentiated gradients will tolerate noise better as they will ignore the low-magnitude noise while the linear gradient will treat such data as equally legitimate. However, we wish to verify this wisdom to our satisfaction, to determine under what conditions it holds. Additionally, we explore the effects of suggested modifications to these algorithms. Procedure --------- A gradient descent / exponentiated descent program is written which reads the data, trains the weight vectors, tests the hypotheses on the selected test data, and returns error results while in training and in the test case. Data includes dense and sparse targets of sizes ranging from 200 to 2000 examples. The program, gradient.cc, may be compiled with the included Makefile. To run the program, we simply enter on the command line as follows: gradient train_file test_file eta iterations g/e noise (eg. gradient DT.2000.train DT.test .05 16 e 8, where the noise value is the denominator of p, hence a value of 8 yields noise of 1/8th or 6.25%.) Values for eta are chosen from the Gaussian curve (0.05, 0.34, 0.50, 0.68, 0.95), while the training iterations are simply powers of 8. Note that 0.34 is close to 0.37 which is roughly 1/e, and important number in population dynamics, in spite of which seemd to have no special character in these cases. Results ------- The resulting data is somewhat voluminous, consequently our analyses are embedded within the data set. ============================================================== Reuters Data ============ Algorithm Train Eta Itrns Test Error --------- ----- --- ----- ---- ----- Gradient 400 .34 8 785 18.9 % Gradient 400 .50 8 785 18.3 % Gradient 400 .68 8 785 15.9 % Exponent 400 .34 8 785 31.7 % Exponent 400 .50 8 785 14.3 % Exponent 400 .68 8 785 13.9 % Gradient 400 .34 16 785 14.9 % Gradient 400 .50 16 785 15.5 % Gradient 400 .68 16 785 15.7 % Exponent 400 .34 16 785 14.8 % Exponent 400 .50 16 785 15.0 % Exponent 400 .68 16 785 16.3 % Gradient 400 .34 64 785 15.9 % Gradient 400 .50 64 785 13.6 % Gradient 400 .68 64 785 16.8 % Exponent 400 .34 64 785 14.4 % Exponent 400 .50 64 785 14.7 % Exponent 400 .68 64 785 13.5 % Reuters data - The sparse target nature of the Reuters data suggests that EG should fare well. However, we can see above that EG outperformed GD only very slightly, and certainly not in all circumstances. This data seems to prefer large eta values, but requires relatively little training time. At this simple level, we are slightly worse than what is achievable with decision trees. It is worth perhaps not worth noting that error rates during reached significantly lower (though nonzero) levels than during testing. We infer that the data is not entirely linearly seperable. ============================================================== Dense Target Data 200 ===================== Algorithm Train Eta Itrns Test Error --------- ----- --- ----- ---- ----- Gradient 200 .34 8 2000 28.2 % Gradient 200 .50 8 2000 27.7 % Gradient 200 .68 8 2000 27.2 % Exponent 200 .34 8 2000 24.5 % Exponent 200 .50 8 2000 27.3 % Exponent 200 .68 8 2000 26.7 % Gradient 200 .34 16 2000 28.2 % Gradient 200 .50 16 2000 27.7 % Gradient 200 .68 16 2000 27.2 % Exponent 200 .34 16 2000 24.5 % Exponent 200 .50 16 2000 27.6 % Exponent 200 .68 16 2000 27.8 % Gradient 200 .34 64 2000 28.2 % Gradient 200 .50 64 2000 27.7 % Gradient 200 .68 64 2000 27.2 % Exponent 200 .34 64 2000 24.5 % Exponent 200 .50 64 2000 27.6 % Exponent 200 .68 64 2000 28.7 % Dense target 200 - Dense targets should see the success of gradient descent. However, with the small amount of training data, EG was able to generalize just as effecitvely and we see comparable, actually slightly better results with our exponentiated approach. ============================================================== Dense Target Data 800 ===================== Algorithm Train Eta Itrns Test Error --------- ----- --- ----- ---- ----- Gradient 800 .34 8 2000 3.4 % Gradient 800 .50 8 2000 5.0 % Gradient 800 .68 8 2000 5.0 % Exponent 800 .34 8 2000 10.1 % Exponent 800 .50 8 2000 19.6 % Exponent 800 .68 8 2000 24.8 % Gradient 800 .34 16 2000 3.6 % Gradient 800 .50 16 2000 5.0 % Gradient 800 .68 16 2000 4.9 % Exponent 800 .34 16 2000 18.5 % Exponent 800 .50 16 2000 24.4 % Exponent 800 .68 16 2000 24.0 % Gradient 800 .34 64 2000 3.6 % Gradient 800 .50 64 2000 5.0 % Gradient 800 .68 64 2000 4.9 % Exponent 800 .34 64 2000 11.1 % Exponent 800 .50 64 2000 23.9 % Exponent 800 .68 64 2000 24.1 % Dense target 800 - Once the amount of training data reached a slightly more meaningful amount, gradient descent, operating on linear motifs, was finally able to realize its potential and dramatically outperform EG. Error rates with gradient descent in this case are objectively remarkable. ============================================================== Dense Target Data 2000 ====================== Algorithm Train Eta Itrns Test Error --------- ----- --- ----- ---- ----- Gradient 2000 .34 8 2000 0.4 % Gradient 2000 .50 8 2000 0.8 % Gradient 2000 .68 8 2000 0.7 % Exponent 2000 .34 8 2000 17.5 % Exponent 2000 .50 8 2000 23.9 % Exponent 2000 .68 8 2000 24.2 % Gradient 2000 .05 16 2000 2.5 % Gradient 2000 .34 16 2000 0.4 % Gradient 2000 .50 16 2000 0.8 % Gradient 2000 .68 16 2000 0.7 % Exponent 2000 .05 16 2000 0.2 % Exponent 2000 .34 16 2000 18.0 % Exponent 2000 .50 16 2000 23.8 % Exponent 2000 .68 16 2000 24.4 % Gradient 2000 .34 64 2000 0.4 % Gradient 2000 .50 64 2000 0.8 % Gradient 2000 .68 64 2000 0.7 % Exponent 2000 .34 64 2000 17.5 % Exponent 2000 .50 64 2000 22.9 % Exponent 2000 .68 64 2000 19.5 % Dense target 2000 - Again we see the same effects as above, only more pronounced. Gradient descent, taking all attributes into account in linear proportion, produced startling results. Error rates below 1% would be phenomenal if they could scale to real-world enviroment, however the data was specifically selected for its density so we cannot maintain such hopes. It is notable that EG can be made to behave like GD if the eta value is quite low, and this is intuive. ============================================================== Sparse Target Data 200 ====================== Algorithm Train Eta Itrns Test Error --------- ----- --- ----- ---- ----- Gradient 200 .34 8 2000 26.1 % Gradient 200 .50 8 2000 23.8 % Gradient 200 .68 8 2000 22.7 % Exponent 200 .34 8 2000 1.3 % Exponent 200 .50 8 2000 31.6 % Exponent 200 .68 8 2000 15.7 % Gradient 200 .34 16 2000 26.1 % Gradient 200 .50 16 2000 23.8 % Gradient 200 .68 16 2000 22.4 % Exponent 200 .34 16 2000 1.3 % Exponent 200 .50 16 2000 15.9 % Exponent 200 .68 16 2000 18.1 % Gradient 200 .34 64 2000 26.1 % Gradient 200 .50 64 2000 23.8 % Gradient 200 .68 64 2000 22.4 % Exponent 200 .34 64 2000 1.3 % Exponent 200 .50 64 2000 5.6 % Exponent 200 .68 64 2000 15.0 % Sparse target 200 - Here we two extremes in that EG outperforms GD both in small training sets and in sparse target vectors. The results are error rates (with small eta) of around 1%. This is impressive scaling for such small training data with minimal training iterations. Note that selecting bad eta values can foil EG on this data. ============================================================== Sparse Target Data 800 ====================== Algorithm Train Eta Itrns Test Error --------- ----- --- ----- ---- ----- Gradient 800 .34 8 2000 5.5 % Gradient 800 .50 8 2000 5.9 % Gradient 800 .68 8 2000 5.6 % Exponent 800 .34 8 2000 6.2 % Exponent 800 .50 8 2000 15.5 % Exponent 800 .68 8 2000 18.2 % Gradient 800 .34 16 2000 5.5 % Gradient 800 .50 16 2000 5.9 % Gradient 800 .68 16 2000 5.3 % Exponent 800 .34 16 2000 0.1 % Exponent 800 .50 16 2000 19.2 % Exponent 800 .68 16 2000 15.5 % Gradient 800 .34 64 2000 5.5 % Gradient 800 .50 64 2000 5.9 % Gradient 800 .68 64 2000 5.3 % Exponent 800 .34 64 2000 0.1 % Exponent 800 .50 64 2000 14.5 % Exponent 800 .68 64 2000 17.6 % Sparse data 800 - Increasing the amoung of training data available attentuates the faults of GD in this environment, but it is not sufficient. By 16 passes, EG with small eta produces error rates of 0.1%. ============================================================== Sparse Target Data 2000 ======================= Algorithm Train Eta Itrns Test Error --------- ----- --- ----- ---- ----- Gradient 2000 .34 8 2000 0.5 % Gradient 2000 .50 8 2000 0.05 % Gradient 2000 .68 8 2000 0.2 % Exponent 2000 .34 8 2000 6.2 % Exponent 2000 .50 8 2000 15.5 % Exponent 2000 .68 8 2000 18.2 % Gradient 2000 .05 16 2000 0.4 % Gradient 2000 .34 16 2000 0.5 % Gradient 2000 .50 16 2000 0.05 % Gradient 2000 .68 16 2000 0.2 % Exponent 2000 .05 16 2000 17.2 % Exponent 2000 .34 16 2000 7.7 % Exponent 2000 .50 16 2000 19.2 % Exponent 2000 .68 16 2000 15.5 % Gradient 2000 .34 64 2000 0.5 % Gradient 2000 .50 64 2000 0.05 % Gradient 2000 .68 64 2000 0.2 % Exponent 2000 .34 64 2000 17.4 % Exponent 2000 .50 64 2000 15.2 % Exponent 2000 .68 64 2000 17.6 % Sparse data 2000 - Interestingly, gradient descent is able to outperform its exponentiated doppelganger. As EG reported low errors in training, we can speculate that this is due to overfitting. Under these circumstances, it is gradient descent achieving <1% error while EG flops about. Probably an extremely low eta for EG would have produced good results. ============================================================== Low Eta Exponents on Sparse Target Data ======================================= Algorithm Train Eta Itrns Test Error --------- ----- --- ----- ---- ----- Exponent 200 .01 5+ 2000 22 % Exponent 800 .01 8+ 2000 4.2 % Exponent 2000 .01 8+ 2000 0.25 % Low eta - Here we can verify this speculation by showing that by 8 iterations of low eta, we achieve 0% error on the training data and ultimately 0.25 % on the test set. Still, it is remarkable how well GD performed in this case. ============================================================== Noise on Dense Target Data ========================== Algorithm Train Eta Itrns Test Noise Error --------- ----- --- ----- ---- ----- ----- Gradient 2000 .34 16 2000 25 % 33.1 % Gradient 2000 .34 16 2000 12.5 % 23.1 % Gradient 2000 .34 16 2000 6.25 % 16.1 % Exponent 2000 .34 16 2000 25 % 31.7 % Exponent 2000 .34 16 2000 12.5 % 19.3 % Exponent 2000 .34 16 2000 6.25 % 14.5 % Gradient 2000 .05 16 2000 25 % 33.5 % Gradient 2000 .34 16 2000 25 % 33.1 % Gradient 2000 .50 16 2000 25 % 32.7 % Gradient 2000 .68 16 2000 25 % 32.7 % Gradient 2000 .95 16 2000 25 % 31.1 % Exponent 2000 .05 16 2000 25 % 34.5 % Exponent 2000 .34 16 2000 25 % 31.7 % Exponent 2000 .50 16 2000 25 % 31.2 % Exponent 2000 .68 16 2000 25 % 28.2 % Exponent 2000 .95 16 2000 25 % 29.6 % Dense noise - Neither algorithm appear immune to the effects of noise. When rates approach 25%, both are comparable. At the low noise levels, EG as expected is a little bette at ignoring the low magnitude rumblings to produce a slightly better result. Noise can seriously affect linear seperability and create nontrivial difficulties. ============================================================== Noise on Sparse Target Data =========================== Algorithm Train Eta Itrns Test Noise Error --------- ----- --- ----- ---- ----- ----- Gradient 2000 .05 16 2000 6.25 % 11.7 % Exponent 2000 .05 16 2000 6.25 % 17 % Gradient 2000 .05 16 2000 25 % 30 % Gradient 2000 .34 16 2000 25 % 31.7 % Gradient 2000 .50 16 2000 25 % 30.5 % Gradient 2000 .68 16 2000 25 % 33.3 % Gradient 2000 .95 16 2000 25 % 28.2 % Exponent 2000 .05 16 2000 25 % 26.8 % Exponent 2000 .34 16 2000 25 % 30.3 % Exponent 2000 .50 16 2000 25 % 31.1 % Exponent 2000 .68 16 2000 25 % 50.9 % Exponent 2000 .95 16 2000 25 % 31.6 % Sparse noise - Here we can comparable rates except with low eta on low noise where GD performs quite well, perhaps because it infers less from the noise that EG. However, EG performs not badly in the high error areas, operating just slightly above the natural error rate, which can be considered quite good, relatively speaking. ============================================================== Sqrt(Attributes)-Bounded Weights on Dense Targets ================================================= Algorithm Train Eta Itrns Test Noise Error --------- ----- --- ----- ---- ----- ----- Gradient 2000 .34 16 2000 6.25 % 14.5 % Gradient 2000 .68 16 2000 6.25 % 14.5 % Exponent 2000 .34 16 2000 6.25 % 14.4 % Exponent 2000 .34 16 2000 6.25 % 16.9 % Gradient 2000 .05 16 2000 12.5 % 18.1 % Gradient 2000 .34 16 2000 12.5 % 18.3 % Gradient 2000 .68 16 2000 12.5 % 22.4 % Exponent 2000 .05 16 2000 12.5 % 23.8 % Exponent 2000 .34 16 2000 12.5 % 21.9 % Exponent 2000 .68 16 2000 12.5 % 22.8 % Dense Bounded - Here we have more egalitarian weight vectors, favoring GD slightly. This is probably because GD has a natural predilection for dense targets. ============================================================== Sqrt(Attributes)-Bounded Weights on Sparse Targets ================================================== Algorithm Train Eta Itrns Test Noise Error --------- ----- --- ----- ---- ----- ----- Gradient 2000 .34 16 2000 6.25 % 36.5 % Gradient 2000 .68 16 2000 6.25 % 40.7 % Exponent 2000 .34 16 2000 6.25 % 35.6 % Exponent 2000 .34 16 2000 6.25 % 36.4 % Gradient 2000 .05 16 2000 6.25 % 9.3 % Gradient 2000 .34 16 2000 6.25 % 36.4 % Gradient 2000 .68 16 2000 6.25 % 42.7 % Gradient 2000 .95 16 2000 6.25 % 42.7 % Exponent 2000 .05 16 2000 6.25 % 29.1 % Exponent 2000 .34 16 2000 6.25 % 32.7 % Exponent 2000 .68 16 2000 6.25 % 45.1 % Exponent 2000 .95 16 2000 6.25 % 40.4 % Sparse Bounded - Here GD continues to outperform EG, suggesting that EG is limited by bounding the weights, and this is what we expect from something that over-emphasizes high values. ============================================================== Bouned Longest Survivor on Dense Targets with Bounds ============================================= Algorithm Train Eta Itrns Test Noise Error --------- ----- --- ----- ---- ----- ----- Gradient 2000 0.05 16 2000 0 % 3.4 % Exponent 2000 0.05 16 2000 0 % 0.5 % Gradient 2000 0.05 16 2000 6.25 % 12 % Exponent 2000 0.05 16 2000 6.25 % 14.1 % Gradient 2000 0.05 16 2000 12.5 % 21 % Exponent 2000 0.05 16 2000 12.5 % 22.6 % Sparse Bounded Longest - This is a questionable approach which yield passable results, but the noise interferes with runs of longest survivor so we do not recommend it. ============================================================== Longest Survivor on Sparse Targets with Bounds ============================================== Algorithm Train Eta Itrns Test Noise Error --------- ----- --- ----- ---- ----- ----- Gradient 2000 0.05 16 2000 0 % 0.7 % Exponent 2000 0.05 16 2000 0 % 15.5 % Gradient 2000 0.05 16 2000 6.25 % 12.8 % Exponent 2000 0.05 16 2000 6.25 % 20.4 % Gradient 2000 0.05 16 2000 12.5 % 18.5 % Exponent 2000 0.05 16 2000 12.5 % 24.8 % ============================================================== Dense Bounded Longest - This is comparable to sparse bounded when noisy, except that in the case of no noise GD has a curious success. Conclusion ---------- In general, the intuitive hypothesis that gradient descent performs well on dense targets while exponentiated descent performs well on sparse targets is born out. To allow these algorithms to perform best, it is necessary to select optimal eta values, which should be very small for EG and in cases of large training sets. Again, this is intuitive. Variations such as bounded weights are not recommended for EG where they can be self-defeating, and seem unnecessary for GD unless we have some prior knowledge indicating that our dense target vector should be more egalitarian than usual. Noise is a problem for both algorithms, probably because it interferes with linear seperability. EG is particularly susceptible on noisy dense data while GD performs equally badly in both environments. An amount of human intuition is useful in tuning these algorithms with optimal learning rates for their environments.