Exponentiated and Gradient Descent for Text Classification	Jason Kroll
==========================================================	Oct 2003

Abstract
--------

Gradient descent and exponentiated gradient are trained to classify text 
and applied to unseen real-world examples. GD and EG are testing on dense 
and sparse data, over a range of training sizes, eta values and training 
iterations. Tolerarnce to noise, performance with bounded weights, and 
longest survivor results are assessed.

Introduction
------------

Common wisdom in machine learning says that gradient descent performs
better on dense target vectors, while exponentiated descent excels on
sparse target vectors. Intuitively we expect this, as the exponentiated
gradient will heavily emphasize the highest vector values, while linear
gradient will give representation to all. As a corollary, we expect that
exponentiated gradients will tolerate noise better as they will ignore the
low-magnitude noise while the linear gradient will treat such data as
equally legitimate. However, we wish to verify this wisdom to our
satisfaction, to determine under what conditions it holds. Additionally,
we explore the effects of suggested modifications to these algorithms.

Procedure
---------

A gradient descent / exponentiated descent program is written which reads
the data, trains the weight vectors, tests the hypotheses on the selected
test data, and returns error results while in training and in the test
case. Data includes dense and sparse targets of sizes ranging from 200 to
2000 examples.

The program, gradient.cc, may be compiled with the included Makefile. To 
run the program, we simply enter on the command line as follows:

gradient train_file test_file eta iterations g/e noise

(eg. gradient DT.2000.train DT.test .05 16 e 8, where the noise value is 
the denominator of p, hence a value of 8 yields noise of 1/8th or 6.25%.)

Values for eta are chosen from the Gaussian curve (0.05, 0.34, 0.50, 0.68, 
0.95), while the training iterations are simply powers of 8. Note that 
0.34 is close to 0.37 which is roughly 1/e, and important number in 
population dynamics, in spite of which seemd to have no special character 
in these cases.

Results
-------

The resulting data is somewhat voluminous, consequently our analyses are  
embedded within the data set.

==============================================================
Reuters Data
============

Algorithm	Train	Eta	Itrns	Test		Error
---------	-----	---	-----	----		-----
Gradient	400 	.34	8	785		18.9 %
Gradient        400     .50     8	785		18.3 %
Gradient        400     .68     8	785		15.9 %

Exponent	400 	.34	8	785		31.7 %
Exponent        400     .50     8	785		14.3 %
Exponent        400     .68     8	785		13.9 %

Gradient	400 	.34	16	785		14.9 %
Gradient        400     .50     16	785		15.5 %
Gradient        400     .68     16	785		15.7 %

Exponent	400 	.34	16	785		14.8 %
Exponent	400 	.50	16	785		15.0 %
Exponent	400 	.68	16	785		16.3 %

Gradient	400 	.34	64	785		15.9 %
Gradient        400     .50     64	785		13.6 %
Gradient        400     .68     64	785		16.8 %

Exponent	400 	.34	64	785		14.4 %
Exponent	400 	.50	64	785		14.7 %
Exponent	400 	.68	64	785		13.5 %

Reuters data - The sparse target nature of the Reuters data suggests that
EG should fare well. However, we can see above that EG outperformed GD
only very slightly, and certainly not in all circumstances. This data
seems to prefer large eta values, but requires relatively little training
time. At this simple level, we are slightly worse than what is achievable
with decision trees. It is worth perhaps not worth noting that error rates
during reached significantly lower (though nonzero) levels than during
testing. We infer that the data is not entirely linearly seperable.

==============================================================
Dense Target Data 200
=====================

Algorithm       Train   Eta     Itrns   Test            Error
---------       -----   ---     -----   ----            -----
Gradient	200	.34	8	2000		28.2 %
Gradient        200     .50     8	2000    	27.7 %
Gradient        200     .68     8	2000    	27.2 %

Exponent	200	.34	8	2000		24.5 %
Exponent	200	.50	8	2000		27.3 %
Exponent	200	.68	8	2000		26.7 %

Gradient	200	.34	16	2000		28.2 %
Gradient        200     .50     16	2000		27.7 %
Gradient        200     .68     16	2000		27.2 %

Exponent	200	.34	16	2000		24.5 %
Exponent	200	.50	16	2000		27.6 %
Exponent	200	.68	16	2000		27.8 %

Gradient	200	.34	64	2000		28.2 %
Gradient        200     .50     64	2000    	27.7 %
Gradient        200     .68     64	2000		27.2 %

Exponent	200	.34	64	2000		24.5 %
Exponent	200	.50	64	2000		27.6 %
Exponent	200	.68	64	2000		28.7 %

Dense target 200 - Dense targets should see the success of gradient 
descent. However, with the small amount of training data, EG was able to 
generalize just as effecitvely and we see comparable, actually slightly 
better results with our exponentiated approach.

==============================================================
Dense Target Data 800
=====================

Algorithm	Train	Eta	Itrns	Test		Error
---------	-----	---	-----	----		-----
Gradient	800	.34	8	2000		3.4 %
Gradient	800	.50	8	2000		5.0 %
Gradient	800	.68	8	2000		5.0 %

Exponent	800	.34	8	2000		10.1 %
Exponent	800	.50	8	2000		19.6 %
Exponent	800	.68	8	2000		24.8 %

Gradient	800	.34	16	2000		3.6 %
Gradient	800	.50	16	2000		5.0 %
Gradient	800	.68	16	2000		4.9 %

Exponent	800	.34	16	2000		18.5 %
Exponent	800	.50	16	2000		24.4 %
Exponent	800	.68	16	2000		24.0 %

Gradient	800	.34	64	2000		3.6 %
Gradient	800	.50	64	2000		5.0 %
Gradient	800	.68	64	2000		4.9 %

Exponent	800	.34	64	2000		11.1 %
Exponent	800	.50	64	2000		23.9 %
Exponent	800	.68	64	2000		24.1 %

Dense target 800 - Once the amount of training data reached a slightly 
more meaningful amount, gradient descent, operating on linear motifs, was 
finally able to realize its potential and dramatically outperform EG. 
Error rates with gradient descent in this case are objectively remarkable.

==============================================================
Dense Target Data 2000
======================

Algorithm	Train	Eta	Itrns	Test		Error
---------	-----	---	-----	----		-----
Gradient	2000	.34	8	2000		0.4 %
Gradient	2000	.50	8	2000		0.8 %
Gradient	2000	.68	8	2000		0.7 %

Exponent	2000	.34	8	2000		17.5 %
Exponent	2000	.50	8	2000		23.9 %
Exponent	2000	.68	8	2000		24.2 %

Gradient	2000	.05	16	2000		2.5 %
Gradient	2000	.34	16	2000		0.4 %
Gradient	2000	.50	16	2000		0.8 %
Gradient	2000	.68	16	2000		0.7 %

Exponent	2000	.05	16	2000		0.2 %
Exponent	2000	.34	16	2000		18.0 %
Exponent	2000	.50	16	2000		23.8 %
Exponent	2000	.68	16	2000		24.4 %

Gradient	2000	.34	64	2000		0.4 %
Gradient	2000	.50	64	2000		0.8 %
Gradient	2000	.68	64	2000		0.7 %

Exponent	2000	.34	64	2000		17.5 %
Exponent	2000	.50	64	2000		22.9 %
Exponent	2000	.68	64	2000		19.5 %

Dense target 2000 - Again we see the same effects as above, only more 
pronounced. Gradient descent, taking all attributes into account in linear 
proportion, produced startling results. Error rates below 1% would be 
phenomenal if they could scale to real-world enviroment, however the data 
was specifically selected for its density so we cannot maintain such 
hopes. It is notable that EG can be made to behave like GD if the eta 
value is quite low, and this is intuive.

==============================================================
Sparse Target Data 200
======================

Algorithm	Train	Eta	Itrns	Test		Error
---------	-----	---	-----	----		-----
Gradient	200	.34	8	2000		26.1 %
Gradient	200	.50	8	2000		23.8 %
Gradient	200	.68	8	2000		22.7 %

Exponent	200	.34	8	2000		1.3 %
Exponent	200	.50	8	2000		31.6 %
Exponent	200	.68	8	2000		15.7 %

Gradient	200	.34	16	2000		26.1 %
Gradient	200	.50	16	2000		23.8 %
Gradient	200	.68	16	2000		22.4 %

Exponent	200	.34	16	2000		1.3 %
Exponent	200	.50	16	2000		15.9 %
Exponent	200	.68	16	2000		18.1 %

Gradient	200	.34	64	2000		26.1 %
Gradient	200	.50	64	2000		23.8 %
Gradient	200	.68	64	2000		22.4 %

Exponent	200	.34	64	2000		1.3 %
Exponent	200	.50	64	2000		5.6 %
Exponent	200	.68	64	2000		15.0 %

Sparse target 200 - Here we two extremes in that EG outperforms GD both in 
small training sets and in sparse target vectors. The results are error 
rates (with small eta) of around 1%. This is impressive scaling for such 
small training data with minimal training iterations. Note that selecting 
bad eta values can foil EG on this data.

==============================================================
Sparse Target Data 800
======================

Algorithm	Train	Eta	Itrns	Test		Error
---------	-----	---	-----	----		-----
Gradient	800	.34	8	2000		5.5 %
Gradient	800	.50	8	2000		5.9 %
Gradient	800	.68	8	2000		5.6 %

Exponent	800	.34	8	2000		6.2 %
Exponent	800	.50	8	2000		15.5 %
Exponent	800	.68	8	2000		18.2 %

Gradient	800	.34	16	2000		5.5 %
Gradient	800	.50	16	2000		5.9 %
Gradient	800	.68	16	2000		5.3 %

Exponent	800	.34	16	2000		0.1 %
Exponent	800	.50	16	2000		19.2 %
Exponent	800	.68	16	2000		15.5 %

Gradient	800	.34	64	2000		5.5 %
Gradient	800	.50	64	2000		5.9 %
Gradient	800	.68	64	2000		5.3 %

Exponent	800	.34	64	2000		0.1 %
Exponent	800	.50	64	2000		14.5 %
Exponent	800	.68	64	2000		17.6 %

Sparse data 800 - Increasing the amoung of training data available 
attentuates the faults of GD in this environment, but it is not 
sufficient. By 16 passes, EG with small eta produces error rates of 0.1%. 

==============================================================
Sparse Target Data 2000
=======================

Algorithm	Train	Eta	Itrns	Test		Error
---------	-----	---	-----	----		-----
Gradient	2000	.34	8	2000		0.5 %
Gradient	2000	.50	8	2000		0.05 %
Gradient	2000	.68	8	2000		0.2 %

Exponent	2000	.34	8	2000		6.2 %
Exponent	2000	.50	8	2000		15.5 %
Exponent	2000	.68	8	2000		18.2 %

Gradient	2000	.05	16	2000		0.4 %
Gradient	2000	.34	16	2000		0.5 %
Gradient	2000	.50	16	2000		0.05 %
Gradient	2000	.68	16	2000		0.2 %

Exponent	2000	.05	16	2000		17.2 %
Exponent	2000	.34	16	2000		7.7 %
Exponent	2000	.50	16	2000		19.2 %
Exponent	2000	.68	16	2000		15.5 %

Gradient	2000	.34	64	2000		0.5 %
Gradient	2000	.50	64	2000		0.05 %
Gradient	2000	.68	64	2000		0.2 %

Exponent	2000	.34	64	2000		17.4 %
Exponent	2000	.50	64	2000		15.2 %
Exponent	2000	.68	64	2000		17.6 %

Sparse data 2000 - Interestingly, gradient descent is able to outperform 
its exponentiated doppelganger. As EG reported low errors in training, we 
can speculate that this is due to overfitting. Under these circumstances, 
it is gradient descent achieving <1% error while EG flops about. Probably 
an extremely low eta for EG would have produced good results.

==============================================================
Low Eta Exponents on Sparse Target Data
=======================================

Algorithm	Train	Eta	Itrns	Test		Error
---------	-----	---	-----	----		-----
Exponent	200	.01	5+	2000		22 %
Exponent	800	.01	8+	2000		4.2 %
Exponent	2000	.01	8+	2000		0.25 %

Low eta - Here we can verify this speculation by showing that by 8 
iterations of low eta, we achieve 0% error on the training data and 
ultimately 0.25 % on the test set. Still, it is remarkable how well GD 
performed in this case.

==============================================================
Noise on Dense Target Data
==========================

Algorithm	Train	Eta	Itrns	Test	Noise	Error
---------	-----	---	-----	----	-----	-----
Gradient	2000	.34	16	2000	25 %	33.1 %
Gradient	2000	.34	16	2000	12.5 %	23.1 %
Gradient	2000	.34	16	2000	6.25 %	16.1 %

Exponent	2000	.34	16	2000	25 %	31.7 %
Exponent	2000	.34	16	2000	12.5 %	19.3 %
Exponent	2000	.34	16	2000	6.25 %	14.5 %

Gradient	2000	.05	16	2000	25 %	33.5 %
Gradient        2000    .34     16      2000    25 %    33.1 %
Gradient	2000	.50	16	2000	25 %	32.7 %
Gradient	2000	.68	16	2000	25 %	32.7 %
Gradient	2000	.95	16	2000	25 %	31.1 %

Exponent	2000	.05	16	2000	25 %	34.5 %
Exponent	2000	.34	16	2000	25 %	31.7 %
Exponent	2000	.50	16	2000	25 %	31.2 %
Exponent	2000	.68	16	2000	25 %	28.2 %
Exponent	2000	.95	16	2000	25 %	29.6 %

Dense noise - Neither algorithm appear immune to the effects of noise. 
When rates approach 25%, both are comparable. At the low noise levels, EG 
as expected is a little bette at ignoring the low magnitude rumblings to 
produce a slightly better result. Noise can seriously affect linear 
seperability and create nontrivial difficulties.

==============================================================
Noise on Sparse Target Data
===========================

Algorithm       Train   Eta     Itrns   Test    Noise   Error 
---------       -----   ---     -----   ----    -----   -----

Gradient	2000	.05	16	2000	6.25 %	11.7 %
Exponent	2000	.05	16	2000	6.25 %	17 %

Gradient	2000	.05	16	2000	25 %	30 %
Gradient	2000	.34	16	2000	25 %	31.7 %
Gradient	2000	.50	16	2000	25 %	30.5 %
Gradient	2000	.68	16	2000	25 %	33.3 %
Gradient	2000	.95	16	2000	25 %	28.2 %

Exponent	2000	.05	16	2000	25 %	26.8 %
Exponent	2000	.34	16	2000	25 %	30.3 %
Exponent	2000	.50	16	2000	25 %	31.1 %
Exponent	2000	.68	16	2000	25 %	50.9 %
Exponent	2000	.95	16	2000	25 %	31.6 %

Sparse noise - Here we can comparable rates except with low eta on low
noise where GD performs quite well, perhaps because it infers less from 
the noise that EG. However, EG performs not badly in the high error areas, 
operating just slightly above the natural error rate, which can be 
considered quite good, relatively speaking.

==============================================================
Sqrt(Attributes)-Bounded Weights on Dense Targets
=================================================

Algorithm       Train   Eta     Itrns   Test    Noise   Error 
---------       -----   ---     -----   ----    -----   -----
Gradient	2000	.34	16	2000	6.25 %	14.5 %
Gradient	2000	.68	16	2000	6.25 %	14.5 %

Exponent	2000	.34	16	2000	6.25 %	14.4 %
Exponent	2000	.34	16	2000	6.25 %	16.9 %

Gradient	2000	.05	16	2000	12.5 %	18.1 %
Gradient	2000	.34	16	2000	12.5 %	18.3 %
Gradient	2000	.68	16	2000	12.5 %	22.4 %

Exponent	2000	.05	16	2000	12.5 %	23.8 %
Exponent	2000	.34	16	2000	12.5 %	21.9 %
Exponent	2000	.68	16	2000	12.5 %	22.8 %

Dense Bounded - Here we have more egalitarian weight vectors, favoring GD
slightly. This is probably because GD has a natural predilection for dense 
targets.

==============================================================
Sqrt(Attributes)-Bounded Weights on Sparse Targets
==================================================

Algorithm       Train   Eta     Itrns   Test    Noise   Error
---------       -----   ---     -----   ----    -----   ----- 
Gradient        2000    .34     16      2000    6.25 %	36.5 %
Gradient        2000    .68     16      2000    6.25 %	40.7 %

Exponent        2000    .34     16      2000    6.25 %	35.6 %
Exponent        2000    .34     16      2000    6.25 %	36.4 %

Gradient	2000	.05	16	2000	6.25 %	9.3 %
Gradient        2000    .34     16      2000    6.25 %	36.4 %
Gradient        2000    .68     16      2000    6.25 %	42.7 %
Gradient	2000	.95	16	2000	6.25 %	42.7 %

Exponent	2000	.05	16	2000	6.25 %	29.1 %
Exponent        2000    .34     16      2000    6.25 %	32.7 %
Exponent        2000    .68     16      2000    6.25 %	45.1 %
Exponent	2000	.95	16	2000	6.25 %	40.4 %

Sparse Bounded - Here GD continues to outperform EG, suggesting that EG is 
limited by bounding the weights, and this is what we expect from something 
that over-emphasizes high values.

==============================================================
Bouned Longest Survivor on Dense Targets with Bounds
=============================================          

Algorithm       Train   Eta     Itrns   Test    Noise   Error 
---------       -----   ---     -----   ----    -----   ----- 
Gradient        2000    0.05    16      2000    0 %     3.4 %
Exponent        2000    0.05    16      2000    0 %     0.5 %

Gradient        2000    0.05    16      2000    6.25 %	12 %
Exponent        2000    0.05    16      2000    6.25 %	14.1 %

Gradient        2000    0.05    16      2000    12.5 %	21 %
Exponent        2000    0.05    16      2000    12.5 %	22.6 %

Sparse Bounded Longest - This is a questionable approach which yield 
passable results, but the noise interferes with runs of longest survivor 
so we do not recommend it.

==============================================================
Longest Survivor on Sparse Targets with Bounds
==============================================

Algorithm       Train   Eta     Itrns   Test    Noise   Error
---------       -----   ---     -----   ----    -----   -----
Gradient	2000	0.05	16	2000	0 %	0.7 %
Exponent	2000	0.05	16	2000	0 %	15.5 %

Gradient        2000    0.05    16      2000    6.25 %  12.8 %
Exponent        2000    0.05    16      2000    6.25 %	20.4 %

Gradient	2000	0.05	16	2000	12.5 %	18.5 %
Exponent	2000	0.05	16	2000	12.5 %	24.8 %

==============================================================

Dense Bounded Longest - This is comparable to sparse bounded when noisy,
except that in the case of no noise GD has a curious success.

Conclusion
----------

In general, the intuitive hypothesis that gradient descent performs well
on dense targets while exponentiated descent performs well on sparse
targets is born out. To allow these algorithms to perform best, it is
necessary to select optimal eta values, which should be very small for EG
and in cases of large training sets. Again, this is intuitive. Variations
such as bounded weights are not recommended for EG where they can be
self-defeating, and seem unnecessary for GD unless we have some prior
knowledge indicating that our dense target vector should be more
egalitarian than usual. Noise is a problem for both algorithms, probably
because it interferes with linear seperability. EG is particularly
susceptible on noisy dense data while GD performs equally badly in both
environments. An amount of human intuition is useful in tuning these
algorithms with optimal learning rates for their environments.