source: orange/docs/widgets/rst/evaluate/testlearners.rst @ 11050:e3c4699ca155

Revision 11050:e3c4699ca155, 5.4 KB checked in by Miha Stajdohar <miha.stajdohar@…>, 16 months ago (diff)

Widget docs From HTML to Sphinx.

Line 
1.. _Test Learners:
2
3Test Learners
4=============
5
6.. image:: ../icons/TestLearners.png
7
8Tests learning algorithms on data.
9
10Signals
11-------
12
13Inputs:
14   - Data (ExampleTable)
15      Data for training and, unless separate test data set is used, testing
16   - Separate Test Data (ExampleTable)
17      Separa data for testing
18   - Learner (orange.Learner)
19      One or more learning algorithms
20
21Outputs:
22   - Evaluation results (orngTest.ExperimentResults)
23      Results of testing the algorithms
24
25
26Description
27-----------
28
29The widget tests learning algorithms on data. Different sampling schemes are available, including using a separate test data. The widget does two things. First, it shows a table with different performance measures of the classifiers, such as classification accuracy and area under ROC. Second, it outputs a signal with data which can be used by other widgets for analyzing the performance of classifiers, such as `ROC Analysis <ROCAnalysis.htm>`_ or `Confusion Matrix <ConfusionMatrix.htm>`_.
30
31The signal Learner has a not very common property that it can be connected to more than one widget, which provide multiple learners to be tested with the same procedures. If the results of evaluation or fed into further widgets, such as the one for ROC analysis, the learning algorithms are analyzed together.
32
33.. image:: images/TestLearners.png
34
35The widget supports various sampling methods. :obj:`Cross-validation` splits the data into the given number of folds (usually 5 or 10). The algorithm is tested by holding out the examples from one fold at a time; the model is induced from the other folds and the examples from the held out fold are classified. :obj:`Leave-one-out` is similar, but it holds out one example at a time, inducing the model from all others and then classifying the held out. This method is obviously very stable and reliable ... and very slow. :obj:`Random sampling` randomly splits the data onto the training and testing set in the given proportion (e.g. 70:30); the whole procedure is repeated for the specified number of times. :obj:`Test on train data` uses the whole data set for training and then for testing. This method practically always gives overly optimistic results.
36
37The above methods use the data from signal Data only. To give another data set with testing examples (for instance from another file or some data selected in another widget), we put it on the input signal Separate Test Data and select :obj:`Test on test data`.
38
39Any changes in the above settings are applied immediately if :obj:`Applied on any change` is checked. If not, the user will have to press :obj:`Apply` to apply any changes.
40
41The widget can compute a number of performance statistics.
42
43   - :obj:`Classification accuracy` is the proportion of correctly classified examples
44   - :obj:`Sensitivity` (also called true positive rate (TPR), hit rate and recall) is the number of detected positive examples among all positive examples, e.g. the proportion of sick people correctly diagnosed as sick
45   - :obj:`Specificity` is the proportion of detected negative examples among all negative examples, e.g. the proportion of healthy correctly recognized as healthy
46   - :obj:`Area under ROC` is the area under receiver-operating curve
47   - :obj:`Information score` is the average amount of information per classified instance, as defined by Kononenko and Bratko
48   - :obj:`F-measure` is a weighted harmonic mean of precision and recall (see below), 2*precision*recall/(precision+recall)
49   - :obj:`Precision` is the number of positive examples among all examples classified as positive, e.g. the number of sick among all diagnosed as sick, or a number of relevant documents among all retrieved documents
50   - :obj:`Recall` is the same measure as sensitivity, except that the latter term is more common in medicine and recall comes from text mining, where it means the proportion of relevant documents which are retrieved
51   - :obj:`Brier score` measure the accuracy of probability assessments, which measures the average deviation between the predicted probabilities of events and the actual events.
52
53
54More comprehensive descriptions of measures can be found at `http://en.wikipedia.org/wiki/Receiver_operating_characteristic <http://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_ (from classification accuracy to area under ROC),
55`http://www.springerlink.com/content/j21p620rw33xw773/ <http://www.springerlink.com/content/j21p620rw33xw773/>`_ (information score), `http://en.wikipedia.org/wiki/F-measure#Performance_measures <http://en.wikipedia.org/wiki/F-measure#Performance_measures>`_
56(from F-measure to recall) and `http://en.wikipedia.org/wiki/Brier_score <http://en.wikipedia.org/wiki/Brier_score>`_ (Brier score).
57
58Most measure require a target class, e.g. having the disease or being relevant. The target class can be selected at the bottom of the widget.
59
60Example
61-------
62
63In a typical use of the widget, we give it a data set and a few learning algorithms, and we observe their performance in the table inside the Test Learners widgets and in the ROC and Lift Curve widgets attached to the Test Learners. The data is often preprocessed before testing; in this case we discretized it and did some manual feature selection; not that this is done outside the cross-validation loop, so the testing results may be overly optimistic.
64
65.. image:: images/TestLearners-Schema.png
66
67Another example of using this widget is given in the documentation for widget `Confusion Matrix <ConfusionMatrix.htm>`_.
Note: See TracBrowser for help on using the repository browser.