1 | .. automodule:: Orange.evaluation.scoring |

2 | |

3 | ############################ |

4 | Method scoring (``scoring``) |

5 | ############################ |

6 | |

7 | .. index: scoring |

8 | |

9 | Scoring plays and integral role in evaluation of any prediction model. Orange |

10 | implements various scores for evaluation of classification, |

11 | regression and multi-label models. Most of the methods needs to be called |

12 | with an instance of :obj:`~Orange.evaluation.testing.ExperimentResults`. |

13 | |

14 | .. literalinclude:: code/scoring-example.py |

15 | |

16 | ============== |

17 | Classification |

18 | ============== |

19 | |

20 | Calibration scores |

21 | ================== |

22 | Many scores for evaluation of the classification models measure whether the |

23 | model assigns the correct class value to the test instances. Many of these |

24 | scores can be computed solely from the confusion matrix constructed manually |

25 | with the :obj:`confusion_matrices` function. If class variable has more than |

26 | two values, the index of the value to calculate the confusion matrix for should |

27 | be passed as well. |

28 | |

29 | .. autofunction:: CA |

30 | .. autofunction:: Sensitivity |

31 | .. autofunction:: Specificity |

32 | .. autofunction:: PPV |

33 | .. autofunction:: NPV |

34 | .. autofunction:: Precision |

35 | .. autofunction:: Recall |

36 | .. autofunction:: F1 |

37 | .. autofunction:: Falpha |

38 | .. autofunction:: MCC |

39 | .. autofunction:: AP |

40 | .. autofunction:: IS |

41 | .. autofunction:: confusion_chi_square |

42 | |

43 | Discriminatory scores |

44 | ===================== |

45 | Scores that measure how good can the prediction model separate instances with |

46 | different classes are called discriminatory scores. |

47 | |

48 | .. autofunction:: Brier_score |

49 | |

50 | .. autoclass:: AUC |

51 | :members: by_weighted_pairs, by_pairs, |

52 | weighted_one_against_all, one_against_all, single_class, pair, |

53 | |

54 | .. autofunction:: AUCWilcoxon |

55 | |

56 | .. autofunction:: compute_ROC |

57 | |

58 | .. autofunction:: confusion_matrices |

59 | |

60 | .. autoclass:: ConfusionMatrix |

61 | |

62 | |

63 | Comparison of Algorithms |

64 | ======================== |

65 | |

66 | .. autofunction:: McNemar |

67 | |

68 | .. autofunction:: McNemar_of_two |

69 | |

70 | ========== |

71 | Regression |

72 | ========== |

73 | |

74 | Several alternative measures, as given below, can be used to evaluate |

75 | the sucess of numeric prediction: |

76 | |

77 | .. image:: files/statRegression.png |

78 | |

79 | .. autofunction:: MSE |

80 | |

81 | .. autofunction:: RMSE |

82 | |

83 | .. autofunction:: MAE |

84 | |

85 | .. autofunction:: RSE |

86 | |

87 | .. autofunction:: RRSE |

88 | |

89 | .. autofunction:: RAE |

90 | |

91 | .. autofunction:: R2 |

92 | |

93 | The following code (:download:`statExamples.py <code/statExamples.py>`) uses most of the above measures to |

94 | score several regression methods. |

95 | |

96 | The code above produces the following output:: |

97 | |

98 | Learner MSE RMSE MAE RSE RRSE RAE R2 |

99 | maj 84.585 9.197 6.653 1.002 1.001 1.001 -0.002 |

100 | rt 40.015 6.326 4.592 0.474 0.688 0.691 0.526 |

101 | knn 21.248 4.610 2.870 0.252 0.502 0.432 0.748 |

102 | lr 24.092 4.908 3.425 0.285 0.534 0.515 0.715 |

103 | |

104 | ================= |

105 | Ploting functions |

106 | ================= |

107 | |

108 | .. autofunction:: graph_ranks |

109 | |

110 | The following script (:download:`statExamplesGraphRanks.py <code/statExamplesGraphRanks.py>`) shows hot to plot a graph: |

111 | |

112 | .. literalinclude:: code/statExamplesGraphRanks.py |

113 | |

114 | Code produces the following graph: |

115 | |

116 | .. image:: files/statExamplesGraphRanks1.png |

117 | |

118 | .. autofunction:: compute_CD |

119 | |

120 | .. autofunction:: compute_friedman |

121 | |

122 | ================= |

123 | Utility Functions |

124 | ================= |

125 | |

126 | .. autofunction:: split_by_iterations |

127 | |

128 | |

129 | .. _mt-scoring: |

130 | |

131 | ============ |

132 | Multi-target |

133 | ============ |

134 | |

135 | :doc:`Multi-target <Orange.multitarget>` classifiers predict values for |

136 | multiple target classes. They can be used with standard |

137 | :obj:`~Orange.evaluation.testing` procedures (e.g. |

138 | :obj:`~Orange.evaluation.testing.Evaluation.cross_validation`), but require |

139 | special scoring functions to compute a single score from the obtained |

140 | :obj:`~Orange.evaluation.testing.ExperimentResults`. |

141 | Since different targets can vary in importance depending on the experiment, |

142 | some methods have options to indicate this e.g. through weights or customized |

143 | distance functions. These can also be used for normalization in case target |

144 | values do not have the same scales. |

145 | |

146 | .. autofunction:: mt_flattened_score |

147 | .. autofunction:: mt_average_score |

148 | |

149 | The whole procedure of evaluating multi-target methods and computing |

150 | the scores (RMSE errors) is shown in the following example |

151 | (:download:`mt-evaluate.py <code/mt-evaluate.py>`). Because we consider |

152 | the first target to be more important and the last not so much we will |

153 | indicate this using appropriate weights. |

154 | |

155 | .. literalinclude:: code/mt-evaluate.py |

156 | |

157 | Which outputs:: |

158 | |

159 | Weighted RMSE scores: |

160 | Majority 0.8228 |

161 | MTTree 0.3949 |

162 | PLS 0.3021 |

163 | Earth 0.2880 |

164 | |

165 | ========================== |

166 | Multi-label classification |

167 | ========================== |

168 | |

169 | Multi-label classification requires different metrics than those used in |

170 | traditional single-label classification. This module presents the various |

171 | metrics that have been proposed in the literature. Let :math:`D` be a |

172 | multi-label evaluation data set, conisting of :math:`|D|` multi-label examples |

173 | :math:`(x_i,Y_i)`, :math:`i=1..|D|`, :math:`Y_i \\subseteq L`. Let :math:`H` |

174 | be a multi-label classifier and :math:`Z_i=H(x_i)` be the set of labels |

175 | predicted by :math:`H` for example :math:`x_i`. |

176 | |

177 | .. autofunction:: mlc_hamming_loss |

178 | .. autofunction:: mlc_accuracy |

179 | .. autofunction:: mlc_precision |

180 | .. autofunction:: mlc_recall |

181 | |

182 | The following script demonstrates the use of those evaluation measures: |

183 | |

184 | .. literalinclude:: code/mlc-evaluate.py |

185 | |

186 | The output should look like this:: |

187 | |

188 | loss= [0.9375] |

189 | accuracy= [0.875] |

190 | precision= [1.0] |

191 | recall= [0.875] |

192 | |

193 | References |

194 | ========== |

195 | |

196 | Boutell, M.R., Luo, J., Shen, X. & Brown, C.M. (2004), 'Learning multi-label scene classification', |

197 | Pattern Recogintion, vol.37, no.9, pp:1757-71 |

198 | |

199 | Godbole, S. & Sarawagi, S. (2004), 'Discriminative Methods for Multi-labeled Classification', paper |

200 | presented to Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining |

201 | (PAKDD 2004) |

202 | |

203 | Schapire, R.E. & Singer, Y. (2000), 'Boostexter: a bossting-based system for text categorization', |

204 | Machine Learning, vol.39, no.2/3, pp:135-68. |

