| 1 | .. py:currentmodule:: Orange.feature.discretization |
|---|
| 2 | |
|---|
| 3 | ################################### |
|---|
| 4 | Discretization (``discretization``) |
|---|
| 5 | ################################### |
|---|
| 6 | |
|---|
| 7 | .. index:: discretization |
|---|
| 8 | |
|---|
| 9 | .. index:: |
|---|
| 10 | single: feature; discretization |
|---|
| 11 | |
|---|
| 12 | Continues features can be discretized either one feature at a time, or, as demonstrated in the following script, |
|---|
| 13 | using a single discretization method on entire set of data features: |
|---|
| 14 | |
|---|
| 15 | .. literalinclude:: code/discretization-table.py |
|---|
| 16 | |
|---|
| 17 | Discretization introduces new categorical features and computes their values in accordance to |
|---|
| 18 | selected (or default) discretization method:: |
|---|
| 19 | |
|---|
| 20 | Original data set: |
|---|
| 21 | [5.1, 3.5, 1.4, 0.2, 'Iris-setosa'] |
|---|
| 22 | [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'] |
|---|
| 23 | [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'] |
|---|
| 24 | |
|---|
| 25 | Discretized data set: |
|---|
| 26 | ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa'] |
|---|
| 27 | ['<=5.45', '(2.85, 3.15]', '<=2.45', '<=0.80', 'Iris-setosa'] |
|---|
| 28 | ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa'] |
|---|
| 29 | |
|---|
| 30 | The following discretization methods are supported: |
|---|
| 31 | |
|---|
| 32 | * equal width discretization, where the domain of continuous feature is split to intervals of the same |
|---|
| 33 | width equal-sized intervals (:class:`EqualWidth`), |
|---|
| 34 | * equal frequency discretization, where each intervals contains equal number of data instances (:class:`EqualFreq`), |
|---|
| 35 | * entropy-based, as originally proposed by [FayyadIrani1993]_ that infers the intervals to minimize |
|---|
| 36 | within-interval entropy of class distributions (:class:`Entropy`), |
|---|
| 37 | * bi-modal, using three intervals to optimize the difference of the class distribution in |
|---|
| 38 | the middle with the distribution outside it (:class:`BiModal`), |
|---|
| 39 | * fixed, with the user-defined cut-off points. |
|---|
| 40 | |
|---|
| 41 | The above script used the default discretization method (equal frequency with three intervals). This can be changed |
|---|
| 42 | as demonstrated below: |
|---|
| 43 | |
|---|
| 44 | .. literalinclude:: code/discretization-table-method.py |
|---|
| 45 | :lines: 3-5 |
|---|
| 46 | |
|---|
| 47 | With exception to fixed discretization, discretization approaches infer the cut-off points from the |
|---|
| 48 | training data set and thus construct a discretizer to convert continuous values of this feature into categorical |
|---|
| 49 | value according to the rule found by discretization. In this respect, the discretization behaves similar to |
|---|
| 50 | :class:`Orange.classification.Learner`. |
|---|
| 51 | |
|---|
| 52 | Discretization Algorithms |
|---|
| 53 | ========================= |
|---|
| 54 | |
|---|
| 55 | Instances of discretization classes are all derived from :class:`Discretization`. |
|---|
| 56 | |
|---|
| 57 | .. class:: Discretization |
|---|
| 58 | |
|---|
| 59 | .. method:: __call__(feature, data[, weightID]) |
|---|
| 60 | |
|---|
| 61 | Given a continuous ``feature``, ``data`` and, optionally id of |
|---|
| 62 | attribute with example weight, this function returns a discretized |
|---|
| 63 | feature. Argument ``feature`` can be a descriptor, index or |
|---|
| 64 | name of the attribute. |
|---|
| 65 | |
|---|
| 66 | |
|---|
| 67 | .. class:: EqualWidth |
|---|
| 68 | |
|---|
| 69 | Discretizes the feature by spliting its domain to a fixed number |
|---|
| 70 | of equal-width intervals. The span of original domain is computed |
|---|
| 71 | from the training data and is defined by the smallest and the |
|---|
| 72 | largest feature value. |
|---|
| 73 | |
|---|
| 74 | .. attribute:: n |
|---|
| 75 | |
|---|
| 76 | Number of discretization intervals (default: 4). |
|---|
| 77 | |
|---|
| 78 | The following example discretizes Iris dataset features using six |
|---|
| 79 | intervals. The script constructs a :class:`Orange.data.Table` with discretized |
|---|
| 80 | features and outputs their description: |
|---|
| 81 | |
|---|
| 82 | .. literalinclude:: code/discretization.py |
|---|
| 83 | :lines: 38-43 |
|---|
| 84 | |
|---|
| 85 | The output of this script is:: |
|---|
| 86 | |
|---|
| 87 | D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> |
|---|
| 88 | D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> |
|---|
| 89 | D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> |
|---|
| 90 | D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10> |
|---|
| 91 | |
|---|
| 92 | The cut-off values are hidden in the discretizer and stored in ``attr.get_value_from.transformer``:: |
|---|
| 93 | |
|---|
| 94 | >>> for attr in newattrs: |
|---|
| 95 | ... print "%s: first interval at %5.3f, step %5.3f" % \ |
|---|
| 96 | ... (attr.name, attr.get_value_from.transformer.first_cut, \ |
|---|
| 97 | ... attr.get_value_from.transformer.step) |
|---|
| 98 | D_sepal length: first interval at 4.900, step 0.600 |
|---|
| 99 | D_sepal width: first interval at 2.400, step 0.400 |
|---|
| 100 | D_petal length: first interval at 1.980, step 0.980 |
|---|
| 101 | D_petal width: first interval at 0.500, step 0.400 |
|---|
| 102 | |
|---|
| 103 | All discretizers have the method |
|---|
| 104 | ``construct_variable``: |
|---|
| 105 | |
|---|
| 106 | .. literalinclude:: code/discretization.py |
|---|
| 107 | :lines: 69-73 |
|---|
| 108 | |
|---|
| 109 | |
|---|
| 110 | .. class:: EqualFreq |
|---|
| 111 | |
|---|
| 112 | Infers the cut-off points so that the discretization intervals contain |
|---|
| 113 | approximately equal number of training data instances. |
|---|
| 114 | |
|---|
| 115 | .. attribute:: n |
|---|
| 116 | |
|---|
| 117 | Number of discretization intervals (default: 4). |
|---|
| 118 | |
|---|
| 119 | The resulting discretizer is of class :class:`IntervalDiscretizer`. Its ``transformer`` includes ``points`` |
|---|
| 120 | that store the inferred cut-offs. |
|---|
| 121 | |
|---|
| 122 | .. class:: Entropy |
|---|
| 123 | |
|---|
| 124 | Entropy-based discretization as originally proposed by [FayyadIrani1993]_. The approach infers the most |
|---|
| 125 | appropriate number of intervals by recursively splitting the domain of continuous feature to minimize the |
|---|
| 126 | class-entropy of training examples. The splitting is repeated until the entropy decrease is smaller than the |
|---|
| 127 | increase of minimal descripton length (MDL) induced by the new cut-off point. |
|---|
| 128 | |
|---|
| 129 | Entropy-based discretization can reduce a continuous feature into |
|---|
| 130 | a single interval if no suitable cut-off points are found. In this case the new feature is constant and can be |
|---|
| 131 | removed. This discretization can |
|---|
| 132 | therefore also serve for identification of non-informative features and thus used for feature subset selection. |
|---|
| 133 | |
|---|
| 134 | .. attribute:: force_attribute |
|---|
| 135 | |
|---|
| 136 | Forces the algorithm to induce at least one cut-off point, even when |
|---|
| 137 | its information gain is lower than MDL (default: ``False``). |
|---|
| 138 | |
|---|
| 139 | Part of :download:`discretization.py <code/discretization.py>`: |
|---|
| 140 | |
|---|
| 141 | .. literalinclude:: code/discretization.py |
|---|
| 142 | :lines: 77-80 |
|---|
| 143 | |
|---|
| 144 | The output shows that all attributes are discretized onto three intervals:: |
|---|
| 145 | |
|---|
| 146 | sepal length: <5.5, 6.09999990463> |
|---|
| 147 | sepal width: <2.90000009537, 3.29999995232> |
|---|
| 148 | petal length: <1.89999997616, 4.69999980927> |
|---|
| 149 | petal width: <0.600000023842, 1.0000004768> |
|---|
| 150 | |
|---|
| 151 | .. class:: BiModal |
|---|
| 152 | |
|---|
| 153 | Infers two cut-off points to optimize the difference of class distribution of data instances in the |
|---|
| 154 | middle and in the other two intervals. The |
|---|
| 155 | difference is scored by chi-square statistics. All possible cut-off |
|---|
| 156 | points are examined, thus the discretization runs in O(n^2). This discretization method is especially suitable |
|---|
| 157 | for the attributes in |
|---|
| 158 | which the middle region corresponds to normal and the outer regions to |
|---|
| 159 | abnormal values of the feature. |
|---|
| 160 | |
|---|
| 161 | .. attribute:: split_in_two |
|---|
| 162 | |
|---|
| 163 | Decides whether the resulting attribute should have three or two values. |
|---|
| 164 | If ``True`` (default), the feature will be discretized to three |
|---|
| 165 | intervals and the discretizer is of type :class:`BiModalDiscretizer`. |
|---|
| 166 | If ``False`` the result is the ordinary :class:`IntervalDiscretizer`. |
|---|
| 167 | |
|---|
| 168 | Iris dataset has three-valued class attribute. The figure below, drawn using LOESS probability estimation, shows that |
|---|
| 169 | sepal lenghts of versicolors are between lengths of setosas and virginicas. |
|---|
| 170 | |
|---|
| 171 | .. image:: files/bayes-iris.gif |
|---|
| 172 | |
|---|
| 173 | If we merge classes setosa and virginica, we can observe if |
|---|
| 174 | the bi-modal discretization would correctly recognize the interval in |
|---|
| 175 | which versicolors dominate. The following scripts peforms the merging and construction of new data set with class |
|---|
| 176 | that reports if iris is versicolor or not. |
|---|
| 177 | |
|---|
| 178 | .. literalinclude:: code/discretization.py |
|---|
| 179 | :lines: 84-87 |
|---|
| 180 | |
|---|
| 181 | The following script implements the discretization: |
|---|
| 182 | |
|---|
| 183 | .. literalinclude:: code/discretization.py |
|---|
| 184 | :lines: 97-100 |
|---|
| 185 | |
|---|
| 186 | The middle intervals are printed:: |
|---|
| 187 | |
|---|
| 188 | sepal length: (5.400, 6.200] |
|---|
| 189 | sepal width: (2.000, 2.900] |
|---|
| 190 | petal length: (1.900, 4.700] |
|---|
| 191 | petal width: (0.600, 1.600] |
|---|
| 192 | |
|---|
| 193 | Judging by the graph, the cut-off points inferred by discretization for "sepal length" make sense. |
|---|
| 194 | |
|---|
| 195 | Discretizers |
|---|
| 196 | ============ |
|---|
| 197 | |
|---|
| 198 | Discretizers construct a categorical feature from the continuous feature according to the method they implement and |
|---|
| 199 | its parameters. The most general is |
|---|
| 200 | :class:`IntervalDiscretizer` that is also used by most discretization |
|---|
| 201 | methods. Two other discretizers, :class:`EquiDistDiscretizer` and |
|---|
| 202 | :class:`ThresholdDiscretizer`> could easily be replaced by |
|---|
| 203 | :class:`IntervalDiscretizer` but are used for speed and simplicity. |
|---|
| 204 | The fourth discretizer, :class:`BiModalDiscretizer` is specialized |
|---|
| 205 | for discretizations induced by :class:`BiModalDiscretization`. |
|---|
| 206 | |
|---|
| 207 | .. class:: Discretizer |
|---|
| 208 | |
|---|
| 209 | A superclass implementing the construction of a new |
|---|
| 210 | attribute from an existing one. |
|---|
| 211 | |
|---|
| 212 | .. method:: construct_variable(feature) |
|---|
| 213 | |
|---|
| 214 | Constructs a descriptor for a new feature. The new feature's name is equal to ``feature.name`` |
|---|
| 215 | prefixed by "D\_". Its symbolic values are discretizer specific. |
|---|
| 216 | |
|---|
| 217 | .. class:: IntervalDiscretizer |
|---|
| 218 | |
|---|
| 219 | Discretizer defined with a set of cut-off points. |
|---|
| 220 | |
|---|
| 221 | .. attribute:: points |
|---|
| 222 | |
|---|
| 223 | The cut-off points; feature values below or equal to the first point will be mapped to the first interval, |
|---|
| 224 | those between the first and the second point |
|---|
| 225 | (including those equal to the second) are mapped to the second interval and |
|---|
| 226 | so forth to the last interval which covers all values greater than |
|---|
| 227 | the last value in ``points``. The number of intervals is thus |
|---|
| 228 | ``len(points)+1``. |
|---|
| 229 | |
|---|
| 230 | The script that follows is an examples of a manual construction of a discretizer with cut-off points |
|---|
| 231 | at 3.0 and 5.0: |
|---|
| 232 | |
|---|
| 233 | .. literalinclude:: code/discretization.py |
|---|
| 234 | :lines: 22-26 |
|---|
| 235 | |
|---|
| 236 | First five data instances of ``data2`` are:: |
|---|
| 237 | |
|---|
| 238 | [5.1, '>5.00', 'Iris-setosa'] |
|---|
| 239 | [4.9, '(3.00, 5.00]', 'Iris-setosa'] |
|---|
| 240 | [4.7, '(3.00, 5.00]', 'Iris-setosa'] |
|---|
| 241 | [4.6, '(3.00, 5.00]', 'Iris-setosa'] |
|---|
| 242 | [5.0, '(3.00, 5.00]', 'Iris-setosa'] |
|---|
| 243 | |
|---|
| 244 | The same discretizer can be used on several features by calling the function construct_var: |
|---|
| 245 | |
|---|
| 246 | .. literalinclude:: code/discretization.py |
|---|
| 247 | :lines: 30-34 |
|---|
| 248 | |
|---|
| 249 | Each feature has its own instance of :class:`ClassifierFromVar` stored in |
|---|
| 250 | ``get_value_from``, but all use the same :class:`IntervalDiscretizer`, |
|---|
| 251 | ``idisc``. Changing any element of its ``points`` affect all attributes. |
|---|
| 252 | |
|---|
| 253 | .. note:: |
|---|
| 254 | |
|---|
| 255 | The length of :obj:`~IntervalDiscretizer.points` should not be changed if the |
|---|
| 256 | discretizer is used by any attribute. The length of |
|---|
| 257 | :obj:`~IntervalDiscretizer.points` should always match the number of values |
|---|
| 258 | of the feature, which is determined by the length of the attribute's field |
|---|
| 259 | ``values``. If ``attr`` is a discretized attribute, than ``len(attr.values)`` must equal |
|---|
| 260 | ``len(attr.get_value_from.transformer.points)+1``. |
|---|
| 261 | |
|---|
| 262 | |
|---|
| 263 | .. class:: EqualWidthDiscretizer |
|---|
| 264 | |
|---|
| 265 | Discretizes to intervals of the fixed width. All values lower than :obj:`~EquiDistDiscretizer.first_cut` are mapped to the first |
|---|
| 266 | interval. Otherwise, value ``val``'s interval is ``floor((val-first_cut)/step)``. Possible overflows are mapped to the |
|---|
| 267 | last intervals. |
|---|
| 268 | |
|---|
| 269 | |
|---|
| 270 | .. attribute:: first_cut |
|---|
| 271 | |
|---|
| 272 | The first cut-off point. |
|---|
| 273 | |
|---|
| 274 | .. attribute:: step |
|---|
| 275 | |
|---|
| 276 | Width of the intervals. |
|---|
| 277 | |
|---|
| 278 | .. attribute:: n |
|---|
| 279 | |
|---|
| 280 | Number of the intervals. |
|---|
| 281 | |
|---|
| 282 | .. attribute:: points (read-only) |
|---|
| 283 | |
|---|
| 284 | The cut-off points; this is not a real attribute although it behaves |
|---|
| 285 | as one. Reading it constructs a list of cut-off points and returns it, |
|---|
| 286 | but changing the list doesn't affect the discretizer. Only present to provide |
|---|
| 287 | the :obj:`EquiDistDiscretizer` the same interface as that of |
|---|
| 288 | :obj:`IntervalDiscretizer`. |
|---|
| 289 | |
|---|
| 290 | |
|---|
| 291 | .. class:: ThresholdDiscretizer |
|---|
| 292 | |
|---|
| 293 | Threshold discretizer converts continuous values into binary by comparing |
|---|
| 294 | them to a fixed threshold. Orange uses this discretizer for |
|---|
| 295 | binarization of continuous attributes in decision trees. |
|---|
| 296 | |
|---|
| 297 | .. attribute:: threshold |
|---|
| 298 | |
|---|
| 299 | The value threshold; values below or equal to the threshold belong to the first |
|---|
| 300 | interval and those that are greater go to the second. |
|---|
| 301 | |
|---|
| 302 | |
|---|
| 303 | .. class:: BiModalDiscretizer |
|---|
| 304 | |
|---|
| 305 | Bimodal discretizer has two cut off points and values are |
|---|
| 306 | discretized according to whether or not they belong to the region between these points |
|---|
| 307 | which includes the lower but not the upper boundary. The |
|---|
| 308 | discretizer is returned by :class:`BiModalDiscretization` if its |
|---|
| 309 | field :obj:`~BiModalDiscretization.split_in_two` is true (the default). |
|---|
| 310 | |
|---|
| 311 | .. attribute:: low |
|---|
| 312 | |
|---|
| 313 | Lower boundary of the interval (included in the interval). |
|---|
| 314 | |
|---|
| 315 | .. attribute:: high |
|---|
| 316 | |
|---|
| 317 | Upper boundary of the interval (not included in the interval). |
|---|
| 318 | |
|---|
| 319 | |
|---|
| 320 | Implementational details |
|---|
| 321 | ======================== |
|---|
| 322 | |
|---|
| 323 | Consider a following example (part of :download:`discretization.py <code/discretization.py>`): |
|---|
| 324 | |
|---|
| 325 | .. literalinclude:: code/discretization.py |
|---|
| 326 | :lines: 7-15 |
|---|
| 327 | |
|---|
| 328 | The discretized attribute ``sep_w`` is constructed with a call to |
|---|
| 329 | :class:`Entropy`; instead of constructing it and calling |
|---|
| 330 | it afterwards, we passed the arguments for calling to the constructor. We then constructed a new |
|---|
| 331 | :class:`Orange.data.Table` with attributes "sepal width" (the original |
|---|
| 332 | continuous attribute), ``sep_w`` and the class attribute:: |
|---|
| 333 | |
|---|
| 334 | Entropy discretization, first 5 data instances |
|---|
| 335 | [3.5, '>3.30', 'Iris-setosa'] |
|---|
| 336 | [3.0, '(2.90, 3.30]', 'Iris-setosa'] |
|---|
| 337 | [3.2, '(2.90, 3.30]', 'Iris-setosa'] |
|---|
| 338 | [3.1, '(2.90, 3.30]', 'Iris-setosa'] |
|---|
| 339 | [3.6, '>3.30', 'Iris-setosa'] |
|---|
| 340 | |
|---|
| 341 | The name of the new categorical variable derives from the name of original |
|---|
| 342 | continuous variable by adding a prefix ``D_``. The values of the new attributes |
|---|
| 343 | are computed automatically when they are needed using a transformation |
|---|
| 344 | function :obj:`~Orange.data.variable.Variable.get_value_from` |
|---|
| 345 | (see :class:`Orange.data.variable.Variable`) which encodes the discretization:: |
|---|
| 346 | |
|---|
| 347 | >>> sep_w |
|---|
| 348 | EnumVariable 'D_sepal width' |
|---|
| 349 | >>> sep_w.get_value_from |
|---|
| 350 | <ClassifierFromVar instance at 0x01BA7DC0> |
|---|
| 351 | >>> sep_w.get_value_from.whichVar |
|---|
| 352 | FloatVariable 'sepal width' |
|---|
| 353 | >>> sep_w.get_value_from.transformer |
|---|
| 354 | <IntervalDiscretizer instance at 0x01BA2100> |
|---|
| 355 | >>> sep_w.get_value_from.transformer.points |
|---|
| 356 | <2.90000009537, 3.29999995232> |
|---|
| 357 | |
|---|
| 358 | The ``select`` statement in the discretization script converted all data instances |
|---|
| 359 | from ``data`` to the new domain. This includes a new feature |
|---|
| 360 | ``sep_w`` whose values are computed on the fly by calling ``sep_w.get_value_from`` for each data instance. |
|---|
| 361 | The original, continuous sepal width |
|---|
| 362 | is passed to the ``transformer`` that determines the interval by its field |
|---|
| 363 | ``points``. Transformer returns the discrete value which is in turn returned |
|---|
| 364 | by ``get_value_from`` and stored in the new example. |
|---|
| 365 | |
|---|
| 366 | References |
|---|
| 367 | ========== |
|---|
| 368 | |
|---|
| 369 | .. [FayyadIrani1993] UM Fayyad and KB Irani. Multi-interval discretization of continuous valued |
|---|
| 370 | attributes for classification learning. In Proc. 13th International Joint Conference on Artificial Intelligence, pages |
|---|
| 371 | 1022--1029, Chambery, France, 1993. |
|---|