Ignore:
Timestamp:
02/07/12 20:18:43 (2 years ago)
Author:
blaz <blaz.zupan@…>
Branch:
default
Children:
9965:efd4185eeba1, 9973:1c9b252d3ff5, 9986:c6ae4843d946
Parents:
9962:bb2a2bb16104 (diff), 9963:7598327495d7 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
rebase_source:
df3c6023ab8e6bc3dc87629444dd733b622c37df
Message:

Merge.

Files:
2 edited

Legend:

Unmodified
Added
Removed
  • docs/reference/rst/Orange.feature.discretization.rst

    r9944 r9964  
    1010   single: feature; discretization 
    1111 
    12 Continues features can be discretized either one feature at a time, or, as demonstrated in the following script, 
    13 using a single discretization method on entire set of data features: 
    14  
    15 .. literalinclude:: code/discretization-table.py 
    16  
    17 Discretization introduces new categorical features and computes their values in accordance to 
    18 selected (or default) discretization method:: 
    19  
    20     Original data set: 
    21     [5.1, 3.5, 1.4, 0.2, 'Iris-setosa'] 
    22     [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'] 
    23     [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'] 
    24  
    25     Discretized data set: 
    26     ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa'] 
    27     ['<=5.45', '(2.85, 3.15]', '<=2.45', '<=0.80', 'Iris-setosa'] 
    28     ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa'] 
    29  
    30 The following discretization methods are supported: 
    31  
    32 * equal width discretization, where the domain of continuous feature is split to intervals of the same 
    33   width equal-sized intervals (:class:`EqualWidth`), 
    34 * equal frequency discretization, where each intervals contains equal number of data instances (:class:`EqualFreq`), 
    35 * entropy-based, as originally proposed by [FayyadIrani1993]_ that infers the intervals to minimize 
    36   within-interval entropy of class distributions (:class:`Entropy`), 
    37 * bi-modal, using three intervals to optimize the difference of the class distribution in 
    38   the middle with the distribution outside it (:class:`BiModal`), 
    39 * fixed, with the user-defined cut-off points. 
    40  
    41 The above script used the default discretization method (equal frequency with three intervals). This can be changed 
    42 as demonstrated below: 
    43  
    44 .. literalinclude:: code/discretization-table-method.py 
    45     :lines: 3-5 
    46  
    47 With exception to fixed discretization, discretization approaches infer the cut-off points from the 
    48 training data set and thus construct a discretizer to convert continuous values of this feature into categorical 
    49 value according to the rule found by discretization. In this respect, the discretization behaves similar to 
    50 :class:`Orange.classification.Learner`. 
    51  
    52 Discretization Algorithms 
    53 ========================= 
    54  
    55 Instances of discretization classes are all derived from :class:`Discretization`. 
    56  
    57 .. class:: Discretization 
    58  
    59     .. method:: __call__(feature, data[, weightID]) 
    60  
    61         Given a continuous ``feature``, ``data`` and, optionally id of 
    62         attribute with example weight, this function returns a discretized 
    63         feature. Argument ``feature`` can be a descriptor, index or 
    64         name of the attribute. 
    65  
    66  
    67 .. class:: EqualWidth 
    68  
    69     Discretizes the feature by spliting its domain to a fixed number 
    70     of equal-width intervals. The span of original domain is computed 
    71     from the training data and is defined by the smallest and the 
    72     largest feature value. 
    73  
    74     .. attribute:: n 
    75  
    76         Number of discretization intervals (default: 4). 
    77  
    78 The following example discretizes Iris dataset features using six 
    79 intervals. The script constructs a :class:`Orange.data.Table` with discretized 
    80 features and outputs their description: 
    81  
    82 .. literalinclude:: code/discretization.py 
    83     :lines: 38-43 
    84  
    85 The output of this script is:: 
    86  
    87     D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> 
    88     D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> 
    89     D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> 
    90     D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10> 
    91  
    92 The cut-off values are hidden in the discretizer and stored in ``attr.get_value_from.transformer``:: 
    93  
    94     >>> for attr in newattrs: 
    95     ...    print "%s: first interval at %5.3f, step %5.3f" % \ 
    96     ...    (attr.name, attr.get_value_from.transformer.first_cut, \ 
    97     ...    attr.get_value_from.transformer.step) 
    98     D_sepal length: first interval at 4.900, step 0.600 
    99     D_sepal width: first interval at 2.400, step 0.400 
    100     D_petal length: first interval at 1.980, step 0.980 
    101     D_petal width: first interval at 0.500, step 0.400 
    102  
    103 All discretizers have the method 
    104 ``construct_variable``: 
    105  
    106 .. literalinclude:: code/discretization.py 
    107     :lines: 69-73 
    108  
    109  
    110 .. class:: EqualFreq 
    111  
    112     Infers the cut-off points so that the discretization intervals contain 
    113     approximately equal number of training data instances. 
    114  
    115     .. attribute:: n 
    116  
    117         Number of discretization intervals (default: 4). 
    118  
    119 The resulting discretizer is of class :class:`IntervalDiscretizer`. Its ``transformer`` includes ``points`` 
    120 that store the inferred cut-offs. 
    121  
    122 .. class:: Entropy 
    123  
    124     Entropy-based discretization as originally proposed by [FayyadIrani1993]_. The approach infers the most 
    125     appropriate number of intervals by recursively splitting the domain of continuous feature to minimize the 
    126     class-entropy of training examples. The splitting is repeated until the entropy decrease is smaller than the 
    127     increase of minimal descripton length (MDL) induced by the new cut-off point. 
    128  
    129     Entropy-based discretization can reduce a continuous feature into 
    130     a single interval if no suitable cut-off points are found. In this case the new feature is constant and can be 
    131     removed. This discretization can 
    132     therefore also serve for identification of non-informative features and thus used for feature subset selection. 
    133  
    134     .. attribute:: force_attribute 
    135  
    136         Forces the algorithm to induce at least one cut-off point, even when 
    137         its information gain is lower than MDL (default: ``False``). 
    138  
    139 Part of :download:`discretization.py <code/discretization.py>`: 
    140  
    141 .. literalinclude:: code/discretization.py 
    142     :lines: 77-80 
    143  
    144 The output shows that all attributes are discretized onto three intervals:: 
    145  
    146     sepal length: <5.5, 6.09999990463> 
    147     sepal width: <2.90000009537, 3.29999995232> 
    148     petal length: <1.89999997616, 4.69999980927> 
    149     petal width: <0.600000023842, 1.0000004768> 
    150  
    151 .. class:: BiModal 
    152  
    153     Infers two cut-off points to optimize the difference of class distribution of data instances in the 
    154     middle and in the other two intervals. The 
    155     difference is scored by chi-square statistics. All possible cut-off 
    156     points are examined, thus the discretization runs in O(n^2). This discretization method is especially suitable 
    157     for the attributes in 
    158     which the middle region corresponds to normal and the outer regions to 
    159     abnormal values of the feature. 
    160  
    161     .. attribute:: split_in_two 
    162  
    163         Decides whether the resulting attribute should have three or two values. 
    164         If ``True`` (default), the feature will be discretized to three 
    165         intervals and the discretizer is of type :class:`BiModalDiscretizer`. 
    166         If ``False`` the result is the ordinary :class:`IntervalDiscretizer`. 
    167  
    168 Iris dataset has three-valued class attribute. The figure below, drawn using LOESS probability estimation, shows that 
    169 sepal lenghts of versicolors are between lengths of setosas and virginicas. 
    170  
    171 .. image:: files/bayes-iris.gif 
    172  
    173 If we merge classes setosa and virginica, we can observe if 
    174 the bi-modal discretization would correctly recognize the interval in 
    175 which versicolors dominate. The following scripts peforms the merging and construction of new data set with class 
    176 that reports if iris is versicolor or not. 
    177  
    178 .. literalinclude:: code/discretization.py 
    179     :lines: 84-87 
    180  
    181 The following script implements the discretization: 
    182  
    183 .. literalinclude:: code/discretization.py 
    184     :lines: 97-100 
    185  
    186 The middle intervals are printed:: 
    187  
    188     sepal length: (5.400, 6.200] 
    189     sepal width: (2.000, 2.900] 
    190     petal length: (1.900, 4.700] 
    191     petal width: (0.600, 1.600] 
    192  
    193 Judging by the graph, the cut-off points inferred by discretization for "sepal length" make sense. 
    194  
    195 Discretizers 
    196 ============ 
    197  
    198 Discretizers construct a categorical feature from the continuous feature according to the method they implement and 
    199 its parameters. The most general is 
    200 :class:`IntervalDiscretizer` that is also used by most discretization 
    201 methods. Two other discretizers, :class:`EquiDistDiscretizer` and 
    202 :class:`ThresholdDiscretizer`> could easily be replaced by 
    203 :class:`IntervalDiscretizer` but are used for speed and simplicity. 
    204 The fourth discretizer, :class:`BiModalDiscretizer` is specialized 
    205 for discretizations induced by :class:`BiModalDiscretization`. 
    206  
    207 .. class:: Discretizer 
    208  
    209     A superclass implementing the construction of a new 
    210     attribute from an existing one. 
    211  
    212     .. method:: construct_variable(feature) 
    213  
    214         Constructs a descriptor for a new feature. The new feature's name is equal to ``feature.name`` 
    215         prefixed by "D\_". Its symbolic values are discretizer specific. 
    216  
    217 .. class:: IntervalDiscretizer 
    218  
    219     Discretizer defined with a set of cut-off points. 
    220  
    221     .. attribute:: points 
    222  
    223         The cut-off points; feature values below or equal to the first point will be mapped to the first interval, 
    224         those between the first and the second point 
    225         (including those equal to the second) are mapped to the second interval and 
    226         so forth to the last interval which covers all values greater than 
    227         the last value in ``points``. The number of intervals is thus 
    228         ``len(points)+1``. 
    229  
    230 The script that follows is an examples of a manual construction of a discretizer with cut-off points 
    231 at 3.0 and 5.0: 
    232  
    233 .. literalinclude:: code/discretization.py 
    234     :lines: 22-26 
    235  
    236 First five data instances of ``data2`` are:: 
    237  
    238     [5.1, '>5.00', 'Iris-setosa'] 
    239     [4.9, '(3.00, 5.00]', 'Iris-setosa'] 
    240     [4.7, '(3.00, 5.00]', 'Iris-setosa'] 
    241     [4.6, '(3.00, 5.00]', 'Iris-setosa'] 
    242     [5.0, '(3.00, 5.00]', 'Iris-setosa'] 
    243  
    244 The same discretizer can be used on several features by calling the function construct_var: 
    245  
    246 .. literalinclude:: code/discretization.py 
    247     :lines: 30-34 
    248  
    249 Each feature has its own instance of :class:`ClassifierFromVar` stored in 
    250 ``get_value_from``, but all use the same :class:`IntervalDiscretizer`, 
    251 ``idisc``. Changing any element of its ``points`` affect all attributes. 
    252  
    253 .. note:: 
    254  
    255     The length of :obj:`~IntervalDiscretizer.points` should not be changed if the 
    256     discretizer is used by any attribute. The length of 
    257     :obj:`~IntervalDiscretizer.points` should always match the number of values 
    258     of the feature, which is determined by the length of the attribute's field 
    259     ``values``. If ``attr`` is a discretized attribute, than ``len(attr.values)`` must equal 
    260     ``len(attr.get_value_from.transformer.points)+1``. 
    261  
    262  
    263 .. class:: EqualWidthDiscretizer 
    264  
    265     Discretizes to intervals of the fixed width. All values lower than :obj:`~EquiDistDiscretizer.first_cut` are mapped to the first 
    266     interval. Otherwise, value ``val``'s interval is ``floor((val-first_cut)/step)``. Possible overflows are mapped to the 
    267     last intervals. 
    268  
    269  
    270     .. attribute:: first_cut 
    271  
    272         The first cut-off point. 
    273  
    274     .. attribute:: step 
    275  
    276         Width of the intervals. 
    277  
    278     .. attribute:: n 
    279  
    280         Number of the intervals. 
    281  
    282     .. attribute:: points (read-only) 
    283  
    284         The cut-off points; this is not a real attribute although it behaves 
    285         as one. Reading it constructs a list of cut-off points and returns it, 
    286         but changing the list doesn't affect the discretizer. Only present to provide 
    287         the :obj:`EquiDistDiscretizer` the same interface as that of 
    288         :obj:`IntervalDiscretizer`. 
    289  
    290  
    291 .. class:: ThresholdDiscretizer 
    292  
    293     Threshold discretizer converts continuous values into binary by comparing 
    294     them to a fixed threshold. Orange uses this discretizer for 
    295     binarization of continuous attributes in decision trees. 
    296  
    297     .. attribute:: threshold 
    298  
    299         The value threshold; values below or equal to the threshold belong to the first 
    300         interval and those that are greater go to the second. 
    301  
    302  
    303 .. class:: BiModalDiscretizer 
    304  
    305     Bimodal discretizer has two cut off points and values are 
    306     discretized according to whether or not they belong to the region between these points 
    307     which includes the lower but not the upper boundary. The 
    308     discretizer is returned by :class:`BiModalDiscretization` if its 
    309     field :obj:`~BiModalDiscretization.split_in_two` is true (the default). 
    310  
    311     .. attribute:: low 
    312  
    313         Lower boundary of the interval (included in the interval). 
    314  
    315     .. attribute:: high 
    316  
    317         Upper boundary of the interval (not included in the interval). 
    318  
    319  
    320 Implementational details 
    321 ======================== 
     12Feature discretization module provides rutines that consider continuous features and 
     13introduce a new discretized feature based on the training data set. Most often such procedure would be executed 
     14on all the features of the data set using implementations from :doc:`Orange.feature.discretization`. Implementation 
     15in this module are concerned with discretization of one feature at the time, and do not provide wrappers for 
     16whole-data set discretization. The discretization is data-specific, and consist of learning of discretization 
     17procedure (see `Discretization Algorithms`_) and actual discretization (see Discretizers_) of the data. Splitting of 
     18these 
     19two phases is intentional, 
     20as in machine learing discretization may be learned from the training set and executed on the test set. 
    32221 
    32322Consider a following example (part of :download:`discretization.py <code/discretization.py>`): 
     
    36463by ``get_value_from`` and stored in the new example. 
    36564 
     65With exception to fixed discretization, discretization approaches infer the cut-off points from the 
     66training data set and thus construct a discretizer to convert continuous values of this feature into categorical 
     67value according to the rule found by discretization. In this respect, the discretization behaves similar to 
     68:class:`Orange.classification.Learner`. 
     69 
     70.. _`Discretization Algorithms` 
     71 
     72Discretization Algorithms 
     73========================= 
     74 
     75Instances of discretization classes are all derived from :class:`Discretization`. 
     76 
     77.. class:: Discretization 
     78 
     79    .. method:: __call__(feature, data[, weightID]) 
     80 
     81        Given a continuous ``feature``, ``data`` and, optionally id of 
     82        attribute with example weight, this function returns a discretized 
     83        feature. Argument ``feature`` can be a descriptor, index or 
     84        name of the attribute. 
     85 
     86 
     87.. class:: EqualWidth 
     88 
     89    Discretizes the feature by spliting its domain to a fixed number 
     90    of equal-width intervals. The span of original domain is computed 
     91    from the training data and is defined by the smallest and the 
     92    largest feature value. 
     93 
     94    .. attribute:: n 
     95 
     96        Number of discretization intervals (default: 4). 
     97 
     98The following example discretizes Iris dataset features using six 
     99intervals. The script constructs a :class:`Orange.data.Table` with discretized 
     100features and outputs their description: 
     101 
     102.. literalinclude:: code/discretization.py 
     103    :lines: 38-43 
     104 
     105The output of this script is:: 
     106 
     107    D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> 
     108    D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> 
     109    D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> 
     110    D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10> 
     111 
     112The cut-off values are hidden in the discretizer and stored in ``attr.get_value_from.transformer``:: 
     113 
     114    >>> for attr in newattrs: 
     115    ...    print "%s: first interval at %5.3f, step %5.3f" % \ 
     116    ...    (attr.name, attr.get_value_from.transformer.first_cut, \ 
     117    ...    attr.get_value_from.transformer.step) 
     118    D_sepal length: first interval at 4.900, step 0.600 
     119    D_sepal width: first interval at 2.400, step 0.400 
     120    D_petal length: first interval at 1.980, step 0.980 
     121    D_petal width: first interval at 0.500, step 0.400 
     122 
     123All discretizers have the method 
     124``construct_variable``: 
     125 
     126.. literalinclude:: code/discretization.py 
     127    :lines: 69-73 
     128 
     129 
     130.. class:: EqualFreq 
     131 
     132    Infers the cut-off points so that the discretization intervals contain 
     133    approximately equal number of training data instances. 
     134 
     135    .. attribute:: n 
     136 
     137        Number of discretization intervals (default: 4). 
     138 
     139The resulting discretizer is of class :class:`IntervalDiscretizer`. Its ``transformer`` includes ``points`` 
     140that store the inferred cut-offs. 
     141 
     142.. class:: Entropy 
     143 
     144    Entropy-based discretization as originally proposed by [FayyadIrani1993]_. The approach infers the most 
     145    appropriate number of intervals by recursively splitting the domain of continuous feature to minimize the 
     146    class-entropy of training examples. The splitting is repeated until the entropy decrease is smaller than the 
     147    increase of minimal descripton length (MDL) induced by the new cut-off point. 
     148 
     149    Entropy-based discretization can reduce a continuous feature into 
     150    a single interval if no suitable cut-off points are found. In this case the new feature is constant and can be 
     151    removed. This discretization can 
     152    therefore also serve for identification of non-informative features and thus used for feature subset selection. 
     153 
     154    .. attribute:: force_attribute 
     155 
     156        Forces the algorithm to induce at least one cut-off point, even when 
     157        its information gain is lower than MDL (default: ``False``). 
     158 
     159Part of :download:`discretization.py <code/discretization.py>`: 
     160 
     161.. literalinclude:: code/discretization.py 
     162    :lines: 77-80 
     163 
     164The output shows that all attributes are discretized onto three intervals:: 
     165 
     166    sepal length: <5.5, 6.09999990463> 
     167    sepal width: <2.90000009537, 3.29999995232> 
     168    petal length: <1.89999997616, 4.69999980927> 
     169    petal width: <0.600000023842, 1.0000004768> 
     170 
     171.. class:: BiModal 
     172 
     173    Infers two cut-off points to optimize the difference of class distribution of data instances in the 
     174    middle and in the other two intervals. The 
     175    difference is scored by chi-square statistics. All possible cut-off 
     176    points are examined, thus the discretization runs in O(n^2). This discretization method is especially suitable 
     177    for the attributes in 
     178    which the middle region corresponds to normal and the outer regions to 
     179    abnormal values of the feature. 
     180 
     181    .. attribute:: split_in_two 
     182 
     183        Decides whether the resulting attribute should have three or two values. 
     184        If ``True`` (default), the feature will be discretized to three 
     185        intervals and the discretizer is of type :class:`BiModalDiscretizer`. 
     186        If ``False`` the result is the ordinary :class:`IntervalDiscretizer`. 
     187 
     188Iris dataset has three-valued class attribute. The figure below, drawn using LOESS probability estimation, shows that 
     189sepal lenghts of versicolors are between lengths of setosas and virginicas. 
     190 
     191.. image:: files/bayes-iris.gif 
     192 
     193If we merge classes setosa and virginica, we can observe if 
     194the bi-modal discretization would correctly recognize the interval in 
     195which versicolors dominate. The following scripts peforms the merging and construction of new data set with class 
     196that reports if iris is versicolor or not. 
     197 
     198.. literalinclude:: code/discretization.py 
     199    :lines: 84-87 
     200 
     201The following script implements the discretization: 
     202 
     203.. literalinclude:: code/discretization.py 
     204    :lines: 97-100 
     205 
     206The middle intervals are printed:: 
     207 
     208    sepal length: (5.400, 6.200] 
     209    sepal width: (2.000, 2.900] 
     210    petal length: (1.900, 4.700] 
     211    petal width: (0.600, 1.600] 
     212 
     213Judging by the graph, the cut-off points inferred by discretization for "sepal length" make sense. 
     214 
     215.. _Discretizers: 
     216 
     217Discretizers 
     218============= 
     219 
     220Discretizers construct a categorical feature from the continuous feature according to the method they implement and 
     221its parameters. The most general is 
     222:class:`IntervalDiscretizer` that is also used by most discretization 
     223methods. Two other discretizers, :class:`EquiDistDiscretizer` and 
     224:class:`ThresholdDiscretizer`> could easily be replaced by 
     225:class:`IntervalDiscretizer` but are used for speed and simplicity. 
     226The fourth discretizer, :class:`BiModalDiscretizer` is specialized 
     227for discretizations induced by :class:`BiModalDiscretization`. 
     228 
     229.. class:: Discretizer 
     230 
     231    A superclass implementing the construction of a new 
     232    attribute from an existing one. 
     233 
     234    .. method:: construct_variable(feature) 
     235 
     236        Constructs a descriptor for a new feature. The new feature's name is equal to ``feature.name`` 
     237        prefixed by "D\_". Its symbolic values are discretizer specific. 
     238 
     239.. class:: IntervalDiscretizer 
     240 
     241    Discretizer defined with a set of cut-off points. 
     242 
     243    .. attribute:: points 
     244 
     245        The cut-off points; feature values below or equal to the first point will be mapped to the first interval, 
     246        those between the first and the second point 
     247        (including those equal to the second) are mapped to the second interval and 
     248        so forth to the last interval which covers all values greater than 
     249        the last value in ``points``. The number of intervals is thus 
     250        ``len(points)+1``. 
     251 
     252The script that follows is an examples of a manual construction of a discretizer with cut-off points 
     253at 3.0 and 5.0: 
     254 
     255.. literalinclude:: code/discretization.py 
     256    :lines: 22-26 
     257 
     258First five data instances of ``data2`` are:: 
     259 
     260    [5.1, '>5.00', 'Iris-setosa'] 
     261    [4.9, '(3.00, 5.00]', 'Iris-setosa'] 
     262    [4.7, '(3.00, 5.00]', 'Iris-setosa'] 
     263    [4.6, '(3.00, 5.00]', 'Iris-setosa'] 
     264    [5.0, '(3.00, 5.00]', 'Iris-setosa'] 
     265 
     266The same discretizer can be used on several features by calling the function construct_var: 
     267 
     268.. literalinclude:: code/discretization.py 
     269    :lines: 30-34 
     270 
     271Each feature has its own instance of :class:`ClassifierFromVar` stored in 
     272``get_value_from``, but all use the same :class:`IntervalDiscretizer`, 
     273``idisc``. Changing any element of its ``points`` affect all attributes. 
     274 
     275.. note:: 
     276 
     277    The length of :obj:`~IntervalDiscretizer.points` should not be changed if the 
     278    discretizer is used by any attribute. The length of 
     279    :obj:`~IntervalDiscretizer.points` should always match the number of values 
     280    of the feature, which is determined by the length of the attribute's field 
     281    ``values``. If ``attr`` is a discretized attribute, than ``len(attr.values)`` must equal 
     282    ``len(attr.get_value_from.transformer.points)+1``. 
     283 
     284 
     285.. class:: EqualWidthDiscretizer 
     286 
     287    Discretizes to intervals of the fixed width. All values lower than :obj:`~EquiDistDiscretizer.first_cut` are mapped to the first 
     288    interval. Otherwise, value ``val``'s interval is ``floor((val-first_cut)/step)``. Possible overflows are mapped to the 
     289    last intervals. 
     290 
     291 
     292    .. attribute:: first_cut 
     293 
     294        The first cut-off point. 
     295 
     296    .. attribute:: step 
     297 
     298        Width of the intervals. 
     299 
     300    .. attribute:: n 
     301 
     302        Number of the intervals. 
     303 
     304    .. attribute:: points (read-only) 
     305 
     306        The cut-off points; this is not a real attribute although it behaves 
     307        as one. Reading it constructs a list of cut-off points and returns it, 
     308        but changing the list doesn't affect the discretizer. Only present to provide 
     309        the :obj:`EquiDistDiscretizer` the same interface as that of 
     310        :obj:`IntervalDiscretizer`. 
     311 
     312 
     313.. class:: ThresholdDiscretizer 
     314 
     315    Threshold discretizer converts continuous values into binary by comparing 
     316    them to a fixed threshold. Orange uses this discretizer for 
     317    binarization of continuous attributes in decision trees. 
     318 
     319    .. attribute:: threshold 
     320 
     321        The value threshold; values below or equal to the threshold belong to the first 
     322        interval and those that are greater go to the second. 
     323 
     324 
     325.. class:: BiModalDiscretizer 
     326 
     327    Bimodal discretizer has two cut off points and values are 
     328    discretized according to whether or not they belong to the region between these points 
     329    which includes the lower but not the upper boundary. The 
     330    discretizer is returned by :class:`BiModalDiscretization` if its 
     331    field :obj:`~BiModalDiscretization.split_in_two` is true (the default). 
     332 
     333    .. attribute:: low 
     334 
     335        Lower boundary of the interval (included in the interval). 
     336 
     337    .. attribute:: high 
     338 
     339        Upper boundary of the interval (not included in the interval). 
     340 
    366341References 
    367342========== 
  • docs/reference/rst/Orange.feature.discretization.rst

    r9944 r9964  
    1010   single: feature; discretization 
    1111 
    12 Continues features can be discretized either one feature at a time, or, as demonstrated in the following script, 
    13 using a single discretization method on entire set of data features: 
    14  
    15 .. literalinclude:: code/discretization-table.py 
    16  
    17 Discretization introduces new categorical features and computes their values in accordance to 
    18 selected (or default) discretization method:: 
    19  
    20     Original data set: 
    21     [5.1, 3.5, 1.4, 0.2, 'Iris-setosa'] 
    22     [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'] 
    23     [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'] 
    24  
    25     Discretized data set: 
    26     ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa'] 
    27     ['<=5.45', '(2.85, 3.15]', '<=2.45', '<=0.80', 'Iris-setosa'] 
    28     ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa'] 
    29  
    30 The following discretization methods are supported: 
    31  
    32 * equal width discretization, where the domain of continuous feature is split to intervals of the same 
    33   width equal-sized intervals (:class:`EqualWidth`), 
    34 * equal frequency discretization, where each intervals contains equal number of data instances (:class:`EqualFreq`), 
    35 * entropy-based, as originally proposed by [FayyadIrani1993]_ that infers the intervals to minimize 
    36   within-interval entropy of class distributions (:class:`Entropy`), 
    37 * bi-modal, using three intervals to optimize the difference of the class distribution in 
    38   the middle with the distribution outside it (:class:`BiModal`), 
    39 * fixed, with the user-defined cut-off points. 
    40  
    41 The above script used the default discretization method (equal frequency with three intervals). This can be changed 
    42 as demonstrated below: 
    43  
    44 .. literalinclude:: code/discretization-table-method.py 
    45     :lines: 3-5 
    46  
    47 With exception to fixed discretization, discretization approaches infer the cut-off points from the 
    48 training data set and thus construct a discretizer to convert continuous values of this feature into categorical 
    49 value according to the rule found by discretization. In this respect, the discretization behaves similar to 
    50 :class:`Orange.classification.Learner`. 
    51  
    52 Discretization Algorithms 
    53 ========================= 
    54  
    55 Instances of discretization classes are all derived from :class:`Discretization`. 
    56  
    57 .. class:: Discretization 
    58  
    59     .. method:: __call__(feature, data[, weightID]) 
    60  
    61         Given a continuous ``feature``, ``data`` and, optionally id of 
    62         attribute with example weight, this function returns a discretized 
    63         feature. Argument ``feature`` can be a descriptor, index or 
    64         name of the attribute. 
    65  
    66  
    67 .. class:: EqualWidth 
    68  
    69     Discretizes the feature by spliting its domain to a fixed number 
    70     of equal-width intervals. The span of original domain is computed 
    71     from the training data and is defined by the smallest and the 
    72     largest feature value. 
    73  
    74     .. attribute:: n 
    75  
    76         Number of discretization intervals (default: 4). 
    77  
    78 The following example discretizes Iris dataset features using six 
    79 intervals. The script constructs a :class:`Orange.data.Table` with discretized 
    80 features and outputs their description: 
    81  
    82 .. literalinclude:: code/discretization.py 
    83     :lines: 38-43 
    84  
    85 The output of this script is:: 
    86  
    87     D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> 
    88     D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> 
    89     D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> 
    90     D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10> 
    91  
    92 The cut-off values are hidden in the discretizer and stored in ``attr.get_value_from.transformer``:: 
    93  
    94     >>> for attr in newattrs: 
    95     ...    print "%s: first interval at %5.3f, step %5.3f" % \ 
    96     ...    (attr.name, attr.get_value_from.transformer.first_cut, \ 
    97     ...    attr.get_value_from.transformer.step) 
    98     D_sepal length: first interval at 4.900, step 0.600 
    99     D_sepal width: first interval at 2.400, step 0.400 
    100     D_petal length: first interval at 1.980, step 0.980 
    101     D_petal width: first interval at 0.500, step 0.400 
    102  
    103 All discretizers have the method 
    104 ``construct_variable``: 
    105  
    106 .. literalinclude:: code/discretization.py 
    107     :lines: 69-73 
    108  
    109  
    110 .. class:: EqualFreq 
    111  
    112     Infers the cut-off points so that the discretization intervals contain 
    113     approximately equal number of training data instances. 
    114  
    115     .. attribute:: n 
    116  
    117         Number of discretization intervals (default: 4). 
    118  
    119 The resulting discretizer is of class :class:`IntervalDiscretizer`. Its ``transformer`` includes ``points`` 
    120 that store the inferred cut-offs. 
    121  
    122 .. class:: Entropy 
    123  
    124     Entropy-based discretization as originally proposed by [FayyadIrani1993]_. The approach infers the most 
    125     appropriate number of intervals by recursively splitting the domain of continuous feature to minimize the 
    126     class-entropy of training examples. The splitting is repeated until the entropy decrease is smaller than the 
    127     increase of minimal descripton length (MDL) induced by the new cut-off point. 
    128  
    129     Entropy-based discretization can reduce a continuous feature into 
    130     a single interval if no suitable cut-off points are found. In this case the new feature is constant and can be 
    131     removed. This discretization can 
    132     therefore also serve for identification of non-informative features and thus used for feature subset selection. 
    133  
    134     .. attribute:: force_attribute 
    135  
    136         Forces the algorithm to induce at least one cut-off point, even when 
    137         its information gain is lower than MDL (default: ``False``). 
    138  
    139 Part of :download:`discretization.py <code/discretization.py>`: 
    140  
    141 .. literalinclude:: code/discretization.py 
    142     :lines: 77-80 
    143  
    144 The output shows that all attributes are discretized onto three intervals:: 
    145  
    146     sepal length: <5.5, 6.09999990463> 
    147     sepal width: <2.90000009537, 3.29999995232> 
    148     petal length: <1.89999997616, 4.69999980927> 
    149     petal width: <0.600000023842, 1.0000004768> 
    150  
    151 .. class:: BiModal 
    152  
    153     Infers two cut-off points to optimize the difference of class distribution of data instances in the 
    154     middle and in the other two intervals. The 
    155     difference is scored by chi-square statistics. All possible cut-off 
    156     points are examined, thus the discretization runs in O(n^2). This discretization method is especially suitable 
    157     for the attributes in 
    158     which the middle region corresponds to normal and the outer regions to 
    159     abnormal values of the feature. 
    160  
    161     .. attribute:: split_in_two 
    162  
    163         Decides whether the resulting attribute should have three or two values. 
    164         If ``True`` (default), the feature will be discretized to three 
    165         intervals and the discretizer is of type :class:`BiModalDiscretizer`. 
    166         If ``False`` the result is the ordinary :class:`IntervalDiscretizer`. 
    167  
    168 Iris dataset has three-valued class attribute. The figure below, drawn using LOESS probability estimation, shows that 
    169 sepal lenghts of versicolors are between lengths of setosas and virginicas. 
    170  
    171 .. image:: files/bayes-iris.gif 
    172  
    173 If we merge classes setosa and virginica, we can observe if 
    174 the bi-modal discretization would correctly recognize the interval in 
    175 which versicolors dominate. The following scripts peforms the merging and construction of new data set with class 
    176 that reports if iris is versicolor or not. 
    177  
    178 .. literalinclude:: code/discretization.py 
    179     :lines: 84-87 
    180  
    181 The following script implements the discretization: 
    182  
    183 .. literalinclude:: code/discretization.py 
    184     :lines: 97-100 
    185  
    186 The middle intervals are printed:: 
    187  
    188     sepal length: (5.400, 6.200] 
    189     sepal width: (2.000, 2.900] 
    190     petal length: (1.900, 4.700] 
    191     petal width: (0.600, 1.600] 
    192  
    193 Judging by the graph, the cut-off points inferred by discretization for "sepal length" make sense. 
    194  
    195 Discretizers 
    196 ============ 
    197  
    198 Discretizers construct a categorical feature from the continuous feature according to the method they implement and 
    199 its parameters. The most general is 
    200 :class:`IntervalDiscretizer` that is also used by most discretization 
    201 methods. Two other discretizers, :class:`EquiDistDiscretizer` and 
    202 :class:`ThresholdDiscretizer`> could easily be replaced by 
    203 :class:`IntervalDiscretizer` but are used for speed and simplicity. 
    204 The fourth discretizer, :class:`BiModalDiscretizer` is specialized 
    205 for discretizations induced by :class:`BiModalDiscretization`. 
    206  
    207 .. class:: Discretizer 
    208  
    209     A superclass implementing the construction of a new 
    210     attribute from an existing one. 
    211  
    212     .. method:: construct_variable(feature) 
    213  
    214         Constructs a descriptor for a new feature. The new feature's name is equal to ``feature.name`` 
    215         prefixed by "D\_". Its symbolic values are discretizer specific. 
    216  
    217 .. class:: IntervalDiscretizer 
    218  
    219     Discretizer defined with a set of cut-off points. 
    220  
    221     .. attribute:: points 
    222  
    223         The cut-off points; feature values below or equal to the first point will be mapped to the first interval, 
    224         those between the first and the second point 
    225         (including those equal to the second) are mapped to the second interval and 
    226         so forth to the last interval which covers all values greater than 
    227         the last value in ``points``. The number of intervals is thus 
    228         ``len(points)+1``. 
    229  
    230 The script that follows is an examples of a manual construction of a discretizer with cut-off points 
    231 at 3.0 and 5.0: 
    232  
    233 .. literalinclude:: code/discretization.py 
    234     :lines: 22-26 
    235  
    236 First five data instances of ``data2`` are:: 
    237  
    238     [5.1, '>5.00', 'Iris-setosa'] 
    239     [4.9, '(3.00, 5.00]', 'Iris-setosa'] 
    240     [4.7, '(3.00, 5.00]', 'Iris-setosa'] 
    241     [4.6, '(3.00, 5.00]', 'Iris-setosa'] 
    242     [5.0, '(3.00, 5.00]', 'Iris-setosa'] 
    243  
    244 The same discretizer can be used on several features by calling the function construct_var: 
    245  
    246 .. literalinclude:: code/discretization.py 
    247     :lines: 30-34 
    248  
    249 Each feature has its own instance of :class:`ClassifierFromVar` stored in 
    250 ``get_value_from``, but all use the same :class:`IntervalDiscretizer`, 
    251 ``idisc``. Changing any element of its ``points`` affect all attributes. 
    252  
    253 .. note:: 
    254  
    255     The length of :obj:`~IntervalDiscretizer.points` should not be changed if the 
    256     discretizer is used by any attribute. The length of 
    257     :obj:`~IntervalDiscretizer.points` should always match the number of values 
    258     of the feature, which is determined by the length of the attribute's field 
    259     ``values``. If ``attr`` is a discretized attribute, than ``len(attr.values)`` must equal 
    260     ``len(attr.get_value_from.transformer.points)+1``. 
    261  
    262  
    263 .. class:: EqualWidthDiscretizer 
    264  
    265     Discretizes to intervals of the fixed width. All values lower than :obj:`~EquiDistDiscretizer.first_cut` are mapped to the first 
    266     interval. Otherwise, value ``val``'s interval is ``floor((val-first_cut)/step)``. Possible overflows are mapped to the 
    267     last intervals. 
    268  
    269  
    270     .. attribute:: first_cut 
    271  
    272         The first cut-off point. 
    273  
    274     .. attribute:: step 
    275  
    276         Width of the intervals. 
    277  
    278     .. attribute:: n 
    279  
    280         Number of the intervals. 
    281  
    282     .. attribute:: points (read-only) 
    283  
    284         The cut-off points; this is not a real attribute although it behaves 
    285         as one. Reading it constructs a list of cut-off points and returns it, 
    286         but changing the list doesn't affect the discretizer. Only present to provide 
    287         the :obj:`EquiDistDiscretizer` the same interface as that of 
    288         :obj:`IntervalDiscretizer`. 
    289  
    290  
    291 .. class:: ThresholdDiscretizer 
    292  
    293     Threshold discretizer converts continuous values into binary by comparing 
    294     them to a fixed threshold. Orange uses this discretizer for 
    295     binarization of continuous attributes in decision trees. 
    296  
    297     .. attribute:: threshold 
    298  
    299         The value threshold; values below or equal to the threshold belong to the first 
    300         interval and those that are greater go to the second. 
    301  
    302  
    303 .. class:: BiModalDiscretizer 
    304  
    305     Bimodal discretizer has two cut off points and values are 
    306     discretized according to whether or not they belong to the region between these points 
    307     which includes the lower but not the upper boundary. The 
    308     discretizer is returned by :class:`BiModalDiscretization` if its 
    309     field :obj:`~BiModalDiscretization.split_in_two` is true (the default). 
    310  
    311     .. attribute:: low 
    312  
    313         Lower boundary of the interval (included in the interval). 
    314  
    315     .. attribute:: high 
    316  
    317         Upper boundary of the interval (not included in the interval). 
    318  
    319  
    320 Implementational details 
    321 ======================== 
     12Feature discretization module provides rutines that consider continuous features and 
     13introduce a new discretized feature based on the training data set. Most often such procedure would be executed 
     14on all the features of the data set using implementations from :doc:`Orange.feature.discretization`. Implementation 
     15in this module are concerned with discretization of one feature at the time, and do not provide wrappers for 
     16whole-data set discretization. The discretization is data-specific, and consist of learning of discretization 
     17procedure (see `Discretization Algorithms`_) and actual discretization (see Discretizers_) of the data. Splitting of 
     18these 
     19two phases is intentional, 
     20as in machine learing discretization may be learned from the training set and executed on the test set. 
    32221 
    32322Consider a following example (part of :download:`discretization.py <code/discretization.py>`): 
     
    36463by ``get_value_from`` and stored in the new example. 
    36564 
     65With exception to fixed discretization, discretization approaches infer the cut-off points from the 
     66training data set and thus construct a discretizer to convert continuous values of this feature into categorical 
     67value according to the rule found by discretization. In this respect, the discretization behaves similar to 
     68:class:`Orange.classification.Learner`. 
     69 
     70.. _`Discretization Algorithms` 
     71 
     72Discretization Algorithms 
     73========================= 
     74 
     75Instances of discretization classes are all derived from :class:`Discretization`. 
     76 
     77.. class:: Discretization 
     78 
     79    .. method:: __call__(feature, data[, weightID]) 
     80 
     81        Given a continuous ``feature``, ``data`` and, optionally id of 
     82        attribute with example weight, this function returns a discretized 
     83        feature. Argument ``feature`` can be a descriptor, index or 
     84        name of the attribute. 
     85 
     86 
     87.. class:: EqualWidth 
     88 
     89    Discretizes the feature by spliting its domain to a fixed number 
     90    of equal-width intervals. The span of original domain is computed 
     91    from the training data and is defined by the smallest and the 
     92    largest feature value. 
     93 
     94    .. attribute:: n 
     95 
     96        Number of discretization intervals (default: 4). 
     97 
     98The following example discretizes Iris dataset features using six 
     99intervals. The script constructs a :class:`Orange.data.Table` with discretized 
     100features and outputs their description: 
     101 
     102.. literalinclude:: code/discretization.py 
     103    :lines: 38-43 
     104 
     105The output of this script is:: 
     106 
     107    D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> 
     108    D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> 
     109    D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> 
     110    D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10> 
     111 
     112The cut-off values are hidden in the discretizer and stored in ``attr.get_value_from.transformer``:: 
     113 
     114    >>> for attr in newattrs: 
     115    ...    print "%s: first interval at %5.3f, step %5.3f" % \ 
     116    ...    (attr.name, attr.get_value_from.transformer.first_cut, \ 
     117    ...    attr.get_value_from.transformer.step) 
     118    D_sepal length: first interval at 4.900, step 0.600 
     119    D_sepal width: first interval at 2.400, step 0.400 
     120    D_petal length: first interval at 1.980, step 0.980 
     121    D_petal width: first interval at 0.500, step 0.400 
     122 
     123All discretizers have the method 
     124``construct_variable``: 
     125 
     126.. literalinclude:: code/discretization.py 
     127    :lines: 69-73 
     128 
     129 
     130.. class:: EqualFreq 
     131 
     132    Infers the cut-off points so that the discretization intervals contain 
     133    approximately equal number of training data instances. 
     134 
     135    .. attribute:: n 
     136 
     137        Number of discretization intervals (default: 4). 
     138 
     139The resulting discretizer is of class :class:`IntervalDiscretizer`. Its ``transformer`` includes ``points`` 
     140that store the inferred cut-offs. 
     141 
     142.. class:: Entropy 
     143 
     144    Entropy-based discretization as originally proposed by [FayyadIrani1993]_. The approach infers the most 
     145    appropriate number of intervals by recursively splitting the domain of continuous feature to minimize the 
     146    class-entropy of training examples. The splitting is repeated until the entropy decrease is smaller than the 
     147    increase of minimal descripton length (MDL) induced by the new cut-off point. 
     148 
     149    Entropy-based discretization can reduce a continuous feature into 
     150    a single interval if no suitable cut-off points are found. In this case the new feature is constant and can be 
     151    removed. This discretization can 
     152    therefore also serve for identification of non-informative features and thus used for feature subset selection. 
     153 
     154    .. attribute:: force_attribute 
     155 
     156        Forces the algorithm to induce at least one cut-off point, even when 
     157        its information gain is lower than MDL (default: ``False``). 
     158 
     159Part of :download:`discretization.py <code/discretization.py>`: 
     160 
     161.. literalinclude:: code/discretization.py 
     162    :lines: 77-80 
     163 
     164The output shows that all attributes are discretized onto three intervals:: 
     165 
     166    sepal length: <5.5, 6.09999990463> 
     167    sepal width: <2.90000009537, 3.29999995232> 
     168    petal length: <1.89999997616, 4.69999980927> 
     169    petal width: <0.600000023842, 1.0000004768> 
     170 
     171.. class:: BiModal 
     172 
     173    Infers two cut-off points to optimize the difference of class distribution of data instances in the 
     174    middle and in the other two intervals. The 
     175    difference is scored by chi-square statistics. All possible cut-off 
     176    points are examined, thus the discretization runs in O(n^2). This discretization method is especially suitable 
     177    for the attributes in 
     178    which the middle region corresponds to normal and the outer regions to 
     179    abnormal values of the feature. 
     180 
     181    .. attribute:: split_in_two 
     182 
     183        Decides whether the resulting attribute should have three or two values. 
     184        If ``True`` (default), the feature will be discretized to three 
     185        intervals and the discretizer is of type :class:`BiModalDiscretizer`. 
     186        If ``False`` the result is the ordinary :class:`IntervalDiscretizer`. 
     187 
     188Iris dataset has three-valued class attribute. The figure below, drawn using LOESS probability estimation, shows that 
     189sepal lenghts of versicolors are between lengths of setosas and virginicas. 
     190 
     191.. image:: files/bayes-iris.gif 
     192 
     193If we merge classes setosa and virginica, we can observe if 
     194the bi-modal discretization would correctly recognize the interval in 
     195which versicolors dominate. The following scripts peforms the merging and construction of new data set with class 
     196that reports if iris is versicolor or not. 
     197 
     198.. literalinclude:: code/discretization.py 
     199    :lines: 84-87 
     200 
     201The following script implements the discretization: 
     202 
     203.. literalinclude:: code/discretization.py 
     204    :lines: 97-100 
     205 
     206The middle intervals are printed:: 
     207 
     208    sepal length: (5.400, 6.200] 
     209    sepal width: (2.000, 2.900] 
     210    petal length: (1.900, 4.700] 
     211    petal width: (0.600, 1.600] 
     212 
     213Judging by the graph, the cut-off points inferred by discretization for "sepal length" make sense. 
     214 
     215.. _Discretizers: 
     216 
     217Discretizers 
     218============= 
     219 
     220Discretizers construct a categorical feature from the continuous feature according to the method they implement and 
     221its parameters. The most general is 
     222:class:`IntervalDiscretizer` that is also used by most discretization 
     223methods. Two other discretizers, :class:`EquiDistDiscretizer` and 
     224:class:`ThresholdDiscretizer`> could easily be replaced by 
     225:class:`IntervalDiscretizer` but are used for speed and simplicity. 
     226The fourth discretizer, :class:`BiModalDiscretizer` is specialized 
     227for discretizations induced by :class:`BiModalDiscretization`. 
     228 
     229.. class:: Discretizer 
     230 
     231    A superclass implementing the construction of a new 
     232    attribute from an existing one. 
     233 
     234    .. method:: construct_variable(feature) 
     235 
     236        Constructs a descriptor for a new feature. The new feature's name is equal to ``feature.name`` 
     237        prefixed by "D\_". Its symbolic values are discretizer specific. 
     238 
     239.. class:: IntervalDiscretizer 
     240 
     241    Discretizer defined with a set of cut-off points. 
     242 
     243    .. attribute:: points 
     244 
     245        The cut-off points; feature values below or equal to the first point will be mapped to the first interval, 
     246        those between the first and the second point 
     247        (including those equal to the second) are mapped to the second interval and 
     248        so forth to the last interval which covers all values greater than 
     249        the last value in ``points``. The number of intervals is thus 
     250        ``len(points)+1``. 
     251 
     252The script that follows is an examples of a manual construction of a discretizer with cut-off points 
     253at 3.0 and 5.0: 
     254 
     255.. literalinclude:: code/discretization.py 
     256    :lines: 22-26 
     257 
     258First five data instances of ``data2`` are:: 
     259 
     260    [5.1, '>5.00', 'Iris-setosa'] 
     261    [4.9, '(3.00, 5.00]', 'Iris-setosa'] 
     262    [4.7, '(3.00, 5.00]', 'Iris-setosa'] 
     263    [4.6, '(3.00, 5.00]', 'Iris-setosa'] 
     264    [5.0, '(3.00, 5.00]', 'Iris-setosa'] 
     265 
     266The same discretizer can be used on several features by calling the function construct_var: 
     267 
     268.. literalinclude:: code/discretization.py 
     269    :lines: 30-34 
     270 
     271Each feature has its own instance of :class:`ClassifierFromVar` stored in 
     272``get_value_from``, but all use the same :class:`IntervalDiscretizer`, 
     273``idisc``. Changing any element of its ``points`` affect all attributes. 
     274 
     275.. note:: 
     276 
     277    The length of :obj:`~IntervalDiscretizer.points` should not be changed if the 
     278    discretizer is used by any attribute. The length of 
     279    :obj:`~IntervalDiscretizer.points` should always match the number of values 
     280    of the feature, which is determined by the length of the attribute's field 
     281    ``values``. If ``attr`` is a discretized attribute, than ``len(attr.values)`` must equal 
     282    ``len(attr.get_value_from.transformer.points)+1``. 
     283 
     284 
     285.. class:: EqualWidthDiscretizer 
     286 
     287    Discretizes to intervals of the fixed width. All values lower than :obj:`~EquiDistDiscretizer.first_cut` are mapped to the first 
     288    interval. Otherwise, value ``val``'s interval is ``floor((val-first_cut)/step)``. Possible overflows are mapped to the 
     289    last intervals. 
     290 
     291 
     292    .. attribute:: first_cut 
     293 
     294        The first cut-off point. 
     295 
     296    .. attribute:: step 
     297 
     298        Width of the intervals. 
     299 
     300    .. attribute:: n 
     301 
     302        Number of the intervals. 
     303 
     304    .. attribute:: points (read-only) 
     305 
     306        The cut-off points; this is not a real attribute although it behaves 
     307        as one. Reading it constructs a list of cut-off points and returns it, 
     308        but changing the list doesn't affect the discretizer. Only present to provide 
     309        the :obj:`EquiDistDiscretizer` the same interface as that of 
     310        :obj:`IntervalDiscretizer`. 
     311 
     312 
     313.. class:: ThresholdDiscretizer 
     314 
     315    Threshold discretizer converts continuous values into binary by comparing 
     316    them to a fixed threshold. Orange uses this discretizer for 
     317    binarization of continuous attributes in decision trees. 
     318 
     319    .. attribute:: threshold 
     320 
     321        The value threshold; values below or equal to the threshold belong to the first 
     322        interval and those that are greater go to the second. 
     323 
     324 
     325.. class:: BiModalDiscretizer 
     326 
     327    Bimodal discretizer has two cut off points and values are 
     328    discretized according to whether or not they belong to the region between these points 
     329    which includes the lower but not the upper boundary. The 
     330    discretizer is returned by :class:`BiModalDiscretization` if its 
     331    field :obj:`~BiModalDiscretization.split_in_two` is true (the default). 
     332 
     333    .. attribute:: low 
     334 
     335        Lower boundary of the interval (included in the interval). 
     336 
     337    .. attribute:: high 
     338 
     339        Upper boundary of the interval (not included in the interval). 
     340 
    366341References 
    367342========== 
Note: See TracChangeset for help on using the changeset viewer.