Changeset 9812:a810dc61169e in orange


Ignore:
Timestamp:
02/06/12 18:25:09 (2 years ago)
Author:
blaz <blaz.zupan@…>
Branch:
default
Message:

polished discretization

Files:
3 added
5 edited

Legend:

Unmodified
Added
Removed
  • docs/reference/rst/Orange.feature.discretization.rst

    r9372 r9812  
    1 .. automodule:: Orange.feature.discretization 
     1.. py:currentmodule:: Orange.feature.discretization 
     2 
     3################################### 
     4Discretization (``discretization``) 
     5################################### 
     6 
     7.. index:: discretization 
     8 
     9.. index:: 
     10   single: feature; discretization 
     11 
     12Continues features can be discretized either one feature at a time, or, as demonstrated in the following script, 
     13using a single discretization method on entire set of data features: 
     14 
     15.. literalinclude:: code/discretization-table.py 
     16 
     17Discretization introduces new categorical features and computes their values in accordance to 
     18selected (or default) discretization method:: 
     19 
     20    Original data set: 
     21    [5.1, 3.5, 1.4, 0.2, 'Iris-setosa'] 
     22    [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'] 
     23    [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'] 
     24 
     25    Discretized data set: 
     26    ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa'] 
     27    ['<=5.45', '(2.85, 3.15]', '<=2.45', '<=0.80', 'Iris-setosa'] 
     28    ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa'] 
     29 
     30The following discretization methods are supported: 
     31 
     32* equal width discretization, where the domain of continuous feature is split to intervals of the same 
     33  width equal-sized intervals (:class:`EqualWidth`), 
     34* equal frequency discretization, where each intervals contains equal number of data instances (:class:`EqualFreq`), 
     35* entropy-based, as originally proposed by [FayyadIrani1993]_ that infers the intervals to minimize 
     36  within-interval entropy of class distributions (:class:`Entropy`), 
     37* bi-modal, using three intervals to optimize the difference of the class distribution in 
     38  the middle with the distribution outside it (:class:`BiModal`), 
     39* fixed, with the user-defined cut-off points. 
     40 
     41The above script used the default discretization method (equal frequency with three intervals). This can be changed 
     42as demonstrated below: 
     43 
     44.. literalinclude:: code/discretization-table-method.py 
     45    :lines: 3-5 
     46 
     47With exception to fixed discretization, discretization approaches infer the cut-off points from the 
     48training data set and thus construct a discretizer to convert continuous values of this feature into categorical 
     49value according to the rule found by discretization. In this respect, the discretization behaves similar to 
     50:class:`Orange.classification.Learner`. 
     51 
     52Utility functions 
     53================= 
     54 
     55Some functions and classes that can be used for 
     56categorization of continuous features. Besides several general classes that 
     57can help in this task, we also provide a function that may help in 
     58entropy-based discretization (Fayyad & Irani), and a wrapper around classes for 
     59categorization that can be used for learning. 
     60 
     61.. autoclass:: Orange.feature.discretization.DiscretizedLearner_Class 
     62 
     63.. autoclass:: DiscretizeTable 
     64 
     65.. rubric:: Example 
     66 
     67FIXME. A chapter on `feature subset selection <../ofb/o_fss.htm>`_ in Orange 
     68for Beginners tutorial shows the use of DiscretizedLearner. Other 
     69discretization classes from core Orange are listed in chapter on 
     70`categorization <../ofb/o_categorization.htm>`_ of the same tutorial. 
     71 
     72Discretization Algorithms 
     73========================= 
     74 
     75Instances of discretization classes are all derived from :class:`Discretization`. 
     76 
     77.. class:: Discretization 
     78 
     79    .. method:: __call__(feature, data[, weightID]) 
     80 
     81        Given a continuous ``feature``, ``data`` and, optionally id of 
     82        attribute with example weight, this function returns a discretized 
     83        feature. Argument ``feature`` can be a descriptor, index or 
     84        name of the attribute. 
     85 
     86 
     87.. class:: EqualWidth 
     88 
     89    Discretizes the feature by spliting its domain to a fixed number 
     90    of equal-width intervals. The span of original domain is computed 
     91    from the training data and is defined by the smallest and the 
     92    largest feature value. 
     93 
     94    .. attribute:: n 
     95 
     96        Number of discretization intervals (default: 4). 
     97 
     98The following example discretizes Iris dataset features using six 
     99intervals. The script constructs a :class:`Orange.data.Table` with discretized 
     100features and outputs their description: 
     101 
     102.. literalinclude:: code/discretization.py 
     103    :lines: 38-43 
     104 
     105The output of this script is:: 
     106 
     107    D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> 
     108    D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> 
     109    D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> 
     110    D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10> 
     111 
     112The cut-off values are hidden in the discretizer and stored in ``attr.get_value_from.transformer``:: 
     113 
     114    >>> for attr in newattrs: 
     115    ...    print "%s: first interval at %5.3f, step %5.3f" % \ 
     116    ...    (attr.name, attr.get_value_from.transformer.first_cut, \ 
     117    ...    attr.get_value_from.transformer.step) 
     118    D_sepal length: first interval at 4.900, step 0.600 
     119    D_sepal width: first interval at 2.400, step 0.400 
     120    D_petal length: first interval at 1.980, step 0.980 
     121    D_petal width: first interval at 0.500, step 0.400 
     122 
     123All discretizers have the method 
     124``construct_variable``: 
     125 
     126.. literalinclude:: code/discretization.py 
     127    :lines: 69-73 
     128 
     129 
     130.. class:: EqualFreq 
     131 
     132    Infers the cut-off points so that the discretization intervals contain 
     133    approximately equal number of training data instances. 
     134 
     135    .. attribute:: n 
     136 
     137        Number of discretization intervals (default: 4). 
     138 
     139The resulting discretizer is of class :class:`IntervalDiscretizer`. Its ``transformer`` includes ``points`` 
     140that store the inferred cut-offs. 
     141 
     142.. class:: Entropy 
     143 
     144    Entropy-based discretization as originally proposed by [FayyadIrani1993]_. The approach infers the most 
     145    appropriate number of intervals by recursively splitting the domain of continuous feature to minimize the 
     146    class-entropy of training examples. The splitting is repeated until the entropy decrease is smaller than the 
     147    increase of minimal descripton length (MDL) induced by the new cut-off point. 
     148 
     149    Entropy-based discretization can reduce a continuous feature into 
     150    a single interval if no suitable cut-off points are found. In this case the new feature is constant and can be 
     151    removed. This discretization can 
     152    therefore also serve for identification of non-informative features and thus used for feature subset selection. 
     153 
     154    .. attribute:: force_attribute 
     155 
     156        Forces the algorithm to induce at least one cut-off point, even when 
     157        its information gain is lower than MDL (default: ``False``). 
     158 
     159Part of :download:`discretization.py <code/discretization.py>`: 
     160 
     161.. literalinclude:: code/discretization.py 
     162    :lines: 77-80 
     163 
     164The output shows that all attributes are discretized onto three intervals:: 
     165 
     166    sepal length: <5.5, 6.09999990463> 
     167    sepal width: <2.90000009537, 3.29999995232> 
     168    petal length: <1.89999997616, 4.69999980927> 
     169    petal width: <0.600000023842, 1.0000004768> 
     170 
     171.. class:: BiModal 
     172 
     173    Infers two cut-off points to optimize the difference of class distribution of data instances in the 
     174    middle and in the other two intervals. The 
     175    difference is scored by chi-square statistics. All possible cut-off 
     176    points are examined, thus the discretization runs in O(n^2). This discretization method is especially suitable 
     177    for the attributes in 
     178    which the middle region corresponds to normal and the outer regions to 
     179    abnormal values of the feature. 
     180 
     181    .. attribute:: split_in_two 
     182 
     183        Decides whether the resulting attribute should have three or two values. 
     184        If ``True`` (default), the feature will be discretized to three intervals and the discretizer 
     185         is of type :class:`BiModalDiscretizer`. If ``False`` the result is the 
     186        ordinary :class:`IntervalDiscretizer`. 
     187 
     188Iris dataset has three-valued class attribute. The figure below, drawn using LOESS probability estimation, shows that 
     189sepal lenghts of versicolors are between lengths of setosas and virginicas. 
     190 
     191.. image:: files/bayes-iris.gif 
     192 
     193If we merge classes setosa and virginica, we can observe if 
     194the bi-modal discretization would correctly recognize the interval in 
     195which versicolors dominate. The following scripts peforms the merging and construction of new data set with class 
     196that reports if iris is versicolor or not. 
     197 
     198.. literalinclude:: code/discretization.py 
     199    :lines: 84-87 
     200 
     201The following script implements the discretization: 
     202 
     203.. literalinclude:: code/discretization.py 
     204    :lines: 97-100 
     205 
     206The middle intervals are printed:: 
     207 
     208    sepal length: (5.400, 6.200] 
     209    sepal width: (2.000, 2.900] 
     210    petal length: (1.900, 4.700] 
     211    petal width: (0.600, 1.600] 
     212 
     213Judging by the graph, the cut-off points inferred by discretization for "sepal length" make sense. 
     214 
     215Discretizers 
     216============ 
     217 
     218Discretizers construct a categorical feature from the continuous feature according to the method they implement and 
     219its parameters. The most general is 
     220:class:`IntervalDiscretizer` that is also used by most discretization 
     221methods. Two other discretizers, :class:`EquiDistDiscretizer` and 
     222:class:`ThresholdDiscretizer`> could easily be replaced by 
     223:class:`IntervalDiscretizer` but are used for speed and simplicity. 
     224The fourth discretizer, :class:`BiModalDiscretizer` is specialized 
     225for discretizations induced by :class:`BiModalDiscretization`. 
     226 
     227.. class:: Discretizer 
     228 
     229    A superclass implementing the construction of a new 
     230    attribute from an existing one. 
     231 
     232    .. method:: construct_variable(feature) 
     233 
     234        Constructs a descriptor for a new feature. The new feature's name is equal to ``feature.name`` 
     235        prefixed by "D\_". Its symbolic values are discretizer specific. 
     236 
     237.. class:: IntervalDiscretizer 
     238 
     239    Discretizer defined with a set of cut-off points. 
     240 
     241    .. attribute:: points 
     242 
     243        The cut-off points; feature values below or equal to the first point will be mapped to the first interval, 
     244        those between the first and the second point 
     245        (including those equal to the second) are mapped to the second interval and 
     246        so forth to the last interval which covers all values greater than 
     247        the last value in ``points``. The number of intervals is thus 
     248        ``len(points)+1``. 
     249 
     250The script that follows is an examples of a manual construction of a discretizer with cut-off points 
     251at 3.0 and 5.0: 
     252 
     253.. literalinclude:: code/discretization.py 
     254    :lines: 22-26 
     255 
     256First five data instances of ``data2`` are:: 
     257 
     258    [5.1, '>5.00', 'Iris-setosa'] 
     259    [4.9, '(3.00, 5.00]', 'Iris-setosa'] 
     260    [4.7, '(3.00, 5.00]', 'Iris-setosa'] 
     261    [4.6, '(3.00, 5.00]', 'Iris-setosa'] 
     262    [5.0, '(3.00, 5.00]', 'Iris-setosa'] 
     263 
     264The same discretizer can be used on several features by calling the function construct_var: 
     265 
     266.. literalinclude:: code/discretization.py 
     267    :lines: 30-34 
     268 
     269Each feature has its own instance of :class:`ClassifierFromVar` stored in 
     270``get_value_from``, but all use the same :class:`IntervalDiscretizer`, 
     271``idisc``. Changing any element of its ``points`` affect all attributes. 
     272 
     273.. note:: 
     274 
     275    The length of :obj:`~IntervalDiscretizer.points` should not be changed if the 
     276    discretizer is used by any attribute. The length of 
     277    :obj:`~IntervalDiscretizer.points` should always match the number of values 
     278    of the feature, which is determined by the length of the attribute's field 
     279    ``values``. If ``attr`` is a discretized attribute, than ``len(attr.values)`` must equal 
     280    ``len(attr.get_value_from.transformer.points)+1``. 
     281 
     282 
     283.. class:: EqualWidthDiscretizer 
     284 
     285    Discretizes to intervals of the fixed width. All values lower than :obj:`~EquiDistDiscretizer.first_cut` are mapped to the first 
     286    interval. Otherwise, value ``val``'s interval is ``floor((val-first_cut)/step)``. Possible overflows are mapped to the 
     287    last intervals. 
     288 
     289 
     290    .. attribute:: first_cut 
     291 
     292        The first cut-off point. 
     293 
     294    .. attribute:: step 
     295 
     296        Width of the intervals. 
     297 
     298    .. attribute:: n 
     299 
     300        Number of the intervals. 
     301 
     302    .. attribute:: points (read-only) 
     303 
     304        The cut-off points; this is not a real attribute although it behaves 
     305        as one. Reading it constructs a list of cut-off points and returns it, 
     306        but changing the list doesn't affect the discretizer. Only present to provide 
     307        the :obj:`EquiDistDiscretizer` the same interface as that of 
     308        :obj:`IntervalDiscretizer`. 
     309 
     310 
     311.. class:: ThresholdDiscretizer 
     312 
     313    Threshold discretizer converts continuous values into binary by comparing 
     314    them to a fixed threshold. Orange uses this discretizer for 
     315    binarization of continuous attributes in decision trees. 
     316 
     317    .. attribute:: threshold 
     318 
     319        The value threshold; values below or equal to the threshold belong to the first 
     320        interval and those that are greater go to the second. 
     321 
     322 
     323.. class:: BiModalDiscretizer 
     324 
     325    Bimodal discretizer has two cut off points and values are 
     326    discretized according to whether or not they belong to the region between these points 
     327    which includes the lower but not the upper boundary. The 
     328    discretizer is returned by :class:`BiModalDiscretization` if its 
     329    field :obj:`~BiModalDiscretization.split_in_two` is true (the default). 
     330 
     331    .. attribute:: low 
     332 
     333        Lower boundary of the interval (included in the interval). 
     334 
     335    .. attribute:: high 
     336 
     337        Upper boundary of the interval (not included in the interval). 
     338 
     339 
     340Implementational details 
     341======================== 
     342 
     343Consider a following example (part of :download:`discretization.py <code/discretization.py>`): 
     344 
     345.. literalinclude:: code/discretization.py 
     346    :lines: 7-15 
     347 
     348The discretized attribute ``sep_w`` is constructed with a call to 
     349:class:`Entropy`; instead of constructing it and calling 
     350it afterwards, we passed the arguments for calling to the constructor. We then constructed a new 
     351:class:`Orange.data.Table` with attributes "sepal width" (the original 
     352continuous attribute), ``sep_w`` and the class attribute:: 
     353 
     354    Entropy discretization, first 5 data instances 
     355    [3.5, '>3.30', 'Iris-setosa'] 
     356    [3.0, '(2.90, 3.30]', 'Iris-setosa'] 
     357    [3.2, '(2.90, 3.30]', 'Iris-setosa'] 
     358    [3.1, '(2.90, 3.30]', 'Iris-setosa'] 
     359    [3.6, '>3.30', 'Iris-setosa'] 
     360 
     361The name of the new categorical variable derives from the name of original continuous variable by adding a prefix 
     362"D_". The values of the new attributes are computed automatically when they are needed using a transformation function 
     363:obj:`~Orange.data.variable.Variable.get_value_from` (see :class:`Orange.data.variable.Variable`) which encodes the 
     364discretization:: 
     365 
     366    >>> sep_w 
     367    EnumVariable 'D_sepal width' 
     368    >>> sep_w.get_value_from 
     369    <ClassifierFromVar instance at 0x01BA7DC0> 
     370    >>> sep_w.get_value_from.whichVar 
     371    FloatVariable 'sepal width' 
     372    >>> sep_w.get_value_from.transformer 
     373    <IntervalDiscretizer instance at 0x01BA2100> 
     374    >>> sep_w.get_value_from.transformer.points 
     375    <2.90000009537, 3.29999995232> 
     376 
     377The ``select`` statement in the discretization script converted all data instances 
     378from ``data`` to the new domain. This includes a new feature 
     379``sep_w`` whose values are computed on the fly by calling ``sep_w.get_value_from`` for each data instance. 
     380The original, continuous sepal width 
     381is passed to the ``transformer`` that determines the interval by its field 
     382``points``. Transformer returns the discrete value which is in turn returned 
     383by ``get_value_from`` and stored in the new example. 
     384 
     385References 
     386========== 
     387 
     388.. [FayyadIrani1993] UM Fayyad and KB Irani. Multi-interval discretization of continuous valued 
     389  attributes for classification learning. In Proc. 13th International Joint Conference on Artificial Intelligence, pages 
     390  1022--1029, Chambery, France, 1993. 
  • docs/reference/rst/code/discretization.py

    r9372 r9812  
    99 
    1010print "\nEntropy discretization, first 10 examples" 
    11 sep_w = Orange.feature.discretization.EntropyDiscretization("sepal width", data) 
     11sep_w = Orange.feature.discretization.Entropy("sepal width", data) 
    1212 
    1313data2 = data.select([data.domain["sepal width"], sep_w, data.domain.class_var]) 
     
    1919print "Cut-off points:", sep_w.get_value_from.transformer.points 
    2020 
    21 print "\nManual construction of IntervalDiscretizer - single attribute" 
    22 idisc = Orange.feature.discretization.IntervalDiscretizer(points = [3.0, 5.0]) 
     21print "\nManual construction of Interval discretizer - single attribute" 
     22idisc = Orange.feature.discretization.Interval(points = [3.0, 5.0]) 
    2323sep_l = idisc.construct_variable(data.domain["sepal length"]) 
    2424data2 = data.select([data.domain["sepal length"], sep_l, data.domain.classVar]) 
     
    2727 
    2828 
    29 print "\nManual construction of IntervalDiscretizer - all attributes" 
    30 idisc = Orange.feature.discretization.IntervalDiscretizer(points = [3.0, 5.0]) 
     29print "\nManual construction of Interval discretizer - all attributes" 
     30idisc = Orange.feature.discretization.Interval(points = [3.0, 5.0]) 
    3131newattrs = [idisc.construct_variable(attr) for attr in data.domain.attributes] 
    3232data2 = data.select(newattrs + [data.domain.class_var]) 
     
    3535 
    3636 
    37 print "\n\nEqual interval size discretization" 
    38 disc = Orange.feature.discretization.EquiDistDiscretization(numberOfIntervals = 6) 
     37print "\n\nDiscretization with equal width intervals" 
     38disc = Orange.feature.discretization.EqualWidth(numberOfIntervals = 6) 
    3939newattrs = [disc(attr, data) for attr in data.domain.attributes] 
    4040data2 = data.select(newattrs + [data.domain.classVar]) 
     
    5151 
    5252 
    53 print "\n\nQuartile discretization" 
    54 disc = Orange.feature.discretization.EquiNDiscretization(numberOfIntervals = 6) 
     53print "\n\nQuartile (equal frequency) discretization" 
     54disc = Orange.feature.discretization.EqualFreq(numberOfIntervals = 6) 
    5555newattrs = [disc(attr, data) for attr in data.domain.attributes] 
    5656data2 = data.select(newattrs + [data.domain.classVar]) 
     
    6666 
    6767 
    68 print "\nManual construction of EquiDistDiscretizer - all attributes" 
    69 edisc = Orange.feature.discretization.EquiDistDiscretizer(first_cut = 2.0, step = 1.0, number_of_intervals = 5) 
     68print "\nManual construction of EqualWidth - all attributes" 
     69edisc = Orange.feature.discretization.EqualWidth(first_cut = 2.0, step = 1.0, number_of_intervals = 5) 
    7070newattrs = [edisc.constructVariable(attr) for attr in data.domain.attributes] 
    7171data2 = data.select(newattrs + [data.domain.classVar]) 
     
    7474 
    7575 
    76 print "\nFayyad-Irani discretization" 
    77 entro = Orange.feature.discretization.EntropyDiscretization() 
     76print "\nFayyad-Irani entropy-based discretization" 
     77entro = Orange.feature.discretization.Entropy() 
    7878for attr in data.domain.attributes: 
    7979    disc = entro(attr, data) 
     
    8787data_v = Orange.data.Table(newdomain, data) 
    8888 
    89 print "\nBi-Modal discretization on binary problem" 
    90 bimod = Orange.feature.discretization.BiModalDiscretization(split_in_two = 0) 
     89print "\nBi-modal discretization on a binary problem" 
     90bimod = Orange.feature.discretization.BiModal(split_in_two = 0) 
    9191for attr in data_v.domain.attributes: 
    9292    disc = bimod(attr, data_v) 
     
    9494print 
    9595 
    96 print "\nBi-Modal discretization on binary problem" 
    97 bimod = Orange.feature.discretization.BiModalDiscretization() 
     96print "\nBi-modal discretization on a binary problem" 
     97bimod = Orange.feature.discretization.BiModal() 
    9898for attr in data_v.domain.attributes: 
    9999    disc = bimod(attr, data_v) 
     
    102102 
    103103 
    104 print "\nEntropy discretization on binary problem" 
     104print "\nEntropy-based discretization on a binary problem" 
    105105for attr in data_v.domain.attributes: 
    106106    disc = entro(attr, data_v) 
  • orange/Orange/feature/discretization.py

    r9349 r9812  
    1 """ 
    2 ################################### 
    3 Discretization (``discretization``) 
    4 ################################### 
    5  
    6 .. index:: discretization 
    7  
    8 .. index::  
    9    single: feature; discretization 
    10  
    11  
    12 Example-based automatic discretization is in essence similar to learning: 
    13 given a set of examples, discretization method proposes a list of suitable 
    14 intervals to cut the attribute's values into. For this reason, Orange 
    15 structures for discretization resemble its structures for learning. Objects 
    16 derived from ``orange.Discretization`` play a role of "learner" that,  
    17 upon observing the examples, construct an ``orange.Discretizer`` whose role 
    18 is to convert continuous values into discrete according to the rule found by 
    19 ``Discretization``. 
    20  
    21 Orange supports several methods of discretization; here's a 
    22 list of methods with belonging classes. 
    23  
    24 * Equi-distant discretization (:class:`EquiDistDiscretization`,  
    25   :class:`EquiDistDiscretizer`). The range of attribute's values is split 
    26   into prescribed number equal-sized intervals. 
    27 * Quantile-based discretization (:class:`EquiNDiscretization`, 
    28   :class:`IntervalDiscretizer`). The range is split into intervals 
    29   containing equal number of examples. 
    30 * Entropy-based discretization (:class:`EntropyDiscretization`, 
    31   :class:`IntervalDiscretizer`). Developed by Fayyad and Irani, 
    32   this method balances between entropy in intervals and MDL of discretization. 
    33 * Bi-modal discretization (:class:`BiModalDiscretization`, 
    34   :class:`BiModalDiscretizer`/:class:`IntervalDiscretizer`). 
    35   Two cut-off points set to optimize the difference of the distribution in 
    36   the middle interval and the distributions outside it. 
    37 * Fixed discretization (:class:`IntervalDiscretizer`). Discretization with  
    38   user-prescribed cut-off points. 
    39  
    40 Instances of classes derived from :class:`Discretization`. It define a 
    41 single method: the call operator. The object can also be called through 
    42 constructor. 
    43  
    44 .. class:: Discretization 
    45  
    46     .. method:: __call__(attribute, examples[, weightID]) 
    47  
    48         Given a continuous ``attribute`, ``examples`` and, optionally id of 
    49         attribute with example weight, this function returns a discretized 
    50         attribute. Argument ``attribute`` can be a descriptor, index or 
    51         name of the attribute. 
    52  
    53 Here's an example. Part of :download:`discretization.py <code/discretization.py>`: 
    54  
    55 .. literalinclude:: code/discretization.py 
    56     :lines: 7-15 
    57  
    58 The discretized attribute ``sep_w`` is constructed with a call to 
    59 :class:`EntropyDiscretization` (instead of constructing it and calling 
    60 it afterwards, we passed the arguments for calling to the constructor, as 
    61 is often allowed in Orange). We then constructed a new  
    62 :class:`Orange.data.Table` with attributes "sepal width" (the original  
    63 continuous attribute), ``sep_w`` and the class attribute. Script output is:: 
    64  
    65     Entropy discretization, first 10 examples 
    66     [3.5, '>3.30', 'Iris-setosa'] 
    67     [3.0, '(2.90, 3.30]', 'Iris-setosa'] 
    68     [3.2, '(2.90, 3.30]', 'Iris-setosa'] 
    69     [3.1, '(2.90, 3.30]', 'Iris-setosa'] 
    70     [3.6, '>3.30', 'Iris-setosa'] 
    71     [3.9, '>3.30', 'Iris-setosa'] 
    72     [3.4, '>3.30', 'Iris-setosa'] 
    73     [3.4, '>3.30', 'Iris-setosa'] 
    74     [2.9, '<=2.90', 'Iris-setosa'] 
    75     [3.1, '(2.90, 3.30]', 'Iris-setosa'] 
    76  
    77 :class:`EntropyDiscretization` named the new attribute's values by the 
    78 interval range (it also named the attribute as "D_sepal width"). The new 
    79 attribute's values get computed automatically when they are needed. 
    80  
    81 As those that have read about :class:`Orange.data.variable.Variable` know, 
    82 the answer to  
    83 "How this works?" is hidden in the field  
    84 :obj:`~Orange.data.variable.Variable.get_value_from`. 
    85 This little dialog reveals the secret. 
    86  
    87 :: 
    88  
    89     >>> sep_w 
    90     EnumVariable 'D_sepal width' 
    91     >>> sep_w.get_value_from 
    92     <ClassifierFromVar instance at 0x01BA7DC0> 
    93     >>> sep_w.get_value_from.whichVar 
    94     FloatVariable 'sepal width' 
    95     >>> sep_w.get_value_from.transformer 
    96     <IntervalDiscretizer instance at 0x01BA2100> 
    97     >>> sep_w.get_value_from.transformer.points 
    98     <2.90000009537, 3.29999995232> 
    99  
    100 So, the ``select`` statement in the above example converted all examples 
    101 from ``data`` to the new domain. Since the new domain includes the attribute 
    102 ``sep_w`` that is not present in the original, ``sep_w``'s values are 
    103 computed on the fly. For each example in ``data``, ``sep_w.get_value_from``  
    104 is called to compute ``sep_w``'s value (if you ever need to call 
    105 ``get_value_from``, you shouldn't call ``get_value_from`` directly but call 
    106 ``compute_value`` instead). ``sep_w.get_value_from`` looks for value of 
    107 "sepal width" in the original example. The original, continuous sepal width 
    108 is passed to the ``transformer`` that determines the interval by its field 
    109 ``points``. Transformer returns the discrete value which is in turn returned 
    110 by ``get_value_from`` and stored in the new example. 
    111  
    112 You don't need to understand this mechanism exactly. It's important to know 
    113 that there are two classes of objects for discretization. Those derived from 
    114 :obj:`Discretizer` (such as :obj:`IntervalDiscretizer` that we've seen above) 
    115 are used as transformers that translate continuous value into discrete. 
    116 Discretization algorithms are derived from :obj:`Discretization`. Their  
    117 job is to construct a :obj:`Discretizer` and return a new variable 
    118 with the discretizer stored in ``get_value_from.transformer``. 
    119  
    120 Discretizers 
    121 ============ 
    122  
    123 Different discretizers support different methods for conversion of 
    124 continuous values into discrete. The most general is  
    125 :class:`IntervalDiscretizer` that is also used by most discretization 
    126 methods. Two other discretizers, :class:`EquiDistDiscretizer` and  
    127 :class:`ThresholdDiscretizer`> could easily be replaced by  
    128 :class:`IntervalDiscretizer` but are used for speed and simplicity. 
    129 The fourth discretizer, :class:`BiModalDiscretizer` is specialized 
    130 for discretizations induced by :class:`BiModalDiscretization`. 
    131  
    132 .. class:: Discretizer 
    133  
    134     All discretizers support a handy method for construction of a new 
    135     attribute from an existing one. 
    136  
    137     .. method:: construct_variable(attribute) 
    138  
    139         Constructs a new attribute descriptor; the new attribute is discretized 
    140         ``attribute``. The new attribute's name equal ``attribute.name``  
    141         prefixed  by "D\_", and its symbolic values are discretizer specific. 
    142         The above example shows what comes out form :class:`IntervalDiscretizer`.  
    143         Discretization algorithms actually first construct a discretizer and 
    144         then call its :class:`construct_variable` to construct an attribute 
    145         descriptor. 
    146  
    147 .. class:: IntervalDiscretizer 
    148  
    149     The most common discretizer.  
    150  
    151     .. attribute:: points 
    152  
    153         Cut-off points. All values below or equal to the first point belong 
    154         to the first interval, those between the first and the second 
    155         (including those equal to the second) go to the second interval and 
    156         so forth to the last interval which covers all values greater than 
    157         the last element in ``points``. The number of intervals is thus  
    158         ``len(points)+1``. 
    159  
    160 Let us manually construct an interval discretizer with cut-off points at 3.0 
    161 and 5.0. We shall use the discretizer to construct a discretized sepal length  
    162 (part of :download:`discretization.py <code/discretization.py>`): 
    163  
    164 .. literalinclude:: code/discretization.py 
    165     :lines: 22-26 
    166  
    167 That's all. First five examples of ``data2`` are now 
    168  
    169 :: 
    170  
    171     [5.1, '>5.00', 'Iris-setosa'] 
    172     [4.9, '(3.00, 5.00]', 'Iris-setosa'] 
    173     [4.7, '(3.00, 5.00]', 'Iris-setosa'] 
    174     [4.6, '(3.00, 5.00]', 'Iris-setosa'] 
    175     [5.0, '(3.00, 5.00]', 'Iris-setosa'] 
    176  
    177 Can you use the same discretizer for more than one attribute? Yes, as long 
    178 as they have same cut-off points, of course. Simply call construct_var for each 
    179 continuous attribute (part of :download:`discretization.py <code/discretization.py>`): 
    180  
    181 .. literalinclude:: code/discretization.py 
    182     :lines: 30-34 
    183  
    184 Each attribute now has its own (FIXME) ClassifierFromVar in its  
    185 ``get_value_from``, but all use the same :class:`IntervalDiscretizer`,  
    186 ``idisc``. Changing an element of its ``points`` affect all attributes. 
    187  
    188 Do not change the length of :obj:`~IntervalDiscretizer.points` if the 
    189 discretizer is used by any attribute. The length of 
    190 :obj:`~IntervalDiscretizer.points` should always match the number of values 
    191 of the attribute, which is determined by the length of the attribute's field 
    192 ``values``. Therefore, if ``attr`` is a discretized 
    193 attribute, than ``len(attr.values)`` must equal 
    194 ``len(attr.get_value_from.transformer.points)+1``. It always 
    195 does, unless you deliberately change it. If the sizes don't match, 
    196 Orange will probably crash, and it will be entirely your fault. 
    197  
    198  
    199  
    200 .. class:: EquiDistDiscretizer 
    201  
    202     More rigid than :obj:`IntervalDiscretizer`:  
    203     it uses intervals of fixed width. 
    204  
    205     .. attribute:: first_cut 
    206          
    207         The first cut-off point. 
    208      
    209     .. attribute:: step 
    210  
    211         Width of intervals. 
    212  
    213     .. attribute:: number_of_intervals 
    214          
    215         Number of intervals. 
    216  
    217     .. attribute:: points (read-only) 
    218          
    219         The cut-off points; this is not a real attribute although it behaves 
    220         as one. Reading it constructs a list of cut-off points and returns it, 
    221         but changing the list doesn't affect the discretizer - it's a separate 
    222         list. This attribute is here only for to give the  
    223         :obj:`EquiDistDiscretizer` the same interface as that of  
    224         :obj:`IntervalDiscretizer`. 
    225  
    226 All values below :obj:`~EquiDistDiscretizer.first_cut` belong to the first 
    227 intervala (including possible values smaller than ``firstVal``. Otherwise, 
    228 value ``val``'s interval is ``floor((val-firstVal)/step)``. If this is turns 
    229 out to be greater or equal to :obj:`~EquiDistDiscretizer.number_of_intervals`,  
    230 it is decreased to ``number_of_intervals-1``. 
    231  
    232 This discretizer is returned by :class:`EquiDistDiscretization`; you can 
    233 see an example in the corresponding section. You can also construct it  
    234 manually and call its ``construct_variable``, just as shown for the 
    235 :obj:`IntervalDiscretizer`. 
    236  
    237  
    238 .. class:: ThresholdDiscretizer 
    239  
    240     Threshold discretizer converts continuous values into binary by comparing 
    241     them with a threshold. This discretizer is actually not used by any 
    242     discretization method, but you can use it for manual discretization. 
    243     Orange needs this discretizer for binarization of continuous attributes 
    244     in decision trees. 
    245  
    246     .. attribute:: threshold 
    247  
    248         Threshold; values below or equal to the threshold belong to the first 
    249         interval and those that are greater go to the second. 
    250  
    251 .. class:: BiModalDiscretizer 
    252  
    253     This discretizer is the first discretizer that couldn't be replaced by 
    254     :class:`IntervalDiscretizer`. It has two cut off points and values are 
    255     discretized according to whether they belong to the middle region 
    256     (which includes the lower but not the upper boundary) or not. The 
    257     discretizer is returned by :class:`BiModalDiscretization` if its 
    258     field :obj:`~BiModalDiscretization.split_in_two` is true (the default). 
    259  
    260     .. attribute:: low 
    261          
    262         Lower boudary of the interval (included in the interval). 
    263  
    264     .. attribute:: high 
    265  
    266         Upper boundary of the interval (not included in the interval). 
    267  
    268  
    269 Discretization Algorithms 
    270 ========================= 
    271  
    272 .. class:: EquiDistDiscretization  
    273  
    274     Discretizes the attribute by cutting it into the prescribed number 
    275     of intervals of equal width. The examples are needed to determine the  
    276     span of attribute values. The interval between the smallest and the 
    277     largest is then cut into equal parts. 
    278  
    279     .. attribute:: number_of_intervals 
    280  
    281         Number of intervals into which the attribute is to be discretized.  
    282         Default value is 4. 
    283  
    284 For an example, we shall discretize all attributes of Iris dataset into 6 
    285 intervals. We shall construct an :class:`Orange.data.Table` with discretized 
    286 attributes and print description of the attributes (part 
    287 of :download:`discretization.py <code/discretization.py>`): 
    288  
    289 .. literalinclude:: code/discretization.py 
    290     :lines: 38-43 
    291  
    292 Script's answer is 
    293  
    294 :: 
    295  
    296     D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> 
    297     D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> 
    298     D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> 
    299     D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10> 
    300  
    301 Any more decent ways for a script to find the interval boundaries than  
    302 by parsing the symbolic values? Sure, they are hidden in the discretizer, 
    303 which is, as usual, stored in ``attr.get_value_from.transformer``. 
    304  
    305 Compare the following with the values above. 
    306  
    307 :: 
    308  
    309     >>> for attr in newattrs: 
    310     ...    print "%s: first interval at %5.3f, step %5.3f" % \ 
    311     ...    (attr.name, attr.get_value_from.transformer.first_cut, \ 
    312     ...    attr.get_value_from.transformer.step) 
    313     D_sepal length: first interval at 4.900, step 0.600 
    314     D_sepal width: first interval at 2.400, step 0.400 
    315     D_petal length: first interval at 1.980, step 0.980 
    316     D_petal width: first interval at 0.500, step 0.400 
    317  
    318 As all discretizers, :class:`EquiDistDiscretizer` also has the method  
    319 ``construct_variable`` (part of :download:`discretization.py <code/discretization.py>`): 
    320  
    321 .. literalinclude:: code/discretization.py 
    322     :lines: 69-73 
    323  
    324  
    325 .. class:: EquiNDiscretization 
    326  
    327     Discretization with Intervals Containing (Approximately) Equal Number 
    328     of Examples. 
    329  
    330     Discretizes the attribute by cutting it into the prescribed number of 
    331     intervals so that each of them contains equal number of examples. The 
    332     examples are obviously needed for this discretization, too. 
    333  
    334     .. attribute:: number_of_intervals 
    335  
    336         Number of intervals into which the attribute is to be discretized. 
    337         Default value is 4. 
    338  
    339 The use of this discretization is the same as the use of  
    340 :class:`EquiDistDiscretization`. The resulting discretizer is  
    341 :class:`IntervalDiscretizer`, hence it has ``points`` instead of ``first_cut``/ 
    342 ``step``/``number_of_intervals``. 
    343  
    344 .. class:: EntropyDiscretization 
    345  
    346     Entropy-based Discretization (Fayyad-Irani). 
    347  
    348     Fayyad-Irani's discretization method works without a predefined number of 
    349     intervals. Instead, it recursively splits intervals at the cut-off point 
    350     that minimizes the entropy, until the entropy decrease is smaller than the 
    351     increase of MDL induced by the new point. 
    352  
    353     An interesting thing about this discretization technique is that an 
    354     attribute can be discretized into a single interval, if no suitable 
    355     cut-off points are found. If this is the case, the attribute is rendered 
    356     useless and can be removed. This discretization can therefore also serve 
    357     for feature subset selection. 
    358  
    359     .. attribute:: force_attribute 
    360  
    361         Forces the algorithm to induce at least one cut-off point, even when 
    362         its information gain is lower than MDL (default: false). 
    363  
    364 Part of :download:`discretization.py <code/discretization.py>`: 
    365  
    366 .. literalinclude:: code/discretization.py 
    367     :lines: 77-80 
    368  
    369 The output shows that all attributes are discretized onto three intervals:: 
    370  
    371     sepal length: <5.5, 6.09999990463> 
    372     sepal width: <2.90000009537, 3.29999995232> 
    373     petal length: <1.89999997616, 4.69999980927> 
    374     petal width: <0.600000023842, 1.0000004768> 
    375  
    376 .. class:: BiModalDiscretization 
    377  
    378     Bi-Modal Discretization 
    379  
    380     Sets two cut-off points so that the class distribution of examples in 
    381     between is as different from the overall distribution as possible. The 
    382     difference is measure by chi-square statistics. All possible cut-off 
    383     points are tried, thus the discretization runs in O(n^2). 
    384  
    385     This discretization method is especially suitable for the attributes in 
    386     which the middle region corresponds to normal and the outer regions to 
    387     abnormal values of the attribute. Depending on the nature of the 
    388     attribute, we can treat the lower and higher values separately, thus 
    389     discretizing the attribute into three intervals, or together, in a 
    390     binary attribute whose values correspond to normal and abnormal. 
    391  
    392     .. attribute:: split_in_two 
    393          
    394         Decides whether the resulting attribute should have three or two. 
    395         If true (default), we have three intervals and the discretizer is 
    396         of type :class:`BiModalDiscretizer`. If false the result is the  
    397         ordinary :class:`IntervalDiscretizer`. 
    398  
    399 Iris dataset has three-valued class attribute, classes are setosa, virginica 
    400 and versicolor. As the picture below shows, sepal lenghts of versicolors are 
    401 between lengths of setosas and virginicas (the picture itself is drawn using 
    402 LOESS probability estimation). 
    403  
    404 .. image:: files/bayes-iris.gif 
    405  
    406 If we merge classes setosa and virginica into one, we can observe whether 
    407 the bi-modal discretization would correctly recognize the interval in 
    408 which versicolors dominate. 
    409  
    410 .. literalinclude:: code/discretization.py 
    411     :lines: 84-87 
    412  
    413 In this script, we have constructed a new class attribute which tells whether 
    414 an iris is versicolor or not. We have told how this attribute's value is 
    415 computed from the original class value with a simple lambda function. 
    416 Finally, we have constructed a new domain and converted the examples. 
    417 Now for discretization. 
    418  
    419 .. literalinclude:: code/discretization.py 
    420     :lines: 97-100 
    421  
    422 The script prints out the middle intervals:: 
    423  
    424     sepal length: (5.400, 6.200] 
    425     sepal width: (2.000, 2.900] 
    426     petal length: (1.900, 4.700] 
    427     petal width: (0.600, 1.600] 
    428  
    429 Judging by the graph, the cut-off points for "sepal length" make sense. 
    430  
    431 Additional functions 
    432 ==================== 
    433  
    434 Some functions and classes that can be used for 
    435 categorization of continuous features. Besides several general classes that 
    436 can help in this task, we also provide a function that may help in 
    437 entropy-based discretization (Fayyad & Irani), and a wrapper around classes for 
    438 categorization that can be used for learning. 
    439  
    440 .. automethod:: Orange.feature.discretization.entropyDiscretization_wrapper 
    441  
    442 .. autoclass:: Orange.feature.discretization.EntropyDiscretization_wrapper 
    443  
    444 .. autoclass:: Orange.feature.discretization.DiscretizedLearner_Class 
    445  
    446 .. rubric:: Example 
    447  
    448 FIXME. A chapter on `feature subset selection <../ofb/o_fss.htm>`_ in Orange 
    449 for Beginners tutorial shows the use of DiscretizedLearner. Other 
    450 discretization classes from core Orange are listed in chapter on 
    451 `categorization <../ofb/o_categorization.htm>`_ of the same tutorial. 
    452  
    453 ========== 
    454 References 
    455 ========== 
    456  
    457 * UM Fayyad and KB Irani. Multi-interval discretization of continuous valued 
    458   attributes for classification learning. In Proceedings of the 13th 
    459   International Joint Conference on Artificial Intelligence, pages 
    460   1022--1029, Chambery, France, 1993. 
    461  
    462 """ 
    463  
     1import Orange 
    4642import Orange.core as orange 
    4653 
     
    4675    Discrete2Continuous, \ 
    4686    Discretizer, \ 
    469         BiModalDiscretizer, \ 
    470         EquiDistDiscretizer, \ 
    471         IntervalDiscretizer, \ 
    472         ThresholdDiscretizer, \ 
    473         EntropyDiscretization, \ 
    474         EquiDistDiscretization, \ 
    475         EquiNDiscretization, \ 
    476         BiModalDiscretization, \ 
    477         Discretization 
     7    BiModalDiscretizer, \ 
     8    EquiDistDiscretizer as EqualWidthDiscretizer, \ 
     9    IntervalDiscretizer, \ 
     10    ThresholdDiscretizer,\ 
     11    EntropyDiscretization as Entropy, \ 
     12    EquiDistDiscretization as EqualWidth, \ 
     13    EquiNDiscretization as EqualFreq, \ 
     14    BiModalDiscretization as BiModal, \ 
     15    Discretization, \ 
     16    Preprocessor_discretize 
    47817 
    479 ###### 
    480 # from orngDics.py 
    481 def entropyDiscretization_wrapper(table): 
    482     """Take the classified table set (table) and categorize all continuous 
    483     features using the entropy based discretization 
    484     :obj:`EntropyDiscretization`. 
     18 
     19 
     20def entropyDiscretization_wrapper(data): 
     21    """Discretize all continuous features in class-labeled data set with the entropy-based discretization 
     22    :obj:`Entropy`. 
    48523     
    486     :param table: data to discretize. 
    487     :type table: Orange.data.Table 
     24    :param data: data to discretize. 
     25    :type data: Orange.data.Table 
    48826    :rtype: :obj:`Orange.data.Table` includes all categorical and discretized\ 
    48927    continuous features from the original data table. 
     
    49533    """ 
    49634    orange.setrandseed(0) 
    497     tablen=orange.Preprocessor_discretize(table, method=EntropyDiscretization()) 
     35    data_new = orange.Preprocessor_discretize(data, method=Entropy()) 
    49836     
    499     attrlist=[] 
    500     nrem=0 
    501     for i in tablen.domain.attributes: 
     37    attrlist = [] 
     38    nrem = 0 
     39    for i in data_new.domain.attributes: 
    50240        if (len(i.values)>1): 
    50341            attrlist.append(i) 
     
    50543            nrem=nrem+1 
    50644    attrlist.append(tablen.domain.classVar) 
    507     return tablen.select(attrlist) 
     45    return data_new.select(attrlist) 
    50846 
    50947 
     
    565103 
    566104    """ 
    567     def __init__(self, baseLearner, discretizer=EntropyDiscretization(), **kwds): 
     105    def __init__(self, baseLearner, discretizer=Entropy(), **kwds): 
    568106        self.baseLearner = baseLearner 
    569107        if hasattr(baseLearner, "name"): 
     
    591129  def __call__(self, example, resultType = orange.GetValue): 
    592130    return self.classifier(example, resultType) 
     131 
     132class DiscretizeTable(object): 
     133    """Discretizes all continuous features of the data table. 
     134 
     135    :param data: data to discretize. 
     136    :type data: :class:`Orange.data.Table` 
     137 
     138    :param features: data features to discretize. None (default) to discretize all features. 
     139    :type features: list of :class:`Orange.data.variable.Variable` 
     140 
     141    :param method: feature discretization method. 
     142    :type method: :class:`Discretization` 
     143    """ 
     144    def __new__(cls, data=None, features=None, discretize_class=False, method=EqualFreq(n_intervals=3)): 
     145        if data is None: 
     146            self = object.__new__(cls, features=features, discretize_class=discretize_class, method=method) 
     147            return self 
     148        else: 
     149            self = cls(features=features, discretize_class=discretize_class, method=method) 
     150            return self(data) 
     151 
     152    def __init__(self, features=None, discretize_class=False, method=EqualFreq(n_intervals=3)): 
     153        self.features = features 
     154        self.discretize_class = discretize_class 
     155        self.method = method 
     156 
     157    def __call__(self, data): 
     158        pp = Preprocessor_discretize(attributes=self.features, discretizeClass=self.discretize_class) 
     159        pp.method = self.method 
     160        return pp(data) 
     161 
  • orange/Orange/feature/scoring.py

    r9349 r9812  
    206206    Assesses features' ability to distinguish between very similar 
    207207    instances from different classes. This scoring method was first 
    208     developed by Kira and Rendell and then improved by Kononenko. The 
     208    developed by Kira and Rendell and then improved by  Kononenko. The 
    209209    class :obj:`Relief` works on discrete and continuous classes and 
    210210    thus implements ReliefF and RReliefF. 
  • source/orange/discretize.hpp

    r9811 r9812  
    8181  __REGISTER_CLASS 
    8282 
    83   int   numberOfIntervals; //P(+n_intervals) number of intervals 
     83  int   numberOfIntervals; //P(+n) number of intervals 
    8484  float firstCut; //P the first cut-off point 
    8585  float step; //P step (width of interval) 
Note: See TracChangeset for help on using the changeset viewer.