03/22/11 16:31:26 (3 years ago)

Orange 2.5. Discretization: added documentation from the reference.

1 edited


  • orange/Orange/feature/discretization.py

    r7660 r7770  
    66   single: feature; discretization 
    8 This module implements some functions and classes that can be used for 
     9Example-based automatic discretization is in essence similar to learning: 
     10given a set of examples, discretization method proposes a list of suitable 
     11intervals to cut the attribute's values into. For this reason, Orange 
     12structures for discretization resemble its structures for learning. Objects 
     13derived from ``orange.Discretization`` play a role of "learner" that,  
     14upon observing the examples, construct an ``orange.Discretizer`` whose role 
     15is to convert continuous values into discrete according to the rule found by 
     18Orange supports several methods of discretization; here's a 
     19list of methods with belonging classes. 
     21* Equi-distant discretization (:class:`EquiDistDiscretization`,  
     22  :class:`EquiDistDiscretizer`). The range of attribute's values is split 
     23  into prescribed number equal-sized intervals. 
     24* Quantile-based discretization (:class:`EquiNDiscretization`, 
     25  :class:`IntervalDiscretizer`). The range is split into intervals 
     26  containing equal number of examples. 
     27* Entropy-based discretization (:class:`EntropyDiscretization`, 
     28  :class:`IntervalDiscretizer`). Developed by Fayyad and Irani, 
     29  this method balances between entropy in intervals and MDL of discretization. 
     30* Bi-modal discretization (:class:`BiModalDiscretization`, 
     31  :class:`BiModalDiscretizer`/:class:`IntervalDiscretizer`). 
     32  Two cut-off points set to optimize the difference of the distribution in 
     33  the middle interval and the distributions outside it. 
     34* Fixed discretization (:class:`IntervalDiscretizer`). Discretization with  
     35  user-prescribed cut-off points. 
     37.. _discretization.py: code/discretization.py 
     39Instances of classes derived from :class:`Discretization`. It define a 
     40single method: the call operator. The object can also be called through 
     43.. class:: Discretization 
     45    .. method:: __call__(attribute, examples[, weightID]) 
     47        Given a continuous ``attribute`, ``examples`` and, optionally id of 
     48        attribute with example weight, this function returns a discretized 
     49        attribute. Argument ``attribute`` can be a descriptor, index or 
     50        name of the attribute. 
     52Here's an example. Part of `discretization.py`_: 
     54.. literalinclude:: code/discretization.py 
     55    :lines: 7-15 
     57The discretized attribute ``sep_w`` is constructed with a call to 
     58:class:`EntropyDiscretization` (instead of constructing it and calling 
     59it afterwards, we passed the arguments for calling to the constructor, as 
     60is often allowed in Orange). We then constructed a new  
     61:class:`Orange.data.Table` with attributes "sepal width" (the original  
     62continuous attribute), ``sep_w`` and the class attribute. Script output is:: 
     64    Entropy discretization, first 10 examples 
     65    [3.5, '>3.30', 'Iris-setosa'] 
     66    [3.0, '(2.90, 3.30]', 'Iris-setosa'] 
     67    [3.2, '(2.90, 3.30]', 'Iris-setosa'] 
     68    [3.1, '(2.90, 3.30]', 'Iris-setosa'] 
     69    [3.6, '>3.30', 'Iris-setosa'] 
     70    [3.9, '>3.30', 'Iris-setosa'] 
     71    [3.4, '>3.30', 'Iris-setosa'] 
     72    [3.4, '>3.30', 'Iris-setosa'] 
     73    [2.9, '<=2.90', 'Iris-setosa'] 
     74    [3.1, '(2.90, 3.30]', 'Iris-setosa'] 
     76:class:`EntropyDiscretization` named the new attribute's values by the 
     77interval range (it also named the attribute as "D_sepal width"). The new 
     78attribute's values get computed automatically when they are needed. 
     80As those that have read about :class:`Orange.data.variable.Variable` know, 
     81the answer to  
     82"How this works?" is hidden in the field  
     84This little dialog reveals the secret. 
     88    >>> sep_w 
     89    EnumVariable 'D_sepal width' 
     90    >>> sep_w.get_value_from 
     91    <ClassifierFromVar instance at 0x01BA7DC0> 
     92    >>> sep_w.get_value_from.whichVar 
     93    FloatVariable 'sepal width' 
     94    >>> sep_w.get_value_from.transformer 
     95    <IntervalDiscretizer instance at 0x01BA2100> 
     96    >>> sep_w.get_value_from.transformer.points 
     97    <2.90000009537, 3.29999995232> 
     99So, the ``select`` statement in the above example converted all examples 
     100from ``data`` to the new domain. Since the new domain includes the attribute 
     101``sep_w`` that is not present in the original, ``sep_w``'s values are 
     102computed on the fly. For each example in ``data``, ``sep_w.get_value_from``  
     103is called to compute ``sep_w``'s value (if you ever need to call 
     104``get_value_from``, you shouldn't call ``get_value_from`` directly but call 
     105``compute_value`` instead). ``sep_w.get_value_from`` looks for value of 
     106"sepal width" in the original example. The original, continuous sepal width 
     107is passed to the ``transformer`` that determines the interval by its field 
     108``points``. Transformer returns the discrete value which is in turn returned 
     109by ``get_value_from`` and stored in the new example. 
     111You don't need to understand this mechanism exactly. It's important to know 
     112that there are two classes of objects for discretization. Those derived from 
     113:obj:`Discretizer` (such as :obj:`IntervalDiscretizer` that we've seen above) 
     114are used as transformers that translate continuous value into discrete. 
     115Discretization algorithms are derived from :obj:`Discretization`. Their  
     116job is to construct a :obj:`Discretizer` and return a new variable 
     117with the discretizer stored in ``get_value_from.transformer``. 
     122Different discretizers support different methods for conversion of 
     123continuous values into discrete. The most general is  
     124:class:`IntervalDiscretizer` that is also used by most discretization 
     125methods. Two other discretizers, :class:`EquiDistDiscretizer` and  
     126:class:`ThresholdDiscretizer`> could easily be replaced by  
     127:class:`IntervalDiscretizer` but are used for speed and simplicity. 
     128The fourth discretizer, :class:`BiModalDiscretizer` is specialized 
     129for discretizations induced by :class:`BiModalDiscretization`. 
     131.. class:: Discretizer 
     133    All discretizers support a handy method for construction of a new 
     134    attribute from an existing one. 
     136    .. method:: construct_variable(attribute) 
     138        Constructs a new attribute descriptor; the new attribute is discretized 
     139        ``attribute``. The new attribute's name equal ``attribute.name``  
     140        prefixed  by "D_", and its symbolic values are discretizer specific. 
     141        The above example shows what comes out form :class:`IntervalDiscretizer`.  
     142        Discretization algorithms actually first construct a discretizer and 
     143        then call its :class:`construct_variable` to construct an attribute 
     144        descriptor. 
     146.. class:: IntervalDiscretizer 
     148    The most common discretizer.  
     150    .. attribute:: points 
     152        Cut-off points. All values below or equal to the first point belong 
     153        to the first interval, those between the first and the second 
     154        (including those equal to the second) go to the second interval and 
     155        so forth to the last interval which covers all values greater than 
     156        the last element in ``points``. The number of intervals is thus  
     157        ``len(points)+1``. 
     159Let us manually construct an interval discretizer with cut-off points at 3.0 
     160and 5.0. We shall use the discretizer to construct a discretized sepal length  
     161(part of `discretization.py`_): 
     163.. literalinclude:: code/discretization.py 
     164    :lines: 22-26 
     166That's all. First five examples of ``data2`` are now 
     170    [5.1, '>5.00', 'Iris-setosa'] 
     171    [4.9, '(3.00, 5.00]', 'Iris-setosa'] 
     172    [4.7, '(3.00, 5.00]', 'Iris-setosa'] 
     173    [4.6, '(3.00, 5.00]', 'Iris-setosa'] 
     174    [5.0, '(3.00, 5.00]', 'Iris-setosa'] 
     176Can you use the same discretizer for more than one attribute? Yes, as long 
     177as they have same cut-off points, of course. Simply call construct_var for each 
     178continuous attribute (part of `discretization.py`_): 
     180.. literalinclude:: code/discretization.py 
     181    :lines: 30-34 
     183Each attribute now has its own (FIXME) ClassifierFromVar in its  
     184``get_value_from``, but all use the same :class:`IntervalDiscretizer`,  
     185``idisc``. Changing an element of its ``points`` affect all attributes. 
     187Do not change the length of :obj:`~IntervalDiscretizer.points` if the 
     188discretizer is used by any attribute. The length of 
     189:obj:`~IntervalDiscretizer.points` should always match the number of values 
     190of the attribute, which is determined by the length of the attribute's field 
     191``values``. Therefore, if ``attr`` is a discretized 
     192attribute, than ``len(attr.values)`` must equal 
     193``len(attr.get_value_from.transformer.points)+1``. It always 
     194does, unless you deliberately change it. If the sizes don't match, 
     195Orange will probably crash, and it will be entirely your fault. 
     199.. class:: EquiDistDiscretizer 
     201    More rigid than :obj:`IntervalDiscretizer`:  
     202    it uses intervals of fixed width. 
     204    .. attribute:: first_cut 
     206        The first cut-off point. 
     208    .. attribute:: step 
     210        Width of intervals. 
     212    .. attribute:: number_of_intervals 
     214        Number of intervals. 
     216    .. attribute:: points (read-only) 
     218        The cut-off points; this is not a real attribute although it behaves 
     219        as one. Reading it constructs a list of cut-off points and returns it, 
     220        but changing the list doesn't affect the discretizer - it's a separate 
     221        list. This attribute is here only for to give the  
     222        :obj:`EquiDistDiscretizer` the same interface as that of  
     223        :obj:`IntervalDiscretizer`. 
     225All values below :obj:`~EquiDistDiscretizer.first_cut` belong to the first 
     226intervala (including possible values smaller than ``firstVal``. Otherwise, 
     227value ``val``'s interval is ``floor((val-firstVal)/step)``. If this is turns 
     228out to be greater or equal to :obj:`~EquiDistDiscretizer.number_of_intervals`,  
     229it is decreased to ``number_of_intervals-1``. 
     231This discretizer is returned by :class:`EquiDistDiscretization`; you can 
     232see an example in the corresponding section. You can also construct it  
     233manually and call its ``construct_variable``, just as shown for the 
     237.. class:: ThresholdDiscretizer 
     239    Threshold discretizer converts continuous values into binary by comparing 
     240    them with a threshold. This discretizer is actually not used by any 
     241    discretization method, but you can use it for manual discretization. 
     242    Orange needs this discretizer for binarization of continuous attributes 
     243    in decision trees. 
     245    .. attribute:: threshold 
     247        Threshold; values below or equal to the threshold belong to the first 
     248        interval and those that are greater go to the second. 
     250.. class:: BiModalDiscretizer 
     252    This discretizer is the first discretizer that couldn't be replaced by 
     253    :class:`IntervalDiscretizer`. It has two cut off points and values are 
     254    discretized according to whether they belong to the middle region 
     255    (which includes the lower but not the upper boundary) or not. The 
     256    discretizer is returned by :class:`BiModalDiscretization` if its 
     257    field :obj:`~BiModalDiscretization.split_in_two` is true (the default). 
     259    .. attribute:: low 
     261        Lower boudary of the interval (included in the interval). 
     263    .. attribute:: high 
     265        Upper boundary of the interval (not included in the interval). 
     268Discretization Algorithms 
     271.. class:: EquiDistDiscretization  
     273    Discretizes the attribute by cutting it into the prescribed number 
     274    of intervals of equal width. The examples are needed to determine the  
     275    span of attribute values. The interval between the smallest and the 
     276    largest is then cut into equal parts. 
     278    .. attribute:: number_of_intervals 
     280        Number of intervals into which the attribute is to be discretized.  
     281        Default value is 4. 
     283For an example, we shall discretize all attributes of Iris dataset into 6 
     284intervals. We shall construct an :class:`Orange.data.Table` with discretized 
     285attributes and print description of the attributes (part 
     286of `discretization.py`_): 
     288.. literalinclude:: code/discretization.py 
     289    :lines: 38-43 
     291Script's answer is 
     295    D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> 
     296    D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> 
     297    D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> 
     298    D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10> 
     300Any more decent ways for a script to find the interval boundaries than  
     301by parsing the symbolic values? Sure, they are hidden in the discretizer, 
     302which is, as usual, stored in ``attr.get_value_from.transformer``. 
     304Compare the following with the values above. 
     308    >>> for attr in newattrs: 
     309    ...    print "%s: first interval at %5.3f, step %5.3f" % \ 
     310    ...    (attr.name, attr.get_value_from.transformer.first_cut, \ 
     311    ...    attr.get_value_from.transformer.step) 
     312    D_sepal length: first interval at 4.900, step 0.600 
     313    D_sepal width: first interval at 2.400, step 0.400 
     314    D_petal length: first interval at 1.980, step 0.980 
     315    D_petal width: first interval at 0.500, step 0.400 
     317As all discretizers, :class:`EquiDistDiscretizer` also has the method  
     318``construct_variable`` (part of `discretization.py`_): 
     320.. literalinclude:: code/discretization.py 
     321    :lines: 69-73 
     324.. class:: EquiNDiscretization 
     326    Discretization with Intervals Containing (Approximately) Equal Number 
     327    of Examples. 
     329    Discretizes the attribute by cutting it into the prescribed number of 
     330    intervals so that each of them contains equal number of examples. The 
     331    examples are obviously needed for this discretization, too. 
     333    .. attribute:: number_of_intervals 
     335        Number of intervals into which the attribute is to be discretized. 
     336        Default value is 4. 
     338The use of this discretization is the same as the use of  
     339:class:`EquiDistDiscretization`. The resulting discretizer is  
     340:class:`IntervalDiscretizer`, hence it has ``points`` instead of ``first_cut``/ 
     343.. class:: EntropyDiscretization 
     345    Entropy-based Discretization (Fayyad-Irani). 
     347    Fayyad-Irani's discretization method works without a predefined number of 
     348    intervals. Instead, it recursively splits intervals at the cut-off point 
     349    that minimizes the entropy, until the entropy decrease is smaller than the 
     350    increase of MDL induced by the new point. 
     352    An interesting thing about this discretization technique is that an 
     353    attribute can be discretized into a single interval, if no suitable 
     354    cut-off points are found. If this is the case, the attribute is rendered 
     355    useless and can be removed. This discretization can therefore also serve 
     356    for feature subset selection. 
     358    .. attribute:: force_attribute 
     360        Forces the algorithm to induce at least one cut-off point, even when 
     361        its information gain is lower than MDL (default: false). 
     363Part of `discretization.py`_: 
     365.. literalinclude:: code/discretization.py 
     366    :lines: 77-80 
     368The output shows that all attributes are discretized onto three intervals:: 
     370    sepal length: <5.5, 6.09999990463> 
     371    sepal width: <2.90000009537, 3.29999995232> 
     372    petal length: <1.89999997616, 4.69999980927> 
     373    petal width: <0.600000023842, 1.0000004768> 
     375.. class:: BiModalDiscretization 
     377    Bi-Modal Discretization 
     379    Sets two cut-off points so that the class distribution of examples in 
     380    between is as different from the overall distribution as possible. The 
     381    difference is measure by chi-square statistics. All possible cut-off 
     382    points are tried, thus the discretization runs in O(n^2). 
     384    This discretization method is especially suitable for the attributes in 
     385    which the middle region corresponds to normal and the outer regions to 
     386    abnormal values of the attribute. Depending on the nature of the 
     387    attribute, we can treat the lower and higher values separately, thus 
     388    discretizing the attribute into three intervals, or together, in a 
     389    binary attribute whose values correspond to normal and abnormal. 
     391    .. attribute:: split_in_two 
     393        Decides whether the resulting attribute should have three or two. 
     394        If true (default), we have three intervals and the discretizer is 
     395        of type :class:`BiModalDiscretizer`. If false the result is the  
     396        ordinary :class:`IntervalDiscretizer`. 
     398Iris dataset has three-valued class attribute, classes are setosa, virginica 
     399and versicolor. As the picture below shows, sepal lenghts of versicolors are 
     400between lengths of setosas and virginicas (the picture itself is drawn using 
     401LOESS probability estimation). 
     403.. image:: files/bayes-iris.gif 
     405If we merge classes setosa and virginica into one, we can observe whether 
     406the bi-modal discretization would correctly recognize the interval in 
     407which versicolors dominate. 
     409.. literalinclude:: code/discretization.py 
     410    :lines: 84-87 
     412In this script, we have constructed a new class attribute which tells whether 
     413an iris is versicolor or not. We have told how this attribute's value is 
     414computed from the original class value with a simple lambda function. 
     415Finally, we have constructed a new domain and converted the examples. 
     416Now for discretization. 
     418.. literalinclude:: code/discretization.py 
     419    :lines: 97-100 
     421The script prints out the middle intervals:: 
     423    sepal length: (5.400, 6.200] 
     424    sepal width: (2.000, 2.900] 
     425    petal length: (1.900, 4.700] 
     426    petal width: (0.600, 1.600] 
     428Judging by the graph, the cut-off points for "sepal length" make sense. 
     430Additional functions 
     433Some functions and classes that can be used for 
    9434categorization of continuous features. Besides several general classes that 
    10435can help in this task, we also provide a function that may help in 
    12437categorization that can be used for learning. 
    14 .. class:: Orange.feature.discretization.EntropyDiscretization 
    16     Discretize the given feature's and return a discretized feature. The new 
    17     attribute's values get computed automatically when they are needed. 
    19     :param attribute: continuous feature to discretize 
    20     :type attribute: :obj:`Orange.data.variable.Variable` 
    21     :param examples: data to discretize 
    22     :type examples: :obj:`Orange.data.Table` 
    23     :param weight: meta feature that stores weights of individual data 
    24           instances 
    25     :type weight: Orange.data.variable.Variable 
    26     :rtype: :obj:`Orange.data.variable.Discrete` 
    28439.. automethod:: Orange.feature.discretization.entropyDiscretization_wrapper 
    34445.. rubric:: Example 
    36 A chapter on `feature subset selection <../ofb/o_fss.htm>`_ in Orange 
     447FIXME. A chapter on `feature subset selection <../ofb/o_fss.htm>`_ in Orange 
    37448for Beginners tutorial shows the use of DiscretizedLearner. Other 
    38449discretization classes from core Orange are listed in chapter on 
    39450`categorization <../ofb/o_categorization.htm>`_ of the same tutorial. 
    41 .. note:: 
    42     add from reference http://orange.biolab.si/doc/reference/discretization.htm 
    62470        IntervalDiscretizer, \ 
    63471        ThresholdDiscretizer, \ 
    64         EntropyDiscretization 
     472        EntropyDiscretization, \ 
     473        EquiDistDiscretization, \ 
     474        EquiNDiscretization, \ 
     475        BiModalDiscretization, \ 
     476        Discretization 
Note: See TracChangeset for help on using the changeset viewer.