Changeset 7770:e1dafa5e0037 in orange


Ignore:
Timestamp:
03/22/11 16:31:26 (3 years ago)
Author:
markotoplak
Branch:
default
Convert:
9e5c89e58c499304849a08679aaf3bd00b280b0f
Message:

Orange 2.5. Discretization: added documentation from the reference.

Location:
orange
Files:
2 added
1 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/feature/discretization.py

    r7660 r7770  
    66   single: feature; discretization 
    77 
    8 This module implements some functions and classes that can be used for 
     8 
     9Example-based automatic discretization is in essence similar to learning: 
     10given a set of examples, discretization method proposes a list of suitable 
     11intervals to cut the attribute's values into. For this reason, Orange 
     12structures for discretization resemble its structures for learning. Objects 
     13derived from ``orange.Discretization`` play a role of "learner" that,  
     14upon observing the examples, construct an ``orange.Discretizer`` whose role 
     15is to convert continuous values into discrete according to the rule found by 
     16``Discretization``. 
     17 
     18Orange supports several methods of discretization; here's a 
     19list of methods with belonging classes. 
     20 
     21* Equi-distant discretization (:class:`EquiDistDiscretization`,  
     22  :class:`EquiDistDiscretizer`). The range of attribute's values is split 
     23  into prescribed number equal-sized intervals. 
     24* Quantile-based discretization (:class:`EquiNDiscretization`, 
     25  :class:`IntervalDiscretizer`). The range is split into intervals 
     26  containing equal number of examples. 
     27* Entropy-based discretization (:class:`EntropyDiscretization`, 
     28  :class:`IntervalDiscretizer`). Developed by Fayyad and Irani, 
     29  this method balances between entropy in intervals and MDL of discretization. 
     30* Bi-modal discretization (:class:`BiModalDiscretization`, 
     31  :class:`BiModalDiscretizer`/:class:`IntervalDiscretizer`). 
     32  Two cut-off points set to optimize the difference of the distribution in 
     33  the middle interval and the distributions outside it. 
     34* Fixed discretization (:class:`IntervalDiscretizer`). Discretization with  
     35  user-prescribed cut-off points. 
     36 
     37.. _discretization.py: code/discretization.py 
     38 
     39Instances of classes derived from :class:`Discretization`. It define a 
     40single method: the call operator. The object can also be called through 
     41constructor. 
     42 
     43.. class:: Discretization 
     44 
     45    .. method:: __call__(attribute, examples[, weightID]) 
     46 
     47        Given a continuous ``attribute`, ``examples`` and, optionally id of 
     48        attribute with example weight, this function returns a discretized 
     49        attribute. Argument ``attribute`` can be a descriptor, index or 
     50        name of the attribute. 
     51 
     52Here's an example. Part of `discretization.py`_: 
     53 
     54.. literalinclude:: code/discretization.py 
     55    :lines: 7-15 
     56 
     57The discretized attribute ``sep_w`` is constructed with a call to 
     58:class:`EntropyDiscretization` (instead of constructing it and calling 
     59it afterwards, we passed the arguments for calling to the constructor, as 
     60is often allowed in Orange). We then constructed a new  
     61:class:`Orange.data.Table` with attributes "sepal width" (the original  
     62continuous attribute), ``sep_w`` and the class attribute. Script output is:: 
     63 
     64    Entropy discretization, first 10 examples 
     65    [3.5, '>3.30', 'Iris-setosa'] 
     66    [3.0, '(2.90, 3.30]', 'Iris-setosa'] 
     67    [3.2, '(2.90, 3.30]', 'Iris-setosa'] 
     68    [3.1, '(2.90, 3.30]', 'Iris-setosa'] 
     69    [3.6, '>3.30', 'Iris-setosa'] 
     70    [3.9, '>3.30', 'Iris-setosa'] 
     71    [3.4, '>3.30', 'Iris-setosa'] 
     72    [3.4, '>3.30', 'Iris-setosa'] 
     73    [2.9, '<=2.90', 'Iris-setosa'] 
     74    [3.1, '(2.90, 3.30]', 'Iris-setosa'] 
     75 
     76:class:`EntropyDiscretization` named the new attribute's values by the 
     77interval range (it also named the attribute as "D_sepal width"). The new 
     78attribute's values get computed automatically when they are needed. 
     79 
     80As those that have read about :class:`Orange.data.variable.Variable` know, 
     81the answer to  
     82"How this works?" is hidden in the field  
     83:obj:`~Orange.data.variable.Variable.get_value_from`. 
     84This little dialog reveals the secret. 
     85 
     86:: 
     87 
     88    >>> sep_w 
     89    EnumVariable 'D_sepal width' 
     90    >>> sep_w.get_value_from 
     91    <ClassifierFromVar instance at 0x01BA7DC0> 
     92    >>> sep_w.get_value_from.whichVar 
     93    FloatVariable 'sepal width' 
     94    >>> sep_w.get_value_from.transformer 
     95    <IntervalDiscretizer instance at 0x01BA2100> 
     96    >>> sep_w.get_value_from.transformer.points 
     97    <2.90000009537, 3.29999995232> 
     98 
     99So, the ``select`` statement in the above example converted all examples 
     100from ``data`` to the new domain. Since the new domain includes the attribute 
     101``sep_w`` that is not present in the original, ``sep_w``'s values are 
     102computed on the fly. For each example in ``data``, ``sep_w.get_value_from``  
     103is called to compute ``sep_w``'s value (if you ever need to call 
     104``get_value_from``, you shouldn't call ``get_value_from`` directly but call 
     105``compute_value`` instead). ``sep_w.get_value_from`` looks for value of 
     106"sepal width" in the original example. The original, continuous sepal width 
     107is passed to the ``transformer`` that determines the interval by its field 
     108``points``. Transformer returns the discrete value which is in turn returned 
     109by ``get_value_from`` and stored in the new example. 
     110 
     111You don't need to understand this mechanism exactly. It's important to know 
     112that there are two classes of objects for discretization. Those derived from 
     113:obj:`Discretizer` (such as :obj:`IntervalDiscretizer` that we've seen above) 
     114are used as transformers that translate continuous value into discrete. 
     115Discretization algorithms are derived from :obj:`Discretization`. Their  
     116job is to construct a :obj:`Discretizer` and return a new variable 
     117with the discretizer stored in ``get_value_from.transformer``. 
     118 
     119Discretizers 
     120============ 
     121 
     122Different discretizers support different methods for conversion of 
     123continuous values into discrete. The most general is  
     124:class:`IntervalDiscretizer` that is also used by most discretization 
     125methods. Two other discretizers, :class:`EquiDistDiscretizer` and  
     126:class:`ThresholdDiscretizer`> could easily be replaced by  
     127:class:`IntervalDiscretizer` but are used for speed and simplicity. 
     128The fourth discretizer, :class:`BiModalDiscretizer` is specialized 
     129for discretizations induced by :class:`BiModalDiscretization`. 
     130 
     131.. class:: Discretizer 
     132 
     133    All discretizers support a handy method for construction of a new 
     134    attribute from an existing one. 
     135 
     136    .. method:: construct_variable(attribute) 
     137 
     138        Constructs a new attribute descriptor; the new attribute is discretized 
     139        ``attribute``. The new attribute's name equal ``attribute.name``  
     140        prefixed  by "D_", and its symbolic values are discretizer specific. 
     141        The above example shows what comes out form :class:`IntervalDiscretizer`.  
     142        Discretization algorithms actually first construct a discretizer and 
     143        then call its :class:`construct_variable` to construct an attribute 
     144        descriptor. 
     145 
     146.. class:: IntervalDiscretizer 
     147 
     148    The most common discretizer.  
     149 
     150    .. attribute:: points 
     151 
     152        Cut-off points. All values below or equal to the first point belong 
     153        to the first interval, those between the first and the second 
     154        (including those equal to the second) go to the second interval and 
     155        so forth to the last interval which covers all values greater than 
     156        the last element in ``points``. The number of intervals is thus  
     157        ``len(points)+1``. 
     158 
     159Let us manually construct an interval discretizer with cut-off points at 3.0 
     160and 5.0. We shall use the discretizer to construct a discretized sepal length  
     161(part of `discretization.py`_): 
     162 
     163.. literalinclude:: code/discretization.py 
     164    :lines: 22-26 
     165 
     166That's all. First five examples of ``data2`` are now 
     167 
     168:: 
     169 
     170    [5.1, '>5.00', 'Iris-setosa'] 
     171    [4.9, '(3.00, 5.00]', 'Iris-setosa'] 
     172    [4.7, '(3.00, 5.00]', 'Iris-setosa'] 
     173    [4.6, '(3.00, 5.00]', 'Iris-setosa'] 
     174    [5.0, '(3.00, 5.00]', 'Iris-setosa'] 
     175 
     176Can you use the same discretizer for more than one attribute? Yes, as long 
     177as they have same cut-off points, of course. Simply call construct_var for each 
     178continuous attribute (part of `discretization.py`_): 
     179 
     180.. literalinclude:: code/discretization.py 
     181    :lines: 30-34 
     182 
     183Each attribute now has its own (FIXME) ClassifierFromVar in its  
     184``get_value_from``, but all use the same :class:`IntervalDiscretizer`,  
     185``idisc``. Changing an element of its ``points`` affect all attributes. 
     186 
     187Do not change the length of :obj:`~IntervalDiscretizer.points` if the 
     188discretizer is used by any attribute. The length of 
     189:obj:`~IntervalDiscretizer.points` should always match the number of values 
     190of the attribute, which is determined by the length of the attribute's field 
     191``values``. Therefore, if ``attr`` is a discretized 
     192attribute, than ``len(attr.values)`` must equal 
     193``len(attr.get_value_from.transformer.points)+1``. It always 
     194does, unless you deliberately change it. If the sizes don't match, 
     195Orange will probably crash, and it will be entirely your fault. 
     196 
     197 
     198 
     199.. class:: EquiDistDiscretizer 
     200 
     201    More rigid than :obj:`IntervalDiscretizer`:  
     202    it uses intervals of fixed width. 
     203 
     204    .. attribute:: first_cut 
     205         
     206        The first cut-off point. 
     207     
     208    .. attribute:: step 
     209 
     210        Width of intervals. 
     211 
     212    .. attribute:: number_of_intervals 
     213         
     214        Number of intervals. 
     215 
     216    .. attribute:: points (read-only) 
     217         
     218        The cut-off points; this is not a real attribute although it behaves 
     219        as one. Reading it constructs a list of cut-off points and returns it, 
     220        but changing the list doesn't affect the discretizer - it's a separate 
     221        list. This attribute is here only for to give the  
     222        :obj:`EquiDistDiscretizer` the same interface as that of  
     223        :obj:`IntervalDiscretizer`. 
     224 
     225All values below :obj:`~EquiDistDiscretizer.first_cut` belong to the first 
     226intervala (including possible values smaller than ``firstVal``. Otherwise, 
     227value ``val``'s interval is ``floor((val-firstVal)/step)``. If this is turns 
     228out to be greater or equal to :obj:`~EquiDistDiscretizer.number_of_intervals`,  
     229it is decreased to ``number_of_intervals-1``. 
     230 
     231This discretizer is returned by :class:`EquiDistDiscretization`; you can 
     232see an example in the corresponding section. You can also construct it  
     233manually and call its ``construct_variable``, just as shown for the 
     234:obj:`IntervalDiscretizer`. 
     235 
     236 
     237.. class:: ThresholdDiscretizer 
     238 
     239    Threshold discretizer converts continuous values into binary by comparing 
     240    them with a threshold. This discretizer is actually not used by any 
     241    discretization method, but you can use it for manual discretization. 
     242    Orange needs this discretizer for binarization of continuous attributes 
     243    in decision trees. 
     244 
     245    .. attribute:: threshold 
     246 
     247        Threshold; values below or equal to the threshold belong to the first 
     248        interval and those that are greater go to the second. 
     249 
     250.. class:: BiModalDiscretizer 
     251 
     252    This discretizer is the first discretizer that couldn't be replaced by 
     253    :class:`IntervalDiscretizer`. It has two cut off points and values are 
     254    discretized according to whether they belong to the middle region 
     255    (which includes the lower but not the upper boundary) or not. The 
     256    discretizer is returned by :class:`BiModalDiscretization` if its 
     257    field :obj:`~BiModalDiscretization.split_in_two` is true (the default). 
     258 
     259    .. attribute:: low 
     260         
     261        Lower boudary of the interval (included in the interval). 
     262 
     263    .. attribute:: high 
     264 
     265        Upper boundary of the interval (not included in the interval). 
     266 
     267 
     268Discretization Algorithms 
     269========================= 
     270 
     271.. class:: EquiDistDiscretization  
     272 
     273    Discretizes the attribute by cutting it into the prescribed number 
     274    of intervals of equal width. The examples are needed to determine the  
     275    span of attribute values. The interval between the smallest and the 
     276    largest is then cut into equal parts. 
     277 
     278    .. attribute:: number_of_intervals 
     279 
     280        Number of intervals into which the attribute is to be discretized.  
     281        Default value is 4. 
     282 
     283For an example, we shall discretize all attributes of Iris dataset into 6 
     284intervals. We shall construct an :class:`Orange.data.Table` with discretized 
     285attributes and print description of the attributes (part 
     286of `discretization.py`_): 
     287 
     288.. literalinclude:: code/discretization.py 
     289    :lines: 38-43 
     290 
     291Script's answer is 
     292 
     293:: 
     294 
     295    D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> 
     296    D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> 
     297    D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> 
     298    D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10> 
     299 
     300Any more decent ways for a script to find the interval boundaries than  
     301by parsing the symbolic values? Sure, they are hidden in the discretizer, 
     302which is, as usual, stored in ``attr.get_value_from.transformer``. 
     303 
     304Compare the following with the values above. 
     305 
     306:: 
     307 
     308    >>> for attr in newattrs: 
     309    ...    print "%s: first interval at %5.3f, step %5.3f" % \ 
     310    ...    (attr.name, attr.get_value_from.transformer.first_cut, \ 
     311    ...    attr.get_value_from.transformer.step) 
     312    D_sepal length: first interval at 4.900, step 0.600 
     313    D_sepal width: first interval at 2.400, step 0.400 
     314    D_petal length: first interval at 1.980, step 0.980 
     315    D_petal width: first interval at 0.500, step 0.400 
     316 
     317As all discretizers, :class:`EquiDistDiscretizer` also has the method  
     318``construct_variable`` (part of `discretization.py`_): 
     319 
     320.. literalinclude:: code/discretization.py 
     321    :lines: 69-73 
     322 
     323 
     324.. class:: EquiNDiscretization 
     325 
     326    Discretization with Intervals Containing (Approximately) Equal Number 
     327    of Examples. 
     328 
     329    Discretizes the attribute by cutting it into the prescribed number of 
     330    intervals so that each of them contains equal number of examples. The 
     331    examples are obviously needed for this discretization, too. 
     332 
     333    .. attribute:: number_of_intervals 
     334 
     335        Number of intervals into which the attribute is to be discretized. 
     336        Default value is 4. 
     337 
     338The use of this discretization is the same as the use of  
     339:class:`EquiDistDiscretization`. The resulting discretizer is  
     340:class:`IntervalDiscretizer`, hence it has ``points`` instead of ``first_cut``/ 
     341``step``/``number_of_intervals``. 
     342 
     343.. class:: EntropyDiscretization 
     344 
     345    Entropy-based Discretization (Fayyad-Irani). 
     346 
     347    Fayyad-Irani's discretization method works without a predefined number of 
     348    intervals. Instead, it recursively splits intervals at the cut-off point 
     349    that minimizes the entropy, until the entropy decrease is smaller than the 
     350    increase of MDL induced by the new point. 
     351 
     352    An interesting thing about this discretization technique is that an 
     353    attribute can be discretized into a single interval, if no suitable 
     354    cut-off points are found. If this is the case, the attribute is rendered 
     355    useless and can be removed. This discretization can therefore also serve 
     356    for feature subset selection. 
     357 
     358    .. attribute:: force_attribute 
     359 
     360        Forces the algorithm to induce at least one cut-off point, even when 
     361        its information gain is lower than MDL (default: false). 
     362 
     363Part of `discretization.py`_: 
     364 
     365.. literalinclude:: code/discretization.py 
     366    :lines: 77-80 
     367 
     368The output shows that all attributes are discretized onto three intervals:: 
     369 
     370    sepal length: <5.5, 6.09999990463> 
     371    sepal width: <2.90000009537, 3.29999995232> 
     372    petal length: <1.89999997616, 4.69999980927> 
     373    petal width: <0.600000023842, 1.0000004768> 
     374 
     375.. class:: BiModalDiscretization 
     376 
     377    Bi-Modal Discretization 
     378 
     379    Sets two cut-off points so that the class distribution of examples in 
     380    between is as different from the overall distribution as possible. The 
     381    difference is measure by chi-square statistics. All possible cut-off 
     382    points are tried, thus the discretization runs in O(n^2). 
     383 
     384    This discretization method is especially suitable for the attributes in 
     385    which the middle region corresponds to normal and the outer regions to 
     386    abnormal values of the attribute. Depending on the nature of the 
     387    attribute, we can treat the lower and higher values separately, thus 
     388    discretizing the attribute into three intervals, or together, in a 
     389    binary attribute whose values correspond to normal and abnormal. 
     390 
     391    .. attribute:: split_in_two 
     392         
     393        Decides whether the resulting attribute should have three or two. 
     394        If true (default), we have three intervals and the discretizer is 
     395        of type :class:`BiModalDiscretizer`. If false the result is the  
     396        ordinary :class:`IntervalDiscretizer`. 
     397 
     398Iris dataset has three-valued class attribute, classes are setosa, virginica 
     399and versicolor. As the picture below shows, sepal lenghts of versicolors are 
     400between lengths of setosas and virginicas (the picture itself is drawn using 
     401LOESS probability estimation). 
     402 
     403.. image:: files/bayes-iris.gif 
     404 
     405If we merge classes setosa and virginica into one, we can observe whether 
     406the bi-modal discretization would correctly recognize the interval in 
     407which versicolors dominate. 
     408 
     409.. literalinclude:: code/discretization.py 
     410    :lines: 84-87 
     411 
     412In this script, we have constructed a new class attribute which tells whether 
     413an iris is versicolor or not. We have told how this attribute's value is 
     414computed from the original class value with a simple lambda function. 
     415Finally, we have constructed a new domain and converted the examples. 
     416Now for discretization. 
     417 
     418.. literalinclude:: code/discretization.py 
     419    :lines: 97-100 
     420 
     421The script prints out the middle intervals:: 
     422 
     423    sepal length: (5.400, 6.200] 
     424    sepal width: (2.000, 2.900] 
     425    petal length: (1.900, 4.700] 
     426    petal width: (0.600, 1.600] 
     427 
     428Judging by the graph, the cut-off points for "sepal length" make sense. 
     429 
     430Additional functions 
     431==================== 
     432 
     433Some functions and classes that can be used for 
    9434categorization of continuous features. Besides several general classes that 
    10435can help in this task, we also provide a function that may help in 
     
    12437categorization that can be used for learning. 
    13438 
    14 .. class:: Orange.feature.discretization.EntropyDiscretization 
    15      
    16     Discretize the given feature's and return a discretized feature. The new 
    17     attribute's values get computed automatically when they are needed. 
    18      
    19     :param attribute: continuous feature to discretize 
    20     :type attribute: :obj:`Orange.data.variable.Variable` 
    21     :param examples: data to discretize 
    22     :type examples: :obj:`Orange.data.Table` 
    23     :param weight: meta feature that stores weights of individual data 
    24           instances 
    25     :type weight: Orange.data.variable.Variable 
    26     :rtype: :obj:`Orange.data.variable.Discrete` 
    27          
    28439.. automethod:: Orange.feature.discretization.entropyDiscretization_wrapper 
    29440 
     
    34445.. rubric:: Example 
    35446 
    36 A chapter on `feature subset selection <../ofb/o_fss.htm>`_ in Orange 
     447FIXME. A chapter on `feature subset selection <../ofb/o_fss.htm>`_ in Orange 
    37448for Beginners tutorial shows the use of DiscretizedLearner. Other 
    38449discretization classes from core Orange are listed in chapter on 
    39450`categorization <../ofb/o_categorization.htm>`_ of the same tutorial. 
    40  
    41 .. note:: 
    42     add from reference http://orange.biolab.si/doc/reference/discretization.htm 
    43451 
    44452========== 
     
    62470        IntervalDiscretizer, \ 
    63471        ThresholdDiscretizer, \ 
    64         EntropyDiscretization 
     472        EntropyDiscretization, \ 
     473        EquiDistDiscretization, \ 
     474        EquiNDiscretization, \ 
     475        BiModalDiscretization, \ 
     476        Discretization 
    65477 
    66478###### 
Note: See TracChangeset for help on using the changeset viewer.