source: orange/docs/reference/rst/Orange.data.discretization.rst @ 11477:c1fdcc4deb7b

Revision 11477:c1fdcc4deb7b, 3.6 KB checked in by Miha Stajdohar <miha.stajdohar@…>, 12 months ago (diff)

Removed broken links.

RevLine 
[9943]1.. py:currentmodule:: Orange.data.discretization
[9900]2
[9963]3########################################
[9943]4Data discretization (``discretization``)
[9963]5########################################
[9900]6
7.. index:: discretization
8
9.. index::
10   single: data; discretization
11
[9963]12Continues features in the data can be discretized using a uniform discretization method. Discretization considers
13only continues features, and replaces them in the new data set with corresponding categorical features:
[9900]14
15.. literalinclude:: code/discretization-table.py
16
[9963]17Discretization introduces new categorical features with discretized values::
[9900]18
19    Original data set:
20    [5.1, 3.5, 1.4, 0.2, 'Iris-setosa']
21    [4.9, 3.0, 1.4, 0.2, 'Iris-setosa']
22    [4.7, 3.2, 1.3, 0.2, 'Iris-setosa']
23
24    Discretized data set:
25    ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa']
26    ['<=5.45', '(2.85, 3.15]', '<=2.45', '<=0.80', 'Iris-setosa']
27    ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa']
28
[10050]29Data discretization uses feature discretization classes from :doc:`Orange.feature.discretization`
30and applies them on entire data set. The suported discretization methods are:
[9900]31
32* equal width discretization, where the domain of continuous feature is split to intervals of the same
[9943]33  width equal-sized intervals (uses :class:`Orange.feature.discretization.EqualWidth`),
34* equal frequency discretization, where each intervals contains equal number of data instances (uses
35  :class:`Orange.feature.discretization.EqualFreq`),
[9900]36* entropy-based, as originally proposed by [FayyadIrani1993]_ that infers the intervals to minimize
[9943]37  within-interval entropy of class distributions (uses :class:`Orange.feature.discretization.Entropy`),
[9900]38* bi-modal, using three intervals to optimize the difference of the class distribution in
[9943]39  the middle with the distribution outside it (uses :class:`Orange.feature.discretization.BiModal`),
[9900]40* fixed, with the user-defined cut-off points.
41
[9943]42.. FIXME give a corresponding class for fixed discretization
43
[9963]44Default discretization method (equal frequency with three intervals) can be replaced with other
45discretization approaches as demonstrated below:
[9900]46
47.. literalinclude:: code/discretization-table-method.py
48    :lines: 3-5
49
[9963]50Entropy-based discretization is special as it may infer new features that are constant and have only one value. Such
51features are redundant and provide no information about the class are. By default,
52:class:`DiscretizeTable` would remove them, a way performing feature subset selection. The effect of removal of
53non-informative features is also demonstrated in the following script:
54
55.. literalinclude:: code/discretization-entropy.py
56    :lines: 3-
57
58In the sampled dat set above three features were discretized to a constant and thus removed::
59
60    Redundant features (3 of 13):
61    cholesterol, rest SBP, age
62
63.. note::
64    Entropy-based and bi-modal discretization require class-labeled data sets.
65
[9943]66Data discretization classes
67===========================
[9900]68
[9943]69.. .. autoclass:: Orange.feature.discretization.DiscretizedLearner_Class
[9900]70
[10393]71.. autoclass:: DiscretizeTable(features=None, discretize_class=False, method=EqualFreq(n=3), clean=True)
[9900]72
[11477]73.. A chapter on feature subset selection in Orange
[9943]74   for Beginners tutorial shows the use of DiscretizedLearner. Other
75   discretization classes from core Orange are listed in chapter on
[11477]76   categorization of the same tutorial. -> should put in classification/wrappers
[9900]77
[9943]78.. [FayyadIrani1993] UM Fayyad and KB Irani. Multi-interval discretization of continuous valued
79  attributes for classification learning. In Proc. 13th International Joint Conference on Artificial Intelligence, pages
[10393]80  1022--1029, Chambery, France, 1993.
Note: See TracBrowser for help on using the repository browser.