Changeset 10171:94f7d4a7208d in orange


Ignore:
Timestamp:
02/12/12 01:37:54 (2 years ago)
Author:
janezd <janez.demsar@…>
Branch:
default
Message:

Cleaned up feature selection module and doc; not finished yet

Files:
3 edited

Legend:

Unmodified
Added
Removed
  • Orange/feature/selection.py

    r9878 r10171  
    1 """ 
    2 ######################### 
    3 Selection (``selection``) 
    4 ######################### 
    5  
    6 .. index:: feature selection 
    7  
    8 .. index:: 
    9    single: feature; feature selection 
    10  
    11 Some machine learning methods perform better if they learn only from a 
    12 selected subset of the most informative or "best" features. 
    13  
    14 This so-called filter approach can boost the performance 
    15 of learner in terms of predictive accuracy, speed-up induction and 
    16 simplicity of resulting models. Feature scores are estimated before 
    17 modeling, without knowing  which machine learning method will be 
    18 used to construct a predictive model. 
    19  
    20 :download:`Example script:<code/selection-best3.py>` 
    21  
    22 .. literalinclude:: code/selection-best3.py 
    23     :lines: 7- 
    24  
    25 The script should output this:: 
    26  
    27     Best 3 features: 
    28     physician-fee-freeze 
    29     el-salvador-aid 
    30     synfuels-corporation-cutback 
    31  
    32 .. autoclass:: Orange.feature.selection.FilterAboveThreshold 
    33    :members: 
    34  
    35 .. autoclass:: Orange.feature.selection.FilterBestN 
    36    :members: 
    37  
    38 .. autoclass:: Orange.feature.selection.FilterRelief 
    39    :members: 
    40  
    41 .. autoclass:: Orange.feature.selection.FilteredLearner 
    42    :members: 
    43  
    44 .. autoclass:: Orange.feature.selection.FilteredClassifier 
    45    :members: 
    46  
    47 These functions support the design of feature subset selection for 
    48 classification problems. 
    49  
    50 .. automethod:: Orange.feature.selection.best_n 
    51  
    52 .. automethod:: Orange.feature.selection.above_threshold 
    53  
    54 .. automethod:: Orange.feature.selection.select_best_n 
    55  
    56 .. automethod:: Orange.feature.selection.select_above_threshold 
    57  
    58 .. automethod:: Orange.feature.selection.select_relief 
    59  
    60 .. rubric:: Examples 
    61  
    62 The following script defines a new Naive Bayes classifier, that 
    63 selects five best features from the data set before learning. 
    64 The new classifier is wrapped-up in a special class (see 
    65 <a href="../ofb/c_pythonlearner.htm">Building your own learner</a> 
    66 lesson in <a href="../ofb/default.htm">Orange for Beginners</a>). The 
    67 script compares this filtered learner with one that uses a complete 
    68 set of features. 
    69  
    70 :download:`selection-bayes.py<code/selection-bayes.py>` 
    71  
    72 .. literalinclude:: code/selection-bayes.py 
    73     :lines: 7- 
    74  
    75 Interestingly, and somehow expected, feature subset selection 
    76 helps. This is the output that we get:: 
    77  
    78     Learner      CA 
    79     Naive Bayes  0.903 
    80     with FSS     0.940 
    81  
    82 We can do all of  he above by wrapping the learner using 
    83 <code>FilteredLearner</code>, thus 
    84 creating an object that is assembled from data filter and a base learner. When 
    85 given a data table, this learner uses attribute filter to construct a new 
    86 data set and base learner to construct a corresponding 
    87 classifier. Attribute filters should be of the type like 
    88 <code>orngFSS.FilterAboveThresh</code> or 
    89 <code>orngFSS.FilterBestN</code> that can be initialized with the 
    90 arguments and later presented with a data, returning new reduced data 
    91 set. 
    92  
    93 The following code fragment replaces the bulk of code 
    94 from previous example, and compares naive Bayesian classifier to the 
    95 same classifier when only a single most important attribute is 
    96 used. 
    97  
    98 :download:`selection-filtered-learner.py<code/selection-filtered-learner.py>` 
    99  
    100 .. literalinclude:: code/selection-filtered-learner.py 
    101     :lines: 13-16 
    102  
    103 Now, let's decide to retain three features (change the code in <a 
    104 href="fss4.py">fss4.py</a> accordingly!), but observe how many times 
    105 an attribute was used. Remember, 10-fold cross validation constructs 
    106 ten instances for each classifier, and each time we run 
    107 FilteredLearner a different set of features may be 
    108 selected. <code>orngEval.CrossValidation</code> stores classifiers in 
    109 <code>results</code> variable, and <code>FilteredLearner</code> 
    110 returns a classifier that can tell which features it used (how 
    111 convenient!), so the code to do all this is quite short. 
    112  
    113 .. literalinclude:: code/selection-filtered-learner.py 
    114     :lines: 25- 
    115  
    116 Running :download:`selection-filtered-learner.py <code/selection-filtered-learner.py>` with three features selected each 
    117 time a learner is run gives the following result:: 
    118  
    119     Learner      CA 
    120     bayes        0.903 
    121     filtered     0.956 
    122  
    123     Number of times features were used in cross-validation: 
    124      3 x el-salvador-aid 
    125      6 x synfuels-corporation-cutback 
    126      7 x adoption-of-the-budget-resolution 
    127     10 x physician-fee-freeze 
    128      4 x crime 
    129  
    130 Experiment yourself to see, if only one attribute is retained for 
    131 classifier, which attribute was the one most frequently selected over 
    132 all the ten cross-validation tests! 
    133  
    134 ========== 
    135 References 
    136 ========== 
    137  
    138 * K. Kira and L. Rendell. A practical approach to feature selection. In 
    139   D. Sleeman and P. Edwards, editors, Proc. 9th Int'l Conf. on Machine 
    140   Learning, pages 249{256, Aberdeen, 1992. Morgan Kaufmann Publishers. 
    141  
    142 * I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF. 
    143   In F. Bergadano and L. De Raedt, editors, Proc. European Conf. on Machine 
    144   Learning (ECML-94), pages  171-182. Springer-Verlag, 1994. 
    145  
    146 * R. Kohavi, G. John: Wrappers for Feature Subset Selection, Artificial 
    147   Intelligence, 97 (1-2), pages 273-324, 1997 
    148  
    149 """ 
    150  
    1511__docformat__ = 'restructuredtext' 
    1522 
     
    16010    by :obj:`Orange.feature.scoring.score_all`. 
    16111 
    162     :param scores: a list such as returned by 
    163       :obj:`Orange.feature.scoring.score_all` 
    164     :type scores: list 
    165     :param N: number of best features to select. 
     12    :param scores: a list such as the one returned by 
     13      :obj:`Orange.feature.scoring.score_all` 
     14    :type scores: list 
     15    :param N: number of features to select. 
    16616    :type N: int 
    16717    :rtype: :obj:`list` 
    16818 
    16919    """ 
    170     return map(lambda x:x[0], scores[:N]) 
     20    return [x[0] for x in sorted(scores)[:N]] 
    17121 
    17222bestNAtts = best_n 
     
    18131      :obj:`Orange.feature.scoring.score_all` 
    18232    :type scores: list 
    183     :param threshold: score threshold for attribute selection. Defaults to 0. 
     33    :param threshold: threshold for selection. Defaults to 0. 
    18434    :type threshold: float 
    18535    :rtype: :obj:`list` 
    18636 
    18737    """ 
    188     pairs = filter(lambda x, t=threshold: x[1] > t, scores) 
    189     return map(lambda x: x[0], pairs) 
     38    return [x[0] for x in scores if x[1] > threshold] 
     39 
    19040 
    19141attsAboveThreshold = above_threshold 
     
    19343 
    19444def select_best_n(data, scores, N): 
    195     """Construct and return a new set of examples that includes a 
     45    """Construct and return a new data table that includes a 
    19646    class and only N best features from a list scores. 
    19747 
    19848    :param data: an example table 
    19949    :type data: Orange.data.table 
    200     :param scores: a list such as one returned by 
     50    :param scores: a list such as the one returned by 
    20151      :obj:`Orange.feature.scoring.score_all` 
    20252    :type scores: list 
    20353    :param N: number of features to select 
    20454    :type N: int 
    205     :rtype: :class:`Orange.data.table` holding N best features 
     55    :rtype: new data table 
    20656 
    20757    """ 
     
    21262 
    21363def select_above_threshold(data, scores, threshold=0.0): 
    214     """Construct and return a new set of examples that includes a class and 
     64    """Construct and return a new data table that includes a class and 
    21565    features from the list returned by 
    21666    :obj:`Orange.feature.scoring.score_all` that have the score above or 
     
    21969    :param data: an example table 
    22070    :type data: Orange.data.table 
    221     :param scores: a list such as one returned by 
    222       :obj:`Orange.feature.scoring.score_all` 
    223     :type scores: list 
    224     :param threshold: score threshold for attribute selection. Defaults to 0. 
     71    :param scores: a list such as the one returned by 
     72      :obj:`Orange.feature.scoring.score_all` 
     73    :type scores: list 
     74    :param threshold: threshold for selection. Defaults to 0. 
    22575    :type threshold: float 
    226     :rtype: :obj:`list` first N features (without measures) 
     76    :rtype: new data table 
    22777 
    22878    """ 
     
    23484 
    23585def select_relief(data, measure=orange.MeasureAttribute_relief(k=20, m=50), margin=0): 
    236     """Take the data set and use an attribute measure to remove the worst 
    237     scored attribute (those below the margin). Repeats, until no attribute has 
    238     negative or zero score. 
    239  
    240     .. note:: Notice that this filter procedure was originally designed for \ 
    241     measures such as Relief, which are context dependent, i.e., removal of \ 
    242     features may change the scores of other remaining features. Hence the \ 
    243     need to re-estimate score every time an attribute is removed. 
     86    """Iteratively remove the worst scored feature until no feature 
     87    has a score below the margin. The filter procedure was originally 
     88    designed for measures such as Relief, which are context dependent, 
     89    i.e., removal of features may change the scores of other remaining 
     90    features. The score is thus recomputed in each iteration. 
    24491 
    24592    :param data: an data table 
    24693    :type data: Orange.data.table 
    247     :param measure: an attribute measure (derived from 
    248       :obj:`Orange.feature.scoring.Measure`). Defaults to 
    249       :obj:`Orange.feature.scoring.Relief` for k=20 and m=50. 
    250     :param margin: if score is higher than margin, attribute is not removed. 
    251       Defaults to 0. 
     94    :param measure: a feature scorer (derived from 
     95      :obj:`Orange.feature.scoring.Measure`) 
     96    :param margin: margin for removal. Defaults to 0. 
    25297    :type margin: float 
    25398 
     
    256101    while len(data.domain.attributes) > 0 and measl[-1][1] < margin: 
    257102        data = select_best_n(data, measl, len(data.domain.attributes) - 1) 
    258 #        print 'remaining ', len(data.domain.attributes) 
    259103        measl = score_all(data, measure) 
    260104    return data 
     
    303147 
    304148    def __call__(self, data): 
    305         """Take data and return features with scores above given threshold. 
     149        """Return data table features with scores above given threshold. 
    306150 
    307151        :param data: data table 
     
    436280 
    437281class FilteredClassifier: 
     282    """A classifier returned by FilteredLearner.""" 
    438283    def __init__(self, **kwds): 
    439284        self.__dict__.update(kwds) 
  • docs/reference/rst/Orange.feature.selection.rst

    r9372 r10171  
    1 .. automodule:: Orange.feature.selection 
     1.. :py:currentmodule:: Orange.feature.selection 
     2 
     3######################### 
     4Selection (``selection``) 
     5######################### 
     6 
     7.. index:: feature selection 
     8 
     9.. index:: 
     10   single: feature; feature selection 
     11 
     12Feature selection module contains several functions for selecting features based on they scores. A typical example is the function :obj:`select_best_n` that returns the best n features: 
     13 
     14    .. literalinclude:: code/selection-best3.py 
     15        :lines: 7- 
     16 
     17    The script outputs:: 
     18 
     19        Best 3 features: 
     20        physician-fee-freeze 
     21        el-salvador-aid 
     22        synfuels-corporation-cutback 
     23 
     24.. autoclass:: Orange.feature.selection.FilterAboveThreshold(data=None, measure=Orange.feature.scoring.Relief(k=20, m=50), threshold=0.0) 
     25   :members: 
     26 
     27.. autoclass:: Orange.feature.selection.FilterBestN(data=None, measure=Orange.feature.scoring.Relief(k=20, m=50), n=5) 
     28   :members: 
     29 
     30.. autoclass:: Orange.feature.selection.FilterRelief(data=None, measure=Orange.feature.scoring.Relief(k=20, m=50), margin=0) 
     31   :members: 
     32 
     33.. autoclass:: Orange.feature.selection.FilteredLearner(baseLerner, filter=FilterAboveThreshold(), name=filtered) 
     34   :members: 
     35 
     36.. autoclass:: Orange.feature.selection.FilteredClassifier 
     37   :members: 
     38 
     39These functions support the design of feature subset selection for 
     40classification problems. 
     41 
     42.. automethod:: Orange.feature.selection.best_n 
     43 
     44.. automethod:: Orange.feature.selection.above_threshold 
     45 
     46.. automethod:: Orange.feature.selection.select_best_n 
     47 
     48.. automethod:: Orange.feature.selection.select_above_threshold 
     49 
     50.. automethod:: Orange.feature.selection.select_relief 
     51 
     52.. rubric:: Examples 
     53 
     54The following script defines a new Naive Bayes classifier, that 
     55selects five best features from the data set before learning. 
     56The new classifier is wrapped-up in a special class (see 
     57<a href="../ofb/c_pythonlearner.htm">Building your own learner</a> 
     58lesson in <a href="../ofb/default.htm">Orange for Beginners</a>). The 
     59script compares this filtered learner with one that uses a complete 
     60set of features. 
     61 
     62:download:`selection-bayes.py<code/selection-bayes.py>` 
     63 
     64.. literalinclude:: code/selection-bayes.py 
     65    :lines: 7- 
     66 
     67Interestingly, and somehow expected, feature subset selection 
     68helps. This is the output that we get:: 
     69 
     70    Learner      CA 
     71    Naive Bayes  0.903 
     72    with FSS     0.940 
     73 
     74We can do all of  he above by wrapping the learner using 
     75<code>FilteredLearner</code>, thus 
     76creating an object that is assembled from data filter and a base learner. When 
     77given a data table, this learner uses attribute filter to construct a new 
     78data set and base learner to construct a corresponding 
     79classifier. Attribute filters should be of the type like 
     80<code>orngFSS.FilterAboveThresh</code> or 
     81<code>orngFSS.FilterBestN</code> that can be initialized with the 
     82arguments and later presented with a data, returning new reduced data 
     83set. 
     84 
     85The following code fragment replaces the bulk of code 
     86from previous example, and compares naive Bayesian classifier to the 
     87same classifier when only a single most important attribute is 
     88used. 
     89 
     90:download:`selection-filtered-learner.py<code/selection-filtered-learner.py>` 
     91 
     92.. literalinclude:: code/selection-filtered-learner.py 
     93    :lines: 13-16 
     94 
     95Now, let's decide to retain three features (change the code in <a 
     96href="fss4.py">fss4.py</a> accordingly!), but observe how many times 
     97an attribute was used. Remember, 10-fold cross validation constructs 
     98ten instances for each classifier, and each time we run 
     99FilteredLearner a different set of features may be 
     100selected. <code>orngEval.CrossValidation</code> stores classifiers in 
     101<code>results</code> variable, and <code>FilteredLearner</code> 
     102returns a classifier that can tell which features it used (how 
     103convenient!), so the code to do all this is quite short. 
     104 
     105.. literalinclude:: code/selection-filtered-learner.py 
     106    :lines: 25- 
     107 
     108Running :download:`selection-filtered-learner.py <code/selection-filtered-learner.py>` with three features selected each 
     109time a learner is run gives the following result:: 
     110 
     111    Learner      CA 
     112    bayes        0.903 
     113    filtered     0.956 
     114 
     115    Number of times features were used in cross-validation: 
     116     3 x el-salvador-aid 
     117     6 x synfuels-corporation-cutback 
     118     7 x adoption-of-the-budget-resolution 
     119    10 x physician-fee-freeze 
     120     4 x crime 
     121 
     122Experiment yourself to see, if only one attribute is retained for 
     123classifier, which attribute was the one most frequently selected over 
     124all the ten cross-validation tests! 
     125 
     126========== 
     127References 
     128========== 
     129 
     130* K. Kira and L. Rendell. A practical approach to feature selection. In 
     131  D. Sleeman and P. Edwards, editors, Proc. 9th Int'l Conf. on Machine 
     132  Learning, pages 249{256, Aberdeen, 1992. Morgan Kaufmann Publishers. 
     133 
     134* I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF. 
     135  In F. Bergadano and L. De Raedt, editors, Proc. European Conf. on Machine 
     136  Learning (ECML-94), pages  171-182. Springer-Verlag, 1994. 
     137 
     138* R. Kohavi, G. John: Wrappers for Feature Subset Selection, Artificial 
     139  Intelligence, 97 (1-2), pages 273-324, 1997 
  • docs/reference/rst/code/selection-best3.py

    r9644 r10171  
    1010n = 3 
    1111ma = Orange.feature.scoring.score_all(voting) 
    12 best = Orange.feature.selection.bestNAtts(ma, n) 
     12best = Orange.feature.selection.best_n(ma, n) 
    1313print 'Best %d features:' % n 
    1414for s in best: 
Note: See TracChangeset for help on using the changeset viewer.