Changeset 7321:0be10815e8da in orange


Ignore:
Timestamp:
02/03/11 13:27:03 (3 years ago)
Author:
tomazc <tomaz.curk@…>
Branch:
default
Convert:
67ad65a015bc6adc7456e079b5c4e2c0c08f3a05
Message:

Documentatio and code refactoring at Bohinj retreat.

Location:
orange/Orange/feature
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • orange/Orange/feature/scoring.py

    r7290 r7321  
    11""" 
    2 Feature scoring is normally used in feature subset selection for classification 
    3 problems. 
    4  
    5 Let start with a simple script that reads the data, uses :obj:`attMeasure` to 
    6 derive attribute scores and prints out these for the first three best scored 
    7 attributes. Same scoring function is then used to report (only) on three best 
    8 score attributes. 
    9  
    10 `fss1.py`_ (uses `voting.tab`_):: 
    11  
    12     import orange, import orngFSS 
    13     data = orange.ExampleTable("voting") 
    14  
    15     print 'Score estimate for first three attributes:' 
    16     ma = orngFSS.attMeasure(data) 
    17     for m in ma[:3]: 
    18         print "%5.3f %s" % (m[1], m[0]) 
    19  
    20     n = 3 
    21     best = orngFSS.bestNAtts(ma, n) 
    22     print '\\nBest %d attributes:' % n 
    23     for s in best: 
    24         print s 
     2Feature scoring is used in feature subset selection for classification 
     3problems. The goal is to find "good" features that are relevant for the given 
     4classification task. 
     5 
     6Here is a simple script that reads the data, uses :obj:`attMeasure` to 
     7derive feature scores and prints out these for the first three best scored 
     8features. Same scoring function is then used to report (only) on three best 
     9score features. 
     10 
     11.. _scoring-all.py: code/scoring-all.py 
     12.. _voting.tab: code/voting.tab 
     13 
     14`scoring-all.py`_ (uses `voting.tab`_): 
     15 
     16.. literalinclude:: code/scoring-all.py 
     17    :lines: 7- 
    2518 
    2619The script should output this:: 
    2720 
    28     Attribute scores for best three attributes: 
    29     Attribute scores for best three attributes: 
    30     0.728 physician-fee-freeze 
    31     0.329 adoption-of-the-budget-resolution 
    32     0.321 synfuels-corporation-cutback 
    33  
    34     Best 3 attributes: 
    35     physician-fee-freeze 
    36     adoption-of-the-budget-resolution 
    37     synfuels-corporation-cutback</xmp> 
    38  
    39 Functions and classes for feature scoring: 
     21    Feature scores for best three features: 
     22    0.613 physician-fee-freeze 
     23    0.255 adoption-of-the-budget-resolution 
     24    0.228 synfuels-corporation-cutback 
    4025 
    4126.. autoclass:: Orange.feature.scoring.OrderAttributesByMeasure 
     
    6348.. note: add links to gain ratio, relief and other feature scores 
    6449 
    65 The following script reports on gain ratio and relief attribute 
    66 scores. Notice that for our data set the ranks of the attributes 
    67 rather match well! 
    68  
    69 `fss2.py`_ (uses `voting.tab`_):: 
    70  
    71     import orange, orngFSS 
    72     data = orange.ExampleTable("voting") 
    73  
    74     print 'Relief GainRt Attribute' 
    75     ma_def = orngFSS.attMeasure(data) 
    76     gainRatio = orange.MeasureAttribute_gainRatio() 
    77     ma_gr  = orngFSS.attMeasure(data, gainRatio) 
    78     for i in range(5): 
    79         print "%5.3f  %5.3f  %s" % (ma_def[i][1], ma_gr[i][1], ma_def[i][0]) 
     50The following script reports on gain ratio and relief feature scores. 
     51 
     52`scoring-relief-gainRatio.py`_ (uses `voting.tab`_): 
     53 
     54.. literalinclude:: code/scoring-relief-gainRatio.py 
     55    :lines: 7- 
     56     
     57Notice that on this data the ranks of features match rather well:: 
     58     
     59    Relief GainRt Feature 
     60    0.613  0.752  physician-fee-freeze 
     61    0.255  0.444  el-salvador-aid 
     62    0.228  0.414  synfuels-corporation-cutback 
     63    0.189  0.382  crime 
     64    0.166  0.345  adoption-of-the-budget-resolution 
    8065 
    8166========== 
     
    8368========== 
    8469 
    85 * Kononeko: Strojno ucenje. 
     70* Kononeko: Strojno ucenje. Zalozba FE in FRI, Ljubljana, 2005. 
     71 
     72.. _scoring-relief-gainRatio.py: code/scoring-relief-gainRatio.py 
     73.. _voting.tab: code/voting.tab 
    8674 
    8775""" 
    8876 
    8977import Orange.core as orange 
     78from orange import MeasureAttribute_gainRatio as GainRatio 
     79from orange import MeasureAttribute as Measure 
     80from orange import MeasureAttribute_relief as Relief 
     81from orange import MeasureAttribute_info as InfoGain 
     82from orange import MeasureAttribute_gini as Gini 
    9083 
    9184###### 
     
    9487    """Construct an instance that orders features by their scores. 
    9588     
    96     :param measure: an attribute measure, derived from  
    97       :obj:`orange.MeasureAttribute`. 
     89    :param measure: a feature measure, derived from  
     90      :obj:`Orange.feature.scoring.Measure`. 
    9891     
    9992    """ 
     
    10396    def __call__(self, data, weight): 
    10497        """Take :obj:`Orange.data.table` data table and an instance of 
    105         :obj:`orange.MeasureAttribute` to score and order features.   
     98        :obj:`Orange.feature.scoring.Measure` to score and order features.   
    10699 
    107100        :param data: a data table used to score features 
     
    114107        """ 
    115108        if self.measure: 
    116             measure=self.measure 
     109            measure = self.measure 
    117110        else: 
    118             measure=orange.MeasureAttribute_relief(m=5,k=10) 
    119  
    120         measured=[(attr, measure(attr, data, None, weight)) for attr in data.domain.attributes] 
     111            measure = Relief(m=5,k=10) 
     112 
     113        measured = [(attr, measure(attr, data, None, weight)) for attr in data.domain.attributes] 
    121114        measured.sort(lambda x, y: cmp(x[1], y[1])) 
    122115        return [x[0] for x in measured] 
     
    167160            dist += list(vals) 
    168161        classAttrEntropy = Entropy(numpy.array(dist)) 
    169         infoGain = orange.MeasureAttribute_info(attr, data) 
     162        infoGain = InfoGain(attr, data) 
    170163        if classAttrEntropy > 0: 
    171164            return float(infoGain) / classAttrEntropy 
     
    277270###### 
    278271# from orngFSS 
    279 def attMeasure(data, measure = orange.MeasureAttribute_relief(k=20, m=50)): 
    280     """Assess the quality of attributes using the given measure and return 
    281     a sorted list of tuples (attribute name, measure). 
     272def attMeasure(data, measure=Relief(k=20, m=50)): 
     273    """Assess the quality of features using the given measure and return 
     274    a sorted list of tuples (feature name, measure). 
    282275 
    283276    :param data: data table should include a discrete class. 
    284     :type data: Orange.data.table. 
    285     :param measure:  attribute scoring function. Derived from 
    286       :obj:`orange.MeasureAttribute`. Defaults to Defaults to  
    287       :obj:`orange.MeasureAttribute_relief` with k=20 and m=50. 
    288     :type measure: :obj:`orange.MeasureAttribute`  
    289     :rtype: :obj:`list` a sorted list of tuples (attribute name, score) 
     277    :type data: :obj:`Orange.data.table` 
     278    :param measure:  feature scoring function. Derived from 
     279      :obj:`Orange.feature.scoring.Measure`. Defaults to Defaults to  
     280      :obj:`Orange.feature.scoring.Relief` with k=20 and m=50. 
     281    :type measure: :obj:`Orange.feature.scoring.Measure`  
     282    :rtype: :obj:`list` a sorted list of tuples (feature name, score) 
    290283 
    291284    """ 
  • orange/Orange/feature/selection.py

    r7290 r7321  
    33 
    44Some machine learning methods may perform better if they learn only from a  
    5 selected subset of "best" features. Here we mostly implement filter approaches, 
    6 were feature scores are estimated prior to the modelling, that is, without 
    7 knowing of which machine learning method will be used to construct a predictive 
    8 model. 
     5selected subset of "best" features.  
     6 
     7The performance of some machine learning method can be improved by learning  
     8only from a selected subset of data, which includes the most informative or  
     9"best" features. This so-called filter approaches can boost the performance  
     10of learner both in terms of predictive accuracy, speed-up induction, and 
     11simplicity of resulting models. Feature scores are estimated prior to the 
     12modelling, that is, without knowing of which machine learning method will be 
     13used to construct a predictive model. 
     14 
     15`selection-best3.py`_ (uses `voting.tab`_): 
     16 
     17.. literalinclude:: code/selection-best3.py 
     18    :lines: 7- 
     19 
     20The script should output this:: 
     21 
     22    Best 3 features: 
     23    physician-fee-freeze 
     24    el-salvador-aid 
     25    synfuels-corporation-cutback 
    926 
    1027.. automethod:: Orange.feature.selection.FilterAttsAboveThresh 
     
    4562 
    4663 
    47 ==================================== 
    48 Filter Approach for Machine Learning 
    49 ==================================== 
    50  
    51 Attribute scoring has at least two potential uses. One is 
    52 informative (or descriptive): the data analyst can use attribute 
    53 scoring to find "good" features and those that are irrelevant for 
    54 given classification task. The other use is in improving the 
    55 performance of machine learning by learning only from the data set 
    56 that includes the most informative features. This so-called filter 
    57 approach can boost the performance of learner both in terms of 
    58 predictive accuracy, speed-up of induction, and simplicity of 
    59 resulting models. 
     64======== 
     65Examples 
     66======== 
    6067 
    6168Following is a script that defines a new classifier that is based 
     
    6774set of features. 
    6875 
    69 `fss3.py`_ (uses `voting.tab`_):: 
    70  
    71     import orange, orngFSS 
    72  
    73     class BayesFSS(object): 
    74         def __new__(cls, examples=None, **kwds): 
    75             learner = object.__new__(cls, **kwds) 
    76             if examples: 
    77                 return learner(examples) 
    78             else: 
    79                 return learner 
    80      
    81         def __init__(self, name='Naive Bayes with FSS', N=5): 
    82             self.name = name 
    83             self.N = 5 
    84      
    85         def __call__(self, data, weight=None): 
    86             ma = orngFSS.attMeasure(data) 
    87             filtered = orngFSS.selectBestNAtts(data, ma, self.N) 
    88             model = orange.BayesLearner(filtered) 
    89             return BayesFSS_Classifier(classifier=model, N=self.N, name=self.name) 
    90      
    91     class BayesFSS_Classifier: 
    92         def __init__(self, **kwds): 
    93             self.__dict__.update(kwds) 
    94      
    95         def __call__(self, example, resultType = orange.GetValue): 
    96             return self.classifier(example, resultType) 
    97      
    98     # test above wraper on a data set 
    99     import orngStat, orngTest 
    100     data = orange.ExampleTable("voting") 
    101     learners = (orange.BayesLearner(name='Naive Bayes'), BayesFSS(name="with FSS")) 
    102     results = orngTest.crossValidation(learners, data) 
    103      
    104     # output the results 
    105     print "Learner      CA" 
    106     for i in range(len(learners)): 
    107         print "%-12s %5.3f" % (learners[i].name, orngStat.CA(results)[i]) 
     76`selection-bayes.py`_ (uses `voting.tab`_): 
     77 
     78.. literalinclude:: code/selection-bayes.py 
     79    :lines: 7- 
    10880 
    10981Interestingly, and somehow expected, feature subset selection 
     
    11486    with FSS     0.940 
    11587 
    116 ====================== 
    117 And a Much Simpler One 
    118 ====================== 
    119  
    120 Although perhaps educational, we can do all of the above by 
    121 wrapping the learner using <code>FilteredLearner</code>, thus creating 
    122 an object that is assembled from data filter and a base learner. When 
     88Now, a much simpler example. Although perhaps educational, we can do all of  
     89the above by wrapping the learner using <code>FilteredLearner</code>, thus  
     90creating an object that is assembled from data filter and a base learner. When 
    12391given the data, this learner uses attribute filter to construct a new 
    12492data set and base learner to construct a corresponding 
     
    134102used. 
    135103 
    136 `fss4.py`_ (uses `voting.tab`_):: 
    137  
    138     nb = orange.BayesLearner() 
    139     learners = (orange.BayesLearner(name='bayes'), 
    140                 FilteredLearner(nb, filter=FilterBestNAtts(n=1), name='filtered')) 
    141     results = orngEval.CrossValidation(learners, data) 
     104`selection-filtered-learner.py`_ (uses `voting.tab`_): 
     105 
     106.. literalinclude:: code/selection-filtered-learner.py 
     107    :lines: 13-16 
    142108 
    143109Now, let's decide to retain three features (change the code in <a 
     
    151117convenient!), so the code to do all this is quite short. 
    152118 
    153 `fss4.py`_ (uses `voting.tab`_):: 
    154  
    155     print "\\nNumber of times features were used in cross-validation:\\n" 
    156     attsUsed = {} 
    157     for i in range(10): 
    158         for a in results.classifiers[i][1].atts(): 
    159             if a.name in attsUsed.keys(): 
    160                 attsUsed[a.name] += 1 
    161             else: 
    162                 attsUsed[a.name] = 1 
    163     for k in attsUsed.keys(): 
    164         print "%2d x %s" % (attsUsed[k], k) 
    165  
    166 Running `fss4.py`_ with three features selected each time a learner is run 
    167 gives the following result:: 
     119.. literalinclude:: code/selection-filtered-learner.py 
     120    :lines: 25- 
     121 
     122Running `selection-filtered-learner.py`_ with three features selected each 
     123time a learner is run gives the following result:: 
    168124 
    169125    Learner      CA 
     
    185141---------- 
    186142 
    187 * K. Kira and L. Rendell. A practical approach to feature 
    188   selection. In D. Sleeman and P. Edwards, editors, <em>Proc. 9th Int'l 
    189   Conf. on Machine Learning</em>, pages 249{256, Aberdeen, 1992. Morgan 
    190   Kaufmann Publishers. 
    191  
    192 * I. Kononenko. Estimating attributes: Analysis and extensions of 
    193   RELIEF. In F. Bergadano and L. De Raedt, editors, <em>Proc. European 
    194   Conf. on Machine Learning (ECML-94)</em>, pages 
    195   171-182. Springer-Verlag, 1994. 
    196  
    197 * R. Kohavi, G. John: Wrappers for Feature Subset Selection, 
    198   <em>Artificial Intelligence</em>, 97 (1-2), pages 273-324, 1997 
    199  
    200 .. _fss1.py: code/fss1.py 
    201 .. _fss2.py: code/fss2.py 
    202 .. _fss3.py: code/fss3.py 
    203 .. _fss4.py: code/fss4.py 
     143* K. Kira and L. Rendell. A practical approach to feature selection. In 
     144  D. Sleeman and P. Edwards, editors, Proc. 9th Int'l Conf. on Machine 
     145  Learning, pages 249{256, Aberdeen, 1992. Morgan Kaufmann Publishers. 
     146 
     147* I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF. 
     148  In F. Bergadano and L. De Raedt, editors, Proc. European Conf. on Machine 
     149  Learning (ECML-94), pages  171-182. Springer-Verlag, 1994. 
     150 
     151* R. Kohavi, G. John: Wrappers for Feature Subset Selection, Artificial 
     152  Intelligence, 97 (1-2), pages 273-324, 1997 
     153 
     154.. _selection-best3.py: code/selection-best3.py 
     155.. _selection-bayes.py: code/selection-bayes.py 
     156.. _selection-filtered-learner.py: code/selection-filtered-learner.py 
    204157.. _voting.tab: code/voting.tab 
    205158 
     
    277230    return data.select(attsAboveThreshold(scores, threshold)+[data.domain.classVar.name]) 
    278231 
    279 def filterRelieff(data, measure = orange.MeasureAttribute_relief(k=20, m=50), margin=0): 
    280     """Take the data set and use an attribute measure to removes the worst  
     232def filterRelieff(data, measure=orange.MeasureAttribute_relief(k=20, m=50), margin=0): 
     233    """Take the data set and use an attribute measure to remove the worst  
    281234    scored attribute (those below the margin). Repeats, until no attribute has 
    282235    negative or zero score. 
     
    290243    :type data: Orange.data.table 
    291244    :param measure: an attribute measure (derived from  
    292       :obj:`Orange.MeasureAttribute`). Defaults to  
    293       :obj:`Orange.MeasureAttribute_relief` for k=20 and m=50. 
     245      :obj:`Orange.feature.scoring.Measure`). Defaults to  
     246      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50. 
    294247    :param margin: if score is higher than margin, attribute is not removed. 
    295248      Defaults to 0. 
     
    315268   
    316269class FilterAttsAboveThresh_Class: 
    317     """FilterAttsAboveThresh([<em>measure</em>[<em>, threshold</em>]])</dt> 
    318     <dd class="ddfun">This is simply a wrapper around the function 
    319     <code>selectAttsAboveThresh</code>. It allows to create an object 
    320     which stores filter's parameters and can be later called with the data 
    321     to return the data set that includes only the selected 
    322     features. <em>measure</em> is a function that returns a list of 
    323     couples (attribute name, score), and it defaults to 
    324     <code>orange.MeasureAttribute_relief(k=20, m=50)</code>. The default 
    325     threshold is 0.0. Some examples of how to use this class are:: 
    326  
    327         filter = orngFSS.FilterAttsAboveThresh(threshold=.15) 
     270    """Stores filter's parameters and can be later called with the data to 
     271    return the data table with only selected features.  
     272     
     273    This class is used in the function :obj:`selectAttsAboveThresh`. 
     274     
     275    :param measure: an attribute measure (derived from  
     276      :obj:`Orange.feature.scoring.Measure`). Defaults to  
     277      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50.   
     278    :param threshold: score threshold for attribute selection. Defaults to 0. 
     279    :type threshold: float 
     280      
     281    Some examples of how to use this class are:: 
     282 
     283        filter = Orange.feature.selection.FilterAttsAboveThresh(threshold=.15) 
    328284        new_data = filter(data) 
    329         new_data = orngFSS.FilterAttsAboveThresh(data) 
    330         new_data = orngFSS.FilterAttsAboveThresh(data, threshold=.1) 
    331         new_data = orngFSS.FilterAttsAboveThresh(data, threshold=.1, 
    332                      measure=orange.MeasureAttribute_gini()) 
     285        new_data = Orange.feature.selection.FilterAttsAboveThresh(data) 
     286        new_data = Orange.feature.selection.FilterAttsAboveThresh(data, threshold=.1) 
     287        new_data = Orange.feature.selection.FilterAttsAboveThresh(data, threshold=.1, 
     288                   measure=Orange.feature.scoring.Gini()) 
    333289 
    334290    """ 
     
    339295 
    340296    def __call__(self, data): 
     297        """Take data and return features with scores above given threshold. 
     298         
     299        :param data: an data table 
     300        :type data: Orange.data.table 
     301        """ 
    341302        ma = attMeasure(data, self.measure) 
    342303        return selectAttsAboveThresh(data, ma, self.threshold) 
    343304 
    344305def FilterBestNAtts(data=None, **kwds): 
    345     """FilterBestNAtts</b>([<em>measure</em>[<em>, n</em>]])</dt> 
    346     <dd class="ddfun">Similarly to <code>FilterAttsAboveThresh</code>, 
    347     this is a wrapper around the function 
    348     <code>selectBestNAtts</code>. Measure and the number of features to 
    349     retain are optional (the latter defaults to 5). 
     306    """Similarly to :obj:`FilterAttsAboveThresh`, wrap around class 
     307    :obj:`FilterBestNAtts_Class`. 
     308     
     309    :param measure: an attribute measure (derived from  
     310      :obj:`Orange.feature.scoring.Measure`). Defaults to  
     311      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50.   
     312    :param n: number of best features to return. Defaults to 5. 
     313    :type n: int 
    350314 
    351315    """ 
     
    364328 
    365329def FilterRelief(data=None, **kwds): 
    366     """FilterRelieff</b>([<em>measure</em>[<em>, margin</em>]])</dt> 
    367     <dd class="ddfun">Similarly to <code>FilterBestNAtts</code>, this is a 
    368     wrapper around the function 
    369     <code>filterRelieff</code>. <em>measure</em> and <em>margin</em> are 
    370     optional attributes, where <em>measure</em> defaults to 
    371     <code>orange.MeasureAttribute_relief(k=20, m=50)</code> and 
    372     <em>margin</em> to 0.0. 
     330    """Similarly to :obj:`FilterBestNAtts`, wrap around class  
     331    :obj:`FilterRelief_Class`. 
     332     
     333    :param measure: an attribute measure (derived from  
     334      :obj:`Orange.feature.scoring.Measure`). Defaults to  
     335      :obj:`Orange.feature.scoring.Relief` for k=20 and m=50.   
     336    :param margin: margin for Relief scoring. Defaults to 0. 
     337    :type margin: float 
    373338 
    374339    """     
     
    390355 
    391356def FilteredLearner(baseLearner, examples = None, weight = None, **kwds): 
    392     """FilteredLearner</b>([<em>baseLearner</em>[<em>, 
    393     examples</em>[<em>, filter</em>[<em>, name</em>]]]])</dt> <dd>Wraps a 
    394     <em>baseLearner</em> using a data <em>filter</em>, and returns the 
    395     corresponding learner. When such learner is presented a data set, data 
    396     is first filtered and then passed to 
    397     <em>baseLearner</em>. <em>FilteredLearner</em> comes handy when one 
    398     wants to test the schema of feature-subset-selection-and-learning by 
    399     some repetitive evaluation method, e.g., cross validation. Filter 
    400     defaults to orngFSS.FilterAttsAboveThresh with default 
    401     attributes. Here is an example of how to set such learner (build a 
    402     wrapper around naive Bayesian learner) and use it on a data set:: 
    403  
    404         nb = orange.BayesLearner() 
    405         learner = orngFSS.FilteredLearner(nb, filter=orngFSS.FilterBestNAtts(n=5), name='filtered') 
     357    """Return the corresponding learner that wraps  
     358    :obj:`Orange.classification.baseLearner` and a data selection method.  
     359     
     360    When such learner is presented a data table, data is first filtered and  
     361    then passed to :obj:`Orange.classification.baseLearner`. This comes handy  
     362    when one wants to test the schema of feature-subset-selection-and-learning 
     363    by some repetitive evaluation method, e.g., cross validation.  
     364     
     365    :param filter: defatuls to 
     366      :obj:`Orange.feature.selection.FilterAttsAboveThresh` 
     367    :type filter: Orange.feature.selection.FilterAttsAboveThresh 
     368 
     369    Here is an example of how to build a wrapper around naive Bayesian learner 
     370    and use it on a data set:: 
     371 
     372        nb = Orange.classification.bayes.NaiveBayesLearner() 
     373        learner = Orange.feature.selection.FilteredLearner(nb,  
     374                  filter=Orange.feature.selection.FilterBestNAtts(n=5), name='filtered') 
    406375        classifier = learner(data) 
    407376 
Note: See TracChangeset for help on using the changeset viewer.