Ignore:
Files:
562 added
377 deleted
97 edited

Legend:

Unmodified
Added
Removed
  • Orange/OrangeWidgets/Data/OWDataDomain.py

    r9671 r9996  
    629629                    mid = original_metas[meta] 
    630630                else: 
    631                     mid = Orange.data.new_meta_id() 
     631                    mid = Orange.feature.Descriptor.new_meta_id() 
    632632                domain.addmeta(mid, meta) 
    633633            newdata = Orange.data.Table(domain, self.data) 
  • Orange/OrangeWidgets/Data/OWPurgeDomain.py

    r9671 r9997  
    134134            for attr in self.data.domain.attributes: 
    135135                if attr.varType == orange.VarTypes.Continuous: 
    136                     if orange.RemoveRedundantOneValue.hasAtLeastTwoValues(self.data, attr): 
     136                    if orange.RemoveRedundantOneValue.has_at_least_two_values(self.data, attr): 
    137137                        newattrs.append(attr) 
    138138                    else: 
  • Orange/OrangeWidgets/Prototypes/OWCorrelations.py

    r9671 r9996  
    115115    domain = Orange.data.Domain(attrs, None) 
    116116    row_name = variable.String("Row name") 
    117     domain.addmeta(Orange.data.new_meta_id(), row_name) 
     117    domain.addmeta(Orange.feature.Descriptor.new_meta_id(), row_name) 
    118118     
    119119    table = Orange.data.Table(domain, [list(r) for r in matrix]) 
     
    445445                 
    446446                domain = Orange.data.Domain([pearson, spearman], None) 
    447                 domain.addmeta(Orange.data.new_meta_id(), row_name) 
     447                domain.addmeta(Orange.feature.Descriptor.new_meta_id(), row_name) 
    448448                table = Orange.data.Table(domain, self.target_correlations) 
    449449                for inst, name in zip(table, self.var_names): 
  • Orange/__init__.py

    r9929 r9986  
    1919_import("data.io") 
    2020_import("data.sample") 
     21_import("data.utils") 
     22_import("data.discretization") 
    2123 
    2224_import("network") 
  • Orange/associate/__init__.py

    r9919 r9988  
    1 """ 
    2 ============================== 
    3 Induction of association rules 
    4 ============================== 
    5  
    6 Orange provides two algorithms for induction of 
    7 `association rules <http://en.wikipedia.org/wiki/Association_rule_learning>`_. 
    8 One is the basic Agrawal's algorithm with dynamic induction of supported 
    9 itemsets and rules that is designed specifically for datasets with a 
    10 large number of different items. This is, however, not really suitable 
    11 for feature-based machine learning problems. 
    12 We have adapted the original algorithm for efficiency 
    13 with the latter type of data, and to induce the rules where,  
    14 both sides don't only contain features 
    15 (like "bread, butter -> jam") but also their values 
    16 ("bread = wheat, butter = yes -> jam = plum"). 
    17  
    18 It is also possible to extract item sets instead of association rules. These 
    19 are often more interesting than the rules themselves. 
    20  
    21 Besides association rule inducer, Orange also provides a rather simplified 
    22 method for classification by association rules. 
    23  
    24 =================== 
    25 Agrawal's algorithm 
    26 =================== 
    27  
    28 The class that induces rules by Agrawal's algorithm, accepts the data examples 
    29 of two forms. The first is the standard form in which each example is 
    30 described by values of a fixed list of features (defined in domain). 
    31 The algorithm, however, disregards the feature values and only checks whether 
    32 the value is defined or not. The rule shown above ("bread, butter -> jam") 
    33 actually means that if "bread" and "butter" are defined, then "jam" is defined 
    34 as well. It is expected that most of values will be undefined - if this is not 
    35 so, use the :class:`~AssociationRulesInducer`. 
    36  
    37 :class:`AssociationRulesSparseInducer` can also use sparse data.  
    38 Sparse examples have no fixed 
    39 features - the domain is empty. All values assigned to example are given as meta attributes. 
    40 All meta attributes need to be registered with the :obj:`~Orange.data.Domain`. 
    41 The most suitable format fot this kind of data it is the basket format. 
    42  
    43 The algorithm first dynamically builds all itemsets (sets of features) that have 
    44 at least the prescribed support. Each of these is then used to derive rules 
    45 with requested confidence. 
    46  
    47 If examples were given in the sparse form, so are the left and right side 
    48 of the induced rules. If examples were given in the standard form, so are 
    49 the examples in association rules. 
    50  
    51 .. class:: AssociationRulesSparseInducer 
    52  
    53     .. attribute:: support 
    54      
    55         Minimal support for the rule. 
    56          
    57     .. attribute:: confidence 
    58      
    59         Minimal confidence for the rule. 
    60          
    61     .. attribute:: store_examples 
    62      
    63         Store the examples covered by each rule and 
    64         those confirming it. 
    65          
    66     .. attribute:: max_item_sets 
    67      
    68         The maximal number of itemsets. The algorithm's 
    69         running time (and its memory consumption) depends on the minimal support; 
    70         the lower the requested support, the more eligible itemsets will be found. 
    71         There is no general rule for setting support - perhaps it  
    72         should be around 0.3, but this depends on the data set. 
    73         If the supoort was set too low, the algorithm could run out of memory. 
    74         Therefore, Orange limits the number of generated rules to 
    75         :obj:`max_item_sets`. If Orange reports, that the prescribed 
    76         :obj:`max_item_sets` was exceeded, increase the requered support 
    77         or alternatively, increase :obj:`max_item_sets` to as high as you computer 
    78         can handle. 
    79  
    80     .. method:: __call__(data, weight_id) 
    81  
    82         Induce rules from the data set. 
    83  
    84  
    85     .. method:: get_itemsets(data) 
    86  
    87         Returns a list of pairs. The first element of a pair is a tuple with  
    88         indices of features in the item set (negative for sparse data).  
    89         The second element is a list of indices supporting the item set, that is, 
    90         all the items in the set. If :obj:`store_examples` is False, the second 
    91         element is None. 
    92  
    93 We shall test the rule inducer on a dataset consisting of a brief description 
    94 of Spanish Inquisition, given by Palin et al: 
    95  
    96     NOBODY expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our *three* weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our *four*...no... *Amongst* our weapons.... Amongst our weaponry...are such elements as fear, surprise.... I'll come in again. 
    97  
    98     NOBODY expects the Spanish Inquisition! Amongst our weaponry are such diverse elements as: fear, surprise, ruthless efficiency, an almost fanatical devotion to the Pope, and nice red uniforms - Oh damn! 
    99      
    100 The text needs to be cleaned of punctuation marks and capital letters at beginnings of the sentences, each sentence needs to be put in a new line and commas need to be inserted between the words. 
    101  
    102 Data example (:download:`inquisition.basket <code/inquisition.basket>`): 
    103  
    104 .. literalinclude:: code/inquisition.basket 
    105     
    106 Inducing the rules is trivial (uses :download:`inquisition.basket <code/inquisition.basket>`):: 
    107  
    108     import Orange 
    109     data = Orange.data.Table("inquisition") 
    110  
    111     rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5) 
    112  
    113     print "%5s   %5s" % ("supp", "conf") 
    114     for r in rules: 
    115         print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r) 
    116  
    117 The induced rules are surprisingly fear-full: :: 
    118  
    119     0.500   1.000   fear -> surprise 
    120     0.500   1.000   surprise -> fear 
    121     0.500   1.000   fear -> surprise our 
    122     0.500   1.000   fear surprise -> our 
    123     0.500   1.000   fear our -> surprise 
    124     0.500   1.000   surprise -> fear our 
    125     0.500   1.000   surprise our -> fear 
    126     0.500   0.714   our -> fear surprise 
    127     0.500   1.000   fear -> our 
    128     0.500   0.714   our -> fear 
    129     0.500   1.000   surprise -> our 
    130     0.500   0.714   our -> surprise 
    131  
    132 To get only a list of supported item sets, one should call the method 
    133 get_itemsets:: 
    134  
    135     inducer = Orange.associate.AssociationRulesSparseInducer(support = 0.5, store_examples = True) 
    136     itemsets = inducer.get_itemsets(data) 
    137      
    138 Now itemsets is a list of itemsets along with the examples supporting them 
    139 since we set store_examples to True. :: 
    140  
    141     >>> itemsets[5] 
    142     ((-11, -7), [1, 2, 3, 6, 9]) 
    143     >>> [data.domain[i].name for i in itemsets[5][0]] 
    144     ['surprise', 'our']    
    145      
    146 The sixth itemset contains features with indices -11 and -7, that is, the 
    147 words "surprise" and "our". The examples supporting it are those with 
    148 indices 1,2, 3, 6 and 9. 
    149  
    150 This way of representing the itemsets is memory efficient and faster than using 
    151 objects like :obj:`~Orange.feature.Descriptor` and :obj:`~Orange.data.Instance`. 
    152  
    153 .. _non-sparse-examples: 
    154  
    155 =================== 
    156 Non-sparse data 
    157 =================== 
    158  
    159 :class:`AssociationRulesInducer` works with non-sparse data. 
    160 Unknown values are ignored, while values of features are not (as opposite to 
    161 the algorithm for sparse rules). In addition, the algorithm 
    162 can be directed to search only for classification rules, in which the only 
    163 feature on the right-hand side is the class variable. 
    164  
    165 .. class:: AssociationRulesInducer 
    166  
    167     All attributes can be set with the constructor.  
    168  
    169     .. attribute:: support 
    170      
    171        Minimal support for the rule. 
    172      
    173     .. attribute:: confidence 
    174      
    175         Minimal confidence for the rule. 
    176      
    177     .. attribute:: classification_rules 
    178      
    179         If True (default is False), the classification rules are constructed instead 
    180         of general association rules. 
    181  
    182     .. attribute:: store_examples 
    183      
    184         Store the examples covered by each rule and those 
    185         confirming it 
    186          
    187     .. attribute:: max_item_sets 
    188      
    189         The maximal number of itemsets. 
    190  
    191     .. method:: __call__(data, weight_id) 
    192  
    193         Induce rules from the data set. 
    194  
    195     .. method:: get_itemsets(data) 
    196  
    197         Returns a list of pairs. The first element of a pair is a tuple with  
    198         indices of features in the item set (negative for sparse data).  
    199         The second element is a list of indices supporting the item set, that is, 
    200         all the items in the set. If :obj:`store_examples` is False, the second 
    201         element is None. 
    202  
    203 The example:: 
    204  
    205     import Orange 
    206  
    207     data = Orange.data.Table("lenses") 
    208  
    209     print "Association rules" 
    210     rules = Orange.associate.AssociationRulesInducer(data, support = 0.5) 
    211     for r in rules: 
    212         print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
    213          
    214 The found rules are: :: 
    215  
    216     0.333  0.533  lenses=none -> prescription=hypermetrope 
    217     0.333  0.667  prescription=hypermetrope -> lenses=none 
    218     0.333  0.533  lenses=none -> astigmatic=yes 
    219     0.333  0.667  astigmatic=yes -> lenses=none 
    220     0.500  0.800  lenses=none -> tear_rate=reduced 
    221     0.500  1.000  tear_rate=reduced -> lenses=none 
    222      
    223 To limit the algorithm to classification rules, set classificationRules to 1: :: 
    224  
    225     print "\\nClassification rules" 
    226     rules = orange.AssociationRulesInducer(data, support = 0.3, classificationRules = 1) 
    227     for r in rules: 
    228         print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
    229  
    230 The found rules are, naturally, a subset of the above rules: :: 
    231  
    232     0.333  0.667  prescription=hypermetrope -> lenses=none 
    233     0.333  0.667  astigmatic=yes -> lenses=none 
    234     0.500  1.000  tear_rate=reduced -> lenses=none 
    235      
    236 Itemsets are induced in a similar fashion as for sparse data, except that the 
    237 first element of the tuple, the item set, is represented not by indices of 
    238 features, as before, but with tuples (feature-index, value-index): :: 
    239  
    240     inducer = Orange.associate.AssociationRulesInducer(support = 0.3, store_examples = True) 
    241     itemsets = inducer.get_itemsets(data) 
    242     print itemsets[8] 
    243      
    244 This prints out :: 
    245  
    246     (((2, 1), (4, 0)), [2, 6, 10, 14, 15, 18, 22, 23]) 
    247      
    248 meaning that the ninth itemset contains the second value of the third feature 
    249 (2, 1), and the first value of the fifth (4, 0). 
    250  
    251 ======================= 
    252 Representation of rules 
    253 ======================= 
    254  
    255 An :class:`AssociationRule` represents a rule. In Orange, methods for  
    256 induction of association rules return the induced rules in 
    257 :class:`AssociationRules`, which is basically a list of :class:`AssociationRule` instances. 
    258  
    259 .. class:: AssociationRule 
    260  
    261     .. method:: __init__(left, right, n_applies_left, n_applies_right, n_applies_both, n_examples) 
    262      
    263         Constructs an association rule and computes all measures listed above. 
    264      
    265     .. method:: __init__(left, right, support, confidence) 
    266      
    267         Construct association rule and sets its support and confidence. If 
    268         you intend to pass on such a rule you should set other attributes 
    269         manually - AssociationRules's constructor cannot compute anything 
    270         from arguments support and confidence. 
    271      
    272     .. method:: __init__(rule) 
    273      
    274         Given an association rule as the argument, constructor copies of the 
    275         rule. 
    276   
    277     .. attribute:: left, right 
    278      
    279         The left and the right side of the rule. Both are given as :class:`Orange.data.Instance`. 
    280         In rules created by :class:`AssociationRulesSparseInducer` from examples that 
    281         contain all values as meta-values, left and right are examples in the 
    282         same form. Otherwise, values in left that do not appear in the rule 
    283         are "don't care", and value in right are "don't know". Both can, 
    284         however, be tested by :meth:`~Orange.data.Value.is_special`. 
    285      
    286     .. attribute:: n_left, n_right 
    287      
    288         The number of features (i.e. defined values) on the left and on the 
    289         right side of the rule. 
    290      
    291     .. attribute:: n_applies_left, n_applies_right, n_applies_both 
    292      
    293         The number of (learning) examples that conform to the left, the right 
    294         and to both sides of the rule. 
    295      
    296     .. attribute:: n_examples 
    297      
    298         The total number of learning examples. 
    299      
    300     .. attribute:: support 
    301      
    302         nAppliesBoth/nExamples. 
    303  
    304     .. attribute:: confidence 
    305      
    306         n_applies_both/n_applies_left. 
    307      
    308     .. attribute:: coverage 
    309      
    310         n_applies_left/n_examples. 
    311  
    312     .. attribute:: strength 
    313      
    314         n_applies_right/n_applies_left. 
    315      
    316     .. attribute:: lift 
    317      
    318         n_examples * n_applies_both / (n_applies_left * n_applies_right). 
    319      
    320     .. attribute:: leverage 
    321      
    322         (n_Applies_both * n_examples - n_applies_left * n_applies_right). 
    323      
    324     .. attribute:: examples, match_left, match_both 
    325      
    326         If store_examples was True during induction, examples contains a copy 
    327         of the example table used to induce the rules. Attributes match_left 
    328         and match_both are lists of integers, representing the indices of 
    329         examples which match the left-hand side of the rule and both sides, 
    330         respectively. 
    331     
    332     .. method:: applies_left(example) 
    333      
    334     .. method:: applies_right(example) 
    335      
    336     .. method:: applies_both(example) 
    337      
    338         Tells whether the example fits into the left, right or both sides of 
    339         the rule, respectively. If the rule is represented by sparse examples, 
    340         the given example must be sparse as well. 
    341      
    342 Association rule inducers do not store evidence about which example supports 
    343 which rule. Let us write a function that finds the examples that 
    344 confirm the rule (fit both sides of it) and those that contradict it (fit the 
    345 left-hand side but not the right). The example:: 
    346  
    347     import Orange 
    348  
    349     data = Orange.data.Table("lenses") 
    350  
    351     rules = Orange.associate.AssociationRulesInducer(data, supp = 0.3) 
    352     rule = rules[0] 
    353  
    354     print 
    355     print "Rule: ", rule 
    356     print 
    357  
    358     print "Supporting examples:" 
    359     for example in data: 
    360         if rule.appliesBoth(example): 
    361             print example 
    362     print 
    363  
    364     print "Contradicting examples:" 
    365     for example in data: 
    366         if rule.applies_left(example) and not rule.applies_right(example): 
    367             print example 
    368     print 
    369  
    370 The latter printouts get simpler and faster if we instruct the inducer to 
    371 store the examples. We can then do, for instance, this: :: 
    372  
    373     print "Match left: " 
    374     print "\\n".join(str(rule.examples[i]) for i in rule.match_left) 
    375     print "\\nMatch both: " 
    376     print "\\n".join(str(rule.examples[i]) for i in rule.match_both) 
    377  
    378 The "contradicting" examples are then those whose indices are found in 
    379 match_left but not in match_both. The memory friendlier and the faster way 
    380 to compute this is as follows: :: 
    381  
    382     >>> [x for x in rule.match_left if not x in rule.match_both] 
    383     [0, 2, 8, 10, 16, 17, 18] 
    384     >>> set(rule.match_left) - set(rule.match_both) 
    385     set([0, 2, 8, 10, 16, 17, 18]) 
    386  
    387 =============== 
    388 Utilities 
    389 =============== 
    390  
    391 .. autofunction:: print_rules 
    392  
    393 .. autofunction:: sort 
    394  
    395 """ 
    396  
    3971from orange import \ 
    3982    AssociationRule, \ 
  • Orange/classification/knn.py

    r9724 r9994  
    136136into training (80%) and testing (20%) instances. We will use the former  
    137137for "training" the classifier and test it on five testing instances  
    138 randomly selected from a part of (:download:`knnlearner.py <code/knnlearner.py>`, uses :download:`iris.tab <code/iris.tab>`): 
     138randomly selected from a part of (:download:`knnlearner.py <code/knnlearner.py>`): 
    139139 
    140140.. literalinclude:: code/knnExample1.py 
     
    157157decide to do so, the distance_constructor must be set to an instance 
    158158of one of the classes for distance measuring. This can be seen in the following 
    159 part of (:download:`knnlearner.py <code/knnlearner.py>`, uses :download:`iris.tab <code/iris.tab>`): 
     159part of (:download:`knnlearner.py <code/knnlearner.py>`): 
    160160 
    161161.. literalinclude:: code/knnExample2.py 
     
    271271-------- 
    272272 
    273 The following script (:download:`knnInstanceDistance.py <code/knnInstanceDistance.py>`, uses :download:`lenses.tab <code/lenses.tab>`) 
     273The following script (:download:`knnInstanceDistance.py <code/knnInstanceDistance.py>`) 
    274274shows how to find the five nearest neighbors of the first instance 
    275275in the lenses dataset. 
  • Orange/classification/logreg.py

    r9936 r9959  
    188188        self.__dict__.update(kwds) 
    189189 
    190     def __call__(self, instance, resultType = Orange.classification.Classifier.GetValue): 
    191         # classification not implemented yet. For now its use is only to provide regression coefficients and its statistics 
    192         pass 
     190    def __call__(self, instance, result_type = Orange.classification.Classifier.GetValue): 
     191        # classification not implemented yet. For now its use is only to 
     192        # provide regression coefficients and its statistics 
     193        raise NotImplemented 
    193194     
    194195 
    195196class LogRegLearnerGetPriors(object): 
    196     def __new__(cls, instances=None, weightID=0, **argkw): 
     197    def __new__(cls, instances=None, weight_id=0, **argkw): 
    197198        self = object.__new__(cls) 
    198199        if instances: 
    199200            self.__init__(**argkw) 
    200             return self.__call__(instances, weightID) 
     201            return self.__call__(instances, weight_id) 
    201202        else: 
    202203            return self 
  • Orange/classification/lookup.py

    r9919 r9994  
    2121they usually reside in :obj:`~Orange.feature.Descriptor.get_value_from` fields of constructed 
    2222features to facilitate their automatic computation. For instance, 
    23 the following script shows how to translate the :download:`monks-1.tab <code/monks-1.tab>` data set 
     23the following script shows how to translate the `monks-1.tab` data set 
    2424features into a more useful subset that will only include the features 
    2525``a``, ``b``, ``e``, and features that will tell whether ``a`` and ``b`` are equal and 
    2626whether ``e`` is 1 (don't bother about the details, they follow later;  
    27 :download:`lookup-lookup.py <code/lookup-lookup.py>`, uses: :download:`monks-1.tab <code/monks-1.tab>`): 
     27:download:`lookup-lookup.py <code/lookup-lookup.py>`): 
    2828 
    2929.. literalinclude:: code/lookup-lookup.py 
     
    158158        Let's see some indices for randomly chosen examples from the original table. 
    159159         
    160         part of :download:`lookup-lookup.py <code/lookup-lookup.py>` (uses: :download:`monks-1.tab <code/monks-1.tab>`): 
     160        part of :download:`lookup-lookup.py <code/lookup-lookup.py>`: 
    161161 
    162162        .. literalinclude:: code/lookup-lookup.py 
     
    254254    is called and the resulting classifier is returned instead of the learner. 
    255255 
    256 part of :download:`lookup-table.py <code/lookup-table.py>` (uses: :download:`monks-1.tab <code/monks-1.tab>`): 
     256part of :download:`lookup-table.py <code/lookup-table.py>`: 
    257257 
    258258.. literalinclude:: code/lookup-table.py 
     
    323323the class_var. It doesn't set the :obj:`Orange.feature.Descriptor.get_value_from`, though. 
    324324 
    325 part of :download:`lookup-table.py <code/lookup-table.py>` (uses: :download:`monks-1.tab <code/monks-1.tab>`):: 
     325part of :download:`lookup-table.py <code/lookup-table.py>`:: 
    326326 
    327327    import Orange 
     
    336336alternative call arguments, it offers an easy way to observe feature 
    337337interactions. For this purpose, we shall omit e, and construct a 
    338 ClassifierByDataTable from a and b only (part of :download:`lookup-table.py <code/lookup-table.py>`; uses: :download:`monks-1.tab <code/monks-1.tab>`): 
     338ClassifierByDataTable from a and b only (part of :download:`lookup-table.py <code/lookup-table.py>`): 
    339339 
    340340.. literalinclude:: code/lookup-table.py 
     
    511511       
    512512 
    513 def lookup_from_data(examples, weight=0, learnerForUnknown=None): 
     513from Orange.misc import deprecated_keywords 
     514@deprecated_keywords({"learnerForUnknown":"learner_for_unknown"}) 
     515def lookup_from_data(examples, weight=0, learner_for_unknown=None): 
    514516    if len(examples.domain.attributes) <= 3: 
    515517        lookup = lookup_from_bound(examples.domain.class_var, 
     
    528530        # ClassifierByDataTable, let it deal with them 
    529531        return LookupLearner(examples, weight, 
    530                              learnerForUnknown=learnerForUnknown) 
     532                             learner_for_unknown=learner_for_unknown) 
    531533 
    532534    else: 
    533535        return LookupLearner(examples, weight, 
    534                              learnerForUnknown=learnerForUnknown) 
     536                             learner_for_unknown=learner_for_unknown) 
    535537         
    536538         
  • Orange/classification/majority.py

    r9671 r9994  
    6262This "learning algorithm" will most often be used as a baseline, 
    6363that is, to determine if some other learning algorithm provides 
    64 any information about the class (:download:`majority-classification.py <code/majority-classification.py>`, 
    65 uses: :download:`monks-1.tab <code/monks-1.tab>`): 
     64any information about the class (:download:`majority-classification.py <code/majority-classification.py>`): 
    6665 
    6766.. literalinclude:: code/majority-classification.py 
  • Orange/classification/rules.py

    r9936 r9994  
    3232Usage is consistent with typical learner usage in Orange: 
    3333 
    34 :download:`rules-cn2.py <code/rules-cn2.py>` (uses :download:`titanic.tab <code/titanic.tab>`) 
     34:download:`rules-cn2.py <code/rules-cn2.py>` 
    3535 
    3636.. literalinclude:: code/rules-cn2.py 
     
    155155in description of classes that follows it: 
    156156 
    157 part of :download:`rules-customized.py <code/rules-customized.py>` (uses :download:`titanic.tab <code/titanic.tab>`) 
     157part of :download:`rules-customized.py <code/rules-customized.py>` 
    158158 
    159159.. literalinclude:: code/rules-customized.py 
     
    181181different bean width. This is simply written as: 
    182182 
    183 part of :download:`rules-customized.py <code/rules-customized.py>` (uses :download:`titanic.tab <code/titanic.tab>`) 
     183part of :download:`rules-customized.py <code/rules-customized.py>` 
    184184 
    185185.. literalinclude:: code/rules-customized.py 
  • Orange/classification/wrappers.py

    r9671 r9961  
    55import Orange.evaluation.scoring 
    66 
     7from Orange.misc import deprecated_members 
     8 
    79class StepwiseLearner(Orange.core.Learner): 
    8   def __new__(cls, data=None, weightId=None, **kwargs): 
     10  def __new__(cls, data=None, weight_id=None, **kwargs): 
    911      self = Orange.core.Learner.__new__(cls, **kwargs) 
    1012      if data is not None: 
    1113          self.__init__(**kwargs) 
    12           return self(data, weightId) 
     14          return self(data, weight_id) 
    1315      else: 
    1416          return self 
    1517       
    1618  def __init__(self, **kwds): 
    17     self.removeThreshold = 0.3 
    18     self.addThreshold = 0.2 
     19    self.remove_threshold = 0.3 
     20    self.add_threshold = 0.2 
    1921    self.stat, self.statsign = scoring.CA, 1 
    20     self.__dict__.update(kwds) 
     22    for name, val in kwds.items(): 
     23        setattr(self, name, val) 
    2124 
    22   def __call__(self, examples, weightID = 0, **kwds): 
     25  def __call__(self, data, weight_id = 0, **kwds): 
    2326    import Orange.evaluation.testing, Orange.evaluation.scoring, statc 
    2427     
    2528    self.__dict__.update(kwds) 
    2629 
    27     if self.removeThreshold < self.addThreshold: 
    28         raise ValueError("'removeThreshold' should be larger or equal to 'addThreshold'") 
     30    if self.remove_threshold < self.add_threshold: 
     31        raise ValueError("'remove_threshold' should be larger or equal to 'add_threshold'") 
    2932 
    30     classVar = examples.domain.classVar 
     33    classVar = data.domain.classVar 
    3134     
    32     indices = Orange.core.MakeRandomIndicesCV(examples, folds = getattr(self, "folds", 10)) 
     35    indices = Orange.core.MakeRandomIndicesCV(data, folds = getattr(self, "folds", 10)) 
    3336    domain = Orange.data.Domain([], classVar) 
    3437 
    35     res = Orange.evaluation.testing.test_with_indices([self.learner], Orange.data.Table(domain, examples), indices) 
     38    res = Orange.evaluation.testing.test_with_indices([self.learner], Orange.data.Table(domain, data), indices) 
    3639     
    3740    oldStat = self.stat(res)[0] 
    38     oldStats = [self.stat(x)[0] for x in Orange.evaluation.scoring.splitByIterations(res)] 
     41    oldStats = [self.stat(x)[0] for x in Orange.evaluation.scoring.split_by_iterations(res)] 
    3942    print ".", oldStat, domain 
    4043    stop = False 
     
    4548            for attr in domain.attributes: 
    4649                newdomain = Orange.data.Domain(filter(lambda x: x!=attr, domain.attributes), classVar) 
    47                 res = Orange.evaluation.testing.test_with_indices([self.learner], (Orange.data.Table(newdomain, examples), weightID), indices) 
     50                res = Orange.evaluation.testing.test_with_indices([self.learner], (Orange.data.Table(newdomain, data), weight_id), indices) 
    4851                 
    4952                newStat = self.stat(res)[0] 
    50                 newStats = [self.stat(x)[0] for x in Orange.evaluation.scoring.splitByIterations(res)]  
     53                newStats = [self.stat(x)[0] for x in Orange.evaluation.scoring.split_by_iterations(res)]  
    5154                print "-", newStat, newdomain 
    5255                ## If stat has increased (ie newStat is better than bestStat) 
     
    5457                    if cmp(newStat, oldStat) == self.statsign: 
    5558                        bestStat, bestStats, bestAttr = newStat, newStats, attr 
    56                     elif statc.wilcoxont(oldStats, newStats)[1] > self.removeThreshold: 
     59                    elif statc.wilcoxont(oldStats, newStats)[1] > self.remove_threshold: 
    5760                            bestStat, bestAttr, bestStats = newStat, newStats, attr 
    5861            if bestStat: 
     
    6366 
    6467        bestStat, bestAttr = oldStat, None 
    65         for attr in examples.domain.attributes: 
     68        for attr in data.domain.attributes: 
    6669            if not attr in domain.attributes: 
    6770                newdomain = Orange.data.Domain(domain.attributes + [attr], classVar) 
    68                 res = Orange.evaluation.testing.test_with_indices([self.learner], (Orange.data.Table(newdomain, examples), weightID), indices) 
     71                res = Orange.evaluation.testing.test_with_indices([self.learner], (Orange.data.Table(newdomain, data), weight_id), indices) 
    6972                 
    7073                newStat = self.stat(res)[0] 
    71                 newStats = [self.stat(x)[0] for x in Orange.evaluation.scoring.splitByIterations(res)]  
     74                newStats = [self.stat(x)[0] for x in Orange.evaluation.scoring.split_by_iterations(res)]  
    7275                print "+", newStat, newdomain 
    7376 
    7477                ## If stat has increased (ie newStat is better than bestStat) 
    75                 if cmp(newStat, bestStat) == self.statsign and statc.wilcoxont(oldStats, newStats)[1] < self.addThreshold: 
     78                if cmp(newStat, bestStat) == self.statsign and statc.wilcoxont(oldStats, newStats)[1] < self.add_threshold: 
    7679                    bestStat, bestStats, bestAttr = newStat, newStats, attr 
    7780        if bestAttr: 
     
    8184            print "added", bestAttr.name 
    8285 
    83     return self.learner(Orange.data.Table(domain, examples), weightID) 
     86    return self.learner(Orange.data.Table(domain, data), weight_id) 
    8487 
     88StepwiseLearner = deprecated_members( 
     89                    {"removeThreshold": "remove_threshold", 
     90                     "addThreshold": "add_threshold"}, 
     91                    )(StepwiseLearner) 
  • Orange/clustering/kmeans.py

    r9725 r9994  
    1616 
    1717The following code runs k-means clustering and prints out the cluster indexes 
    18 for the last 10 data instances (:download:`kmeans-run.py <code/kmeans-run.py>`, uses :download:`iris.tab <code/iris.tab>`): 
     18for the last 10 data instances (:download:`kmeans-run.py <code/kmeans-run.py>`): 
    1919 
    2020.. literalinclude:: code/kmeans-run.py 
     
    2929o be computed at each iteration we have to set :obj:`minscorechange`, but we can 
    3030leave it at 0 or even set it to a negative value, which allows the score to deteriorate 
    31 by some amount (:download:`kmeans-run-callback.py <code/kmeans-run-callback.py>`, uses :download:`iris.tab <code/iris.tab>`): 
     31by some amount (:download:`kmeans-run-callback.py <code/kmeans-run-callback.py>`): 
    3232 
    3333.. literalinclude:: code/kmeans-run-callback.py 
     
    4444    Iteration: 8, changes: 0, score: 9.8624 
    4545 
    46 Call-back above is used for reporting of the progress, but may as well call a function that plots a selection data projection with corresponding centroid at a given step of the clustering. This is exactly what we did with the following script (:download:`kmeans-trace.py <code/kmeans-trace.py>`, uses :download:`iris.tab <code/iris.tab>`): 
     46Call-back above is used for reporting of the progress, but may as well call a function that plots a selection data projection with corresponding centroid at a given step of the clustering. This is exactly what we did with the following script (:download:`kmeans-trace.py <code/kmeans-trace.py>`): 
    4747 
    4848.. literalinclude:: code/kmeans-trace.py 
     
    8282and finds more optimal centroids. The following code compares three different  
    8383initialization methods (random, diversity-based and hierarchical clustering-based)  
    84 in terms of how fast they converge (:download:`kmeans-cmp-init.py <code/kmeans-cmp-init.py>`, uses :download:`iris.tab <code/iris.tab>`, 
    85 :download:`housing.tab <code/housing.tab>`, :download:`vehicle.tab <code/vehicle.tab>`): 
     84in terms of how fast they converge (:download:`kmeans-cmp-init.py <code/kmeans-cmp-init.py>`): 
    8685 
    8786.. literalinclude:: code/kmeans-cmp-init.py 
     
    9695 
    9796The following code computes the silhouette score for k=2..7 and plots a  
    98 silhuette plot for k=3 (:download:`kmeans-silhouette.py <code/kmeans-silhouette.py>`, uses :download:`iris.tab <code/iris.tab>`): 
     97silhuette plot for k=3 (:download:`kmeans-silhouette.py <code/kmeans-silhouette.py>`): 
    9998 
    10099.. literalinclude:: code/kmeans-silhouette.py 
     
    176175score_distance_to_centroids.minimize = True 
    177176 
    178 def score_conditionalEntropy(km): 
     177def score_conditional_entropy(km): 
    179178    """UNIMPLEMENTED cluster quality measured by conditional entropy""" 
    180     pass 
    181  
    182 def score_withinClusterDistance(km): 
     179    raise NotImplemented 
     180 
     181def score_within_cluster_distance(km): 
    183182    """UNIMPLEMENTED weighted average within-cluster pairwise distance""" 
    184     pass 
    185  
    186 score_withinClusterDistance.minimize = True 
    187  
    188 def score_betweenClusterDistance(km): 
     183    raise NotImplemented 
     184 
     185score_within_cluster_distance.minimize = True 
     186 
     187def score_between_cluster_distance(km): 
    189188    """Sum of distances from elements to 'nearest miss' centroids""" 
    190189    return sum(min(km.distance(c, d) for j,c in enumerate(km.centroids) if j!=km.clusters[i]) for i,d in enumerate(km.data)) 
     190 
     191from Orange.misc import deprecated_function_name 
     192score_betweenClusterDistance = deprecated_function_name(score_between_cluster_distance) 
    191193 
    192194def score_silhouette(km, index=None): 
  • Orange/clustering/mixture.py

    r9919 r9976  
    277277    """ Computes the gaussian mixture model from an Orange data-set. 
    278278    """ 
    279     def __new__(cls, data=None, weightId=None, **kwargs): 
     279    def __new__(cls, data=None, weight_id=None, **kwargs): 
    280280        self = object.__new__(cls) 
    281281        if data is not None: 
    282282            self.__init__(**kwargs) 
    283             return self.__call__(data, weightId) 
     283            return self.__call__(data, weight_id) 
    284284        else: 
    285285            return self 
     
    289289        self.init_function = init_function 
    290290         
    291     def __call__(self, data, weightId=None): 
     291    def __call__(self, data, weight_id=None): 
    292292        from Orange.preprocess import Preprocessor_impute, DomainContinuizer 
    293293#        data = Preprocessor_impute(data) 
  • Orange/data/sample.py

    r9697 r9994  
    265265Let us construct a list of indices that would assign half of examples 
    266266to the first set and a quarter to the second and third (part of 
    267 :download:`randomindicesn.py <code/randomindicesn.py>`, uses :download:`lenses.tab <code/lenses.tab>`): 
     267:download:`randomindicesn.py <code/randomindicesn.py>`): 
    268268 
    269269.. literalinclude:: code/randomindicesn.py 
     
    292292indices for 10 examples for 5-fold cross validation. For the latter, 
    293293we shall only pass the number of examples, which, of course, prevents 
    294 the stratification. Part of :download:`randomindicescv.py <code/randomindicescv.py>`, uses :download:`lenses.tab <code/lenses.tab>`): 
     294the stratification. Part of :download:`randomindicescv.py <code/randomindicescv.py>`): 
    295295 
    296296.. literalinclude:: code/randomindicescv.py 
  • Orange/data/utils.py

    r9936 r9986  
    1 """\ 
    2 ************************** 
    3 Data Utilities (``utils``) 
    4 ************************** 
    5  
    6 Common operations on :class:`Orange.data.Table`. 
    7  
    8 """ 
    91#from __future__ import absolute_import 
     2 
     3from Orange.core import TransformValue, \ 
     4    Ordinal2Continuous, \ 
     5    Discrete2Continuous, \ 
     6    NormalizeContinuous, \ 
     7    MapIntValue 
     8 
    109 
    1110import random 
  • Orange/ensemble/__init__.py

    r9671 r9994  
    4646validation and observe classification accuracy. 
    4747 
    48 :download:`ensemble.py <code/ensemble.py>` (uses :download:`lymphography.tab <code/lymphography.tab>`) 
     48:download:`ensemble.py <code/ensemble.py>` 
    4949 
    5050.. literalinclude:: code/ensemble.py 
     
    8282to a tree learner on a liver disorder (bupa) and housing data sets. 
    8383 
    84 :download:`ensemble-forest.py <code/ensemble-forest.py>` (uses :download:`bupa.tab <code/bupa.tab>`, :download:`housing.tab <code/housing.tab>`) 
     84:download:`ensemble-forest.py <code/ensemble-forest.py>` 
    8585 
    8686.. literalinclude:: code/ensemble-forest.py 
     
    106106and minExamples are both set to 5. 
    107107 
    108 :download:`ensemble-forest2.py <code/ensemble-forest2.py>` (uses :download:`bupa.tab <code/bupa.tab>`) 
     108:download:`ensemble-forest2.py <code/ensemble-forest2.py>` 
    109109 
    110110.. literalinclude:: code/ensemble-forest2.py 
     
    144144:class:`Orange.data.Table` for details). 
    145145 
    146 :download:`ensemble-forest-measure.py <code/ensemble-forest-measure.py>` (uses :download:`iris.tab <code/iris.tab>`) 
     146:download:`ensemble-forest-measure.py <code/ensemble-forest-measure.py>` 
    147147 
    148148.. literalinclude:: code/ensemble-forest-measure.py 
  • Orange/evaluation/scoring.py

    r10001 r10004  
    10081008@deprecated_keywords({"keepConcavities": "keep_concavities"}) 
    10091009def ROC_add_point(P, R, keep_concavities=1): 
    1010     if keepConcavities: 
     1010    if keep_concavities: 
    10111011        R.append(P) 
    10121012    else: 
  • Orange/feature/discretization.py

    r9927 r9944  
    9393 
    9494        from Orange.feature import discretization 
    95         bayes = Orange.classification.bayes.NaiveBayesLearner() 
     95        bayes = Orange.classification.bayes.Learner() 
    9696        disc = orange.Preprocessor_discretize(method=discretization.EquiNDiscretization(numberOfIntervals=10)) 
    9797        dBayes = discretization.DiscretizedLearner(bayes, name='disc bayes') 
     
    127127  def __call__(self, example, resultType = orange.GetValue): 
    128128    return self.classifier(example, resultType) 
    129  
    130 class DiscretizeTable(object): 
    131     """Discretizes all continuous features of the data table. 
    132  
    133     :param data: data to discretize. 
    134     :type data: :class:`Orange.data.Table` 
    135  
    136     :param features: data features to discretize. None (default) to discretize all features. 
    137     :type features: list of :class:`Orange.feature.Descriptor` 
    138  
    139     :param method: feature discretization method. 
    140     :type method: :class:`Discretization` 
    141     """ 
    142     def __new__(cls, data=None, features=None, discretize_class=False, method=EqualFreq(n=3)): 
    143         if data is None: 
    144             self = object.__new__(cls) 
    145             return self 
    146         else: 
    147             self = cls(features=features, discretize_class=discretize_class, method=method) 
    148             return self(data) 
    149  
    150     def __init__(self, features=None, discretize_class=False, method=EqualFreq(n=3)): 
    151         self.features = features 
    152         self.discretize_class = discretize_class 
    153         self.method = method 
    154  
    155     def __call__(self, data): 
    156         pp = Preprocessor_discretize(attributes=self.features, discretizeClass=self.discretize_class) 
    157         pp.method = self.method 
    158         return pp(data) 
    159  
  • Orange/feature/scoring.py

    r9919 r9988  
    1 """ 
    2 ##################### 
    3 Scoring (``scoring``) 
    4 ##################### 
    5  
    6 .. index:: feature scoring 
    7  
    8 .. index::  
    9    single: feature; feature scoring 
    10  
    11 Feature score is an assessment of the usefulness of the feature for  
    12 prediction of the dependant (class) variable. 
    13  
    14 To compute the information gain of feature "tear_rate" in the Lenses data set (loaded into ``data``) use: 
    15  
    16     >>> meas = Orange.feature.scoring.InfoGain() 
    17     >>> print meas("tear_rate", data) 
    18     0.548794925213 
    19  
    20 Other scoring methods are listed in :ref:`classification` and 
    21 :ref:`regression`. Various ways to call them are described on 
    22 :ref:`callingscore`. 
    23  
    24 Instead of first constructing the scoring object (e.g. ``InfoGain``) and 
    25 then using it, it is usually more convenient to do both in a single step:: 
    26  
    27     >>> print Orange.feature.scoring.InfoGain("tear_rate", data) 
    28     0.548794925213 
    29  
    30 This way is much slower for Relief that can efficiently compute scores 
    31 for all features in parallel. 
    32  
    33 It is also possible to score features that do not appear in the data 
    34 but can be computed from it. A typical case are discretized features: 
    35  
    36 .. literalinclude:: code/scoring-info-iris.py 
    37     :lines: 7-11 
    38  
    39 The following example computes feature scores, both with 
    40 :obj:`score_all` and by scoring each feature individually, and prints out  
    41 the best three features.  
    42  
    43 .. literalinclude:: code/scoring-all.py 
    44     :lines: 7- 
    45  
    46 The output:: 
    47  
    48     Feature scores for best three features (with score_all): 
    49     0.613 physician-fee-freeze 
    50     0.255 el-salvador-aid 
    51     0.228 synfuels-corporation-cutback 
    52  
    53     Feature scores for best three features (scored individually): 
    54     0.613 physician-fee-freeze 
    55     0.255 el-salvador-aid 
    56     0.228 synfuels-corporation-cutback 
    57  
    58 .. comment 
    59     The next script uses :obj:`GainRatio` and :obj:`Relief`. 
    60  
    61     .. literalinclude:: code/scoring-relief-gainRatio.py 
    62         :lines: 7- 
    63  
    64     Notice that on this data the ranks of features match:: 
    65          
    66         Relief GainRt Feature 
    67         0.613  0.752  physician-fee-freeze 
    68         0.255  0.444  el-salvador-aid 
    69         0.228  0.414  synfuels-corporation-cutback 
    70         0.189  0.382  crime 
    71         0.166  0.345  adoption-of-the-budget-resolution 
    72  
    73  
    74 .. _callingscore: 
    75  
    76 ======================= 
    77 Calling scoring methods 
    78 ======================= 
    79  
    80 To score a feature use :obj:`Score.__call__`. There are diferent 
    81 function signatures, which enable optimization. For instance, 
    82 most scoring methods first compute contingency tables from the 
    83 data. If these are already computed, they can be passed to the scorer 
    84 instead of the data. 
    85  
    86 Not all classes accept all kinds of arguments. :obj:`Relief`, 
    87 for instance, only supports the form with instances on the input. 
    88  
    89 .. method:: Score.__call__(attribute, data[, apriori_class_distribution][, weightID]) 
    90  
    91     :param attribute: the chosen feature, either as a descriptor,  
    92       index, or a name. 
    93     :type attribute: :class:`Orange.feature.Descriptor` or int or string 
    94     :param data: data. 
    95     :type data: `Orange.data.Table` 
    96     :param weightID: id for meta-feature with weight. 
    97  
    98     All scoring methods support the first signature. 
    99  
    100 .. method:: Score.__call__(attribute, domain_contingency[, apriori_class_distribution]) 
    101  
    102     :param attribute: the chosen feature, either as a descriptor,  
    103       index, or a name. 
    104     :type attribute: :class:`Orange.feature.Descriptor` or int or string 
    105     :param domain_contingency:  
    106     :type domain_contingency: :obj:`Orange.statistics.contingency.Domain` 
    107  
    108 .. method:: Score.__call__(contingency, class_distribution[, apriori_class_distribution]) 
    109  
    110     :param contingency: 
    111     :type contingency: :obj:`Orange.statistics.contingency.VarClass` 
    112     :param class_distribution: distribution of the class 
    113       variable. If :obj:`unknowns_treatment` is :obj:`IgnoreUnknowns`, 
    114       it should be computed on instances where feature value is 
    115       defined. Otherwise, class distribution should be the overall 
    116       class distribution. 
    117     :type class_distribution:  
    118       :obj:`Orange.statistics.distribution.Distribution` 
    119     :param apriori_class_distribution: Optional and most often 
    120       ignored. Useful if the scoring method makes any probability estimates 
    121       based on apriori class probabilities (such as the m-estimate). 
    122     :return: Feature score - the higher the value, the better the feature. 
    123       If the quality cannot be scored, return :obj:`Score.Rejected`. 
    124     :rtype: float or :obj:`Score.Rejected`. 
    125  
    126 The code below scores the same feature with :obj:`GainRatio`  
    127 using different calls. 
    128  
    129 .. literalinclude:: code/scoring-calls.py 
    130     :lines: 7- 
    131  
    132 .. _classification: 
    133  
    134 ========================================== 
    135 Feature scoring in classification problems 
    136 ========================================== 
    137  
    138 .. Undocumented: MeasureAttribute_IM, MeasureAttribute_chiSquare, MeasureAttribute_gainRatioA, MeasureAttribute_logOddsRatio, MeasureAttribute_splitGain. 
    139  
    140 .. index::  
    141    single: feature scoring; information gain 
    142  
    143 .. class:: InfoGain 
    144  
    145     Information gain; the expected decrease of entropy. See `page on wikipedia 
    146     <http://en.wikipedia.org/wiki/Information_gain_ratio>`_. 
    147  
    148 .. index::  
    149    single: feature scoring; gain ratio 
    150  
    151 .. class:: GainRatio 
    152  
    153     Information gain ratio; information gain divided by the entropy of the feature's 
    154     value. Introduced in [Quinlan1986]_ in order to avoid overestimation 
    155     of multi-valued features. It has been shown, however, that it still 
    156     overestimates features with multiple values. See `Wikipedia 
    157     <http://en.wikipedia.org/wiki/Information_gain_ratio>`_. 
    158  
    159 .. index::  
    160    single: feature scoring; gini index 
    161  
    162 .. class:: Gini 
    163  
    164     Gini index is the probability that two randomly chosen instances will have different 
    165     classes. See `Gini coefficient on Wikipedia <http://en.wikipedia.org/wiki/Gini_coefficient>`_. 
    166  
    167 .. index::  
    168    single: feature scoring; relevance 
    169  
    170 .. class:: Relevance 
    171  
    172     The potential value for decision rules. 
    173  
    174 .. index::  
    175    single: feature scoring; cost 
    176  
    177 .. class:: Cost 
    178  
    179     Evaluates features based on the cost decrease achieved by knowing the value of 
    180     feature, according to the specified cost matrix. 
    181  
    182     .. attribute:: cost 
    183       
    184         Cost matrix, see :obj:`Orange.classification.CostMatrix` for details. 
    185  
    186     If the cost of predicting the first class of an instance that is actually in 
    187     the second is 5, and the cost of the opposite error is 1, than an appropriate 
    188     score can be constructed as follows:: 
    189  
    190  
    191         >>> meas = Orange.feature.scoring.Cost() 
    192         >>> meas.cost = ((0, 5), (1, 0)) 
    193         >>> meas(3, data) 
    194         0.083333350718021393 
    195  
    196     Knowing the value of feature 3 would decrease the 
    197     classification cost for approximately 0.083 per instance. 
    198  
    199     .. comment   opposite error - is this term correct? TODO 
    200  
    201 .. index::  
    202    single: feature scoring; ReliefF 
    203  
    204 .. class:: Relief 
    205  
    206     Assesses features' ability to distinguish between very similar 
    207     instances from different classes. This scoring method was first 
    208     developed by Kira and Rendell and then improved by  Kononenko. The 
    209     class :obj:`Relief` works on discrete and continuous classes and 
    210     thus implements ReliefF and RReliefF. 
    211  
    212     ReliefF is slow since it needs to find k nearest neighbours for 
    213     each of m reference instances. As we normally compute ReliefF for 
    214     all features in the dataset, :obj:`Relief` caches the results for 
    215     all features, when called to score a certain feature.  When called 
    216     again, it uses the stored results if the domain and the data table 
    217     have not changed (data table version and the data checksum are 
    218     compared). Caching will only work if you use the same object.  
    219     Constructing new instances of :obj:`Relief` for each feature, 
    220     like this:: 
    221  
    222         for attr in data.domain.attributes: 
    223             print Orange.feature.scoring.Relief(attr, data) 
    224  
    225     runs much slower than reusing the same instance:: 
    226  
    227         meas = Orange.feature.scoring.Relief() 
    228         for attr in table.domain.attributes: 
    229             print meas(attr, data) 
    230  
    231  
    232     .. attribute:: k 
    233      
    234        Number of neighbours for each instance. Default is 5. 
    235  
    236     .. attribute:: m 
    237      
    238         Number of reference instances. Default is 100. When -1, all 
    239         instances are used as reference. 
    240  
    241     .. attribute:: check_cached_data 
    242      
    243         Check if the cached data is changed, which may be slow on large 
    244         tables.  Defaults to :obj:`True`, but should be disabled when it 
    245         is certain that the data will not change while the scorer is used. 
    246  
    247 .. autoclass:: Orange.feature.scoring.Distance 
    248     
    249 .. autoclass:: Orange.feature.scoring.MDL 
    250  
    251 .. _regression: 
    252  
    253 ====================================== 
    254 Feature scoring in regression problems 
    255 ====================================== 
    256  
    257 .. class:: Relief 
    258  
    259     Relief is used for regression in the same way as for 
    260     classification (see :class:`Relief` in classification 
    261     problems). 
    262  
    263 .. index::  
    264    single: feature scoring; mean square error 
    265  
    266 .. class:: MSE 
    267  
    268     Implements the mean square error score. 
    269  
    270     .. attribute:: unknowns_treatment 
    271      
    272         What to do with unknown values. See :obj:`Score.unknowns_treatment`. 
    273  
    274     .. attribute:: m 
    275      
    276         Parameter for m-estimate of error. Default is 0 (no m-estimate). 
    277  
    278  
    279  
    280 ============ 
    281 Base Classes 
    282 ============ 
    283  
    284 Implemented methods for scoring relevances of features are subclasses 
    285 of :obj:`Score`. Those that compute statistics on conditional 
    286 distributions of class values given the feature values are derived from 
    287 :obj:`ScoreFromProbabilities`. 
    288  
    289 .. class:: Score 
    290  
    291     Abstract base class for feature scoring. Its attributes describe which 
    292     types of features it can handle which kind of data it requires. 
    293  
    294     **Capabilities** 
    295  
    296     .. attribute:: handles_discrete 
    297      
    298         Indicates whether the scoring method can handle discrete features. 
    299  
    300     .. attribute:: handles_continuous 
    301      
    302         Indicates whether the scoring method can handle continuous features. 
    303  
    304     .. attribute:: computes_thresholds 
    305      
    306         Indicates whether the scoring method implements the :obj:`threshold_function`. 
    307  
    308     **Input specification** 
    309  
    310     .. attribute:: needs 
    311      
    312         The type of data needed indicated by one the constants 
    313         below. Classes with use :obj:`DomainContingency` will also handle 
    314         generators. Those based on :obj:`Contingency_Class` will be able 
    315         to take generators and domain contingencies. 
    316  
    317         .. attribute:: Generator 
    318  
    319             Constant. Indicates that the scoring method needs an instance 
    320             generator on the input as, for example, :obj:`Relief`. 
    321  
    322         .. attribute:: DomainContingency 
    323  
    324             Constant. Indicates that the scoring method needs 
    325             :obj:`Orange.statistics.contingency.Domain`. 
    326  
    327         .. attribute:: Contingency_Class 
    328  
    329             Constant. Indicates, that the scoring method needs the contingency 
    330             (:obj:`Orange.statistics.contingency.VarClass`), feature 
    331             distribution and the apriori class distribution (as most 
    332             scoring methods). 
    333  
    334     **Treatment of unknown values** 
    335  
    336     .. attribute:: unknowns_treatment 
    337  
    338         Defined in classes that are able to treat unknown values. It 
    339         should be set to one of the values below. 
    340  
    341         .. attribute:: IgnoreUnknowns 
    342  
    343             Constant. Instances for which the feature value is unknown are removed. 
    344  
    345         .. attribute:: ReduceByUnknown 
    346  
    347             Constant. Features with unknown values are  
    348             punished. The feature quality is reduced by the proportion of 
    349             unknown values. For impurity scores the impurity decreases 
    350             only where the value is defined and stays the same otherwise. 
    351  
    352         .. attribute:: UnknownsToCommon 
    353  
    354             Constant. Undefined values are replaced by the most common value. 
    355  
    356         .. attribute:: UnknownsAsValue 
    357  
    358             Constant. Unknown values are treated as a separate value. 
    359  
    360     **Methods** 
    361  
    362     .. method:: __call__ 
    363  
    364         Abstract. See :ref:`callingscore`. 
    365  
    366     .. method:: threshold_function(attribute, instances[, weightID]) 
    367      
    368         Abstract.  
    369          
    370         Assess different binarizations of the continuous feature 
    371         :obj:`attribute`.  Return a list of tuples. The first element 
    372         is a threshold (between two existing values), the second is 
    373         the quality of the corresponding binary feature, and the third 
    374         the distribution of instances below and above the threshold. 
    375         Not all scorers return the third element. 
    376  
    377         To show the computation of thresholds, we shall use the Iris 
    378         data set: 
    379  
    380         .. literalinclude:: code/scoring-info-iris.py 
    381             :lines: 13-16 
    382  
    383     .. method:: best_threshold(attribute, instances) 
    384  
    385         Return the best threshold for binarization, that is, the threshold 
    386         with which the resulting binary feature will have the optimal 
    387         score. 
    388  
    389         The script below prints out the best threshold for 
    390         binarization of an feature. ReliefF is used scoring:  
    391  
    392         .. literalinclude:: code/scoring-info-iris.py 
    393             :lines: 18-19 
    394  
    395 .. class:: ScoreFromProbabilities 
    396  
    397     Bases: :obj:`Score` 
    398  
    399     Abstract base class for feature scoring method that can be 
    400     computed from contingency matrices. 
    401  
    402     .. attribute:: estimator_constructor 
    403     .. attribute:: conditional_estimator_constructor 
    404      
    405         The classes that are used to estimate unconditional 
    406         and conditional probabilities of classes, respectively. 
    407         Defaults use relative frequencies; possible alternatives are, 
    408         for instance, :obj:`ProbabilityEstimatorConstructor_m` and 
    409         :obj:`ConditionalProbabilityEstimatorConstructor_ByRows` 
    410         (with estimator constructor again set to 
    411         :obj:`ProbabilityEstimatorConstructor_m`), respectively. 
    412  
    413 ============ 
    414 Other 
    415 ============ 
    416  
    417 .. autoclass:: Orange.feature.scoring.OrderAttributes 
    418    :members: 
    419  
    420 .. autofunction:: Orange.feature.scoring.score_all 
    421  
    422 .. rubric:: Bibliography 
    423  
    424 .. [Kononenko2007] Igor Kononenko, Matjaz Kukar: Machine Learning and Data Mining,  
    425   Woodhead Publishing, 2007. 
    426  
    427 .. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986. 
    428  
    429 .. [Breiman1984] L Breiman et al: Classification and Regression Trees, Chapman and Hall, 1984. 
    430  
    431 .. [Kononenko1995] I Kononenko: On biases in estimating multi-valued attributes, International Joint Conference on Artificial Intelligence, 1995. 
    432  
    433 """ 
    434  
    4351import Orange.core as orange 
    4362import Orange.misc 
     
    44511from orange import MeasureAttribute_relief as Relief 
    44612from orange import MeasureAttribute_MSE as MSE 
    447  
    44813 
    44914###### 
  • Orange/fixes/fix_changed_names.py

    r9936 r9991  
    4141           "orange.PythonVariable": "Orange.feature.Python", 
    4242 
    43            "orange.newmetaid": "Orange.feature:Variable.new_meta_id" 
     43           "orange.newmetaid": "Orange.feature:Variable.new_meta_id", 
    4444 
    4545           "orange.SymMatrix": "Orange.misc.SymMatrix", 
     
    236236           "orange.MajorityLearner":"Orange.classification.majority.MajorityLearner", 
    237237           "orange.DefaultClassifier":"Orange.classification.ConstantClassifier", 
     238 
     239           "orngSQL.SQLReader": "Orange.data.sql.SQLReader", 
     240           "orngSQL.SQLWriter": "Orange.data.sql.SQLWriter", 
    238241 
    239242           "orange.LookupLearner":"Orange.classification.lookup.LookupLearner", 
     
    586589           "orange.RandomGenerator": "Orange.misc.Random", 
    587590 
     591           "orange.TransformValue": "Orange.data.utils.TransformValue", 
     592           "orange.Ordinal2Continuous": "Orange.data.utils.Ordinal2Continuous", 
     593           "orange.Discrete2Continuous": "Orange.data.utils.Discrete2Continuous", 
     594           "orange.NormalizeContinuous": "Orange.data.utils.NormalizeContinuous", 
     595           "orange.MapIntValue": "Orange.data.utils.MapIntValue", 
     596 
    588597           } 
    589598 
  • Orange/fixes/fix_orange_imports.py

    r9818 r9991  
    5858           "orngLinProj": "Orange.projection.linear", 
    5959           "orngEnviron": "Orange.misc.environ", 
     60           "orngSQL": "Orange.data.sql" 
    6061           } 
    6162 
  • Orange/misc/selection.py

    r9775 r9994  
    3232feature with the highest information gain. 
    3333 
    34 part of :download:`misc-selection-bestonthefly.py <code/misc-selection-bestonthefly.py>` (uses :download:`lymphography.tab <code/lymphography.tab>`) 
     34part of :download:`misc-selection-bestonthefly.py <code/misc-selection-bestonthefly.py>` 
    3535 
    3636.. literalinclude:: code/misc-selection-bestonthefly.py 
     
    4242like this: 
    4343 
    44 part of :download:`misc-selection-bestonthefly.py <code/misc-selection-bestonthefly.py>` (uses :download:`lymphography.tab <code/lymphography.tab>`) 
     44part of :download:`misc-selection-bestonthefly.py <code/misc-selection-bestonthefly.py>` 
    4545 
    4646.. literalinclude:: code/misc-selection-bestonthefly.py 
     
    5050The other way to do it is through indices. 
    5151 
    52 :download:`misc-selection-bestonthefly.py <code/misc-selection-bestonthefly.py>` (uses :download:`lymphography.tab <code/lymphography.tab>`) 
     52:download:`misc-selection-bestonthefly.py <code/misc-selection-bestonthefly.py>` 
    5353 
    5454.. literalinclude:: code/misc-selection-bestonthefly.py 
  • Orange/multilabel/br.py

    r9671 r9994  
    4545 
    4646The following example demonstrates a straightforward invocation of 
    47 this algorithm (:download:`mlc-classify.py <code/mlc-classify.py>`, uses 
    48 :download:`emotions.tab <code/emotions.tab>`): 
     47this algorithm (:download:`mlc-classify.py <code/mlc-classify.py>`): 
    4948 
    5049.. literalinclude:: code/mlc-classify.py 
  • Orange/multilabel/brknn.py

    r9671 r9994  
    3030 
    3131The following example demonstrates a straightforward invocation of 
    32 this algorithm (:download:`mlc-classify.py <code/mlc-classify.py>`, uses 
    33 :download:`emotions.tab <code/emotions.tab>`): 
     32this algorithm (:download:`mlc-classify.py <code/mlc-classify.py>`): 
    3433 
    3534.. literalinclude:: code/mlc-classify.py 
  • Orange/multilabel/lp.py

    r9922 r9994  
    3434 
    3535The following example demonstrates a straightforward invocation of 
    36 this algorithm (:download:`mlc-classify.py <code/mlc-classify.py>`, uses 
    37 :download:`emotions.tab <code/emotions.tab>`): 
     36this algorithm (:download:`mlc-classify.py <code/mlc-classify.py>`): 
    3837 
    3938.. literalinclude:: code/mlc-classify.py 
  • Orange/multilabel/mlknn.py

    r9671 r9994  
    3636 
    3737The following example demonstrates a straightforward invocation of 
    38 this algorithm (:download:`mlc-classify.py <code/mlc-classify.py>`, uses 
    39 :download:`emotions.tab <code/emotions.tab>`): 
     38this algorithm (:download:`mlc-classify.py <code/mlc-classify.py>`): 
    4039 
    4140.. literalinclude:: code/mlc-classify.py 
  • Orange/multilabel/mulan.py

    r9927 r9994  
    4040 
    4141if __name__=="__main__": 
    42     table = trans_mulan_data("../../doc/datasets/emotions.xml","../../doc/datasets/emotions.arff") 
     42    table = trans_mulan_data("../doc/datasets/emotions.xml","../doc/datasets/emotions.arff") 
    4343     
    4444    for i in range(10): 
    4545        print table[i] 
    4646     
    47     table.save("emotions.tab") 
     47    table.save("/tmp/emotions.tab") 
  • Orange/multitarget/__init__.py

    r9671 r9994  
    2424:download:`generate_multitarget.py <code/generate_multitarget.py>`) to show 
    2525some basic functionalities (part of 
    26 :download:`multitarget.py <code/multitarget.py>`, uses 
    27 :download:`multitarget-synthetic.tab <code/multitarget-synthetic.tab>`). 
     26:download:`multitarget.py <code/multitarget.py>`). 
    2827 
    2928.. literalinclude:: code/multitarget.py 
  • Orange/multitarget/tree.py

    r9922 r9994  
    2424The following example demonstrates how to build a prediction model with 
    2525MultitargetTreeLearner and use it to predict (multiple) class values for 
    26 a given instance (:download:`multitarget.py <code/multitarget.py>`, 
    27 uses :download:`test-pls.tab <code/test-pls.tab>`): 
     26a given instance (:download:`multitarget.py <code/multitarget.py>`): 
    2827 
    2928.. literalinclude:: code/multitarget.py 
  • Orange/network/__init__.py

    r9671 r9994  
    2323Pajek (.net) or GML file format. 
    2424 
    25 :download:`network-read-nx.py <code/network-read-nx.py>` (uses: :download:`K5.net <code/K5.net>`): 
     25:download:`network-read-nx.py <code/network-read-nx.py>`: 
    2626 
    2727.. literalinclude:: code/network-read.py 
  • Orange/network/deprecated.py

    r9922 r9994  
    2929Pajek (.net) or GML file format. 
    3030 
    31 :download:`network-read.py <code/network-read.py>` (uses: :download:`K5.net <code/K5.net>`): 
     31:download:`network-read.py <code/network-read.py>`: 
    3232 
    3333.. literalinclude:: code/network-read.py 
  • Orange/projection/correspondence.py

    r9671 r9994  
    2121 
    2222Data table given below represents smoking habits of different employees 
    23 in a company (computed from :download:`smokers_ct.tab <code/smokers_ct.tab>`). 
     23in a company (computed from `smokers_ct.tab`). 
    2424 
    2525    ================  ====  =====  ======  =====  ========== 
     
    5656 
    5757So lets load the data, compute the contingency and do the analysis 
    58 (:download:`correspondence.py <code/correspondence.py>`, uses :download:`smokers_ct.tab <code/smokers_ct.tab>`):: 
     58(:download:`correspondence.py <code/correspondence.py>`):: 
    5959     
    6060    from Orange.projection import correspondence 
  • Orange/projection/mds.py

    r9916 r9994  
    5555(not included with orange, http://matplotlib.sourceforge.net/). 
    5656 
    57 Example (:download:`mds-scatterplot.py <code/mds-scatterplot.py>`, uses :download:`iris.tab <code/iris.tab>`) 
     57Example (:download:`mds-scatterplot.py <code/mds-scatterplot.py>`) 
    5858 
    5959.. literalinclude:: code/mds-scatterplot.py 
     
    7676time. 
    7777 
    78 Example (:download:`mds-advanced.py <code/mds-advanced.py>`, uses :download:`iris.tab <code/iris.tab>`) 
     78Example (:download:`mds-advanced.py <code/mds-advanced.py>`) 
    7979 
    8080.. literalinclude:: code/mds-advanced.py 
  • Orange/projection/som.py

    r9671 r9994  
    8282 
    8383Class :obj:`Map` stores the self-organizing map composed of :obj:`Node` objects. The code below 
    84 (:download:`som-node.py <code/som-node.py>`, uses :download:`iris.tab <code/iris.tab>`) shows an example how to access the information stored in the  
     84(:download:`som-node.py <code/som-node.py>`) shows an example how to access the information stored in the 
    8585node of the map: 
    8686 
     
    9898======== 
    9999 
    100 The following code  (:download:`som-mapping.py <code/som-mapping.py>`, uses :download:`iris.tab <code/iris.tab>`) infers self-organizing map from Iris data set. The map is rather small, and consists  
     100The following code  (:download:`som-mapping.py <code/som-mapping.py>`) infers self-organizing map from Iris data set. The map is rather small, and consists 
    101101of only 9 cells. We optimize the network, and then report how many data instances were mapped 
    102102into each cell. The second part of the code reports on data instances from one of the corner cells: 
  • Orange/regression/mean.py

    r9671 r9994  
    2626Here's a simple example. 
    2727 
    28 :download:`mean-regression.py <code/mean-regression.py>` (uses: :download:`housing.tab <code/housing.tab>`): 
     28:download:`mean-regression.py <code/mean-regression.py>`: 
    2929 
    3030.. literalinclude:: code/mean-regression.py 
  • Orange/regression/tree.py

    r9671 r9994  
    1212but uses a different set of functions to evaluate node splitting and stop 
    1313criteria. Usage of regression trees is straightforward as demonstrated on the 
    14 following example (:download:`regression-tree-run.py <code/regression-tree-run.py>`, uses :download:`servo.tab <code/servo.tab>`): 
     14following example (:download:`regression-tree-run.py <code/regression-tree-run.py>`): 
    1515 
    1616.. literalinclude:: code/regression-tree-run.py 
  • Orange/statistics/basic.py

    r9671 r9994  
    9797        variables in the domain. 
    9898     
    99     part of :download:`distributions-basic-stat.py <code/distributions-basic-stat.py>` (uses :download:`monks-1.tab <code/monks-1.tab>`) 
     99    part of :download:`distributions-basic-stat.py <code/distributions-basic-stat.py>` 
    100100     
    101101    .. literalinclude:: code/distributions-basic-stat.py 
     
    111111 
    112112 
    113     part of :download:`distributions-basic-stat.py <code/distributions-basic-stat.py>` (uses :download:`iris.tab <code/iris.tab>`) 
     113    part of :download:`distributions-basic-stat.py <code/distributions-basic-stat.py>` 
    114114     
    115115    .. literalinclude:: code/distributions-basic-stat.py 
  • Orange/testing/regression/results_reference/knnlearner.py.txt

    r9689 r9954  
     1Testing using euclidean distance 
    12Iris-setosa Iris-setosa 
    23Iris-versicolor Iris-versicolor 
     
    67 
    78 
    8  
     9Testing using hamming distance 
    910Iris-virginica Iris-virginica 
    1011Iris-setosa Iris-setosa 
  • Orange/testing/regression/results_reference/randomindicescv.py.txt

    r9689 r9954  
     1Indices for ordinary 10-fold CV 
    12<1, 1, 3, 8, 8, 3, 2, 7, 5, 0, 1, 5, 2, 9, 4, 7, 4, 9, 3, 6, 0, 2, 0, 6> 
     3Indices for 5 folds on 10 examples 
    24<3, 0, 1, 0, 3, 2, 4, 4, 1, 2> 
  • Orange/testing/regression/results_reference/treelearner.py.txt

    r9689 r9954  
    1 None 
    2 None 
    311.0 0.0 
    42 
    53 
    64Tree with minExamples = 5.0 
     5tear_rate=reduced: none (100.00%) 
     6tear_rate=normal 
     7|    astigmatic=no 
     8|    |    age=pre-presbyopic: soft (100.00%) 
     9|    |    age=presbyopic: none (50.00%) 
     10|    |    age=young: soft (100.00%) 
     11|    astigmatic=yes 
     12|    |    prescription=hypermetrope: none (66.67%) 
     13|    |    prescription=myope: hard (100.00%) 
    714 
    8 tear_rate (<15.000, 4.000, 5.000>)  
    9 : normal  
    10    astigmatic (<3.000, 4.000, 5.000>)  
    11    : no  
    12       age (<1.000, 0.000, 5.000>)  
    13       : pre-presbyopic --> soft (<0.000, 0.000, 2.000>)   
    14       : presbyopic --> none (<1.000, 0.000, 1.000>)   
    15       : young --> soft (<0.000, 0.000, 2.000>)   
    16    : yes  
    17       prescription (<2.000, 4.000, 0.000>)  
    18       : hypermetrope --> none (<2.000, 1.000, 0.000>)   
    19       : myope --> hard (<0.000, 3.000, 0.000>)   
    20 : reduced --> none (<12.000, 0.000, 0.000>)   
     15 
    2116 
    2217Tree with maxMajority = 0.5 
    23 --> none (<15.000, 4.000, 5.000>)  
     18none (62.50%) 
  • Orange/testing/regression/results_reference/treestructure.py.txt

    r9689 r9954  
    1 Tree size: 10 
     1Tree size: 15 
    22 
    33 
    44Unpruned tree 
     5tear_rate=reduced: none (100.00%) 
     6tear_rate=normal 
     7|    astigmatic=no 
     8|    |    age=pre-presbyopic: soft (100.00%) 
     9|    |    age=young: soft (100.00%) 
     10|    |    age=presbyopic 
     11|    |    |    prescription=hypermetrope: soft (100.00%) 
     12|    |    |    prescription=myope: none (100.00%) 
     13|    astigmatic=yes 
     14|    |    prescription=myope: hard (100.00%) 
     15|    |    prescription=hypermetrope 
     16|    |    |    age=pre-presbyopic: none (100.00%) 
     17|    |    |    age=presbyopic: none (100.00%) 
     18|    |    |    age=young: hard (100.00%) 
    519 
    6 tear_rate (<15.000, 4.000, 5.000>)  
    7 : normal  
    8    astigmatic (<3.000, 4.000, 5.000>)  
    9    : no  
    10       age (<1.000, 0.000, 5.000>)  
    11       : pre-presbyopic --> soft (<0.000, 0.000, 2.000>)   
    12       : presbyopic --> none (<1.000, 0.000, 1.000>)   
    13       : young --> soft (<0.000, 0.000, 2.000>)   
    14    : yes  
    15       prescription (<2.000, 4.000, 0.000>)  
    16       : hypermetrope --> none (<2.000, 1.000, 0.000>)   
    17       : myope --> hard (<0.000, 3.000, 0.000>)   
    18 : reduced --> none (<12.000, 0.000, 0.000>)   
     20 
    1921 
    2022Pruned tree 
     23tear_rate=reduced: none (100.00%) 
     24tear_rate=normal 
     25|    astigmatic=no: soft (83.33%) 
     26|    astigmatic=yes: hard (66.67%) 
    2127 
    22 tear_rate (<15.000, 4.000, 5.000>)  
    23 : normal  
    24    astigmatic (<3.000, 4.000, 5.000>)  
    25    : no --> soft (<1.000, 0.000, 5.000>)   
    26    : yes --> hard (<2.000, 4.000, 0.000>)   
    27 : reduced --> none (<12.000, 0.000, 0.000>)  
  • Orange/testing/regression/xtest.py

    r9873 r9972  
    1212platform = sys.platform 
    1313pyversion = sys.version[:3] 
    14 states = ["OK", "changed", "random", "error", "crash"] 
     14states = ["OK", "timedout", "changed", "random", "error", "crash"] 
    1515 
    1616def file_name_match(name, patterns): 
     
    2222 
    2323def test_scripts(complete, just_print, module="orange", root_directory=".", 
    24                 test_files=None, directories=None): 
     24                test_files=None, directories=None, timeout=5): 
    2525    """Test the scripts in the given directory.""" 
    2626    global error_status 
     
    123123                sys.stdout.flush() 
    124124 
    125                 for state in ["crash", "error", "new", "changed", "random1", "random2"]: 
     125                for state in states: 
    126126                    remname = "%s/%s.%s.%s.%s.txt" % \ 
    127127                              (outputsdir, name, platform, pyversion, state) 
     
    130130 
    131131                titerations = re_israndom.search(open(name, "rt").read()) and 1 or iterations 
    132                 os.spawnl(os.P_WAIT, sys.executable, "-c", regtestdir + "/xtest_one.py", name, str(titerations), outputsdir) 
    133  
    134                 result = open("xtest1_report", "rt").readline().rstrip() or "crash" 
     132                #os.spawnl(os.P_WAIT, sys.executable, "-c", regtestdir + "/xtest_one.py", name, str(titerations), outputsdir) 
     133                p = subprocess.Popen([sys.executable, regtestdir + "/xtest_one.py", name, str(titerations), outputsdir]) 
     134 
     135                passed_time = 0 
     136                while passed_time < timeout: 
     137                    time.sleep(0.01) 
     138                    passed_time += 0.01 
     139 
     140                    if p.poll() is not None: 
     141                        break 
     142 
     143                if p.poll() is None: 
     144                    p.kill() 
     145                    result2 = "timedout" 
     146                    print "timedout (use: --timeout #)" 
     147                    # remove output file and change it for *.timedout.* 
     148                    for state in states: 
     149                        remname = "%s/%s.%s.%s.%s.txt" % \ 
     150                                  (outputsdir, name, platform, pyversion, state) 
     151                        if os.path.exists(remname): 
     152                            os.remove(remname) 
     153 
     154                    timeoutname = "%s/%s.%s.%s.%s.txt" % (outputsdir, name, sys.platform, sys.version[:3], "timedout") 
     155                    open(timeoutname, "wt").close() 
     156                    result = "timedout" 
     157                else: 
     158                    stdout, stderr = p.communicate() 
     159                    result = open("xtest1_report", "rt").readline().rstrip() or "crash" 
     160 
    135161                error_status = max(error_status, states.index(result)) 
    136162                os.remove("xtest1_report") 
     
    139165 
    140166    os.chdir(caller_directory) 
    141  
    142167 
    143168iterations = 1 
     
    147172def usage(): 
    148173    """Print out help.""" 
    149     print "%s [update|test|report|report-html|errors] -[h|s] [--single|--module=[orange|obi|text]|--dir=<dir>|] <files>" % sys.argv[0] 
    150     print "  test:   regression tests on all scripts" 
    151     print "  update: regression tests on all previously failed scripts (default)" 
     174    print "%s [test|update|report|report-html|errors] -[h|s] [--single|--module=[all|orange|docs]|--timeout=<#>|--dir=<dir>|] <files>" % sys.argv[0] 
     175    print "  test:   regression tests on all scripts (default)" 
     176    print "  update: regression tests on all previously failed scripts" 
    152177    print "  report: report on testing results" 
    153178    print "  errors: report on errors from regression tests" 
     
    155180    print "-s, --single: runs a single test on each script" 
    156181    print "--module=<module>: defines a module to test" 
     182    print "--timeout=<#seconds>: defines max. execution time" 
    157183    print "--dir=<dir>: a comma-separated list of names where any should match the directory to be tested" 
    158184    print "<files>: space separated list of string matching the file names to be tested" 
     
    163189    global iterations 
    164190 
    165     command = "update" 
     191    command = "test" 
    166192    if argv: 
    167193        if argv[0] in ["update", "test", "report", "report-html", "errors", "help"]: 
     
    170196 
    171197    try: 
    172         opts, test_files = getopt.getopt(argv, "hs", ["single", "module=", "help", "files=", "verbose="]) 
     198        opts, test_files = getopt.getopt(argv, "hs", ["single", "module=", "timeout=", "help", "files=", "verbose="]) 
    173199    except getopt.GetoptError: 
    174200        print "Warning: Wrong argument" 
     
    183209 
    184210    module = opts.get("--module", "all") 
    185     if module in ["all"]: 
     211    if module == "all": 
    186212        root = "%s/.." % environ.install_dir 
    187213        module = "orange" 
    188         dirs = [("modules", "Orange/doc/modules"), 
    189                 ("reference", "Orange/doc/reference"), 
    190                 ("ofb", "docs/tutorial/rst/code"), 
    191                 ("orange25", "docs/reference/rst/code")] 
    192     elif module in ["orange"]: 
     214        dirs = [("tests", "Orange/testing/regression/tests"), 
     215                ("tests_20", "Orange/testing/regression/tests_20"), 
     216                ("tutorial", "docs/tutorial/rst/code"), 
     217                ("reference", "docs/reference/rst/code")] 
     218    elif module == "orange": 
     219        root = "%s" % environ.install_dir 
     220        module = "orange" 
     221        dirs = [("tests", "testing/regression/tests"), 
     222                ("tests_20", "testing/regression/tests_20")] 
     223    elif module == "docs": 
    193224        root = "%s/.." % environ.install_dir 
    194225        module = "orange" 
    195         dirs = [("modules", "Orange/doc/modules"), 
    196                 ("reference", "Orange/doc/reference"), 
    197                 ("ofb", "docs/tutorial/rst/code")] 
    198     elif module in ["ofb-rst"]: 
    199         root = "%s/.." % environ.install_dir 
    200         module = "orange" 
    201         dirs = [("ofb", "docs/tutorial/rst/code")] 
    202     elif module in ["orange25"]: 
    203         root = "%s/.." % environ.install_dir 
    204         module = "orange" 
    205         dirs = [("orange25", "docs/reference/rst/code")] 
    206     elif module == "obi": 
    207         root = environ.add_ons_dir + "/Bioinformatics/doc" 
    208         dirs = [("modules", "modules")] 
    209     elif module == "text": 
    210         root = environ.add_ons_dir + "/Text/doc" 
    211         dirs = [("modules", "modules")] 
     226        dirs = [("tutorial", "docs/tutorial/rst/code"), 
     227                ("reference", "docs/reference/rst/code")] 
    212228    else: 
    213         print "Error: %s is wrong name of the module, should be in [orange|obi|text]" % module 
     229        print "Error: %s is wrong name of the module, should be in [orange|docs]" % module 
     230        sys.exit(1) 
     231 
     232    timeout = 5 
     233    try: 
     234        _t = opts.get("--timeout", "5") 
     235        timeout = int(_t) 
     236        if timeout <= 0 or timeout >= 120: 
     237            raise AttributeError() 
     238    except AttributeError: 
     239        print "Error: timeout out of range (0 < # < 120)" 
     240        sys.exit(1) 
     241    except: 
     242        print "Error: %s wrong timeout" % opts.get("--timeout", "5") 
    214243        sys.exit(1) 
    215244 
    216245    test_scripts(command == "test", command == "report" or (command == "report-html" and command or False), 
    217246                 module=module, root_directory=root, 
    218                  test_files=test_files, directories=dirs) 
     247                 test_files=test_files, directories=dirs, timeout=timeout) 
    219248    # sys.exit(error_status) 
    220249 
  • Orange/testing/unit/tests/test_association.py

    r9679 r9979  
    1414         
    1515    self.assertLessEqual(len(rules), self.inducer.max_item_sets) 
    16     print "\n%5s   %5s" % ("supp", "conf") 
    1716    for r in rules: 
    18         print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r) 
    1917        self.assertGreaterEqual(r.support, self.inducer.support) 
    2018        self.assertIsNotNone(r.left) 
  • Orange/testing/unit/tests/test_ensemble.py

    r9679 r9978  
    2323        testing.LearnerTestCase.test_pickling_on(self, dataset) 
    2424 
     25 
    2526@datasets_driven(datasets=testing.CLASSIFICATION_DATASETS) 
    2627class TestRandomForest(testing.LearnerTestCase): 
     
    3132    @test_on_datasets(datasets=["iris"]) 
    3233    def test_pickling_on(self, dataset): 
    33         testing.LearnerTestCase.test_pickling_on(self, dataset) 
     34        raise NotImplemented("SmallTreeLearner pickling is not implemented") 
     35#        testing.LearnerTestCase.test_pickling_on(self, dataset) 
     36         
    3437         
    3538         
  • docs/reference/rst/Orange.associate.rst

    r9372 r9994  
    33==================================== 
    44 
    5 .. automodule:: Orange.associate 
     5============================== 
     6Induction of association rules 
     7============================== 
     8 
     9Orange provides two algorithms for induction of 
     10`association rules <http://en.wikipedia.org/wiki/Association_rule_learning>`_. 
     11One is the basic Agrawal's algorithm with dynamic induction of supported 
     12itemsets and rules that is designed specifically for datasets with a 
     13large number of different items. This is, however, not really suitable 
     14for feature-based machine learning problems. 
     15We have adapted the original algorithm for efficiency 
     16with the latter type of data, and to induce the rules where, 
     17both sides don't only contain features 
     18(like "bread, butter -> jam") but also their values 
     19("bread = wheat, butter = yes -> jam = plum"). 
     20 
     21It is also possible to extract item sets instead of association rules. These 
     22are often more interesting than the rules themselves. 
     23 
     24Besides association rule inducer, Orange also provides a rather simplified 
     25method for classification by association rules. 
     26 
     27=================== 
     28Agrawal's algorithm 
     29=================== 
     30 
     31The class that induces rules by Agrawal's algorithm, accepts the data examples 
     32of two forms. The first is the standard form in which each example is 
     33described by values of a fixed list of features (defined in domain). 
     34The algorithm, however, disregards the feature values and only checks whether 
     35the value is defined or not. The rule shown above ("bread, butter -> jam") 
     36actually means that if "bread" and "butter" are defined, then "jam" is defined 
     37as well. It is expected that most of values will be undefined - if this is not 
     38so, use the :class:`~AssociationRulesInducer`. 
     39 
     40:class:`AssociationRulesSparseInducer` can also use sparse data. 
     41Sparse examples have no fixed 
     42features - the domain is empty. All values assigned to example are given as meta attributes. 
     43All meta attributes need to be registered with the :obj:`~Orange.data.Domain`. 
     44The most suitable format fot this kind of data it is the basket format. 
     45 
     46The algorithm first dynamically builds all itemsets (sets of features) that have 
     47at least the prescribed support. Each of these is then used to derive rules 
     48with requested confidence. 
     49 
     50If examples were given in the sparse form, so are the left and right side 
     51of the induced rules. If examples were given in the standard form, so are 
     52the examples in association rules. 
     53 
     54.. class:: AssociationRulesSparseInducer 
     55 
     56    .. attribute:: support 
     57 
     58        Minimal support for the rule. 
     59 
     60    .. attribute:: confidence 
     61 
     62        Minimal confidence for the rule. 
     63 
     64    .. attribute:: store_examples 
     65 
     66        Store the examples covered by each rule and 
     67        those confirming it. 
     68 
     69    .. attribute:: max_item_sets 
     70 
     71        The maximal number of itemsets. The algorithm's 
     72        running time (and its memory consumption) depends on the minimal support; 
     73        the lower the requested support, the more eligible itemsets will be found. 
     74        There is no general rule for setting support - perhaps it 
     75        should be around 0.3, but this depends on the data set. 
     76        If the supoort was set too low, the algorithm could run out of memory. 
     77        Therefore, Orange limits the number of generated rules to 
     78        :obj:`max_item_sets`. If Orange reports, that the prescribed 
     79        :obj:`max_item_sets` was exceeded, increase the requered support 
     80        or alternatively, increase :obj:`max_item_sets` to as high as you computer 
     81        can handle. 
     82 
     83    .. method:: __call__(data, weight_id) 
     84 
     85        Induce rules from the data set. 
     86 
     87 
     88    .. method:: get_itemsets(data) 
     89 
     90        Returns a list of pairs. The first element of a pair is a tuple with 
     91        indices of features in the item set (negative for sparse data). 
     92        The second element is a list of indices supporting the item set, that is, 
     93        all the items in the set. If :obj:`store_examples` is False, the second 
     94        element is None. 
     95 
     96We shall test the rule inducer on a dataset consisting of a brief description 
     97of Spanish Inquisition, given by Palin et al: 
     98 
     99    NOBODY expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our *three* weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our *four*...no... *Amongst* our weapons.... Amongst our weaponry...are such elements as fear, surprise.... I'll come in again. 
     100 
     101    NOBODY expects the Spanish Inquisition! Amongst our weaponry are such diverse elements as: fear, surprise, ruthless efficiency, an almost fanatical devotion to the Pope, and nice red uniforms - Oh damn! 
     102 
     103The text needs to be cleaned of punctuation marks and capital letters at beginnings of the sentences, each sentence needs to be put in a new line and commas need to be inserted between the words. 
     104 
     105Data example (:download:`inquisition.basket <code/inquisition.basket>`): 
     106 
     107.. literalinclude:: code/inquisition.basket 
     108 
     109Inducing the rules is trivial:: 
     110 
     111    import Orange 
     112    data = Orange.data.Table("inquisition") 
     113 
     114    rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5) 
     115 
     116    print "%5s   %5s" % ("supp", "conf") 
     117    for r in rules: 
     118        print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r) 
     119 
     120The induced rules are surprisingly fear-full: :: 
     121 
     122    0.500   1.000   fear -> surprise 
     123    0.500   1.000   surprise -> fear 
     124    0.500   1.000   fear -> surprise our 
     125    0.500   1.000   fear surprise -> our 
     126    0.500   1.000   fear our -> surprise 
     127    0.500   1.000   surprise -> fear our 
     128    0.500   1.000   surprise our -> fear 
     129    0.500   0.714   our -> fear surprise 
     130    0.500   1.000   fear -> our 
     131    0.500   0.714   our -> fear 
     132    0.500   1.000   surprise -> our 
     133    0.500   0.714   our -> surprise 
     134 
     135To get only a list of supported item sets, one should call the method 
     136get_itemsets:: 
     137 
     138    inducer = Orange.associate.AssociationRulesSparseInducer(support = 0.5, store_examples = True) 
     139    itemsets = inducer.get_itemsets(data) 
     140 
     141Now itemsets is a list of itemsets along with the examples supporting them 
     142since we set store_examples to True. :: 
     143 
     144    >>> itemsets[5] 
     145    ((-11, -7), [1, 2, 3, 6, 9]) 
     146    >>> [data.domain[i].name for i in itemsets[5][0]] 
     147    ['surprise', 'our'] 
     148 
     149The sixth itemset contains features with indices -11 and -7, that is, the 
     150words "surprise" and "our". The examples supporting it are those with 
     151indices 1,2, 3, 6 and 9. 
     152 
     153This way of representing the itemsets is memory efficient and faster than using 
     154objects like :obj:`~Orange.feature.Descriptor` and :obj:`~Orange.data.Instance`. 
     155 
     156.. _non-sparse-examples: 
     157 
     158=================== 
     159Non-sparse data 
     160=================== 
     161 
     162:class:`AssociationRulesInducer` works with non-sparse data. 
     163Unknown values are ignored, while values of features are not (as opposite to 
     164the algorithm for sparse rules). In addition, the algorithm 
     165can be directed to search only for classification rules, in which the only 
     166feature on the right-hand side is the class variable. 
     167 
     168.. class:: AssociationRulesInducer 
     169 
     170    All attributes can be set with the constructor. 
     171 
     172    .. attribute:: support 
     173 
     174       Minimal support for the rule. 
     175 
     176    .. attribute:: confidence 
     177 
     178        Minimal confidence for the rule. 
     179 
     180    .. attribute:: classification_rules 
     181 
     182        If True (default is False), the classification rules are constructed instead 
     183        of general association rules. 
     184 
     185    .. attribute:: store_examples 
     186 
     187        Store the examples covered by each rule and those 
     188        confirming it 
     189 
     190    .. attribute:: max_item_sets 
     191 
     192        The maximal number of itemsets. 
     193 
     194    .. method:: __call__(data, weight_id) 
     195 
     196        Induce rules from the data set. 
     197 
     198    .. method:: get_itemsets(data) 
     199 
     200        Returns a list of pairs. The first element of a pair is a tuple with 
     201        indices of features in the item set (negative for sparse data). 
     202        The second element is a list of indices supporting the item set, that is, 
     203        all the items in the set. If :obj:`store_examples` is False, the second 
     204        element is None. 
     205 
     206The example:: 
     207 
     208    import Orange 
     209 
     210    data = Orange.data.Table("lenses") 
     211 
     212    print "Association rules" 
     213    rules = Orange.associate.AssociationRulesInducer(data, support = 0.5) 
     214    for r in rules: 
     215        print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
     216 
     217The found rules are: :: 
     218 
     219    0.333  0.533  lenses=none -> prescription=hypermetrope 
     220    0.333  0.667  prescription=hypermetrope -> lenses=none 
     221    0.333  0.533  lenses=none -> astigmatic=yes 
     222    0.333  0.667  astigmatic=yes -> lenses=none 
     223    0.500  0.800  lenses=none -> tear_rate=reduced 
     224    0.500  1.000  tear_rate=reduced -> lenses=none 
     225 
     226To limit the algorithm to classification rules, set classificationRules to 1: :: 
     227 
     228    print "\\nClassification rules" 
     229    rules = orange.AssociationRulesInducer(data, support = 0.3, classificationRules = 1) 
     230    for r in rules: 
     231        print "%5.3f  %5.3f  %s" % (r.support, r.confidence, r) 
     232 
     233The found rules are, naturally, a subset of the above rules: :: 
     234 
     235    0.333  0.667  prescription=hypermetrope -> lenses=none 
     236    0.333  0.667  astigmatic=yes -> lenses=none 
     237    0.500  1.000  tear_rate=reduced -> lenses=none 
     238 
     239Itemsets are induced in a similar fashion as for sparse data, except that the 
     240first element of the tuple, the item set, is represented not by indices of 
     241features, as before, but with tuples (feature-index, value-index): :: 
     242 
     243    inducer = Orange.associate.AssociationRulesInducer(support = 0.3, store_examples = True) 
     244    itemsets = inducer.get_itemsets(data) 
     245    print itemsets[8] 
     246 
     247This prints out :: 
     248 
     249    (((2, 1), (4, 0)), [2, 6, 10, 14, 15, 18, 22, 23]) 
     250 
     251meaning that the ninth itemset contains the second value of the third feature 
     252(2, 1), and the first value of the fifth (4, 0). 
     253 
     254======================= 
     255Representation of rules 
     256======================= 
     257 
     258An :class:`AssociationRule` represents a rule. In Orange, methods for 
     259induction of association rules return the induced rules in 
     260:class:`AssociationRules`, which is basically a list of :class:`AssociationRule` instances. 
     261 
     262.. class:: AssociationRule 
     263 
     264    .. method:: __init__(left, right, n_applies_left, n_applies_right, n_applies_both, n_examples) 
     265 
     266        Constructs an association rule and computes all measures listed above. 
     267 
     268    .. method:: __init__(left, right, support, confidence) 
     269 
     270        Construct association rule and sets its support and confidence. If 
     271        you intend to pass on such a rule you should set other attributes 
     272        manually - AssociationRules's constructor cannot compute anything 
     273        from arguments support and confidence. 
     274 
     275    .. method:: __init__(rule) 
     276 
     277        Given an association rule as the argument, constructor copies of the 
     278        rule. 
     279 
     280    .. attribute:: left, right 
     281 
     282        The left and the right side of the rule. Both are given as :class:`Orange.data.Instance`. 
     283        In rules created by :class:`AssociationRulesSparseInducer` from examples that 
     284        contain all values as meta-values, left and right are examples in the 
     285        same form. Otherwise, values in left that do not appear in the rule 
     286        are "don't care", and value in right are "don't know". Both can, 
     287        however, be tested by :meth:`~Orange.data.Value.is_special`. 
     288 
     289    .. attribute:: n_left, n_right 
     290 
     291        The number of features (i.e. defined values) on the left and on the 
     292        right side of the rule. 
     293 
     294    .. attribute:: n_applies_left, n_applies_right, n_applies_both 
     295 
     296        The number of (learning) examples that conform to the left, the right 
     297        and to both sides of the rule. 
     298 
     299    .. attribute:: n_examples 
     300 
     301        The total number of learning examples. 
     302 
     303    .. attribute:: support 
     304 
     305        nAppliesBoth/nExamples. 
     306 
     307    .. attribute:: confidence 
     308 
     309        n_applies_both/n_applies_left. 
     310 
     311    .. attribute:: coverage 
     312 
     313        n_applies_left/n_examples. 
     314 
     315    .. attribute:: strength 
     316 
     317        n_applies_right/n_applies_left. 
     318 
     319    .. attribute:: lift 
     320 
     321        n_examples * n_applies_both / (n_applies_left * n_applies_right). 
     322 
     323    .. attribute:: leverage 
     324 
     325        (n_Applies_both * n_examples - n_applies_left * n_applies_right). 
     326 
     327    .. attribute:: examples, match_left, match_both 
     328 
     329        If store_examples was True during induction, examples contains a copy 
     330        of the example table used to induce the rules. Attributes match_left 
     331        and match_both are lists of integers, representing the indices of 
     332        examples which match the left-hand side of the rule and both sides, 
     333        respectively. 
     334 
     335    .. method:: applies_left(example) 
     336 
     337    .. method:: applies_right(example) 
     338 
     339    .. method:: applies_both(example) 
     340 
     341        Tells whether the example fits into the left, right or both sides of 
     342        the rule, respectively. If the rule is represented by sparse examples, 
     343        the given example must be sparse as well. 
     344 
     345Association rule inducers do not store evidence about which example supports 
     346which rule. Let us write a function that finds the examples that 
     347confirm the rule (fit both sides of it) and those that contradict it (fit the 
     348left-hand side but not the right). The example:: 
     349 
     350    import Orange 
     351 
     352    data = Orange.data.Table("lenses") 
     353 
     354    rules = Orange.associate.AssociationRulesInducer(data, supp = 0.3) 
     355    rule = rules[0] 
     356 
     357    print 
     358    print "Rule: ", rule 
     359    print 
     360 
     361    print "Supporting examples:" 
     362    for example in data: 
     363        if rule.appliesBoth(example): 
     364            print example 
     365    print 
     366 
     367    print "Contradicting examples:" 
     368    for example in data: 
     369        if rule.applies_left(example) and not rule.applies_right(example): 
     370            print example 
     371    print 
     372 
     373The latter printouts get simpler and faster if we instruct the inducer to 
     374store the examples. We can then do, for instance, this: :: 
     375 
     376    print "Match left: " 
     377    print "\\n".join(str(rule.examples[i]) for i in rule.match_left) 
     378    print "\\nMatch both: " 
     379    print "\\n".join(str(rule.examples[i]) for i in rule.match_both) 
     380 
     381The "contradicting" examples are then those whose indices are found in 
     382match_left but not in match_both. The memory friendlier and the faster way 
     383to compute this is as follows: :: 
     384 
     385    >>> [x for x in rule.match_left if not x in rule.match_both] 
     386    [0, 2, 8, 10, 16, 17, 18] 
     387    >>> set(rule.match_left) - set(rule.match_both) 
     388    set([0, 2, 8, 10, 16, 17, 18]) 
     389 
     390=============== 
     391Utilities 
     392=============== 
     393 
     394.. autofunction:: print_rules 
     395 
     396.. autofunction:: sort 
  • docs/reference/rst/Orange.data.continuization.rst

    r9941 r9966  
    1111variable separately. 
    1212 
    13 .. class DomainContinuizer 
     13.. class:: DomainContinuizer 
    1414 
    1515    Returns a new domain containing only continuous attributes given a 
     
    2929      ``multinomial_treatment``. 
    3030 
    31     .. attribute zero_based 
     31    The typical use of the class is as follows:: 
     32 
     33        continuizer = orange.DomainContinuizer() 
     34        continuizer.multinomialTreatment = continuizer.LowestIsBase 
     35        domain0 = continuizer(data) 
     36        data0 = data.translate(domain0) 
     37 
     38    .. attribute:: zero_based 
    3239 
    3340        Determines the value used as the "low" value of the variable. When 
     
    3845        following text assumes the default case. 
    3946 
    40     .. attribute multinomial_treatment 
     47    .. attribute:: multinomial_treatment 
    4148 
    4249       Decides the treatment of multinomial variables. Let N be the 
     
    5461           used (directly) in, for instance, linear or logistic regression. 
    5562 
     63           For example, data set "bridges" has feature "RIVER" with 
     64           values "M", "A", "O" and "Y", in that order. Its value for 
     65           the 15th row is "M". Continuization replaces the variable 
     66           with variables "RIVER=M", "RIVER=A", "RIVER=O" and 
     67           "RIVER=Y". For the 15th row, the first has value 1 and 
     68           others are 0. 
     69 
    5670       DomainContinuizer.LowestIsBase 
    5771           Similar to the above except that it creates only N-1 
     
    6377           specified value is used as base instead of the lowest one. 
    6478 
     79           Continuizing the variable "RIVER" gives similar results as 
     80           above except that it would omit "RIVER=M"; all three 
     81           variables would be zero for the 15th data instance. 
     82 
    6583       DomainContinuizer.FrequentIsBase 
    66  
    6784           Like above, except that the most frequent value is used as the 
    6885           base (this can again be overidden by setting the descriptor's 
     
    7188           extracted from data, so this option cannot be used if constructor 
    7289           is given only a domain. 
     90 
     91           Variable "RIVER" would be continuized similarly to above 
     92           except that it omits "RIVER=A", which is the most frequent value. 
    7393            
    7494       DomainContinuizer.Ignore 
     
    87107           variable. 
    88108 
    89     .. attribute normalize_continuous 
     109    .. attribute:: normalize_continuous 
    90110 
    91111        If ``False`` (default), continues variables are left unchanged. If 
  • docs/reference/rst/Orange.data.discretization.rst

    r9900 r9963  
    1 .. py:currentmodule:: Orange.data 
     1.. py:currentmodule:: Orange.data.discretization 
    22 
    3 ################################### 
    4 Discretization (``discretization``) 
    5 ################################### 
     3######################################## 
     4Data discretization (``discretization``) 
     5######################################## 
    66 
    77.. index:: discretization 
     
    1010   single: data; discretization 
    1111 
    12 Continues features in the data can be discretized using a uniform discretization method. The approach will consider 
    13 only continues features, and replace them in the data set with corresponding categorical features: 
     12Continues features in the data can be discretized using a uniform discretization method. Discretization considers 
     13only continues features, and replaces them in the new data set with corresponding categorical features: 
    1414 
    1515.. literalinclude:: code/discretization-table.py 
    1616 
    17 Discretization introduces new categorical features and computes their values in accordance to 
    18 a discretization method:: 
     17Discretization introduces new categorical features with discretized values:: 
    1918 
    2019    Original data set: 
     
    2827    ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa'] 
    2928 
    30 The procedure uses feature discretization classes as define in XXX and applies them on entire data sets. 
    31 The suported discretization methods are: 
     29Data discretization uses feature discretization classes from :doc:`Orange.feature 
     30.discretization` and applies them on entire data set. The suported discretization methods are: 
    3231 
    3332* equal width discretization, where the domain of continuous feature is split to intervals of the same 
    34   width equal-sized intervals (:class:`EqualWidth`), 
    35 * equal frequency discretization, where each intervals contains equal number of data instances (:class:`EqualFreq`), 
     33  width equal-sized intervals (uses :class:`Orange.feature.discretization.EqualWidth`), 
     34* equal frequency discretization, where each intervals contains equal number of data instances (uses 
     35  :class:`Orange.feature.discretization.EqualFreq`), 
    3636* entropy-based, as originally proposed by [FayyadIrani1993]_ that infers the intervals to minimize 
    37   within-interval entropy of class distributions (:class:`Entropy`), 
     37  within-interval entropy of class distributions (uses :class:`Orange.feature.discretization.Entropy`), 
    3838* bi-modal, using three intervals to optimize the difference of the class distribution in 
    39   the middle with the distribution outside it (:class:`BiModal`), 
     39  the middle with the distribution outside it (uses :class:`Orange.feature.discretization.BiModal`), 
    4040* fixed, with the user-defined cut-off points. 
    4141 
    42 The above script used the default discretization method (equal frequency with three intervals). This can be 
    43 changed while some selected discretization approach as demonstrated below: 
     42.. FIXME give a corresponding class for fixed discretization 
     43 
     44Default discretization method (equal frequency with three intervals) can be replaced with other 
     45discretization approaches as demonstrated below: 
    4446 
    4547.. literalinclude:: code/discretization-table-method.py 
    4648    :lines: 3-5 
    4749 
    48 Classes 
    49 ======= 
     50Entropy-based discretization is special as it may infer new features that are constant and have only one value. Such 
     51features are redundant and provide no information about the class are. By default, 
     52:class:`DiscretizeTable` would remove them, a way performing feature subset selection. The effect of removal of 
     53non-informative features is also demonstrated in the following script: 
    5054 
    51 Some functions and classes that can be used for 
    52 categorization of continuous features. Besides several general classes that 
    53 can help in this task, we also provide a function that may help in 
    54 entropy-based discretization (Fayyad & Irani), and a wrapper around classes for 
    55 categorization that can be used for learning. 
     55.. literalinclude:: code/discretization-entropy.py 
     56    :lines: 3- 
    5657 
    57 .. autoclass:: Orange.feature.discretization.DiscretizedLearner_Class 
     58In the sampled dat set above three features were discretized to a constant and thus removed:: 
     59 
     60    Redundant features (3 of 13): 
     61    cholesterol, rest SBP, age 
     62 
     63.. note:: 
     64    Entropy-based and bi-modal discretization require class-labeled data sets. 
     65 
     66Data discretization classes 
     67=========================== 
     68 
     69.. .. autoclass:: Orange.feature.discretization.DiscretizedLearner_Class 
    5870 
    5971.. autoclass:: DiscretizeTable 
    6072 
    61 .. rubric:: Example 
     73.. A chapter on `feature subset selection <../ofb/o_fss.htm>`_ in Orange 
     74   for Beginners tutorial shows the use of DiscretizedLearner. Other 
     75   discretization classes from core Orange are listed in chapter on 
     76   `categorization <../ofb/o_categorization.htm>`_ of the same tutorial. -> should put in classification/wrappers 
    6277 
    63 FIXME. A chapter on `feature subset selection <../ofb/o_fss.htm>`_ in Orange 
    64 for Beginners tutorial shows the use of DiscretizedLearner. Other 
    65 discretization classes from core Orange are listed in chapter on 
    66 `categorization <../ofb/o_categorization.htm>`_ of the same tutorial. 
     78.. [FayyadIrani1993] UM Fayyad and KB Irani. Multi-interval discretization of continuous valued 
     79  attributes for classification learning. In Proc. 13th International Joint Conference on Artificial Intelligence, pages 
     80  1022--1029, Chambery, France, 1993. 
  • docs/reference/rst/Orange.data.domain.rst

    r9936 r9958  
    308308         variable from the list is used as the class variable. :: 
    309309 
    310              >>> domain1 = orange.Domain([a, b]) 
    311              >>> domain2 = orange.Domain(["a", b, c], domain) 
     310             >>> domain1 = Orange.data.Domain([a, b]) 
     311             >>> domain2 = Orange.data.Domain(["a", b, c], domain) 
    312312 
    313313         :param variables: List of variables (strings or instances of :obj:`~Orange.feature.Descriptor`) 
     
    323323         last variable should be used as the class variable. :: 
    324324 
    325              >>> domain1 = orange.Domain([a, b], False) 
    326              >>> domain2 = orange.Domain(["a", b, c], False, domain) 
     325             >>> domain1 = Orange.data.Domain([a, b], False) 
     326             >>> domain2 = Orange.data.Domain(["a", b, c], False, domain) 
    327327 
    328328         :param variables: List of variables (strings or instances of :obj:`~Orange.feature.Descriptor`) 
  • docs/reference/rst/Orange.data.instance.rst

    r9936 r9958  
    9191passed along with the data:: 
    9292 
    93     bayes = orange.BayesLearner(data, id) 
     93    bayes = Orange.classification.bayes.NaiveLearner(data, id) 
    9494 
    9595Many other functions accept weights in similar fashion. 
     
    112112accessed:: 
    113113 
    114     w = orange.FloatVariable("w") 
     114    w = Orange.feature.Continuous("w") 
    115115    data.domain.addmeta(id, w) 
    116116 
     
    125125allows for conversion from Python native types:: 
    126126 
    127     ok = orange.EnumVariable("ok?", values=["no", "yes"]) 
     127    ok = Orange.feature.Discrete("ok?", values=["no", "yes"]) 
    128128    ok_id = Orange.feature.Descriptor.new_meta_id() 
    129129    data.domain.addmeta(ok_id, ok) 
     
    237237        Convert the instance into an ordinary Python list. If the 
    238238        optional argument `level` is 1 (default), the result is a list of 
    239         instances of :obj:`orange.data.Value`. If it is 0, it contains 
     239        instances of :obj:`Orange.data.Value`. If it is 0, it contains 
    240240        pure Python objects, that is, strings for discrete variables 
    241241        and numbers for continuous ones. 
     
    281281        attributes are returned. :: 
    282282 
    283             data = orange.ExampleTable("inquisition2") 
     283            data = Orange.data.Table("inquisition2") 
    284284            example = data[4] 
    285             print example.getmetas() 
    286             print example.getmetas(int) 
    287             print example.getmetas(str) 
    288             print example.getmetas(orange.Variable) 
     285            print example.get_metas() 
     286            print example.get_metas(int) 
     287            print example.get_metas(str) 
     288            print example.get_metas(Orange.feature.Descriptor) 
    289289 
    290290        :param key_type: the key type; either ``int``, ``str`` or :obj:`~Orange.feature.Descriptor` 
  • docs/reference/rst/Orange.data.rst

    r9941 r9992  
    1313    Orange.data.discretization 
    1414    Orange.data.continuization 
     15    Orange.data.utils 
     16    Orange.data.continuization 
     17    Orange.data.sql 
  • docs/reference/rst/Orange.data.value.rst

    r9927 r9958  
    133133    deg3 = Orange.feature.Discrete( 
    134134        "deg3", values=["little", "medium", "big"]) 
    135     deg4 = orange.feature.Discrete( 
     135    deg4 = Orange.feature.Discrete( 
    136136        "deg4", values=["tiny", "little", "big", "huge"]) 
    137     val3 = orange.Value(deg3) 
    138     val4 = orange.Value(deg4) 
     137    val3 = Orange.data.Value(deg3) 
     138    val4 = Orange.data.Value(deg4) 
    139139    val3.value = "medium" 
    140140    val4.value = "little" 
  • docs/reference/rst/Orange.evaluation.scoring.rst

    r10000 r10004  
    166166 
    167167Basic cross validation example is shown in the following part of 
    168 (:download:`statExamples.py <code/statExamples.py>`, uses :download:`voting.tab <code/voting.tab>` and :download:`vehicle.tab <code/vehicle.tab>`): 
     168(:download:`statExamples.py <code/statExamples.py>`): 
    169169 
    170170If instances are weighted, weights are taken into account. This can be 
     
    181181 
    182182So, let's compute all this in part of 
    183 (:download:`statExamples.py <code/statExamples.py>`, uses :download:`voting.tab <code/voting.tab>` and :download:`vehicle.tab <code/vehicle.tab>`) and print it out: 
     183(:download:`statExamples.py <code/statExamples.py>`) and print it out: 
    184184 
    185185.. literalinclude:: code/statExample1.py 
     
    446446We shall use the following code to prepare suitable experimental results:: 
    447447 
    448     ri2 = Orange.core.MakeRandomIndices2(voting, 0.6) 
     448    ri2 = Orange.data.sample.SubsetIndices2(voting, 0.6) 
    449449    train = voting.selectref(ri2, 0) 
    450450    test = voting.selectref(ri2, 1) 
     
    542542 
    543543So, let's compute all this and print it out (part of 
    544 :download:`mlc-evaluate.py <code/mlc-evaluate.py>`, uses 
    545 :download:`emotions.tab <code/emotions.tab>`): 
     544:download:`mlc-evaluate.py <code/mlc-evaluate.py>`): 
    546545 
    547546.. literalinclude:: code/mlc-evaluate.py 
  • docs/reference/rst/Orange.evaluation.testing.rst

    r9696 r9994  
    3232list of learning algorithms is prepared. 
    3333 
    34 part of :download:`testing-test.py <code/testing-test.py>` (uses :download:`voting.tab <code/voting.tab>`) 
     34part of :download:`testing-test.py <code/testing-test.py>` 
    3535 
    3636.. literalinclude:: code/testing-test.py 
  • docs/reference/rst/Orange.feature.discretization.rst

    r9927 r9964  
    11.. py:currentmodule:: Orange.feature.discretization 
    22 
    3 ################################### 
    4 Discretization (``discretization``) 
    5 ################################### 
     3########################################### 
     4Feature discretization (``discretization``) 
     5########################################### 
    66 
    77.. index:: discretization 
     
    1010   single: feature; discretization 
    1111 
    12 Continues features can be discretized either one feature at a time, or, as demonstrated in the following script, 
    13 using a single discretization method on entire set of data features: 
    14  
    15 .. literalinclude:: code/discretization-table.py 
    16  
    17 Discretization introduces new categorical features and computes their values in accordance to 
    18 selected (or default) discretization method:: 
    19  
    20     Original data set: 
    21     [5.1, 3.5, 1.4, 0.2, 'Iris-setosa'] 
    22     [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'] 
    23     [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'] 
    24  
    25     Discretized data set: 
    26     ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa'] 
    27     ['<=5.45', '(2.85, 3.15]', '<=2.45', '<=0.80', 'Iris-setosa'] 
    28     ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa'] 
    29  
    30 The following discretization methods are supported: 
    31  
    32 * equal width discretization, where the domain of continuous feature is split to intervals of the same 
    33   width equal-sized intervals (:class:`EqualWidth`), 
    34 * equal frequency discretization, where each intervals contains equal number of data instances (:class:`EqualFreq`), 
    35 * entropy-based, as originally proposed by [FayyadIrani1993]_ that infers the intervals to minimize 
    36   within-interval entropy of class distributions (:class:`Entropy`), 
    37 * bi-modal, using three intervals to optimize the difference of the class distribution in 
    38   the middle with the distribution outside it (:class:`BiModal`), 
    39 * fixed, with the user-defined cut-off points. 
    40  
    41 The above script used the default discretization method (equal frequency with three intervals). This can be changed 
    42 as demonstrated below: 
    43  
    44 .. literalinclude:: code/discretization-table-method.py 
    45     :lines: 3-5 
    46  
    47 With exception to fixed discretization, discretization approaches infer the cut-off points from the 
    48 training data set and thus construct a discretizer to convert continuous values of this feature into categorical 
    49 value according to the rule found by discretization. In this respect, the discretization behaves similar to 
    50 :class:`Orange.classification.Learner`. 
    51  
    52 Discretization Algorithms 
    53 ========================= 
    54  
    55 Instances of discretization classes are all derived from :class:`Discretization`. 
    56  
    57 .. class:: Discretization 
    58  
    59     .. method:: __call__(feature, data[, weightID]) 
    60  
    61         Given a continuous ``feature``, ``data`` and, optionally id of 
    62         attribute with example weight, this function returns a discretized 
    63         feature. Argument ``feature`` can be a descriptor, index or 
    64         name of the attribute. 
    65  
    66  
    67 .. class:: EqualWidth 
    68  
    69     Discretizes the feature by spliting its domain to a fixed number 
    70     of equal-width intervals. The span of original domain is computed 
    71     from the training data and is defined by the smallest and the 
    72     largest feature value. 
    73  
    74     .. attribute:: n 
    75  
    76         Number of discretization intervals (default: 4). 
    77  
    78 The following example discretizes Iris dataset features using six 
    79 intervals. The script constructs a :class:`Orange.data.Table` with discretized 
    80 features and outputs their description: 
    81  
    82 .. literalinclude:: code/discretization.py 
    83     :lines: 38-43 
    84  
    85 The output of this script is:: 
    86  
    87     D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> 
    88     D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> 
    89     D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> 
    90     D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10> 
    91  
    92 The cut-off values are hidden in the discretizer and stored in ``attr.get_value_from.transformer``:: 
    93  
    94     >>> for attr in newattrs: 
    95     ...    print "%s: first interval at %5.3f, step %5.3f" % \ 
    96     ...    (attr.name, attr.get_value_from.transformer.first_cut, \ 
    97     ...    attr.get_value_from.transformer.step) 
    98     D_sepal length: first interval at 4.900, step 0.600 
    99     D_sepal width: first interval at 2.400, step 0.400 
    100     D_petal length: first interval at 1.980, step 0.980 
    101     D_petal width: first interval at 0.500, step 0.400 
    102  
    103 All discretizers have the method 
    104 ``construct_variable``: 
    105  
    106 .. literalinclude:: code/discretization.py 
    107     :lines: 69-73 
    108  
    109  
    110 .. class:: EqualFreq 
    111  
    112     Infers the cut-off points so that the discretization intervals contain 
    113     approximately equal number of training data instances. 
    114  
    115     .. attribute:: n 
    116  
    117         Number of discretization intervals (default: 4). 
    118  
    119 The resulting discretizer is of class :class:`IntervalDiscretizer`. Its ``transformer`` includes ``points`` 
    120 that store the inferred cut-offs. 
    121  
    122 .. class:: Entropy 
    123  
    124     Entropy-based discretization as originally proposed by [FayyadIrani1993]_. The approach infers the most 
    125     appropriate number of intervals by recursively splitting the domain of continuous feature to minimize the 
    126     class-entropy of training examples. The splitting is repeated until the entropy decrease is smaller than the 
    127     increase of minimal descripton length (MDL) induced by the new cut-off point. 
    128  
    129     Entropy-based discretization can reduce a continuous feature into 
    130     a single interval if no suitable cut-off points are found. In this case the new feature is constant and can be 
    131     removed. This discretization can 
    132     therefore also serve for identification of non-informative features and thus used for feature subset selection. 
    133  
    134     .. attribute:: force_attribute 
    135  
    136         Forces the algorithm to induce at least one cut-off point, even when 
    137         its information gain is lower than MDL (default: ``False``). 
    138  
    139 Part of :download:`discretization.py <code/discretization.py>`: 
    140  
    141 .. literalinclude:: code/discretization.py 
    142     :lines: 77-80 
    143  
    144 The output shows that all attributes are discretized onto three intervals:: 
    145  
    146     sepal length: <5.5, 6.09999990463> 
    147     sepal width: <2.90000009537, 3.29999995232> 
    148     petal length: <1.89999997616, 4.69999980927> 
    149     petal width: <0.600000023842, 1.0000004768> 
    150  
    151 .. class:: BiModal 
    152  
    153     Infers two cut-off points to optimize the difference of class distribution of data instances in the 
    154     middle and in the other two intervals. The 
    155     difference is scored by chi-square statistics. All possible cut-off 
    156     points are examined, thus the discretization runs in O(n^2). This discretization method is especially suitable 
    157     for the attributes in 
    158     which the middle region corresponds to normal and the outer regions to 
    159     abnormal values of the feature. 
    160  
    161     .. attribute:: split_in_two 
    162  
    163         Decides whether the resulting attribute should have three or two values. 
    164         If ``True`` (default), the feature will be discretized to three 
    165         intervals and the discretizer is of type :class:`BiModalDiscretizer`. 
    166         If ``False`` the result is the ordinary :class:`IntervalDiscretizer`. 
    167  
    168 Iris dataset has three-valued class attribute. The figure below, drawn using LOESS probability estimation, shows that 
    169 sepal lenghts of versicolors are between lengths of setosas and virginicas. 
    170  
    171 .. image:: files/bayes-iris.gif 
    172  
    173 If we merge classes setosa and virginica, we can observe if 
    174 the bi-modal discretization would correctly recognize the interval in 
    175 which versicolors dominate. The following scripts peforms the merging and construction of new data set with class 
    176 that reports if iris is versicolor or not. 
    177  
    178 .. literalinclude:: code/discretization.py 
    179     :lines: 84-87 
    180  
    181 The following script implements the discretization: 
    182  
    183 .. literalinclude:: code/discretization.py 
    184     :lines: 97-100 
    185  
    186 The middle intervals are printed:: 
    187  
    188     sepal length: (5.400, 6.200] 
    189     sepal width: (2.000, 2.900] 
    190     petal length: (1.900, 4.700] 
    191     petal width: (0.600, 1.600] 
    192  
    193 Judging by the graph, the cut-off points inferred by discretization for "sepal length" make sense. 
    194  
    195 Discretizers 
    196 ============ 
    197  
    198 Discretizers construct a categorical feature from the continuous feature according to the method they implement and 
    199 its parameters. The most general is 
    200 :class:`IntervalDiscretizer` that is also used by most discretization 
    201 methods. Two other discretizers, :class:`EquiDistDiscretizer` and 
    202 :class:`ThresholdDiscretizer`> could easily be replaced by 
    203 :class:`IntervalDiscretizer` but are used for speed and simplicity. 
    204 The fourth discretizer, :class:`BiModalDiscretizer` is specialized 
    205 for discretizations induced by :class:`BiModalDiscretization`. 
    206  
    207 .. class:: Discretizer 
    208  
    209     A superclass implementing the construction of a new 
    210     attribute from an existing one. 
    211  
    212     .. method:: construct_variable(feature) 
    213  
    214         Constructs a descriptor for a new feature. The new feature's name is equal to ``feature.name`` 
    215         prefixed by "D\_". Its symbolic values are discretizer specific. 
    216  
    217 .. class:: IntervalDiscretizer 
    218  
    219     Discretizer defined with a set of cut-off points. 
    220  
    221     .. attribute:: points 
    222  
    223         The cut-off points; feature values below or equal to the first point will be mapped to the first interval, 
    224         those between the first and the second point 
    225         (including those equal to the second) are mapped to the second interval and 
    226         so forth to the last interval which covers all values greater than 
    227         the last value in ``points``. The number of intervals is thus 
    228         ``len(points)+1``. 
    229  
    230 The script that follows is an examples of a manual construction of a discretizer with cut-off points 
    231 at 3.0 and 5.0: 
    232  
    233 .. literalinclude:: code/discretization.py 
    234     :lines: 22-26 
    235  
    236 First five data instances of ``data2`` are:: 
    237  
    238     [5.1, '>5.00', 'Iris-setosa'] 
    239     [4.9, '(3.00, 5.00]', 'Iris-setosa'] 
    240     [4.7, '(3.00, 5.00]', 'Iris-setosa'] 
    241     [4.6, '(3.00, 5.00]', 'Iris-setosa'] 
    242     [5.0, '(3.00, 5.00]', 'Iris-setosa'] 
    243  
    244 The same discretizer can be used on several features by calling the function construct_var: 
    245  
    246 .. literalinclude:: code/discretization.py 
    247     :lines: 30-34 
    248  
    249 Each feature has its own instance of :class:`ClassifierFromVar` stored in 
    250 ``get_value_from``, but all use the same :class:`IntervalDiscretizer`, 
    251 ``idisc``. Changing any element of its ``points`` affect all attributes. 
    252  
    253 .. note:: 
    254  
    255     The length of :obj:`~IntervalDiscretizer.points` should not be changed if the 
    256     discretizer is used by any attribute. The length of 
    257     :obj:`~IntervalDiscretizer.points` should always match the number of values 
    258     of the feature, which is determined by the length of the attribute's field 
    259     ``values``. If ``attr`` is a discretized attribute, than ``len(attr.values)`` must equal 
    260     ``len(attr.get_value_from.transformer.points)+1``. 
    261  
    262  
    263 .. class:: EqualWidthDiscretizer 
    264  
    265     Discretizes to intervals of the fixed width. All values lower than :obj:`~EquiDistDiscretizer.first_cut` are mapped to the first 
    266     interval. Otherwise, value ``val``'s interval is ``floor((val-first_cut)/step)``. Possible overflows are mapped to the 
    267     last intervals. 
    268  
    269  
    270     .. attribute:: first_cut 
    271  
    272         The first cut-off point. 
    273  
    274     .. attribute:: step 
    275  
    276         Width of the intervals. 
    277  
    278     .. attribute:: n 
    279  
    280         Number of the intervals. 
    281  
    282     .. attribute:: points (read-only) 
    283  
    284         The cut-off points; this is not a real attribute although it behaves 
    285         as one. Reading it constructs a list of cut-off points and returns it, 
    286         but changing the list doesn't affect the discretizer. Only present to provide 
    287         the :obj:`EquiDistDiscretizer` the same interface as that of 
    288         :obj:`IntervalDiscretizer`. 
    289  
    290  
    291 .. class:: ThresholdDiscretizer 
    292  
    293     Threshold discretizer converts continuous values into binary by comparing 
    294     them to a fixed threshold. Orange uses this discretizer for 
    295     binarization of continuous attributes in decision trees. 
    296  
    297     .. attribute:: threshold 
    298  
    299         The value threshold; values below or equal to the threshold belong to the first 
    300         interval and those that are greater go to the second. 
    301  
    302  
    303 .. class:: BiModalDiscretizer 
    304  
    305     Bimodal discretizer has two cut off points and values are 
    306     discretized according to whether or not they belong to the region between these points 
    307     which includes the lower but not the upper boundary. The 
    308     discretizer is returned by :class:`BiModalDiscretization` if its 
    309     field :obj:`~BiModalDiscretization.split_in_two` is true (the default). 
    310  
    311     .. attribute:: low 
    312  
    313         Lower boundary of the interval (included in the interval). 
    314  
    315     .. attribute:: high 
    316  
    317         Upper boundary of the interval (not included in the interval). 
    318  
    319  
    320 Implementational details 
    321 ======================== 
     12Feature discretization module provides rutines that consider continuous features and 
     13introduce a new discretized feature based on the training data set. Most often such procedure would be executed 
     14on all the features of the data set using implementations from :doc:`Orange.feature.discretization`. Implementation 
     15in this module are concerned with discretization of one feature at the time, and do not provide wrappers for 
     16whole-data set discretization. The discretization is data-specific, and consist of learning of discretization 
     17procedure (see `Discretization Algorithms`_) and actual discretization (see Discretizers_) of the data. Splitting of 
     18these 
     19two phases is intentional, 
     20as in machine learing discretization may be learned from the training set and executed on the test set. 
    32221 
    32322Consider a following example (part of :download:`discretization.py <code/discretization.py>`): 
     
    36463by ``get_value_from`` and stored in the new example. 
    36564 
     65With exception to fixed discretization, discretization approaches infer the cut-off points from the 
     66training data set and thus construct a discretizer to convert continuous values of this feature into categorical 
     67value according to the rule found by discretization. In this respect, the discretization behaves similar to 
     68:class:`Orange.classification.Learner`. 
     69 
     70.. _`Discretization Algorithms` 
     71 
     72Discretization Algorithms 
     73========================= 
     74 
     75Instances of discretization classes are all derived from :class:`Discretization`. 
     76 
     77.. class:: Discretization 
     78 
     79    .. method:: __call__(feature, data[, weightID]) 
     80 
     81        Given a continuous ``feature``, ``data`` and, optionally id of 
     82        attribute with example weight, this function returns a discretized 
     83        feature. Argument ``feature`` can be a descriptor, index or 
     84        name of the attribute. 
     85 
     86 
     87.. class:: EqualWidth 
     88 
     89    Discretizes the feature by spliting its domain to a fixed number 
     90    of equal-width intervals. The span of original domain is computed 
     91    from the training data and is defined by the smallest and the 
     92    largest feature value. 
     93 
     94    .. attribute:: n 
     95 
     96        Number of discretization intervals (default: 4). 
     97 
     98The following example discretizes Iris dataset features using six 
     99intervals. The script constructs a :class:`Orange.data.Table` with discretized 
     100features and outputs their description: 
     101 
     102.. literalinclude:: code/discretization.py 
     103    :lines: 38-43 
     104 
     105The output of this script is:: 
     106 
     107    D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> 
     108    D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> 
     109    D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> 
     110    D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10> 
     111 
     112The cut-off values are hidden in the discretizer and stored in ``attr.get_value_from.transformer``:: 
     113 
     114    >>> for attr in newattrs: 
     115    ...    print "%s: first interval at %5.3f, step %5.3f" % \ 
     116    ...    (attr.name, attr.get_value_from.transformer.first_cut, \ 
     117    ...    attr.get_value_from.transformer.step) 
     118    D_sepal length: first interval at 4.900, step 0.600 
     119    D_sepal width: first interval at 2.400, step 0.400 
     120    D_petal length: first interval at 1.980, step 0.980 
     121    D_petal width: first interval at 0.500, step 0.400 
     122 
     123All discretizers have the method 
     124``construct_variable``: 
     125 
     126.. literalinclude:: code/discretization.py 
     127    :lines: 69-73 
     128 
     129 
     130.. class:: EqualFreq 
     131 
     132    Infers the cut-off points so that the discretization intervals contain 
     133    approximately equal number of training data instances. 
     134 
     135    .. attribute:: n 
     136 
     137        Number of discretization intervals (default: 4). 
     138 
     139The resulting discretizer is of class :class:`IntervalDiscretizer`. Its ``transformer`` includes ``points`` 
     140that store the inferred cut-offs. 
     141 
     142.. class:: Entropy 
     143 
     144    Entropy-based discretization as originally proposed by [FayyadIrani1993]_. The approach infers the most 
     145    appropriate number of intervals by recursively splitting the domain of continuous feature to minimize the 
     146    class-entropy of training examples. The splitting is repeated until the entropy decrease is smaller than the 
     147    increase of minimal descripton length (MDL) induced by the new cut-off point. 
     148 
     149    Entropy-based discretization can reduce a continuous feature into 
     150    a single interval if no suitable cut-off points are found. In this case the new feature is constant and can be 
     151    removed. This discretization can 
     152    therefore also serve for identification of non-informative features and thus used for feature subset selection. 
     153 
     154    .. attribute:: force_attribute 
     155 
     156        Forces the algorithm to induce at least one cut-off point, even when 
     157        its information gain is lower than MDL (default: ``False``). 
     158 
     159Part of :download:`discretization.py <code/discretization.py>`: 
     160 
     161.. literalinclude:: code/discretization.py 
     162    :lines: 77-80 
     163 
     164The output shows that all attributes are discretized onto three intervals:: 
     165 
     166    sepal length: <5.5, 6.09999990463> 
     167    sepal width: <2.90000009537, 3.29999995232> 
     168    petal length: <1.89999997616, 4.69999980927> 
     169    petal width: <0.600000023842, 1.0000004768> 
     170 
     171.. class:: BiModal 
     172 
     173    Infers two cut-off points to optimize the difference of class distribution of data instances in the 
     174    middle and in the other two intervals. The 
     175    difference is scored by chi-square statistics. All possible cut-off 
     176    points are examined, thus the discretization runs in O(n^2). This discretization method is especially suitable 
     177    for the attributes in 
     178    which the middle region corresponds to normal and the outer regions to 
     179    abnormal values of the feature. 
     180 
     181    .. attribute:: split_in_two 
     182 
     183        Decides whether the resulting attribute should have three or two values. 
     184        If ``True`` (default), the feature will be discretized to three 
     185        intervals and the discretizer is of type :class:`BiModalDiscretizer`. 
     186        If ``False`` the result is the ordinary :class:`IntervalDiscretizer`. 
     187 
     188Iris dataset has three-valued class attribute. The figure below, drawn using LOESS probability estimation, shows that 
     189sepal lenghts of versicolors are between lengths of setosas and virginicas. 
     190 
     191.. image:: files/bayes-iris.gif 
     192 
     193If we merge classes setosa and virginica, we can observe if 
     194the bi-modal discretization would correctly recognize the interval in 
     195which versicolors dominate. The following scripts peforms the merging and construction of new data set with class 
     196that reports if iris is versicolor or not. 
     197 
     198.. literalinclude:: code/discretization.py 
     199    :lines: 84-87 
     200 
     201The following script implements the discretization: 
     202 
     203.. literalinclude:: code/discretization.py 
     204    :lines: 97-100 
     205 
     206The middle intervals are printed:: 
     207 
     208    sepal length: (5.400, 6.200] 
     209    sepal width: (2.000, 2.900] 
     210    petal length: (1.900, 4.700] 
     211    petal width: (0.600, 1.600] 
     212 
     213Judging by the graph, the cut-off points inferred by discretization for "sepal length" make sense. 
     214 
     215.. _Discretizers: 
     216 
     217Discretizers 
     218============= 
     219 
     220Discretizers construct a categorical feature from the continuous feature according to the method they implement and 
     221its parameters. The most general is 
     222:class:`IntervalDiscretizer` that is also used by most discretization 
     223methods. Two other discretizers, :class:`EquiDistDiscretizer` and 
     224:class:`ThresholdDiscretizer`> could easily be replaced by 
     225:class:`IntervalDiscretizer` but are used for speed and simplicity. 
     226The fourth discretizer, :class:`BiModalDiscretizer` is specialized 
     227for discretizations induced by :class:`BiModalDiscretization`. 
     228 
     229.. class:: Discretizer 
     230 
     231    A superclass implementing the construction of a new 
     232    attribute from an existing one. 
     233 
     234    .. method:: construct_variable(feature) 
     235 
     236        Constructs a descriptor for a new feature. The new feature's name is equal to ``feature.name`` 
     237        prefixed by "D\_". Its symbolic values are discretizer specific. 
     238 
     239.. class:: IntervalDiscretizer 
     240 
     241    Discretizer defined with a set of cut-off points. 
     242 
     243    .. attribute:: points 
     244 
     245        The cut-off points; feature values below or equal to the first point will be mapped to the first interval, 
     246        those between the first and the second point 
     247        (including those equal to the second) are mapped to the second interval and 
     248        so forth to the last interval which covers all values greater than 
     249        the last value in ``points``. The number of intervals is thus 
     250        ``len(points)+1``. 
     251 
     252The script that follows is an examples of a manual construction of a discretizer with cut-off points 
     253at 3.0 and 5.0: 
     254 
     255.. literalinclude:: code/discretization.py 
     256    :lines: 22-26 
     257 
     258First five data instances of ``data2`` are:: 
     259 
     260    [5.1, '>5.00', 'Iris-setosa'] 
     261    [4.9, '(3.00, 5.00]', 'Iris-setosa'] 
     262    [4.7, '(3.00, 5.00]', 'Iris-setosa'] 
     263    [4.6, '(3.00, 5.00]', 'Iris-setosa'] 
     264    [5.0, '(3.00, 5.00]', 'Iris-setosa'] 
     265 
     266The same discretizer can be used on several features by calling the function construct_var: 
     267 
     268.. literalinclude:: code/discretization.py 
     269    :lines: 30-34 
     270 
     271Each feature has its own instance of :class:`ClassifierFromVar` stored in 
     272``get_value_from``, but all use the same :class:`IntervalDiscretizer`, 
     273``idisc``. Changing any element of its ``points`` affect all attributes. 
     274 
     275.. note:: 
     276 
     277    The length of :obj:`~IntervalDiscretizer.points` should not be changed if the 
     278    discretizer is used by any attribute. The length of 
     279    :obj:`~IntervalDiscretizer.points` should always match the number of values 
     280    of the feature, which is determined by the length of the attribute's field 
     281    ``values``. If ``attr`` is a discretized attribute, than ``len(attr.values)`` must equal 
     282    ``len(attr.get_value_from.transformer.points)+1``. 
     283 
     284 
     285.. class:: EqualWidthDiscretizer 
     286 
     287    Discretizes to intervals of the fixed width. All values lower than :obj:`~EquiDistDiscretizer.first_cut` are mapped to the first 
     288    interval. Otherwise, value ``val``'s interval is ``floor((val-first_cut)/step)``. Possible overflows are mapped to the 
     289    last intervals. 
     290 
     291 
     292    .. attribute:: first_cut 
     293 
     294        The first cut-off point. 
     295 
     296    .. attribute:: step 
     297 
     298        Width of the intervals. 
     299 
     300    .. attribute:: n 
     301 
     302        Number of the intervals. 
     303 
     304    .. attribute:: points (read-only) 
     305 
     306        The cut-off points; this is not a real attribute although it behaves 
     307        as one. Reading it constructs a list of cut-off points and returns it, 
     308        but changing the list doesn't affect the discretizer. Only present to provide 
     309        the :obj:`EquiDistDiscretizer` the same interface as that of 
     310        :obj:`IntervalDiscretizer`. 
     311 
     312 
     313.. class:: ThresholdDiscretizer 
     314 
     315    Threshold discretizer converts continuous values into binary by comparing 
     316    them to a fixed threshold. Orange uses this discretizer for 
     317    binarization of continuous attributes in decision trees. 
     318 
     319    .. attribute:: threshold 
     320 
     321        The value threshold; values below or equal to the threshold belong to the first 
     322        interval and those that are greater go to the second. 
     323 
     324 
     325.. class:: BiModalDiscretizer 
     326 
     327    Bimodal discretizer has two cut off points and values are 
     328    discretized according to whether or not they belong to the region between these points 
     329    which includes the lower but not the upper boundary. The 
     330    discretizer is returned by :class:`BiModalDiscretization` if its 
     331    field :obj:`~BiModalDiscretization.split_in_two` is true (the default). 
     332 
     333    .. attribute:: low 
     334 
     335        Lower boundary of the interval (included in the interval). 
     336 
     337    .. attribute:: high 
     338 
     339        Upper boundary of the interval (not included in the interval). 
     340 
    366341References 
    367342========== 
  • docs/reference/rst/Orange.feature.scoring.rst

    r9372 r9988  
    1 .. automodule:: Orange.feature.scoring 
     1.. py:currentmodule:: Orange.feature.scoring 
     2 
     3##################### 
     4Scoring (``scoring``) 
     5##################### 
     6 
     7.. index:: feature scoring 
     8 
     9.. index:: 
     10   single: feature; feature scoring 
     11 
     12Feature score is an assessment of the usefulness of the feature for 
     13prediction of the dependant (class) variable. 
     14 
     15To compute the information gain of feature "tear_rate" in the Lenses data set (loaded into ``data``) use: 
     16 
     17    >>> meas = Orange.feature.scoring.InfoGain() 
     18    >>> print meas("tear_rate", data) 
     19    0.548794925213 
     20 
     21Other scoring methods are listed in :ref:`classification` and 
     22:ref:`regression`. Various ways to call them are described on 
     23:ref:`callingscore`. 
     24 
     25Instead of first constructing the scoring object (e.g. ``InfoGain``) and 
     26then using it, it is usually more convenient to do both in a single step:: 
     27 
     28    >>> print Orange.feature.scoring.InfoGain("tear_rate", data) 
     29    0.548794925213 
     30 
     31This way is much slower for Relief that can efficiently compute scores 
     32for all features in parallel. 
     33 
     34It is also possible to score features that do not appear in the data 
     35but can be computed from it. A typical case are discretized features: 
     36 
     37.. literalinclude:: code/scoring-info-iris.py 
     38    :lines: 7-11 
     39 
     40The following example computes feature scores, both with 
     41:obj:`score_all` and by scoring each feature individually, and prints out 
     42the best three features. 
     43 
     44.. literalinclude:: code/scoring-all.py 
     45    :lines: 7- 
     46 
     47The output:: 
     48 
     49    Feature scores for best three features (with score_all): 
     50    0.613 physician-fee-freeze 
     51    0.255 el-salvador-aid 
     52    0.228 synfuels-corporation-cutback 
     53 
     54    Feature scores for best three features (scored individually): 
     55    0.613 physician-fee-freeze 
     56    0.255 el-salvador-aid 
     57    0.228 synfuels-corporation-cutback 
     58 
     59.. comment 
     60    The next script uses :obj:`GainRatio` and :obj:`Relief`. 
     61 
     62    .. literalinclude:: code/scoring-relief-gainRatio.py 
     63        :lines: 7- 
     64 
     65    Notice that on this data the ranks of features match:: 
     66 
     67        Relief GainRt Feature 
     68        0.613  0.752  physician-fee-freeze 
     69        0.255  0.444  el-salvador-aid 
     70        0.228  0.414  synfuels-corporation-cutback 
     71        0.189  0.382  crime 
     72        0.166  0.345  adoption-of-the-budget-resolution 
     73 
     74 
     75.. _callingscore: 
     76 
     77======================= 
     78Calling scoring methods 
     79======================= 
     80 
     81To score a feature use :obj:`Score.__call__`. There are diferent 
     82function signatures, which enable optimization. For instance, 
     83most scoring methods first compute contingency tables from the 
     84data. If these are already computed, they can be passed to the scorer 
     85instead of the data. 
     86 
     87Not all classes accept all kinds of arguments. :obj:`Relief`, 
     88for instance, only supports the form with instances on the input. 
     89 
     90.. method:: Score.__call__(attribute, data[, apriori_class_distribution][, weightID]) 
     91 
     92    :param attribute: the chosen feature, either as a descriptor, 
     93      index, or a name. 
     94    :type attribute: :class:`Orange.feature.Descriptor` or int or string 
     95    :param data: data. 
     96    :type data: `Orange.data.Table` 
     97    :param weightID: id for meta-feature with weight. 
     98 
     99    All scoring methods support the first signature. 
     100 
     101.. method:: Score.__call__(attribute, domain_contingency[, apriori_class_distribution]) 
     102 
     103    :param attribute: the chosen feature, either as a descriptor, 
     104      index, or a name. 
     105    :type attribute: :class:`Orange.feature.Descriptor` or int or string 
     106    :param domain_contingency: 
     107    :type domain_contingency: :obj:`Orange.statistics.contingency.Domain` 
     108 
     109.. method:: Score.__call__(contingency, class_distribution[, apriori_class_distribution]) 
     110 
     111    :param contingency: 
     112    :type contingency: :obj:`Orange.statistics.contingency.VarClass` 
     113    :param class_distribution: distribution of the class 
     114      variable. If :obj:`unknowns_treatment` is :obj:`IgnoreUnknowns`, 
     115      it should be computed on instances where feature value is 
     116      defined. Otherwise, class distribution should be the overall 
     117      class distribution. 
     118    :type class_distribution: 
     119      :obj:`Orange.statistics.distribution.Distribution` 
     120    :param apriori_class_distribution: Optional and most often 
     121      ignored. Useful if the scoring method makes any probability estimates 
     122      based on apriori class probabilities (such as the m-estimate). 
     123    :return: Feature score - the higher the value, the better the feature. 
     124      If the quality cannot be scored, return :obj:`Score.Rejected`. 
     125    :rtype: float or :obj:`Score.Rejected`. 
     126 
     127The code below scores the same feature with :obj:`GainRatio` 
     128using different calls. 
     129 
     130.. literalinclude:: code/scoring-calls.py 
     131    :lines: 7- 
     132 
     133.. _classification: 
     134 
     135========================================== 
     136Feature scoring in classification problems 
     137========================================== 
     138 
     139.. Undocumented: MeasureAttribute_IM, MeasureAttribute_chiSquare, MeasureAttribute_gainRatioA, MeasureAttribute_logOddsRatio, MeasureAttribute_splitGain. 
     140 
     141.. index:: 
     142   single: feature scoring; information gain 
     143 
     144.. class:: InfoGain 
     145 
     146    Information gain; the expected decrease of entropy. See `page on wikipedia 
     147    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_. 
     148 
     149.. index:: 
     150   single: feature scoring; gain ratio 
     151 
     152.. class:: GainRatio 
     153 
     154    Information gain ratio; information gain divided by the entropy of the feature's 
     155    value. Introduced in [Quinlan1986]_ in order to avoid overestimation 
     156    of multi-valued features. It has been shown, however, that it still 
     157    overestimates features with multiple values. See `Wikipedia 
     158    <http://en.wikipedia.org/wiki/Information_gain_ratio>`_. 
     159 
     160.. index:: 
     161   single: feature scoring; gini index 
     162 
     163.. class:: Gini 
     164 
     165    Gini index is the probability that two randomly chosen instances will have different 
     166    classes. See `Gini coefficient on Wikipedia <http://en.wikipedia.org/wiki/Gini_coefficient>`_. 
     167 
     168.. index:: 
     169   single: feature scoring; relevance 
     170 
     171.. class:: Relevance 
     172 
     173    The potential value for decision rules. 
     174 
     175.. index:: 
     176   single: feature scoring; cost 
     177 
     178.. class:: Cost 
     179 
     180    Evaluates features based on the cost decrease achieved by knowing the value of 
     181    feature, according to the specified cost matrix. 
     182 
     183    .. attribute:: cost 
     184 
     185        Cost matrix, see :obj:`Orange.classification.CostMatrix` for details. 
     186 
     187    If the cost of predicting the first class of an instance that is actually in 
     188    the second is 5, and the cost of the opposite error is 1, than an appropriate 
     189    score can be constructed as follows:: 
     190 
     191 
     192        >>> meas = Orange.feature.scoring.Cost() 
     193        >>> meas.cost = ((0, 5), (1, 0)) 
     194        >>> meas(3, data) 
     195        0.083333350718021393 
     196 
     197    Knowing the value of feature 3 would decrease the 
     198    classification cost for approximately 0.083 per instance. 
     199 
     200    .. comment   opposite error - is this term correct? TODO 
     201 
     202.. index:: 
     203   single: feature scoring; ReliefF 
     204 
     205.. class:: Relief 
     206 
     207    Assesses features' ability to distinguish between very similar 
     208    instances from different classes. This scoring method was first 
     209    developed by Kira and Rendell and then improved by  Kononenko. The 
     210    class :obj:`Relief` works on discrete and continuous classes and 
     211    thus implements ReliefF and RReliefF. 
     212 
     213    ReliefF is slow since it needs to find k nearest neighbours for 
     214    each of m reference instances. As we normally compute ReliefF for 
     215    all features in the dataset, :obj:`Relief` caches the results for 
     216    all features, when called to score a certain feature.  When called 
     217    again, it uses the stored results if the domain and the data table 
     218    have not changed (data table version and the data checksum are 
     219    compared). Caching will only work if you use the same object. 
     220    Constructing new instances of :obj:`Relief` for each feature, 
     221    like this:: 
     222 
     223        for attr in data.domain.attributes: 
     224            print Orange.feature.scoring.Relief(attr, data) 
     225 
     226    runs much slower than reusing the same instance:: 
     227 
     228        meas = Orange.feature.scoring.Relief() 
     229        for attr in table.domain.attributes: 
     230            print meas(attr, data) 
     231 
     232 
     233    .. attribute:: k 
     234 
     235       Number of neighbours for each instance. Default is 5. 
     236 
     237    .. attribute:: m 
     238 
     239        Number of reference instances. Default is 100. When -1, all 
     240        instances are used as reference. 
     241 
     242    .. attribute:: check_cached_data 
     243 
     244        Check if the cached data is changed, which may be slow on large 
     245        tables.  Defaults to :obj:`True`, but should be disabled when it 
     246        is certain that the data will not change while the scorer is used. 
     247 
     248.. autoclass:: Orange.feature.scoring.Distance 
     249 
     250.. autoclass:: Orange.feature.scoring.MDL 
     251 
     252.. _regression: 
     253 
     254====================================== 
     255Feature scoring in regression problems 
     256====================================== 
     257 
     258.. class:: Relief 
     259 
     260    Relief is used for regression in the same way as for 
     261    classification (see :class:`Relief` in classification 
     262    problems). 
     263 
     264.. index:: 
     265   single: feature scoring; mean square error 
     266 
     267.. class:: MSE 
     268 
     269    Implements the mean square error score. 
     270 
     271    .. attribute:: unknowns_treatment 
     272 
     273        What to do with unknown values. See :obj:`Score.unknowns_treatment`. 
     274 
     275    .. attribute:: m 
     276 
     277        Parameter for m-estimate of error. Default is 0 (no m-estimate). 
     278 
     279============ 
     280Base Classes 
     281============ 
     282 
     283Implemented methods for scoring relevances of features are subclasses 
     284of :obj:`Score`. Those that compute statistics on conditional 
     285distributions of class values given the feature values are derived from 
     286:obj:`ScoreFromProbabilities`. 
     287 
     288.. class:: Score 
     289 
     290    Abstract base class for feature scoring. Its attributes describe which 
     291    types of features it can handle which kind of data it requires. 
     292 
     293    **Capabilities** 
     294 
     295    .. attribute:: handles_discrete 
     296 
     297        Indicates whether the scoring method can handle discrete features. 
     298 
     299    .. attribute:: handles_continuous 
     300 
     301        Indicates whether the scoring method can handle continuous features. 
     302 
     303    .. attribute:: computes_thresholds 
     304 
     305        Indicates whether the scoring method implements the :obj:`threshold_function`. 
     306 
     307    **Input specification** 
     308 
     309    .. attribute:: needs 
     310 
     311        The type of data needed indicated by one the constants 
     312        below. Classes with use :obj:`DomainContingency` will also handle 
     313        generators. Those based on :obj:`Contingency_Class` will be able 
     314        to take generators and domain contingencies. 
     315 
     316        .. attribute:: Generator 
     317 
     318            Constant. Indicates that the scoring method needs an instance 
     319            generator on the input as, for example, :obj:`Relief`. 
     320 
     321        .. attribute:: DomainContingency 
     322 
     323            Constant. Indicates that the scoring method needs 
     324            :obj:`Orange.statistics.contingency.Domain`. 
     325 
     326        .. attribute:: Contingency_Class 
     327 
     328            Constant. Indicates, that the scoring method needs the contingency 
     329            (:obj:`Orange.statistics.contingency.VarClass`), feature 
     330            distribution and the apriori class distribution (as most 
     331            scoring methods). 
     332 
     333    **Treatment of unknown values** 
     334 
     335    .. attribute:: unknowns_treatment 
     336 
     337        Defined in classes that are able to treat unknown values. It 
     338        should be set to one of the values below. 
     339 
     340        .. attribute:: IgnoreUnknowns 
     341 
     342            Constant. Instances for which the feature value is unknown are removed. 
     343 
     344        .. attribute:: ReduceByUnknown 
     345 
     346            Constant. Features with unknown values are 
     347            punished. The feature quality is reduced by the proportion of 
     348            unknown values. For impurity scores the impurity decreases 
     349            only where the value is defined and stays the same otherwise. 
     350 
     351        .. attribute:: UnknownsToCommon 
     352 
     353            Constant. Undefined values are replaced by the most common value. 
     354 
     355        .. attribute:: UnknownsAsValue 
     356 
     357            Constant. Unknown values are treated as a separate value. 
     358 
     359    **Methods** 
     360 
     361    .. method:: __call__ 
     362 
     363        Abstract. See :ref:`callingscore`. 
     364 
     365    .. method:: threshold_function(attribute, instances[, weightID]) 
     366 
     367        Abstract. 
     368 
     369        Assess different binarizations of the continuous feature 
     370        :obj:`attribute`.  Return a list of tuples. The first element 
     371        is a threshold (between two existing values), the second is 
     372        the quality of the corresponding binary feature, and the third 
     373        the distribution of instances below and above the threshold. 
     374        Not all scorers return the third element. 
     375 
     376        To show the computation of thresholds, we shall use the Iris 
     377        data set: 
     378 
     379        .. literalinclude:: code/scoring-info-iris.py 
     380            :lines: 13-16 
     381 
     382    .. method:: best_threshold(attribute, instances) 
     383 
     384        Return the best threshold for binarization, that is, the threshold 
     385        with which the resulting binary feature will have the optimal 
     386        score. 
     387 
     388        The script below prints out the best threshold for 
     389        binarization of an feature. ReliefF is used scoring: 
     390 
     391        .. literalinclude:: code/scoring-info-iris.py 
     392            :lines: 18-19 
     393 
     394.. class:: ScoreFromProbabilities 
     395 
     396    Bases: :obj:`Score` 
     397 
     398    Abstract base class for feature scoring method that can be 
     399    computed from contingency matrices. 
     400 
     401    .. attribute:: estimator_constructor 
     402    .. attribute:: conditional_estimator_constructor 
     403 
     404    &