Ignore:
Files:
6 added
5 deleted
27 edited

Legend:

Unmodified
Added
Removed
  • .hgignore

    r9881 r9898  
    1515source/orangeom/lib_vectors_auto.txt 
    1616 
    17 # Ignore build and dist dir, created by setup.py build or setup.py bdist_* . 
     17# Ignore files created by setup.py. 
    1818build 
    1919dist 
     20MANIFEST 
     21Orange.egg-info 
    2022 
    2123# Ignore dot files. 
     
    3436docs/reference/html 
    3537 
    36 # Images generated by tests. 
     38# Files generated by tests. 
     39Orange/testing/regression/*/*.changed.txt 
     40Orange/testing/regression/*/*.crash.txt 
     41Orange/testing/regression/*/*.new.txt 
    3742Orange/doc/modules/*.png 
    3843docs/reference/rst/code/*.png 
     44 
     45Orange/doc/modules/tree1.dot 
     46Orange/doc/reference/del2.tab 
     47Orange/doc/reference/undefined-saved-dc-dk.tab 
     48Orange/doc/reference/undefined-saved-na.tab 
     49Orange/testing/regression/results_orange25/unusedValues.py.txt 
     50docs/reference/rst/code/iris.testsave.arff 
     51docs/tutorial/rst/code/adult_sample_sampled.tab 
     52docs/tutorial/rst/code/tree.dot 
     53 
  • Orange/clustering/hierarchical.py

    r9752 r9906  
    8181         
    8282        :param matrix: A distance matrix to perform the clustering on. 
    83         :type matrix: :class:`Orange.core.SymMatrix` 
     83        :type matrix: :class:`Orange.misc.SymMatrix` 
    8484 
    8585 
     
    157157 
    158158Let us construct a simple distance matrix and run clustering on it. 
    159 :: 
    160  
    161     import Orange 
    162     from Orange.clustering import hierarchical 
    163     m = [[], 
    164          [ 3], 
    165          [ 2, 4], 
    166          [17, 5, 4], 
    167          [ 2, 8, 3, 8], 
    168          [ 7, 5, 10, 11, 2], 
    169          [ 8, 4, 1, 5, 11, 13], 
    170          [ 4, 7, 12, 8, 10, 1, 5], 
    171          [13, 9, 14, 15, 7, 8, 4, 6], 
    172          [12, 10, 11, 15, 2, 5, 7, 3, 1]] 
    173     matrix = Orange.core.SymMatrix(m) 
    174     root = hierarchical.HierarchicalClustering(matrix, 
    175             linkage=hierarchical.HierarchicalClustering.Average) 
     159 
     160.. literalinclude:: code/hierarchical-example.py 
     161    :lines: 1-14 
    176162     
    177163Root is a root of the cluster hierarchy. We can print using a 
    178164simple recursive function. 
    179 :: 
    180  
    181     def printClustering(cluster): 
    182         if cluster.branches: 
    183             return "(%s%s)" % (printClustering(cluster.left), printClustering(cluster.right)) 
    184         else: 
    185             return str(cluster[0]) 
     165 
     166.. literalinclude:: code/hierarchical-example.py 
     167    :lines: 16-20 
    186168             
    187169The output is not exactly nice, but it will have to do. Our clustering, 
     
    211193supposedly the only) element of cluster, cluster[0], we shall print 
    212194it out as a tuple.  
    213 :: 
    214  
    215     def printClustering2(cluster): 
    216         if cluster.branches: 
    217             return "(%s%s)" % (printClustering2(cluster.left), printClustering2(cluster.right)) 
    218         else: 
    219             return str(tuple(cluster)) 
     195 
     196.. literalinclude:: code/hierarchical-example.py 
     197    :lines: 22-26 
    220198             
    221199The distance matrix could have been given a list of objects. We could, 
    222200for instance, put 
    223 :: 
    224      
    225     matrix.objects = ["Ann", "Bob", "Curt", "Danny", "Eve", 
    226                       "Fred", "Greg", "Hue", "Ivy", "Jon"] 
     201     
     202.. literalinclude:: code/hierarchical-example.py 
     203    :lines: 28-29 
    227204 
    228205above calling the HierarchicalClustering. 
     
    234211If we've forgotten to store the objects into matrix prior to clustering, 
    235212nothing is lost. We can add it into clustering later, by 
    236 :: 
    237  
    238     root.mapping.objects = ["Ann", "Bob", "Curt", "Danny", "Eve", "Fred", "Greg", "Hue", "Ivy", "Jon"] 
     213 
     214.. literalinclude:: code/hierarchical-example.py 
     215    :lines: 31 
    239216     
    240217So, what do these "objects" do? Call printClustering(root) again and you'll 
     
    269246of ``root.left`` and ``root.right``. 
    270247 
    271 Let us write function for cluster pruning. :: 
    272  
    273     def prune(cluster, togo): 
    274         if cluster.branches: 
    275             if togo<0: 
    276                 cluster.branches = None 
    277             else: 
    278                 for branch in cluster.branches: 
    279                     prune(branch, togo-cluster.height) 
     248Let us write function for cluster pruning. 
     249 
     250.. literalinclude:: code/hierarchical-example.py 
     251    :lines: 33-39 
    280252 
    281253We shall use ``printClustering2`` here, since we can have multiple elements 
     
    287259     
    288260We've ended up with four clusters. Need a list of clusters? 
    289 Here's the function. :: 
    290      
    291     def listOfClusters0(cluster, alist): 
    292         if not cluster.branches: 
    293             alist.append(list(cluster)) 
    294         else: 
    295             for branch in cluster.branches: 
    296                 listOfClusters0(branch, alist) 
    297                  
    298     def listOfClusters(root): 
    299         l = [] 
    300         listOfClusters0(root, l) 
    301         return l 
     261Here's the function. 
     262 
     263.. literalinclude:: code/hierarchical-example.py 
     264    :lines: 41-51 
    302265         
    303266The function returns a list of lists, in our case 
     
    313276and cluster it with average linkage. Since we don't need the matrix, 
    314277we shall let the clustering overwrite it (not that it's needed for 
    315 such a small data set as Iris). :: 
    316  
    317     import Orange 
    318     from Orange.clustering import hierarchical 
    319  
    320     data = Orange.data.Table("iris") 
    321     matrix = Orange.core.SymMatrix(len(data)) 
    322     matrix.setattr("objects", data) 
    323     distance = Orange.distance.Euclidean(data) 
    324     for i1, instance1 in enumerate(data): 
    325         for i2 in range(i1+1, len(data)): 
    326             matrix[i1, i2] = distance(instance1, data[i2]) 
    327              
    328     clustering = hierarchical.HierarchicalClustering() 
    329     clustering.linkage = clustering.Average 
    330     clustering.overwrite_matrix = 1 
    331     root = clustering(matrix) 
     278such a small data set as Iris). 
     279 
     280.. literalinclude:: code/hierarchical-example-2.py 
     281    :lines: 1-15 
    332282 
    333283Note that we haven't forgotten to set the ``matrix.objects``. We did it 
    334284through ``matrix.setattr`` to avoid the warning. Let us now prune the 
    335285clustering using the function we've written above, and print out the 
    336 clusters. :: 
    337      
    338     prune(root, 1.4) 
    339     for n, cluster in enumerate(listOfClusters(root)): 
    340         print "\n\n Cluster %i \n" % n 
    341         for instance in cluster: 
    342             print instance 
     286clusters. 
     287     
     288.. literalinclude:: code/hierarchical-example-2.py 
     289    :lines: 16-20 
    343290             
    344291Since the printout is pretty long, it might be more informative to just 
    345 print out the class distributions for each cluster. :: 
    346      
    347     for cluster in listOfClusters(root): 
    348         dist = Orange.core.get_class_distribution(cluster) 
    349         for e, d in enumerate(dist): 
    350             print "%s: %3.0f " % (data.domain.class_var.values[e], d), 
    351         print 
     292print out the class distributions for each cluster. 
     293     
     294.. literalinclude:: code/hierarchical-example-2.py 
     295    :lines: 22-26 
    352296         
    353297Here's what it shows. :: 
     
    365309instance, call a learning algorithms, passing a cluster as an argument. 
    366310It won't mind. If you, however, want to have a list of table, you can 
    367 easily convert the list by :: 
    368  
    369     tables = [Orange.data.Table(cluster) for cluster in listOfClusters(root)] 
     311easily convert the list by 
     312 
     313.. literalinclude:: code/hierarchical-example-2.py 
     314    :lines: 28 
    370315     
    371316Finally, if you are dealing with examples, you may want to take the function 
     
    502447    """ 
    503448    distance = distance_constructor(data) 
    504     matrix = orange.SymMatrix(len(data)) 
     449    matrix = Orange.misc.SymMatrix(len(data)) 
    505450    for i in range(len(data)): 
    506451        for j in range(i+1): 
     
    540485     
    541486    """ 
    542     matrix = orange.SymMatrix(len(data.domain.attributes)) 
     487    matrix = Orange.misc.SymMatrix(len(data.domain.attributes)) 
    543488    for a1 in range(len(data.domain.attributes)): 
    544489        for a2 in range(a1): 
     
    618563    :type tree: :class:`HierarchicalCluster` 
    619564    :param matrix: SymMatrix that was used to compute the clustering. 
    620     :type matrix: :class:`Orange.core.SymMatrix` 
     565    :type matrix: :class:`Orange.misc.SymMatrix` 
    621566    :param progress_callback: Function used to report on progress. 
    622567    :type progress_callback: function 
     
    811756    :type tree: :class:`HierarchicalCluster` 
    812757    :param matrix: SymMatrix that was used to compute the clustering. 
    813     :type matrix: :class:`Orange.core.SymMatrix` 
     758    :type matrix: :class:`Orange.misc.SymMatrix` 
    814759    :param progress_callback: Function used to report on progress. 
    815760    :type progress_callback: function 
     
    15111456 
    15121457def feature_distance_matrix(data, distance=None, progress_callback=None): 
    1513     """ A helper function that computes an :class:`Orange.core.SymMatrix` of 
     1458    """ A helper function that computes an :class:`Orange.misc.SymMatrix` of 
    15141459    all pairwise distances between features in `data`. 
    15151460     
     
    15241469    :type progress_callback: function 
    15251470     
    1526     :rtype: :class:`Orange.core.SymMatrix` 
     1471    :rtype: :class:`Orange.misc.SymMatrix` 
    15271472     
    15281473    """ 
    15291474    attributes = data.domain.attributes 
    1530     matrix = orange.SymMatrix(len(attributes)) 
     1475    matrix = Orange.misc.SymMatrix(len(attributes)) 
    15311476    iter_count = matrix.dim * (matrix.dim - 1) / 2 
    15321477    milestones = progress_bar_milestones(iter_count, 100) 
     
    15811526    :type cluster: :class:`HierarchicalCluster` 
    15821527     
    1583     :rtype: :class:`Orange.core.SymMatrix` 
     1528    :rtype: :class:`Orange.misc.SymMatrix` 
    15841529     
    15851530    """ 
    15861531 
    15871532    mapping = cluster.mapping   
    1588     matrix = Orange.core.SymMatrix(len(mapping)) 
     1533    matrix = Orange.misc.SymMatrix(len(mapping)) 
    15891534    for cluster in postorder(cluster): 
    15901535        if cluster.branches: 
     
    16241569     
    16251570     
    1626 if __name__=="__main__": 
    1627     data = orange.ExampleTable("doc//datasets//brown-selected.tab") 
    1628 #    data = orange.ExampleTable("doc//datasets//iris.tab") 
    1629     root = hierarchicalClustering(data, order=True) #, linkage=orange.HierarchicalClustering.Single) 
    1630     attr_root = hierarchicalClustering_attributes(data, order=True) 
    1631 #    print root 
    1632 #    d = DendrogramPlotPylab(root, data=data, labels=[str(ex.getclass()) for ex in data], dendrogram_width=0.4, heatmap_width=0.3,  params={}, cmap=None) 
    1633 #    d.plot(show=True, filename="graph.png") 
    1634  
    1635     dendrogram_draw("graph.eps", root, attr_tree=attr_root, data=data, labels=[str(e.getclass()) for e in data], tree_height=50, #width=500, height=500, 
    1636                           cluster_colors={root.right:(255,0,0), root.right.right:(0,255,0)},  
    1637                           color_palette=ColorPalette([(255, 0, 0), (0,0,0), (0, 255,0)], gamma=0.5,  
    1638                                                      overflow=(255, 255, 255), underflow=(255, 255, 255))) #, minv=-0.5, maxv=0.5) 
  • Orange/data/__init__.py

    r9671 r9891  
    1010 
    1111from orange import newmetaid as new_meta_id 
    12  
    13 from orange import SymMatrix 
  • Orange/evaluation/scoring.py

    r9725 r9892  
    1 """ 
    2 ############################ 
    3 Method scoring (``scoring``) 
    4 ############################ 
    5  
    6 .. index: scoring 
    7  
    8 This module contains various measures of quality for classification and 
    9 regression. Most functions require an argument named :obj:`res`, an instance of 
    10 :class:`Orange.evaluation.testing.ExperimentResults` as computed by 
    11 functions from :mod:`Orange.evaluation.testing` and which contains  
    12 predictions obtained through cross-validation, 
    13 leave one-out, testing on training data or test set instances. 
    14  
    15 ============== 
    16 Classification 
    17 ============== 
    18  
    19 To prepare some data for examples on this page, we shall load the voting data 
    20 set (problem of predicting the congressman's party (republican, democrat) 
    21 based on a selection of votes) and evaluate naive Bayesian learner, 
    22 classification trees and majority classifier using cross-validation. 
    23 For examples requiring a multivalued class problem, we shall do the same 
    24 with the vehicle data set (telling whether a vehicle described by the features 
    25 extracted from a picture is a van, bus, or Opel or Saab car). 
    26  
    27 Basic cross validation example is shown in the following part of  
    28 (:download:`statExamples.py <code/statExamples.py>`, uses :download:`voting.tab <code/voting.tab>` and :download:`vehicle.tab <code/vehicle.tab>`): 
    29  
    30 .. literalinclude:: code/statExample0.py 
    31  
    32 If instances are weighted, weights are taken into account. This can be 
    33 disabled by giving :obj:`unweighted=1` as a keyword argument. Another way of 
    34 disabling weights is to clear the 
    35 :class:`Orange.evaluation.testing.ExperimentResults`' flag weights. 
    36  
    37 General Measures of Quality 
    38 =========================== 
    39  
    40 .. autofunction:: CA 
    41  
    42 .. autofunction:: AP 
    43  
    44 .. autofunction:: Brier_score 
    45  
    46 .. autofunction:: IS 
    47  
    48 So, let's compute all this in part of  
    49 (:download:`statExamples.py <code/statExamples.py>`, uses :download:`voting.tab <code/voting.tab>` and :download:`vehicle.tab <code/vehicle.tab>`) and print it out: 
    50  
    51 .. literalinclude:: code/statExample1.py 
    52    :lines: 13- 
    53  
    54 The output should look like this:: 
    55  
    56     method  CA      AP      Brier    IS 
    57     bayes   0.903   0.902   0.175    0.759 
    58     tree    0.846   0.845   0.286    0.641 
    59     majrty  0.614   0.526   0.474   -0.000 
    60  
    61 Script :download:`statExamples.py <code/statExamples.py>` contains another example that also prints out  
    62 the standard errors. 
    63  
    64 Confusion Matrix 
    65 ================ 
    66  
    67 .. autofunction:: confusion_matrices 
    68  
    69    **A positive-negative confusion matrix** is computed (a) if the class is 
    70    binary unless :obj:`classIndex` argument is -2, (b) if the class is 
    71    multivalued and the :obj:`classIndex` is non-negative. Argument 
    72    :obj:`classIndex` then tells which class is positive. In case (a), 
    73    :obj:`classIndex` may be omitted; the first class 
    74    is then negative and the second is positive, unless the :obj:`baseClass` 
    75    attribute in the object with results has non-negative value. In that case, 
    76    :obj:`baseClass` is an index of the target class. :obj:`baseClass` 
    77    attribute of results object should be set manually. The result of a 
    78    function is a list of instances of class :class:`ConfusionMatrix`, 
    79    containing the (weighted) number of true positives (TP), false 
    80    negatives (FN), false positives (FP) and true negatives (TN). 
    81     
    82    We can also add the keyword argument :obj:`cutoff` 
    83    (e.g. confusion_matrices(results, cutoff=0.3); if we do, :obj:`confusion_matrices` 
    84    will disregard the classifiers' class predictions and observe the predicted 
    85    probabilities, and consider the prediction "positive" if the predicted 
    86    probability of the positive class is higher than the :obj:`cutoff`. 
    87  
    88    The example (part of :download:`statExamples.py <code/statExamples.py>`) below shows how setting the 
    89    cut off threshold from the default 0.5 to 0.2 affects the confusion matrics  
    90    for naive Bayesian classifier:: 
    91     
    92        cm = Orange.evaluation.scoring.confusion_matrices(res)[0] 
    93        print "Confusion matrix for naive Bayes:" 
    94        print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN) 
    95         
    96        cm = Orange.evaluation.scoring.confusion_matrices(res, cutoff=0.2)[0] 
    97        print "Confusion matrix for naive Bayes:" 
    98        print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN) 
    99  
    100    The output:: 
    101     
    102        Confusion matrix for naive Bayes: 
    103        TP: 238, FP: 13, FN: 29.0, TN: 155 
    104        Confusion matrix for naive Bayes: 
    105        TP: 239, FP: 18, FN: 28.0, TN: 150 
    106     
    107    shows that the number of true positives increases (and hence the number of 
    108    false negatives decreases) by only a single instance, while five instances 
    109    that were originally true negatives become false positives due to the 
    110    lower threshold. 
    111     
    112    To observe how good are the classifiers in detecting vans in the vehicle 
    113    data set, we would compute the matrix like this:: 
    114     
    115       cm = Orange.evaluation.scoring.confusion_matrices(resVeh, \ 
    116 vehicle.domain.classVar.values.index("van")) 
    117     
    118    and get the results like these:: 
    119     
    120        TP: 189, FP: 241, FN: 10.0, TN: 406 
    121     
    122    while the same for class "opel" would give:: 
    123     
    124        TP: 86, FP: 112, FN: 126.0, TN: 522 
    125         
    126    The main difference is that there are only a few false negatives for the 
    127    van, meaning that the classifier seldom misses it (if it says it's not a 
    128    van, it's almost certainly not a van). Not so for the Opel car, where the 
    129    classifier missed 126 of them and correctly detected only 86. 
    130     
    131    **General confusion matrix** is computed (a) in case of a binary class, 
    132    when :obj:`classIndex` is set to -2, (b) when we have multivalued class and  
    133    the caller doesn't specify the :obj:`classIndex` of the positive class. 
    134    When called in this manner, the function cannot use the argument 
    135    :obj:`cutoff`. 
    136     
    137    The function then returns a three-dimensional matrix, where the element 
    138    A[:obj:`learner`][:obj:`actual_class`][:obj:`predictedClass`] 
    139    gives the number of instances belonging to 'actual_class' for which the 
    140    'learner' predicted 'predictedClass'. We shall compute and print out 
    141    the matrix for naive Bayesian classifier. 
    142     
    143    Here we see another example from :download:`statExamples.py <code/statExamples.py>`:: 
    144     
    145        cm = Orange.evaluation.scoring.confusion_matrices(resVeh)[0] 
    146        classes = vehicle.domain.classVar.values 
    147        print "\t"+"\t".join(classes) 
    148        for className, classConfusions in zip(classes, cm): 
    149            print ("%s" + ("\t%i" * len(classes))) % ((className, ) + tuple(classConfusions)) 
    150     
    151    So, here's what this nice piece of code gives:: 
    152     
    153               bus   van  saab opel 
    154        bus     56   95   21   46 
    155        van     6    189  4    0 
    156        saab    3    75   73   66 
    157        opel    4    71   51   86 
    158         
    159    Van's are clearly simple: 189 vans were classified as vans (we know this 
    160    already, we've printed it out above), and the 10 misclassified pictures 
    161    were classified as buses (6) and Saab cars (4). In all other classes, 
    162    there were more instances misclassified as vans than correctly classified 
    163    instances. The classifier is obviously quite biased to vans. 
    164     
    165    .. method:: sens(confm)  
    166    .. method:: spec(confm) 
    167    .. method:: PPV(confm) 
    168    .. method:: NPV(confm) 
    169    .. method:: precision(confm) 
    170    .. method:: recall(confm) 
    171    .. method:: F2(confm) 
    172    .. method:: Falpha(confm, alpha=2.0) 
    173    .. method:: MCC(conf) 
    174  
    175    With the confusion matrix defined in terms of positive and negative 
    176    classes, you can also compute the  
    177    `sensitivity <http://en.wikipedia.org/wiki/Sensitivity_(tests)>`_ 
    178    [TP/(TP+FN)], `specificity \ 
    179 <http://en.wikipedia.org/wiki/Specificity_%28tests%29>`_ 
    180    [TN/(TN+FP)], `positive predictive value \ 
    181 <http://en.wikipedia.org/wiki/Positive_predictive_value>`_ 
    182    [TP/(TP+FP)] and `negative predictive value \ 
    183 <http://en.wikipedia.org/wiki/Negative_predictive_value>`_ [TN/(TN+FN)].  
    184    In information retrieval, positive predictive value is called precision 
    185    (the ratio of the number of relevant records retrieved to the total number 
    186    of irrelevant and relevant records retrieved), and sensitivity is called 
    187    `recall <http://en.wikipedia.org/wiki/Information_retrieval>`_  
    188    (the ratio of the number of relevant records retrieved to the total number 
    189    of relevant records in the database). The  
    190    `harmonic mean <http://en.wikipedia.org/wiki/Harmonic_mean>`_ of precision 
    191    and recall is called an  
    192    `F-measure <http://en.wikipedia.org/wiki/F-measure>`_, where, depending 
    193    on the ratio of the weight between precision and recall is implemented 
    194    as F1 [2*precision*recall/(precision+recall)] or, for a general case, 
    195    Falpha [(1+alpha)*precision*recall / (alpha*precision + recall)]. 
    196    The `Matthews correlation coefficient \ 
    197 <http://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_ 
    198    in essence a correlation coefficient between 
    199    the observed and predicted binary classifications; it returns a value 
    200    between -1 and +1. A coefficient of +1 represents a perfect prediction, 
    201    0 an average random prediction and -1 an inverse prediction. 
    202     
    203    If the argument :obj:`confm` is a single confusion matrix, a single 
    204    result (a number) is returned. If confm is a list of confusion matrices, 
    205    a list of scores is returned, one for each confusion matrix. 
    206     
    207    Note that weights are taken into account when computing the matrix, so 
    208    these functions don't check the 'weighted' keyword argument. 
    209     
    210    Let us print out sensitivities and specificities of our classifiers in 
    211    part of :download:`statExamples.py <code/statExamples.py>`:: 
    212     
    213        cm = Orange.evaluation.scoring.confusion_matrices(res) 
    214        print 
    215        print "method\tsens\tspec" 
    216        for l in range(len(learners)): 
    217            print "%s\t%5.3f\t%5.3f" % (learners[l].name, Orange.evaluation.scoring.sens(cm[l]), Orange.evaluation.scoring.spec(cm[l])) 
    218     
    219 ROC Analysis 
    220 ============ 
    221  
    222 `Receiver Operating Characteristic \ 
    223 <http://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_  
    224 (ROC) analysis was initially developed for 
    225 a binary-like problems and there is no consensus on how to apply it in 
    226 multi-class problems, nor do we know for sure how to do ROC analysis after 
    227 cross validation and similar multiple sampling techniques. If you are 
    228 interested in the area under the curve, function AUC will deal with those 
    229 problems as specifically described below. 
    230  
    231 .. autofunction:: AUC 
    232     
    233    .. attribute:: AUC.ByWeightedPairs (or 0) 
    234        
    235       Computes AUC for each pair of classes (ignoring instances of all other 
    236       classes) and averages the results, weighting them by the number of 
    237       pairs of instances from these two classes (e.g. by the product of 
    238       probabilities of the two classes). AUC computed in this way still 
    239       behaves as concordance index, e.g., gives the probability that two 
    240       randomly chosen instances from different classes will be correctly 
    241       recognized (this is of course true only if the classifier knows 
    242       from which two classes the instances came). 
    243     
    244    .. attribute:: AUC.ByPairs (or 1) 
    245     
    246       Similar as above, except that the average over class pairs is not 
    247       weighted. This AUC is, like the binary, independent of class 
    248       distributions, but it is not related to concordance index any more. 
    249        
    250    .. attribute:: AUC.WeightedOneAgainstAll (or 2) 
    251        
    252       For each class, it computes AUC for this class against all others (that 
    253       is, treating other classes as one class). The AUCs are then averaged by 
    254       the class probabilities. This is related to concordance index in which 
    255       we test the classifier's (average) capability for distinguishing the 
    256       instances from a specified class from those that come from other classes. 
    257       Unlike the binary AUC, the measure is not independent of class 
    258       distributions. 
    259        
    260    .. attribute:: AUC.OneAgainstAll (or 3) 
    261     
    262       As above, except that the average is not weighted. 
    263     
    264    In case of multiple folds (for instance if the data comes from cross 
    265    validation), the computation goes like this. When computing the partial 
    266    AUCs for individual pairs of classes or singled-out classes, AUC is 
    267    computed for each fold separately and then averaged (ignoring the number 
    268    of instances in each fold, it's just a simple average). However, if a 
    269    certain fold doesn't contain any instances of a certain class (from the 
    270    pair), the partial AUC is computed treating the results as if they came 
    271    from a single-fold. This is not really correct since the class 
    272    probabilities from different folds are not necessarily comparable, 
    273    yet this will most often occur in a leave-one-out experiments, 
    274    comparability shouldn't be a problem. 
    275     
    276    Computing and printing out the AUC's looks just like printing out 
    277    classification accuracies (except that we call AUC instead of 
    278    CA, of course):: 
    279     
    280        AUCs = Orange.evaluation.scoring.AUC(res) 
    281        for l in range(len(learners)): 
    282            print "%10s: %5.3f" % (learners[l].name, AUCs[l]) 
    283             
    284    For vehicle, you can run exactly this same code; it will compute AUCs 
    285    for all pairs of classes and return the average weighted by probabilities 
    286    of pairs. Or, you can specify the averaging method yourself, like this:: 
    287     
    288        AUCs = Orange.evaluation.scoring.AUC(resVeh, Orange.evaluation.scoring.AUC.WeightedOneAgainstAll) 
    289     
    290    The following snippet tries out all four. (We don't claim that this is 
    291    how the function needs to be used; it's better to stay with the default.):: 
    292     
    293        methods = ["by pairs, weighted", "by pairs", "one vs. all, weighted", "one vs. all"] 
    294        print " " *25 + "  \tbayes\ttree\tmajority" 
    295        for i in range(4): 
    296            AUCs = Orange.evaluation.scoring.AUC(resVeh, i) 
    297            print "%25s: \t%5.3f\t%5.3f\t%5.3f" % ((methods[i], ) + tuple(AUCs)) 
    298     
    299    As you can see from the output:: 
    300     
    301                                    bayes   tree    majority 
    302               by pairs, weighted:  0.789   0.871   0.500 
    303                         by pairs:  0.791   0.872   0.500 
    304            one vs. all, weighted:  0.783   0.800   0.500 
    305                      one vs. all:  0.783   0.800   0.500 
    306  
    307 .. autofunction:: AUC_single 
    308  
    309 .. autofunction:: AUC_pair 
    310  
    311 .. autofunction:: AUC_matrix 
    312  
    313 The remaining functions, which plot the curves and statistically compare 
    314 them, require that the results come from a test with a single iteration, 
    315 and they always compare one chosen class against all others. If you have 
    316 cross validation results, you can either use split_by_iterations to split the 
    317 results by folds, call the function for each fold separately and then sum 
    318 the results up however you see fit, or you can set the ExperimentResults' 
    319 attribute number_of_iterations to 1, to cheat the function - at your own 
    320 responsibility for the statistical correctness. Regarding the multi-class 
    321 problems, if you don't chose a specific class, Orange.evaluation.scoring will use the class 
    322 attribute's baseValue at the time when results were computed. If baseValue 
    323 was not given at that time, 1 (that is, the second class) is used as default. 
    324  
    325 We shall use the following code to prepare suitable experimental results:: 
    326  
    327     ri2 = Orange.core.MakeRandomIndices2(voting, 0.6) 
    328     train = voting.selectref(ri2, 0) 
    329     test = voting.selectref(ri2, 1) 
    330     res1 = Orange.evaluation.testing.learnAndTestOnTestData(learners, train, test) 
    331  
    332  
    333 .. autofunction:: AUCWilcoxon 
    334  
    335 .. autofunction:: compute_ROC 
    336  
    337 Comparison of Algorithms 
    338 ------------------------ 
    339  
    340 .. autofunction:: McNemar 
    341  
    342 .. autofunction:: McNemar_of_two 
    343  
    344 ========== 
    345 Regression 
    346 ========== 
    347  
    348 General Measure of Quality 
    349 ========================== 
    350  
    351 Several alternative measures, as given below, can be used to evaluate 
    352 the sucess of numeric prediction: 
    353  
    354 .. image:: files/statRegression.png 
    355  
    356 .. autofunction:: MSE 
    357  
    358 .. autofunction:: RMSE 
    359  
    360 .. autofunction:: MAE 
    361  
    362 .. autofunction:: RSE 
    363  
    364 .. autofunction:: RRSE 
    365  
    366 .. autofunction:: RAE 
    367  
    368 .. autofunction:: R2 
    369  
    370 The following code (:download:`statExamples.py <code/statExamples.py>`) uses most of the above measures to 
    371 score several regression methods. 
    372  
    373 .. literalinclude:: code/statExamplesRegression.py 
    374  
    375 The code above produces the following output:: 
    376  
    377     Learner   MSE     RMSE    MAE     RSE     RRSE    RAE     R2 
    378     maj       84.585  9.197   6.653   1.002   1.001   1.001  -0.002 
    379     rt        40.015  6.326   4.592   0.474   0.688   0.691   0.526 
    380     knn       21.248  4.610   2.870   0.252   0.502   0.432   0.748 
    381     lr        24.092  4.908   3.425   0.285   0.534   0.515   0.715 
    382      
    383 ================= 
    384 Ploting functions 
    385 ================= 
    386  
    387 .. autofunction:: graph_ranks 
    388  
    389 The following script (:download:`statExamplesGraphRanks.py <code/statExamplesGraphRanks.py>`) shows hot to plot a graph: 
    390  
    391 .. literalinclude:: code/statExamplesGraphRanks.py 
    392  
    393 Code produces the following graph:  
    394  
    395 .. image:: files/statExamplesGraphRanks1.png 
    396  
    397 .. autofunction:: compute_CD 
    398  
    399 .. autofunction:: compute_friedman 
    400  
    401 ================= 
    402 Utility Functions 
    403 ================= 
    404  
    405 .. autofunction:: split_by_iterations 
    406  
    407 ===================================== 
    408 Scoring for multilabel classification 
    409 ===================================== 
    410  
    411 Multi-label classification requries different metrics than those used in traditional single-label  
    412 classification. This module presents the various methrics that have been proposed in the literature.  
    413 Let :math:`D` be a multi-label evaluation data set, conisting of :math:`|D|` multi-label examples  
    414 :math:`(x_i,Y_i)`, :math:`i=1..|D|`, :math:`Y_i \\subseteq L`. Let :math:`H` be a multi-label classifier  
    415 and :math:`Z_i=H(x_i)` be the set of labels predicted by :math:`H` for example :math:`x_i`. 
    416  
    417 .. autofunction:: mlc_hamming_loss  
    418 .. autofunction:: mlc_accuracy 
    419 .. autofunction:: mlc_precision 
    420 .. autofunction:: mlc_recall 
    421  
    422 So, let's compute all this and print it out (part of 
    423 :download:`mlc-evaluate.py <code/mlc-evaluate.py>`, uses 
    424 :download:`emotions.tab <code/emotions.tab>`): 
    425  
    426 .. literalinclude:: code/mlc-evaluate.py 
    427    :lines: 1-15 
    428  
    429 The output should look like this:: 
    430  
    431     loss= [0.9375] 
    432     accuracy= [0.875] 
    433     precision= [1.0] 
    434     recall= [0.875] 
    435  
    436 References 
    437 ========== 
    438  
    439 Boutell, M.R., Luo, J., Shen, X. & Brown, C.M. (2004), 'Learning multi-label scene classification', 
    440 Pattern Recogintion, vol.37, no.9, pp:1757-71 
    441  
    442 Godbole, S. & Sarawagi, S. (2004), 'Discriminative Methods for Multi-labeled Classification', paper  
    443 presented to Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining  
    444 (PAKDD 2004) 
    445   
    446 Schapire, R.E. & Singer, Y. (2000), 'Boostexter: a bossting-based system for text categorization',  
    447 Machine Learning, vol.39, no.2/3, pp:135-68. 
    448  
    449 """ 
    450  
    4511import operator, math 
    4522from operator import add 
     
    4555import Orange 
    4566from Orange import statc 
    457  
     7from Orange.misc import deprecated_keywords 
    4588 
    4599#### Private stuff 
     
    53383 
    53484 
    535 def statistics_by_folds(stats, foldN, reportSE, iterationIsOuter): 
     85@deprecated_keywords({ 
     86    "foldN": "fold_n", 
     87    "reportSE": "report_se", 
     88    "iterationIsOuter": "iteration_is_outer"}) 
     89def statistics_by_folds(stats, fold_n, report_se, iteration_is_outer): 
    53690    # remove empty folds, turn the matrix so that learner is outer 
    537     if iterationIsOuter: 
     91    if iteration_is_outer: 
    53892        if not stats: 
    53993            raise ValueError, "Cannot compute the score: no examples or sum of weights is 0.0." 
    54094        number_of_learners = len(stats[0]) 
    541         stats = filter(lambda (x, fN): fN>0.0, zip(stats,foldN)) 
     95        stats = filter(lambda (x, fN): fN>0.0, zip(stats,fold_n)) 
    54296        stats = [ [x[lrn]/fN for x, fN in stats] for lrn in range(number_of_learners)] 
    54397    else: 
    544         stats = [ [x/Fn for x, Fn in filter(lambda (x, Fn): Fn > 0.0, zip(lrnD, foldN))] for lrnD in stats] 
     98        stats = [ [x/Fn for x, Fn in filter(lambda (x, Fn): Fn > 0.0, zip(lrnD, fold_n))] for lrnD in stats] 
    54599 
    546100    if not stats: 
     
    549103        raise ValueError, "Cannot compute the score: no examples or sum of weights is 0.0." 
    550104     
    551     if reportSE: 
     105    if report_se: 
    552106        return [(statc.mean(x), statc.sterr(x)) for x in stats] 
    553107    else: 
     
    751305# Scores for evaluation of classifiers 
    752306 
    753 def CA(res, reportSE = False, **argkw): 
     307@deprecated_keywords({"reportSE": "report_se"}) 
     308def CA(res, report_se = False, **argkw): 
    754309    """ Computes classification accuracy, i.e. percentage of matches between 
    755310    predicted and actual class. The function returns a list of classification 
     
    793348            ca = [x/totweight for x in CAs] 
    794349             
    795         if reportSE: 
     350        if report_se: 
    796351            return [(x, x*(1-x)/math.sqrt(totweight)) for x in ca] 
    797352        else: 
     
    813368                foldN[tex.iteration_number] += tex.weight 
    814369 
    815         return statistics_by_folds(CAsByFold, foldN, reportSE, False) 
     370        return statistics_by_folds(CAsByFold, foldN, report_se, False) 
    816371 
    817372 
     
    820375    return CA(res, True, **argkw) 
    821376 
    822  
    823 def AP(res, reportSE = False, **argkw): 
     377@deprecated_keywords({"reportSE": "report_se"}) 
     378def AP(res, report_se = False, **argkw): 
    824379    """ Computes the average probability assigned to the correct class. """ 
    825380    if res.number_of_iterations == 1: 
     
    848403            foldN[tex.iteration_number] += tex.weight 
    849404 
    850     return statistics_by_folds(APsByFold, foldN, reportSE, True) 
    851  
    852  
    853 def Brier_score(res, reportSE = False, **argkw): 
     405    return statistics_by_folds(APsByFold, foldN, report_se, True) 
     406 
     407 
     408@deprecated_keywords({"reportSE": "report_se"}) 
     409def Brier_score(res, report_se = False, **argkw): 
    854410    """ Computes the Brier's score, defined as the average (over test examples) 
    855411    of sumx(t(x)-p(x))2, where x is a class, t(x) is 1 for the correct class 
     
    881437            totweight = gettotweight(res) 
    882438        check_non_zero(totweight) 
    883         if reportSE: 
     439        if report_se: 
    884440            return [(max(x/totweight+1.0, 0), 0) for x in MSEs]  ## change this, not zero!!! 
    885441        else: 
     
    900456            foldN[tex.iteration_number] += tex.weight 
    901457 
    902     stats = statistics_by_folds(BSs, foldN, reportSE, True) 
    903     if reportSE: 
     458    stats = statistics_by_folds(BSs, foldN, report_se, True) 
     459    if report_se: 
    904460        return [(x+1.0, y) for x, y in stats] 
    905461    else: 
     
    915471    else: 
    916472        return -(-log2(1-P)+log2(1-Pc)) 
    917      
    918 def IS(res, apriori=None, reportSE = False, **argkw): 
     473 
     474 
     475@deprecated_keywords({"reportSE": "report_se"}) 
     476def IS(res, apriori=None, report_se = False, **argkw): 
    919477    """ Computes the information score as defined by  
    920478    `Kononenko and Bratko (1991) \ 
     
    941499                    ISs[i] += IS_ex(tex.probabilities[i][cls], apriori[cls]) * tex.weight 
    942500            totweight = gettotweight(res) 
    943         if reportSE: 
     501        if report_se: 
    944502            return [(IS/totweight,0) for IS in ISs] 
    945503        else: 
     
    964522            foldN[tex.iteration_number] += tex.weight 
    965523 
    966     return statistics_by_folds(ISs, foldN, reportSE, False) 
     524    return statistics_by_folds(ISs, foldN, report_se, False) 
    967525 
    968526 
     
    1026584 
    1027585 
    1028 def confusion_matrices(res, classIndex=-1, **argkw): 
     586@deprecated_keywords({"classIndex": "class_index"}) 
     587def confusion_matrices(res, class_index=-1, **argkw): 
    1029588    """ This function can compute two different forms of confusion matrix: 
    1030589    one in which a certain class is marked as positive and the other(s) 
     
    1035594    tfpns = [ConfusionMatrix() for i in range(res.number_of_learners)] 
    1036595     
    1037     if classIndex<0: 
     596    if class_index<0: 
    1038597        numberOfClasses = len(res.class_values) 
    1039         if classIndex < -1 or numberOfClasses > 2: 
     598        if class_index < -1 or numberOfClasses > 2: 
    1040599            cm = [[[0.0] * numberOfClasses for i in range(numberOfClasses)] for l in range(res.number_of_learners)] 
    1041600            if argkw.get("unweighted", 0) or not res.weights: 
     
    1056615             
    1057616        elif res.baseClass>=0: 
    1058             classIndex = res.baseClass 
    1059         else: 
    1060             classIndex = 1 
     617            class_index = res.baseClass 
     618        else: 
     619            class_index = 1 
    1061620             
    1062621    cutoff = argkw.get("cutoff") 
     
    1064623        if argkw.get("unweighted", 0) or not res.weights: 
    1065624            for lr in res.results: 
    1066                 isPositive=(lr.actual_class==classIndex) 
     625                isPositive=(lr.actual_class==class_index) 
    1067626                for i in range(res.number_of_learners): 
    1068                     tfpns[i].addTFPosNeg(lr.probabilities[i][classIndex]>cutoff, isPositive) 
     627                    tfpns[i].addTFPosNeg(lr.probabilities[i][class_index]>cutoff, isPositive) 
    1069628        else: 
    1070629            for lr in res.results: 
    1071                 isPositive=(lr.actual_class==classIndex) 
     630                isPositive=(lr.actual_class==class_index) 
    1072631                for i in range(res.number_of_learners): 
    1073                     tfpns[i].addTFPosNeg(lr.probabilities[i][classIndex]>cutoff, isPositive, lr.weight) 
     632                    tfpns[i].addTFPosNeg(lr.probabilities[i][class_index]>cutoff, isPositive, lr.weight) 
    1074633    else: 
    1075634        if argkw.get("unweighted", 0) or not res.weights: 
    1076635            for lr in res.results: 
    1077                 isPositive=(lr.actual_class==classIndex) 
     636                isPositive=(lr.actual_class==class_index) 
    1078637                for i in range(res.number_of_learners): 
    1079                     tfpns[i].addTFPosNeg(lr.classes[i]==classIndex, isPositive) 
     638                    tfpns[i].addTFPosNeg(lr.classes[i]==class_index, isPositive) 
    1080639        else: 
    1081640            for lr in res.results: 
    1082                 isPositive=(lr.actual_class==classIndex) 
     641                isPositive=(lr.actual_class==class_index) 
    1083642                for i in range(res.number_of_learners): 
    1084                     tfpns[i].addTFPosNeg(lr.classes[i]==classIndex, isPositive, lr.weight) 
     643                    tfpns[i].addTFPosNeg(lr.classes[i]==class_index, isPositive, lr.weight) 
    1085644    return tfpns 
    1086645 
     
    1090649 
    1091650 
    1092 def confusion_chi_square(confusionMatrix): 
    1093     dim = len(confusionMatrix) 
    1094     rowPriors = [sum(r) for r in confusionMatrix] 
    1095     colPriors = [sum([r[i] for r in confusionMatrix]) for i in range(dim)] 
     651@deprecated_keywords({"confusionMatrix": "confusion_matrix"}) 
     652def confusion_chi_square(confusion_matrix): 
     653    dim = len(confusion_matrix) 
     654    rowPriors = [sum(r) for r in confusion_matrix] 
     655    colPriors = [sum([r[i] for r in confusion_matrix]) for i in range(dim)] 
    1096656    total = sum(rowPriors) 
    1097657    rowPriors = [r/total for r in rowPriors] 
    1098658    colPriors = [r/total for r in colPriors] 
    1099659    ss = 0 
    1100     for ri, row in enumerate(confusionMatrix): 
     660    for ri, row in enumerate(confusion_matrix): 
    1101661        for ci, o in enumerate(row): 
    1102662            e = total * rowPriors[ri] * colPriors[ci] 
     
    1229789    return r 
    1230790 
    1231 def scotts_pi(confm, bIsListOfMatrices=True): 
     791 
     792@deprecated_keywords({"bIsListOfMatrices": "b_is_list_of_matrices"}) 
     793def scotts_pi(confm, b_is_list_of_matrices=True): 
    1232794   """Compute Scott's Pi for measuring inter-rater agreement for nominal data 
    1233795 
     
    1240802                           Orange.evaluation.scoring.compute_confusion_matrices and set the 
    1241803                           classIndex parameter to -2. 
    1242    @param bIsListOfMatrices: specifies whether confm is list of matrices. 
     804   @param b_is_list_of_matrices: specifies whether confm is list of matrices. 
    1243805                           This function needs to operate on non-binary 
    1244806                           confusion matrices, which are represented by python 
     
    1247809   """ 
    1248810 
    1249    if bIsListOfMatrices: 
     811   if b_is_list_of_matrices: 
    1250812       try: 
    1251            return [scotts_pi(cm, bIsListOfMatrices=False) for cm in confm] 
     813           return [scotts_pi(cm, b_is_list_of_matrices=False) for cm in confm] 
    1252814       except TypeError: 
    1253815           # Nevermind the parameter, maybe this is a "conventional" binary 
     
    1276838       return ret 
    1277839 
    1278 def AUCWilcoxon(res, classIndex=-1, **argkw): 
     840@deprecated_keywords({"classIndex": "class_index"}) 
     841def AUCWilcoxon(res, class_index=-1, **argkw): 
    1279842    """ Computes the area under ROC (AUC) and its standard error using 
    1280843    Wilcoxon's approach proposed by Hanley and McNeal (1982). If  
     
    1285848    import corn 
    1286849    useweights = res.weights and not argkw.get("unweighted", 0) 
    1287     problists, tots = corn.computeROCCumulative(res, classIndex, useweights) 
     850    problists, tots = corn.computeROCCumulative(res, class_index, useweights) 
    1288851 
    1289852    results=[] 
     
    1313876AROC = AUCWilcoxon # for backward compatibility, AROC is obsolote 
    1314877 
    1315 def compare_2_AUCs(res, lrn1, lrn2, classIndex=-1, **argkw): 
     878 
     879@deprecated_keywords({"classIndex": "class_index"}) 
     880def compare_2_AUCs(res, lrn1, lrn2, class_index=-1, **argkw): 
    1316881    import corn 
    1317     return corn.compare2ROCs(res, lrn1, lrn2, classIndex, res.weights and not argkw.get("unweighted")) 
     882    return corn.compare2ROCs(res, lrn1, lrn2, class_index, res.weights and not argkw.get("unweighted")) 
    1318883 
    1319884compare_2_AROCs = compare_2_AUCs # for backward compatibility, compare_2_AROCs is obsolote 
    1320885 
    1321      
    1322 def compute_ROC(res, classIndex=-1): 
     886 
     887@deprecated_keywords({"classIndex": "class_index"}) 
     888def compute_ROC(res, class_index=-1): 
    1323889    """ Computes a ROC curve as a list of (x, y) tuples, where x is  
    1324890    1-specificity and y is sensitivity. 
    1325891    """ 
    1326892    import corn 
    1327     problists, tots = corn.computeROCCumulative(res, classIndex) 
     893    problists, tots = corn.computeROCCumulative(res, class_index) 
    1328894 
    1329895    results = [] 
     
    1357923    return (P1y - P2y) / (P1x - P2x) 
    1358924 
    1359 def ROC_add_point(P, R, keepConcavities=1): 
     925 
     926@deprecated_keywords({"keepConcavities": "keep_concavities"}) 
     927def ROC_add_point(P, R, keep_concavities=1): 
    1360928    if keepConcavities: 
    1361929        R.append(P) 
     
    1374942    return R 
    1375943 
    1376 def TC_compute_ROC(res, classIndex=-1, keepConcavities=1): 
     944 
     945@deprecated_keywords({"classIndex": "class_index", 
     946                      "keepConcavities": "keep_concavities"}) 
     947def TC_compute_ROC(res, class_index=-1, keep_concavities=1): 
    1377948    import corn 
    1378     problists, tots = corn.computeROCCumulative(res, classIndex) 
     949    problists, tots = corn.computeROCCumulative(res, class_index) 
    1379950 
    1380951    results = [] 
     
    1399970                else: 
    1400971                    fpr = 0.0 
    1401                 curve = ROC_add_point((fpr, tpr, fPrev), curve, keepConcavities) 
     972                curve = ROC_add_point((fpr, tpr, fPrev), curve, keep_concavities) 
    1402973                fPrev = f 
    1403974            thisPos, thisNeg = prob[1][1], prob[1][0] 
     
    1412983        else: 
    1413984            fpr = 0.0 
    1414         curve = ROC_add_point((fpr, tpr, f), curve, keepConcavities) ## ugly 
     985        curve = ROC_add_point((fpr, tpr, f), curve, keep_concavities) ## ugly 
    1415986        results.append(curve) 
    1416987 
     
    14721043## for each (sub)set of input ROC curves 
    14731044## returns the average ROC curve and an array of (vertical) standard deviations 
    1474 def TC_vertical_average_ROC(ROCcurves, samples = 10): 
     1045@deprecated_keywords({"ROCcurves": "roc_curves"}) 
     1046def TC_vertical_average_ROC(roc_curves, samples = 10): 
    14751047    def INTERPOLATE((P1x, P1y, P1fscore), (P2x, P2y, P2fscore), X): 
    14761048        if (P1x == P2x) or ((X > P1x) and (X > P2x)) or ((X < P1x) and (X < P2x)): 
     
    15011073    average = [] 
    15021074    stdev = [] 
    1503     for ROCS in ROCcurves: 
     1075    for ROCS in roc_curves: 
    15041076        npts = [] 
    15051077        for c in ROCS: 
     
    15311103## for each (sub)set of input ROC curves 
    15321104## returns the average ROC curve, an array of vertical standard deviations and an array of horizontal standard deviations 
    1533 def TC_threshold_average_ROC(ROCcurves, samples = 10): 
     1105@deprecated_keywords({"ROCcurves": "roc_curves"}) 
     1106def TC_threshold_average_ROC(roc_curves, samples = 10): 
    15341107    def POINT_AT_THRESH(ROC, npts, thresh): 
    15351108        i = 0 
     
    15451118    stdevV = [] 
    15461119    stdevH = [] 
    1547     for ROCS in ROCcurves: 
     1120    for ROCS in roc_curves: 
    15481121        npts = [] 
    15491122        for c in ROCS: 
     
    15961169##  - yesClassRugPoints is an array of (x, 1) points 
    15971170##  - noClassRugPoints is an array of (x, 0) points 
    1598 def compute_calibration_curve(res, classIndex=-1): 
     1171@deprecated_keywords({"classIndex": "class_index"}) 
     1172def compute_calibration_curve(res, class_index=-1): 
    15991173    import corn 
    16001174    ## merge multiple iterations into one 
     
    16031177        mres.results.append( te ) 
    16041178 
    1605     problists, tots = corn.computeROCCumulative(mres, classIndex) 
     1179    problists, tots = corn.computeROCCumulative(mres, class_index) 
    16061180 
    16071181    results = [] 
     
    16581232## returns an array of curve elements, where: 
    16591233##  - curve is an array of points ((TP+FP)/(P + N), TP/P, (th, FP/N)) on the Lift Curve 
    1660 def compute_lift_curve(res, classIndex=-1): 
     1234@deprecated_keywords({"classIndex": "class_index"}) 
     1235def compute_lift_curve(res, class_index=-1): 
    16611236    import corn 
    16621237    ## merge multiple iterations into one 
     
    16651240        mres.results.append( te ) 
    16661241 
    1667     problists, tots = corn.computeROCCumulative(mres, classIndex) 
     1242    problists, tots = corn.computeROCCumulative(mres, class_index) 
    16681243 
    16691244    results = [] 
     
    16931268 
    16941269 
    1695 def compute_CDT(res, classIndex=-1, **argkw): 
     1270@deprecated_keywords({"classIndex": "class_index"}) 
     1271def compute_CDT(res, class_index=-1, **argkw): 
    16961272    """Obsolete, don't use""" 
    16971273    import corn 
    1698     if classIndex<0: 
     1274    if class_index<0: 
    16991275        if res.baseClass>=0: 
    1700             classIndex = res.baseClass 
    1701         else: 
    1702             classIndex = 1 
     1276            class_index = res.baseClass 
     1277        else: 
     1278            class_index = 1 
    17031279             
    17041280    useweights = res.weights and not argkw.get("unweighted", 0) 
     
    17091285        iterationExperiments = split_by_iterations(res) 
    17101286        for exp in iterationExperiments: 
    1711             expCDTs = corn.computeCDT(exp, classIndex, useweights) 
     1287            expCDTs = corn.computeCDT(exp, class_index, useweights) 
    17121288            for i in range(len(CDTs)): 
    17131289                CDTs[i].C += expCDTs[i].C 
     
    17161292        for i in range(res.number_of_learners): 
    17171293            if is_CDT_empty(CDTs[0]): 
    1718                 return corn.computeCDT(res, classIndex, useweights) 
     1294                return corn.computeCDT(res, class_index, useweights) 
    17191295         
    17201296        return CDTs 
    17211297    else: 
    1722         return corn.computeCDT(res, classIndex, useweights) 
     1298        return corn.computeCDT(res, class_index, useweights) 
    17231299 
    17241300## THIS FUNCTION IS OBSOLETE AND ITS AVERAGING OVER FOLDS IS QUESTIONABLE 
     
    17641340# are divided by 'divideByIfIte'. Additional flag is returned which is True in 
    17651341# the former case, or False in the latter. 
    1766 def AUC_x(cdtComputer, ite, all_ite, divideByIfIte, computerArgs): 
    1767     cdts = cdtComputer(*(ite, ) + computerArgs) 
     1342@deprecated_keywords({"divideByIfIte": "divide_by_if_ite", 
     1343                      "computerArgs": "computer_args"}) 
     1344def AUC_x(cdtComputer, ite, all_ite, divide_by_if_ite, computer_args): 
     1345    cdts = cdtComputer(*(ite, ) + computer_args) 
    17681346    if not is_CDT_empty(cdts[0]): 
    1769         return [(cdt.C+cdt.T/2)/(cdt.C+cdt.D+cdt.T)/divideByIfIte for cdt in cdts], True 
     1347        return [(cdt.C+cdt.T/2)/(cdt.C+cdt.D+cdt.T)/divide_by_if_ite for cdt in cdts], True 
    17701348         
    17711349    if all_ite: 
    1772         cdts = cdtComputer(*(all_ite, ) + computerArgs) 
     1350        cdts = cdtComputer(*(all_ite, ) + computer_args) 
    17731351        if not is_CDT_empty(cdts[0]): 
    17741352            return [(cdt.C+cdt.T/2)/(cdt.C+cdt.D+cdt.T) for cdt in cdts], False 
     
    17781356     
    17791357# computes AUC between classes i and j as if there we no other classes 
    1780 def AUC_ij(ite, classIndex1, classIndex2, useWeights = True, all_ite = None, divideByIfIte = 1.0): 
     1358@deprecated_keywords({"classIndex1": "class_index1", 
     1359                      "classIndex2": "class_index2", 
     1360                      "useWeights": "use_weights", 
     1361                      "divideByIfIte": "divide_by_if_ite"}) 
     1362def AUC_ij(ite, class_index1, class_index2, use_weights = True, all_ite = None, divide_by_if_ite = 1.0): 
    17811363    import corn 
    1782     return AUC_x(corn.computeCDTPair, ite, all_ite, divideByIfIte, (classIndex1, classIndex2, useWeights)) 
     1364    return AUC_x(corn.computeCDTPair, ite, all_ite, divide_by_if_ite, (class_index1, class_index2, use_weights)) 
    17831365 
    17841366 
    17851367# computes AUC between class i and the other classes (treating them as the same class) 
    1786 def AUC_i(ite, classIndex, useWeights = True, all_ite = None, divideByIfIte = 1.0): 
     1368@deprecated_keywords({"classIndex": "class_index", 
     1369                      "useWeights": "use_weights", 
     1370                      "divideByIfIte": "divide_by_if_ite"}) 
     1371def AUC_i(ite, class_index, use_weights = True, all_ite = None, divide_by_if_ite = 1.0): 
    17871372    import corn 
    1788     return AUC_x(corn.computeCDT, ite, all_ite, divideByIfIte, (classIndex, useWeights)) 
    1789     
     1373    return AUC_x(corn.computeCDT, ite, all_ite, divide_by_if_ite, (class_index, use_weights)) 
     1374 
    17901375 
    17911376# computes the average AUC over folds using a "AUCcomputer" (AUC_i or AUC_ij) 
     
    17931378# fold the computer has to resort to computing over all folds or even this failed; 
    17941379# in these cases the result is returned immediately 
    1795 def AUC_iterations(AUCcomputer, iterations, computerArgs): 
     1380 
     1381@deprecated_keywords({"AUCcomputer": "auc_computer", 
     1382                      "computerArgs": "computer_args"}) 
     1383def AUC_iterations(auc_computer, iterations, computer_args): 
    17961384    subsum_aucs = [0.] * iterations[0].number_of_learners 
    17971385    for ite in iterations: 
    1798         aucs, foldsUsed = AUCcomputer(*(ite, ) + computerArgs) 
     1386        aucs, foldsUsed = auc_computer(*(ite, ) + computer_args) 
    17991387        if not aucs: 
    18001388            return None 
     
    18061394 
    18071395# AUC for binary classification problems 
    1808 def AUC_binary(res, useWeights = True): 
     1396@deprecated_keywords({"useWeights": "use_weights"}) 
     1397def AUC_binary(res, use_weights = True): 
    18091398    if res.number_of_iterations > 1: 
    1810         return AUC_iterations(AUC_i, split_by_iterations(res), (-1, useWeights, res, res.number_of_iterations)) 
    1811     else: 
    1812         return AUC_i(res, -1, useWeights)[0] 
     1399        return AUC_iterations(AUC_i, split_by_iterations(res), (-1, use_weights, res, res.number_of_iterations)) 
     1400    else: 
     1401        return AUC_i(res, -1, use_weights)[0] 
    18131402 
    18141403# AUC for multiclass problems 
    1815 def AUC_multi(res, useWeights = True, method = 0): 
     1404@deprecated_keywords({"useWeights": "use_weights"}) 
     1405def AUC_multi(res, use_weights = True, method = 0): 
    18161406    numberOfClasses = len(res.class_values) 
    18171407     
     
    18331423        for classIndex1 in range(numberOfClasses): 
    18341424            for classIndex2 in range(classIndex1): 
    1835                 subsum_aucs = AUC_iterations(AUC_ij, iterations, (classIndex1, classIndex2, useWeights, all_ite, res.number_of_iterations)) 
     1425                subsum_aucs = AUC_iterations(AUC_ij, iterations, (classIndex1, classIndex2, use_weights, all_ite, res.number_of_iterations)) 
    18361426                if subsum_aucs: 
    18371427                    if method == 0: 
     
    18441434    else: 
    18451435        for classIndex in range(numberOfClasses): 
    1846             subsum_aucs = AUC_iterations(AUC_i, iterations, (classIndex, useWeights, all_ite, res.number_of_iterations)) 
     1436            subsum_aucs = AUC_iterations(AUC_i, iterations, (classIndex, use_weights, all_ite, res.number_of_iterations)) 
    18471437            if subsum_aucs: 
    18481438                if method == 0: 
     
    18661456# Computes AUC, possibly for multiple classes (the averaging method can be specified) 
    18671457# Results over folds are averages; if some folds examples from one class only, the folds are merged 
    1868 def AUC(res, method = AUC.ByWeightedPairs, useWeights = True): 
     1458@deprecated_keywords({"useWeights": "use_weights"}) 
     1459def AUC(res, method = AUC.ByWeightedPairs, use_weights = True): 
    18691460    """ Returns the area under ROC curve (AUC) given a set of experimental 
    18701461    results. For multivalued class problems, it will compute some sort of 
     
    18741465        raise ValueError("Cannot compute AUC on a single-class problem") 
    18751466    elif len(res.class_values) == 2: 
    1876         return AUC_binary(res, useWeights) 
    1877     else: 
    1878         return AUC_multi(res, useWeights, method) 
     1467        return AUC_binary(res, use_weights) 
     1468    else: 
     1469        return AUC_multi(res, use_weights, method) 
    18791470 
    18801471AUC.ByWeightedPairs = 0 
     
    18861477# Computes AUC; in multivalued class problem, AUC is computed as one against all 
    18871478# Results over folds are averages; if some folds examples from one class only, the folds are merged 
    1888 def AUC_single(res, classIndex = -1, useWeights = True): 
     1479@deprecated_keywords({"classIndex": "class_index", 
     1480                      "useWeights": "use_weights"}) 
     1481def AUC_single(res, class_index = -1, use_weights = True): 
    18891482    """ Computes AUC where the class given classIndex is singled out, and 
    18901483    all other classes are treated as a single class. To find how good our 
     
    18951488classIndex = vehicle.domain.classVar.values.index("van")) 
    18961489    """ 
    1897     if classIndex<0: 
     1490    if class_index<0: 
    18981491        if res.baseClass>=0: 
    1899             classIndex = res.baseClass 
    1900         else: 
    1901             classIndex = 1 
     1492            class_index = res.baseClass 
     1493        else: 
     1494            class_index = 1 
    19021495 
    19031496    if res.number_of_iterations > 1: 
    1904         return AUC_iterations(AUC_i, split_by_iterations(res), (classIndex, useWeights, res, res.number_of_iterations)) 
    1905     else: 
    1906         return AUC_i( res, classIndex, useWeights)[0] 
     1497        return AUC_iterations(AUC_i, split_by_iterations(res), (class_index, use_weights, res, res.number_of_iterations)) 
     1498    else: 
     1499        return AUC_i( res, class_index, use_weights)[0] 
    19071500 
    19081501# Computes AUC for a pair of classes (as if there were no other classes) 
    19091502# Results over folds are averages; if some folds have examples from one class only, the folds are merged 
    1910 def AUC_pair(res, classIndex1, classIndex2, useWeights = True): 
     1503@deprecated_keywords({"classIndex1": "class_index1", 
     1504                      "classIndex2": "class_index2", 
     1505                      "useWeights": "use_weights"}) 
     1506def AUC_pair(res, class_index1, class_index2, use_weights = True): 
    19111507    """ Computes AUC between a pair of instances, ignoring instances from all 
    19121508    other classes. 
    19131509    """ 
    19141510    if res.number_of_iterations > 1: 
    1915         return AUC_iterations(AUC_ij, split_by_iterations(res), (classIndex1, classIndex2, useWeights, res, res.number_of_iterations)) 
    1916     else: 
    1917         return AUC_ij(res, classIndex1, classIndex2, useWeights) 
     1511        return AUC_iterations(AUC_ij, split_by_iterations(res), (class_index1, class_index2, use_weights, res, res.number_of_iterations)) 
     1512    else: 
     1513        return AUC_ij(res, class_index1, class_index2, use_weights) 
    19181514   
    19191515 
    19201516# AUC for multiclass problems 
    1921 def AUC_matrix(res, useWeights = True): 
     1517@deprecated_keywords({"useWeights": "use_weights"}) 
     1518def AUC_matrix(res, use_weights = True): 
    19221519    """ Computes a (lower diagonal) matrix with AUCs for all pairs of classes. 
    19231520    If there are empty classes, the corresponding elements in the matrix 
     
    19441541    for classIndex1 in range(numberOfClasses): 
    19451542        for classIndex2 in range(classIndex1): 
    1946             pair_aucs = AUC_iterations(AUC_ij, iterations, (classIndex1, classIndex2, useWeights, all_ite, res.number_of_iterations)) 
     1543            pair_aucs = AUC_iterations(AUC_ij, iterations, (classIndex1, classIndex2, use_weights, all_ite, res.number_of_iterations)) 
    19471544            if pair_aucs: 
    19481545                for lrn in range(number_of_learners): 
     
    20801677 
    20811678 
    2082 def plot_learning_curve_learners(file, allResults, proportions, learners, noConfidence=0): 
    2083     plot_learning_curve(file, allResults, proportions, [Orange.misc.getobjectname(learners[i], "Learner %i" % i) for i in range(len(learners))], noConfidence) 
    2084      
    2085 def plot_learning_curve(file, allResults, proportions, legend, noConfidence=0): 
     1679@deprecated_keywords({"allResults": "all_results", 
     1680                      "noConfidence": "no_confidence"}) 
     1681def plot_learning_curve_learners(file, all_results, proportions, learners, no_confidence=0): 
     1682    plot_learning_curve(file, all_results, proportions, [Orange.misc.getobjectname(learners[i], "Learner %i" % i) for i in range(len(learners))], no_confidence) 
     1683 
     1684 
     1685@deprecated_keywords({"allResults": "all_results", 
     1686                      "noConfidence": "no_confidence"}) 
     1687def plot_learning_curve(file, all_results, proportions, legend, no_confidence=0): 
    20861688    import types 
    20871689    fopened=0 
    2088     if (type(file)==types.StringType): 
     1690    if type(file)==types.StringType: 
    20891691        file=open(file, "wt") 
    20901692        fopened=1 
     
    20931695    file.write("set xrange [%f:%f]\n" % (proportions[0], proportions[-1])) 
    20941696    file.write("set multiplot\n\n") 
    2095     CAs = [CA_dev(x) for x in allResults] 
     1697    CAs = [CA_dev(x) for x in all_results] 
    20961698 
    20971699    file.write("plot \\\n") 
    20981700    for i in range(len(legend)-1): 
    2099         if not noConfidence: 
     1701        if not no_confidence: 
    21001702            file.write("'-' title '' with yerrorbars pointtype %i,\\\n" % (i+1)) 
    21011703        file.write("'-' title '%s' with linespoints pointtype %i,\\\n" % (legend[i], i+1)) 
    2102     if not noConfidence: 
     1704    if not no_confidence: 
    21031705        file.write("'-' title '' with yerrorbars pointtype %i,\\\n" % (len(legend))) 
    21041706    file.write("'-' title '%s' with linespoints pointtype %i\n" % (legend[-1], len(legend))) 
    21051707 
    21061708    for i in range(len(legend)): 
    2107         if not noConfidence: 
     1709        if not no_confidence: 
    21081710            for p in range(len(proportions)): 
    21091711                file.write("%f\t%f\t%f\n" % (proportions[p], CAs[p][i][0], 1.96*CAs[p][i][1])) 
     
    21621764 
    21631765 
    2164  
    2165 def plot_McNemar_curve_learners(file, allResults, proportions, learners, reference=-1): 
    2166     plot_McNemar_curve(file, allResults, proportions, [Orange.misc.getobjectname(learners[i], "Learner %i" % i) for i in range(len(learners))], reference) 
    2167  
    2168 def plot_McNemar_curve(file, allResults, proportions, legend, reference=-1): 
     1766@deprecated_keywords({"allResults": "all_results"}) 
     1767def plot_McNemar_curve_learners(file, all_results, proportions, learners, reference=-1): 
     1768    plot_McNemar_curve(file, all_results, proportions, [Orange.misc.getobjectname(learners[i], "Learner %i" % i) for i in range(len(learners))], reference) 
     1769 
     1770 
     1771@deprecated_keywords({"allResults": "all_results"}) 
     1772def plot_McNemar_curve(file, all_results, proportions, legend, reference=-1): 
    21691773    if reference<0: 
    21701774        reference=len(legend)-1 
     
    21881792    for i in tmap: 
    21891793        for p in range(len(proportions)): 
    2190             file.write("%f\t%f\n" % (proportions[p], McNemar_of_two(allResults[p], i, reference))) 
     1794            file.write("%f\t%f\n" % (proportions[p], McNemar_of_two(all_results[p], i, reference))) 
    21911795        file.write("e\n\n") 
    21921796 
     
    21971801default_line_types=("\\setsolid", "\\setdashpattern <4pt, 2pt>", "\\setdashpattern <8pt, 2pt>", "\\setdashes", "\\setdots") 
    21981802 
    2199 def learning_curve_learners_to_PiCTeX(file, allResults, proportions, **options): 
    2200     return apply(learning_curve_to_PiCTeX, (file, allResults, proportions), options) 
    2201      
    2202 def learning_curve_to_PiCTeX(file, allResults, proportions, **options): 
     1803@deprecated_keywords({"allResults": "all_results"}) 
     1804def learning_curve_learners_to_PiCTeX(file, all_results, proportions, **options): 
     1805    return apply(learning_curve_to_PiCTeX, (file, all_results, proportions), options) 
     1806 
     1807 
     1808@deprecated_keywords({"allResults": "all_results"}) 
     1809def learning_curve_to_PiCTeX(file, all_results, proportions, **options): 
    22031810    import types 
    22041811    fopened=0 
     
    22071814        fopened=1 
    22081815 
    2209     nexamples=len(allResults[0].results) 
    2210     CAs = [CA_dev(x) for x in allResults] 
     1816    nexamples=len(all_results[0].results) 
     1817    CAs = [CA_dev(x) for x in all_results] 
    22111818 
    22121819    graphsize=float(options.get("graphsize", 10.0)) #cm 
  • Orange/feature/__init__.py

    r9671 r9895  
    1010import imputation 
    1111 
     12from Orange.core import Variable as Descriptor 
     13from Orange.core import EnumVariable as Discrete 
     14from Orange.core import FloatVariable as Continuous 
     15from Orange.core import PythonVariable as Python 
     16from Orange.core import StringVariable as String 
     17 
     18from Orange.core import VarList as Descriptors 
     19 
     20from Orange.core import newmetaid as new_meta_id 
     21 
     22from Orange.core import Variable as V 
     23make = V.make 
     24retrieve = V.get_existing 
     25MakeStatus = V.MakeStatus 
     26del V 
     27 
    1228__docformat__ = 'restructuredtext' 
  • Orange/feature/discretization.py

    r9878 r9900  
    1515    Discretization, \ 
    1616    Preprocessor_discretize 
    17  
    18  
    1917 
    2018def entropyDiscretization_wrapper(data): 
  • Orange/misc/__init__.py

    r9698 r9891  
    33 
    44Module Orange.misc contains common functions and classes which are used in other modules. 
     5 
     6.. index: SymMatrix 
     7 
     8----------------------- 
     9SymMatrix 
     10----------------------- 
     11 
     12:obj:`SymMatrix` implements symmetric matrices of size fixed at  
     13construction time (and stored in :obj:`SymMatrix.dim`). 
     14 
     15.. class:: SymMatrix 
     16 
     17    .. attribute:: dim 
     18     
     19        Matrix dimension. 
     20             
     21    .. attribute:: matrix_type  
     22 
     23        Can be ``SymMatrix.Lower`` (0), ``SymMatrix.Upper`` (1),  
     24        ``SymMatrix.Symmetric`` (2, default), ``SymMatrix.Lower_Filled`` (3) or 
     25        ``SymMatrix.Upper_Filled`` (4).  
     26 
     27        If the matrix type is ``Lower`` or ``Upper``, indexing  
     28        above or below the diagonal, respectively, will fail.  
     29        With ``Lower_Filled`` and ``Upper_Filled``, 
     30        the elements upper or lower, respectively, still  
     31        exist and are set to zero, but they cannot be modified. The  
     32        default matrix type is ``Symmetric``, but can be changed  
     33        at any time. 
     34 
     35        If matrix type is ``Upper``, it is printed as: 
     36 
     37        >>> m.matrix_type = m.Upper 
     38        >>> print m 
     39        (( 1.000,  2.000,  3.000,  4.000), 
     40         (         4.000,  6.000,  8.000), 
     41         (                 9.000, 12.000), 
     42         (                        16.000)) 
     43 
     44        Changing the type to ``Lower_Filled`` changes the printout to 
     45 
     46        >>> m.matrix_type = m.Lower_Filled 
     47        >>> print m 
     48        (( 1.000,  0.000,  0.000,  0.000), 
     49         ( 2.000,  4.000,  0.000,  0.000), 
     50         ( 3.000,  6.000,  9.000,  0.000), 
     51         ( 4.000,  8.000, 12.000, 16.000)) 
     52     
     53    .. method:: __init__(dim[, default_value]) 
     54 
     55        Construct a symmetric matrix of the given dimension. 
     56 
     57        :param dim: matrix dimension 
     58        :type dim: int 
     59 
     60        :param default_value: default value (0 by default) 
     61        :type default_value: double 
     62         
     63         
     64    .. method:: __init__(instances) 
     65 
     66        Construct a new symmetric matrix containing the given data instances.  
     67        These can be given as Python list containing lists or tuples. 
     68 
     69        :param instances: data instances 
     70        :type instances: list of lists 
     71         
     72        The following example fills a matrix created above with 
     73        data in a list:: 
     74 
     75            import Orange 
     76            m = [[], 
     77                 [ 3], 
     78                 [ 2, 4], 
     79                 [17, 5, 4], 
     80                 [ 2, 8, 3, 8], 
     81                 [ 7, 5, 10, 11, 2], 
     82                 [ 8, 4, 1, 5, 11, 13], 
     83                 [ 4, 7, 12, 8, 10, 1, 5], 
     84                 [13, 9, 14, 15, 7, 8, 4, 6], 
     85                 [12, 10, 11, 15, 2, 5, 7, 3, 1]] 
     86                     
     87            matrix = Orange.data.SymMatrix(m) 
     88 
     89        SymMatrix also stores diagonal elements. They are set 
     90        to zero, if they are not specified. The missing elements 
     91        (shorter lists) are set to zero as well. If a list 
     92        spreads over the diagonal, the constructor checks 
     93        for asymmetries. For instance, the matrix 
     94 
     95        :: 
     96 
     97            m = [[], 
     98                 [ 3,  0, f], 
     99                 [ 2,  4]] 
     100     
     101        is only OK if f equals 2. Finally, no row can be longer  
     102        than matrix size.   
     103 
     104    .. method:: get_values() 
     105     
     106        Return all matrix values in a Python list. 
     107 
     108    .. method:: get_KNN(i, k) 
     109     
     110        Return k columns with the lowest value in the i-th row.  
     111         
     112        :param i: i-th row 
     113        :type i: int 
     114         
     115        :param k: number of neighbors 
     116        :type k: int 
     117         
     118    .. method:: avg_linkage(clusters) 
     119     
     120        Return a symmetric matrix with average distances between given clusters.   
     121       
     122        :param clusters: list of clusters 
     123        :type clusters: list of lists 
     124         
     125    .. method:: invert(type) 
     126     
     127        Invert values in the symmetric matrix. 
     128         
     129        :param type: 0 (-X), 1 (1 - X), 2 (max - X), 3 (1 / X) 
     130        :type type: int 
     131 
     132    .. method:: normalize(type) 
     133     
     134        Normalize values in the symmetric matrix. 
     135         
     136        :param type: 0 (normalize to [0, 1] interval), 1 (Sigmoid) 
     137        :type type: int 
     138         
     139         
     140------------------- 
     141Indexing 
     142------------------- 
     143 
     144For symmetric matrices the order of indices is not important:  
     145if ``m`` is a SymMatrix, then ``m[2, 4]`` addresses the same element as ``m[4, 2]``. 
     146 
     147.. literalinclude:: code/symmatrix.py 
     148    :lines: 1-6 
     149 
     150Although only the lower left half of the matrix was set explicitely,  
     151the whole matrix is constructed. 
     152 
     153>>> print m 
     154(( 1.000,  2.000,  3.000,  4.000), 
     155 ( 2.000,  4.000,  6.000,  8.000), 
     156 ( 3.000,  6.000,  9.000, 12.000), 
     157 ( 4.000,  8.000, 12.000, 16.000)) 
     158  
     159Entire rows are indexed with a single index. They can be iterated 
     160over in a for loop or sliced (with, for example, ``m[:3]``): 
     161 
     162>>> print m[1] 
     163(3.0, 6.0, 9.0, 0.0) 
     164>>> m.matrix_type = m.Lower 
     165>>> for row in m: 
     166...     print row 
     167(1.0,) 
     168(2.0, 4.0) 
     169(3.0, 6.0, 9.0) 
     170(4.0, 8.0, 12.0, 16.0) 
    5171 
    6172.. index: Random number generator 
     
    84250 
    85251from Orange.core import RandomGenerator as Random 
     252from orange import SymMatrix 
    86253 
    87254# addons is intentionally not imported; if it were, add-ons' directories would 
  • docs/reference/rst/Orange.classification.rst

    r9820 r9887  
    8080   Orange.classification.svm 
    8181   Orange.classification.tree 
    82    Orange.classification.random    
  • docs/reference/rst/Orange.data.rst

    r9372 r9901  
    55.. toctree:: 
    66 
    7     Orange.data.variable 
    87    Orange.data.domain 
    98    Orange.data.value 
     
    1211    Orange.data.sample 
    1312    Orange.data.formats 
    14     Orange.data.symmatrix 
     13    Orange.data.discretization 
  • docs/reference/rst/Orange.evaluation.scoring.rst

    r9372 r9904  
    11.. automodule:: Orange.evaluation.scoring 
     2 
     3############################ 
     4Method scoring (``scoring``) 
     5############################ 
     6 
     7.. index: scoring 
     8 
     9This module contains various measures of quality for classification and 
     10regression. Most functions require an argument named :obj:`res`, an instance of 
     11:class:`Orange.evaluation.testing.ExperimentResults` as computed by 
     12functions from :mod:`Orange.evaluation.testing` and which contains 
     13predictions obtained through cross-validation, 
     14leave one-out, testing on training data or test set instances. 
     15 
     16============== 
     17Classification 
     18============== 
     19 
     20To prepare some data for examples on this page, we shall load the voting data 
     21set (problem of predicting the congressman's party (republican, democrat) 
     22based on a selection of votes) and evaluate naive Bayesian learner, 
     23classification trees and majority classifier using cross-validation. 
     24For examples requiring a multivalued class problem, we shall do the same 
     25with the vehicle data set (telling whether a vehicle described by the features 
     26extracted from a picture is a van, bus, or Opel or Saab car). 
     27 
     28Basic cross validation example is shown in the following part of 
     29(:download:`statExamples.py <code/statExamples.py>`, uses :download:`voting.tab <code/voting.tab>` and :download:`vehicle.tab <code/vehicle.tab>`): 
     30 
     31.. literalinclude:: code/statExample0.py 
     32 
     33If instances are weighted, weights are taken into account. This can be 
     34disabled by giving :obj:`unweighted=1` as a keyword argument. Another way of 
     35disabling weights is to clear the 
     36:class:`Orange.evaluation.testing.ExperimentResults`' flag weights. 
     37 
     38General Measures of Quality 
     39=========================== 
     40 
     41.. autofunction:: CA 
     42 
     43.. autofunction:: AP 
     44 
     45.. autofunction:: Brier_score 
     46 
     47.. autofunction:: IS 
     48 
     49So, let's compute all this in part of 
     50(:download:`statExamples.py <code/statExamples.py>`, uses :download:`voting.tab <code/voting.tab>` and :download:`vehicle.tab <code/vehicle.tab>`) and print it out: 
     51 
     52.. literalinclude:: code/statExample1.py 
     53   :lines: 13- 
     54 
     55The output should look like this:: 
     56 
     57    method  CA      AP      Brier    IS 
     58    bayes   0.903   0.902   0.175    0.759 
     59    tree    0.846   0.845   0.286    0.641 
     60    majorty  0.614   0.526   0.474   -0.000 
     61 
     62Script :download:`statExamples.py <code/statExamples.py>` contains another example that also prints out 
     63the standard errors. 
     64 
     65Confusion Matrix 
     66================ 
     67 
     68.. autofunction:: confusion_matrices 
     69 
     70   **A positive-negative confusion matrix** is computed (a) if the class is 
     71   binary unless :obj:`classIndex` argument is -2, (b) if the class is 
     72   multivalued and the :obj:`classIndex` is non-negative. Argument 
     73   :obj:`classIndex` then tells which class is positive. In case (a), 
     74   :obj:`classIndex` may be omitted; the first class 
     75   is then negative and the second is positive, unless the :obj:`baseClass` 
     76   attribute in the object with results has non-negative value. In that case, 
     77   :obj:`baseClass` is an index of the target class. :obj:`baseClass` 
     78   attribute of results object should be set manually. The result of a 
     79   function is a list of instances of class :class:`ConfusionMatrix`, 
     80   containing the (weighted) number of true positives (TP), false 
     81   negatives (FN), false positives (FP) and true negatives (TN). 
     82 
     83   We can also add the keyword argument :obj:`cutoff` 
     84   (e.g. confusion_matrices(results, cutoff=0.3); if we do, :obj:`confusion_matrices` 
     85   will disregard the classifiers' class predictions and observe the predicted 
     86   probabilities, and consider the prediction "positive" if the predicted 
     87   probability of the positive class is higher than the :obj:`cutoff`. 
     88 
     89   The example (part of :download:`statExamples.py <code/statExamples.py>`) below shows how setting the 
     90   cut off threshold from the default 0.5 to 0.2 affects the confusion matrics 
     91   for naive Bayesian classifier:: 
     92 
     93       cm = Orange.evaluation.scoring.confusion_matrices(res)[0] 
     94       print "Confusion matrix for naive Bayes:" 
     95       print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN) 
     96 
     97       cm = Orange.evaluation.scoring.confusion_matrices(res, cutoff=0.2)[0] 
     98       print "Confusion matrix for naive Bayes:" 
     99       print "TP: %i, FP: %i, FN: %s, TN: %i" % (cm.TP, cm.FP, cm.FN, cm.TN) 
     100 
     101   The output:: 
     102 
     103       Confusion matrix for naive Bayes: 
     104       TP: 238, FP: 13, FN: 29.0, TN: 155 
     105       Confusion matrix for naive Bayes: 
     106       TP: 239, FP: 18, FN: 28.0, TN: 150 
     107 
     108   shows that the number of true positives increases (and hence the number of 
     109   false negatives decreases) by only a single instance, while five instances 
     110   that were originally true negatives become false positives due to the 
     111   lower threshold. 
     112 
     113   To observe how good are the classifiers in detecting vans in the vehicle 
     114   data set, we would compute the matrix like this:: 
     115 
     116      cm = Orange.evaluation.scoring.confusion_matrices(resVeh, vehicle.domain.classVar.values.index("van")) 
     117 
     118   and get the results like these:: 
     119 
     120       TP: 189, FP: 241, FN: 10.0, TN: 406 
     121 
     122   while the same for class "opel" would give:: 
     123 
     124       TP: 86, FP: 112, FN: 126.0, TN: 522 
     125 
     126   The main difference is that there are only a few false negatives for the 
     127   van, meaning that the classifier seldom misses it (if it says it's not a 
     128   van, it's almost certainly not a van). Not so for the Opel car, where the 
     129   classifier missed 126 of them and correctly detected only 86. 
     130 
     131   **General confusion matrix** is computed (a) in case of a binary class, 
     132   when :obj:`classIndex` is set to -2, (b) when we have multivalued class and 
     133   the caller doesn't specify the :obj:`classIndex` of the positive class. 
     134   When called in this manner, the function cannot use the argument 
     135   :obj:`cutoff`. 
     136 
     137   The function then returns a three-dimensional matrix, where the element 
     138   A[:obj:`learner`][:obj:`actual_class`][:obj:`predictedClass`] 
     139   gives the number of instances belonging to 'actual_class' for which the 
     140   'learner' predicted 'predictedClass'. We shall compute and print out 
     141   the matrix for naive Bayesian classifier. 
     142 
     143   Here we see another example from :download:`statExamples.py <code/statExamples.py>`:: 
     144 
     145       cm = Orange.evaluation.scoring.confusion_matrices(resVeh)[0] 
     146       classes = vehicle.domain.classVar.values 
     147       print "\t"+"\t".join(classes) 
     148       for className, classConfusions in zip(classes, cm): 
     149           print ("%s" + ("\t%i" * len(classes))) % ((className, ) + tuple(classConfusions)) 
     150 
     151   So, here's what this nice piece of code gives:: 
     152 
     153              bus   van  saab opel 
     154       bus     56   95   21   46 
     155       van     6    189  4    0 
     156       saab    3    75   73   66 
     157       opel    4    71   51   86 
     158 
     159   Van's are clearly simple: 189 vans were classified as vans (we know this 
     160   already, we've printed it out above), and the 10 misclassified pictures 
     161   were classified as buses (6) and Saab cars (4). In all other classes, 
     162   there were more instances misclassified as vans than correctly classified 
     163   instances. The classifier is obviously quite biased to vans. 
     164 
     165   .. method:: sens(confm) 
     166   .. method:: spec(confm) 
     167   .. method:: PPV(confm) 
     168   .. method:: NPV(confm) 
     169   .. method:: precision(confm) 
     170   .. method:: recall(confm) 
     171   .. method:: F2(confm) 
     172   .. method:: Falpha(confm, alpha=2.0) 
     173   .. method:: MCC(conf) 
     174 
     175   With the confusion matrix defined in terms of positive and negative 
     176   classes, you can also compute the 
     177   `sensitivity <http://en.wikipedia.org/wiki/Sensitivity_(tests)>`_ 
     178   [TP/(TP+FN)], `specificity <http://en.wikipedia.org/wiki/Specificity_%28tests%29>`_ 
     179   [TN/(TN+FP)], `positive predictive value <http://en.wikipedia.org/wiki/Positive_predictive_value>`_ 
     180   [TP/(TP+FP)] and `negative predictive value <http://en.wikipedia.org/wiki/Negative_predictive_value>`_ [TN/(TN+FN)]. 
     181   In information retrieval, positive predictive value is called precision 
     182   (the ratio of the number of relevant records retrieved to the total number 
     183   of irrelevant and relevant records retrieved), and sensitivity is called 
     184   `recall <http://en.wikipedia.org/wiki/Information_retrieval>`_ 
     185   (the ratio of the number of relevant records retrieved to the total number 
     186   of relevant records in the database). The 
     187   `harmonic mean <http://en.wikipedia.org/wiki/Harmonic_mean>`_ of precision 
     188   and recall is called an 
     189   `F-measure <http://en.wikipedia.org/wiki/F-measure>`_, where, depending 
     190   on the ratio of the weight between precision and recall is implemented 
     191   as F1 [2*precision*recall/(precision+recall)] or, for a general case, 
     192   Falpha [(1+alpha)*precision*recall / (alpha*precision + recall)]. 
     193   The `Matthews correlation coefficient <http://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_ 
     194   in essence a correlation coefficient between 
     195   the observed and predicted binary classifications; it returns a value 
     196   between -1 and +1. A coefficient of +1 represents a perfect prediction, 
     197   0 an average random prediction and -1 an inverse prediction. 
     198 
     199   If the argument :obj:`confm` is a single confusion matrix, a single 
     200   result (a number) is returned. If confm is a list of confusion matrices, 
     201   a list of scores is returned, one for each confusion matrix. 
     202 
     203   Note that weights are taken into account when computing the matrix, so 
     204   these functions don't check the 'weighted' keyword argument. 
     205 
     206   Let us print out sensitivities and specificities of our classifiers in 
     207   part of :download:`statExamples.py <code/statExamples.py>`:: 
     208 
     209       cm = Orange.evaluation.scoring.confusion_matrices(res) 
     210       print 
     211       print "method\tsens\tspec" 
     212       for l in range(len(learners)): 
     213           print "%s\t%5.3f\t%5.3f" % (learners[l].name, Orange.evaluation.scoring.sens(cm[l]), Orange.evaluation.scoring.spec(cm[l])) 
     214 
     215ROC Analysis 
     216============ 
     217 
     218`Receiver Operating Characteristic \ 
     219<http://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_ 
     220(ROC) analysis was initially developed for 
     221a binary-like problems and there is no consensus on how to apply it in 
     222multi-class problems, nor do we know for sure how to do ROC analysis after 
     223cross validation and similar multiple sampling techniques. If you are 
     224interested in the area under the curve, function AUC will deal with those 
     225problems as specifically described below. 
     226 
     227.. autofunction:: AUC 
     228 
     229   .. attribute:: AUC.ByWeightedPairs (or 0) 
     230 
     231      Computes AUC for each pair of classes (ignoring instances of all other 
     232      classes) and averages the results, weighting them by the number of 
     233      pairs of instances from these two classes (e.g. by the product of 
     234      probabilities of the two classes). AUC computed in this way still 
     235      behaves as concordance index, e.g., gives the probability that two 
     236      randomly chosen instances from different classes will be correctly 
     237      recognized (this is of course true only if the classifier knows 
     238      from which two classes the instances came). 
     239 
     240   .. attribute:: AUC.ByPairs (or 1) 
     241 
     242      Similar as above, except that the average over class pairs is not 
     243      weighted. This AUC is, like the binary, independent of class 
     244      distributions, but it is not related to concordance index any more. 
     245 
     246   .. attribute:: AUC.WeightedOneAgainstAll (or 2) 
     247 
     248      For each class, it computes AUC for this class against all others (that 
     249      is, treating other classes as one class). The AUCs are then averaged by 
     250      the class probabilities. This is related to concordance index in which 
     251      we test the classifier's (average) capability for distinguishing the 
     252      instances from a specified class from those that come from other classes. 
     253      Unlike the binary AUC, the measure is not independent of class 
     254      distributions. 
     255 
     256   .. attribute:: AUC.OneAgainstAll (or 3) 
     257 
     258      As above, except that the average is not weighted. 
     259 
     260   In case of multiple folds (for instance if the data comes from cross 
     261   validation), the computation goes like this. When computing the partial 
     262   AUCs for individual pairs of classes or singled-out classes, AUC is 
     263   computed for each fold separately and then averaged (ignoring the number 
     264   of instances in each fold, it's just a simple average). However, if a 
     265   certain fold doesn't contain any instances of a certain class (from the 
     266   pair), the partial AUC is computed treating the results as if they came 
     267   from a single-fold. This is not really correct since the class 
     268   probabilities from different folds are not necessarily comparable, 
     269   yet this will most often occur in a leave-one-out experiments, 
     270   comparability shouldn't be a problem. 
     271 
     272   Computing and printing out the AUC's looks just like printing out 
     273   classification accuracies (except that we call AUC instead of 
     274   CA, of course):: 
     275 
     276       AUCs = Orange.evaluation.scoring.AUC(res) 
     277       for l in range(len(learners)): 
     278           print "%10s: %5.3f" % (learners[l].name, AUCs[l]) 
     279 
     280   For vehicle, you can run exactly this same code; it will compute AUCs 
     281   for all pairs of classes and return the average weighted by probabilities 
     282   of pairs. Or, you can specify the averaging method yourself, like this:: 
     283 
     284       AUCs = Orange.evaluation.scoring.AUC(resVeh, Orange.evaluation.scoring.AUC.WeightedOneAgainstAll) 
     285 
     286   The following snippet tries out all four. (We don't claim that this is 
     287   how the function needs to be used; it's better to stay with the default.):: 
     288 
     289       methods = ["by pairs, weighted", "by pairs", "one vs. all, weighted", "one vs. all"] 
     290       print " " *25 + "  \tbayes\ttree\tmajority" 
     291       for i in range(4): 
     292           AUCs = Orange.evaluation.scoring.AUC(resVeh, i) 
     293           print "%25s: \t%5.3f\t%5.3f\t%5.3f" % ((methods[i], ) + tuple(AUCs)) 
     294 
     295   As you can see from the output:: 
     296 
     297                                   bayes   tree    majority 
     298              by pairs, weighted:  0.789   0.871   0.500 
     299                        by pairs:  0.791   0.872   0.500 
     300           one vs. all, weighted:  0.783   0.800   0.500 
     301                     one vs. all:  0.783   0.800   0.500 
     302 
     303.. autofunction:: AUC_single 
     304 
     305.. autofunction:: AUC_pair 
     306 
     307.. autofunction:: AUC_matrix 
     308 
     309The remaining functions, which plot the curves and statistically compare 
     310them, require that the results come from a test with a single iteration, 
     311and they always compare one chosen class against all others. If you have 
     312cross validation results, you can either use split_by_iterations to split the 
     313results by folds, call the function for each fold separately and then sum 
     314the results up however you see fit, or you can set the ExperimentResults' 
     315attribute number_of_iterations to 1, to cheat the function - at your own 
     316responsibility for the statistical correctness. Regarding the multi-class 
     317problems, if you don't chose a specific class, Orange.evaluation.scoring will use the class 
     318attribute's baseValue at the time when results were computed. If baseValue 
     319was not given at that time, 1 (that is, the second class) is used as default. 
     320 
     321We shall use the following code to prepare suitable experimental results:: 
     322 
     323    ri2 = Orange.core.MakeRandomIndices2(voting, 0.6) 
     324    train = voting.selectref(ri2, 0) 
     325    test = voting.selectref(ri2, 1) 
     326    res1 = Orange.evaluation.testing.learnAndTestOnTestData(learners, train, test) 
     327 
     328 
     329.. autofunction:: AUCWilcoxon 
     330 
     331.. autofunction:: compute_ROC 
     332 
     333Comparison of Algorithms 
     334------------------------ 
     335 
     336.. autofunction:: McNemar 
     337 
     338.. autofunction:: McNemar_of_two 
     339 
     340========== 
     341Regression 
     342========== 
     343 
     344General Measure of Quality 
     345========================== 
     346 
     347Several alternative measures, as given below, can be used to evaluate 
     348the sucess of numeric prediction: 
     349 
     350.. image:: files/statRegression.png 
     351 
     352.. autofunction:: MSE 
     353 
     354.. autofunction:: RMSE 
     355 
     356.. autofunction:: MAE 
     357 
     358.. autofunction:: RSE 
     359 
     360.. autofunction:: RRSE 
     361 
     362.. autofunction:: RAE 
     363 
     364.. autofunction:: R2 
     365 
     366The following code (:download:`statExamples.py <code/statExamples.py>`) uses most of the above measures to 
     367score several regression methods. 
     368 
     369.. literalinclude:: code/statExamplesRegression.py 
     370 
     371The code above produces the following output:: 
     372 
     373    Learner   MSE     RMSE    MAE     RSE     RRSE    RAE     R2 
     374    maj       84.585  9.197   6.653   1.002   1.001   1.001  -0.002 
     375    rt        40.015  6.326   4.592   0.474   0.688   0.691   0.526 
     376    knn       21.248  4.610   2.870   0.252   0.502   0.432   0.748 
     377    lr        24.092  4.908   3.425   0.285   0.534   0.515   0.715 
     378 
     379================= 
     380Ploting functions 
     381================= 
     382 
     383.. autofunction:: graph_ranks 
     384 
     385The following script (:download:`statExamplesGraphRanks.py <code/statExamplesGraphRanks.py>`) shows hot to plot a graph: 
     386 
     387.. literalinclude:: code/statExamplesGraphRanks.py 
     388 
     389Code produces the following graph: 
     390 
     391.. image:: files/statExamplesGraphRanks1.png 
     392 
     393.. autofunction:: compute_CD 
     394 
     395.. autofunction:: compute_friedman 
     396 
     397================= 
     398Utility Functions 
     399================= 
     400 
     401.. autofunction:: split_by_iterations 
     402 
     403===================================== 
     404Scoring for multilabel classification 
     405===================================== 
     406 
     407Multi-label classification requries different metrics than those used in traditional single-label 
     408classification. This module presents the various methrics that have been proposed in the literature. 
     409Let :math:`D` be a multi-label evaluation data set, conisting of :math:`|D|` multi-label examples 
     410:math:`(x_i,Y_i)`, :math:`i=1..|D|`, :math:`Y_i \\subseteq L`. Let :math:`H` be a multi-label classifier 
     411and :math:`Z_i=H(x_i)` be the set of labels predicted by :math:`H` for example :math:`x_i`. 
     412 
     413.. autofunction:: mlc_hamming_loss 
     414.. autofunction:: mlc_accuracy 
     415.. autofunction:: mlc_precision 
     416.. autofunction:: mlc_recall 
     417 
     418So, let's compute all this and print it out (part of 
     419:download:`mlc-evaluate.py <code/mlc-evaluate.py>`, uses 
     420:download:`emotions.tab <code/emotions.tab>`): 
     421 
     422.. literalinclude:: code/mlc-evaluate.py 
     423   :lines: 1-15 
     424 
     425The output should look like this:: 
     426 
     427    loss= [0.9375] 
     428    accuracy= [0.875] 
     429    precision= [1.0] 
     430    recall= [0.875] 
     431 
     432References 
     433========== 
     434 
     435Boutell, M.R., Luo, J., Shen, X. & Brown, C.M. (2004), 'Learning multi-label scene classification', 
     436Pattern Recogintion, vol.37, no.9, pp:1757-71 
     437 
     438Godbole, S. & Sarawagi, S. (2004), 'Discriminative Methods for Multi-labeled Classification', paper 
     439presented to Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining 
     440(PAKDD 2004) 
     441 
     442Schapire, R.E. & Singer, Y. (2000), 'Boostexter: a bossting-based system for text categorization', 
     443Machine Learning, vol.39, no.2/3, pp:135-68. 
  • docs/reference/rst/Orange.feature.discretization.rst

    r9863 r9900  
    4949value according to the rule found by discretization. In this respect, the discretization behaves similar to 
    5050:class:`Orange.classification.Learner`. 
    51  
    52 Utility functions 
    53 ================= 
    54  
    55 Some functions and classes that can be used for 
    56 categorization of continuous features. Besides several general classes that 
    57 can help in this task, we also provide a function that may help in 
    58 entropy-based discretization (Fayyad & Irani), and a wrapper around classes for 
    59 categorization that can be used for learning. 
    60  
    61 .. autoclass:: Orange.feature.discretization.DiscretizedLearner_Class 
    62  
    63 .. autoclass:: DiscretizeTable 
    64  
    65 .. rubric:: Example 
    66  
    67 FIXME. A chapter on `feature subset selection <../ofb/o_fss.htm>`_ in Orange 
    68 for Beginners tutorial shows the use of DiscretizedLearner. Other 
    69 discretization classes from core Orange are listed in chapter on 
    70 `categorization <../ofb/o_categorization.htm>`_ of the same tutorial. 
    7151 
    7252Discretization Algorithms 
  • docs/reference/rst/Orange.feature.imputation.rst

    r9853 r9905  
    2525 
    2626Imputers 
    27 ================= 
     27----------------- 
    2828 
    2929:obj:`ImputerConstructor` is the abstract root in a hierarchy of classes 
     
    5252    .. attribute::  defaults 
    5353 
    54     An instance :obj:`Orange.data.Instance` with the default values to be 
     54    An instance :obj:`~Orange.data.Instance` with the default values to be 
    5555    imputed instead of missing value. Examples to be imputed must be from the 
    5656    same :obj:`~Orange.data.Domain` as :obj:`defaults`. 
     
    7171pessimistic imputations. 
    7272 
    73 User-define defaults can be given when constructing a :obj:`~Orange.feature 
    74 .imputation.Imputer_defaults`. Values that are left unspecified do not get 
    75 imputed. In the following example "LENGTH" is the 
     73User-define defaults can be given when constructing a 
     74:obj:`~Orange.feature.imputation.Imputer_defaults`. Values that are left 
     75unspecified do not get imputed. In the following example "LENGTH" is the 
    7676only attribute to get imputed with value 1234: 
    7777 
     
    164164:obj:`DefaultClassifier`. A float must be given, because integer values are 
    165165interpreted as indexes of discrete features. Discrete feature "T-OR-D" is 
    166 imputed using :class:`Orange.classification.ConstantClassifier` which is 
     166imputed using :class:`~Orange.classification.ConstantClassifier` which is 
    167167given the index of value "THROUGH" as an argument. 
    168168 
     
    277277 
    278278Using imputers 
    279 ============== 
    280  
    281 Imputation must run on training data only. Imputing the missing values 
    282 and subsequently using the data in cross-validation will give overly 
    283 optimistic results. 
    284  
    285 Learners with imputer as a component 
    286 ------------------------------------ 
    287  
    288 Learners that cannot handle missing values provide a slot for the imputer 
    289 component. An example of such a class is 
    290 :obj:`~Orange.classification.logreg.LogRegLearner` with an attribute called 
    291 :obj:`~Orange.classification.logreg.LogRegLearner.imputer_constructor`. 
    292  
    293 When given learning instances, 
     279-------------- 
     280 
     281Imputation is also used by learning algorithms and other methods that are not 
     282capable of handling unknown values. 
     283 
     284Imputer as a component 
     285====================== 
     286 
     287Learners that cannot handle missing values should provide a slot 
     288for imputer constructor. An example of such class is 
     289:obj:`~Orange.classification.logreg.LogRegLearner` with attribute 
     290:obj:`~Orange.classification.logreg.LogRegLearner.imputer_constructor`, 
     291which imputes to average value by default. When given learning instances, 
    294292:obj:`~Orange.classification.logreg.LogRegLearner` will pass them to 
    295293:obj:`~Orange.classification.logreg.LogRegLearner.imputer_constructor` to get 
    296 an imputer and used it to impute the missing values in the learning data. 
    297 Imputed data is then used by the actual learning algorithm. Also, when a 
    298 classifier :obj:`Orange.classification.logreg.LogRegClassifier` is constructed, 
    299 the imputer is stored in its attribute 
    300 :obj:`Orange.classification.logreg.LogRegClassifier.imputer`. At 
    301 classification, the same imputer is used for imputation of missing values 
     294an imputer and use it to impute the missing values in the learning data. 
     295Imputed data is then used by the actual learning algorithm. When a 
     296classifier :obj:`~Orange.classification.logreg.LogRegClassifier` is 
     297constructed, the imputer is stored in its attribute 
     298:obj:`~Orange.classification.logreg.LogRegClassifier.imputer`. During 
     299classification the same imputer is used for imputation of missing values 
    302300in (testing) examples. 
    303301 
     
    306304it is recommended to use imputation according to the described procedure. 
    307305 
    308 Wrapper for learning algorithms 
    309 =============================== 
    310  
    311 Imputation is also used by learning algorithms and other methods that are not 
    312 capable of handling unknown values. It imputes missing values, 
    313 calls the learner and, if imputation is also needed by the classifier, 
    314 it wraps the classifier that imputes missing values in instances to classify. 
     306The choice of the imputer depends on the problem domain. In this example the 
     307minimal value of each feature is imputed: 
    315308 
    316309.. literalinclude:: code/imputation-logreg.py 
     
    322315    With imputation: 0.954 
    323316 
    324 Even so, the module is somewhat redundant, as all learners that cannot handle 
    325 missing values should, in principle, provide the slots for imputer constructor. 
    326 For instance, :obj:`Orange.classification.logreg.LogRegLearner` has an 
    327 attribute 
    328 :obj:`Orange.classification.logreg.LogRegLearner.imputer_constructor`, 
    329 and even if you don't set it, it will do some imputation by default. 
     317.. note:: 
     318 
     319   Just one instance of 
     320   :obj:`~Orange.classification.logreg.LogRegLearner` is constructed and then 
     321   used twice in each fold. Once it is given the original instances as they 
     322   are. It returns an instance of 
     323   :obj:`~Orange.classification.logreg.LogRegLearner`. The second time it is 
     324   called by :obj:`imra` and the 
     325   :obj:`~Orange.classification.logreg.LogRegLearner` gets wrapped 
     326   into :obj:`~Orange.feature.imputation.Classifier`. There is only one 
     327   learner, which produces two different classifiers in each round of 
     328   testing. 
     329 
     330Wrappers for learning 
     331===================== 
     332 
     333In a learning/classification process, imputation is needed on two occasions. 
     334Before learning, the imputer needs to process the training instances. 
     335Afterwards, the imputer is called for each instance to be classified. For 
     336example, in cross validation, imputation should be done on training folds 
     337only. Imputing the missing values on all data and subsequently performing 
     338cross-validation will give overly optimistic results. 
     339 
     340Most of Orange's learning algorithms do not use imputers because they can 
     341appropriately handle the missing values. Bayesian classifier, for instance, 
     342simply skips the corresponding attributes in the formula, while 
     343classification/regression trees have components for handling the missing 
     344values in various ways. A wrapper is provided for learning algorithms that 
     345require imputed data. 
    330346 
    331347.. class:: ImputeLearner 
    332348 
    333     Wraps a learner and performs data discretization before learning. 
    334  
    335     Most of Orange's learning algorithms do not use imputers because they can 
    336     appropriately handle the missing values. Bayesian classifier, for instance, 
    337     simply skips the corresponding attributes in the formula, while 
    338     classification/regression trees have components for handling the missing 
    339     values in various ways. 
    340  
    341     If for any reason you want to use these algorithms to run on imputed data, 
    342     you can use this wrapper. The class description is a matter of a separate 
    343     page, but we shall show its code here as another demonstration of how to 
    344     use the imputers - logistic regression is implemented essentially the same 
    345     as the below classes. 
     349    Wraps a learner and performs data imputation before learning. 
    346350 
    347351    This is basically a learner, so the constructor will return either an 
    348352    instance of :obj:`ImputerLearner` or, if called with examples, an instance 
    349     of some classifier. There are a few attributes that need to be set, though. 
     353    of some classifier. 
    350354 
    351355    .. attribute:: base_learner 
     
    355359    .. attribute:: imputer_constructor 
    356360 
    357     An instance of a class derived from :obj:`ImputerConstructor` (or a class 
    358     with the same call operator). 
     361    An instance of a class derived from :obj:`ImputerConstructor` or a class 
     362    with the same call operator. 
    359363 
    360364    .. attribute:: dont_impute_classifier 
    361365 
    362     If given and set (this attribute is optional), the classifier will not be 
    363     wrapped into an imputer. Do this if the classifier doesn't mind if the 
    364     examples it is given have missing values. 
     366    If set and a table is given, the classifier is not be 
     367    wrapped into an imputer. This can be done if classifier can handle 
     368    missing values. 
    365369 
    366370    The learner is best illustrated by its code - here's its complete 
     
    376380                return ImputeClassifier(base_classifier, trained_imputer) 
    377381 
    378     So "learning" goes like this. :obj:`ImputeLearner` will first construct 
    379     the imputer (that is, call :obj:`self.imputer_constructor` to get a (trained) 
    380     imputer. Than it will use the imputer to impute the data, and call the 
     382    During learning, :obj:`ImputeLearner` will first construct 
     383    the imputer. It will then impute the data and call the 
    381384    given :obj:`baseLearner` to construct a classifier. For instance, 
    382385    :obj:`baseLearner` could be a learner for logistic regression and the 
    383386    result would be a logistic regression model. If the classifier can handle 
    384     unknown values (that is, if :obj:`dont_impute_classifier`, we return it as 
    385     it is, otherwise we wrap it into :obj:`ImputeClassifier`, which is given 
    386     the base classifier and the imputer which it can use to impute the missing 
    387     values in (testing) examples. 
     387    unknown values (that is, if :obj:`dont_impute_classifier`, 
     388    it is returned as is, otherwise it is wrapped into 
     389    :obj:`ImputeClassifier`, which holds the base classifier and 
     390    the imputer used to impute the missing values in (testing) data. 
    388391 
    389392.. class:: ImputeClassifier 
     
    401404    .. method:: __call__ 
    402405 
    403     This class is even more trivial than the learner. Its constructor accepts 
    404     two arguments, the classifier and the imputer, which are stored into the 
    405     corresponding attributes. The call operator which does the classification 
    406     then looks like this:: 
     406    This class's constructor accepts and stores two arguments, 
     407    the classifier and the imputer. The call operator for classification 
     408    looks like this:: 
    407409 
    408410        def __call__(self, ex, what=orange.GetValue): 
     
    413415 
    414416.. note:: 
    415    In this setup the imputer is trained on the training data - even if you do 
     417   In this setup the imputer is trained on the training data. Even during 
    416418   cross validation, the imputer will be trained on the right data. In the 
    417    classification phase we again use the imputer which was classified on the 
    418    training data only. 
     419   classification phase, the imputer will be used to impute testing data. 
    419420 
    420421.. rubric:: Code of ImputeLearner and ImputeClassifier 
    421422 
    422 :obj:`Orange.feature.imputation.ImputeLearner` puts the keyword arguments into 
    423 the instance's  dictionary. You are expected to call it like 
    424 :obj:`ImputeLearner(base_learner=<someLearner>, 
    425 imputer=<someImputerConstructor>)`. When the learner is called with 
    426 examples, it 
    427 trains the imputer, imputes the data, induces a :obj:`base_classifier` by the 
    428 :obj:`base_cearner` and constructs :obj:`ImputeClassifier` that stores the 
     423The learner is called with 
     424:obj:`Orange.feature.imputation.ImputeLearner(base_learner=<someLearner>, imputer=<someImputerConstructor>)`. 
     425When given examples, it trains the imputer, imputes the data, 
     426induces a :obj:`base_classifier` by the 
     427:obj:`base_learner` and constructs :obj:`ImputeClassifier` that stores the 
    429428:obj:`base_classifier` and the :obj:`imputer`. For classification, the missing 
    430429values are imputed and the classifier's prediction is returned. 
    431430 
    432 Note that this code is slightly simplified, although the omitted details handle 
     431This is a slightly simplified code, where details on how to handle 
    433432non-essential technical issues that are unrelated to imputation:: 
    434433 
     
    456455            return self.base_classifier(self.imputer(ex), what) 
    457456 
    458 .. rubric:: Example 
    459  
    460 Although most Orange's learning algorithms will take care of imputation 
    461 internally, if needed, it can sometime happen that an expert will be able to 
    462 tell you exactly what to put in the data instead of the missing values. In this 
    463 example we shall suppose that we want to impute the minimal value of each 
    464 feature. We will try to determine whether the naive Bayesian classifier with 
    465 its  implicit internal imputation works better than one that uses imputation by 
    466 minimal values. 
    467  
    468 :download:`imputation-minimal-imputer.py <code/imputation-minimal-imputer.py>` (uses :download:`voting.tab <code/voting.tab>`): 
    469  
    470 .. literalinclude:: code/imputation-minimal-imputer.py 
    471     :lines: 7- 
    472  
    473 Should ouput this:: 
    474  
    475     Without imputation: 0.903 
    476     With imputation: 0.899 
    477  
    478 .. note:: 
    479    Note that we constructed just one instance of \ 
    480    :obj:`Orange.classification.bayes.NaiveLearner`, but this same instance is 
    481    used twice in each fold, once it is given the examples as they are (and 
    482    returns an instance of :obj:`Orange.classification.bayes.NaiveClassifier`. 
    483    The second time it is called by :obj:`imba` and the \ 
    484    :obj:`Orange.classification.bayes.NaiveClassifier` it returns is wrapped 
    485    into :obj:`Orange.feature.imputation.Classifier`. We thus have only one 
    486    learner, but which produces two different classifiers in each round of 
    487    testing. 
    488  
    489457Write your own imputer 
    490 ====================== 
    491  
    492 Imputation classes provide the Python-callback functionality (not all Orange 
    493 classes do so, refer to the documentation on `subtyping the Orange classes 
    494 in Python <callbacks.htm>`_ for a list). If you want to write your own 
    495 imputation constructor or an imputer, you need to simply program a Python 
    496 function that will behave like the built-in Orange classes (and even less, 
    497 for imputer, you only need to write a function that gets an example as 
    498 argument, imputation for example tables will then use that function). 
    499  
    500 You will most often write the imputation constructor when you have a special 
    501 imputation procedure or separate procedures for various attributes, as we've 
    502 demonstrated in the description of 
    503 :obj:`Orange.feature.imputation.ImputerConstructor_model`. You basically only 
    504 need to pack everything we've written there to an imputer constructor that 
    505 will accept a data set and the id of the weight meta-attribute (ignore it if 
    506 you will, but you must accept two arguments), and return the imputer (probably 
    507 :obj:`Orange.feature.imputation.Imputer_model`. The benefit of implementing an 
    508 imputer constructor as opposed to what we did above is that you can use such a 
    509 constructor as a component for Orange learners (like logistic regression) or 
    510 for wrappers from module orngImpute, and that way properly use the in 
    511 classifier testing procedures. 
     458---------------------- 
     459 
     460Imputation classes provide the Python-callback functionality. The simples 
     461way to write custom imputation constructors or imputers is to write a Python 
     462function that behaves like the built-in Orange classes. For imputers it is 
     463enough to write a function that gets an instance as argument. Inputation for 
     464data tables will then use that function. 
     465 
     466Special imputation procedures or separate procedures for various attributes, 
     467as demonstrated in the description of 
     468:obj:`~Orange.feature.imputation.ImputerConstructor_model`, 
     469are achieved by encoding it in a constructor that accepts a data table and 
     470id of the weight meta-attribute, and returns the imputer. The benefit of 
     471implementing an imputer constructor is that you can use is as a component 
     472for learners (for example, in logistic regression) or wrappers, and that way 
     473properly use the classifier in testing procedures. 
     474 
     475 
     476 
     477.. 
     478    This was commented out: 
     479    Examples 
     480    -------- 
     481 
     482    Missing values sometimes have a special meaning, so they need to be replaced 
     483    by a designated value. Sometimes we know what to replace the missing value 
     484    with; for instance, in a medical problem, some laboratory tests might not be 
     485    done when it is known what their results would be. In that case, we impute 
     486    certain fixed value instead of the missing. In the most complex case, we assign 
     487    values that are computed based on some model; we can, for instance, impute the 
     488    average or majority value or even a value which is computed from values of 
     489    other, known feature, using a classifier. 
     490 
     491    In general, imputer itself needs to be trained. This is, of course, not needed 
     492    when the imputer imputes certain fixed value. However, when it imputes the 
     493    average or majority value, it needs to compute the statistics on the training 
     494    examples, and use it afterwards for imputation of training and testing 
     495    examples. 
     496 
     497    While reading this document, bear in mind that imputation is a part of the 
     498    learning process. If we fit the imputation model, for instance, by learning 
     499    how to predict the feature's value from other features, or even if we 
     500    simply compute the average or the minimal value for the feature and use it 
     501    in imputation, this should only be done on learning data. Orange 
     502    provides simple means for doing that. 
     503 
     504    This page will first explain how to construct various imputers. Then follow 
     505    the examples for `proper use of imputers <#using-imputers>`_. Finally, quite 
     506    often you will want to use imputation with special requests, such as certain 
     507    features' missing values getting replaced by constants and other by values 
     508    computed using models induced from specified other features. For instance, 
     509    in one of the studies we worked on, the patient's pulse rate needed to be 
     510    estimated using regression trees that included the scope of the patient's 
     511    injuries, sex and age, some attributes' values were replaced by the most 
     512    pessimistic ones and others were computed with regression trees based on 
     513    values of all features. If you are using learners that need the imputer as a 
     514    component, you will need to `write your own imputer constructor 
     515    <#write-your-own-imputer-constructor>`_. This is trivial and is explained at 
     516    the end of this page. 
  • docs/reference/rst/Orange.feature.rst

    r9372 r9896  
    88   :maxdepth: 2 
    99 
     10   Orange.feature.descriptor 
    1011   Orange.feature.scoring 
    1112   Orange.feature.selection 
  • docs/reference/rst/code/distances-test.py

    r9823 r9889  
    55 
    66# Euclidean distance constructor 
    7 d2Constr = Orange.distance.instances.EuclideanConstructor() 
     7d2Constr = Orange.distance.Euclidean() 
    88d2 = d2Constr(iris) 
    99 
    1010# Constructs  
    11 dPears = Orange.distance.instances.PearsonRConstructor(iris) 
     11dPears = Orange.distance.PearsonR(iris) 
    1212 
    1313#reference instance 
  • docs/reference/rst/code/majority-classification.py

    r9823 r9894  
    1515 
    1616res = Orange.evaluation.testing.cross_validation(learners, monks) 
    17 CAs = Orange.evaluation.scoring.CA(res, reportSE=True) 
     17CAs = Orange.evaluation.scoring.CA(res, report_se=True) 
    1818 
    1919print "Tree:    %5.3f+-%5.3f" % CAs[0] 
  • docs/reference/rst/code/mds-advanced.py

    r9823 r9891  
    1313# Construct a distance matrix using Euclidean distance 
    1414dist = Orange.core.ExamplesDistanceConstructor_Euclidean(iris) 
    15 matrix = Orange.core.SymMatrix(len(iris)) 
     15matrix = Orange.misc.SymMatrix(len(iris)) 
    1616for i in range(len(iris)): 
    1717   for j in range(i+1): 
  • docs/reference/rst/code/mds-euclid-torgerson-3d.py

    r9866 r9891  
    1212# Construct a distance matrix using Euclidean distance 
    1313dist = Orange.distance.Euclidean(iris) 
    14 matrix = Orange.core.SymMatrix(len(iris)) 
     14matrix = Orange.misc.SymMatrix(len(iris)) 
    1515matrix.setattr('items', iris) 
    1616for i in range(len(iris)): 
  • docs/reference/rst/code/mds-scatterplot.py

    r9838 r9891  
    1212# Construct a distance matrix using Euclidean distance 
    1313euclidean = Orange.distance.Euclidean(iris) 
    14 distance = Orange.core.SymMatrix(len(iris)) 
     14distance = Orange.misc.SymMatrix(len(iris)) 
    1515for i in range(len(iris)): 
    1616   for j in range(i + 1): 
  • docs/reference/rst/code/outlier2.py

    r9823 r9889  
    33bridges = Orange.data.Table("bridges") 
    44outlier_det = Orange.preprocess.outliers.OutlierDetection() 
    5 outlier_det.set_examples(bridges, Orange.distance.instances.EuclideanConstructor(bridges)) 
     5outlier_det.set_examples(bridges, Orange.distance.Euclidean(bridges)) 
    66outlier_det.set_knn(3) 
    77z_values = outlier_det.z_values() 
  • docs/reference/rst/code/svm-custom-kernel.py

    r9823 r9889  
    33 
    44from Orange.classification.svm import SVMLearner, kernels 
    5 from Orange.distance.instances import EuclideanConstructor 
    6 from Orange.distance.instances import HammingConstructor 
     5from Orange.distance import Euclidean 
     6from Orange.distance import Hamming 
    77 
    88iris = data.Table("iris.tab") 
    99l1 = SVMLearner() 
    10 l1.kernel_func = kernels.RBFKernelWrapper(EuclideanConstructor(iris), gamma=0.5) 
     10l1.kernel_func = kernels.RBFKernelWrapper(Euclidean(iris), gamma=0.5) 
    1111l1.kernel_type = SVMLearner.Custom 
    1212l1.probability = True 
     
    1515 
    1616l2 = SVMLearner() 
    17 l2.kernel_func = kernels.RBFKernelWrapper(HammingConstructor(iris), gamma=0.5) 
     17l2.kernel_func = kernels.RBFKernelWrapper(Hamming(iris), gamma=0.5) 
    1818l2.kernel_type = SVMLearner.Custom 
    1919l2.probability = True 
     
    2323l3 = SVMLearner() 
    2424l3.kernel_func = kernels.CompositeKernelWrapper( 
    25     kernels.RBFKernelWrapper(EuclideanConstructor(iris), gamma=0.5), 
    26     kernels.RBFKernelWrapper(HammingConstructor(iris), gamma=0.5), l=0.5) 
     25    kernels.RBFKernelWrapper(Euclidean(iris), gamma=0.5), 
     26    kernels.RBFKernelWrapper(Hamming(iris), gamma=0.5), l=0.5) 
    2727l3.kernel_type = SVMLearner.Custom 
    2828l3.probability = True 
  • docs/reference/rst/code/symmatrix.py

    r9823 r9891  
    11import Orange 
    22 
    3 m = Orange.data.SymMatrix(4) 
     3m = Orange.misc.SymMatrix(4) 
    44for i in range(4): 
    55    for j in range(i+1): 
  • docs/reference/rst/code/testing-test.py

    r9823 r9894  
    1212 
    1313def printResults(res): 
    14     CAs = Orange.evaluation.scoring.CA(res, reportSE=1) 
     14    CAs = Orange.evaluation.scoring.CA(res, report_se=1) 
    1515    for name, ca in zip(res.classifierNames, CAs): 
    1616        print "%s: %5.3f+-%5.3f" % (name, ca[0], 1.96 * ca[1]), 
  • docs/reference/rst/code/variable-get_value_from.py

    r9823 r9897  
    22# Category:    core 
    33# Uses:        monks-1 
    4 # Referenced:  Orange.data.variable 
    5 # Classes:     Orange.data.variable.Discrete 
     4# Referenced:  Orange.feature 
     5# Classes:     Orange.feature.Discrete 
    66 
    77import Orange 
     
    1414 
    1515monks = Orange.data.Table("monks-1") 
    16 e2 = Orange.data.variable.Discrete("e2", values=["not 1", "1"])     
     16e2 = Orange.feature.Discrete("e2", values=["not 1", "1"])     
    1717e2.get_value_from = checkE  
    1818 
    19 print Orange.core.MeasureAttribute_info(e2, monks) 
     19print Orange.feature.scoring.InfoGain(e2, monks) 
    2020 
    2121dist = Orange.core.Distribution(e2, monks) 
  • docs/reference/rst/index.rst

    r9729 r9897  
    77 
    88   Orange.data 
     9 
     10   Orange.feature 
    911 
    1012   Orange.associate 
     
    1921 
    2022   Orange.evaluation 
    21  
    22    Orange.feature 
    2323 
    2424   Orange.multilabel 
  • setup.py

    r9879 r9893  
    391391        install.run(self) 
    392392         
    393         # Create a .pth file wiht a path inside the Orange/orng directory 
     393        # Create a .pth file with a path inside the Orange/orng directory 
    394394        # so the old modules are importable 
    395395        self.path_file, self.extra_dirs = ("orange-orng-modules", "Orange/orng") 
  • source/orange/_aliases.txt

    r9907 r9908  
    7272TransformValue 
    7373sub_transformer subtransformer 
     74 
     75ImputerConstructor 
     76impute_class imputeClass 
  • source/orange/discretize.hpp

    r9863 r9899  
    199199  __REGISTER_CLASS 
    200200 
    201   int maxNumberOfIntervals; //P maximal number of intervals; default = 0 (no limits) 
     201  int maxNumberOfIntervals; //P(+n) maximal number of intervals; default = 0 (no limits) 
    202202  bool forceAttribute; //P minimal number of intervals; default = 0 (no limits) 
    203203 
Note: See TracChangeset for help on using the changeset viewer.