# Changeset 8506:78e7ed99658a in orange

Ignore:
Timestamp:
07/28/11 10:42:38 (3 years ago)
Branch:
default
Convert:
fbec3dea907693a688e0ed16aa991a1ac5881479
Message:

Merge changes from trunk

Files:
1 deleted
58 edited

Unmodified
Removed
• ## orange/Orange/__init__.py

 r8378 _import("misc.render") _import("misc.selection") _import("misc.r") #_import("misc.r")
• ## orange/Orange/classification/bayes.py

 r8042 """ .. index:: naive Bayes classifier .. index:: single: classification; naive Bayes classifier ********************************** Naive Bayes classifier (bayes) ********************************** The most primitive Bayesian classifier is :obj:NaiveLearner. Naive Bayes classification algorithm _ estimates conditional probabilities from training data and uses them for classification of new data instances. The algorithm learns very fast if all features in the training data set are discrete. If a number of features are continues, though, the algorithm runs slower due to time spent to estimate continuous conditional distributions. The following example demonstrates a straightforward invocation of this algorithm (bayes-run.py_, uses titanic.tab_): .. literalinclude:: code/bayes-run.py :lines: 7- .. index:: Naive Bayesian Learner .. autoclass:: Orange.classification.bayes.NaiveLearner :members: :show-inheritance: .. autoclass:: Orange.classification.bayes.NaiveClassifier :members: :show-inheritance: Examples ======== :obj:NaiveLearner can estimate probabilities using relative frequencies or m-estimate (bayes-mestimate.py_, uses lenses.tab_): .. literalinclude:: code/bayes-mestimate.py :lines: 7- Observing conditional probabilities in an m-estimate based classifier shows a shift towards the second class - as compared to probabilities above, where relative frequencies were used. Note that the change in error estimation did not have any effect on apriori probabilities (bayes-thresholdAdjustment.py_, uses adult-sample.tab_): .. literalinclude:: code/bayes-thresholdAdjustment.py :lines: 7- Setting adjustThreshold parameter can sometimes improve the results. Those are the classification accuracies of 10-fold cross-validation of a normal naive bayesian classifier, and one with an adjusted threshold:: [0.7901746265516516, 0.8280138859667578] Probabilities for continuous features are estimated with \ :class:ProbabilityEstimatorConstructor_loess. (bayes-plot-iris.py_, uses iris.tab_): .. literalinclude:: code/bayes-plot-iris.py :lines: 4- .. image:: code/bayes-iris.png :scale: 50 % If petal lengths are shorter, the most probable class is "setosa". Irises with middle petal lengths belong to "versicolor", while longer petal lengths indicate for "virginica". Critical values where the decision would change are at about 5.4 and 6.3. .. _bayes-run.py: code/bayes-run.py .. _bayes-thresholdAdjustment.py: code/bayes-thresholdAdjustment.py .. _bayes-mestimate.py: code/bayes-mestimate.py .. _bayes-plot-iris.py: code/bayes-plot-iris.py .. _adult-sample.tab: code/adult-sample.tab .. _iris.tab: code/iris.tab .. _titanic.tab: code/iris.tab .. _lenses.tab: code/lenses.tab Implementation details ====================== The following two classes are implemented in C++ (*bayes.cpp*). They are not intended to be used directly. Here we provide implementation details for those interested. Orange.core.BayesLearner ------------------------ Fields estimatorConstructor, conditionalEstimatorConstructor and conditionalEstimatorConstructorContinuous are empty (None) by default. If estimatorConstructor is left undefined, p(C) will be estimated by relative frequencies of examples (see ProbabilityEstimatorConstructor_relative). When conditionalEstimatorConstructor is left undefined, it will use the same constructor as for estimating unconditional probabilities (estimatorConstructor is used as an estimator in ConditionalProbabilityEstimatorConstructor_ByRows). That is, by default, both will use relative frequencies. But when estimatorConstructor is set to, for instance, estimate probabilities by m-estimate with m=2.0, the same estimator will be used for estimation of conditional probabilities, too. P(c|vi) for continuous attributes are, by default, estimated with loess (a variant of locally weighted linear regression), using ConditionalProbabilityEstimatorConstructor_loess. The learner first constructs an estimator for p(C). It tries to get a precomputed distribution of probabilities; if the estimator is capable of returning it, the distribution is stored in the classifier's field distribution and the just constructed estimator is disposed. Otherwise, the estimator is stored in the classifier's field estimator, while the distribution is left empty. The same is then done for conditional probabilities. Different constructors are used for discrete and continuous attributes. If the constructed estimator can return all conditional probabilities in form of Contingency, the contingency is stored and the estimator disposed. If not, the estimator is stored. If there are no contingencies when the learning is finished, the resulting classifier's conditionalDistributions is None. Alternatively, if all probabilities are stored as contingencies, the conditionalEstimators fields is None. Field normalizePredictions is copied to the resulting classifier. Orange.core.BayesClassifier --------------------------- Class NaiveClassifier represents a naive bayesian classifier. Probability of class C, knowing that values of features :math:F_1, F_2, ..., F_n are :math:v_1, v_2, ..., v_n, is computed as :math:p(C|v_1, v_2, ..., v_n) = \ p(C) \\cdot \\frac{p(C|v_1)}{p(C)} \\cdot \\frac{p(C|v_2)}{p(C)} \\cdot ... \ \\cdot \\frac{p(C|v_n)}{p(C)}. Note that when relative frequencies are used to estimate probabilities, the more usual formula (with factors of form :math:\\frac{p(v_i|C)}{p(v_i)}) and the above formula are exactly equivalent (without any additional assumptions of independency, as one could think at a first glance). The difference becomes important when using other ways to estimate probabilities, like, for instance, m-estimate. In this case, the above formula is much more appropriate. When computing the formula, probabilities p(C) are read from distribution, which is of type Distribution, and stores a (normalized) probability of each class. When distribution is None, BayesClassifier calls estimator to assess the probability. The former method is faster and is actually used by all existing methods of probability estimation. The latter is more flexible. Conditional probabilities are computed similarly. Field conditionalDistribution is of type DomainContingency which is basically a list of instances of Contingency, one for each attribute; the outer variable of the contingency is the attribute and the inner is the class. Contingency can be seen as a list of normalized probability distributions. For attributes for which there is no contingency in conditionalDistribution a corresponding estimator in conditionalEstimators is used. The estimator is given the attribute value and returns distributions of classes. If neither, nor pre-computed contingency nor conditional estimator exist, the attribute is ignored without issuing any warning. The attribute is also ignored if its value is undefined; this cannot be overriden by estimators. Any field (distribution, estimator, conditionalDistributions, conditionalEstimators) can be None. For instance, BayesLearner normally constructs a classifier which has either distribution or estimator defined. While it is not an error to have both, only distribution will be used in that case. As for the other two fields, they can be both defined and used complementarily; the elements which are missing in one are defined in the other. However, if there is no need for estimators, BayesLearner will not construct an empty list; it will not construct a list at all, but leave the field conditionalEstimators empty. If you only need probabilities of individual class call BayesClassifier's method p(class, example) to compute the probability of this class only. Note that this probability will not be normalized and will thus, in general, not equal the probability returned by the call operator. """ import Orange from Orange.core import BayesClassifier as _BayesClassifier .. :param adjustTreshold: sets the corresponding attribute :type adjustTreshold: boolean :param adjust_threshold: sets the corresponding attribute :type adjust_threshold: boolean :param m: sets the :obj:estimatorConstructor to :class:orange.ProbabilityEstimatorConstructor_m with specified m :type m: integer :param estimatorConstructor: sets the corresponding attribute :type estimatorConstructor: orange.ProbabilityEstimatorConstructor :param conditionalEstimatorConstructor: sets the corresponding attribute :type conditionalEstimatorConstructor: :param estimator_constructor: sets the corresponding attribute :type estimator_constructor: orange.ProbabilityEstimatorConstructor :param conditional_estimator_constructor: sets the corresponding attribute :type conditional_estimator_constructor: :class:orange.ConditionalProbabilityEstimatorConstructor :param conditionalEstimatorConstructorContinuous: sets the corresponding :param conditional_estimator_constructor_continuous: sets the corresponding attribute :type conditionalEstimatorConstructorContinuous: :type conditional_estimator_constructor_continuous: :class:orange.ConditionalProbabilityEstimatorConstructor Constructor parameters set the corresponding attributes. .. attribute:: adjustTreshold .. attribute:: adjust_threshold If set and the class is binary, the classifier's This attribute is ignored if you also set estimatorConstructor. .. attribute:: estimatorConstructor .. attribute:: estimator_constructor Probability estimator constructor for Setting this attribute disables the above described attribute m. .. attribute:: conditionalEstimatorConstructor .. attribute:: conditional_estimator_constructor Probability estimator constructor the estimator for prior probabilities will be used. .. attribute:: conditionalEstimatorConstructorContinuous .. attribute:: conditional_estimator_constructor_continuous Probability estimator constructor for conditional probabilities for "conditionalEstimatorConstructorContinuous":"conditional_estimator_constructor_continuous", "weightID": "weight_id" }, in_place=False)(NaiveLearner) }, in_place=True)(NaiveLearner) :class:Orange.core.BayesClassifier that does the actual classification. :param baseClassifier: an :class:Orange.core.BayesLearner to wrap. If :param base_classifier: an :class:Orange.core.BayesLearner to wrap. If not set, a new :class:Orange.core.BayesLearner is created. :type baseClassifier: :class:Orange.core.BayesLearner :type base_classifier: :class:Orange.core.BayesLearner .. attribute:: distribution An object that returns a probability of class p(C) for a given class C. .. attribute:: conditionalDistributions .. attribute:: conditional_distributions A list of conditional probabilities. .. attribute:: conditionalEstimators .. attribute:: conditional_estimators A list of estimators for conditional probabilities. .. attribute:: adjustThreshold .. attribute:: adjust_threshold For binary classes, this tells the learner to """ def __init__(self, baseClassifier=None): if not baseClassifier: baseClassifier = _BayesClassifier() self.nativeBayesClassifier = baseClassifier for k, v in self.nativeBayesClassifier.__dict__.items(): def __init__(self, base_classifier=None): if not base_classifier: base_classifier = _BayesClassifier() self.native_bayes_classifier = base_classifier for k, v in self.native_bayes_classifier.__dict__.items(): self.__dict__[k] = v :class:Orange.statistics.Distribution or a tuple with both """ return self.nativeBayesClassifier(instance, result_type, *args, **kwdargs) return self.native_bayes_classifier(instance, result_type, *args, **kwdargs) def __setattr__(self, name, value): if name == "nativeBayesClassifier": if name == "native_bayes_classifier": self.__dict__[name] = value return if name in self.nativeBayesClassifier.__dict__: self.nativeBayesClassifier.__dict__[name] = value if name in self.native_bayes_classifier.__dict__: self.native_bayes_classifier.__dict__[name] = value self.__dict__[name] = value """ return self.nativeBayesClassifier.p(class_, instance) return self.native_bayes_classifier.p(class_, instance) def __str__(self): """return classifier in human friendly format.""" nValues=len(self.classVar.values) frmtStr=' %10.3f'*nValues classes=" "*20+ ((' %10s'*nValues) % tuple([i[:10] for i in self.classVar.values])) """Return classifier in human friendly format.""" nvalues=len(self.class_var.values) frmtStr=' %10.3f'*nvalues classes=" "*20+ ((' %10s'*nvalues) % tuple([i[:10] for i in self.class_var.values])) return "\n".join([ ("%20s" % i.variable.values[v][:20]) + (frmtStr % tuple(i[v])) for v in xrange(len(i.variable.values)))] ) for i in self.conditionalDistributions])]) ) for i in self.conditional_distributions if i.variable.var_type == Orange.data.variable.Discrete])])
• ## orange/Orange/classification/majority.py

 r8042 MajorityLearner will most often be used as is, without setting any features. Nevertheless, it has two. parameters. Nevertheless, it has two. .. attribute:: estimatorConstructor .. attribute:: estimator_constructor An estimator constructor that can be used for estimation of this class. .. attribute:: aprioriDistribution .. attribute:: apriori_distribution Apriori class distribution that is passed to estimator same class probabilities. .. attribute:: defaultVal .. attribute:: default_val Value that is returned by the classifier. .. attribute:: defaultDistribution .. attribute:: default_distribution Class probabilities returned by the classifier. The ConstantClassifier's constructor can be called without arguments, with value (for defaultVal), variable (for classVar). If the value is given and is of type orange.Value (alternatives are an integer index of a discrete value or a continuous value), its field variable is will either be used for initializing classVar if variable is not given as with value (for default_val), variable (for class_var). If the value is given and is of type Orange.data.Value (alternatives are an integer index of a discrete value or a continuous value), its field variable will either be used for initializing class_var if variable is not given as an argument, or checked against the variable argument, if it is given.
• ## orange/Orange/feature/__init__.py

 r8042 """ .. index:: feature This module provides functionality for feature scoring, selection, discretization, continuzation, imputation, construction and feature interaction analysis. ======= Scoring ======= .. automodule:: Orange.feature.scoring ========= Selection ========= .. automodule:: Orange.feature.selection ============== Discretization ============== .. automodule:: Orange.feature.discretization ============== Continuization ============== .. index:: continuization .. automodule:: Orange.feature.continuization ========== Imputation ========== .. automodule:: Orange.feature.imputation Feature scoring, selection, discretization, continuzation, imputation, construction and feature interaction analysis. """
• ## orange/Orange/feature/continuization.py

 r8042 """ ################################### Continuization (continuization) ################################### """ from Orange.core import DomainContinuizer
• ## orange/Orange/feature/discretization.py

 r8042 """ ################################### Discretization (discretization) ################################### .. index:: discretization
• ## orange/Orange/feature/imputation.py

 r8042 """ ########################### Imputation (imputation) ########################### .. index:: imputation
• ## orange/Orange/feature/scoring.py

 r8042 """ ##################### Scoring (scoring) ##################### .. index:: feature scoring single: feature; feature scoring Feature scoring is used in feature subset selection for classification problems. The goal is to find "good" features that are relevant for the given classification task. Here is a simple script that reads the data, uses :obj:attMeasure to derive feature scores and prints out these for the first three best scored features. Same scoring function is then used to report (only) on three best score features. Features selection aims to find relevant features for the given prediction task. The following example computes feature scores, both with :obj:score_all and by scoring each feature individually, and prints out the best three features. .. _scoring-all.py: code/scoring-all.py .. _voting.tab: code/voting.tab scoring-all.py_ (uses voting.tab_): .. literalinclude:: code/scoring-all.py :lines: 7- The script should output this:: Feature scores for best three features: The output:: Feature scores for best three features (with score_all): 0.613 physician-fee-freeze 0.255 adoption-of-the-budget-resolution 0.255 el-salvador-aid 0.228 synfuels-corporation-cutback .. autoclass:: Orange.feature.scoring.OrderAttributesByMeasure :members: .. automethod:: Orange.feature.scoring.MeasureAttribute_Distance .. autoclass:: Orange.feature.scoring.MeasureAttribute_DistanceClass :members: .. automethod:: Orange.feature.scoring.MeasureAttribute_MDL .. autoclass:: Orange.feature.scoring.MeasureAttribute_MDLClass :members: .. automethod:: Orange.feature.scoring.mergeAttrValues .. automethod:: Orange.feature.scoring.attMeasure Feature scores for best three features (scored individually): 0.613 physician-fee-freeze 0.255 el-salvador-aid 0.228 synfuels-corporation-cutback ============ ============ There are a number of different measures for assessing the relevance of features with respect to much information they contain about the corresponding class. These procedures are also known as feature scoring. Orange implements several methods that all stem from :obj:Orange.feature.scoring.Measure. The most of common ones compute certain statistics on conditional distributions of class values given the feature values; in Orange, these are derived from :obj:Orange.feature.scoring.MeasureAttributeFromProbabilities. Implemented methods for scoring relevances of features to the class are subclasses of :obj:Measure. Those that compute statistics on conditional distributions of class values given the feature values are derived from :obj:MeasureFromProbabilities. .. class:: Measure This is the base class for a wide range of classes that measure quality of features. The class itself is, naturally, abstract. Its fields merely describe what kinds of features it can handle and what kind of data it requires. .. attribute:: handlesDiscrete Tells whether the measure can handle discrete features. .. attribute:: handlesContinuous Tells whether the measure can handle continuous features. .. attribute:: computesThresholds Tells whether the measure implements the :obj:thresholdFunction. Abstract base class for feature scoring. Its attributes describe which features it can handle and the required data. **Capabilities** .. attribute:: handles_discrete Indicates whether the measure can handle discrete features. .. attribute:: handles_continuous Indicates whether the measure can handle continuous features. .. attribute:: computes_thresholds Indicates whether the measure implements the :obj:threshold_function. **Input specification** .. attribute:: needs Tells what kind of data the measure needs. This can be either :obj:NeedsGenerator, :obj:NeedsDomainContingency, :obj:NeedsContingency_Class. The first need an instance generator (Relief is an example of such measure), the second can compute the quality from :obj:Orange.statistics.contingency.Domain and the latter only needs the contingency (:obj:Orange.statistics.contingency.VarClass) the feature distribution and the apriori class distribution. Most measures only need the latter. Several (but not all) measures can treat unknown feature values in different ways, depending on field :obj:unknownsTreatment (this field is not defined in :obj:Measure but in many derived classes). Undefined values can be: * ignored (:obj:Measure.IgnoreUnknowns); this has the same effect as if the example for which the feature value is unknown are removed. * punished (:obj:Measure.ReduceByUnknown); the feature quality is reduced by the proportion of unknown values. In impurity measures, this can be interpreted as if the impurity is decreased only on examples for which the value is defined and stays the same for the others, and the feature quality is the average impurity decrease. * imputed (:obj:Measure.UnknownsToCommon); here, undefined values are replaced by the most common feature value. If you want a more clever imputation, you should do it in advance. * treated as a separate value (:obj:MeasureAttribute.UnknownsAsValue) The default treatment is :obj:ReduceByUnknown, which is optimal in most cases and does not make additional presumptions (as, for instance, :obj:UnknownsToCommon which supposes that missing values are not for instance, results of measurements that were not performed due to information extracted from the other features). Use other treatments if you know that they make better sense on your data. The only method supported by all measures is the call operator to which we pass the data and get the number representing the quality of the feature. The number does not have any absolute meaning and can vary widely for different feature measures. The only common characteristic is that higher the value, better the feature. If the feature is so bad that it's quality cannot be measured, the measure returns :obj:Measure.Rejected. None of the measures described here do so. There are different sets of arguments that the call operator can accept. Not all classes will accept all kinds of arguments. Relief, for instance, cannot be computed from contingencies alone. Besides, the feature and the class need to be of the correct type for a particular measure. There are three call operators just to make your life simpler and faster. When working with the data, your method might have already computed, for instance, contingency matrix. If so and if the quality measure you use is OK with that (as most measures are), you can pass the contingency matrix and the measure will compute much faster. If, on the other hand, you only have examples and haven't computed any statistics on them, you can pass examples (and, optionally, an id for meta-feature with weights) and the measure will compute the contingency itself, if needed. .. method:: __call__(attribute, examples[, apriori class distribution][, weightID]) .. method:: __call__(attribute, domain contingency[, apriori class distribution]) .. method:: __call__(contingency, class distribution[, apriori class distribution]) :param attribute: gives the feature whose quality is to be assessed. This can be either a descriptor, an index into domain or a name. In the first form, if the feature is given by descriptor, it doesn't need to be in the domain. It needs to be computable from the feature in the domain, though. Data is given either as examples (and, optionally, id for meta-feature with weight), contingency tables (:obj:Orange.statistics.contingency.Domain) or distributions (:obj:Orange.statistics.distribution.Distribution) for all attributes. In the latter for, what is given as the class distribution depends upon what you do with unknown values (if there are any).  If :obj:unknownsTreatment is :obj:IgnoreUnknowns, the class distribution should be computed on examples for which the feature value is defined. Otherwise, class distribution should be the overall class distribution. The optional argument with apriori class distribution is most often ignored. It comes handy if the measure makes any probability estimates based on apriori class probabilities (such as m-estimate). .. method:: thresholdFunction(attribute, examples[, weightID]) This function computes the qualities for different binarizations of the continuous feature :obj:attribute. The feature should of course be continuous. The result of a function is a list of tuples, where the first element represents a threshold (all splits in the middle between two existing feature values), the second is the measured quality for a corresponding binary feature and the last one is the distribution which gives the number of examples below and above the threshold. The last element, though, may be missing; generally, if the particular measure can get the distribution without any computational burden, it will do so and the caller can use it. If not, the caller needs to compute it itself. The type of data needed: :obj:NeedsGenerator, :obj:NeedsDomainContingency, or :obj:NeedsContingency_Class. .. attribute:: NeedsGenerator Constant. Indicates that the measure Needs an instance generator on the input (as, for example, the :obj:Relief measure). .. attribute:: NeedsDomainContingency Constant. Indicates that the measure needs :obj:Orange.statistics.contingency.Domain. .. attribute:: NeedsContingency_Class Constant. Indicates, that the measure needs the contingency (:obj:Orange.statistics.contingency.VarClass), feature distribution and the apriori class distribution (as most measures). **Treatment of unknown values** .. attribute:: unknowns_treatment Not defined in :obj:Measure but defined in classes that are able to treat unknown values. Either :obj:IgnoreUnknowns, :obj:ReduceByUnknown. :obj:UnknownsToCommon, or :obj:UnknownsAsValue. .. attribute:: IgnoreUnknowns Constant. Examples for which the feature value is unknown are removed. .. attribute:: ReduceByUnknown Constant. Features with unknown values are punished. The feature quality is reduced by the proportion of unknown values. For impurity measures the impurity decreases only where the value is defined and stays the same otherwise, .. attribute:: UnknownsToCommon Constant. Undefined values are replaced by the most common value. .. attribute:: UnknownsAsValue Constant. Unknown values are treated as a separate value. **Methods** .. method:: __call__(attribute, instances[, apriori_class_distribution][, weightID]) :param attribute: the chosen feature, either as a descriptor, index, or a name. :type attribute: :class:Orange.data.variable.Variable or int or string :param instances: data. :type instances: Orange.data.Table :param weightID: id for meta-feature with weight. Abstract. All measures need to support __call__ with these parameters.  Described below. .. method:: __call__(attribute, domain_contingency[, apriori_class_distribution]) :param attribute: the chosen feature, either as a descriptor, index, or a name. :type attribute: :class:Orange.data.variable.Variable or int or string :param domain_contingency: :type domain_contingency: :obj:Orange.statistics.contingency.Domain Abstract. Described below. .. method:: __call__(contingency, class_distribution[, apriori_class_distribution]) :param contingency: :type contingency: :obj:Orange.statistics.contingency.VarClass :param class_distribution: distribution of the class variable. If :obj:unknowns_treatment is :obj:IgnoreUnknowns, it should be computed on instances where feature value is defined. Otherwise, class distribution should be the overall class distribution. :type class_distribution: :obj:Orange.statistics.distribution.Distribution :param apriori_class_distribution: Optional and most often ignored. Useful if the measure makes any probability estimates based on apriori class probabilities (such as the m-estimate). :return: Feature score - the higher the value, the better the feature. If the quality cannot be measured, return :obj:Measure.Rejected. :rtype: float or :obj:Measure.Rejected. Abstract. Different forms of __call__ enable optimization.  For instance, if contingency matrix has already been computed, you can speed up the computation by passing it to the measure (if it supports that form - most do). Otherwise the measure will have to compute the contingency itself. Not all classes will accept all kinds of arguments. :obj:Relief, for instance, only supports the form with instances on the input. The code sample below shows the use of :obj:GainRatio with different call types. .. literalinclude:: code/scoring-calls.py :lines: 7- .. method:: threshold_function(attribute, examples[, weightID]) Abstract. Assess different binarizations of the continuous feature :obj:attribute.  Return a list of tuples, where the first element is a threshold (between two existing values), the second is the quality of the corresponding binary feature, and the last the distribution of examples below and above the threshold. The last element is optional. .. method:: best_threshold Return the best threshold for binarization. Parameters? The script below shows different ways to assess the quality of astigmatic, tear rate and the first feature (whichever it is) in the dataset lenses. tear rate and the first feature in the dataset lenses. .. literalinclude:: code/scoring-info-lenses.py 0.548794984818 You shouldn't use this shortcut with ReliefF, though; see the explanation in the section on ReliefF. It is also possible to assess the quality of features that do not exist in the features. For instance, you can assess the quality of discretized features without constructing a new domain and dataset that would include them. scoring-info-iris.py_ (uses iris.tab_): You shouldn't use this with :obj:Relief; see :obj:Relief for the explanation. It is also possible to score features that are not in the domain. For instance, you can score discretized features on the fly: .. literalinclude:: code/scoring-info-iris.py :lines: 7-11 The quality of the new feature d1 is assessed on data, which does not include the new feature at all. (Note that ReliefF won't do that since it would be too slow. ReliefF requires the feature to be present in the dataset.) Finally, you can compute the quality of meta-features. The following script adds a meta-feature to an example table, initializes it to random values and measures its information gain. scoring-info-lenses.py_ (uses lenses.tab_): .. literalinclude:: code/scoring-info-lenses.py :lines: 54- Note that this is not possible with :obj:Relief, as it would be too slow. To show the computation of thresholds, we shall use the Iris data set. If we hadn't constructed the feature in advance, we could write Orange.feature.scoring.Relief().thresholdFunction("petal length", data). Orange.feature.scoring.Relief().threshold_function("petal length", data). This is not recommendable for ReliefF, since it may be a lot slower. feature will have the optimal ReliefF (or any other measure):: thresh, score, distr = meas.bestThreshold("petal length", data) thresh, score, distr = meas.best_threshold("petal length", data) print "Best threshold: %5.3f (score %5.3f)" % (thresh, score) .. class:: MeasureAttributeFromProbabilities This is the abstract base class for feature quality measures that can be .. class:: MeasureFromProbabilities Bases: :obj:Measure Abstract base class for feature quality measures that can be computed from contingency matrices only. It relieves the derived classes from having to compute the contingency matrix by defining the first two forms of call operator. (Well, that's not something you need to know if you only work in Python.) Additional feature of this class is that you can set probability estimators. If none are given, probabilities and conditional probabilities of classes are estimated by relative frequencies. .. attribute:: unknownsTreatment you only work in Python.) .. attribute:: unknowns_treatment Defines what to do with unknown values. See the possibilities described above. .. attribute:: estimatorConstructor .. attribute:: conditionalEstimatorConstructor The classes that are used to estimate unconditional and conditional probabilities of classes, respectively. You can set this to, for instance, :obj:ProbabilityEstimatorConstructor_m and :obj:ConditionalProbabilityEstimatorConstructor_ByRows (with estimator constructor again set to :obj:ProbabilityEstimatorConstructor_m), respectively. See :obj:Measure.unknowns_treatment. .. attribute:: estimator_constructor .. attribute:: conditional_estimator_constructor The classes that are used to estimate unconditional and conditional probabilities of classes, respectively. You can set this to, for instance, :obj:ProbabilityEstimatorConstructor_m and :obj:ConditionalProbabilityEstimatorConstructor_ByRows (with estimator constructor again set to :obj:ProbabilityEstimatorConstructor_m), respectively. Both default to relative frequencies. =========================== =========================== This script scores features with gain ratio and relief. scoring-relief-gainRatio.py_ (uses voting.tab_): This script uses :obj:GainRatio and :obj:Relief. .. literalinclude:: code/scoring-relief-gainRatio.py :lines: 7- Notice that on this data the ranks of features match rather well:: Notice that on this data the ranks of features match:: Relief GainRt Feature 0.166  0.345  adoption-of-the-budget-resolution The following section describes the feature quality measures suitable for discrete features and outcomes. See  scoring-info-lenses.py_, scoring-info-iris.py_, scoring-diff-measures.py_ and scoring-regression.py_ for more examples on their use. Undocumented: MeasureAttribute_IM, MeasureAttribute_chiSquare, MeasureAttribute_gainRatioA, MeasureAttribute_logOddsRatio, MeasureAttribute_splitGain. .. index:: .. class:: InfoGain The most popular measure, information gain :obj:Info measures the expected decrease of the entropy. Measures the expected decrease of entropy. .. index:: .. class:: GainRatio Gain ratio :obj:GainRatio was introduced by Quinlan in order to avoid overestimation of multi-valued features. It is computed as information gain divided by the entropy of the feature's value. (It has been shown, however, that such measure still overstimates the features with multiple values.) Information gain divided by the entropy of the feature's value. Introduced by Quinlan in order to avoid overestimation of multi-valued features. It has been shown, however, that it still overestimates features with multiple values. .. index:: .. class:: Gini Gini index :obj:Gini was first introduced by Breiman and can be interpreted as the probability that two randomly chosen examples will have different classes. The probability that two randomly chosen examples will have different classes; first introduced by Breiman. .. index:: .. class:: Relevance Relevance of features :obj:Relevance is a measure that discriminate between features on the basis of their potential value in the formation of decision rules. The potential value for decision rules. .. index:: .. attribute:: cost Cost matrix, see :obj:Orange.classification.CostMatrix for details. If cost of predicting the first class for an example that is actually in Cost matrix, see :obj:Orange.classification.CostMatrix for details. If cost of predicting the first class of an example that is actually in the second is 5, and the cost of the opposite error is 1, than an appropriate measure can be constructed and used for feature 3 as follows:: measure can be constructed as follows:: >>> meas = Orange.feature.scoring.Cost() 0.083333350718021393 This tells that knowing the value of feature 3 would decrease the classification cost for appx 0.083 per example. Knowing the value of feature 3 would decrease the classification cost for approximately 0.083 per example. .. index:: .. class:: Relief ReliefF :obj:Relief was first developed by Kira and Rendell and then substantially generalized and improved by Kononenko. It measures the usefulness of features based on their ability to distinguish between very similar examples belonging to different classes. Assesses features' ability to distinguish between very similar examples from different classes.  First developed by Kira and Rendell and then improved by Kononenko. .. attribute:: k Number of neighbours for each example. Default is 5. Number of neighbours for each example. Default is 5. .. attribute:: m Number of reference examples. Default is 100. Set to -1 to take all the examples. .. attribute:: checkCachedData A flag best left alone unless you know what you do. Computation of ReliefF is rather slow since it needs to find k nearest neighbours for each of m reference examples (or all examples, if m is set to -1). Since we normally compute ReliefF for all features in the dataset, :obj:Relief caches the results. When it is called to compute a quality of certain feature, it computes qualities for all features in the dataset. When called again, it uses the stored results if the data has not changeddomain is still the same and the example table has not changed. Checking is done by comparing the data table version :obj:Orange.data.Table for details) and then computing a checksum of the data and comparing it with the previous checksum. The latter can take some time on large tables, so you may want to disable it by setting checkCachedData to :obj:False. In most cases it will do no harm, except when the data is changed in such a way that it passed unnoticed by the version' control, in which cases the computed ReliefFs can be false. Hence: disable it if you know that the data does not change or if you know what kind of changes are detected by the version control. Caching will only have an effect if you use the same instance for all features in the domain. So, don't do this:: for attr in data.domain.attributes: print Orange.feature.scoring.Relief(attr, data) In this script, cached data dies together with the instance of :obj:Relief, which is constructed and destructed for each feature separately. It's way faster to go like this:: meas = Orange.feature.scoring.Relief() for attr in table.domain.attributes: print meas(attr, data) When called for the first time, meas will compute ReliefF for all features and the subsequent calls simply return the stored data. Class :obj:Relief works on discrete and continuous classes and thus implements functionality of algorithms ReliefF and RReliefF. .. note:: ReliefF can also compute the threshold function, that is, the feature quality at different thresholds for binarization. Finally, here is an example which shows what can happen if you disable the computation of checksums:: table = Orange.data.Table("iris") r1 = Orange.feature.scoring.Relief() r2 = Orange.feature.scoring.Relief(checkCachedData = False) print "%.3f\\t%.3f" % (r1(0, table), r2(0, table)) for ex in table: ex[0] = 0 print "%.3f\\t%.3f" % (r1(0, table), r2(0, table)) The first print prints out the same number, 0.321 twice. Then we annulate the first feature. r1 notices it and returns -1 as it's ReliefF, while r2 does not and returns the same number, 0.321, which is now wrong. Number of reference examples. Default is 100. Set to -1 to take all the examples. .. attribute:: check_cached_data Check if the cached data is changed with data checksum. Slow on large tables.  Defaults to True. Disable it if you know that the data will not change. ReliefF is slow since it needs to find k nearest neighbours for each of m reference examples.  As we normally compute ReliefF for all features in the dataset, :obj:Relief caches the results. When called to score a certain feature, it computes all feature scores. When called again, it uses the stored results if the domain and the data table have not changed (data table version and the data checksum are compared). Caching will only work if you use the same instance. So, don't do this:: for attr in data.domain.attributes: print Orange.feature.scoring.Relief(attr, data) But this:: meas = Orange.feature.scoring.Relief() for attr in table.domain.attributes: print meas(attr, data) Class :obj:Relief works on discrete and continuous classes and thus implements functionality of algorithms ReliefF and RReliefF. .. note:: Relief can also compute the threshold function, that is, the feature quality at different thresholds for binarization. ======================= ======================= Except for ReliefF, the only feature quality measure available for regression problems is based on a mean square error. :obj:Relief can be also used for regression. .. index:: Implements the mean square error measure. .. attribute:: unknownsTreatment Tells what to do with unknown feature values. See description on the top of this page. .. attribute:: unknowns_treatment What to do with unknown values. See :obj:Measure.unknowns_treatment. .. attribute:: m Parameter for m-estimate of error. Default is 0 (no m-estimate). Parameter for m-estimate of error. Default is 0 (no m-estimate). ============ Other ============ .. autoclass:: Orange.feature.scoring.OrderAttributes :members: .. autofunction:: Orange.feature.scoring.Distance .. autoclass:: Orange.feature.scoring.DistanceClass :members: .. autofunction:: Orange.feature.scoring.MDL .. autoclass:: Orange.feature.scoring.MDLClass :members: .. autofunction:: Orange.feature.scoring.merge_values .. autofunction:: Orange.feature.scoring.score_all ========== import Orange.core as orange import Orange.misc from orange import MeasureAttribute as Measure from orange import MeasureAttributeFromProbabilities as MeasureFromProbabilities from orange import MeasureAttribute_info as InfoGain from orange import MeasureAttribute_gainRatio as GainRatio from orange import MeasureAttribute_MSE as MSE ###### # from orngEvalAttr.py class OrderAttributesByMeasure: """Construct an instance that orders features by their scores. :param measure: a feature measure, derived from :obj:Orange.feature.scoring.Measure. class OrderAttributes: """Orders features by their scores. .. attribute::  measure A measure derived from :obj:~Orange.feature.scoring.Measure. If None, :obj:Relief will be used. """ def __call__(self, data, weight): """Take :obj:Orange.data.table data table and an instance of :obj:Orange.feature.scoring.Measure to score and order features. """Score and order all features. :param data: a data table used to score features :type data: Orange.data.table :param weight: meta feature that stores weights of individual data instances :param weight: meta attribute that stores weights of instances :type weight: Orange.data.variable return [x[0] for x in measured] def MeasureAttribute_Distance(attr = None, data = None): """Instantiate :obj:MeasureAttribute_DistanceClass and use it to return def Distance(attr=None, data=None): """Instantiate :obj:DistanceClass and use it to return the score of a given feature on given data. """ m = MeasureAttribute_DistanceClass() m = DistanceClass() if attr != None and data != None: return m(attr, data) return m class MeasureAttribute_DistanceClass(orange.MeasureAttribute): """Implement the 1-D feature distance measure described in Kononenko.""" def __call__(self, attr, data, aprioriDist = None, weightID = None): class DistanceClass(Measure): """The 1-D feature distance measure described in Kononenko.""" @Orange.misc.deprecated_keywords({"aprioriDist": "apriori_dist"}) def __call__(self, attr, data, apriori_dist=None, weightID=None): """Take :obj:Orange.data.table data table and score the given :obj:Orange.data.variable. :type data: Orange.data.table :param aprioriDist: :type aprioriDist: :param apriori_dist: :type apriori_dist: :param weightID: meta feature used to weight individual data instances return 0 def MeasureAttribute_MDL(attr = None, data = None): """Instantiate :obj:MeasureAttribute_MDLClass and use it n given data to def MDL(attr=None, data=None): """Instantiate :obj:MDLClass and use it n given data to return the feature's score.""" m = MeasureAttribute_MDLClass() m = MDLClass() if attr != None and data != None: return m(attr, data) return m class MeasureAttribute_MDLClass(orange.MeasureAttribute): class MDLClass(Measure): """Score feature based on the minimum description length principle.""" def __call__(self, attr, data, aprioriDist = None, weightID = None): @Orange.misc.deprecated_keywords({"aprioriDist": "apriori_dist"}) def __call__(self, attr, data, apriori_dist=None, weightID=None): """Take :obj:Orange.data.table data table and score the given :obj:Orange.data.variable. :type data: Orange.data.table :param aprioriDist: :type aprioriDist: :param apriori_dist: :type apriori_dist: :param weightID: meta feature used to weight individual data instances return ret def mergeAttrValues(data, attrList, attrMeasure, removeUnusedValues = 1): @Orange.misc.deprecated_keywords({"attrList": "attr_list", "attrMeasure": "attr_measure", "removeUnusedValues": "remove_unused_values"}) def merge_values(data, attr_list, attr_measure, remove_unused_values = 1): import orngCI #data = data.select([data.domain[attr] for attr in attrList] + [data.domain.classVar]) newData = data.select(attrList + [data.domain.classVar]) newAttr = orngCI.FeatureByCartesianProduct(newData, attrList)[0] #data = data.select([data.domain[attr] for attr in attr_list] + [data.domain.classVar]) newData = data.select(attr_list + [data.domain.class_var]) newAttr = orngCI.FeatureByCartesianProduct(newData, attr_list)[0] dist = orange.Distribution(newAttr, newData) activeValues = [] for i in range(len(newAttr.values)): if dist[newAttr.values[i]] > 0: activeValues.append(i) currScore = attrMeasure(newAttr, newData) currScore = attr_measure(newAttr, newData) while 1: bestScore, bestMerge = currScore, None for i1, ind1 in enumerate(activeValues): oldInd1 = newAttr.getValueFrom.lookupTable[ind1] oldInd1 = newAttr.get_value_from.lookupTable[ind1] for ind2 in activeValues[:i1]: newAttr.getValueFrom.lookupTable[ind1] = ind2 score = attrMeasure(newAttr, newData) newAttr.get_value_from.lookupTable[ind1] = ind2 score = attr_measure(newAttr, newData) if score >= bestScore: bestScore, bestMerge = score, (ind1, ind2) newAttr.getValueFrom.lookupTable[ind1] = oldInd1 newAttr.get_value_from.lookupTable[ind1] = oldInd1 if bestMerge: ind1, ind2 = bestMerge currScore = bestScore for i, l in enumerate(newAttr.getValueFrom.lookupTable): for i, l in enumerate(newAttr.get_value_from.lookupTable): if not l.isSpecial() and int(l) == ind1: newAttr.getValueFrom.lookupTable[i] = ind2 newAttr.get_value_from.lookupTable[i] = ind2 newAttr.values[ind2] = newAttr.values[ind2] + "+" + newAttr.values[ind1] del activeValues[activeValues.index(ind1)] break if not removeUnusedValues: if not remove_unused_values: return newAttr reducedAttr = orange.EnumVariable(newAttr.name, values = [newAttr.values[i] for i in activeValues]) reducedAttr.getValueFrom = newAttr.getValueFrom reducedAttr.getValueFrom.classVar = reducedAttr reducedAttr.get_value_from = newAttr.get_value_from reducedAttr.get_value_from.class_var = reducedAttr return reducedAttr ###### # from orngFSS def attMeasure(data, measure=Relief(k=20, m=50)): def score_all(data, measure=Relief(k=20, m=50)): """Assess the quality of features using the given measure and return a sorted list of tuples (feature name, measure). :type data: :obj:Orange.data.table :param measure:  feature scoring function. Derived from :obj:Orange.feature.scoring.Measure. Defaults to Defaults to :obj:Orange.feature.scoring.Measure. Defaults to :obj:Orange.feature.scoring.Relief with k=20 and m=50. :type measure: :obj:Orange.feature.scoring.Measure
• ## orange/Orange/feature/selection.py

 r8042 """ ######################### Selection (selection) ######################### .. index:: feature selection import Orange.core as orange from Orange.feature.scoring import attMeasure from Orange.feature.scoring import score_all # from orngFSS def bestNAtts(scores, N): """Return the best N features (without scores) from the list returned by function :obj:Orange.feature.scoring.attMeasure. :param scores: a list such as one returned by :obj:Orange.feature.scoring.attMeasure by :obj:Orange.feature.scoring.score_all. :param scores: a list such as returned by :obj:Orange.feature.scoring.score_all :type scores: list :param N: number of best features to select. def attsAboveThreshold(scores, threshold=0.0): """Return features (without scores) from the list returned by :obj:Orange.feature.scoring.attMeasure with score above or :obj:Orange.feature.scoring.score_all with score above or equal to a specified threshold. :param scores: a list such as one returned by :obj:Orange.feature.scoring.attMeasure :obj:Orange.feature.scoring.score_all :type scores: list :param threshold: score threshold for attribute selection. Defaults to 0. :type data: Orange.data.table :param scores: a list such as one returned by :obj:Orange.feature.scoring.attMeasure :obj:Orange.feature.scoring.score_all :type scores: list :param N: number of features to select """Construct and return a new set of examples that includes a class and features from the list returned by :obj:Orange.feature.scoring.attMeasure that have the score above or :obj:Orange.feature.scoring.score_all that have the score above or equal to a specified threshold. :type data: Orange.data.table :param scores: a list such as one returned by :obj:Orange.feature.scoring.attMeasure :obj:Orange.feature.scoring.score_all :type scores: list :param threshold: score threshold for attribute selection. Defaults to 0. """ measl = attMeasure(data, measure) measl = score_all(data, measure) while len(data.domain.attributes)>0 and measl[-1][1]
• ## orange/OrangeCanvas/orngDoc.py

 r8052 return with self.signalManager.freeze(): while widget.inLines != []: self.removeLine1(widget.inLines[0]) while widget.outLines != []:  self.removeLine1(widget.outLines[0]) #with self.signalManager.freeze(): while widget.inLines != []: self.removeLine1(widget.inLines[0]) while widget.outLines != []:  self.removeLine1(widget.outLines[0]) self.signalManager.removeWidget(widget.instance) self.signalManager.removeWidget(widget.instance) self.widgets.remove(widget)
• ## orange/doc/Orange/rst/Orange.feature.rst

 r8264 ********************* ##################### Feature (feature) ********************* ##################### .. automodule:: Orange.feature .. toctree:: :maxdepth: 2 Orange.feature.scoring Orange.feature.selection Orange.feature.discretization Orange.feature.continuization Orange.feature.imputation
• ## orange/doc/Orange/rst/code/majority-classification.py

 r8042 # Description: Shows how to "learn" the majority class and compare other classifiers to the default classification # Category:    default classification accuracy, statistics # Classes:     MajorityLearner, Orange.evaluation.testing.crossValidation # Classes:     MajorityLearner, Orange.evaluation.testing.cross_validation # Uses:        monks-1 # Referenced:  majority.htm import Orange import orngStat table = Orange.data.Table("monks-1") learners = [treeLearner, bayesLearner, majorityLearner] res = Orange.evaluation.testing.crossValidation(learners, table) CAs = orngStat.CA(res, reportSE = 1) res = Orange.evaluation.testing.cross_validation(learners, table) CAs = Orange.evaluation.scoring.CA(res, reportSE=True) print "Tree:    %5.3f+-%5.3f" % CAs[0]
• ## orange/doc/Orange/rst/code/mean-regression.py

 r8042 learners = [treeLearner, meanLearner] res = Orange.evaluation.testing.crossValidation(learners, table) res = Orange.evaluation.testing.cross_validation(learners, table) MSEs = Orange.evaluation.scoring.MSE(res)
• ## orange/doc/Orange/rst/code/scoring-all.py

 r8042 # Category:    feature scoring # Uses:        voting # Referenced:  Orange.feature.html#scoring # Classes:     Orange.feature.scoring.attMeasure, Orange.features.scoring.GainRatio # Referenced:  Orange.feature.scoring # Classes:     Orange.feature.scoring.score_all, Orange.feature.scoring.Relief import Orange table = Orange.data.Table("voting") print 'Feature scores for best three features:' ma = Orange.feature.scoring.attMeasure(table) for m in ma[:3]: print "%5.3f %s" % (m[1], m[0]) def print_best_3(ma): for m in ma[:3]: print "%5.3f %s" % (m[1], m[0]) print 'Feature scores for best three features (with score_all):' ma = Orange.feature.scoring.score_all(table) print_best_3(ma) print print 'Feature scores for best three features (scored individually):' meas = Orange.feature.scoring.Relief(k=20, m=50) mr = [ (a.name, meas(a, table)) for a in table.domain.attributes ] mr.sort(key=lambda x: -x[1]) #sort decreasingly by the score print_best_3(mr)
• ## orange/doc/Orange/rst/code/scoring-diff-measures.py

 r8042 # Uses:        measure # Referenced:  Orange.feature.html#scoring # Classes:     Orange.feature.scoring.attMeasure, Orange.features.scoring.Info, Orange.features.scoring.GainRatio, Orange.features.scoring.Gini, Orange.features.scoring.Relevance, Orange.features.scoring.Cost, Orange.features.scoring.Relief # Classes:     Orange.features.scoring.Info, Orange.features.scoring.GainRatio, Orange.features.scoring.Gini, Orange.features.scoring.Relevance, Orange.features.scoring.Cost, Orange.features.scoring.Relief import Orange print fstr % (("- no unknowns:",) + tuple([meas(i, table) for i in range(attrs)])) meas.unknownsTreatment = meas.IgnoreUnknowns meas.unknowns_treatment = meas.IgnoreUnknowns print fstr % (("- ignore unknowns:",) + tuple([meas(i, table2) for i in range(attrs)])) meas.unknownsTreatment = meas.ReduceByUnknowns meas.unknowns_treatment = meas.ReduceByUnknowns print fstr % (("- reduce unknowns:",) + tuple([meas(i, table2) for i in range(attrs)])) meas.unknownsTreatment = meas.UnknownsToCommon meas.unknowns_treatment = meas.UnknownsToCommon print fstr % (("- unknowns to common:",) + tuple([meas(i, table2) for i in range(attrs)])) meas.unknownsTreatment = meas.UnknownsAsValue meas.unknowns_treatment = meas.UnknownsAsValue print fstr % (("- unknowns as value:",) + tuple([meas(i, table2) for i in range(attrs)])) print
• ## orange/doc/Orange/rst/code/scoring-info-iris.py

 r8042 meas = Orange.feature.scoring.Relief() for t in meas.thresholdFunction("petal length", table): for t in meas.threshold_function("petal length", table): print "%5.3f: %5.3f" % t thresh, score, distr = meas.bestThreshold("petal length", table) thresh, score, distr = meas.best_threshold("petal length", table) print "\nBest threshold: %5.3f (score %5.3f)" % (thresh, score)
• ## orange/doc/Orange/rst/code/scoring-info-lenses.py

 r8042 # Classes:     Orange.feature.scoring.Measure, Orange.features.scoring.Info import Orange import random import Orange, random table = Orange.data.Table("lenses") print "Information gain of 'astigmatic': %6.4f" % meas(astigm, table) classdistr = Orange.data.value.Distribution(table.domain.classVar, table) cont = Orange.probability.distributions.ContingencyAttrClass("tear_rate", table) classdistr = Orange.statistics.distribution.Distribution(table.domain.class_var, table) cont = Orange.statistics.contingency.VarClass("tear_rate", table) print "Information gain of 'tear_rate': %6.4f" % meas(cont, classdistr) dcont = Orange.probability.distributions.DomainContingency(table) dcont = Orange.statistics.contingency.Domain(table) print "Information gain of the first attribute: %6.4f" % meas(0, dcont) print print dcont = Orange.probability.distributions.DomainContingency(table) dcont = Orange.statistics.contingency.Domain(table) print "Computing information gain from DomainContingency" print fstr % (("- by attribute number:",) + tuple([meas(i, dcont) for i in range(attrs)])) print "Computing information gain from DomainContingency" cdist = Orange.data.value.Distribution(table.domain.classVar, table) print fstr % (("- by attribute number:",) + tuple([meas(Orange.probability.distributions.ContingencyAttrClass(i, table), cdist) for i in range(attrs)])) print fstr % (("- by attribute name:",) + tuple([meas(Orange.probability.distributions.ContingencyAttrClass(i, table), cdist) for i in names])) print fstr % (("- by attribute descriptor:",) + tuple([meas(Orange.probability.distributions.ContingencyAttrClass(i, table), cdist) for i in table.domain.attributes])) cdist = Orange.statistics.distribution.Distribution(table.domain.class_var, table) print fstr % (("- by attribute number:",) + tuple([meas(Orange.statistics.contingency.VarClass(i, table), cdist) for i in range(attrs)])) print fstr % (("- by attribute name:",) + tuple([meas(Orange.statistics.contingency.VarClass(i, table), cdist) for i in names])) print fstr % (("- by attribute descriptor:",) + tuple([meas(Orange.statistics.contingency.VarClass(i, table), cdist) for i in table.domain.attributes])) print values = ["v%i" % i for i in range(len(table.domain[2].values)*len(table.domain[3].values))] cartesian = Orange.data.variable.Discrete("cart", values = values) cartesian.getValueFrom = Orange.classification.lookup.ClassifierByLookupTable(cartesian, table.domain[2], table.domain[3], values) cartesian.get_value_from = Orange.classification.lookup.ClassifierByLookupTable(cartesian, table.domain[2], table.domain[3], values) print "Information gain of Cartesian product of %s and %s: %6.4f" % (table.domain[2].name, table.domain[3].name, meas(cartesian, table)) mid = Orange.core.newmetaid() table.domain.addmeta(mid, Orange.data.variable.Discrete(values = ["v0", "v1"])) table.addMetaAttribute(mid) table.domain.add_meta(mid, Orange.data.variable.Discrete(values = ["v0", "v1"])) table.add_meta_attribute(mid) rg = random.Random() rg.seed(0) for ex in table: ex[mid] = Orange.data.value.Value(rg.randint(0, 1)) ex[mid] = Orange.data.Value(rg.randint(0, 1)) print "Information gain for a random meta attribute: %6.4f" % meas(mid, table)
• ## orange/doc/Orange/rst/code/scoring-regression.py

 r8042 # Classes:     Orange.feature.scoring.MSE import Orange import random import Orange, random data = Orange.data.Table("measure-c") print fstr % (("- no unknowns:",) + tuple([meas(i, data) for i in range(attrs)])) meas.unknownsTreatment = meas.IgnoreUnknowns meas.unknowns_treatment = meas.IgnoreUnknowns print fstr % (("- ignore unknowns:",) + tuple([meas(i, data2) for i in range(attrs)])) meas.unknownsTreatment = meas.ReduceByUnknowns meas.unknowns_treatment = meas.ReduceByUnknowns print fstr % (("- reduce unknowns:",) + tuple([meas(i, data2) for i in range(attrs)])) meas.unknownsTreatment = meas.UnknownsToCommon meas.unknowns_treatment = meas.UnknownsToCommon print fstr % (("- unknowns to common:",) + tuple([meas(i, data2) for i in range(attrs)])) print
• ## orange/doc/Orange/rst/code/scoring-relief-caching.py

 r7510 # Description: Shows why ReliefF needs to check the cached neighbours # Category:    feature scoring # Category:    statistics # Classes:     MeasureAttribute_relief # Uses:        iris # Referenced:  Orange.feature.html#scoring # Classes:     Orange.feature.scoring.Relief # Referenced:  MeasureAttribute.htm import orange r1 = orange.MeasureAttribute_relief() r2 = orange.MeasureAttribute_relief(checkCachedData = False) r2 = orange.MeasureAttribute_relief(check_cached_data = False) print "%.3f\t%.3f" % (r1(0, data), r2(0, data))
• ## orange/doc/Orange/rst/code/scoring-relief-gainRatio.py

 r8042 # Uses:        voting # Referenced:  Orange.feature.html#scoring # Classes:     Orange.feature.scoring.attMeasure, Orange.features.scoring.GainRatio # Classes:     Orange.feature.scoring.score_all, Orange.features.scoring.GainRatio import Orange print 'Relief GainRt Feature' ma_def = Orange.feature.scoring.attMeasure(table) ma_def = Orange.feature.scoring.score_all(table) gr = Orange.feature.scoring.GainRatio() ma_gr  = Orange.feature.scoring.attMeasure(table, gr) ma_gr  = Orange.feature.scoring.score_all(table, gr) for i in range(5): print "%5.3f  %5.3f  %s" % (ma_def[i][1], ma_gr[i][1], ma_def[i][0])
• ## orange/doc/Orange/rst/code/selection-bayes.py

 r8042 # Uses:        voting # Referenced:  Orange.feature.html#selection # Classes:     Orange.feature.scoring.attMeasure, Orange.feature.selection.bestNAtts # Classes:     Orange.feature.scoring.score_all, Orange.feature.selection.bestNAtts import Orange import orngTest, orngEval class BayesFSS(object): def __call__(self, table, weight=None): ma = Orange.feature.scoring.attMeasure(table) ma = Orange.feature.scoring.score_all(table) filtered = Orange.feature.selection.selectBestNAtts(table, ma, self.N) model = Orange.classification.bayes.NaiveLearner(filtered) return self.classifier(example, resultType) # test above wraper on a data set import orngStat, orngTest table = Orange.data.Table("voting") learners = (Orange.classification.bayes.NaiveLearner(name='Naive Bayes'), BayesFSS(name="with FSS")) results = orngTest.crossValidation(learners, table) results = Orange.evaluation.testing.cross_validation(learners, table) # output the results print "Learner      CA" for i in range(len(learners)): print "%-12s %5.3f" % (learners[i].name, orngStat.CA(results)[i]) print "%-12s %5.3f" % (learners[i].name, Orange.evaluation.scoring.CA(results)[i])
• ## orange/doc/Orange/rst/code/selection-best3.py

 r7319 # Uses:        voting # Referenced:  Orange.feature.html#selection # Classes:     Orange.feature.scoring.attMeasure, Orange.feature.selection.bestNAtts # Classes:     Orange.feature.scoring.score_all, Orange.feature.selection.bestNAtts import Orange n = 3 ma = Orange.feature.scoring.attMeasure(table) ma = Orange.feature.scoring.score_all(table) best = Orange.feature.selection.bestNAtts(ma, n) print 'Best %d features:' % n
• ## orange/doc/catalog-rst/rst/orange_theme/footer.html

 r8264

Home

Screenshots
Feature list
Extensions

Home

Screenshots
Feature list
Extensions

Extensions
Subversion
Orange 1.0 (old)

Extensions
Subversion
Orange 1.0 (old)

News & Support

Blog
Forum
Blog
Forum

Documentation

Widget catalog
Widget catalog (Orange 1.0)
Data sets

Documentation

Widget catalog
Widget catalog (Orange 1.0)
Data sets

Scripting

Quick start
Reference
Modules
Widget development
Example scripts

Scripting

Quick start
Reference
Modules
Widget development
Example scripts

 r7013
• ## orange/doc/catalog-rst/rst/orange_theme/static/footer.html

 r8264

Home

Screenshots
Feature list
Extensions

Home

Screenshots
Feature list
Extensions

Extensions
Subversion
Orange 1.0 (old)

Extensions
Subversion
Orange 1.0 (old)

News & Support

Blog
Forum
Blog
Forum

Documentation

Widget catalog
Widget catalog (Orange 1.0)
Data sets

Documentation

Widget catalog
Widget catalog (Orange 1.0)
Data sets

Scripting

Quick start
Reference
Modules
Widget development
Example scripts

Scripting

Quick start
Reference
Modules
Widget development
Example scripts

 r7013
• ## orange/doc/catalog/Classify/RandomForest.htm

 r6129

Random forest is a classification technique that proposed by Leo Brieman (2001), given the set of class-labeled data, builds a set of classification trees. Each tree is developed from a bootstrap sample from the training data. When developing individual trees, an arbitrary subset of attributes is drawn (hence the term "random") from which the best attribute for the split is selected. The classification is based on the majority vote from individually developed tree classifiers in the forest.

Random forest widget provides for a GUI to Orange's own implementation of random forest (orngEnsemble module). The widget output the learner, and, given the training data on its input, the random forest. Additional output channel is provided for a selected classification tree (from the forest) for the purpose of visualization or further analysis.

Random forest widget provides for a GUI to Orange's own implementation of random forest (orngEnsemble module). The widget output the learner, and, given the training data on its input, the random forest. Additional output channel is provided for a selected classification tree (from the forest) for the purpose of visualization or further analysis.

• ## orange/doc/catalog/Visualize/SurveyPlot.htm

 r5811

Implementation in Orange supports sorting by two selected attributes (Sorting). The attributes shown in the plot are listed in Shown attributes box, all other appear in the list of Hidden attributes.

Below is a snapshot of survey plot widget for an Iris data set. Plot nicely shows that petal width and length and sepal length are correlated. It is also very clear that Iris-setosa can be classified based on petal length or width alone, while for the Iris versicolor and virginica there is some ambiguity with some potential outliers, one of which is highlighted in the snapshot.

Below is a snapshot of survey plot widget for an Iris data set. Plot nicely shows that petal width and length and sepal length are correlated. It is also very clear that Iris-setosa can be classified based on petal length or width alone, while for the Iris versicolor and virginica there is some ambiguity with some potential outliers, one of which is highlighted in the snapshot.

• ## orange/doc/catalog/iconlist.html

 r5811

Data

Visualize

Classify

Evaluate

Associate

Regression

div.catdiv table td.left-nodoc { width: 3%; color: #aaaaaa; } div.catdiv table td.right-nodoc { width: 22%; border-left: none; color: #aaaaaa; } div.catdiv table img { border: none; height: 28px; width: 28px; }

Catalog of widgets

Data

Visualize

Classify

Regression

Evaluate

Associate

Unsupervised

File Data Table Select Attributes Rank
Purge Domain Merge Data Concatenate Data Sampler
Select Data Save Discretize Continuize
Impute Outliers
Distributions Scatterplot Scatterplot matrix Attribute Statistics
Linear Projection Radviz Polyviz Parallel coordinates
Survey Plot Correspondence Analysis Time Data Visualizer Mosaic Display
Sieve Diagram Sieve multigram
Naive Bayes SVM Logistic Regression Majority
Classification Tree Viewer Classification Tree Graph CN2 Rules Viewer k Nearest Neighbours
Nomogram Classification Tree CN2 Random Forest
C4.5 Interactive Tree Builder
Regression Tree Regression Tree Graph Pade
Confusion Matrix ROC Analysis Lift Curve Calibration Plot
Test Learners Predictions
Association Rules Itemsets Itemsests Explorer Association Rules Filter
Association Rules Explorer
Distance File Matrix Transformation Save Distance File Distance Matrix Filter
Distance Map Example Distance Attribute Distance Hierarchical Clustering
Interaction Graph K-Means Clustering MDS Network File
Net Explorer Network from Distances SOM SOMVisualizer
• ## orange/doc/catalog/path.htm

 r5664 CatalogCatalog
• ## orange/doc/modules/orngClustering.htm

 r8264 sample = data.selectref(orange.MakeRandomIndices2(data, 20), 0) root = orngClustering.hierarchicalClustering(sample) dendrogram = orngClustering.dendrogram_draw("hclust-dendrogram.png", root, sample, labels=[str(d.getclass()) for d in sample]) orngClustering.dendrogram_draw("hclust-dendrogram.png", root, data=sample, labels=[str(d.getclass()) for d in sample])
• ## orange/doc/modules/path.htm

 r892 ModulesModules
• ## orange/doc/ofb-rst/rst/orange_theme/footer.html

 r8264

Home

Screenshots
Feature list
Extensions

Home

Screenshots
Feature list
Extensions

Extensions
Subversion
Orange 1.0 (old)

Extensions
Subversion
Orange 1.0 (old)

News & Support

Blog
Forum
Blog
Forum

Documentation

Widget catalog
Widget catalog (Orange 1.0)
Data sets

Documentation

Widget catalog
Widget catalog (Orange 1.0)
Data sets

Scripting

Quick start
Reference
Modules
Widget development
Example scripts

Scripting

Quick start
Reference
Modules
Widget development
Example scripts

 r6999
• ## orange/doc/ofb-rst/rst/orange_theme/static/footer.html

 r8264

Home

Screenshots
Feature list
Extensions

Home

Screenshots
Feature list
Extensions

Extensions
Subversion
Orange 1.0 (old)

Extensions
Subversion
Orange 1.0 (old)

News & Support

Blog
Forum
Blog
Forum

Documentation

Widget catalog
Widget catalog (Orange 1.0)
Data sets

Documentation

Widget catalog
Widget catalog (Orange 1.0)
Data sets

Scripting

Quick start
Reference
Modules
Widget development
Example scripts

Scripting

Quick start
Reference
Modules
Widget development
Example scripts

 r7008
• ## orange/doc/ofb/default.htm

 r6538

 r892
Home
Home

Documentation
Documentation

Orange for Beginners
Orange for Beginners

• ## orange/doc/ofb/path.htm

 r892 Orange for BeginnersOrange for Beginners
• ## orange/doc/path.htm

 r892 DocumentationDocumentation
• ## orange/doc/reference/C45Learner.htm

 r6538 href="http://www.rulequest.com/">Rule Quest's site and extract them into some temporary directory. The files will be modified in the further process, so don't use your copy of Quinlan's sources that you need for another purpose.
• Download buildC45.zip and unzip its contents  into the directory R8/Src of the Quinlan's stuff (it's the directory that contains, for instance, the file average.c).
• Download buildC45.zip and unzip its contents  into the directory R8/Src of the Quinlan's stuff (it's the directory that contains, for instance, the file average.c).
• Run buildC45.py, which will build the plug-in and put it next to orange.pyd (or orange.so on Linux/Mac).
• ## orange/doc/reference/path.htm

 r892 Reference GuideReference Guide
• ## orange/doc/sphinx-ext/themes/orange_theme/footer.html

 r7026

Home

Screenshots
Feature list
Extensions

Home

Screenshots
Feature list
Extensions

Extensions
Subversion
Orange 1.0 (old)

Extensions
Subversion
Orange 1.0 (old)

News & Support

Blog
Forum
Blog
Forum

Documentation

Widget catalog
Widget catalog (Orange 1.0)
Data sets

Documentation

Widget catalog
Widget catalog (Orange 1.0)
Data sets

Scripting

Quick start
Reference
Modules
Widget development
Example scripts

Scripting

Quick start
Reference
Modules
Widget development
Example scripts

 r8264
• ## orange/doc/sphinx-ext/themes/orange_theme/static/footer.html

 r7026

Home

Screenshots
Feature list
Extensions

Home

Screenshots
Feature list
Extensions

Extensions
Subversion
Orange 1.0 (old)

Extensions
Subversion
Orange 1.0 (old)

News & Support

Blog
Forum
Blog
Forum

Documentation

Widget catalog
Widget catalog (Orange 1.0)
Data sets

Documentation

Widget catalog
Widget catalog (Orange 1.0)
Data sets

Scripting

Quick start
Reference
Modules
Widget development
Example scripts

Scripting

Quick start
Reference
Modules
Widget development
Example scripts

 r7026
• ## orange/doc/widgets/api.htm

 r6538

Note: This documentation is not complete, we are now providing it as is, and are working on it to update it. Until then, you should find most information about widget APIs in the Tutorial. most information about widget APIs in the Tutorial.

• ## orange/doc/widgets/basics.htm

 r8264 reporting on number of data items on the input, then does the data sampling using Orange's routines for these (see chapter on Random href="/doc/reference/RandomIndices.htm">chapter on Random Sampling in Orange Reference Guide for more), and updates the interface reporting on the number of sampled instances. Finally, the

Now for the real test. We put the File widget on the schema (from Data pane), read iris.tab data set (or any that comes handy, if you can find none, download iris from Orange's data set can find none, download iris from Orange's data set repository). We also put our Data Sampler widget on the pane and open it (double click on the icon, or right-click and choose

• ## orange/doc/widgets/channels.htm

 r6538 the rest of the widget does some simple GUI management, and calls learning curve routines from orngTest and performance href="/doc/modules/orngTest.htm">orngTest and performance scoring functions from orngStat. I rather like href="/doc/modules/orngStat.htm">orngStat. I rather like the easy by which new scoring functions are added to the widget, since all that is needed is the augmenting the list

• ## orange/doc/widgets/path.htm

 r6538 Orange WidgetsOrange Widgets
• ## orange/fixes/fix_changed_names.py

 r8378 "orange.MeasureAttribute_MSE": "Orange.feature.scoring.MSE", "orngFSS.attMeasure": "Orange.feature.scoring.attMeasure", "orngFSS.attMeasure": "Orange.feature.scoring.score_all", "orngFSS.bestNAtts": "Orange.feature.selection.bestNAtts", "orngFSS.attsAbovethreshold": "Orange.feature.selection.attsAbovethreshold",
• ## orange/orngEvalAttr.py

 r8042 ### Janez 03-02-14: Added weights ### Inform Blaz and remove this comment from Orange.feature.scoring import * from Orange.feature.scoring import * mergeAttrValues = merge_values MeasureAttribute_MDL = MDL MeasureAttribute_MDLClass = MDLClass MeasureAttribute_Distance = Distance MeasureAttribute_DistanceClass = DistanceClass OrderAttributesByMeasure = OrderAttributes
• ## orange/orngFSS.py

 r8042 #This was in the old module attsAbovethreshold = attsAboveThreshold attMeasure = score_all
• ## source/orange/distvars.cpp

 r8265 float ri = randomGenerator->randfloat(abs); const_iterator di(begin()); while (ri > (*di).first) ri -= (*(di++)).first; return (*di).second; while (ri > (*di).second) ri -= (*(di++)).second; return (*di).first; } float ri = (random & 0x7fffffff) / float(0x7fffffff); const_iterator di(begin()); while (ri > (*di).first) ri -= (*(di++)).first; return (*di).second; while (ri > (*di).second) ri -= (*(di++)).second; return (*di).first; }
• ## source/orange/earth.cpp

 r8396 threshold = 0.001; prune = true; trace = 3; trace = 0.0; min_span = 0; fast_k = 20; PEarthClassifier classifier = mlnew TEarthClassifier(examples->domain, best_set, dirs, cuts, betas, num_preds, num_responses, num_terms, max_terms); std::string str = classifier->format_earth(); //  std::string str = classifier->format_earth(); // Free memory } TEarthClassifier::TEarthClassifier(PDomain _domain, bool * _best_set, int * _dirs, double * _cuts, double *_betas, int _num_preds, int _num_responses, int _num_terms, int _max_terms) TEarthClassifier::TEarthClassifier(PDomain _domain, bool * best_set, int * dirs, double * cuts, double *betas, int _num_preds, int _num_responses, int _num_terms, int _max_terms) { domain = _domain; classVar = domain->classVar; best_set = _best_set; dirs = _dirs; cuts = _cuts; betas = _betas; _best_set = best_set; _dirs = dirs; _cuts = cuts; _betas = betas; num_preds = _num_preds; num_responses = _num_responses; max_terms = _max_terms; computesProbabilities = false; init_members(); } TEarthClassifier::TEarthClassifier() { domain = NULL; classVar = NULL; _best_set = NULL; _dirs = NULL; _cuts = NULL; _betas = NULL; num_preds = 0; num_responses = 0; num_terms = 0; max_terms = 0; computesProbabilities = false; } } double round(double r) { return floor(r + 0.5); TEarthClassifier::~TEarthClassifier() { if (_best_set) free(_best_set); if (_dirs) free(_dirs); if (_cuts) free(_cuts); if (_betas) free(_betas); } double *x = to_xvector(example); double y = 0.0; PredictEarth(&y, x, best_set, dirs, cuts, betas, num_preds, num_responses, num_terms, max_terms); PredictEarth(&y, x, _best_set, _dirs, _cuts, _betas, num_preds, num_responses, num_terms, max_terms); free(x); if (classVar->varType == TValue::INTVAR) return TValue((int) std::max(0.0, round(y))); return TValue((int) std::max(0.0, floor(y + 0.5))); else return TValue((float) y); std::string TEarthClassifier::format_earth(){ FormatEarth(best_set, dirs, cuts, betas, num_preds, 1, num_terms, max_terms, 3, 0.0); FormatEarth(_best_set, _dirs, _cuts, _betas, num_preds, 1, num_terms, max_terms, 3, 0.0); // TODO: FormatEarth to a string. return ""; } TEarthClassifier::~TEarthClassifier() { free(best_set); free(dirs); free(cuts); free(betas); } } PBoolList TEarthClassifier::get_best_set() { PBoolList list = mlnew TBoolList(); for (bool * p=_best_set; p < _best_set + max_terms; p++) list->push_back(*p); return list; } PFloatListList TEarthClassifier::get_dirs() { PFloatListList list = mlnew TFloatListList(); for (int i=0; ipush_back(_dirs[i + j*max_terms]); list->push_back(inner_list); } return list; } PFloatListList TEarthClassifier::get_cuts() { PFloatListList list = mlnew TFloatListList(); for (int i=0; ipush_back(_cuts[i + j*max_terms]); list->push_back(inner_list); } return list; } PFloatList TEarthClassifier::get_betas() { PFloatList list = mlnew TFloatList(); for (double * p=_betas; p < _betas + max_terms; p++) list->push_back((float)*p); return list; } void TEarthClassifier::init_members() { best_set = get_best_set(); dirs = get_dirs(); cuts = get_cuts(); betas = get_betas(); } void TEarthClassifier::save_model(TCharBuffer& buffer) { buffer.writeInt(max_terms); buffer.writeInt(num_terms); buffer.writeInt(num_preds); buffer.writeInt(num_responses); buffer.writeBuf((void *) _best_set, sizeof(bool) * max_terms); buffer.writeBuf((void *) _dirs, sizeof(int) * max_terms * num_preds); buffer.writeBuf((void *) _cuts, sizeof(double) * max_terms * num_preds); buffer.writeBuf((void *) _betas, sizeof(double) * max_terms * num_responses); } void TEarthClassifier::load_model(TCharBuffer& buffer) { if (max_terms) raiseError("Cannot overwrite a model"); max_terms = buffer.readInt(); num_terms = buffer.readInt(); num_preds = buffer.readInt(); num_responses = buffer.readInt(); _best_set = (bool *) calloc(max_terms, sizeof(bool)); _dirs = (int *) calloc(max_terms * num_preds, sizeof(int)); _cuts = (double *) calloc(max_terms * num_preds, sizeof(double)); _betas = (double *) calloc(max_terms * num_responses, sizeof(double)); buffer.readBuf((void *) _best_set, sizeof(bool) * max_terms); buffer.readBuf((void *) _dirs, sizeof(int) * max_terms * num_preds); buffer.readBuf((void *) _cuts, sizeof(double) * max_terms * num_preds); buffer.readBuf((void *) _betas, sizeof(double) * max_terms * num_responses); init_members(); }
• ## source/orange/earth.hpp

 r8378 }; #include "slist.hpp" class ORANGE_API TEarthClassifier: public TClassifierFD { public: __REGISTER_CLASS TEarthClassifier() {}; TEarthClassifier(); TEarthClassifier(PDomain domain, bool * best_set, int * dirs, double * cuts, double *betas, int num_preds, int num_responses, int num_terms, int max_terms); TEarthClassifier(const TEarthClassifier & other); virtual ~TEarthClassifier(); std::string format_earth(); int num_preds; //P int num_terms; //P int max_terms; //P int num_responses; //P int num_preds; //P Number of predictor variables int num_terms; //P Number of used terms int max_terms; //P Maximum number of terms int num_responses; //P Number of response variables PBoolList best_set; //P Used terms. PFloatListList dirs; //P max_preds x num_preds matrix PFloatListList cuts; //P max_preds x num_preds matrix of cuts PFloatList betas; //P Term coefficients; void save_model(TCharBuffer& buffer); void load_model(TCharBuffer& buffer); private: PBoolList get_best_set(); PFloatListList get_dirs(); PFloatListList get_cuts(); PFloatList get_betas(); void init_members(); double* to_xvector(const TExample&); bool* best_set; int * dirs; double * cuts; double * betas; bool* _best_set; int * _dirs; double * _cuts; double * _betas; };
• ## source/orange/lib_learner.cpp

 r8378 } PyObject *EarthClassifier__reduce__(PyObject *self) PYARGS(METH_VARARGS, "") { PyTRY CAST_TO(TEarthClassifier, classifier); TCharBuffer buffer(1024); classifier->save_model(buffer); return Py_BuildValue("O(s#)N", getExportedFunction("__pickleLoaderEarthClassifier"), buffer.buf, buffer.length(), packOrangeDictionary(self)); PyCATCH } PyObject *__pickleLoaderEarthClassifier(PyObject *self, PyObject *args) PYARGS(METH_VARARGS, "(buffer)") { PyTRY char * cbuf = NULL; int buffer_size = 0; if (!PyArg_ParseTuple(args, "s#:__pickleLoaderEarthClassifier", &cbuf, &buffer_size)) return NULL; TCharBuffer buffer(cbuf); PEarthClassifier classifier = mlnew TEarthClassifier(); classifier->load_model(buffer); return WrapOrange(classifier); PyCATCH } /************* BAYES ************/
• ## testing/regressionTests/results/orange25/mean-regression.py.txt

 r7749 Tree:    18.659 Tree:    18.834 Default: 84.777
Note: See TracChangeset for help on using the changeset viewer.