# Changes in [9819:e11e2ff31f47:9821:6cc715432fa7] in orange

Ignore:
Files:
2 deleted
49 edited

Unmodified
Removed
• ## Orange/classification/logreg.py

 r9671 """ .. index: logistic regression .. index: single: classification; logistic regression ******************************** Logistic regression (logreg) ******************************** Implements logistic regression _ with an extension for proper treatment of discrete features.  The algorithm can handle various anomalies in features, such as constant variables and singularities, that could make fitting of logistic regression almost impossible. Stepwise logistic regression, which iteratively selects the most informative features, is also supported. Logistic regression is a popular classification method that comes from statistics. The model is described by a linear combination of coefficients, .. math:: F = \\beta_0 + \\beta_1*X_1 + \\beta_2*X_2 + ... + \\beta_k*X_k and the probability (p) of a class value is  computed as: .. math:: p = \\frac{\exp(F)}{1 + \exp(F)} .. class :: LogRegClassifier :obj:LogRegClassifier stores estimated values of regression coefficients and their significances, and uses them to predict classes and class probabilities using the equations described above. .. attribute :: beta Estimated regression coefficients. .. attribute :: beta_se Estimated standard errors for regression coefficients. .. attribute :: wald_Z Wald Z statistics for beta coefficients. Wald Z is computed as beta/beta_se. .. attribute :: P List of P-values for beta coefficients, that is, the probability that beta coefficients differ from 0.0. The probability is computed from squared Wald Z statistics that is distributed with Chi-Square distribution. .. attribute :: likelihood The probability of the sample (ie. learning examples) observed on the basis of the derived model, as a function of the regression parameters. .. attribute :: fitStatus Tells how the model fitting ended - either regularly (:obj:LogRegFitter.OK), or it was interrupted due to one of beta coefficients escaping towards infinity (:obj:LogRegFitter.Infinity) or since the values didn't converge (:obj:LogRegFitter.Divergence). The value tells about the classifier's "reliability"; the classifier itself is useful in either case. .. autoclass:: LogRegLearner .. class:: LogRegFitter :obj:LogRegFitter is the abstract base class for logistic fitters. It defines the form of call operator and the constants denoting its (un)success: .. attribute:: OK Fitter succeeded to converge to the optimal fit. .. attribute:: Infinity Fitter failed due to one or more beta coefficients escaping towards infinity. .. attribute:: Divergence Beta coefficients failed to converge, but none of beta coefficients escaped. .. attribute:: Constant There is a constant attribute that causes the matrix to be singular. .. attribute:: Singularity The matrix is singular. .. method:: __call__(examples, weightID) Performs the fitting. There can be two different cases: either the fitting succeeded to find a set of beta coefficients (although possibly with difficulties) or the fitting failed altogether. The two cases return different results. (status, beta, beta_se, likelihood) The fitter managed to fit the model. The first element of the tuple, result, tells about the problems occurred; it can be either :obj:OK, :obj:Infinity or :obj:Divergence. In the latter cases, returned values may still be useful for making predictions, but it's recommended that you inspect the coefficients and their errors and make your decision whether to use the model or not. (status, attribute) The fitter failed and the returned attribute is responsible for it. The type of failure is reported in status, which can be either :obj:Constant or :obj:Singularity. The proper way of calling the fitter is to expect and handle all the situations described. For instance, if fitter is an instance of some fitter and examples contain a set of suitable examples, a script should look like this:: res = fitter(examples) if res[0] in [fitter.OK, fitter.Infinity, fitter.Divergence]: status, beta, beta_se, likelihood = res < proceed by doing something with what you got > else: status, attr = res < remove the attribute or complain to the user or ... > .. class :: LogRegFitter_Cholesky :obj:LogRegFitter_Cholesky is the sole fitter available at the moment. It is a C++ translation of Alan Miller's logistic regression code _. It uses Newton-Raphson algorithm to iteratively minimize least squares error computed from learning examples. .. autoclass:: StepWiseFSS .. autofunction:: dump Examples -------- The first example shows a very simple induction of a logistic regression classifier (:download:logreg-run.py , uses :download:titanic.tab ). .. literalinclude:: code/logreg-run.py Result:: Classification accuracy: 0.778282598819 class attribute = survived class values = Attribute       beta  st. error     wald Z          P OR=exp(beta) Intercept      -1.23       0.08     -15.15      -0.00 status=first       0.86       0.16       5.39       0.00       2.36 status=second      -0.16       0.18      -0.91       0.36       0.85 status=third      -0.92       0.15      -6.12       0.00       0.40 age=child       1.06       0.25       4.30       0.00       2.89 sex=female       2.42       0.14      17.04       0.00      11.25 The next examples shows how to handle singularities in data sets (:download:logreg-singularities.py , uses :download:adult_sample.tab ). .. literalinclude:: code/logreg-singularities.py The first few lines of the output of this script are:: <=50K <=50K <=50K <=50K <=50K <=50K >50K >50K <=50K >50K class attribute = y class values = <>50K, <=50K> Attribute       beta  st. error     wald Z          P OR=exp(beta) Intercept       6.62      -0.00       -inf       0.00 age      -0.04       0.00       -inf       0.00       0.96 fnlwgt      -0.00       0.00       -inf       0.00       1.00 education-num      -0.28       0.00       -inf       0.00       0.76 marital-status=Divorced       4.29       0.00        inf       0.00      72.62 marital-status=Never-married       3.79       0.00        inf       0.00      44.45 marital-status=Separated       3.46       0.00        inf       0.00      31.95 marital-status=Widowed       3.85       0.00        inf       0.00      46.96 marital-status=Married-spouse-absent       3.98       0.00        inf       0.00      53.63 marital-status=Married-AF-spouse       4.01       0.00        inf       0.00      55.19 occupation=Tech-support      -0.32       0.00       -inf       0.00       0.72 If :obj:removeSingular is set to 0, inducing a logistic regression classifier would return an error:: Traceback (most recent call last): File "logreg-singularities.py", line 4, in lr = classification.logreg.LogRegLearner(table, removeSingular=0) File "/home/jure/devel/orange/Orange/classification/logreg.py", line 255, in LogRegLearner return lr(examples, weightID) File "/home/jure/devel/orange/Orange/classification/logreg.py", line 291, in __call__ lr = learner(examples, weight) orange.KernelException: 'orange.LogRegLearner': singularity in workclass=Never-worked We can see that the attribute workclass is causing a singularity. The example below shows, how the use of stepwise logistic regression can help to gain in classification performance (:download:logreg-stepwise.py , uses :download:ionosphere.tab ): .. literalinclude:: code/logreg-stepwise.py The output of this script is:: Learner      CA logistic     0.841 filtered     0.846 Number of times attributes were used in cross-validation: 1 x a21 10 x a22 8 x a23 7 x a24 1 x a25 10 x a26 10 x a27 3 x a28 7 x a29 9 x a31 2 x a16 7 x a12 1 x a32 8 x a15 10 x a14 4 x a17 7 x a30 10 x a11 1 x a10 1 x a13 10 x a34 2 x a19 1 x a18 10 x a3 10 x a5 4 x a4 4 x a7 8 x a6 10 x a9 10 x a8 """ from Orange.core import LogRegLearner, LogRegClassifier, LogRegFitter, LogRegFitter_Cholesky import Orange import math, os import warnings from numpy import * from numpy.linalg import * ########################################################################## ## Print out methods from Orange.misc import deprecated_keywords, deprecated_members import math from numpy import dot, array, identity, reshape, diagonal, \ transpose, concatenate, sqrt, sign from numpy.linalg import inv from Orange.core import LogRegClassifier, LogRegFitter, LogRegFitter_Cholesky def dump(classifier): """ Formatted string of all major features in logistic regression classifier. :param classifier: logistic regression classifier """ Return a formatted string of all major features in logistic regression classifier. :param classifier: logistic regression classifier. """ # print out class values out = [''] out.append("class attribute = " + classifier.domain.classVar.name) out.append("class values = " + str(classifier.domain.classVar.values)) out.append("class attribute = " + classifier.domain.class_var.name) out.append("class values = " + str(classifier.domain.class_var.values)) out.append('') # get the longest attribute name longest=0 for at in classifier.continuizedDomain.attributes: for at in classifier.continuized_domain.features: if len(at.name)>longest: longest=len(at.name); longest=len(at.name) # print out the head out.append(formatstr % ("Intercept", classifier.beta[0], classifier.beta_se[0], classifier.wald_Z[0], classifier.P[0])) formatstr = "%"+str(longest)+"s %10.2f %10.2f %10.2f %10.2f %10.2f" for i in range(len(classifier.continuizedDomain.attributes)): out.append(formatstr % (classifier.continuizedDomain.attributes[i].name, classifier.beta[i+1], classifier.beta_se[i+1], classifier.wald_Z[i+1], abs(classifier.P[i+1]), math.exp(classifier.beta[i+1]))) for i in range(len(classifier.continuized_domain.features)): out.append(formatstr % (classifier.continuized_domain.features[i].name, classifier.beta[i+1], classifier.beta_se[i+1], classifier.wald_Z[i+1], abs(classifier.P[i+1]), math.exp(classifier.beta[i+1]))) return '\n'.join(out) def has_discrete_values(domain): for at in domain.attributes: if at.varType == Orange.core.VarTypes.Discrete: return 1 return 0 """ Return 1 if the given domain contains any discrete features, else 0. :param domain: domain. :type domain: :class:Orange.data.Domain """ return any(at.var_type == Orange.data.Type.Discrete for at in domain.features) class LogRegLearner(Orange.classification.Learner): """ Logistic regression learner. Implements logistic regression. If data instances are provided to If data instances are provided to the constructor, the learning algorithm is called and the resulting classifier is returned instead of the learner. :param table: data table with either discrete or continuous features :type table: Orange.data.Table :param weightID: the ID of the weight meta attribute :type weightID: int :param removeSingular: set to 1 if you want automatic removal of disturbing features, such as constants and singularities :type removeSingular: bool :param fitter: the fitting algorithm (by default the Newton-Raphson fitting algorithm is used) :param stepwiseLR: set to 1 if you wish to use stepwise logistic regression :type stepwiseLR: bool :param addCrit: parameter for stepwise feature selection :type addCrit: float :param deleteCrit: parameter for stepwise feature selection :type deleteCrit: float :param numFeatures: parameter for stepwise feature selection :type numFeatures: int :param instances: data table with either discrete or continuous features :type instances: Orange.data.Table :param weight_id: the ID of the weight meta attribute :type weight_id: int :param remove_singular: set to 1 if you want automatic removal of disturbing features, such as constants and singularities :type remove_singular: bool :param fitter: the fitting algorithm (by default the Newton-Raphson fitting algorithm is used) :param stepwise_lr: set to 1 if you wish to use stepwise logistic regression :type stepwise_lr: bool :param add_crit: parameter for stepwise feature selection :type add_crit: float :param delete_crit: parameter for stepwise feature selection :type delete_crit: float :param num_features: parameter for stepwise feature selection :type num_features: int :rtype: :obj:LogRegLearner or :obj:LogRegClassifier """ def __new__(cls, instances=None, weightID=0, **argkw): @deprecated_keywords({"weightID": "weight_id"}) def __new__(cls, instances=None, weight_id=0, **argkw): self = Orange.classification.Learner.__new__(cls, **argkw) if instances: self.__init__(**argkw) return self.__call__(instances, weightID) return self.__call__(instances, weight_id) else: return self def __init__(self, removeSingular=0, fitter = None, **kwds): @deprecated_keywords({"removeSingular": "remove_singular"}) def __init__(self, remove_singular=0, fitter = None, **kwds): self.__dict__.update(kwds) self.removeSingular = removeSingular self.remove_singular = remove_singular self.fitter = None def __call__(self, examples, weight=0): @deprecated_keywords({"examples": "instances"}) def __call__(self, instances, weight=0): """Learn from the given table of data instances. :param instances: Data instances to learn from. :type instances: :class:~Orange.data.Table :param weight: Id of meta attribute with weights of instances :type weight: int :rtype: :class:~Orange.classification.logreg.LogRegClassifier """ imputer = getattr(self, "imputer", None) or None if getattr(self, "removeMissing", 0): examples = Orange.core.Preprocessor_dropMissing(examples) if getattr(self, "remove_missing", 0): instances = Orange.core.Preprocessor_dropMissing(instances) ##        if hasDiscreteValues(examples.domain): ##            examples = createNoDiscTable(examples) if not len(examples): if not len(instances): return None if getattr(self, "stepwiseLR", 0): addCrit = getattr(self, "addCrit", 0.2) removeCrit = getattr(self, "removeCrit", 0.3) numFeatures = getattr(self, "numFeatures", -1) attributes = StepWiseFSS(examples, addCrit = addCrit, deleteCrit = removeCrit, imputer = imputer, numFeatures = numFeatures) tmpDomain = Orange.core.Domain(attributes, examples.domain.classVar) tmpDomain.addmetas(examples.domain.getmetas()) examples = examples.select(tmpDomain) learner = Orange.core.LogRegLearner() learner.imputerConstructor = imputer if getattr(self, "stepwise_lr", 0): add_crit = getattr(self, "add_crit", 0.2) delete_crit = getattr(self, "delete_crit", 0.3) num_features = getattr(self, "num_features", -1) attributes = StepWiseFSS(instances, add_crit= add_crit, delete_crit=delete_crit, imputer = imputer, num_features= num_features) tmp_domain = Orange.data.Domain(attributes, instances.domain.class_var) tmp_domain.addmetas(instances.domain.getmetas()) instances = instances.select(tmp_domain) learner = Orange.core.LogRegLearner() # Yes, it has to be from core. learner.imputer_constructor = imputer if imputer: examples = self.imputer(examples)(examples) examples = Orange.core.Preprocessor_dropMissing(examples) instances = self.imputer(instances)(instances) instances = Orange.core.Preprocessor_dropMissing(instances) if self.fitter: learner.fitter = self.fitter if self.removeSingular: lr = learner.fitModel(examples, weight) if self.remove_singular: lr = learner.fit_model(instances, weight) else: lr = learner(examples, weight) while isinstance(lr, Orange.core.Variable): lr = learner(instances, weight) while isinstance(lr, Orange.data.variable.Variable): if isinstance(lr.getValueFrom, Orange.core.ClassifierFromVar) and isinstance(lr.getValueFrom.transformer, Orange.core.Discrete2Continuous): lr = lr.getValueFrom.variable attributes = examples.domain.attributes[:] attributes = instances.domain.features[:] if lr in attributes: attributes.remove(lr) else: attributes.remove(lr.getValueFrom.variable) newDomain = Orange.core.Domain(attributes, examples.domain.classVar) newDomain.addmetas(examples.domain.getmetas()) examples = examples.select(newDomain) lr = learner.fitModel(examples, weight) new_domain = Orange.data.Domain(attributes, instances.domain.class_var) new_domain.addmetas(instances.domain.getmetas()) instances = instances.select(new_domain) lr = learner.fit_model(instances, weight) return lr LogRegLearner = deprecated_members({"removeSingular": "remove_singular", "weightID": "weight_id", "stepwiseLR": "stepwise_lr", "addCrit": "add_crit", "deleteCrit": "delete_crit", "numFeatures": "num_features", "removeMissing": "remove_missing" })(LogRegLearner) class UnivariateLogRegLearner(Orange.classification.Learner): self.__dict__.update(kwds) def __call__(self, examples): examples = createFullNoDiscTable(examples) classifiers = map(lambda x: LogRegLearner(Orange.core.Preprocessor_dropMissing(examples.select(Orange.core.Domain(x, examples.domain.classVar)))), examples.domain.attributes) maj_classifier = LogRegLearner(Orange.core.Preprocessor_dropMissing(examples.select(Orange.core.Domain(examples.domain.classVar)))) @deprecated_keywords({"examples": "instances"}) def __call__(self, instances): instances = createFullNoDiscTable(instances) classifiers = map(lambda x: LogRegLearner(Orange.core.Preprocessor_dropMissing( instances.select(Orange.data.Domain(x, instances.domain.class_var)))), instances.domain.features) maj_classifier = LogRegLearner(Orange.core.Preprocessor_dropMissing (instances.select(Orange.data.Domain(instances.domain.class_var)))) beta = [maj_classifier.beta[0]] + [x.beta[1] for x in classifiers] beta_se = [maj_classifier.beta_se[0]] + [x.beta_se[1] for x in classifiers] P = [maj_classifier.P[0]] + [x.P[1] for x in classifiers] wald_Z = [maj_classifier.wald_Z[0]] + [x.wald_Z[1] for x in classifiers] domain = examples.domain domain = instances.domain return Univariate_LogRegClassifier(beta = beta, beta_se = beta_se, P = P, wald_Z = wald_Z, domain = domain) class UnivariateLogRegClassifier(Orange.core.Classifier): class UnivariateLogRegClassifier(Orange.classification.Classifier): def __init__(self, **kwds): self.__dict__.update(kwds) def __call__(self, example, resultType = Orange.core.GetValue): def __call__(self, instance, resultType = Orange.classification.Classifier.GetValue): # classification not implemented yet. For now its use is only to provide regression coefficients and its statistics pass return self def __init__(self, removeSingular=0, **kwds): @deprecated_keywords({"removeSingular": "remove_singular"}) def __init__(self, remove_singular=0, **kwds): self.__dict__.update(kwds) self.removeSingular = removeSingular def __call__(self, examples, weight=0): self.remove_singular = remove_singular @deprecated_keywords({"examples": "instances"}) def __call__(self, instances, weight=0): # next function changes data set to a extended with unknown values def createLogRegExampleTable(data, weightID): setsOfData = [] for at in data.domain.attributes: # za vsak atribut kreiraj nov newExampleTable newData # v dataOrig, dataFinal in newData dodaj nov atribut -- continuous variable if at.varType == Orange.core.VarTypes.Continuous: atDisc = Orange.core.FloatVariable(at.name + "Disc") newDomain = Orange.core.Domain(data.domain.attributes+[atDisc,data.domain.classVar]) newDomain.addmetas(data.domain.getmetas()) newData = Orange.core.ExampleTable(newDomain,data) altData = Orange.core.ExampleTable(newDomain,data) for i,d in enumerate(newData): d[atDisc] = 0 d[weightID] = 1*data[i][weightID] for i,d in enumerate(altData): d[atDisc] = 1 def createLogRegExampleTable(data, weight_id): sets_of_data = [] for at in data.domain.features: # za vsak atribut kreiraj nov newExampleTable new_data # v dataOrig, dataFinal in new_data dodaj nov atribut -- continuous variable if at.var_type == Orange.data.Type.Continuous: at_disc = Orange.data.variable.Continuous(at.name+ "Disc") new_domain = Orange.data.Domain(data.domain.features+[at_disc,data.domain.class_var]) new_domain.addmetas(data.domain.getmetas()) new_data = Orange.data.Table(new_domain,data) alt_data = Orange.data.Table(new_domain,data) for i,d in enumerate(new_data): d[at_disc] = 0 d[weight_id] = 1*data[i][weight_id] for i,d in enumerate(alt_data): d[at_disc] = 1 d[at] = 0 d[weightID] = 0.000001*data[i][weightID] elif at.varType == Orange.core.VarTypes.Discrete: # v dataOrig, dataFinal in newData atributu "at" dodaj ee  eno  vreednost, ki ima vrednost kar  ime atributa +  "X" atNew = Orange.core.EnumVariable(at.name, values = at.values + [at.name+"X"]) newDomain = Orange.core.Domain(filter(lambda x: x!=at, data.domain.attributes)+[atNew,data.domain.classVar]) newDomain.addmetas(data.domain.getmetas()) newData = Orange.core.ExampleTable(newDomain,data) altData = Orange.core.ExampleTable(newDomain,data) for i,d in enumerate(newData): d[atNew] = data[i][at] d[weightID] = 1*data[i][weightID] for i,d in enumerate(altData): d[atNew] = at.name+"X" d[weightID] = 0.000001*data[i][weightID] newData.extend(altData) setsOfData.append(newData) return setsOfData d[weight_id] = 0.000001*data[i][weight_id] elif at.var_type == Orange.data.Type.Discrete: # v dataOrig, dataFinal in new_data atributu "at" dodaj ee  eno  vreednost, ki ima vrednost kar  ime atributa +  "X" at_new = Orange.data.variable.Discrete(at.name, values = at.values + [at.name+"X"]) new_domain = Orange.data.Domain(filter(lambda x: x!=at, data.domain.features)+[at_new,data.domain.class_var]) new_domain.addmetas(data.domain.getmetas()) new_data = Orange.data.Table(new_domain,data) alt_data = Orange.data.Table(new_domain,data) for i,d in enumerate(new_data): d[at_new] = data[i][at] d[weight_id] = 1*data[i][weight_id] for i,d in enumerate(alt_data): d[at_new] = at.name+"X" d[weight_id] = 0.000001*data[i][weight_id] new_data.extend(alt_data) sets_of_data.append(new_data) return sets_of_data learner = LogRegLearner(imputer = Orange.core.ImputerConstructor_average(), removeSingular = self.removeSingular) learner = LogRegLearner(imputer=Orange.feature.imputation.ImputerConstructor_average(), remove_singular = self.remove_singular) # get Original Model orig_model = learner(examples,weight) orig_model = learner(instances,weight) if orig_model.fit_status: print "Warning: model did not converge" if weight == 0: weight = Orange.data.new_meta_id() examples.addMetaAttribute(weight, 1.0) extended_set_of_examples = createLogRegExampleTable(examples, weight) instances.addMetaAttribute(weight, 1.0) extended_set_of_examples = createLogRegExampleTable(instances, weight) extended_models = [learner(extended_examples, weight) \ for extended_examples in extended_set_of_examples] ##        print orig_model.domain ##        print orig_model.beta ##        print orig_model.beta[orig_model.continuizedDomain.attributes[-1]] ##        print orig_model.beta[orig_model.continuized_domain.features[-1]] ##        for i,m in enumerate(extended_models): ##            print examples.domain.attributes[i] ##            print examples.domain.features[i] ##            printOUT(m) betas_ap = [] for m in extended_models: beta_add = m.beta[m.continuizedDomain.attributes[-1]] beta_add = m.beta[m.continuized_domain.features[-1]] betas_ap.append(beta_add) beta = beta + beta_add # compare it to bayes prior bayes = Orange.core.BayesLearner(examples) bayes = Orange.classification.bayes.NaiveLearner(instances) bayes_prior = math.log(bayes.distribution[1]/bayes.distribution[0]) ##        print "lr", orig_model.beta[0] ##        print "lr2", logistic_prior ##        print "dist", Orange.core.Distribution(examples.domain.classVar,examples) ##        print "dist", Orange.statistics.distribution.Distribution(examples.domain.class_var,examples) ##        print "prej", betas_ap # vrni originalni model in pripadajoce apriorne niclele return (orig_model, betas_ap) #return (bayes_prior,orig_model.beta[examples.domain.classVar],logistic_prior) #return (bayes_prior,orig_model.beta[examples.domain.class_var],logistic_prior) LogRegLearnerGetPriors = deprecated_members({"removeSingular": "remove_singular"} )(LogRegLearnerGetPriors) class LogRegLearnerGetPriorsOneTable: def __init__(self, removeSingular=0, **kwds): @deprecated_keywords({"removeSingular": "remove_singular"}) def __init__(self, remove_singular=0, **kwds): self.__dict__.update(kwds) self.removeSingular = removeSingular def __call__(self, examples, weight=0): self.remove_singular = remove_singular @deprecated_keywords({"examples": "instances"}) def __call__(self, instances, weight=0): # next function changes data set to a extended with unknown values def createLogRegExampleTable(data, weightID): finalData = Orange.core.ExampleTable(data) origData = Orange.core.ExampleTable(data) for at in data.domain.attributes: finalData = Orange.data.Table(data) orig_data = Orange.data.Table(data) for at in data.domain.features: # za vsak atribut kreiraj nov newExampleTable newData # v dataOrig, dataFinal in newData dodaj nov atribut -- continuous variable if at.varType == Orange.core.VarTypes.Continuous: atDisc = Orange.core.FloatVariable(at.name + "Disc") newDomain = Orange.core.Domain(origData.domain.attributes+[atDisc,data.domain.classVar]) if at.var_type == Orange.data.Type.Continuous: atDisc = Orange.data.variable.Continuous(at.name + "Disc") newDomain = Orange.data.Domain(orig_data.domain.features+[atDisc,data.domain.class_var]) newDomain.addmetas(newData.domain.getmetas()) finalData = Orange.core.ExampleTable(newDomain,finalData) newData = Orange.core.ExampleTable(newDomain,origData) origData = Orange.core.ExampleTable(newDomain,origData) for d in origData: finalData = Orange.data.Table(newDomain,finalData) newData = Orange.data.Table(newDomain,orig_data) orig_data = Orange.data.Table(newDomain,orig_data) for d in orig_data: d[atDisc] = 0 for d in finalData: d[weightID] = 100*data[i][weightID] elif at.varType == Orange.core.VarTypes.Discrete: elif at.var_type == Orange.data.Type.Discrete: # v dataOrig, dataFinal in newData atributu "at" dodaj ee  eno  vreednost, ki ima vrednost kar  ime atributa +  "X" atNew = Orange.core.EnumVariable(at.name, values = at.values + [at.name+"X"]) newDomain = Orange.core.Domain(filter(lambda x: x!=at, origData.domain.attributes)+[atNew,origData.domain.classVar]) newDomain.addmetas(origData.domain.getmetas()) temp_finalData = Orange.core.ExampleTable(finalData) finalData = Orange.core.ExampleTable(newDomain,finalData) newData = Orange.core.ExampleTable(newDomain,origData) temp_origData = Orange.core.ExampleTable(origData) origData = Orange.core.ExampleTable(newDomain,origData) for i,d in enumerate(origData): d[atNew] = temp_origData[i][at] at_new = Orange.data.variable.Discrete(at.name, values = at.values + [at.name+"X"]) newDomain = Orange.data.Domain(filter(lambda x: x!=at, orig_data.domain.features)+[at_new,orig_data.domain.class_var]) newDomain.addmetas(orig_data.domain.getmetas()) temp_finalData = Orange.data.Table(finalData) finalData = Orange.data.Table(newDomain,finalData) newData = Orange.data.Table(newDomain,orig_data) temp_origData = Orange.data.Table(orig_data) orig_data = Orange.data.Table(newDomain,orig_data) for i,d in enumerate(orig_data): d[at_new] = temp_origData[i][at] for i,d in enumerate(finalData): d[atNew] = temp_finalData[i][at] d[at_new] = temp_finalData[i][at] for i,d in enumerate(newData): d[atNew] = at.name+"X" d[at_new] = at.name+"X" d[weightID] = 10*data[i][weightID] finalData.extend(newData) return finalData learner = LogRegLearner(imputer = Orange.core.ImputerConstructor_average(), removeSingular = self.removeSingular) learner = LogRegLearner(imputer = Orange.feature.imputation.ImputerConstructor_average(), removeSingular = self.remove_singular) # get Original Model orig_model = learner(examples,weight) orig_model = learner(instances,weight) # get extended Model (you should not change data) if weight == 0: weight = Orange.data.new_meta_id() examples.addMetaAttribute(weight, 1.0) extended_examples = createLogRegExampleTable(examples, weight) instances.addMetaAttribute(weight, 1.0) extended_examples = createLogRegExampleTable(instances, weight) extended_model = learner(extended_examples, weight) betas_ap = [] for m in extended_models: beta_add = m.beta[m.continuizedDomain.attributes[-1]] beta_add = m.beta[m.continuized_domain.features[-1]] betas_ap.append(beta_add) beta = beta + beta_add # compare it to bayes prior bayes = Orange.core.BayesLearner(examples) bayes = Orange.classification.bayes.NaiveLearner(instances) bayes_prior = math.log(bayes.distribution[1]/bayes.distribution[0]) #print "lr", orig_model.beta[0] #print "lr2", logistic_prior #print "dist", Orange.core.Distribution(examples.domain.classVar,examples) #print "dist", Orange.statistics.distribution.Distribution(examples.domain.class_var,examples) k = (bayes_prior-orig_model.beta[0])/(logistic_prior-orig_model.beta[0]) #print "prej", betas_ap # vrni originalni model in pripadajoce apriorne niclele return (orig_model, betas_ap) #return (bayes_prior,orig_model.beta[data.domain.classVar],logistic_prior) #return (bayes_prior,orig_model.beta[data.domain.class_var],logistic_prior) LogRegLearnerGetPriorsOneTable = deprecated_members({"removeSingular": "remove_singular"} )(LogRegLearnerGetPriorsOneTable) for i,x_i in enumerate(x): pr = pr(x_i,betas) llh += y[i]*log(max(pr,1e-6)) + (1-y[i])*log(max(1-pr,1e-6)) llh += y[i]*math.log(max(pr,1e-6)) + (1-y[i])*log(max(1-pr,1e-6)) return llh def diag(vector): mat = identity(len(vector), Float) mat = identity(len(vector)) for i,v in enumerate(vector): mat[i][i] = v return mat class SimpleFitter(Orange.core.LogRegFitter): class SimpleFitter(LogRegFitter): def __init__(self, penalty=0, se_penalty = False): self.penalty = penalty self.se_penalty = se_penalty def __call__(self, data, weight=0): ml = data.native(0) for i in range(len(data.domain.attributes)): a = data.domain.attributes[i] if a.varType == Orange.core.VarTypes.Discrete: for i in range(len(data.domain.features)): a = data.domain.features[i] if a.var_type == Orange.data.Type.Discrete: for m in ml: m[i] = a.values.index(m[i]) for m in ml: m[-1] = data.domain.classVar.values.index(m[-1]) m[-1] = data.domain.class_var.values.index(m[-1]) Xtmp = array(ml) y = Xtmp[:,-1]   # true probabilities (1's or 0's) X=concatenate((one, Xtmp[:,:-1]),1)  # intercept first, then data betas = array([0.0] * (len(data.domain.attributes)+1)) oldBetas = array([1.0] * (len(data.domain.attributes)+1)) betas = array([0.0] * (len(data.domain.features)+1)) oldBetas = array([1.0] * (len(data.domain.features)+1)) N = len(data) pen_matrix = array([self.penalty] * (len(data.domain.attributes)+1)) pen_matrix = array([self.penalty] * (len(data.domain.features)+1)) if self.se_penalty: p = array([pr(X[i], betas) for i in range(len(data))]) W = identity(len(data), Float) W = identity(len(data)) pp = p * (1.0-p) for i in range(N): W[i,i] = pp[i] se = sqrt(diagonal(inverse(matrixmultiply(transpose(X), matrixmultiply(W, X))))) se = sqrt(diagonal(inv(dot(transpose(X), dot(W, X))))) for i,p in enumerate(pen_matrix): pen_matrix[i] *= se[i] p = array([pr(X[i], betas) for i in range(len(data))]) W = identity(len(data), Float) W = identity(len(data)) pp = p * (1.0-p) for i in range(N): W[i,i] = pp[i] WI = inverse(W) z = matrixmultiply(X, betas) + matrixmultiply(WI, y - p) tmpA = inverse(matrixmultiply(transpose(X), matrixmultiply(W, X))+diag(pen_matrix)) tmpB = matrixmultiply(transpose(X), y-p) betas = oldBetas + matrixmultiply(tmpA,tmpB) #            betaTemp = matrixmultiply(matrixmultiply(matrixmultiply(matrixmultiply(tmpA,transpose(X)),W),X),oldBetas) WI = inv(W) z = dot(X, betas) + dot(WI, y - p) tmpA = inv(dot(transpose(X), dot(W, X))+diag(pen_matrix)) tmpB = dot(transpose(X), y-p) betas = oldBetas + dot(tmpA,tmpB) #            betaTemp = dot(dot(dot(dot(tmpA,transpose(X)),W),X),oldBetas) #            print betaTemp #            tmpB = matrixmultiply(transpose(X), matrixmultiply(W, z)) #            betas = matrixmultiply(tmpA, tmpB) #            tmpB = dot(transpose(X), dot(W, z)) #            betas = dot(tmpA, tmpB) likelihood_new = lh(X,y,betas)-self.penalty*sum([b*b for b in betas]) print likelihood_new ##        XX = sqrt(diagonal(inverse(matrixmultiply(transpose(X),X)))) ##        XX = sqrt(diagonal(inv(dot(transpose(X),X)))) ##        yhat = array([pr(X[i], betas) for i in range(len(data))]) ##        ss = sum((y - yhat) ** 2) / (N - len(data.domain.attributes) - 1) ##        ss = sum((y - yhat) ** 2) / (N - len(data.domain.features) - 1) ##        sigma = math.sqrt(ss) p = array([pr(X[i], betas) for i in range(len(data))]) W = identity(len(data), Float) W = identity(len(data)) pp = p * (1.0-p) for i in range(N): W[i,i] = pp[i] diXWX = sqrt(diagonal(inverse(matrixmultiply(transpose(X), matrixmultiply(W, X))))) xTemp = matrixmultiply(matrixmultiply(inverse(matrixmultiply(transpose(X), matrixmultiply(W, X))),transpose(X)),y) diXWX = sqrt(diagonal(inv(dot(transpose(X), dot(W, X))))) xTemp = dot(dot(inv(dot(transpose(X), dot(W, X))),transpose(X)),y) beta = [] beta_se = [] return exp(bx)/(1+exp(bx)) class BayesianFitter(Orange.core.LogRegFitter): class BayesianFitter(LogRegFitter): def __init__(self, penalty=0, anch_examples=[], tau = 0): self.penalty = penalty # convert data to numeric ml = data.native(0) for i,a in enumerate(data.domain.attributes): if a.varType == Orange.core.VarTypes.Discrete: for i,a in enumerate(data.domain.features): if a.var_type == Orange.data.Type.Discrete: for m in ml: m[i] = a.values.index(m[i]) for m in ml: m[-1] = data.domain.classVar.values.index(m[-1]) m[-1] = data.domain.class_var.values.index(m[-1]) Xtmp = array(ml) y = Xtmp[:,-1]   # true probabilities (1's or 0's) (X,y)=self.create_array_data(data) exTable = Orange.core.ExampleTable(data.domain) exTable = Orange.data.Table(data.domain) for id,ex in self.anch_examples: exTable.extend(Orange.core.ExampleTable(ex,data.domain)) exTable.extend(Orange.data.Table(ex,data.domain)) (X_anch,y_anch)=self.create_array_data(exTable) betas = array([0.0] * (len(data.domain.attributes)+1)) betas = array([0.0] * (len(data.domain.features)+1)) likelihood,betas = self.estimate_beta(X,y,betas,[0]*(len(betas)),X_anch,y_anch) # get attribute groups atGroup = [(startIndex, number of values), ...) ats = data.domain.attributes ats = data.domain.features atVec=reduce(lambda x,y: x+[(y,not y==x[-1][0])], [a.getValueFrom and a.getValueFrom.whichVar or a for a in ats],[(ats[0].getValueFrom and ats[0].getValueFrom.whichVar or ats[0],0)])[1:] atGroup=[[0,0]] print "betas", betas[0], betas_temp[0] sumB += betas[0]-betas_temp[0] apriori = Orange.core.Distribution(data.domain.classVar, data) apriori = Orange.statistics.distribution.Distribution(data.domain.class_var, data) aprioriProb = apriori[0]/apriori.abs for j in range(len(betas)): if const_betas[j]: continue dl = matrixmultiply(X[:,j],transpose(y-p)) dl = dot(X[:,j], transpose(y-p)) for xi,x in enumerate(X_anch): dl += self.penalty*x[j]*(y_anch[xi] - pr_bx(r_anch[xi]*self.penalty)) ddl = matrixmultiply(X_sq[:,j],transpose(p*(1-p))) ddl = dot(X_sq[:,j], transpose(p*(1-p))) for xi,x in enumerate(X_anch): ddl += self.penalty*x[j]*pr_bx(r[xi]*self.penalty)*(1-pr_bx(r[xi]*self.penalty)) #  Feature subset selection for logistic regression def get_likelihood(fitter, examples): res = fitter(examples) @deprecated_keywords({"examples": "instances"}) def get_likelihood(fitter, instances): res = fitter(instances) if res[0] in [fitter.OK]: #, fitter.Infinity, fitter.Divergence]: status, beta, beta_se, likelihood = res if sum([abs(b) for b in beta])=self.deleteCrit: if P>=self.delete_crit: attr.remove(worstAt) remain_attr.append(worstAt) nodeletion = 1 # END OF DELETION PART # if enough attributes has been chosen, stop the procedure if self.numFeatures>-1 and len(attr)>=self.numFeatures: if self.num_features>-1 and len(attr)>=self.num_features: remain_attr=[] # for each attribute in the remaining maxG=-1 for at in remain_attr: tempAttr = attr + [at] tempDomain = Orange.core.Domain(tempAttr,examples.domain.classVar) tempDomain = Orange.data.Domain(tempAttr,examples.domain.class_var) tempDomain.addmetas(examples.domain.getmetas()) # domain, calculate P for LL improvement. tempDomain  = continuizer(Orange.core.Preprocessor_dropMissing(examples.select(tempDomain))) tempData = Orange.core.Preprocessor_dropMissing(examples.select(tempDomain)) ll_New = get_likelihood(Orange.core.LogRegFitter_Cholesky(), tempData) ll_New = get_likelihood(LogRegFitter_Cholesky(), tempData) length_New = float(len(tempData)) # get number of examples in tempData to normalize likelihood stop = 1 continue if bestAt.varType==Orange.core.VarTypes.Continuous: if bestAt.var_type==Orange.data.Type.Continuous: P=lchisqprob(maxG,1); else: P=lchisqprob(maxG,len(bestAt.values)-1); # Add attribute with smallest P to attributes(attr) if P<=self.addCrit: if P<=self.add_crit: attr.append(bestAt) remain_attr.remove(bestAt) length_Old = length_Best if (P>self.addCrit and nodeletion) or (bestAt == worstAt): if (P>self.add_crit and nodeletion) or (bestAt == worstAt): stop = 1 return attr StepWiseFSS = deprecated_members({"addCrit": "add_crit", "deleteCrit": "delete_crit", "numFeatures": "num_features"})(StepWiseFSS) else: return self def __init__(self, addCrit=0.2, deleteCrit=0.3, numFeatures = -1): self.addCrit = addCrit self.deleteCrit = deleteCrit self.numFeatures = numFeatures def __call__(self, examples): attr = StepWiseFSS(examples, addCrit=self.addCrit, deleteCrit = self.deleteCrit, numFeatures = self.numFeatures) return examples.select(Orange.core.Domain(attr, examples.domain.classVar)) @deprecated_keywords({"addCrit": "add_crit", "deleteCrit": "delete_crit", "numFeatures": "num_features"}) def __init__(self, add_crit=0.2, delete_crit=0.3, num_features = -1): self.add_crit = add_crit self.delete_crit = delete_crit self.num_features = num_features @deprecated_keywords({"examples": "instances"}) def __call__(self, instances): attr = StepWiseFSS(instances, add_crit=self.add_crit, delete_crit= self.delete_crit, num_features= self.num_features) return instances.select(Orange.data.Domain(attr, instances.domain.class_var)) StepWiseFSSFilter = deprecated_members({"addCrit": "add_crit", "deleteCrit": "delete_crit", "numFeatures": "num_features"})\ (StepWiseFSSFilter) ####################################
• ## Orange/data/io.py

 r9671 MakeStatus = Variable.MakeStatus def loadARFF(filename, create_on_new = MakeStatus.Incompatible, **kwargs): def loadARFF(filename, create_on_new=MakeStatus.Incompatible, **kwargs): """Return class:Orange.data.Table containing data from file in Weka ARFF format if there exists no .xml file with the same name. If it does, a multi-label filename = filename[:-5] if os.path.exists(filename + ".xml") and os.path.exists(filename + ".arff"): xml_name = filename + ".xml" arff_name = filename + ".arff" return Orange.multilabel.mulan.trans_mulan_data(xml_name,arff_name,create_on_new) xml_name = filename + ".xml" arff_name = filename + ".arff" return Orange.multilabel.mulan.trans_mulan_data(xml_name, arff_name, create_on_new) else: return loadARFF_Weka(filename, create_on_new) def loadARFF_Weka(filename, create_on_new = Orange.data.variable.Variable.MakeStatus.Incompatible, **kwargs): def loadARFF_Weka(filename, create_on_new=Orange.data.variable.Variable.MakeStatus.Incompatible, **kwargs): """Return class:Orange.data.Table containing data from file in Weka ARFF format""" if not os.path.exists(filename) and os.path.exists(filename + ".arff"): filename = filename + ".arff" f = open(filename,'r') filename = filename + ".arff" f = open(filename, 'r') attributes = [] attributeLoadStatus = [] name = '' state = 0 # header for l in f.readlines(): l = l.rstrip("\n") # strip \n l = l.replace('\t',' ') # get rid of tabs l = l.replace('\t', ' ') # get rid of tabs x = l.split('%')[0] # strip comments if len(x.strip()) == 0: continue if state == 0 and x[0] != '@': print "ARFF import ignoring:",x print "ARFF import ignoring:", x if state == 1: if x[0] == '{':#sparse data format, begin with '{', ends with '}' r = [None]*len(attributes) r = [None] * len(attributes) dd = x[1:-1] dd = dd.split(',') y = xs.strip(" ") if len(y) > 0: if y[0]=="'" or y[0]=='"': if y[0] == "'" or y[0] == '"': r.append(xs.strip("'\"")) else: while y[idx][-1] != "'": idx += 1 atn += ' '+y[idx] atn += ' ' + y[idx] atn = atn.strip("' ") else: for y in w[0].split(','): sy = y.strip(" '\"") if len(sy)>0: if len(sy) > 0: vals.append(sy) a, s = Variable.make(atn, Orange.data.Type.Discrete, vals, [], create_on_new) # real... a, s = Variable.make(atn, Orange.data.Type.Continuous, [], [], create_on_new) attributes.append(a) attributeLoadStatus.append(s) lex = [] for dd in data: e = Orange.data.Instance(d,dd) e = Orange.data.Instance(d, dd) lex.append(e) t = Orange.data.Table(d,lex) t = Orange.data.Table(d, lex) t.name = name if hasattr(t, "attribute_load_status"): t.attribute_load_status = attributeLoadStatus #if hasattr(t, "attribute_load_status"): t.setattr("attribute_load_status", attributeLoadStatus) return t loadARFF = Orange.misc.deprecated_keywords( filename = filename[:-5] #print filename f = open(filename+'.arff','w') f.write('@relation %s\n'%t.domain.classVar.name) f = open(filename + '.arff', 'w') f.write('@relation %s\n' % t.domain.classVar.name) # attributes ats = [i for i in t.domain.attributes] iname = str(i.name) if iname.find(" ") != -1: iname = "'%s'"%iname if real==1: f.write('@attribute %s real\n'%iname) iname = "'%s'" % iname if real == 1: f.write('@attribute %s real\n' % iname) else: f.write('@attribute %s { '%iname) f.write('@attribute %s { ' % iname) x = [] for j in i.values: s = str(j) if s.find(" ") == -1: x.append("%s"%s) x.append("%s" % s) else: x.append("'%s'"%s) x.append("'%s'" % s) for j in x[:-1]: f.write('%s,'%j) f.write('%s }\n'%x[-1]) f.write('%s,' % j) f.write('%s }\n' % x[-1]) # examples s = str(j[i]) if s.find(" ") == -1: x.append("%s"%s) x.append("%s" % s) else: x.append("'%s'"%s) x.append("'%s'" % s) for i in x[:-1]: f.write('%s,'%i) f.write('%s\n'%x[-1]) def loadMULAN(filename, create_on_new = Orange.data.variable.Variable.MakeStatus.Incompatible, **kwargs): f.write('%s,' % i) f.write('%s\n' % x[-1]) def loadMULAN(filename, create_on_new=Orange.data.variable.Variable.MakeStatus.Incompatible, **kwargs): """Return class:Orange.data.Table containing data from file in Mulan ARFF and XML format""" if filename[-4:] == ".xml": filename = filename[:-4] if os.path.exists(filename + ".xml") and os.path.exists(filename + ".arff"): xml_name = filename + ".xml" arff_name = filename + ".arff" return Orange.multilabel.mulan.trans_mulan_data(xml_name,arff_name) xml_name = filename + ".xml" arff_name = filename + ".arff" return Orange.multilabel.mulan.trans_mulan_data(xml_name, arff_name) else: return None else: real = 0 if real==1: if real == 1: f.write('%s: continuous.\n' % i.name) else: # examples f.close() f = open('%s.data' % filename_prefix, 'w') for j in t: f.write('%s\n' % x[-1]) def toR(filename,t): def toR(filename, t): """Save class:Orange.data.Table to file in R format""" if str.upper(filename[-2:]) == ".R": filename = filename[:-2] f = open(filename+'.R','w') f = open(filename + '.R', 'w') atyp = [] for i in xrange(len(labels)): if atyp[i] == 2: # continuous f.write('"%s" = c('%(labels[i])) f.write('"%s" = c(' % (labels[i])) for j in xrange(len(t)): if t[j][i].isSpecial(): else: f.write(str(t[j][i])) if (j == len(t)-1): if (j == len(t) - 1): f.write(')') else: elif atyp[i] == 1: # discrete if aord[i]: # ordered f.write('"%s" = ordered('%labels[i]) f.write('"%s" = ordered(' % labels[i]) else: f.write('"%s" = factor('%labels[i]) f.write('"%s" = factor(' % labels[i]) f.write('levels=c(') for j in xrange(len(as0[i].values)): f.write('"x%s"'%(as0[i].values[j])) if j == len(as0[i].values)-1: f.write('"x%s"' % (as0[i].values[j])) if j == len(as0[i].values) - 1: f.write('),c(') else: f.write('NA') else: f.write('"x%s"'%str(t[j][i])) if (j == len(t)-1): f.write('"x%s"' % str(t[j][i])) if (j == len(t) - 1): f.write('))') else: else: raise "Unknown attribute type." if (i < len(labels)-1): if (i < len(labels) - 1): f.write(',\n') f.write(')\n') def toLibSVM(filename, example): """Save class:Orange.data.Table to file in LibSVM format""" import Orange.classification.svm Orange.classification.svm.tableToSVMFormat(example, open(filename, "wb")) def loadLibSVM(filename, create_on_new=MakeStatus.Incompatible, **kwargs): """Return class:Orange.data.Table containing data from file in LibSVM format""" attributeLoadStatus[attr] = s return attr def make_disc(name, unordered): attr, s = Orange.data.variable.make(name, Orange.data.Type.Discrete, [], unordered, create_on_new) attributeLoadStatus[attr] = s return attr data = [line.split() for line in open(filename, "rb").read().splitlines() if line.strip()] vars = type("attr", (dict,), {"__missing__": lambda self, key: self.setdefault(key, make_float(key))})() res.append(str[start:index]) start = find_start = index + 1 elif index == -1: res.append(str[start:]) return res def is_standard_var_def(cell): """Is the cell a standard variable definition (empty, cont, disc, string) except ValueError, ex: return False def is_var_types_row(row): """ Is the row a variable type definition row (as in the orange .tab file) """ return all(map(is_standard_var_def, row)) def var_type(cell): """ Return variable type from a variable type definition in cell. """ return map(var_type, row) def is_var_attributes_row(row): """ Is the row an attribute definition row (i.e. the third row in the else: raise ValueError("Unknown attribute label definition") def var_attributes(row): """ Return variable specifiers and label definitions for row """ return map(var_attribute, row) class _var_placeholder(object): """ A place holder for an arbitrary variable while it's values are still unknown. self.name = name self.values = set(values) class _disc_placeholder(_var_placeholder): """ A place holder for discrete variables while their values are not yet known. except ValueError: return False def is_variable_cont(values, n=None, cutoff=0.5): """ Is variable with values in column (n rows) a continuous variable. n = len(values) or 1 return (float(cont) / n) > cutoff def is_variable_discrete(values, n=None, cutoff=0.3): """ Is variable with values in column (n rows) a discrete variable. file.seek(0) # Rewind reader = csv.reader(file, dialect=dialect) header = types = var_attrs = None #    if not has_header: #        raise ValueError("No header in the data file.") header = reader.next() if header: # Try to get variable definitions if is_var_types_row(types_row): types = var_types(types_row) if types: # Try to get the variable attributes if is_var_attributes_row(labels_row): var_attrs = var_attributes(labels_row) # If definitions not present fill with blanks if not types: if not var_attrs: var_attrs = [None] * len(header) # start from the beginning file.seek(0) if any(defined): # skip definition rows if present in the file reader.next() variables = [] undefined_vars = [] variables.append(_var_placeholder(name)) undefined_vars.append((i, variables[-1])) data = [] for row in reader: for ind, var_def in undefined_vars: var_def.values.add(row[ind]) for ind, var_def in undefined_vars: values = var_def.values - set(["?", ""]) # TODO: Other unknown strings? values = sorted(values) values = sorted(values) if isinstance(var_def, _disc_placeholder): variables[ind] = variable.make(var_def.name, Orange.data.Type.Discrete, [], values, create_new_on) else: raise ValueError("Strange column in the data") vars = [] vars_load_status = [] vars.append(var) vars_load_status.append(status) attributes = [] class_var = [] attribute_load_status.append(status) attribute_indices.append(i) if len(class_var) > 1: raise ValueError("Multiple class variables defined") class_var = class_var[0] if class_var else None attribute_load_status += class_var_load_status variable_indices = attribute_indices + class_indices domain.add_metas(metas) normal = [[row[i] for i in variable_indices] for row in data] meta_part = [[row[i] for i,_ in meta_indices] for row in data] meta_part = [[row[i] for i, _ in meta_indices] for row in data] table = Orange.data.Table(domain, normal) for ex, m_part in zip(table, meta_part): for (column, var), val in zip(meta_indices, m_part): ex[var] = var(val) table.metaAttributeLoadStatus = meta_attribute_load_status table.attributeLoadStatus = attribute_load_status return table pass return file def save_csv(file, table, orange_specific=True, **kwargs): import csv names = [v.name for v in all_vars] writer.writerow(names) if orange_specific: type_cells = [] raise TypeError("Unknown variable type") writer.writerow(type_cells) var_attr_cells = [] for spec, var in [("", v) for v in attrs] + \ ([("class", class_var)] if class_var else []) +\ ([("class", class_var)] if class_var else []) + \ [("m", v) for v in metas]: labels = ["{0}={1}".format(*t) for t in var.attributes.items()] # TODO escape spaces var_attr_cells.append(" ".join([spec] if spec else [] + labels)) writer.writerow(var_attr_cells) for instance in table: instance = list(instance) + [instance[m] for m in metas] writer.writerow(instance) register_file_type("R", None, toR, ".R") register_file_type("Weka", loadARFF, toARFF, ".arff") """ Return a list of persistent registered (prefix, path) pairs """ global_settings_dir = Orange.misc.environ.install_dir user_settings_dir = Orange.misc.environ.orange_settings_dir if isinstance(path, list): path = os.path.pathsep.join(path) user_settings_dir = Orange.misc.environ.orange_settings_dir if not os.path.exists(user_settings_dir): except OSError: pass filename = os.path.join(user_settings_dir, "orange-search-paths.cfg") parser = SafeConfigParser() parser.read([filename]) if not parser.has_section("paths"): parser.add_section("paths") if path is not None: parser.set("paths", prefix, path) parser.remove_option("paths", prefix) parser.write(open(filename, "wb")) def search_paths(prefix=None): """ Return a list of the registered (prefix, path) pairs. else: return paths def set_search_path(prefix, path, persistent=False): """ Associate a search path with a prefix. """ global _session_paths if isinstance(path, list): path = os.path.pathsep.join(path) if persistent: save_persistent_search_path(prefix, path) else: _session_paths.append((prefix, path)) def expand_filename(prefixed_name): else: raise ValueError("Unknown prefix %r." % prefix) def find_file(prefixed_name): """ Find the prefixed filename and return its full path. if not os.path.exists(prefixed_name): if ":" not in prefixed_name: raise ValueError("Not a prefixed name.") prefix, filename = prefixed_name.split(":", 1) raise ValueError("Not a prefixed name.") prefix, filename = prefixed_name.split(":", 1) paths = search_paths(prefix) if paths: else: return prefixed_name
• ## Orange/distance/__init__.py

 r9759 class PearsonR(DistanceConstructor): """Constructs an instance of :obj:PearsonRDistance. Not all the data needs to be given.""" def __new__(cls, data=None, **argkw): Pearson correlation coefficient _ _correlation_coefficient>_. """ Returns Pearson's disimilarity between e1 and e2, i.e. (1-r)/2 where r is Sprearman's rank coefficient. i.e. (1-r)/2 where r is Pearson's rank coefficient. """ X1 = [] class SpearmanR(DistanceConstructor): """Constructs an instance of SpearmanR. Not all the data needs to be given.""" def __new__(cls, data=None, **argkw): """Spearman's rank correlation coefficient _""" correlation_coefficient>_.""" def __init__(self, **argkw): class Mahalanobis(DistanceConstructor): """ Construct instance of Mahalanobis. """ def __new__(cls, data=None, **argkw):
• ## Orange/doc/reference/undefineds.tab

 r9671 a   b   c d   d   d -dc X -dc UNK   -dc UNAVAILABLE 0   0   0 1   1   1
• ## Orange/evaluation/reliability.py

 r9725 :obj:Orange.classification.Classifier.GetBoth is passed) contain an additional attribute :obj:reliability_estimate, which is an instance of :class:~Orange.evaluation.reliability.Estimate. :class:~Orange.evaluation.reliability.Estimate. """
• ## Orange/feature/discretization.py

 r9671 """ ################################### Discretization (discretization) ################################### .. index:: discretization .. index:: single: feature; discretization Example-based automatic discretization is in essence similar to learning: given a set of examples, discretization method proposes a list of suitable intervals to cut the attribute's values into. For this reason, Orange structures for discretization resemble its structures for learning. Objects derived from orange.Discretization play a role of "learner" that, upon observing the examples, construct an orange.Discretizer whose role is to convert continuous values into discrete according to the rule found by Discretization. Orange supports several methods of discretization; here's a list of methods with belonging classes. * Equi-distant discretization (:class:EquiDistDiscretization, :class:EquiDistDiscretizer). The range of attribute's values is split into prescribed number equal-sized intervals. * Quantile-based discretization (:class:EquiNDiscretization, :class:IntervalDiscretizer). The range is split into intervals containing equal number of examples. * Entropy-based discretization (:class:EntropyDiscretization, :class:IntervalDiscretizer). Developed by Fayyad and Irani, this method balances between entropy in intervals and MDL of discretization. * Bi-modal discretization (:class:BiModalDiscretization, :class:BiModalDiscretizer/:class:IntervalDiscretizer). Two cut-off points set to optimize the difference of the distribution in the middle interval and the distributions outside it. * Fixed discretization (:class:IntervalDiscretizer). Discretization with user-prescribed cut-off points. Instances of classes derived from :class:Discretization. It define a single method: the call operator. The object can also be called through constructor. .. class:: Discretization .. method:: __call__(attribute, examples[, weightID]) Given a continuous attribute, examples and, optionally id of attribute with example weight, this function returns a discretized attribute. Argument attribute can be a descriptor, index or name of the attribute. Here's an example. Part of :download:discretization.py : .. literalinclude:: code/discretization.py :lines: 7-15 The discretized attribute sep_w is constructed with a call to :class:EntropyDiscretization (instead of constructing it and calling it afterwards, we passed the arguments for calling to the constructor, as is often allowed in Orange). We then constructed a new :class:Orange.data.Table with attributes "sepal width" (the original continuous attribute), sep_w and the class attribute. Script output is:: Entropy discretization, first 10 examples [3.5, '>3.30', 'Iris-setosa'] [3.0, '(2.90, 3.30]', 'Iris-setosa'] [3.2, '(2.90, 3.30]', 'Iris-setosa'] [3.1, '(2.90, 3.30]', 'Iris-setosa'] [3.6, '>3.30', 'Iris-setosa'] [3.9, '>3.30', 'Iris-setosa'] [3.4, '>3.30', 'Iris-setosa'] [3.4, '>3.30', 'Iris-setosa'] [2.9, '<=2.90', 'Iris-setosa'] [3.1, '(2.90, 3.30]', 'Iris-setosa'] :class:EntropyDiscretization named the new attribute's values by the interval range (it also named the attribute as "D_sepal width"). The new attribute's values get computed automatically when they are needed. As those that have read about :class:Orange.data.variable.Variable know, the answer to "How this works?" is hidden in the field :obj:~Orange.data.variable.Variable.get_value_from. This little dialog reveals the secret. :: >>> sep_w EnumVariable 'D_sepal width' >>> sep_w.get_value_from >>> sep_w.get_value_from.whichVar FloatVariable 'sepal width' >>> sep_w.get_value_from.transformer >>> sep_w.get_value_from.transformer.points <2.90000009537, 3.29999995232> So, the select statement in the above example converted all examples from data to the new domain. Since the new domain includes the attribute sep_w that is not present in the original, sep_w's values are computed on the fly. For each example in data, sep_w.get_value_from is called to compute sep_w's value (if you ever need to call get_value_from, you shouldn't call get_value_from directly but call compute_value instead). sep_w.get_value_from looks for value of "sepal width" in the original example. The original, continuous sepal width is passed to the transformer that determines the interval by its field points. Transformer returns the discrete value which is in turn returned by get_value_from and stored in the new example. You don't need to understand this mechanism exactly. It's important to know that there are two classes of objects for discretization. Those derived from :obj:Discretizer (such as :obj:IntervalDiscretizer that we've seen above) are used as transformers that translate continuous value into discrete. Discretization algorithms are derived from :obj:Discretization. Their job is to construct a :obj:Discretizer and return a new variable with the discretizer stored in get_value_from.transformer. Discretizers ============ Different discretizers support different methods for conversion of continuous values into discrete. The most general is :class:IntervalDiscretizer that is also used by most discretization methods. Two other discretizers, :class:EquiDistDiscretizer and :class:ThresholdDiscretizer> could easily be replaced by :class:IntervalDiscretizer but are used for speed and simplicity. The fourth discretizer, :class:BiModalDiscretizer is specialized for discretizations induced by :class:BiModalDiscretization. .. class:: Discretizer All discretizers support a handy method for construction of a new attribute from an existing one. .. method:: construct_variable(attribute) Constructs a new attribute descriptor; the new attribute is discretized attribute. The new attribute's name equal attribute.name prefixed  by "D\_", and its symbolic values are discretizer specific. The above example shows what comes out form :class:IntervalDiscretizer. Discretization algorithms actually first construct a discretizer and then call its :class:construct_variable to construct an attribute descriptor. .. class:: IntervalDiscretizer The most common discretizer. .. attribute:: points Cut-off points. All values below or equal to the first point belong to the first interval, those between the first and the second (including those equal to the second) go to the second interval and so forth to the last interval which covers all values greater than the last element in points. The number of intervals is thus len(points)+1. Let us manually construct an interval discretizer with cut-off points at 3.0 and 5.0. We shall use the discretizer to construct a discretized sepal length (part of :download:discretization.py ): .. literalinclude:: code/discretization.py :lines: 22-26 That's all. First five examples of data2 are now :: [5.1, '>5.00', 'Iris-setosa'] [4.9, '(3.00, 5.00]', 'Iris-setosa'] [4.7, '(3.00, 5.00]', 'Iris-setosa'] [4.6, '(3.00, 5.00]', 'Iris-setosa'] [5.0, '(3.00, 5.00]', 'Iris-setosa'] Can you use the same discretizer for more than one attribute? Yes, as long as they have same cut-off points, of course. Simply call construct_var for each continuous attribute (part of :download:discretization.py ): .. literalinclude:: code/discretization.py :lines: 30-34 Each attribute now has its own (FIXME) ClassifierFromVar in its get_value_from, but all use the same :class:IntervalDiscretizer, idisc. Changing an element of its points affect all attributes. Do not change the length of :obj:~IntervalDiscretizer.points if the discretizer is used by any attribute. The length of :obj:~IntervalDiscretizer.points should always match the number of values of the attribute, which is determined by the length of the attribute's field values. Therefore, if attr is a discretized attribute, than len(attr.values) must equal len(attr.get_value_from.transformer.points)+1. It always does, unless you deliberately change it. If the sizes don't match, Orange will probably crash, and it will be entirely your fault. .. class:: EquiDistDiscretizer More rigid than :obj:IntervalDiscretizer: it uses intervals of fixed width. .. attribute:: first_cut The first cut-off point. .. attribute:: step Width of intervals. .. attribute:: number_of_intervals Number of intervals. .. attribute:: points (read-only) The cut-off points; this is not a real attribute although it behaves as one. Reading it constructs a list of cut-off points and returns it, but changing the list doesn't affect the discretizer - it's a separate list. This attribute is here only for to give the :obj:EquiDistDiscretizer the same interface as that of :obj:IntervalDiscretizer. All values below :obj:~EquiDistDiscretizer.first_cut belong to the first intervala (including possible values smaller than firstVal. Otherwise, value val's interval is floor((val-firstVal)/step). If this is turns out to be greater or equal to :obj:~EquiDistDiscretizer.number_of_intervals, it is decreased to number_of_intervals-1. This discretizer is returned by :class:EquiDistDiscretization; you can see an example in the corresponding section. You can also construct it manually and call its construct_variable, just as shown for the :obj:IntervalDiscretizer. .. class:: ThresholdDiscretizer Threshold discretizer converts continuous values into binary by comparing them with a threshold. This discretizer is actually not used by any discretization method, but you can use it for manual discretization. Orange needs this discretizer for binarization of continuous attributes in decision trees. .. attribute:: threshold Threshold; values below or equal to the threshold belong to the first interval and those that are greater go to the second. .. class:: BiModalDiscretizer This discretizer is the first discretizer that couldn't be replaced by :class:IntervalDiscretizer. It has two cut off points and values are discretized according to whether they belong to the middle region (which includes the lower but not the upper boundary) or not. The discretizer is returned by :class:BiModalDiscretization if its field :obj:~BiModalDiscretization.split_in_two is true (the default). .. attribute:: low Lower boudary of the interval (included in the interval). .. attribute:: high Upper boundary of the interval (not included in the interval). Discretization Algorithms ========================= .. class:: EquiDistDiscretization Discretizes the attribute by cutting it into the prescribed number of intervals of equal width. The examples are needed to determine the span of attribute values. The interval between the smallest and the largest is then cut into equal parts. .. attribute:: number_of_intervals Number of intervals into which the attribute is to be discretized. Default value is 4. For an example, we shall discretize all attributes of Iris dataset into 6 intervals. We shall construct an :class:Orange.data.Table with discretized attributes and print description of the attributes (part of :download:discretization.py ): .. literalinclude:: code/discretization.py :lines: 38-43 Script's answer is :: D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10> Any more decent ways for a script to find the interval boundaries than by parsing the symbolic values? Sure, they are hidden in the discretizer, which is, as usual, stored in attr.get_value_from.transformer. Compare the following with the values above. :: >>> for attr in newattrs: ...    print "%s: first interval at %5.3f, step %5.3f" % \ ...    (attr.name, attr.get_value_from.transformer.first_cut, \ ...    attr.get_value_from.transformer.step) D_sepal length: first interval at 4.900, step 0.600 D_sepal width: first interval at 2.400, step 0.400 D_petal length: first interval at 1.980, step 0.980 D_petal width: first interval at 0.500, step 0.400 As all discretizers, :class:EquiDistDiscretizer also has the method construct_variable (part of :download:discretization.py ): .. literalinclude:: code/discretization.py :lines: 69-73 .. class:: EquiNDiscretization Discretization with Intervals Containing (Approximately) Equal Number of Examples. Discretizes the attribute by cutting it into the prescribed number of intervals so that each of them contains equal number of examples. The examples are obviously needed for this discretization, too. .. attribute:: number_of_intervals Number of intervals into which the attribute is to be discretized. Default value is 4. The use of this discretization is the same as the use of :class:EquiDistDiscretization. The resulting discretizer is :class:IntervalDiscretizer, hence it has points instead of first_cut/ step/number_of_intervals. .. class:: EntropyDiscretization Entropy-based Discretization (Fayyad-Irani). Fayyad-Irani's discretization method works without a predefined number of intervals. Instead, it recursively splits intervals at the cut-off point that minimizes the entropy, until the entropy decrease is smaller than the increase of MDL induced by the new point. An interesting thing about this discretization technique is that an attribute can be discretized into a single interval, if no suitable cut-off points are found. If this is the case, the attribute is rendered useless and can be removed. This discretization can therefore also serve for feature subset selection. .. attribute:: force_attribute Forces the algorithm to induce at least one cut-off point, even when its information gain is lower than MDL (default: false). Part of :download:discretization.py : .. literalinclude:: code/discretization.py :lines: 77-80 The output shows that all attributes are discretized onto three intervals:: sepal length: <5.5, 6.09999990463> sepal width: <2.90000009537, 3.29999995232> petal length: <1.89999997616, 4.69999980927> petal width: <0.600000023842, 1.0000004768> .. class:: BiModalDiscretization Bi-Modal Discretization Sets two cut-off points so that the class distribution of examples in between is as different from the overall distribution as possible. The difference is measure by chi-square statistics. All possible cut-off points are tried, thus the discretization runs in O(n^2). This discretization method is especially suitable for the attributes in which the middle region corresponds to normal and the outer regions to abnormal values of the attribute. Depending on the nature of the attribute, we can treat the lower and higher values separately, thus discretizing the attribute into three intervals, or together, in a binary attribute whose values correspond to normal and abnormal. .. attribute:: split_in_two Decides whether the resulting attribute should have three or two. If true (default), we have three intervals and the discretizer is of type :class:BiModalDiscretizer. If false the result is the ordinary :class:IntervalDiscretizer. Iris dataset has three-valued class attribute, classes are setosa, virginica and versicolor. As the picture below shows, sepal lenghts of versicolors are between lengths of setosas and virginicas (the picture itself is drawn using LOESS probability estimation). .. image:: files/bayes-iris.gif If we merge classes setosa and virginica into one, we can observe whether the bi-modal discretization would correctly recognize the interval in which versicolors dominate. .. literalinclude:: code/discretization.py :lines: 84-87 In this script, we have constructed a new class attribute which tells whether an iris is versicolor or not. We have told how this attribute's value is computed from the original class value with a simple lambda function. Finally, we have constructed a new domain and converted the examples. Now for discretization. .. literalinclude:: code/discretization.py :lines: 97-100 The script prints out the middle intervals:: sepal length: (5.400, 6.200] sepal width: (2.000, 2.900] petal length: (1.900, 4.700] petal width: (0.600, 1.600] Judging by the graph, the cut-off points for "sepal length" make sense. Additional functions ==================== Some functions and classes that can be used for categorization of continuous features. Besides several general classes that can help in this task, we also provide a function that may help in entropy-based discretization (Fayyad & Irani), and a wrapper around classes for categorization that can be used for learning. .. automethod:: Orange.feature.discretization.entropyDiscretization_wrapper .. autoclass:: Orange.feature.discretization.EntropyDiscretization_wrapper .. autoclass:: Orange.feature.discretization.DiscretizedLearner_Class .. rubric:: Example FIXME. A chapter on feature subset selection <../ofb/o_fss.htm>_ in Orange for Beginners tutorial shows the use of DiscretizedLearner. Other discretization classes from core Orange are listed in chapter on categorization <../ofb/o_categorization.htm>_ of the same tutorial. ========== References ========== * UM Fayyad and KB Irani. Multi-interval discretization of continuous valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, pages 1022--1029, Chambery, France, 1993. """ import Orange import Orange.core as orange Discrete2Continuous, \ Discretizer, \ BiModalDiscretizer, \ EquiDistDiscretizer, \ IntervalDiscretizer, \ ThresholdDiscretizer, \ EntropyDiscretization, \ EquiDistDiscretization, \ EquiNDiscretization, \ BiModalDiscretization, \ Discretization BiModalDiscretizer, \ EquiDistDiscretizer as EqualWidthDiscretizer, \ IntervalDiscretizer, \ ThresholdDiscretizer,\ EntropyDiscretization as Entropy, \ EquiDistDiscretization as EqualWidth, \ EquiNDiscretization as EqualFreq, \ BiModalDiscretization as BiModal, \ Discretization, \ Preprocessor_discretize ###### # from orngDics.py def entropyDiscretization_wrapper(table): """Take the classified table set (table) and categorize all continuous features using the entropy based discretization :obj:EntropyDiscretization. def entropyDiscretization_wrapper(data): """Discretize all continuous features in class-labeled data set with the entropy-based discretization :obj:Entropy. :param table: data to discretize. :type table: Orange.data.Table :param data: data to discretize. :type data: Orange.data.Table :rtype: :obj:Orange.data.Table includes all categorical and discretized\ continuous features from the original data table. """ orange.setrandseed(0) tablen=orange.Preprocessor_discretize(table, method=EntropyDiscretization()) data_new = orange.Preprocessor_discretize(data, method=Entropy()) attrlist=[] nrem=0 for i in tablen.domain.attributes: attrlist = [] nrem = 0 for i in data_new.domain.attributes: if (len(i.values)>1): attrlist.append(i) nrem=nrem+1 attrlist.append(tablen.domain.classVar) return tablen.select(attrlist) return data_new.select(attrlist) """ def __init__(self, baseLearner, discretizer=EntropyDiscretization(), **kwds): def __init__(self, baseLearner, discretizer=Entropy(), **kwds): self.baseLearner = baseLearner if hasattr(baseLearner, "name"): def __call__(self, example, resultType = orange.GetValue): return self.classifier(example, resultType) class DiscretizeTable(object): """Discretizes all continuous features of the data table. :param data: data to discretize. :type data: :class:Orange.data.Table :param features: data features to discretize. None (default) to discretize all features. :type features: list of :class:Orange.data.variable.Variable :param method: feature discretization method. :type method: :class:Discretization """ def __new__(cls, data=None, features=None, discretize_class=False, method=EqualFreq(n_intervals=3)): if data is None: self = object.__new__(cls, features=features, discretize_class=discretize_class, method=method) return self else: self = cls(features=features, discretize_class=discretize_class, method=method) return self(data) def __init__(self, features=None, discretize_class=False, method=EqualFreq(n_intervals=3)): self.features = features self.discretize_class = discretize_class self.method = method def __call__(self, data): pp = Preprocessor_discretize(attributes=self.features, discretizeClass=self.discretize_class) pp.method = self.method return pp(data)
• ## Orange/feature/imputation.py

 r9671 """ ########################### Imputation (imputation) ########################### .. index:: imputation .. index:: single: feature; value imputation Imputation is a procedure of replacing the missing feature values with some appropriate values. Imputation is needed because of the methods (learning algorithms and others) that are not capable of handling unknown values, for instance logistic regression. Missing values sometimes have a special meaning, so they need to be replaced by a designated value. Sometimes we know what to replace the missing value with; for instance, in a medical problem, some laboratory tests might not be done when it is known what their results would be. In that case, we impute certain fixed value instead of the missing. In the most complex case, we assign values that are computed based on some model; we can, for instance, impute the average or majority value or even a value which is computed from values of other, known feature, using a classifier. In a learning/classification process, imputation is needed on two occasions. Before learning, the imputer needs to process the training examples. Afterwards, the imputer is called for each example to be classified. In general, imputer itself needs to be trained. This is, of course, not needed when the imputer imputes certain fixed value. However, when it imputes the average or majority value, it needs to compute the statistics on the training examples, and use it afterwards for imputation of training and testing examples. While reading this document, bear in mind that imputation is a part of the learning process. If we fit the imputation model, for instance, by learning how to predict the feature's value from other features, or even if we simply compute the average or the minimal value for the feature and use it in imputation, this should only be done on learning data. If cross validation is used for sampling, imputation should be done on training folds only. Orange provides simple means for doing that. This page will first explain how to construct various imputers. Then follow the examples for proper use of imputers <#using-imputers>_. Finally, quite often you will want to use imputation with special requests, such as certain features' missing values getting replaced by constants and other by values computed using models induced from specified other features. For instance, in one of the studies we worked on, the patient's pulse rate needed to be estimated using regression trees that included the scope of the patient's injuries, sex and age, some attributes' values were replaced by the most pessimistic ones and others were computed with regression trees based on values of all features. If you are using learners that need the imputer as a component, you will need to write your own imputer constructor <#write-your-own-imputer-constructor>_. This is trivial and is explained at the end of this page. Wrapper for learning algorithms =============================== This wrapper can be used with learning algorithms that cannot handle missing values: it will impute the missing examples using the imputer, call the earning and, if the imputation is also needed by the classifier, wrap the resulting classifier into another wrapper that will impute the missing values in examples to be classified. Even so, the module is somewhat redundant, as all learners that cannot handle missing values should, in principle, provide the slots for imputer constructor. For instance, :obj:Orange.classification.logreg.LogRegLearner has an attribute :obj:Orange.classification.logreg.LogRegLearner.imputerConstructor, and even if you don't set it, it will do some imputation by default. .. class:: ImputeLearner Wraps a learner and performs data discretization before learning. Most of Orange's learning algorithms do not use imputers because they can appropriately handle the missing values. Bayesian classifier, for instance, simply skips the corresponding attributes in the formula, while classification/regression trees have components for handling the missing values in various ways. If for any reason you want to use these algorithms to run on imputed data, you can use this wrapper. The class description is a matter of a separate page, but we shall show its code here as another demonstration of how to use the imputers - logistic regression is implemented essentially the same as the below classes. This is basically a learner, so the constructor will return either an instance of :obj:ImputerLearner or, if called with examples, an instance of some classifier. There are a few attributes that need to be set, though. .. attribute:: base_learner A wrapped learner. .. attribute:: imputer_constructor An instance of a class derived from :obj:ImputerConstructor (or a class with the same call operator). .. attribute:: dont_impute_classifier If given and set (this attribute is optional), the classifier will not be wrapped into an imputer. Do this if the classifier doesn't mind if the examples it is given have missing values. The learner is best illustrated by its code - here's its complete :obj:__call__ method:: def __call__(self, data, weight=0): trained_imputer = self.imputer_constructor(data, weight) imputed_data = trained_imputer(data, weight) base_classifier = self.base_learner(imputed_data, weight) if self.dont_impute_classifier: return base_classifier else: return ImputeClassifier(base_classifier, trained_imputer) So "learning" goes like this. :obj:ImputeLearner will first construct the imputer (that is, call :obj:self.imputer_constructor to get a (trained) imputer. Than it will use the imputer to impute the data, and call the given :obj:baseLearner to construct a classifier. For instance, :obj:baseLearner could be a learner for logistic regression and the result would be a logistic regression model. If the classifier can handle unknown values (that is, if :obj:dont_impute_classifier, we return it as it is, otherwise we wrap it into :obj:ImputeClassifier, which is given the base classifier and the imputer which it can use to impute the missing values in (testing) examples. .. class:: ImputeClassifier Objects of this class are returned by :obj:ImputeLearner when given data. .. attribute:: baseClassifier A wrapped classifier. .. attribute:: imputer An imputer for imputation of unknown values. .. method:: __call__ This class is even more trivial than the learner. Its constructor accepts two arguments, the classifier and the imputer, which are stored into the corresponding attributes. The call operator which does the classification then looks like this:: def __call__(self, ex, what=orange.GetValue): return self.base_classifier(self.imputer(ex), what) It imputes the missing values by calling the :obj:imputer and passes the class to the base classifier. .. note:: In this setup the imputer is trained on the training data - even if you do cross validation, the imputer will be trained on the right data. In the classification phase we again use the imputer which was classified on the training data only. .. rubric:: Code of ImputeLearner and ImputeClassifier :obj:Orange.feature.imputation.ImputeLearner puts the keyword arguments into the instance's  dictionary. You are expected to call it like :obj:ImputeLearner(base_learner=, imputer=). When the learner is called with examples, it trains the imputer, imputes the data, induces a :obj:base_classifier by the :obj:base_cearner and constructs :obj:ImputeClassifier that stores the :obj:base_classifier and the :obj:imputer. For classification, the missing values are imputed and the classifier's prediction is returned. Note that this code is slightly simplified, although the omitted details handle non-essential technical issues that are unrelated to imputation:: class ImputeLearner(orange.Learner): def __new__(cls, examples = None, weightID = 0, **keyw): self = orange.Learner.__new__(cls, **keyw) self.__dict__.update(keyw) if examples: return self.__call__(examples, weightID) else: return self def __call__(self, data, weight=0): trained_imputer = self.imputer_constructor(data, weight) imputed_data = trained_imputer(data, weight) base_classifier = self.base_learner(imputed_data, weight) return ImputeClassifier(base_classifier, trained_imputer) class ImputeClassifier(orange.Classifier): def __init__(self, base_classifier, imputer): self.base_classifier = base_classifier self.imputer = imputer def __call__(self, ex, what=orange.GetValue): return self.base_classifier(self.imputer(ex), what) .. rubric:: Example Although most Orange's learning algorithms will take care of imputation internally, if needed, it can sometime happen that an expert will be able to tell you exactly what to put in the data instead of the missing values. In this example we shall suppose that we want to impute the minimal value of each feature. We will try to determine whether the naive Bayesian classifier with its  implicit internal imputation works better than one that uses imputation by minimal values. :download:imputation-minimal-imputer.py  (uses :download:voting.tab ): .. literalinclude:: code/imputation-minimal-imputer.py :lines: 7- Should ouput this:: Without imputation: 0.903 With imputation: 0.899 .. note:: Note that we constructed just one instance of \ :obj:Orange.classification.bayes.NaiveLearner, but this same instance is used twice in each fold, once it is given the examples as they are (and returns an instance of :obj:Orange.classification.bayes.NaiveClassifier. The second time it is called by :obj:imba and the \ :obj:Orange.classification.bayes.NaiveClassifier it returns is wrapped into :obj:Orange.feature.imputation.Classifier. We thus have only one learner, but which produces two different classifiers in each round of testing. Abstract imputers ================= As common in Orange, imputation is done by pairs of two classes: one that does the work and another that constructs it. :obj:ImputerConstructor is an abstract root of the hierarchy of classes that get the training data (with an optional id for weight) and constructs an instance of a class, derived from :obj:Imputer. An :obj:Imputer can be called with an :obj:Orange.data.Instance and it will return a new example with the missing values imputed (it will leave the original example intact!). If imputer is called with an :obj:Orange.data.Table, it will return a new example table with imputed examples. .. class:: ImputerConstructor .. attribute:: imputeClass Tell whether to impute the class value (default) or not. Simple imputation ================= The simplest imputers always impute the same value for a particular attribute, disregarding the values of other attributes. They all use the same imputer class, :obj:Imputer_defaults. .. class:: Imputer_defaults .. attribute::  defaults An example with the default values to be imputed instead of the missing. Examples to be imputed must be from the same domain as :obj:defaults. Instances of this class can be constructed by :obj:Orange.feature.imputation.ImputerConstructor_minimal, :obj:Orange.feature.imputation.ImputerConstructor_maximal, :obj:Orange.feature.imputation.ImputerConstructor_average. For continuous features, they will impute the smallest, largest or the average  values encountered in the training examples. For discrete, they will impute the lowest (the one with index 0, e. g. attr.values[0]), the highest (attr.values[-1]), and the most common value encountered in the data. The first two imputers will mostly be used when the discrete values are ordered according to their impact on the class (for instance, possible values for symptoms of some disease can be ordered according to their seriousness). The minimal and maximal imputers will then represent optimistic and pessimistic imputations. The following code will load the bridges data, and first impute the values in a single examples and then in the whole table. :download:imputation-complex.py  (uses :download:bridges.tab ): .. literalinclude:: code/imputation-complex.py :lines: 9-23 This is example shows what the imputer does, not how it is to be used. Don't impute all the data and then use it for cross-validation. As warned at the top of this page, see the instructions for actual use of imputers <#using-imputers>_. .. note:: The :obj:ImputerConstructor are another class with schizophrenic constructor: if you give the constructor the data, it will return an \ :obj:Imputer - the above call is equivalent to calling \ :obj:Orange.feature.imputation.ImputerConstructor_minimal()(data). You can also construct the :obj:Orange.feature.imputation.Imputer_defaults yourself and specify your own defaults. Or leave some values unspecified, in which case the imputer won't impute them, as in the following example. Here, the only attribute whose values will get imputed is "LENGTH"; the imputed value will be 1234. .. literalinclude:: code/imputation-complex.py :lines: 56-69 :obj:Orange.feature.imputation.Imputer_defaults's constructor will accept an argument of type :obj:Orange.data.Domain (in which case it will construct an empty instance for :obj:defaults) or an example. (Be careful with this: :obj:Orange.feature.imputation.Imputer_defaults will have a reference to the instance and not a copy. But you can make a copy yourself to avoid problems: instead of Imputer_defaults(data[0]) you may want to write Imputer_defaults(Orange.data.Instance(data[0])). Random imputation ================= .. class:: Imputer_Random Imputes random values. The corresponding constructor is :obj:ImputerConstructor_Random. .. attribute:: impute_class Tells whether to impute the class values or not. Defaults to True. .. attribute:: deterministic If true (default is False), random generator is initialized for each example using the example's hash value as a seed. This results in same examples being always imputed the same values. Model-based imputation ====================== .. class:: ImputerConstructor_model Model-based imputers learn to predict the attribute's value from values of other attributes. :obj:ImputerConstructor_model are given a learning algorithm (two, actually - one for discrete and one for continuous attributes) and they construct a classifier for each attribute. The constructed imputer :obj:Imputer_model stores a list of classifiers which are used when needed. .. attribute:: learner_discrete, learner_continuous Learner for discrete and for continuous attributes. If any of them is missing, the attributes of the corresponding type won't get imputed. .. attribute:: use_class Tells whether the imputer is allowed to use the class value. As this is most often undesired, this option is by default set to False. It can however be useful for a more complex design in which we would use one imputer for learning examples (this one would use the class value) and another for testing examples (which would not use the class value as this is unavailable at that moment). .. class:: Imputer_model .. attribute: models A list of classifiers, each corresponding to one attribute of the examples whose values are to be imputed. The :obj:classVar's of the models should equal the examples' attributes. If any of classifier is missing (that is, the corresponding element of the table is :obj:None, the corresponding attribute's values will not be imputed. .. rubric:: Examples The following imputer predicts the missing attribute values using classification and regression trees with the minimum of 20 examples in a leaf. Part of :download:imputation-complex.py  (uses :download:bridges.tab ): .. literalinclude:: code/imputation-complex.py :lines: 74-76 We could even use the same learner for discrete and continuous attributes, as :class:Orange.classification.tree.TreeLearner checks the class type and constructs regression or classification trees accordingly. The common parameters, such as the minimal number of examples in leaves, are used in both cases. You can also use different learning algorithms for discrete and continuous attributes. Probably a common setup will be to use :class:Orange.classification.bayes.BayesLearner for discrete and :class:Orange.regression.mean.MeanLearner (which just remembers the average) for continuous attributes. Part of :download:imputation-complex.py  (uses :download:bridges.tab ): .. literalinclude:: code/imputation-complex.py :lines: 91-94 You can also construct an :class:Imputer_model yourself. You will do this if different attributes need different treatment. Brace for an example that will be a bit more complex. First we shall construct an :class:Imputer_model and initialize an empty list of models. The following code snippets are from :download:imputation-complex.py  (uses :download:bridges.tab ): .. literalinclude:: code/imputation-complex.py :lines: 108-109 Attributes "LANES" and "T-OR-D" will always be imputed values 2 and "THROUGH". Since "LANES" is continuous, it suffices to construct a :obj:DefaultClassifier with the default value 2.0 (don't forget the decimal part, or else Orange will think you talk about an index of a discrete value - how could it tell?). For the discrete attribute "T-OR-D", we could construct a :class:Orange.classification.ConstantClassifier and give the index of value "THROUGH" as an argument. But we shall do it nicer, by constructing a :class:Orange.data.Value. Both classifiers will be stored at the appropriate places in :obj:imputer.models. .. literalinclude:: code/imputation-complex.py :lines: 110-112 "LENGTH" will be computed with a regression tree induced from "MATERIAL", "SPAN" and "ERECTED" (together with "LENGTH" as the class attribute, of course). Note that we initialized the domain by simply giving a list with the names of the attributes, with the domain as an additional argument in which Orange will look for the named attributes. .. literalinclude:: code/imputation-complex.py :lines: 114-119 We printed the tree just to see what it looks like. :: SPAN=SHORT: 1158 SPAN=LONG: 1907 SPAN=MEDIUM |    ERECTED<1908.500: 1325 |    ERECTED>=1908.500: 1528 Small and nice. Now for the "SPAN". Wooden bridges and walkways are short, while the others are mostly medium. This could be done by :class:Orange.classifier.ClassifierByLookupTable - this would be faster than what we plan here. See the corresponding documentation on lookup classifier. Here we are going to do it with a Python function. .. literalinclude:: code/imputation-complex.py :lines: 121-128 :obj:compute_span could also be written as a class, if you'd prefer it. It's important that it behaves like a classifier, that is, gets an example and returns a value. The second element tells, as usual, what the caller expect the classifier to return - a value, a distribution or both. Since the caller, :obj:Imputer_model, always wants values, we shall ignore the argument (at risk of having problems in the future when imputers might handle distribution as well). Missing values as special values ================================ Missing values sometimes have a special meaning. The fact that something was not measured can sometimes tell a lot. Be, however, cautious when using such values in decision models; it the decision not to measure something (for instance performing a laboratory test on a patient) is based on the expert's knowledge of the class value, such unknown values clearly should not be used in models. .. class:: ImputerConstructor_asValue Constructs a new domain in which each discrete attribute is replaced with a new attribute that has one value more: "NA". The new attribute will compute its values on the fly from the old one, copying the normal values and replacing the unknowns with "NA". For continuous attributes, it will construct a two-valued discrete attribute with values "def" and "undef", telling whether the continuous attribute was defined or not. The attribute's name will equal the original's with "_def" appended. The original continuous attribute will remain in the domain and its unknowns will be replaced by averages. :class:ImputerConstructor_asValue has no specific attributes. It constructs :class:Imputer_asValue (I bet you wouldn't guess). It converts the example into the new domain, which imputes the values for discrete attributes. If continuous attributes are present, it will also replace their values by the averages. .. class:: Imputer_asValue .. attribute:: domain The domain with the new attributes constructed by :class:ImputerConstructor_asValue. .. attribute:: defaults Default values for continuous attributes. Present only if there are any. The following code shows what this imputer actually does to the domain. Part of :download:imputation-complex.py  (uses :download:bridges.tab ): .. literalinclude:: code/imputation-complex.py :lines: 137-151 The script's output looks like this:: [RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE] [RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, LANES_def, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE] RIVER: M -> M ERECTED: 1874 -> 1874 (def) PURPOSE: RR -> RR LENGTH: ? -> 1567 (undef) LANES: 2 -> 2 (def) CLEAR-G: ? -> NA T-OR-D: THROUGH -> THROUGH MATERIAL: IRON -> IRON SPAN: ? -> NA REL-L: ? -> NA TYPE: SIMPLE-T -> SIMPLE-T Seemingly, the two examples have the same attributes (with :samp:imputed having a few additional ones). If you check this by :samp:original.domain[0] == imputed.domain[0], you shall see that this first glance is False. The attributes only have the same names, but they are different attributes. If you read this page (which is already a bit advanced), you know that Orange does not really care about the attribute names). Therefore, if we wrote :samp:imputed[i] the program would fail since :samp:imputed has no attribute :samp:i. But it has an attribute with the same name (which even usually has the same value). We therefore use :samp:i.name to index the attributes of :samp:imputed. (Using names for indexing is not fast, though; if you do it a lot, compute the integer index with :samp:imputed.domain.index(i.name).)

For continuous attributes, there is an additional attribute with "_def" appended; we get it by :samp:i.name+"_def". The first continuous attribute, "ERECTED" is defined. Its value remains 1874 and the additional attribute "ERECTED_def" has value "def". Not so for "LENGTH". Its undefined value is replaced by the average (1567) and the new attribute has value "undef". The undefined discrete attribute "CLEAR-G" (and all other undefined discrete attributes) is assigned the value "NA". Using imputers ============== To properly use the imputation classes in learning process, they must be trained on training examples only. Imputing the missing values and subsequently using the data set in cross-validation will give overly optimistic results. Learners with imputer as a component ------------------------------------ Orange learners that cannot handle missing values will generally provide a slot for the imputer component. An example of such a class is :obj:Orange.classification.logreg.LogRegLearner with an attribute called :obj:Orange.classification.logreg.LogRegLearner.imputerConstructor. To it you can assign an imputer constructor - one of the above constructors or a specific constructor you wrote yourself. When given learning examples, :obj:Orange.classification.logreg.LogRegLearner will pass them to :obj:Orange.classification.logreg.LogRegLearner.imputerConstructor to get an imputer (again some of the above or a specific imputer you programmed). It will immediately use the imputer to impute the missing values in the learning data set, so it can be used by the actual learning algorithm. Besides, when the classifier :obj:Orange.classification.logreg.LogRegClassifier is constructed, the imputer will be stored in its attribute :obj:Orange.classification.logreg.LogRegClassifier.imputer. At classification, the imputer will be used for imputation of missing values in (testing) examples. Although details may vary from algorithm to algorithm, this is how the imputation is generally used in Orange's learners. Also, if you write your own learners, it is recommended that you use imputation according to the described procedure. Write your own imputer ====================== Imputation classes provide the Python-callback functionality (not all Orange classes do so, refer to the documentation on subtyping the Orange classes in Python _ for a list). If you want to write your own imputation constructor or an imputer, you need to simply program a Python function that will behave like the built-in Orange classes (and even less, for imputer, you only need to write a function that gets an example as argument, imputation for example tables will then use that function). You will most often write the imputation constructor when you have a special imputation procedure or separate procedures for various attributes, as we've demonstrated in the description of :obj:Orange.feature.imputation.ImputerConstructor_model. You basically only need to pack everything we've written there to an imputer constructor that will accept a data set and the id of the weight meta-attribute (ignore it if you will, but you must accept two arguments), and return the imputer (probably :obj:Orange.feature.imputation.Imputer_model. The benefit of implementing an imputer constructor as opposed to what we did above is that you can use such a constructor as a component for Orange learners (like logistic regression) or for wrappers from module orngImpute, and that way properly use the in classifier testing procedures. """ import Orange.core as orange from orange import ImputerConstructor_minimal
• ## Orange/feature/scoring.py

 r9671 Assesses features' ability to distinguish between very similar instances from different classes. This scoring method was first developed by Kira and Rendell and then improved by Kononenko. The developed by Kira and Rendell and then improved by  Kononenko. The class :obj:Relief works on discrete and continuous classes and thus implements ReliefF and RReliefF.
• ## Orange/fixes/fix_orange_imports.py

 r9671 "orngSOM": "Orange.projection.som", "orngBayes":"Orange.classification.bayes", "orngLR":"Orange.classification.logreg", "orngNetwork":"Orange.network", "orngMisc":"Orange.misc",
• ## Orange/orng/orngCA.py

 r9671 # This has to be seriously outdated, as it uses matrixmultiply, which is not # present in numpy since, like, 2006.     Matija Polajnar, 2012 a.d. """ Correspondence analysis is a descriptive/exploratory technique designed to analyze simple two-way and
• ## Orange/testing/regression/results_modules/outlier1.py.txt

 r9689 [1.9537338562966178, 1.4916207549367024, 2.3567027252768531, 1.3547757144824362, 1.3165608011293919, 0.77090435634545718, 2.5037293510604748, 1.5073157621637685, 3.1360623720829142, 0.69078079138014914, 2.5285764279332521, 0.7433781485122628, 1.0879648893391614, 1.5809701088397101, 0.95551967016414863, 0.64109928938427407, 1.3610443564385841, 0.59842805343843031, 0.5764828338874588, 0.65896660247970529, -0.11532119366264716, 0.5946079506443136, 0.2164048799939679, -0.13532808531244495, -0.77356783494291881, -0.75705635746691069, -0.86104391212529852, -1.2168053958148779, -0.8496508795832981, -0.50123387464222346, -0.77341815879931297, 0.16767396953609892, -0.35117712691223024, 0.48721479093189457, -1.7288308112675763, -0.41177669611696449, -0.78877602511487799, -1.3872717005506068, 2.106596327546038, -1.2871466419288893, -0.41878699142408998, -1.2405469480761979, -0.24337773166773838, -0.95375835103295281, 0.48201088266530023, -0.56333257217473698, -0.7443069891450993, -1.764166660649225, -1.7065434213509496, -0.50504384697095039, -1.3054185775296043, -0.80681324891277462, -0.63828120116943221, -1.1627861693976371, -0.45401443972098127, -0.61371961477796033, -0.4977304132654839, -0.028586481599992053, -0.30729060363501759, -0.88295861769597794, -1.4449272394895061, -0.096688263582083503, -0.59221286077400181, 0.60840114873648288, -0.65173597952659179, -0.29992196992479447, -1.2488097930212723, 1.199403408838386, -1.1695890709679937, 0.43695418246540901, -1.1042173101212782, -0.66509777841096018, -0.51396431962727951, -0.51984330801587042, 0.0043002748331546275, -0.55181190984179918, 1.3441195820509282, -0.71294892448637903, -0.018440170740114344, 0.36568005901789147, -0.5072588384978175, -1.1650625423904195, -0.38855801862147704, 0.30554976386664046, -0.69074060168648355, 1.5280992689860888, -0.10141847608741561, 0.47051948353017276, -0.53870459912488544, 0.97851359900596802, -0.40786746694014064, 0.48008263656307049, 0.2240552203634309, -0.95959950865225618, -0.83515808262514213, 0.47533200198642123, -0.57881353707992, 1.0640441766510356, -0.58475334818831626, -0.36677168526134823, -0.11674188951656596, 0.58513238481347862, 0.50799047774082506, -0.41123866164618883, 0.78235502874783658, 0.036505126098621624, 0.57776931524450592, 0.17483092285652521] [1.9537338562966178, 1.4916207549367024, 2.356702725276853, 1.3547757144824362, 1.3165608011293919, 0.7709043563454572, 2.503729351060475, 1.5073157621637685, 3.136062372082914, 0.6907807913801491, 2.528576427933252, 0.7433781485122628, 1.0879648893391614, 1.5809701088397101, 0.9555196701641486, 0.6410992893842741, 1.3610443564385841, 0.5984280534384303, 0.5764828338874588, 0.6589666024797053, -0.11532119366264716, 0.5946079506443136, 0.2164048799939679, -0.13532808531244495, -0.7735678349429188, -0.7570563574669107, -0.8610439121252985, -1.216805395814878, -0.8496508795832981, -0.5012338746422235, -0.773418158799313, 0.16767396953609892, -0.35117712691223024, 0.48721479093189457, -1.7288308112675763, -0.4117766961169645, -0.788776025114878, -1.3872717005506068, 2.106596327546038, -1.2871466419288893, -0.41878699142409, -1.240546948076198, -0.24337773166773838, -0.9537583510329528, 0.48201088266530023, -0.563332572174737, -0.7443069891450993, -1.764166660649225, -1.7065434213509496, -0.5050438469709504, -1.3054185775296043, -0.8068132489127746, -0.6382812011694322, -1.1627861693976371, -0.45401443972098127, -0.6137196147779603, -0.4977304132654839, -0.028586481599992053, -0.3072906036350176, -0.8829586176959779, -1.444927239489506, -0.0966882635820835, -0.5922128607740018, 0.6084011487364829, -0.6517359795265918, -0.2999219699247945, -1.2488097930212723, 1.199403408838386, -1.1695890709679937, 0.436954182465409, -1.1042173101212782, -0.6650977784109602, -0.5139643196272795, -0.5198433080158704, 0.0043002748331546275, -0.5518119098417992, 1.3441195820509282, -0.712948924486379, -0.018440170740114344, 0.36568005901789147, -0.5072588384978175, -1.1650625423904195, -0.38855801862147704, 0.30554976386664046, -0.6907406016864835, 1.5280992689860888, -0.10141847608741561, 0.47051948353017276, -0.5387045991248854, 0.978513599005968, -0.40786746694014064, 0.4800826365630705, 0.2240552203634309, -0.9595995086522562, -0.8351580826251421, 0.4753320019864212, -0.57881353707992, 1.0640441766510356, -0.5847533481883163, -0.36677168526134823, -0.11674188951656596, 0.5851323848134786, 0.5079904777408251, -0.41123866164618883, 0.7823550287478366, 0.036505126098621624, 0.5777693152445059, 0.1748309228565252]
• ## Orange/testing/regression/results_modules/som1.py.txt

 r9689 node: 0 0 [5.2, 2.7, 3.9, 1.4, 'Iris-versicolor'] [5.1, 3.5, 1.4, 0.3, 'Iris-setosa'] [5.0, 3.5, 1.3, 0.3, 'Iris-setosa'] node: 0 1 [5.0, 3.6, 1.4, 0.2, 'Iris-setosa'] node: 0 2 [5.7, 3.8, 1.7, 0.3, 'Iris-setosa'] [5.1, 3.7, 1.5, 0.4, 'Iris-setosa'] node: 0 3 [5.1, 3.8, 1.5, 0.3, 'Iris-setosa'] [5.1, 3.8, 1.6, 0.2, 'Iris-setosa'] node: 0 4 [5.7, 4.4, 1.5, 0.4, 'Iris-setosa'] node: 0 5 [5.8, 4.0, 1.2, 0.2, 'Iris-setosa'] [5.2, 4.1, 1.5, 0.1, 'Iris-setosa'] node: 0 6 [5.4, 3.9, 1.3, 0.4, 'Iris-setosa'] node: 0 7 node: 0 8 [5.4, 3.7, 1.5, 0.2, 'Iris-setosa'] [5.5, 3.5, 1.3, 0.2, 'Iris-setosa'] [5.3, 3.7, 1.5, 0.2, 'Iris-setosa'] node: 0 9 node: 0 10 [5.1, 3.5, 1.4, 0.2, 'Iris-setosa'] [5.2, 3.5, 1.5, 0.2, 'Iris-setosa'] [5.2, 3.4, 1.4, 0.2, 'Iris-setosa'] node: 0 11 [5.1, 3.5, 1.4, 0.2, 'Iris-setosa'] [5.1, 3.5, 1.4, 0.3, 'Iris-setosa'] node: 0 12 [5.0, 3.6, 1.4, 0.2, 'Iris-setosa'] [5.0, 3.5, 1.3, 0.3, 'Iris-setosa'] [5.1, 3.4, 1.5, 0.2, 'Iris-setosa'] node: 0 13 [5.0, 3.4, 1.5, 0.2, 'Iris-setosa'] node: 0 14 [5.0, 3.3, 1.4, 0.2, 'Iris-setosa'] node: 0 15 [5.0, 3.2, 1.2, 0.2, 'Iris-setosa'] node: 0 16 [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'] node: 0 17 [4.4, 3.2, 1.3, 0.2, 'Iris-setosa'] [4.6, 3.2, 1.4, 0.2, 'Iris-setosa'] node: 0 18 [4.6, 3.4, 1.4, 0.3, 'Iris-setosa'] node: 0 19 [4.6, 3.6, 1.0, 0.2, 'Iris-setosa'] node: 0 15 node: 0 16 [4.3, 3.0, 1.1, 0.1, 'Iris-setosa'] [4.4, 3.2, 1.3, 0.2, 'Iris-setosa'] node: 0 17 [4.4, 3.0, 1.3, 0.2, 'Iris-setosa'] node: 0 18 [4.4, 2.9, 1.4, 0.2, 'Iris-setosa'] node: 0 19 [4.5, 2.3, 1.3, 0.3, 'Iris-setosa'] node: 1 0 [5.5, 2.3, 4.0, 1.3, 'Iris-versicolor'] [5.5, 2.5, 4.0, 1.3, 'Iris-versicolor'] [5.0, 3.4, 1.6, 0.4, 'Iris-setosa'] node: 1 1 [5.1, 3.3, 1.7, 0.5, 'Iris-setosa'] [5.0, 3.5, 1.6, 0.6, 'Iris-setosa'] node: 1 2 [5.5, 4.2, 1.4, 0.2, 'Iris-setosa'] node: 1 3 node: 1 4 [5.2, 4.1, 1.5, 0.1, 'Iris-setosa'] [5.4, 3.9, 1.7, 0.4, 'Iris-setosa'] [5.1, 3.8, 1.9, 0.4, 'Iris-setosa'] node: 1 5 [5.5, 4.2, 1.4, 0.2, 'Iris-setosa'] node: 1 6 [5.4, 3.9, 1.7, 0.4, 'Iris-setosa'] [5.8, 4.0, 1.2, 0.2, 'Iris-setosa'] [5.7, 4.4, 1.5, 0.4, 'Iris-setosa'] node: 1 7 [5.4, 3.7, 1.5, 0.2, 'Iris-setosa'] node: 1 8 [5.3, 3.7, 1.5, 0.2, 'Iris-setosa'] [5.7, 3.8, 1.7, 0.3, 'Iris-setosa'] node: 1 9 [5.2, 3.5, 1.5, 0.2, 'Iris-setosa'] node: 1 10 [5.1, 3.4, 1.5, 0.2, 'Iris-setosa'] [5.4, 3.4, 1.7, 0.2, 'Iris-setosa'] [5.4, 3.4, 1.5, 0.4, 'Iris-setosa'] node: 1 11 [5.0, 3.4, 1.5, 0.2, 'Iris-setosa'] node: 1 12 [5.0, 3.3, 1.4, 0.2, 'Iris-setosa'] [4.8, 3.4, 1.6, 0.2, 'Iris-setosa'] [4.8, 3.4, 1.9, 0.2, 'Iris-setosa'] node: 1 13 [5.0, 3.2, 1.2, 0.2, 'Iris-setosa'] [4.7, 3.2, 1.6, 0.2, 'Iris-setosa'] node: 1 14 [4.8, 3.1, 1.6, 0.2, 'Iris-setosa'] node: 1 15 [4.6, 3.4, 1.4, 0.3, 'Iris-setosa'] node: 1 16 [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'] [4.6, 3.2, 1.4, 0.2, 'Iris-setosa'] node: 1 17 [4.6, 3.1, 1.5, 0.2, 'Iris-setosa'] node: 1 18 node: 1 19 [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'] [4.8, 3.0, 1.4, 0.1, 'Iris-setosa'] [4.8, 3.0, 1.4, 0.3, 'Iris-setosa'] node: 2 0 [4.9, 2.5, 4.5, 1.7, 'Iris-virginica'] node: 2 1 node: 2 2 [5.5, 2.6, 4.4, 1.2, 'Iris-versicolor'] [5.6, 2.7, 4.2, 1.3, 'Iris-versicolor'] node: 2 3 node: 2 4 [5.1, 3.8, 1.9, 0.4, 'Iris-setosa'] node: 2 5 node: 2 6 [5.1, 3.8, 1.5, 0.3, 'Iris-setosa'] [5.1, 3.7, 1.5, 0.4, 'Iris-setosa'] [5.1, 3.8, 1.6, 0.2, 'Iris-setosa'] node: 2 7 node: 2 8 [5.4, 3.4, 1.7, 0.2, 'Iris-setosa'] node: 2 9 [5.4, 3.4, 1.5, 0.4, 'Iris-setosa'] node: 2 10 node: 2 11 node: 2 12 [5.1, 3.3, 1.7, 0.5, 'Iris-setosa'] node: 2 13 [5.0, 3.5, 1.6, 0.6, 'Iris-setosa'] node: 2 14 [5.0, 3.4, 1.6, 0.4, 'Iris-setosa'] node: 2 15 [4.8, 3.4, 1.6, 0.2, 'Iris-setosa'] node: 2 16 [4.7, 3.2, 1.6, 0.2, 'Iris-setosa'] node: 2 17 [4.8, 3.1, 1.6, 0.2, 'Iris-setosa'] node: 2 18 node: 2 19 [4.9, 3.1, 1.5, 0.1, 'Iris-setosa'] [5.0, 3.0, 1.6, 0.2, 'Iris-setosa'] [4.9, 3.1, 1.5, 0.1, 'Iris-setosa'] [4.9, 3.1, 1.5, 0.1, 'Iris-setosa'] node: 1 16 [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'] [4.8, 3.0, 1.4, 0.1, 'Iris-setosa'] node: 1 17 [4.6, 3.1, 1.5, 0.2, 'Iris-setosa'] [4.8, 3.0, 1.4, 0.3, 'Iris-setosa'] node: 1 18 [4.4, 2.9, 1.4, 0.2, 'Iris-setosa'] node: 1 19 [4.3, 3.0, 1.1, 0.1, 'Iris-setosa'] [4.4, 3.0, 1.3, 0.2, 'Iris-setosa'] node: 2 0 [5.7, 2.6, 3.5, 1.0, 'Iris-versicolor'] node: 2 1 node: 2 2 [5.5, 2.4, 3.7, 1.0, 'Iris-versicolor'] node: 2 3 [5.5, 2.4, 3.8, 1.1, 'Iris-versicolor'] node: 2 4 [5.6, 2.5, 3.9, 1.1, 'Iris-versicolor'] node: 2 5 node: 2 6 [5.1, 2.5, 3.0, 1.1, 'Iris-versicolor'] node: 2 7 node: 2 8 [4.9, 2.4, 3.3, 1.0, 'Iris-versicolor'] [5.0, 2.3, 3.3, 1.0, 'Iris-versicolor'] node: 2 9 node: 2 10 [5.0, 2.0, 3.5, 1.0, 'Iris-versicolor'] node: 2 11 node: 2 12 [4.5, 2.3, 1.3, 0.3, 'Iris-setosa'] node: 2 13 node: 2 14 node: 2 15 node: 2 16 node: 2 17 node: 2 18 node: 2 19 node: 3 0 [5.4, 3.0, 4.5, 1.5, 'Iris-versicolor'] [5.6, 2.9, 3.6, 1.3, 'Iris-versicolor'] node: 3 1 [5.6, 3.0, 4.5, 1.5, 'Iris-versicolor'] node: 3 2 [5.7, 2.8, 4.5, 1.3, 'Iris-versicolor'] [5.6, 3.0, 4.1, 1.3, 'Iris-versicolor'] [5.7, 3.0, 4.2, 1.2, 'Iris-versicolor'] node: 3 3 [5.7, 2.9, 4.2, 1.3, 'Iris-versicolor'] [5.7, 2.8, 4.1, 1.3, 'Iris-versicolor'] node: 3 4 [5.7, 2.8, 4.1, 1.3, 'Iris-versicolor'] node: 3 5 [5.7, 2.9, 4.2, 1.3, 'Iris-versicolor'] [5.9, 3.0, 4.2, 1.5, 'Iris-versicolor'] node: 3 6 [5.7, 3.0, 4.2, 1.2, 'Iris-versicolor'] [6.1, 2.8, 4.0, 1.3, 'Iris-versicolor'] node: 3 7 [5.8, 2.7, 3.9, 1.2, 'Iris-versicolor'] node: 3 8 [5.6, 3.0, 4.1, 1.3, 'Iris-versicolor'] [5.8, 2.7, 4.1, 1.0, 'Iris-versicolor'] [5.8, 2.6, 4.0, 1.2, 'Iris-versicolor'] node: 3 9 node: 3 10 [4.8, 3.4, 1.9, 0.2, 'Iris-setosa'] node: 3 11 [5.5, 2.3, 4.0, 1.3, 'Iris-versicolor'] [5.5, 2.5, 4.0, 1.3, 'Iris-versicolor'] node: 3 12 [7.1, 3.0, 5.9, 2.1, 'Iris-virginica'] [5.2, 2.7, 3.9, 1.4, 'Iris-versicolor'] node: 3 13 node: 3 14 [7.6, 3.0, 6.6, 2.1, 'Iris-virginica'] node: 3 15 [6.0, 2.2, 4.0, 1.0, 'Iris-versicolor'] node: 3 16 [7.7, 2.8, 6.7, 2.0, 'Iris-virginica'] [6.2, 2.2, 4.5, 1.5, 'Iris-versicolor'] [6.3, 2.3, 4.4, 1.3, 'Iris-versicolor'] node: 3 17 node: 3 18 node: 3 19 [7.7, 2.6, 6.9, 2.3, 'Iris-virginica'] [7.7, 2.8, 6.7, 2.0, 'Iris-virginica'] node: 4 0 [5.8, 2.7, 5.1, 1.9, 'Iris-virginica'] [5.7, 2.5, 5.0, 2.0, 'Iris-virginica'] [5.8, 2.7, 5.1, 1.9, 'Iris-virginica'] [5.4, 3.0, 4.5, 1.5, 'Iris-versicolor'] node: 4 1 [5.6, 3.0, 4.5, 1.5, 'Iris-versicolor'] node: 4 2 [5.6, 2.8, 4.9, 2.0, 'Iris-virginica'] [5.7, 2.8, 4.5, 1.3, 'Iris-versicolor'] node: 4 3 node: 4 4 [5.9, 3.0, 4.2, 1.5, 'Iris-versicolor'] [6.0, 2.9, 4.5, 1.5, 'Iris-versicolor'] node: 4 5 [6.1, 3.0, 4.6, 1.4, 'Iris-versicolor'] node: 4 6 [6.0, 2.9, 4.5, 1.5, 'Iris-versicolor'] [6.1, 2.9, 4.7, 1.4, 'Iris-versicolor'] node: 4 7 [6.1, 2.8, 4.7, 1.2, 'Iris-versicolor'] node: 4 8 [6.2, 2.2, 4.5, 1.5, 'Iris-versicolor'] [6.3, 2.3, 4.4, 1.3, 'Iris-versicolor'] [5.6, 2.7, 4.2, 1.3, 'Iris-versicolor'] node: 4 9 node: 4 10 [7.4, 2.8, 6.1, 1.9, 'Iris-virginica'] node: 4 11 [5.5, 2.6, 4.4, 1.2, 'Iris-versicolor'] node: 4 12 [7.7, 3.0, 6.1, 2.3, 'Iris-virginica'] [6.0, 2.2, 5.0, 1.5, 'Iris-virginica'] node: 4 13 node: 4 14 node: 4 15 [6.3, 2.5, 4.9, 1.5, 'Iris-versicolor'] node: 4 16 node: 4 17 [6.7, 2.5, 5.8, 1.8, 'Iris-virginica'] node: 4 18 node: 4 19 node: 5 0 [5.8, 2.8, 5.1, 2.4, 'Iris-virginica'] [4.9, 2.5, 4.5, 1.7, 'Iris-virginica'] node: 5 1 node: 5 2 [6.1, 3.0, 4.9, 1.8, 'Iris-virginica'] node: 5 7 [6.2, 2.8, 4.8, 1.8, 'Iris-virginica'] node: 5 8 [6.2, 2.8, 4.8, 1.8, 'Iris-virginica'] [6.3, 2.7, 4.9, 1.8, 'Iris-virginica'] node: 5 9 [6.3, 2.7, 4.9, 1.8, 'Iris-virginica'] [6.3, 2.5, 5.0, 1.9, 'Iris-virginica'] node: 5 10 [6.3, 2.5, 5.0, 1.9, 'Iris-virginica'] node: 5 11 [6.0, 2.7, 5.1, 1.6, 'Iris-versicolor'] node: 5 12 [6.4, 2.7, 5.3, 1.9, 'Iris-virginica'] node: 5 13 [6.1, 2.6, 5.6, 1.4, 'Iris-virginica'] node: 5 14 [6.7, 2.5, 5.8, 1.8, 'Iris-virginica'] node: 5 15 [7.4, 2.8, 6.1, 1.9, 'Iris-virginica'] node: 5 16 [7.2, 3.0, 5.8, 1.6, 'Iris-virginica'] node: 5 17 [7.2, 3.2, 6.0, 1.8, 'Iris-virginica'] [7.3, 2.9, 6.3, 1.8, 'Iris-virginica'] node: 5 18 node: 5 19 [7.3, 2.9, 6.3, 1.8, 'Iris-virginica'] [7.6, 3.0, 6.6, 2.1, 'Iris-virginica'] node: 6 0 [6.4, 2.8, 5.6, 2.1, 'Iris-virginica'] [6.4, 2.8, 5.6, 2.2, 'Iris-virginica'] [5.6, 2.8, 4.9, 2.0, 'Iris-virginica'] node: 6 1 node: 6 2 [6.5, 3.2, 5.1, 2.0, 'Iris-virginica'] [5.7, 2.5, 5.0, 2.0, 'Iris-virginica'] node: 6 3 node: 6 4 [6.5, 3.0, 5.2, 2.0, 'Iris-virginica'] [5.8, 2.7, 5.1, 1.9, 'Iris-virginica'] [5.8, 2.7, 5.1, 1.9, 'Iris-virginica'] node: 6 5 node: 6 6 [6.0, 2.2, 5.0, 1.5, 'Iris-virginica'] [6.4, 2.7, 5.3, 1.9, 'Iris-virginica'] node: 6 7 [6.3, 2.5, 4.9, 1.5, 'Iris-versicolor'] node: 6 8 [6.3, 2.8, 5.1, 1.5, 'Iris-virginica'] node: 6 9 [6.8, 2.8, 4.8, 1.4, 'Iris-versicolor'] node: 6 10 [7.2, 3.0, 5.8, 1.6, 'Iris-virginica'] node: 6 11 [6.7, 3.0, 5.0, 1.7, 'Iris-versicolor'] node: 6 12 [7.2, 3.2, 6.0, 1.8, 'Iris-virginica'] node: 6 13 [6.7, 3.1, 4.7, 1.5, 'Iris-versicolor'] node: 6 14 [7.1, 3.0, 5.9, 2.1, 'Iris-virginica'] node: 6 15 [6.9, 3.1, 4.9, 1.5, 'Iris-versicolor'] node: 6 16 [7.7, 3.0, 6.1, 2.3, 'Iris-virginica'] node: 6 17 [7.0, 3.2, 4.7, 1.4, 'Iris-versicolor'] node: 6 18 node: 6 19 node: 7 0 [6.5, 3.0, 5.8, 2.2, 'Iris-virginica'] [5.8, 2.8, 5.1, 2.4, 'Iris-virginica'] node: 7 1 node: 7 2 node: 7 4 [6.9, 3.1, 5.1, 2.3, 'Iris-virginica'] [6.7, 3.0, 5.2, 2.3, 'Iris-virginica'] node: 7 5 node: 7 6 [6.7, 3.0, 5.2, 2.3, 'Iris-virginica'] node: 7 7 node: 7 8 [6.5, 3.0, 5.2, 2.0, 'Iris-virginica'] node: 7 9 node: 7 10 [6.5, 3.2, 5.1, 2.0, 'Iris-virginica'] node: 7 11 node: 7 12 [6.7, 3.0, 5.0, 1.7, 'Iris-versicolor'] node: 7 13 node: 7 14 [6.8, 2.8, 4.8, 1.4, 'Iris-versicolor'] node: 7 15 node: 7 16 [7.2, 3.6, 6.1, 2.5, 'Iris-virginica'] node: 7 17 node: 7 18 [7.9, 3.8, 6.4, 2.0, 'Iris-virginica'] node: 7 19 [7.7, 3.8, 6.7, 2.2, 'Iris-virginica'] node: 8 0 [6.3, 2.9, 5.6, 1.8, 'Iris-virginica'] node: 8 1 [6.4, 2.8, 5.6, 2.1, 'Iris-virginica'] node: 8 2 [6.4, 2.8, 5.6, 2.2, 'Iris-virginica'] node: 8 3 [6.5, 3.0, 5.8, 2.2, 'Iris-virginica'] node: 8 4 node: 8 5 [6.9, 3.2, 5.7, 2.3, 'Iris-virginica'] node: 8 6 node: 8 7 [6.7, 3.1, 5.6, 2.4, 'Iris-virginica'] [6.7, 3.3, 5.7, 2.5, 'Iris-virginica'] node: 8 8 node: 8 9 [6.4, 3.2, 5.3, 2.3, 'Iris-virginica'] node: 8 10 node: 8 11 [6.7, 3.1, 4.7, 1.5, 'Iris-versicolor'] node: 8 12 node: 8 13 [6.9, 3.1, 4.9, 1.5, 'Iris-versicolor'] node: 8 14 node: 8 15 [7.0, 3.2, 4.7, 1.4, 'Iris-versicolor'] node: 8 16 node: 8 17 [6.6, 2.9, 4.6, 1.3, 'Iris-versicolor'] node: 8 18 node: 8 19 [6.5, 2.8, 4.6, 1.5, 'Iris-versicolor'] node: 9 0 [6.5, 3.0, 5.5, 1.8, 'Iris-virginica'] [6.4, 3.1, 5.5, 1.8, 'Iris-virginica'] node: 7 7 node: 7 8 [6.3, 2.9, 5.6, 1.8, 'Iris-virginica'] node: 7 9 node: 7 10 [6.1, 2.6, 5.6, 1.4, 'Iris-virginica'] node: 7 11 node: 7 12 [6.0, 2.7, 5.1, 1.6, 'Iris-versicolor'] node: 7 13 [6.3, 2.8, 5.1, 1.5, 'Iris-virginica'] node: 7 14 node: 7 15 [6.5, 2.8, 4.6, 1.5, 'Iris-versicolor'] node: 7 16 [6.6, 2.9, 4.6, 1.3, 'Iris-versicolor'] node: 7 17 [6.7, 3.1, 4.4, 1.4, 'Iris-versicolor'] [6.6, 3.0, 4.4, 1.4, 'Iris-versicolor'] node: 7 18 node: 7 19 [6.4, 2.9, 4.3, 1.3, 'Iris-versicolor'] [6.2, 2.9, 4.3, 1.3, 'Iris-versicolor'] node: 8 0 [6.7, 3.3, 5.7, 2.1, 'Iris-virginica'] node: 8 1 node: 8 2 [6.9, 3.2, 5.7, 2.3, 'Iris-virginica'] [6.8, 3.2, 5.9, 2.3, 'Iris-virginica'] node: 8 3 node: 8 4 [6.7, 3.1, 5.6, 2.4, 'Iris-virginica'] [6.7, 3.3, 5.7, 2.5, 'Iris-virginica'] node: 8 5 node: 8 6 [6.3, 3.4, 5.6, 2.4, 'Iris-virginica'] node: 8 7 [6.2, 3.4, 5.4, 2.3, 'Iris-virginica'] node: 8 8 [6.4, 3.2, 5.3, 2.3, 'Iris-virginica'] node: 8 9 node: 8 10 [6.1, 2.9, 4.7, 1.4, 'Iris-versicolor'] [6.1, 2.8, 4.7, 1.2, 'Iris-versicolor'] [6.1, 3.0, 4.6, 1.4, 'Iris-versicolor'] node: 8 11 node: 8 12 [5.7, 2.6, 3.5, 1.0, 'Iris-versicolor'] node: 8 13 node: 8 14 [5.8, 2.7, 3.9, 1.2, 'Iris-versicolor'] node: 8 15 [6.1, 2.8, 4.0, 1.3, 'Iris-versicolor'] node: 8 16 node: 8 17 [6.0, 2.2, 4.0, 1.0, 'Iris-versicolor'] node: 8 18 node: 8 19 [5.8, 2.7, 4.1, 1.0, 'Iris-versicolor'] [5.8, 2.6, 4.0, 1.2, 'Iris-versicolor'] node: 9 0 [7.7, 3.8, 6.7, 2.2, 'Iris-virginica'] [7.9, 3.8, 6.4, 2.0, 'Iris-virginica'] node: 9 1 node: 9 2 [7.2, 3.6, 6.1, 2.5, 'Iris-virginica'] [6.7, 3.3, 5.7, 2.1, 'Iris-virginica'] node: 9 3 node: 9 4 [6.3, 3.3, 6.0, 2.5, 'Iris-virginica'] [6.8, 3.2, 5.9, 2.3, 'Iris-virginica'] node: 9 5 node: 9 6 [6.3, 3.3, 6.0, 2.5, 'Iris-virginica'] node: 9 7 node: 9 8 [6.3, 3.4, 5.6, 2.4, 'Iris-virginica'] node: 9 9 [6.2, 3.4, 5.4, 2.3, 'Iris-virginica'] node: 9 10 node: 9 11 node: 9 12 [6.0, 3.4, 4.5, 1.6, 'Iris-versicolor'] node: 9 13 [6.3, 3.3, 4.7, 1.6, 'Iris-versicolor'] node: 9 14 [6.4, 3.2, 4.5, 1.5, 'Iris-versicolor'] [6.3, 3.3, 4.7, 1.6, 'Iris-versicolor'] node: 9 7 [6.0, 3.4, 4.5, 1.6, 'Iris-versicolor'] node: 9 8 node: 9 9 node: 9 10 [5.6, 2.9, 3.6, 1.3, 'Iris-versicolor'] node: 9 11 [5.1, 2.5, 3.0, 1.1, 'Iris-versicolor'] node: 9 12 node: 9 13 [4.9, 2.4, 3.3, 1.0, 'Iris-versicolor'] node: 9 14 [5.0, 2.3, 3.3, 1.0, 'Iris-versicolor'] node: 9 15 [5.0, 2.0, 3.5, 1.0, 'Iris-versicolor'] node: 9 16 [6.2, 2.9, 4.3, 1.3, 'Iris-versicolor'] node: 9 17 [5.5, 2.4, 3.7, 1.0, 'Iris-versicolor'] [6.4, 2.9, 4.3, 1.3, 'Iris-versicolor'] node: 9 18 [5.5, 2.4, 3.8, 1.1, 'Iris-versicolor'] [6.6, 3.0, 4.4, 1.4, 'Iris-versicolor'] node: 9 19 [5.6, 2.5, 3.9, 1.1, 'Iris-versicolor'] [6.7, 3.1, 4.4, 1.4, 'Iris-versicolor']
• ## Orange/testing/regression/results_modules/statExamples.py.txt

 r9767 method  CA  AP  Brier   IS bayes   0.903   0.902   0.176    0.758 tree    0.825   0.824   0.326    0.599 bayes   0.903   0.902   0.175    0.759 tree    0.846   0.845   0.286    0.641 majrty  0.614   0.526   0.474   -0.000 method  CA  AP  Brier   IS bayes   0.903+-0.008    0.902+-0.008    0.176+-0.016     0.758+-0.017 tree    0.825+-0.016    0.824+-0.016    0.326+-0.033     0.599+-0.034 bayes   0.903+-0.019    0.902+-0.019    0.175+-0.036     0.759+-0.039 tree    0.846+-0.016    0.845+-0.015    0.286+-0.030     0.641+-0.032 majrty  0.614+-0.003    0.526+-0.001    0.474+-0.001    -0.000+-0.000 Confusion matrix for naive Bayes: TP: 240, FP: 18, FN: 27.0, TN: 150 TP: 239, FP: 18, FN: 28.0, TN: 150 Confusion matrix for naive Bayes for 'van': TP: 192, FP: 151, FN: 7.0, TN: 496 TP: 189, FP: 241, FN: 10.0, TN: 406 Confusion matrix for naive Bayes for 'opel': TP: 79, FP: 75, FN: 133.0, TN: 559 TP: 86, FP: 112, FN: 126.0, TN: 522 bus van saab    opel bus 156 19  17  26 van 4   192 2   1 saab    8   68  93  48 opel    8   64  61  79 bus 56  95  21  46 van 6   189 4   0 saab    3   75  73  66 opel    4   71  51  86 Sensitivity and specificity for 'voting' method  sens    spec bayes   0.891   0.923 tree    0.801   0.863 tree    0.816   0.893 majrty  1.000   0.000 Sensitivity and specificity for 'vehicle=van' method  sens    spec bayes   0.965   0.767 tree    0.834   0.966 bayes   0.950   0.628 tree    0.809   0.966 majrty  0.000   1.000 AUC (voting) bayes: 0.974 tree: 0.926 tree: 0.930 majrty: 0.500 AUC for vehicle using weighted single-out method bayes   tree    majority 0.840   0.816   0.500 0.783   0.800   0.500 AUC for vehicle, using different methods bayes   tree    majority by pairs, weighted:  0.861   0.883   0.500 by pairs:  0.863   0.884   0.500 one vs. all, weighted:  0.840   0.816   0.500 one vs. all:  0.840   0.816   0.500 by pairs, weighted:  0.789   0.870   0.500 by pairs:  0.791   0.871   0.500 one vs. all, weighted:  0.783   0.800   0.500 one vs. all:  0.783   0.800   0.500 AUC for detecting class 'van' in 'vehicle' 0.923   0.900   0.500 0.858   0.888   0.500 AUCs for detecting various classes in 'vehicle' bus (218.000) vs others:    0.952   0.936   0.500 van (199.000) vs others:    0.923   0.900   0.500 saab (217.000) vs others:   0.737   0.707   0.500 opel (212.000) vs others:   0.749   0.718   0.500 bus (218.000) vs others:    0.894   0.932   0.500 van (199.000) vs others:    0.858   0.888   0.500 saab (217.000) vs others:   0.699   0.687   0.500 opel (212.000) vs others:   0.682   0.694   0.500 bus van saab van 0.987 saab    0.927   0.860 opel    0.921   0.894   0.587 van 0.933 saab    0.820   0.828 opel    0.822   0.825   0.519 AUCs for detecting various pairs of classes in 'vehicle' van vs bus:     0.987   0.976   0.500 saab vs bus:    0.927   0.936   0.500 saab vs van:    0.860   0.906   0.500 opel vs bus:    0.921   0.951   0.500 opel vs van:    0.894   0.915   0.500 opel vs saab:   0.587   0.622   0.500 van vs bus:     0.933   0.978   0.500 saab vs bus:    0.820   0.938   0.500 saab vs van:    0.828   0.879   0.500 opel vs bus:    0.822   0.932   0.500 opel vs van:    0.825   0.903   0.500 opel vs saab:   0.519   0.599   0.500 AUC and SE for voting bayes: 0.982+-0.008 tree: 0.888+-0.025 bayes: 0.968+-0.015 tree: 0.924+-0.022 majrty: 0.500+-0.045 Difference between naive Bayes and tree: 0.065+-0.066 Difference between naive Bayes and tree: 0.014+-0.062 ROC (first 20 points) for bayes on 'voting' 1.000   1.000 0.970   1.000 0.940   1.000 0.910   1.000 0.896   1.000 0.881   1.000 0.836   1.000 0.821   1.000 0.806   1.000 0.791   1.000 0.761   1.000 0.746   1.000 0.687   1.000 0.672   1.000 0.627   1.000 0.612   1.000 0.597   1.000 0.582   1.000 0.567   1.000 0.672   0.991 0.657   0.991 0.642   0.991 0.552   0.991 0.537   0.991 0.522   0.991 0.507   0.991
• ## Orange/testing/regression/results_ofb/accuracy3.py.txt

 r9689 Classification accuracies: bayes 0.93119266055 tree 0.876146788991 tree 0.871559633028
• ## Orange/testing/regression/results_ofb/accuracy4.py.txt

 r9689 1: [0.90839694656488545, 0.9007633587786259] 2: [0.90839694656488545, 0.9007633587786259] 3: [0.90839694656488545, 0.9007633587786259] 4: [0.90839694656488545, 0.9007633587786259] 5: [0.90839694656488545, 0.9007633587786259] 6: [0.90839694656488545, 0.9007633587786259] 7: [0.90839694656488545, 0.9007633587786259] 8: [0.90839694656488545, 0.9007633587786259] 9: [0.90839694656488545, 0.9007633587786259] 10: [0.90839694656488545, 0.9007633587786259] 1: [0.9083969465648855, 0.9007633587786259] 2: [0.9083969465648855, 0.9007633587786259] 3: [0.9083969465648855, 0.9007633587786259] 4: [0.9083969465648855, 0.9007633587786259] 5: [0.9083969465648855, 0.9007633587786259] 6: [0.9083969465648855, 0.9007633587786259] 7: [0.9083969465648855, 0.9007633587786259] 8: [0.9083969465648855, 0.9007633587786259] 9: [0.9083969465648855, 0.9007633587786259] 10: [0.9083969465648855, 0.9007633587786259] Classification accuracies: bayes 0.908396946565
• ## Orange/testing/regression/results_ofb/accuracy5.py.txt

 r9689 1: [0.88636363636363635, 0.93181818181818177] 2: [0.88636363636363635, 0.93181818181818177] 3: [0.88636363636363635, 0.93181818181818177] 4: [0.93181818181818177, 1.0] 5: [0.95454545454545459, 1.0] 6: [0.88372093023255816, 0.97674418604651159] 7: [0.93023255813953487, 0.95348837209302328] 8: [0.88372093023255816, 0.90697674418604646] 9: [0.88372093023255816, 1.0] 10: [0.90697674418604646, 0.90697674418604646] 1: [0.8863636363636364, 0.9318181818181818] 2: [0.8863636363636364, 0.9318181818181818] 3: [0.8863636363636364, 0.9318181818181818] 4: [0.9318181818181818, 1.0] 5: [0.9545454545454546, 1.0] 6: [0.8837209302325582, 0.9767441860465116] 7: [0.9302325581395349, 0.9534883720930233] 8: [0.8837209302325582, 0.9069767441860465] 9: [0.8837209302325582, 1.0] 10: [0.9069767441860465, 0.9069767441860465] Classification accuracies: bayes 0.903382663848
• ## Orange/testing/regression/results_ofb/assoc2.py.txt

 r9689 5 most confident rules: conf    supp    lift    rule 1.000   0.585   1.015   drive-wheels=fwd -> engine-location=front 1.000   0.478   1.015   fuel-type=gas num-of-doors=four -> engine-location=front 1.000   0.478   1.015   fuel-type=gas aspiration=std drive-wheels=fwd -> engine-location=front 1.000   0.429   1.015   fuel-type=gas aspiration=std num-of-doors=four -> engine-location=front 1.000   0.507   1.015   aspiration=std drive-wheels=fwd -> engine-location=front 1.000   0.556   1.015   num-of-doors=four -> engine-location=front 1.000   0.541   1.015   fuel-type=gas drive-wheels=fwd -> engine-location=front 1.000   0.449   1.015   aspiration=std num-of-doors=four -> engine-location=front
• ## Orange/testing/regression/results_ofb/bagging_test.py.linux2.txt

 r9689 tree: 0.777 bagged classifier: 0.794 tree: 0.795 bagged classifier: 0.802
• ## Orange/testing/regression/results_ofb/domain13.py.txt

 r9689 original: 0.940, new: 0.960 original: 0.947, new: 0.960
• ## Orange/testing/regression/results_ofb/ensemble3.py.txt

 r9689 Learner   CA     Brier Score default:  0.473  0.501 k-NN (k=11):  0.881  0.233 bagged k-NN:  0.825  0.252 boosted k-NN:  0.853  0.238 k-NN (k=11):  0.861  0.231 bagged k-NN:  0.854  0.249 boosted k-NN:  0.843  0.261
• ## Orange/testing/regression/results_ofb/handful.py.txt

 r9689 (democrat  )   0.386         0.995         0.011         0.048 (democrat  )   0.386         0.002         0.015         0.000 (democrat  )   0.386         0.043         0.015         0.018 (democrat  )   0.386         0.228         0.015         0.192 (democrat  )   0.386         1.000         0.973         0.665 (democrat  )   0.386         0.043         0.015         0.015 (democrat  )   0.386         0.228         0.015         0.191 (democrat  )   0.386         1.000         0.973         0.776 (republican)   0.386         1.000         0.973         0.861 (republican)   0.386         1.000         0.973         1.000
• ## Orange/testing/regression/results_ofb/regression3.py.txt

 r9689 Learner        MSE default         84.777 regression tree 40.096 regression tree 39.705 k-NN (k=5)      17.532
• ## Orange/testing/regression/results_ofb/regression4.py.txt

 r9689 maj      84.777  9.207  6.659  1.004  1.002  1.002 -0.004 lr       23.729  4.871  3.413  0.281  0.530  0.513  0.719 rt       40.096  6.332  4.569  0.475  0.689  0.687  0.525 rt       39.705  6.301  4.549  0.470  0.686  0.684  0.530 knn      17.244  4.153  2.670  0.204  0.452  0.402  0.796
• ## Orange/testing/regression/results_orange25/linear-example.py.txt

 r9689 30.0 25.0 30.6 28.6 27.9 Actual: 24.00, predicted: 30.00 Actual: 21.60, predicted: 25.03 Actual: 34.70, predicted: 30.57 Actual: 33.40, predicted: 28.61 Actual: 36.20, predicted: 27.94 Variable  Coeff Est  Std Error    t-value          p Intercept     36.459      5.103      7.144      0.000   *** LSTAT     -0.525      0.051    -10.347      0.000   *** Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1 empty 1 Variable  Coeff Est  Std Error    t-value          p Intercept     36.341      5.067      7.171      0.000   *** LSTAT     -0.523      0.047    -11.019      0.000   *** RM      3.802      0.406      9.356      0.000   *** PTRATIO     -0.947      0.129     -7.334      0.000   *** DIS     -1.493      0.186     -8.037      0.000   *** NOX    -17.376      3.535     -4.915      0.000   *** CHAS      2.719      0.854      3.183      0.002    ** B      0.009      0.003      3.475      0.001   *** ZN      0.046      0.014      3.390      0.001   *** CRIM     -0.108      0.033     -3.307      0.001    ** RAD      0.300      0.063      4.726      0.000   *** TAX     -0.012      0.003     -3.493      0.001   *** Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1 empty 1
• ## Orange/testing/regression/results_orange25/simple_tree_random_forest.py.txt

 r9689 Learner  CA     Brier  AUC for_gain 0.933  0.080  0.994 for_gain 0.947  0.078  0.995 for_simp 0.933  0.079  0.995 Runtimes: for_gain 0.0959219932556 for_simp 0.0247950553894 for_gain 0.0934960842133 for_simp 0.0251078605652
• ## Orange/testing/regression/results_orange25/statExamplesRegression.py.txt

 r9689 Learner   MSE     RMSE    MAE     RSE     RRSE    RAE     R2 maj       84.585   9.197   6.653   1.002   1.001   1.001  -0.002 rt        21.685   4.657   3.024   0.257   0.507   0.455   0.743 rt        40.015   6.326   4.592   0.474   0.688   0.691   0.526 knn       21.248   4.610   2.870   0.252   0.502   0.432   0.748 lr        24.092   4.908   3.425   0.285   0.534   0.515   0.715
• ## Orange/testing/regression/results_orange25/svm-linear-weights.py.txt

 r9689 defaultdict(, {FloatVariable 'Elu 360': 0.16577258493775038, FloatVariable 'alpha 28': 0.04645238275160125, FloatVariable 'Elu 390': 0.1820761083768519, FloatVariable 'alpha 35': 0.21379911334384863, FloatVariable 'heat 80': 0.38909905042009185, FloatVariable 'cdc15 10': 0.11428450056762224, FloatVariable 'alpha 42': 0.13641650791763626, FloatVariable 'cdc15 30': 0.18270335600911874, FloatVariable 'alpha 49': 0.16085715739160683, FloatVariable 'Elu 150': 0.5965599387054419, FloatVariable 'alpha 7': 0.06557659077448137, FloatVariable 'alpha 63': 0.18433878873311124, FloatVariable 'cdc15 70': 0.13876265882583333, FloatVariable 'Elu 120': 0.5939764872379445, FloatVariable 'alpha 70': 0.26459268873021585, FloatVariable 'Elu 90': 0.3834942300173535, FloatVariable 'cdc15 90': 0.32729758144823035, FloatVariable 'heat 0': 0.19091815990337407, FloatVariable 'alpha 77': 0.20088381949723239, FloatVariable 'cdc15 110': 0.564756618474929, FloatVariable 'heat 160': 0.3192185000510415, FloatVariable 'alpha 56': 0.3367416107117914, FloatVariable 'cdc15 130': 0.3658301477295572, FloatVariable 'alpha 21': 0.030539018557891578, FloatVariable 'cdc15 150': 0.693249161777514, FloatVariable 'dtt 120': 0.55305024494988, FloatVariable 'cdc15 170': 0.44789694844623534, FloatVariable 'alpha 119': 0.1699365258012951, FloatVariable 'cold 0': 0.27980454530046744, FloatVariable 'cdc15 190': 0.16956982427965123, FloatVariable 'alpha 91': 0.1905674816738207, FloatVariable 'heat 10': 1.000320207469925, FloatVariable 'cdc15 210': 0.15183429673463036, FloatVariable 'cold 40': 0.3092287272528724, FloatVariable 'cdc15 230': 0.5474715182870784, FloatVariable 'alpha 105': 0.14088060621674625, FloatVariable 'dtt 15': 0.49451797411099035, FloatVariable 'cdc15 250': 0.3573777070361102, FloatVariable 'diau a': 0.14935761521655416, FloatVariable 'cold 20': 0.4097605043285248, FloatVariable 'cdc15 270': 0.21951931922594184, FloatVariable 'diau b': 0.23473821977067888, FloatVariable 'cdc15 290': 0.24965577080121784, FloatVariable 'diau c': 0.13741762346585432, FloatVariable 'Elu 0': 0.8466167037587657, FloatVariable 'Elu 60': 0.36698713537474476, FloatVariable 'Elu 30': 0.4786304184632838, FloatVariable 'alpha 84': 0.1308234316119738, FloatVariable 'spo 2': 0.7078062232168753, FloatVariable 'heat 20': 0.9867456006798212, FloatVariable 'diau e': 0.864767371223623, FloatVariable 'spo 5': 1.0683498605674933, FloatVariable 'dtt 30': 0.5838556086306895, FloatVariable 'diau f': 1.4452997087693935, FloatVariable 'spo 0': 0.13024372486415015, FloatVariable 'spo 7': 0.8081568079276176, FloatVariable 'diau g': 2.248793727194904, FloatVariable 'spo 9': 0.2498282401589412, FloatVariable 'alpha 98': 0.21754881357923167, FloatVariable 'alpha 0': 0.19198054903386352, FloatVariable 'spo 11': 0.20615575833508282, FloatVariable 'diau d': 0.44932601596409105, FloatVariable 'Elu 180': 0.42425268429856355, FloatVariable 'alpha 14': 0.18310901108005986, FloatVariable 'spo5 2': 0.40417556232809326, FloatVariable 'Elu 210': 0.12396520046361383, FloatVariable 'heat 40': 0.4580618143377812, FloatVariable 'spo5 7': 0.26780459416067937, FloatVariable 'alpha 112': 0.19329749923741518, FloatVariable 'Elu 240': 0.2093390824178926, FloatVariable 'spo5 11': 1.200079459496442, FloatVariable 'Elu 270': 0.33471969574325466, FloatVariable 'dtt 60': 0.5951914850021424, FloatVariable 'spo- early': 1.9466556509082333, FloatVariable 'Elu 300': 0.15913983311663107, FloatVariable 'cold 160': 0.6947037871090163, FloatVariable 'spo- mid': 3.2086605964825132, FloatVariable 'Elu 330': 0.11474308886955724, FloatVariable 'cdc15 50': 0.24968263583989325}) defaultdict(, {FloatVariable 'Elu 30': 0.4786304184632838, FloatVariable 'spo 0': 0.13024372486415015, FloatVariable 'Elu 60': 0.36698713537474476, FloatVariable 'spo 2': 0.7078062232168753, FloatVariable 'Elu 90': 0.3834942300173535, FloatVariable 'spo 5': 1.0683498605674933, FloatVariable 'alpha 7': 0.06557659077448137, FloatVariable 'Elu 120': 0.5939764872379445, FloatVariable 'spo 7': 0.8081568079276176, FloatVariable 'diau d': 0.44932601596409105, FloatVariable 'Elu 150': 0.5965599387054419, FloatVariable 'alpha 119': 0.1699365258012951, FloatVariable 'spo 9': 0.2498282401589412, FloatVariable 'Elu 180': 0.42425268429856355, FloatVariable 'spo 11': 0.20615575833508282, FloatVariable 'alpha 70': 0.26459268873021585, FloatVariable 'Elu 210': 0.12396520046361383, FloatVariable 'spo5 2': 0.40417556232809326, FloatVariable 'Elu 240': 0.2093390824178926, FloatVariable 'spo5 7': 0.26780459416067937, FloatVariable 'Elu 270': 0.33471969574325466, FloatVariable 'alpha 84': 0.1308234316119738, FloatVariable 'spo5 11': 1.200079459496442, FloatVariable 'diau e': 0.864767371223623, FloatVariable 'Elu 300': 0.15913983311663107, FloatVariable 'spo- early': 1.9466556509082333, FloatVariable 'Elu 330': 0.11474308886955724, FloatVariable 'alpha 42': 0.13641650791763626, FloatVariable 'spo- mid': 3.2086605964825132, FloatVariable 'Elu 360': 0.16577258493775038, FloatVariable 'alpha 14': 0.18310901108005986, FloatVariable 'Elu 390': 0.1820761083768519, FloatVariable 'alpha 21': 0.030539018557891578, FloatVariable 'cdc15 10': 0.11428450056762224, FloatVariable 'alpha 28': 0.04645238275160125, FloatVariable 'alpha 91': 0.1905674816738207, FloatVariable 'cdc15 30': 0.18270335600911874, FloatVariable 'alpha 35': 0.21379911334384863, FloatVariable 'heat 0': 0.19091815990337407, FloatVariable 'cdc15 50': 0.24968263583989325, FloatVariable 'cdc15 170': 0.44789694844623534, FloatVariable 'heat 80': 0.38909905042009185, FloatVariable 'diau f': 1.4452997087693935, FloatVariable 'cdc15 70': 0.13876265882583333, FloatVariable 'alpha 49': 0.16085715739160683, FloatVariable 'cdc15 90': 0.32729758144823035, FloatVariable 'alpha 56': 0.3367416107117914, FloatVariable 'cold 0': 0.27980454530046744, FloatVariable 'cdc15 110': 0.564756618474929, FloatVariable 'Elu 0': 0.8466167037587657, FloatVariable 'alpha 63': 0.18433878873311124, FloatVariable 'cdc15 130': 0.3658301477295572, FloatVariable 'dtt 60': 0.5951914850021424, FloatVariable 'alpha 105': 0.14088060621674625, FloatVariable 'cdc15 150': 0.693249161777514, FloatVariable 'dtt 120': 0.55305024494988, FloatVariable 'alpha 112': 0.19329749923741518, FloatVariable 'diau g': 2.248793727194904, FloatVariable 'heat 10': 1.000320207469925, FloatVariable 'cdc15 190': 0.16956982427965123, FloatVariable 'heat 160': 0.3192185000510415, FloatVariable 'dtt 15': 0.49451797411099035, FloatVariable 'cold 20': 0.4097605043285248, FloatVariable 'alpha 0': 0.19198054903386352, FloatVariable 'cdc15 210': 0.15183429673463036, FloatVariable 'cold 40': 0.3092287272528724, FloatVariable 'alpha 98': 0.21754881357923167, FloatVariable 'cdc15 230': 0.5474715182870784, FloatVariable 'cold 160': 0.6947037871090163, FloatVariable 'heat 40': 0.4580618143377812, FloatVariable 'cdc15 250': 0.3573777070361102, FloatVariable 'dtt 30': 0.5838556086306895, FloatVariable 'diau a': 0.14935761521655416, FloatVariable 'alpha 77': 0.20088381949723239, FloatVariable 'cdc15 270': 0.21951931922594184, FloatVariable 'diau b': 0.23473821977067888, FloatVariable 'heat 20': 0.9867456006798212, FloatVariable 'cdc15 290': 0.24965577080121784, FloatVariable 'diau c': 0.13741762346585432})
• ## Orange/testing/regression/results_reference/MeasureAttribute1a.py.txt

 r9689 0.793992996216 0.794493019581 1.100: 0.005 1.200: 0.015 4.500: 0.238 4.600: 0.235 4.700: 0.232 4.700: 0.233 4.800: 0.206 4.900: 0.166 4.900: 0.167 5.000: 0.153 5.100: 0.107
• ## Orange/testing/regression/results_reference/MeasureAttribute3.py.txt

 r9689 Relief - no unknowns:         0.1609         0.0337         0.1219         0.0266 - with unknowns:         0.0878         0.0989         0.3216         0.0830 - no unknowns:         0.6167         0.2816         0.1425         0.0532 - with unknowns:         0.5056         0.3881         0.1849         0.0857
• ## Orange/testing/regression/results_reference/contingency6.py.txt

 r9689 4.69999980927 <2.000, 0.000, 0.000> Contingency keys:  [4.3000001907348633, 4.4000000953674316, 4.5] Contingency keys:  [4.300000190734863, 4.400000095367432, 4.5] Contingency values:  [<1.000, 0.000, 0.000>, <3.000, 0.000, 0.000>, <1.000, 0.000, 0.000>] Contingency items:  [(4.3000001907348633, <1.000, 0.000, 0.000>), (4.4000000953674316, <3.000, 0.000, 0.000>), (4.5, <1.000, 0.000, 0.000>)] Contingency items:  [(4.300000190734863, <1.000, 0.000, 0.000>), (4.400000095367432, <3.000, 0.000, 0.000>), (4.5, <1.000, 0.000, 0.000>)] Error:  invalid index (%5.3f)
• ## Orange/testing/regression/results_reference/distributions.py.txt

 r9689 Private Private Private Private Private Private Private Private Private Private Private Private Private Private Private Private Private Private Private Private Private Self-emp-not-inc Self-emp-not-inc Private Private Self-emp-not-inc Self-emp-not-inc Private Self-emp-not-inc Private Private Self-emp-not-inc Self-emp-not-inc Private Private Self-emp-not-inc Self-emp-not-inc Private Private Self-emp-not-inc Private Self-emp-not-inc Private Private Private Private Private Private Private Private Private Private Private Private Private Private Private Private Private Private Private:  685.0 Private:  685.0
• ## Orange/testing/regression/results_reference/imputation.py.txt

 r9689 ['M', 1874, 'RR', ?, 2, '?', 'THROUGH', 'IRON', '?', '?', 'SIMPLE-T'] Imputed: ['M', 1874, 'RR', 1000, 2, 'N', 'THROUGH', 'IRON', 'MEDIUM', 'S', 'SIMPLE-T'] ['M', 1874, 'RR', 1257, 2, 'N', 'THROUGH', 'IRON', 'MEDIUM', 'S', 'SIMPLE-T'] ['M', 1876, 'HIGHWAY', 1245, ?, '?', 'THROUGH', 'STEEL', 'LONG', 'F', 'SUSPEN'] ['O', 1878, 'RR', ?, 2, 'G', '?', 'STEEL', '?', '?', 'SIMPLE-T'] ['O', 1878, 'RR', 804, 2, 'G', 'THROUGH', 'STEEL', 'MEDIUM', 'S', 'SIMPLE-T'] ['O', 1878, 'RR', 1257, 2, 'G', 'THROUGH', 'STEEL', 'MEDIUM', 'S', 'SIMPLE-T'] ['M', 1882, 'RR', ?, 2, 'G', '?', 'STEEL', '?', '?', 'SIMPLE-T'] ['M', 1882, 'RR', 1000, 2, 'G', 'THROUGH', 'STEEL', 'MEDIUM', 'F', 'SIMPLE-T'] ['M', 1882, 'RR', 1257, 2, 'G', 'THROUGH', 'STEEL', 'MEDIUM', 'F', 'SIMPLE-T'] ['A', 1883, 'RR', ?, 2, 'G', 'THROUGH', 'STEEL', '?', 'F', 'SIMPLE-T'] ['A', 1883, 'RR', 1000, 2, 'G', 'THROUGH', 'STEEL', 'MEDIUM', 'F', 'SIMPLE-T'] ['A', 1883, 'RR', 1257, 2, 'G', 'THROUGH', 'STEEL', 'MEDIUM', 'F', 'SIMPLE-T'] *** CUSTOM IMPUTATION BY MODELS *** SPAN=SHORT: 1158 SPAN=LONG: 1907 SPAN=MEDIUM |    ERECTED<1911.500: 1325 |    ERECTED>=1911.500: 1528 ERECTED<=1894.500: 1257 ERECTED>1894.500 |    SPAN=SHORT: |    SPAN=MEDIUM: 1571 |    SPAN=LONG: 1829 ['M', 1876, 'HIGHWAY', 1245, ?, '?', 'THROUGH', 'STEEL', 'LONG', 'F', 'SUSPEN']
• ## Orange/testing/regression/results_reference/undefineds.py.txt

 r9689 ['?', '?', '?'] ['~', '~', '~'] ['?', 'X', 'X'] ['?', 'UNK', 'UNK'] ['UNAVAILABLE', '?', 'UNAVAILABLE'] ['X', 'X', 'X'] ['UNK', 'UNK', 'UNK'] ['UNAVAILABLE', 'UNAVAILABLE', 'UNAVAILABLE'] Default saving a   b   c 0 1 UNAVAILABLE 0 1 X UNK   0 1 X UNK UNAVAILABLE 0 1 UNAVAILABLE UNK X   0 1 UNAVAILABLE UNK X   0 1 UNAVAILABLE UNK X 0   0   0 ?   ?   ? ~   ~   ~ ?   X   X ?   UNK UNK UNAVAILABLE ?   UNAVAILABLE X   X   X UNK UNK UNK UNAVAILABLE UNAVAILABLE UNAVAILABLE Saving with all undefined as NA a   b   c 0 1 UNAVAILABLE 0 1 X UNK   0 1 X UNK UNAVAILABLE 0 1 UNAVAILABLE UNK X   0 1 UNAVAILABLE UNK X   0 1 UNAVAILABLE UNK X 0   0   0 ?   ?   ? ~   ~   ~ ?   X   X ?   UNK UNK UNAVAILABLE ?   UNAVAILABLE X   X   X UNK UNK UNK UNAVAILABLE UNAVAILABLE UNAVAILABLE Saving with all undefined as NA a   b   c 0 1 UNAVAILABLE 0 1 X UNK   0 1 X UNK UNAVAILABLE 0 1 UNAVAILABLE UNK X   0 1 UNAVAILABLE UNK X   0 1 UNAVAILABLE UNK X 0   0   0 ?   ?   ? ~   ~   ~ ?   X   X ?   UNK UNK UNAVAILABLE ?   UNAVAILABLE X   X   X UNK UNK UNK UNAVAILABLE UNAVAILABLE UNAVAILABLE
• ## docs/reference/rst/Orange.classification.logreg.rst

 r9372 .. automodule:: Orange.classification.logreg .. index: logistic regression .. index: single: classification; logistic regression ******************************** Logistic regression (logreg) ******************************** Logistic regression _ is a statistical classification methods that fits data to a logistic function. Orange's implementation of algorithm can handle various anomalies in features, such as constant variables and singularities, that could make direct fitting of logistic regression almost impossible. Stepwise logistic regression, which iteratively selects the most informative features, is also supported. .. autoclass:: LogRegLearner :members: .. class :: LogRegClassifier A logistic regression classification model. Stores estimated values of regression coefficients and their significances, and uses them to predict classes and class probabilities. .. attribute :: beta Estimated regression coefficients. .. attribute :: beta_se Estimated standard errors for regression coefficients. .. attribute :: wald_Z Wald Z statistics for beta coefficients. Wald Z is computed as beta/beta_se. .. attribute :: P List of P-values for beta coefficients, that is, the probability that beta coefficients differ from 0.0. The probability is computed from squared Wald Z statistics that is distributed with Chi-Square distribution. .. attribute :: likelihood The probability of the sample (ie. learning examples) observed on the basis of the derived model, as a function of the regression parameters. .. attribute :: fit_status Tells how the model fitting ended - either regularly (:obj:LogRegFitter.OK), or it was interrupted due to one of beta coefficients escaping towards infinity (:obj:LogRegFitter.Infinity) or since the values didn't converge (:obj:LogRegFitter.Divergence). The value tells about the classifier's "reliability"; the classifier itself is useful in either case. .. method:: __call__(instance, result_type) Classify a new instance. :param instance: instance to be classified. :type instance: :class:~Orange.data.Instance :param result_type: :class:~Orange.classification.Classifier.GetValue or :class:~Orange.classification.Classifier.GetProbabilities or :class:~Orange.classification.Classifier.GetBoth :rtype: :class:~Orange.data.Value, :class:~Orange.statistics.distribution.Distribution or a tuple with both .. class:: LogRegFitter :obj:LogRegFitter is the abstract base class for logistic fitters. It defines the form of call operator and the constants denoting its (un)success: .. attribute:: OK Fitter succeeded to converge to the optimal fit. .. attribute:: Infinity Fitter failed due to one or more beta coefficients escaping towards infinity. .. attribute:: Divergence Beta coefficients failed to converge, but none of beta coefficients escaped. .. attribute:: Constant There is a constant attribute that causes the matrix to be singular. .. attribute:: Singularity The matrix is singular. .. method:: __call__(examples, weight_id) Performs the fitting. There can be two different cases: either the fitting succeeded to find a set of beta coefficients (although possibly with difficulties) or the fitting failed altogether. The two cases return different results. (status, beta, beta_se, likelihood) The fitter managed to fit the model. The first element of the tuple, result, tells about the problems occurred; it can be either :obj:OK, :obj:Infinity or :obj:Divergence. In the latter cases, returned values may still be useful for making predictions, but it's recommended that you inspect the coefficients and their errors and make your decision whether to use the model or not. (status, attribute) The fitter failed and the returned attribute is responsible for it. The type of failure is reported in status, which can be either :obj:Constant or :obj:Singularity. The proper way of calling the fitter is to expect and handle all the situations described. For instance, if fitter is an instance of some fitter and examples contain a set of suitable examples, a script should look like this:: res = fitter(examples) if res[0] in [fitter.OK, fitter.Infinity, fitter.Divergence]: status, beta, beta_se, likelihood = res < proceed by doing something with what you got > else: status, attr = res < remove the attribute or complain to the user or ... > .. class :: LogRegFitter_Cholesky The sole fitter available at the moment. It is a C++ translation of Alan Miller's logistic regression code _. It uses Newton-Raphson algorithm to iteratively minimize least squares error computed from learning examples. .. autoclass:: StepWiseFSS :members: :show-inheritance: .. autofunction:: dump Examples -------- The first example shows a very simple induction of a logistic regression classifier (:download:logreg-run.py ). .. literalinclude:: code/logreg-run.py Result:: Classification accuracy: 0.778282598819 class attribute = survived class values = Attribute       beta  st. error     wald Z          P OR=exp(beta) Intercept      -1.23       0.08     -15.15      -0.00 status=first       0.86       0.16       5.39       0.00       2.36 status=second      -0.16       0.18      -0.91       0.36       0.85 status=third      -0.92       0.15      -6.12       0.00       0.40 age=child       1.06       0.25       4.30       0.00       2.89 sex=female       2.42       0.14      17.04       0.00      11.25 The next examples shows how to handle singularities in data sets (:download:logreg-singularities.py ). .. literalinclude:: code/logreg-singularities.py The first few lines of the output of this script are:: <=50K <=50K <=50K <=50K <=50K <=50K >50K >50K <=50K >50K class attribute = y class values = <>50K, <=50K> Attribute       beta  st. error     wald Z          P OR=exp(beta) Intercept       6.62      -0.00       -inf       0.00 age      -0.04       0.00       -inf       0.00       0.96 fnlwgt      -0.00       0.00       -inf       0.00       1.00 education-num      -0.28       0.00       -inf       0.00       0.76 marital-status=Divorced       4.29       0.00        inf       0.00      72.62 marital-status=Never-married       3.79       0.00        inf       0.00      44.45 marital-status=Separated       3.46       0.00        inf       0.00      31.95 marital-status=Widowed       3.85       0.00        inf       0.00      46.96 marital-status=Married-spouse-absent       3.98       0.00        inf       0.00      53.63 marital-status=Married-AF-spouse       4.01       0.00        inf       0.00      55.19 occupation=Tech-support      -0.32       0.00       -inf       0.00       0.72 If :obj:remove_singular is set to 0, inducing a logistic regression classifier would return an error:: Traceback (most recent call last): File "logreg-singularities.py", line 4, in lr = classification.logreg.LogRegLearner(table, removeSingular=0) File "/home/jure/devel/orange/Orange/classification/logreg.py", line 255, in LogRegLearner return lr(examples, weightID) File "/home/jure/devel/orange/Orange/classification/logreg.py", line 291, in __call__ lr = learner(examples, weight) orange.KernelException: 'orange.LogRegLearner': singularity in workclass=Never-worked We can see that the attribute workclass is causing a singularity. The example below shows how the use of stepwise logistic regression can help to gain in classification performance (:download:logreg-stepwise.py ): .. literalinclude:: code/logreg-stepwise.py The output of this script is:: Learner      CA logistic     0.841 filtered     0.846 Number of times attributes were used in cross-validation: 1 x a21 10 x a22 8 x a23 7 x a24 1 x a25 10 x a26 10 x a27 3 x a28 7 x a29 9 x a31 2 x a16 7 x a12 1 x a32 8 x a15 10 x a14 4 x a17 7 x a30 10 x a11 1 x a10 1 x a13 10 x a34 2 x a19 1 x a18 10 x a3 10 x a5 4 x a4 4 x a7 8 x a6 10 x a9 10 x a8
• ## docs/reference/rst/Orange.data.instance.rst

 r9525 ============================= Class Orange.data.Instance holds data instances. Each data instance Class Orange.data.Instance holds a data instance. Each data instance corresponds to a domain, which defines its length, data types and values for symbolic indices. -------- The data instance is described by a list of features, defined by the domain descriptor. Instances support indexing with either integer indices, strings or variable descriptors. The data instance is described by a list of features defined by the domain descriptor (:obj:Orange.data.domain). Instances support indexing with either integer indices, strings or variable descriptors. Since "age" is the the first attribute in dataset lenses, the below statements are equivalent.:: below statements are equivalent:: >>> data = Orange.data.Table("lenses") >>> age = data.domain["age"] >>> example = data[0] young Negative indices do not work as usual in Python, since they return Negative indices do not work as usual in Python, since they refer to the values of meta attributes. The last element of data instance is the class label, if it exists. It should be accessed using :obj:get_class and :obj:set_class. Data instances can be traversed using a for loop. The list has a fixed length, determined by the domain to which the instance corresponds. The last element of data instance is the class label, if the domain has a class. It should be accessed using :obj:~Orange.data.Instance.get_class() and :obj:~Orange.data.Instance.set_class(). The list has a fixed length that equals the number of variables. --------------- --------------- Meta attributes provide a way to attach additional information to examples. These attributes are treated specially, for instance, they are not used for learning, but can carry additional information, such as, for example, a name of a patient or the number of times the instance was missclassified during some test procedure. The most common additional information is the instance's weight. For contrast from ordinary features, instances from the same domain do not need to have the same meta attributes. Meta attributes are hence not addressed by positions, but by their id's, which are represented by negative indices. Id's are generated by function :obj:Orange.data.variable.new_meta_id(). Id's can be reused for multiple domains. If ordinary features resemble lists, meta attributes can be seen as a dictionaries. Meta attributes provide a way to attach additional information to data instances, such as, for example, an id of a patient or the number of times the instance was missclassified during some test procedure. The most common additional information is the instance's weight. These attributes do not appear in induced models. Instances from the same domain do not need to have the same meta attributes. Meta attributes are hence not addressed by positions, but by their id's, which are represented by negative indices. Id's are generated by function :obj:Orange.data.variable.new_meta_id(). Id's can be reused for multiple domains. Domain descriptor can, but doesn't need to know about for the domain, attribute or its name can also be used for indexing. When registering meta attributes with domains, it is recommended to used the same id for the same attribute in all domains. recommended to use the same id for the same attribute in all domains. Meta values can also be loaded from files in tab-delimited format. Meta attributes are often used as weights. Many procedures, such as learning algorithms, accept the id of the meta attribute defining the weights of instances as an additional argument besides the data. weights of instances as an additional argument. The following example adds a meta attribute with a random value to each data instance each data instance. .. literalinclude:: code/instance-metavar.py :lines: 1- The code prints out something like:: The code prints out:: ['young', 'myope', 'no', 'reduced', 'none'], {-2:0.84} Data instance now consists of two parts, ordinary features that (except for a different random value). Data instance now consists of two parts, ordinary features that resemble a list since they are addressed by positions (eg. the first value is "psby"), and meta values that are more like dictionaries, where the id (-2) is a key and 0.34 is a value (of type where the id (-2) is a key and 0.84 is a value (of type :obj:Orange.data.Value). Many other functions accept weights in similar fashion. Code:: Code :: print orange.getClassDistribution(data) print orange.getClassDistribution(data, id) prints out:: prints out :: <15.000, 5.000, 4.000> <9.691, 3.232, 1.969> Registering the meta attribute changes how the data instance is printed out and how it can be accessed:: where the first line is the actual distribution and the second a distribution with random weights assigned to the instances. Registering the meta attribute using :obj:Orange.data.Domain.add_meta changes how the data instance is printed out and how it can be accessed:: w = orange.FloatVariable("w") Construct a data instance with the given domain and initialize the values. Values should be given as a list containing the values. Values are given as a list of objects that can be converted into values of corresponding variables; generally, they can be given as strings and integer indices (for discrete varaibles) or numbers (for continuous variables), and also as instances of variables: strings and integer indices (for discrete varaibles), strings or numbers (for continuous variables), or instances of :obj:Orange.data.Value. Construct a new data instance as a shallow copy of the original. If a domain descriptor is given, the instance is converted; conversion can add or remove variables, including transformations, like discretization ets. converted to another domain. :param domain: domain descriptor .. method:: __init__(domain, instances) Construct a new data instance for the given domain, where attribute values are taken from the provided instances, using both their ordinary features and meta attributes, which are registered with their corresponding domains. Meta attributes which appear in the provided instances and do not appear in the domain of the new instance, are copied as well. Construct a new data instance for the given domain, where the feature values are found in the provided instances using both their ordinary features and meta attributes that are registered with their corresponding domains. The new instance also includes the meta attributes that appear in the provided instances and whose values are not used for the instance's features. :param domain: domain descriptor .. method:: native([level]) Converts the instance into an ordinary Python list. If the optional argument is 1 (default), the result is a list of objects of :obj:orange.Data.value. If it is 0, it contains pure Pyhon objects, that is, strings for discrete variables Convert the instance into an ordinary Python list. If the optional argument level is 1 (default), the result is a list of instances of :obj:orange.data.Value. If it is 0, it contains pure Python objects, that is, strings for discrete variables and numbers for continuous ones. .. method:: compatible(other, ignore_class=0) Return :obj:True if the two instances are compatible, that .. method:: compatible(other, ignore_class=False) Return True if the two instances are compatible, that is, equal in all features which are not missing in one of them. The optional second argument can be used to omit the Return a dictionary containing meta values of the data instance. The key type can be :obj:int (default), :obj:str or :obj:Orange.data.variable.Variable and determines whether the dictionary keys will be meta ids, variables names or instance. The argument key_type can be int (default), str or :obj:Orange.data.variable.Variable and determines whether the dictionary keys are meta ids, variables names or variable descriptors. In the latter two cases, only registered attributes are returned. :: print example.getmetas(orange.Variable) :param key_type: the key type; either :obj:int, :obj:str or :obj:Orange.data.variable.Variable :type key_type: :obj:type :param key_type: the key type; either int, str or :obj:~Orange.data.variable.Variable :type key_type: type .. method:: get_metas(optional, [key_type]) Similar to above, but return a dictionary containing meta values of the data instance which are or which are not optional. Similar to above, but return a dictionary that contains only non-optional attributes (if optional is 0) or only optional attributes. :param optional: tells whether to return optional or non-optional attributes :type optional: :obj:bool :param key_type: the key type; either :obj:int, :obj:str or :obj:Orange.data.variable.Variable :type key_type: :obj:type :type optional: bool :param key_type: the key type; either int, str or :obj:~Orange.data.variable.Variable :type key_type: type .. method:: has_meta(meta_attr) Return :obj:True if the data instance has the specified meta attribute, specified by id, string or descriptor. Return True if the data instance has the specified meta attribute. :param meta_attr: meta attribute :type meta_attr: :obj:id, :obj:str or :obj:Orange.data.variable.Variable :type meta_attr: :obj:id, str or :obj:~Orange.data.variable.Variable .. method:: remove_meta(meta_attr) Remove meta attribute. Remove the specified meta attribute. :param meta_attr: meta attribute :type meta_attr: :obj:id, :obj:str or :obj:Orange.data.variable.Variable :type meta_attr: :obj:id, str or :obj:~Orange.data.variable.Variable .. method:: get_weight(meta_attr) Return the value of the specified meta attribute. The value must be continuous; it is returned as a :obj:float. Return the value of the specified meta attribute. The attribute's value must be continuous and is returned as float. :param meta_attr: meta attribute :type meta_attr: :obj:id, :obj:str or :obj:Orange.data.variable.Variable :type meta_attr: :obj:id, str or :obj:~Orange.data.variable.Variable .. method:: set_weight(meta_attr, weight=1) Set the value of the specified meta attribute to weight. The value must be continuous; it is returned as a :obj:float. Set the value of the specified meta attribute to weight. :param meta_attr: meta attribute :type meta_attr: :obj:id, :obj:str or :obj:Orange.data.variable.Variable :param weight: weight of the instance :type weight: :obj:float :type meta_attr: :obj:id, str or :obj:~Orange.data.variable.Variable :param weight: weight of instance :type weight: float
• ## docs/reference/rst/Orange.data.table.rst

 r9726 :obj:Table supports most list-like operations: getting, setting, removing data instances, as well as methods :obj:append and :obj:extend. The limitation is that table contain instances of :obj:Orange.data.Instance. When setting items, the item must be :obj:extend. When setting items, the item must be either the instance of the correct type or a Python list of appropriate length and content to be converted into a data instance of the corresponding domain. When retrieving data instances, what we get are references and not copies. Changing the retrieved instance changes the data in the table, too. Slicing returns ordinary Python lists containing the data instance, not a new Table. As usual in Python, the data table is considered False, when empty. the corresponding domain. Retrieving data instances returns references and not copies: changing the retrieved instance changes the data in the table. Slicing returns ordinary Python lists containing references to data instances, not a new :obj:Orange.data.Table. According to a Python convention, the data table is considered False when empty. .. class:: Table .. attribute:: owns_instances True, if the table contains the data instances, False if it contains just references to instances owned by another table. True, if the table contains the data instances and False if it contains references to instances owned by another table. .. attribute:: owner If the table does not own the data instances, this attribute gives the actual owner. The actual owner of the data when own_instances is False. .. attribute:: version An integer that is increased whenever the table is changed. This is not foolproof, since the object cannot detect when individual instances are changed. It will, however, catch any additions and removals from the table. An integer that is increased when instances are added or removed from the table. It does not detect changes of the data. .. attribute:: random_generator Random generator that is used by method :obj:random_instance. If the method is called and random_generator is None, a new generator is constructed with random seed 0, and stored here for subsequent use. random_generator is None, a new generator is constructed with random seed 0 and stored here for future use. .. attribute:: attribute_load_status If the table was loaded from a file, this list of flags tells whether the feature descriptors were reused and how they matched. See :ref:descriptor reuse  for details. matched. See :ref:descriptor reuse  for details. .. attribute:: meta_attribute_load_status Same as above, except that this is a dictionary for meta attributes, with keys corresponding to their ids. A dictionary holding this same information for meta attributes, with keys corresponding to their ids and values to load statuses. .. method:: __init__(filename[, create_new_on]) specified in the environment variable ORANGE_DATA_PATH. The optional flag create_new_on decides when variable The optional flag create_new_on decides when variable descriptors are reused. See :ref:descriptor reuse  for more details.
• ## docs/reference/rst/Orange.distance.rst

 r9819 ########################################## The following example demonstrates how to compute distances between two instances: .. literalinclude:: code/distance-simple.py :lines: 1-7 A matrix with all pairwise distances can be computed with :obj:distance_matrix: .. literalinclude:: code/distance-simple.py :lines: 9-11 Unknown values are treated correctly only by Euclidean and Relief distance.  For other measures, a distance between unknown and known or between two unknown values is always 0.5. =================== Computing distances =================== Distance measures typically have to be adjusted to the data. For instance, when the data set contains continuous features, the distances between similar impats, e.g. by dividing the distance with the range. Distance measures thus appear in pairs - a class that measures the distance (:obj:Distance) and a class that constructs it based on the data (:obj:DistanceConstructor). Distance measures thus appear in pairs: Since most measures work on normalized distances between corresponding features, an abstract class DistanceNormalized takes care of normalizing. Unknown values are treated correctly only by Euclidean and Relief distance.  For other measures, a distance between unknown and known or between two unknown values is always 0.5. .. autofunction:: distance_matrix .. class:: Distance .. method:: __call__(instance1, instance2) Return a distance between the given instances (as a floating point number). - a class that constructs the distance measure based on the data (subclass of :obj:DistanceConstructor, for example :obj:Euclidean), and returns is as - a class that measures the distance between two instances (subclass of :obj:Distance, for example :obj:EuclideanDistance). .. class:: DistanceConstructor not given, instances or distributions can be used. .. class:: DistanceNormalized .. class:: Distance An abstract class that provides normalization. .. method:: __call__(instance1, instance2) .. attribute:: normalizers Return a distance between the given instances (as a floating point number). A precomputed list of normalizing factors for feature values. They are: Pairwise distances ================== - 1/(max_value-min_value) for continuous and 1/number_of_values for ordinal features. If either feature is unknown, the distance is 0.5. Such factors are used to multiply differences in feature's values. - -1 for nominal features; the distance between two values is 0 if they are same (or at least one is unknown) and 1 if they are different. - 0 for ignored features. .. autofunction:: distance_matrix .. attribute:: bases, averages, variances ========= Measures ========= The minimal values, averages and variances (continuous features only). .. attribute:: domain_version The domain version changes each time a domain description is changed (i.e. features are added or removed). .. method:: feature_distances(instance1, instance2) Return a list of floats representing normalized distances between pairs of feature values of the two instances. Distance measures are defined with two classes: a subclass of obj:DistanceConstructor and a subclass of :obj:Distance. .. class:: Hamming The maximal distance between two feature values. If dist is the result of ~:obj:DistanceNormalized.feature_distances, :obj:~DistanceNormalized.feature_distances, then :class:Maximal returns max(dist). The sum of absolute values of distances between pairs of features, e.g. sum(abs(x) for x in dist) where dist is the result of ~:obj:DistanceNormalized.feature_distances. where dist is the result of :obj:~DistanceNormalized.feature_distances. .. class:: Euclidean The square root of sum of squared per-feature distances, i.e. sqrt(sum(x*x for x in dist)), where dist is the result of ~:obj:DistanceNormalized.feature_distances. :obj:~DistanceNormalized.feature_distances. .. method:: distributions This class is derived directly from :obj:Distance. .. autoclass:: PearsonR :members: :members: .. autoclass:: Mahalanobis :members: .. autoclass:: MahalanobisDistance :members: ========= Utilities ========= .. class:: DistanceNormalized An abstract class that provides normalization. .. attribute:: normalizers A precomputed list of normalizing factors for feature values. They are: - 1/(max_value-min_value) for continuous and 1/number_of_values for ordinal features. If either feature is unknown, the distance is 0.5. Such factors are used to multiply differences in feature's values. - -1 for nominal features; the distance between two values is 0 if they are same (or at least one is unknown) and 1 if they are different. - 0 for ignored features. .. attribute:: bases, averages, variances The minimal values, averages and variances (continuous features only). .. attribute:: domain_version The domain version changes each time a domain description is changed (i.e. features are added or removed). .. method:: feature_distances(instance1, instance2) Return a list of floats representing normalized distances between pairs of feature values of the two instances.
• ## docs/reference/rst/Orange.feature.discretization.rst

 r9372 .. automodule:: Orange.feature.discretization .. py:currentmodule:: Orange.feature.discretization ################################### Discretization (discretization) ################################### .. index:: discretization .. index:: single: feature; discretization Continues features can be discretized either one feature at a time, or, as demonstrated in the following script, using a single discretization method on entire set of data features: .. literalinclude:: code/discretization-table.py Discretization introduces new categorical features and computes their values in accordance to selected (or default) discretization method:: Original data set: [5.1, 3.5, 1.4, 0.2, 'Iris-setosa'] [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'] [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'] Discretized data set: ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa'] ['<=5.45', '(2.85, 3.15]', '<=2.45', '<=0.80', 'Iris-setosa'] ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa'] The following discretization methods are supported: * equal width discretization, where the domain of continuous feature is split to intervals of the same width equal-sized intervals (:class:EqualWidth), * equal frequency discretization, where each intervals contains equal number of data instances (:class:EqualFreq), * entropy-based, as originally proposed by [FayyadIrani1993]_ that infers the intervals to minimize within-interval entropy of class distributions (:class:Entropy), * bi-modal, using three intervals to optimize the difference of the class distribution in the middle with the distribution outside it (:class:BiModal), * fixed, with the user-defined cut-off points. The above script used the default discretization method (equal frequency with three intervals). This can be changed as demonstrated below: .. literalinclude:: code/discretization-table-method.py :lines: 3-5 With exception to fixed discretization, discretization approaches infer the cut-off points from the training data set and thus construct a discretizer to convert continuous values of this feature into categorical value according to the rule found by discretization. In this respect, the discretization behaves similar to :class:Orange.classification.Learner. Utility functions ================= Some functions and classes that can be used for categorization of continuous features. Besides several general classes that can help in this task, we also provide a function that may help in entropy-based discretization (Fayyad & Irani), and a wrapper around classes for categorization that can be used for learning. .. autoclass:: Orange.feature.discretization.DiscretizedLearner_Class .. autoclass:: DiscretizeTable .. rubric:: Example FIXME. A chapter on feature subset selection <../ofb/o_fss.htm>_ in Orange for Beginners tutorial shows the use of DiscretizedLearner. Other discretization classes from core Orange are listed in chapter on categorization <../ofb/o_categorization.htm>_ of the same tutorial. Discretization Algorithms ========================= Instances of discretization classes are all derived from :class:Discretization. .. class:: Discretization .. method:: __call__(feature, data[, weightID]) Given a continuous feature, data and, optionally id of attribute with example weight, this function returns a discretized feature. Argument feature can be a descriptor, index or name of the attribute. .. class:: EqualWidth Discretizes the feature by spliting its domain to a fixed number of equal-width intervals. The span of original domain is computed from the training data and is defined by the smallest and the largest feature value. .. attribute:: n Number of discretization intervals (default: 4). The following example discretizes Iris dataset features using six intervals. The script constructs a :class:Orange.data.Table with discretized features and outputs their description: .. literalinclude:: code/discretization.py :lines: 38-43 The output of this script is:: D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10> The cut-off values are hidden in the discretizer and stored in attr.get_value_from.transformer:: >>> for attr in newattrs: ...    print "%s: first interval at %5.3f, step %5.3f" % \ ...    (attr.name, attr.get_value_from.transformer.first_cut, \ ...    attr.get_value_from.transformer.step) D_sepal length: first interval at 4.900, step 0.600 D_sepal width: first interval at 2.400, step 0.400 D_petal length: first interval at 1.980, step 0.980 D_petal width: first interval at 0.500, step 0.400 All discretizers have the method construct_variable: .. literalinclude:: code/discretization.py :lines: 69-73 .. class:: EqualFreq Infers the cut-off points so that the discretization intervals contain approximately equal number of training data instances. .. attribute:: n Number of discretization intervals (default: 4). The resulting discretizer is of class :class:IntervalDiscretizer. Its transformer includes points that store the inferred cut-offs. .. class:: Entropy Entropy-based discretization as originally proposed by [FayyadIrani1993]_. The approach infers the most appropriate number of intervals by recursively splitting the domain of continuous feature to minimize the class-entropy of training examples. The splitting is repeated until the entropy decrease is smaller than the increase of minimal descripton length (MDL) induced by the new cut-off point. Entropy-based discretization can reduce a continuous feature into a single interval if no suitable cut-off points are found. In this case the new feature is constant and can be removed. This discretization can therefore also serve for identification of non-informative features and thus used for feature subset selection. .. attribute:: force_attribute Forces the algorithm to induce at least one cut-off point, even when its information gain is lower than MDL (default: False). Part of :download:discretization.py : .. literalinclude:: code/discretization.py :lines: 77-80 The output shows that all attributes are discretized onto three intervals:: sepal length: <5.5, 6.09999990463> sepal width: <2.90000009537, 3.29999995232> petal length: <1.89999997616, 4.69999980927> petal width: <0.600000023842, 1.0000004768> .. class:: BiModal Infers two cut-off points to optimize the difference of class distribution of data instances in the middle and in the other two intervals. The difference is scored by chi-square statistics. All possible cut-off points are examined, thus the discretization runs in O(n^2). This discretization method is especially suitable for the attributes in which the middle region corresponds to normal and the outer regions to abnormal values of the feature. .. attribute:: split_in_two Decides whether the resulting attribute should have three or two values. If True (default), the feature will be discretized to three intervals and the discretizer is of type :class:BiModalDiscretizer. If False the result is the ordinary :class:IntervalDiscretizer. Iris dataset has three-valued class attribute. The figure below, drawn using LOESS probability estimation, shows that sepal lenghts of versicolors are between lengths of setosas and virginicas. .. image:: files/bayes-iris.gif If we merge classes setosa and virginica, we can observe if the bi-modal discretization would correctly recognize the interval in which versicolors dominate. The following scripts peforms the merging and construction of new data set with class that reports if iris is versicolor or not. .. literalinclude:: code/discretization.py :lines: 84-87 The following script implements the discretization: .. literalinclude:: code/discretization.py :lines: 97-100 The middle intervals are printed:: sepal length: (5.400, 6.200] sepal width: (2.000, 2.900] petal length: (1.900, 4.700] petal width: (0.600, 1.600] Judging by the graph, the cut-off points inferred by discretization for "sepal length" make sense. Discretizers ============ Discretizers construct a categorical feature from the continuous feature according to the method they implement and its parameters. The most general is :class:IntervalDiscretizer that is also used by most discretization methods. Two other discretizers, :class:EquiDistDiscretizer and :class:ThresholdDiscretizer> could easily be replaced by :class:IntervalDiscretizer but are used for speed and simplicity. The fourth discretizer, :class:BiModalDiscretizer is specialized for discretizations induced by :class:BiModalDiscretization. .. class:: Discretizer A superclass implementing the construction of a new attribute from an existing one. .. method:: construct_variable(feature) Constructs a descriptor for a new feature. The new feature's name is equal to feature.name prefixed by "D\_". Its symbolic values are discretizer specific. .. class:: IntervalDiscretizer Discretizer defined with a set of cut-off points. .. attribute:: points The cut-off points; feature values below or equal to the first point will be mapped to the first interval, those between the first and the second point (including those equal to the second) are mapped to the second interval and so forth to the last interval which covers all values greater than the last value in points. The number of intervals is thus len(points)+1. The script that follows is an examples of a manual construction of a discretizer with cut-off points at 3.0 and 5.0: .. literalinclude:: code/discretization.py :lines: 22-26 First five data instances of data2 are:: [5.1, '>5.00', 'Iris-setosa'] [4.9, '(3.00, 5.00]', 'Iris-setosa'] [4.7, '(3.00, 5.00]', 'Iris-setosa'] [4.6, '(3.00, 5.00]', 'Iris-setosa'] [5.0, '(3.00, 5.00]', 'Iris-setosa'] The same discretizer can be used on several features by calling the function construct_var: .. literalinclude:: code/discretization.py :lines: 30-34 Each feature has its own instance of :class:ClassifierFromVar stored in get_value_from, but all use the same :class:IntervalDiscretizer, idisc. Changing any element of its points affect all attributes. .. note:: The length of :obj:~IntervalDiscretizer.points should not be changed if the discretizer is used by any attribute. The length of :obj:~IntervalDiscretizer.points should always match the number of values of the feature, which is determined by the length of the attribute's field values. If attr is a discretized attribute, than len(attr.values) must equal len(attr.get_value_from.transformer.points)+1. .. class:: EqualWidthDiscretizer Discretizes to intervals of the fixed width. All values lower than :obj:~EquiDistDiscretizer.first_cut are mapped to the first interval. Otherwise, value val's interval is floor((val-first_cut)/step). Possible overflows are mapped to the last intervals. .. attribute:: first_cut The first cut-off point. .. attribute:: step Width of the intervals. .. attribute:: n Number of the intervals. .. attribute:: points (read-only) The cut-off points; this is not a real attribute although it behaves as one. Reading it constructs a list of cut-off points and returns it, but changing the list doesn't affect the discretizer. Only present to provide the :obj:EquiDistDiscretizer the same interface as that of :obj:IntervalDiscretizer. .. class:: ThresholdDiscretizer Threshold discretizer converts continuous values into binary by comparing them to a fixed threshold. Orange uses this discretizer for binarization of continuous attributes in decision trees. .. attribute:: threshold The value threshold; values below or equal to the threshold belong to the first interval and those that are greater go to the second. .. class:: BiModalDiscretizer Bimodal discretizer has two cut off points and values are discretized according to whether or not they belong to the region between these points which includes the lower but not the upper boundary. The discretizer is returned by :class:BiModalDiscretization if its field :obj:~BiModalDiscretization.split_in_two is true (the default). .. attribute:: low Lower boundary of the interval (included in the interval). .. attribute:: high Upper boundary of the interval (not included in the interval). Implementational details ======================== Consider a following example (part of :download:discretization.py ): .. literalinclude:: code/discretization.py :lines: 7-15 The discretized attribute sep_w is constructed with a call to :class:Entropy; instead of constructing it and calling it afterwards, we passed the arguments for calling to the constructor. We then constructed a new :class:Orange.data.Table with attributes "sepal width" (the original continuous attribute), sep_w and the class attribute:: Entropy discretization, first 5 data instances [3.5, '>3.30', 'Iris-setosa'] [3.0, '(2.90, 3.30]', 'Iris-setosa'] [3.2, '(2.90, 3.30]', 'Iris-setosa'] [3.1, '(2.90, 3.30]', 'Iris-setosa'] [3.6, '>3.30', 'Iris-setosa'] The name of the new categorical variable derives from the name of original continuous variable by adding a prefix "D_". The values of the new attributes are computed automatically when they are needed using a transformation function :obj:~Orange.data.variable.Variable.get_value_from (see :class:Orange.data.variable.Variable) which encodes the discretization:: >>> sep_w EnumVariable 'D_sepal width' >>> sep_w.get_value_from >>> sep_w.get_value_from.whichVar FloatVariable 'sepal width' >>> sep_w.get_value_from.transformer >>> sep_w.get_value_from.transformer.points <2.90000009537, 3.29999995232> The select statement in the discretization script converted all data instances from data to the new domain. This includes a new feature sep_w whose values are computed on the fly by calling sep_w.get_value_from for each data instance. The original, continuous sepal width is passed to the transformer that determines the interval by its field points. Transformer returns the discrete value which is in turn returned by get_value_from and stored in the new example. References ========== .. [FayyadIrani1993] UM Fayyad and KB Irani. Multi-interval discretization of continuous valued attributes for classification learning. In Proc. 13th International Joint Conference on Artificial Intelligence, pages 1022--1029, Chambery, France, 1993.