# Changes in [9827:71697b7e31c7:9838:bde35afd7a6a] in orange

Ignore:
Files:
84 edited

Unmodified
Removed
• ## Orange/classification/logreg.py

 r9671 """ .. index: logistic regression .. index: single: classification; logistic regression ******************************** Logistic regression (logreg) ******************************** Implements logistic regression _ with an extension for proper treatment of discrete features.  The algorithm can handle various anomalies in features, such as constant variables and singularities, that could make fitting of logistic regression almost impossible. Stepwise logistic regression, which iteratively selects the most informative features, is also supported. Logistic regression is a popular classification method that comes from statistics. The model is described by a linear combination of coefficients, .. math:: F = \\beta_0 + \\beta_1*X_1 + \\beta_2*X_2 + ... + \\beta_k*X_k and the probability (p) of a class value is  computed as: .. math:: p = \\frac{\exp(F)}{1 + \exp(F)} .. class :: LogRegClassifier :obj:LogRegClassifier stores estimated values of regression coefficients and their significances, and uses them to predict classes and class probabilities using the equations described above. .. attribute :: beta Estimated regression coefficients. .. attribute :: beta_se Estimated standard errors for regression coefficients. .. attribute :: wald_Z Wald Z statistics for beta coefficients. Wald Z is computed as beta/beta_se. .. attribute :: P List of P-values for beta coefficients, that is, the probability that beta coefficients differ from 0.0. The probability is computed from squared Wald Z statistics that is distributed with Chi-Square distribution. .. attribute :: likelihood The probability of the sample (ie. learning examples) observed on the basis of the derived model, as a function of the regression parameters. .. attribute :: fitStatus Tells how the model fitting ended - either regularly (:obj:LogRegFitter.OK), or it was interrupted due to one of beta coefficients escaping towards infinity (:obj:LogRegFitter.Infinity) or since the values didn't converge (:obj:LogRegFitter.Divergence). The value tells about the classifier's "reliability"; the classifier itself is useful in either case. .. autoclass:: LogRegLearner .. class:: LogRegFitter :obj:LogRegFitter is the abstract base class for logistic fitters. It defines the form of call operator and the constants denoting its (un)success: .. attribute:: OK Fitter succeeded to converge to the optimal fit. .. attribute:: Infinity Fitter failed due to one or more beta coefficients escaping towards infinity. .. attribute:: Divergence Beta coefficients failed to converge, but none of beta coefficients escaped. .. attribute:: Constant There is a constant attribute that causes the matrix to be singular. .. attribute:: Singularity The matrix is singular. .. method:: __call__(examples, weightID) Performs the fitting. There can be two different cases: either the fitting succeeded to find a set of beta coefficients (although possibly with difficulties) or the fitting failed altogether. The two cases return different results. (status, beta, beta_se, likelihood) The fitter managed to fit the model. The first element of the tuple, result, tells about the problems occurred; it can be either :obj:OK, :obj:Infinity or :obj:Divergence. In the latter cases, returned values may still be useful for making predictions, but it's recommended that you inspect the coefficients and their errors and make your decision whether to use the model or not. (status, attribute) The fitter failed and the returned attribute is responsible for it. The type of failure is reported in status, which can be either :obj:Constant or :obj:Singularity. The proper way of calling the fitter is to expect and handle all the situations described. For instance, if fitter is an instance of some fitter and examples contain a set of suitable examples, a script should look like this:: res = fitter(examples) if res[0] in [fitter.OK, fitter.Infinity, fitter.Divergence]: status, beta, beta_se, likelihood = res < proceed by doing something with what you got > else: status, attr = res < remove the attribute or complain to the user or ... > .. class :: LogRegFitter_Cholesky :obj:LogRegFitter_Cholesky is the sole fitter available at the moment. It is a C++ translation of Alan Miller's logistic regression code _. It uses Newton-Raphson algorithm to iteratively minimize least squares error computed from learning examples. .. autoclass:: StepWiseFSS .. autofunction:: dump Examples -------- The first example shows a very simple induction of a logistic regression classifier (:download:logreg-run.py , uses :download:titanic.tab ). .. literalinclude:: code/logreg-run.py Result:: Classification accuracy: 0.778282598819 class attribute = survived class values = Attribute       beta  st. error     wald Z          P OR=exp(beta) Intercept      -1.23       0.08     -15.15      -0.00 status=first       0.86       0.16       5.39       0.00       2.36 status=second      -0.16       0.18      -0.91       0.36       0.85 status=third      -0.92       0.15      -6.12       0.00       0.40 age=child       1.06       0.25       4.30       0.00       2.89 sex=female       2.42       0.14      17.04       0.00      11.25 The next examples shows how to handle singularities in data sets (:download:logreg-singularities.py , uses :download:adult_sample.tab ). .. literalinclude:: code/logreg-singularities.py The first few lines of the output of this script are:: <=50K <=50K <=50K <=50K <=50K <=50K >50K >50K <=50K >50K class attribute = y class values = <>50K, <=50K> Attribute       beta  st. error     wald Z          P OR=exp(beta) Intercept       6.62      -0.00       -inf       0.00 age      -0.04       0.00       -inf       0.00       0.96 fnlwgt      -0.00       0.00       -inf       0.00       1.00 education-num      -0.28       0.00       -inf       0.00       0.76 marital-status=Divorced       4.29       0.00        inf       0.00      72.62 marital-status=Never-married       3.79       0.00        inf       0.00      44.45 marital-status=Separated       3.46       0.00        inf       0.00      31.95 marital-status=Widowed       3.85       0.00        inf       0.00      46.96 marital-status=Married-spouse-absent       3.98       0.00        inf       0.00      53.63 marital-status=Married-AF-spouse       4.01       0.00        inf       0.00      55.19 occupation=Tech-support      -0.32       0.00       -inf       0.00       0.72 If :obj:removeSingular is set to 0, inducing a logistic regression classifier would return an error:: Traceback (most recent call last): File "logreg-singularities.py", line 4, in lr = classification.logreg.LogRegLearner(table, removeSingular=0) File "/home/jure/devel/orange/Orange/classification/logreg.py", line 255, in LogRegLearner return lr(examples, weightID) File "/home/jure/devel/orange/Orange/classification/logreg.py", line 291, in __call__ lr = learner(examples, weight) orange.KernelException: 'orange.LogRegLearner': singularity in workclass=Never-worked We can see that the attribute workclass is causing a singularity. The example below shows, how the use of stepwise logistic regression can help to gain in classification performance (:download:logreg-stepwise.py , uses :download:ionosphere.tab ): .. literalinclude:: code/logreg-stepwise.py The output of this script is:: Learner      CA logistic     0.841 filtered     0.846 Number of times attributes were used in cross-validation: 1 x a21 10 x a22 8 x a23 7 x a24 1 x a25 10 x a26 10 x a27 3 x a28 7 x a29 9 x a31 2 x a16 7 x a12 1 x a32 8 x a15 10 x a14 4 x a17 7 x a30 10 x a11 1 x a10 1 x a13 10 x a34 2 x a19 1 x a18 10 x a3 10 x a5 4 x a4 4 x a7 8 x a6 10 x a9 10 x a8 """ from Orange.core import LogRegLearner, LogRegClassifier, LogRegFitter, LogRegFitter_Cholesky import Orange import math, os import warnings from numpy import * from numpy.linalg import * ########################################################################## ## Print out methods from Orange.misc import deprecated_keywords, deprecated_members import math from numpy import dot, array, identity, reshape, diagonal, \ transpose, concatenate, sqrt, sign from numpy.linalg import inv from Orange.core import LogRegClassifier, LogRegFitter, LogRegFitter_Cholesky def dump(classifier): """ Formatted string of all major features in logistic regression classifier. :param classifier: logistic regression classifier """ Return a formatted string of all major features in logistic regression classifier. :param classifier: logistic regression classifier. """ # print out class values out = [''] out.append("class attribute = " + classifier.domain.classVar.name) out.append("class values = " + str(classifier.domain.classVar.values)) out.append("class attribute = " + classifier.domain.class_var.name) out.append("class values = " + str(classifier.domain.class_var.values)) out.append('') # get the longest attribute name longest=0 for at in classifier.continuizedDomain.attributes: for at in classifier.continuized_domain.features: if len(at.name)>longest: longest=len(at.name); longest=len(at.name) # print out the head out.append(formatstr % ("Intercept", classifier.beta[0], classifier.beta_se[0], classifier.wald_Z[0], classifier.P[0])) formatstr = "%"+str(longest)+"s %10.2f %10.2f %10.2f %10.2f %10.2f" for i in range(len(classifier.continuizedDomain.attributes)): out.append(formatstr % (classifier.continuizedDomain.attributes[i].name, classifier.beta[i+1], classifier.beta_se[i+1], classifier.wald_Z[i+1], abs(classifier.P[i+1]), math.exp(classifier.beta[i+1]))) for i in range(len(classifier.continuized_domain.features)): out.append(formatstr % (classifier.continuized_domain.features[i].name, classifier.beta[i+1], classifier.beta_se[i+1], classifier.wald_Z[i+1], abs(classifier.P[i+1]), math.exp(classifier.beta[i+1]))) return '\n'.join(out) def has_discrete_values(domain): for at in domain.attributes: if at.varType == Orange.core.VarTypes.Discrete: return 1 return 0 """ Return 1 if the given domain contains any discrete features, else 0. :param domain: domain. :type domain: :class:Orange.data.Domain """ return any(at.var_type == Orange.data.Type.Discrete for at in domain.features) class LogRegLearner(Orange.classification.Learner): """ Logistic regression learner. Implements logistic regression. If data instances are provided to If data instances are provided to the constructor, the learning algorithm is called and the resulting classifier is returned instead of the learner. :param table: data table with either discrete or continuous features :type table: Orange.data.Table :param weightID: the ID of the weight meta attribute :type weightID: int :param removeSingular: set to 1 if you want automatic removal of disturbing features, such as constants and singularities :type removeSingular: bool :param fitter: the fitting algorithm (by default the Newton-Raphson fitting algorithm is used) :param stepwiseLR: set to 1 if you wish to use stepwise logistic regression :type stepwiseLR: bool :param addCrit: parameter for stepwise feature selection :type addCrit: float :param deleteCrit: parameter for stepwise feature selection :type deleteCrit: float :param numFeatures: parameter for stepwise feature selection :type numFeatures: int :param instances: data table with either discrete or continuous features :type instances: Orange.data.Table :param weight_id: the ID of the weight meta attribute :type weight_id: int :param remove_singular: set to 1 if you want automatic removal of disturbing features, such as constants and singularities :type remove_singular: bool :param fitter: the fitting algorithm (by default the Newton-Raphson fitting algorithm is used) :param stepwise_lr: set to 1 if you wish to use stepwise logistic regression :type stepwise_lr: bool :param add_crit: parameter for stepwise feature selection :type add_crit: float :param delete_crit: parameter for stepwise feature selection :type delete_crit: float :param num_features: parameter for stepwise feature selection :type num_features: int :rtype: :obj:LogRegLearner or :obj:LogRegClassifier """ def __new__(cls, instances=None, weightID=0, **argkw): @deprecated_keywords({"weightID": "weight_id"}) def __new__(cls, instances=None, weight_id=0, **argkw): self = Orange.classification.Learner.__new__(cls, **argkw) if instances: self.__init__(**argkw) return self.__call__(instances, weightID) return self.__call__(instances, weight_id) else: return self def __init__(self, removeSingular=0, fitter = None, **kwds): @deprecated_keywords({"removeSingular": "remove_singular"}) def __init__(self, remove_singular=0, fitter = None, **kwds): self.__dict__.update(kwds) self.removeSingular = removeSingular self.remove_singular = remove_singular self.fitter = None def __call__(self, examples, weight=0): @deprecated_keywords({"examples": "instances"}) def __call__(self, instances, weight=0): """Learn from the given table of data instances. :param instances: Data instances to learn from. :type instances: :class:~Orange.data.Table :param weight: Id of meta attribute with weights of instances :type weight: int :rtype: :class:~Orange.classification.logreg.LogRegClassifier """ imputer = getattr(self, "imputer", None) or None if getattr(self, "removeMissing", 0): examples = Orange.core.Preprocessor_dropMissing(examples) if getattr(self, "remove_missing", 0): instances = Orange.core.Preprocessor_dropMissing(instances) ##        if hasDiscreteValues(examples.domain): ##            examples = createNoDiscTable(examples) if not len(examples): if not len(instances): return None if getattr(self, "stepwiseLR", 0): addCrit = getattr(self, "addCrit", 0.2) removeCrit = getattr(self, "removeCrit", 0.3) numFeatures = getattr(self, "numFeatures", -1) attributes = StepWiseFSS(examples, addCrit = addCrit, deleteCrit = removeCrit, imputer = imputer, numFeatures = numFeatures) tmpDomain = Orange.core.Domain(attributes, examples.domain.classVar) tmpDomain.addmetas(examples.domain.getmetas()) examples = examples.select(tmpDomain) learner = Orange.core.LogRegLearner() learner.imputerConstructor = imputer if getattr(self, "stepwise_lr", 0): add_crit = getattr(self, "add_crit", 0.2) delete_crit = getattr(self, "delete_crit", 0.3) num_features = getattr(self, "num_features", -1) attributes = StepWiseFSS(instances, add_crit= add_crit, delete_crit=delete_crit, imputer = imputer, num_features= num_features) tmp_domain = Orange.data.Domain(attributes, instances.domain.class_var) tmp_domain.addmetas(instances.domain.getmetas()) instances = instances.select(tmp_domain) learner = Orange.core.LogRegLearner() # Yes, it has to be from core. learner.imputer_constructor = imputer if imputer: examples = self.imputer(examples)(examples) examples = Orange.core.Preprocessor_dropMissing(examples) instances = self.imputer(instances)(instances) instances = Orange.core.Preprocessor_dropMissing(instances) if self.fitter: learner.fitter = self.fitter if self.removeSingular: lr = learner.fitModel(examples, weight) if self.remove_singular: lr = learner.fit_model(instances, weight) else: lr = learner(examples, weight) while isinstance(lr, Orange.core.Variable): lr = learner(instances, weight) while isinstance(lr, Orange.data.variable.Variable): if isinstance(lr.getValueFrom, Orange.core.ClassifierFromVar) and isinstance(lr.getValueFrom.transformer, Orange.core.Discrete2Continuous): lr = lr.getValueFrom.variable attributes = examples.domain.attributes[:] attributes = instances.domain.features[:] if lr in attributes: attributes.remove(lr) else: attributes.remove(lr.getValueFrom.variable) newDomain = Orange.core.Domain(attributes, examples.domain.classVar) newDomain.addmetas(examples.domain.getmetas()) examples = examples.select(newDomain) lr = learner.fitModel(examples, weight) new_domain = Orange.data.Domain(attributes, instances.domain.class_var) new_domain.addmetas(instances.domain.getmetas()) instances = instances.select(new_domain) lr = learner.fit_model(instances, weight) return lr LogRegLearner = deprecated_members({"removeSingular": "remove_singular", "weightID": "weight_id", "stepwiseLR": "stepwise_lr", "addCrit": "add_crit", "deleteCrit": "delete_crit", "numFeatures": "num_features", "removeMissing": "remove_missing" })(LogRegLearner) class UnivariateLogRegLearner(Orange.classification.Learner): self.__dict__.update(kwds) def __call__(self, examples): examples = createFullNoDiscTable(examples) classifiers = map(lambda x: LogRegLearner(Orange.core.Preprocessor_dropMissing(examples.select(Orange.core.Domain(x, examples.domain.classVar)))), examples.domain.attributes) maj_classifier = LogRegLearner(Orange.core.Preprocessor_dropMissing(examples.select(Orange.core.Domain(examples.domain.classVar)))) @deprecated_keywords({"examples": "instances"}) def __call__(self, instances): instances = createFullNoDiscTable(instances) classifiers = map(lambda x: LogRegLearner(Orange.core.Preprocessor_dropMissing( instances.select(Orange.data.Domain(x, instances.domain.class_var)))), instances.domain.features) maj_classifier = LogRegLearner(Orange.core.Preprocessor_dropMissing (instances.select(Orange.data.Domain(instances.domain.class_var)))) beta = [maj_classifier.beta[0]] + [x.beta[1] for x in classifiers] beta_se = [maj_classifier.beta_se[0]] + [x.beta_se[1] for x in classifiers] P = [maj_classifier.P[0]] + [x.P[1] for x in classifiers] wald_Z = [maj_classifier.wald_Z[0]] + [x.wald_Z[1] for x in classifiers] domain = examples.domain domain = instances.domain return Univariate_LogRegClassifier(beta = beta, beta_se = beta_se, P = P, wald_Z = wald_Z, domain = domain) class UnivariateLogRegClassifier(Orange.core.Classifier): class UnivariateLogRegClassifier(Orange.classification.Classifier): def __init__(self, **kwds): self.__dict__.update(kwds) def __call__(self, example, resultType = Orange.core.GetValue): def __call__(self, instance, resultType = Orange.classification.Classifier.GetValue): # classification not implemented yet. For now its use is only to provide regression coefficients and its statistics pass return self def __init__(self, removeSingular=0, **kwds): @deprecated_keywords({"removeSingular": "remove_singular"}) def __init__(self, remove_singular=0, **kwds): self.__dict__.update(kwds) self.removeSingular = removeSingular def __call__(self, examples, weight=0): self.remove_singular = remove_singular @deprecated_keywords({"examples": "instances"}) def __call__(self, instances, weight=0): # next function changes data set to a extended with unknown values def createLogRegExampleTable(data, weightID): setsOfData = [] for at in data.domain.attributes: # za vsak atribut kreiraj nov newExampleTable newData # v dataOrig, dataFinal in newData dodaj nov atribut -- continuous variable if at.varType == Orange.core.VarTypes.Continuous: atDisc = Orange.core.FloatVariable(at.name + "Disc") newDomain = Orange.core.Domain(data.domain.attributes+[atDisc,data.domain.classVar]) newDomain.addmetas(data.domain.getmetas()) newData = Orange.core.ExampleTable(newDomain,data) altData = Orange.core.ExampleTable(newDomain,data) for i,d in enumerate(newData): d[atDisc] = 0 d[weightID] = 1*data[i][weightID] for i,d in enumerate(altData): d[atDisc] = 1 def createLogRegExampleTable(data, weight_id): sets_of_data = [] for at in data.domain.features: # za vsak atribut kreiraj nov newExampleTable new_data # v dataOrig, dataFinal in new_data dodaj nov atribut -- continuous variable if at.var_type == Orange.data.Type.Continuous: at_disc = Orange.data.variable.Continuous(at.name+ "Disc") new_domain = Orange.data.Domain(data.domain.features+[at_disc,data.domain.class_var]) new_domain.addmetas(data.domain.getmetas()) new_data = Orange.data.Table(new_domain,data) alt_data = Orange.data.Table(new_domain,data) for i,d in enumerate(new_data): d[at_disc] = 0 d[weight_id] = 1*data[i][weight_id] for i,d in enumerate(alt_data): d[at_disc] = 1 d[at] = 0 d[weightID] = 0.000001*data[i][weightID] elif at.varType == Orange.core.VarTypes.Discrete: # v dataOrig, dataFinal in newData atributu "at" dodaj ee  eno  vreednost, ki ima vrednost kar  ime atributa +  "X" atNew = Orange.core.EnumVariable(at.name, values = at.values + [at.name+"X"]) newDomain = Orange.core.Domain(filter(lambda x: x!=at, data.domain.attributes)+[atNew,data.domain.classVar]) newDomain.addmetas(data.domain.getmetas()) newData = Orange.core.ExampleTable(newDomain,data) altData = Orange.core.ExampleTable(newDomain,data) for i,d in enumerate(newData): d[atNew] = data[i][at] d[weightID] = 1*data[i][weightID] for i,d in enumerate(altData): d[atNew] = at.name+"X" d[weightID] = 0.000001*data[i][weightID] newData.extend(altData) setsOfData.append(newData) return setsOfData d[weight_id] = 0.000001*data[i][weight_id] elif at.var_type == Orange.data.Type.Discrete: # v dataOrig, dataFinal in new_data atributu "at" dodaj ee  eno  vreednost, ki ima vrednost kar  ime atributa +  "X" at_new = Orange.data.variable.Discrete(at.name, values = at.values + [at.name+"X"]) new_domain = Orange.data.Domain(filter(lambda x: x!=at, data.domain.features)+[at_new,data.domain.class_var]) new_domain.addmetas(data.domain.getmetas()) new_data = Orange.data.Table(new_domain,data) alt_data = Orange.data.Table(new_domain,data) for i,d in enumerate(new_data): d[at_new] = data[i][at] d[weight_id] = 1*data[i][weight_id] for i,d in enumerate(alt_data): d[at_new] = at.name+"X" d[weight_id] = 0.000001*data[i][weight_id] new_data.extend(alt_data) sets_of_data.append(new_data) return sets_of_data learner = LogRegLearner(imputer = Orange.core.ImputerConstructor_average(), removeSingular = self.removeSingular) learner = LogRegLearner(imputer=Orange.feature.imputation.ImputerConstructor_average(), remove_singular = self.remove_singular) # get Original Model orig_model = learner(examples,weight) orig_model = learner(instances,weight) if orig_model.fit_status: print "Warning: model did not converge" if weight == 0: weight = Orange.data.new_meta_id() examples.addMetaAttribute(weight, 1.0) extended_set_of_examples = createLogRegExampleTable(examples, weight) instances.addMetaAttribute(weight, 1.0) extended_set_of_examples = createLogRegExampleTable(instances, weight) extended_models = [learner(extended_examples, weight) \ for extended_examples in extended_set_of_examples] ##        print orig_model.domain ##        print orig_model.beta ##        print orig_model.beta[orig_model.continuizedDomain.attributes[-1]] ##        print orig_model.beta[orig_model.continuized_domain.features[-1]] ##        for i,m in enumerate(extended_models): ##            print examples.domain.attributes[i] ##            print examples.domain.features[i] ##            printOUT(m) betas_ap = [] for m in extended_models: beta_add = m.beta[m.continuizedDomain.attributes[-1]] beta_add = m.beta[m.continuized_domain.features[-1]] betas_ap.append(beta_add) beta = beta + beta_add # compare it to bayes prior bayes = Orange.core.BayesLearner(examples) bayes = Orange.classification.bayes.NaiveLearner(instances) bayes_prior = math.log(bayes.distribution[1]/bayes.distribution[0]) ##        print "lr", orig_model.beta[0] ##        print "lr2", logistic_prior ##        print "dist", Orange.core.Distribution(examples.domain.classVar,examples) ##        print "dist", Orange.statistics.distribution.Distribution(examples.domain.class_var,examples) ##        print "prej", betas_ap # vrni originalni model in pripadajoce apriorne niclele return (orig_model, betas_ap) #return (bayes_prior,orig_model.beta[examples.domain.classVar],logistic_prior) #return (bayes_prior,orig_model.beta[examples.domain.class_var],logistic_prior) LogRegLearnerGetPriors = deprecated_members({"removeSingular": "remove_singular"} )(LogRegLearnerGetPriors) class LogRegLearnerGetPriorsOneTable: def __init__(self, removeSingular=0, **kwds): @deprecated_keywords({"removeSingular": "remove_singular"}) def __init__(self, remove_singular=0, **kwds): self.__dict__.update(kwds) self.removeSingular = removeSingular def __call__(self, examples, weight=0): self.remove_singular = remove_singular @deprecated_keywords({"examples": "instances"}) def __call__(self, instances, weight=0): # next function changes data set to a extended with unknown values def createLogRegExampleTable(data, weightID): finalData = Orange.core.ExampleTable(data) origData = Orange.core.ExampleTable(data) for at in data.domain.attributes: finalData = Orange.data.Table(data) orig_data = Orange.data.Table(data) for at in data.domain.features: # za vsak atribut kreiraj nov newExampleTable newData # v dataOrig, dataFinal in newData dodaj nov atribut -- continuous variable if at.varType == Orange.core.VarTypes.Continuous: atDisc = Orange.core.FloatVariable(at.name + "Disc") newDomain = Orange.core.Domain(origData.domain.attributes+[atDisc,data.domain.classVar]) if at.var_type == Orange.data.Type.Continuous: atDisc = Orange.data.variable.Continuous(at.name + "Disc") newDomain = Orange.data.Domain(orig_data.domain.features+[atDisc,data.domain.class_var]) newDomain.addmetas(newData.domain.getmetas()) finalData = Orange.core.ExampleTable(newDomain,finalData) newData = Orange.core.ExampleTable(newDomain,origData) origData = Orange.core.ExampleTable(newDomain,origData) for d in origData: finalData = Orange.data.Table(newDomain,finalData) newData = Orange.data.Table(newDomain,orig_data) orig_data = Orange.data.Table(newDomain,orig_data) for d in orig_data: d[atDisc] = 0 for d in finalData: d[weightID] = 100*data[i][weightID] elif at.varType == Orange.core.VarTypes.Discrete: elif at.var_type == Orange.data.Type.Discrete: # v dataOrig, dataFinal in newData atributu "at" dodaj ee  eno  vreednost, ki ima vrednost kar  ime atributa +  "X" atNew = Orange.core.EnumVariable(at.name, values = at.values + [at.name+"X"]) newDomain = Orange.core.Domain(filter(lambda x: x!=at, origData.domain.attributes)+[atNew,origData.domain.classVar]) newDomain.addmetas(origData.domain.getmetas()) temp_finalData = Orange.core.ExampleTable(finalData) finalData = Orange.core.ExampleTable(newDomain,finalData) newData = Orange.core.ExampleTable(newDomain,origData) temp_origData = Orange.core.ExampleTable(origData) origData = Orange.core.ExampleTable(newDomain,origData) for i,d in enumerate(origData): d[atNew] = temp_origData[i][at] at_new = Orange.data.variable.Discrete(at.name, values = at.values + [at.name+"X"]) newDomain = Orange.data.Domain(filter(lambda x: x!=at, orig_data.domain.features)+[at_new,orig_data.domain.class_var]) newDomain.addmetas(orig_data.domain.getmetas()) temp_finalData = Orange.data.Table(finalData) finalData = Orange.data.Table(newDomain,finalData) newData = Orange.data.Table(newDomain,orig_data) temp_origData = Orange.data.Table(orig_data) orig_data = Orange.data.Table(newDomain,orig_data) for i,d in enumerate(orig_data): d[at_new] = temp_origData[i][at] for i,d in enumerate(finalData): d[atNew] = temp_finalData[i][at] d[at_new] = temp_finalData[i][at] for i,d in enumerate(newData): d[atNew] = at.name+"X" d[at_new] = at.name+"X" d[weightID] = 10*data[i][weightID] finalData.extend(newData) return finalData learner = LogRegLearner(imputer = Orange.core.ImputerConstructor_average(), removeSingular = self.removeSingular) learner = LogRegLearner(imputer = Orange.feature.imputation.ImputerConstructor_average(), removeSingular = self.remove_singular) # get Original Model orig_model = learner(examples,weight) orig_model = learner(instances,weight) # get extended Model (you should not change data) if weight == 0: weight = Orange.data.new_meta_id() examples.addMetaAttribute(weight, 1.0) extended_examples = createLogRegExampleTable(examples, weight) instances.addMetaAttribute(weight, 1.0) extended_examples = createLogRegExampleTable(instances, weight) extended_model = learner(extended_examples, weight) betas_ap = [] for m in extended_models: beta_add = m.beta[m.continuizedDomain.attributes[-1]] beta_add = m.beta[m.continuized_domain.features[-1]] betas_ap.append(beta_add) beta = beta + beta_add # compare it to bayes prior bayes = Orange.core.BayesLearner(examples) bayes = Orange.classification.bayes.NaiveLearner(instances) bayes_prior = math.log(bayes.distribution[1]/bayes.distribution[0]) #print "lr", orig_model.beta[0] #print "lr2", logistic_prior #print "dist", Orange.core.Distribution(examples.domain.classVar,examples) #print "dist", Orange.statistics.distribution.Distribution(examples.domain.class_var,examples) k = (bayes_prior-orig_model.beta[0])/(logistic_prior-orig_model.beta[0]) #print "prej", betas_ap # vrni originalni model in pripadajoce apriorne niclele return (orig_model, betas_ap) #return (bayes_prior,orig_model.beta[data.domain.classVar],logistic_prior) #return (bayes_prior,orig_model.beta[data.domain.class_var],logistic_prior) LogRegLearnerGetPriorsOneTable = deprecated_members({"removeSingular": "remove_singular"} )(LogRegLearnerGetPriorsOneTable) for i,x_i in enumerate(x): pr = pr(x_i,betas) llh += y[i]*log(max(pr,1e-6)) + (1-y[i])*log(max(1-pr,1e-6)) llh += y[i]*math.log(max(pr,1e-6)) + (1-y[i])*log(max(1-pr,1e-6)) return llh def diag(vector): mat = identity(len(vector), Float) mat = identity(len(vector)) for i,v in enumerate(vector): mat[i][i] = v return mat class SimpleFitter(Orange.core.LogRegFitter): class SimpleFitter(LogRegFitter): def __init__(self, penalty=0, se_penalty = False): self.penalty = penalty self.se_penalty = se_penalty def __call__(self, data, weight=0): ml = data.native(0) for i in range(len(data.domain.attributes)): a = data.domain.attributes[i] if a.varType == Orange.core.VarTypes.Discrete: for i in range(len(data.domain.features)): a = data.domain.features[i] if a.var_type == Orange.data.Type.Discrete: for m in ml: m[i] = a.values.index(m[i]) for m in ml: m[-1] = data.domain.classVar.values.index(m[-1]) m[-1] = data.domain.class_var.values.index(m[-1]) Xtmp = array(ml) y = Xtmp[:,-1]   # true probabilities (1's or 0's) X=concatenate((one, Xtmp[:,:-1]),1)  # intercept first, then data betas = array([0.0] * (len(data.domain.attributes)+1)) oldBetas = array([1.0] * (len(data.domain.attributes)+1)) betas = array([0.0] * (len(data.domain.features)+1)) oldBetas = array([1.0] * (len(data.domain.features)+1)) N = len(data) pen_matrix = array([self.penalty] * (len(data.domain.attributes)+1)) pen_matrix = array([self.penalty] * (len(data.domain.features)+1)) if self.se_penalty: p = array([pr(X[i], betas) for i in range(len(data))]) W = identity(len(data), Float) W = identity(len(data)) pp = p * (1.0-p) for i in range(N): W[i,i] = pp[i] se = sqrt(diagonal(inverse(matrixmultiply(transpose(X), matrixmultiply(W, X))))) se = sqrt(diagonal(inv(dot(transpose(X), dot(W, X))))) for i,p in enumerate(pen_matrix): pen_matrix[i] *= se[i] p = array([pr(X[i], betas) for i in range(len(data))]) W = identity(len(data), Float) W = identity(len(data)) pp = p * (1.0-p) for i in range(N): W[i,i] = pp[i] WI = inverse(W) z = matrixmultiply(X, betas) + matrixmultiply(WI, y - p) tmpA = inverse(matrixmultiply(transpose(X), matrixmultiply(W, X))+diag(pen_matrix)) tmpB = matrixmultiply(transpose(X), y-p) betas = oldBetas + matrixmultiply(tmpA,tmpB) #            betaTemp = matrixmultiply(matrixmultiply(matrixmultiply(matrixmultiply(tmpA,transpose(X)),W),X),oldBetas) WI = inv(W) z = dot(X, betas) + dot(WI, y - p) tmpA = inv(dot(transpose(X), dot(W, X))+diag(pen_matrix)) tmpB = dot(transpose(X), y-p) betas = oldBetas + dot(tmpA,tmpB) #            betaTemp = dot(dot(dot(dot(tmpA,transpose(X)),W),X),oldBetas) #            print betaTemp #            tmpB = matrixmultiply(transpose(X), matrixmultiply(W, z)) #            betas = matrixmultiply(tmpA, tmpB) #            tmpB = dot(transpose(X), dot(W, z)) #            betas = dot(tmpA, tmpB) likelihood_new = lh(X,y,betas)-self.penalty*sum([b*b for b in betas]) print likelihood_new ##        XX = sqrt(diagonal(inverse(matrixmultiply(transpose(X),X)))) ##        XX = sqrt(diagonal(inv(dot(transpose(X),X)))) ##        yhat = array([pr(X[i], betas) for i in range(len(data))]) ##        ss = sum((y - yhat) ** 2) / (N - len(data.domain.attributes) - 1) ##        ss = sum((y - yhat) ** 2) / (N - len(data.domain.features) - 1) ##        sigma = math.sqrt(ss) p = array([pr(X[i], betas) for i in range(len(data))]) W = identity(len(data), Float) W = identity(len(data)) pp = p * (1.0-p) for i in range(N): W[i,i] = pp[i] diXWX = sqrt(diagonal(inverse(matrixmultiply(transpose(X), matrixmultiply(W, X))))) xTemp = matrixmultiply(matrixmultiply(inverse(matrixmultiply(transpose(X), matrixmultiply(W, X))),transpose(X)),y) diXWX = sqrt(diagonal(inv(dot(transpose(X), dot(W, X))))) xTemp = dot(dot(inv(dot(transpose(X), dot(W, X))),transpose(X)),y) beta = [] beta_se = [] return exp(bx)/(1+exp(bx)) class BayesianFitter(Orange.core.LogRegFitter): class BayesianFitter(LogRegFitter): def __init__(self, penalty=0, anch_examples=[], tau = 0): self.penalty = penalty # convert data to numeric ml = data.native(0) for i,a in enumerate(data.domain.attributes): if a.varType == Orange.core.VarTypes.Discrete: for i,a in enumerate(data.domain.features): if a.var_type == Orange.data.Type.Discrete: for m in ml: m[i] = a.values.index(m[i]) for m in ml: m[-1] = data.domain.classVar.values.index(m[-1]) m[-1] = data.domain.class_var.values.index(m[-1]) Xtmp = array(ml) y = Xtmp[:,-1]   # true probabilities (1's or 0's) (X,y)=self.create_array_data(data) exTable = Orange.core.ExampleTable(data.domain) exTable = Orange.data.Table(data.domain) for id,ex in self.anch_examples: exTable.extend(Orange.core.ExampleTable(ex,data.domain)) exTable.extend(Orange.data.Table(ex,data.domain)) (X_anch,y_anch)=self.create_array_data(exTable) betas = array([0.0] * (len(data.domain.attributes)+1)) betas = array([0.0] * (len(data.domain.features)+1)) likelihood,betas = self.estimate_beta(X,y,betas,[0]*(len(betas)),X_anch,y_anch) # get attribute groups atGroup = [(startIndex, number of values), ...) ats = data.domain.attributes ats = data.domain.features atVec=reduce(lambda x,y: x+[(y,not y==x[-1][0])], [a.getValueFrom and a.getValueFrom.whichVar or a for a in ats],[(ats[0].getValueFrom and ats[0].getValueFrom.whichVar or ats[0],0)])[1:] atGroup=[[0,0]] print "betas", betas[0], betas_temp[0] sumB += betas[0]-betas_temp[0] apriori = Orange.core.Distribution(data.domain.classVar, data) apriori = Orange.statistics.distribution.Distribution(data.domain.class_var, data) aprioriProb = apriori[0]/apriori.abs for j in range(len(betas)): if const_betas[j]: continue dl = matrixmultiply(X[:,j],transpose(y-p)) dl = dot(X[:,j], transpose(y-p)) for xi,x in enumerate(X_anch): dl += self.penalty*x[j]*(y_anch[xi] - pr_bx(r_anch[xi]*self.penalty)) ddl = matrixmultiply(X_sq[:,j],transpose(p*(1-p))) ddl = dot(X_sq[:,j], transpose(p*(1-p))) for xi,x in enumerate(X_anch): ddl += self.penalty*x[j]*pr_bx(r[xi]*self.penalty)*(1-pr_bx(r[xi]*self.penalty)) #  Feature subset selection for logistic regression def get_likelihood(fitter, examples): res = fitter(examples) @deprecated_keywords({"examples": "instances"}) def get_likelihood(fitter, instances): res = fitter(instances) if res[0] in [fitter.OK]: #, fitter.Infinity, fitter.Divergence]: status, beta, beta_se, likelihood = res if sum([abs(b) for b in beta])=self.deleteCrit: if P>=self.delete_crit: attr.remove(worstAt) remain_attr.append(worstAt) nodeletion = 1 # END OF DELETION PART # if enough attributes has been chosen, stop the procedure if self.numFeatures>-1 and len(attr)>=self.numFeatures: if self.num_features>-1 and len(attr)>=self.num_features: remain_attr=[] # for each attribute in the remaining maxG=-1 for at in remain_attr: tempAttr = attr + [at] tempDomain = Orange.core.Domain(tempAttr,examples.domain.classVar) tempDomain = Orange.data.Domain(tempAttr,examples.domain.class_var) tempDomain.addmetas(examples.domain.getmetas()) # domain, calculate P for LL improvement. tempDomain  = continuizer(Orange.core.Preprocessor_dropMissing(examples.select(tempDomain))) tempData = Orange.core.Preprocessor_dropMissing(examples.select(tempDomain)) ll_New = get_likelihood(Orange.core.LogRegFitter_Cholesky(), tempData) ll_New = get_likelihood(LogRegFitter_Cholesky(), tempData) length_New = float(len(tempData)) # get number of examples in tempData to normalize likelihood stop = 1 continue if bestAt.varType==Orange.core.VarTypes.Continuous: if bestAt.var_type==Orange.data.Type.Continuous: P=lchisqprob(maxG,1); else: P=lchisqprob(maxG,len(bestAt.values)-1); # Add attribute with smallest P to attributes(attr) if P<=self.addCrit: if P<=self.add_crit: attr.append(bestAt) remain_attr.remove(bestAt) length_Old = length_Best if (P>self.addCrit and nodeletion) or (bestAt == worstAt): if (P>self.add_crit and nodeletion) or (bestAt == worstAt): stop = 1 return attr StepWiseFSS = deprecated_members({"addCrit": "add_crit", "deleteCrit": "delete_crit", "numFeatures": "num_features"})(StepWiseFSS) else: return self def __init__(self, addCrit=0.2, deleteCrit=0.3, numFeatures = -1): self.addCrit = addCrit self.deleteCrit = deleteCrit self.numFeatures = numFeatures def __call__(self, examples): attr = StepWiseFSS(examples, addCrit=self.addCrit, deleteCrit = self.deleteCrit, numFeatures = self.numFeatures) return examples.select(Orange.core.Domain(attr, examples.domain.classVar)) @deprecated_keywords({"addCrit": "add_crit", "deleteCrit": "delete_crit", "numFeatures": "num_features"}) def __init__(self, add_crit=0.2, delete_crit=0.3, num_features = -1): self.add_crit = add_crit self.delete_crit = delete_crit self.num_features = num_features @deprecated_keywords({"examples": "instances"}) def __call__(self, instances): attr = StepWiseFSS(instances, add_crit=self.add_crit, delete_crit= self.delete_crit, num_features= self.num_features) return instances.select(Orange.data.Domain(attr, instances.domain.class_var)) StepWiseFSSFilter = deprecated_members({"addCrit": "add_crit", "deleteCrit": "delete_crit", "numFeatures": "num_features"})\ (StepWiseFSSFilter) ####################################
• ## Orange/evaluation/reliability.py

 r9725 :obj:Orange.classification.Classifier.GetBoth is passed) contain an additional attribute :obj:reliability_estimate, which is an instance of :class:~Orange.evaluation.reliability.Estimate. :class:~Orange.evaluation.reliability.Estimate. """
• ## Orange/feature/discretization.py

 r9671 """ ################################### Discretization (discretization) ################################### .. index:: discretization .. index:: single: feature; discretization Example-based automatic discretization is in essence similar to learning: given a set of examples, discretization method proposes a list of suitable intervals to cut the attribute's values into. For this reason, Orange structures for discretization resemble its structures for learning. Objects derived from orange.Discretization play a role of "learner" that, upon observing the examples, construct an orange.Discretizer whose role is to convert continuous values into discrete according to the rule found by Discretization. Orange supports several methods of discretization; here's a list of methods with belonging classes. * Equi-distant discretization (:class:EquiDistDiscretization, :class:EquiDistDiscretizer). The range of attribute's values is split into prescribed number equal-sized intervals. * Quantile-based discretization (:class:EquiNDiscretization, :class:IntervalDiscretizer). The range is split into intervals containing equal number of examples. * Entropy-based discretization (:class:EntropyDiscretization, :class:IntervalDiscretizer). Developed by Fayyad and Irani, this method balances between entropy in intervals and MDL of discretization. * Bi-modal discretization (:class:BiModalDiscretization, :class:BiModalDiscretizer/:class:IntervalDiscretizer). Two cut-off points set to optimize the difference of the distribution in the middle interval and the distributions outside it. * Fixed discretization (:class:IntervalDiscretizer). Discretization with user-prescribed cut-off points. Instances of classes derived from :class:Discretization. It define a single method: the call operator. The object can also be called through constructor. .. class:: Discretization .. method:: __call__(attribute, examples[, weightID]) Given a continuous attribute, examples and, optionally id of attribute with example weight, this function returns a discretized attribute. Argument attribute can be a descriptor, index or name of the attribute. Here's an example. Part of :download:discretization.py : .. literalinclude:: code/discretization.py :lines: 7-15 The discretized attribute sep_w is constructed with a call to :class:EntropyDiscretization (instead of constructing it and calling it afterwards, we passed the arguments for calling to the constructor, as is often allowed in Orange). We then constructed a new :class:Orange.data.Table with attributes "sepal width" (the original continuous attribute), sep_w and the class attribute. Script output is:: Entropy discretization, first 10 examples [3.5, '>3.30', 'Iris-setosa'] [3.0, '(2.90, 3.30]', 'Iris-setosa'] [3.2, '(2.90, 3.30]', 'Iris-setosa'] [3.1, '(2.90, 3.30]', 'Iris-setosa'] [3.6, '>3.30', 'Iris-setosa'] [3.9, '>3.30', 'Iris-setosa'] [3.4, '>3.30', 'Iris-setosa'] [3.4, '>3.30', 'Iris-setosa'] [2.9, '<=2.90', 'Iris-setosa'] [3.1, '(2.90, 3.30]', 'Iris-setosa'] :class:EntropyDiscretization named the new attribute's values by the interval range (it also named the attribute as "D_sepal width"). The new attribute's values get computed automatically when they are needed. As those that have read about :class:Orange.data.variable.Variable know, the answer to "How this works?" is hidden in the field :obj:~Orange.data.variable.Variable.get_value_from. This little dialog reveals the secret. :: >>> sep_w EnumVariable 'D_sepal width' >>> sep_w.get_value_from >>> sep_w.get_value_from.whichVar FloatVariable 'sepal width' >>> sep_w.get_value_from.transformer >>> sep_w.get_value_from.transformer.points <2.90000009537, 3.29999995232> So, the select statement in the above example converted all examples from data to the new domain. Since the new domain includes the attribute sep_w that is not present in the original, sep_w's values are computed on the fly. For each example in data, sep_w.get_value_from is called to compute sep_w's value (if you ever need to call get_value_from, you shouldn't call get_value_from directly but call compute_value instead). sep_w.get_value_from looks for value of "sepal width" in the original example. The original, continuous sepal width is passed to the transformer that determines the interval by its field points. Transformer returns the discrete value which is in turn returned by get_value_from and stored in the new example. You don't need to understand this mechanism exactly. It's important to know that there are two classes of objects for discretization. Those derived from :obj:Discretizer (such as :obj:IntervalDiscretizer that we've seen above) are used as transformers that translate continuous value into discrete. Discretization algorithms are derived from :obj:Discretization. Their job is to construct a :obj:Discretizer and return a new variable with the discretizer stored in get_value_from.transformer. Discretizers ============ Different discretizers support different methods for conversion of continuous values into discrete. The most general is :class:IntervalDiscretizer that is also used by most discretization methods. Two other discretizers, :class:EquiDistDiscretizer and :class:ThresholdDiscretizer> could easily be replaced by :class:IntervalDiscretizer but are used for speed and simplicity. The fourth discretizer, :class:BiModalDiscretizer is specialized for discretizations induced by :class:BiModalDiscretization. .. class:: Discretizer All discretizers support a handy method for construction of a new attribute from an existing one. .. method:: construct_variable(attribute) Constructs a new attribute descriptor; the new attribute is discretized attribute. The new attribute's name equal attribute.name prefixed  by "D\_", and its symbolic values are discretizer specific. The above example shows what comes out form :class:IntervalDiscretizer. Discretization algorithms actually first construct a discretizer and then call its :class:construct_variable to construct an attribute descriptor. .. class:: IntervalDiscretizer The most common discretizer. .. attribute:: points Cut-off points. All values below or equal to the first point belong to the first interval, those between the first and the second (including those equal to the second) go to the second interval and so forth to the last interval which covers all values greater than the last element in points. The number of intervals is thus len(points)+1. Let us manually construct an interval discretizer with cut-off points at 3.0 and 5.0. We shall use the discretizer to construct a discretized sepal length (part of :download:discretization.py ): .. literalinclude:: code/discretization.py :lines: 22-26 That's all. First five examples of data2 are now :: [5.1, '>5.00', 'Iris-setosa'] [4.9, '(3.00, 5.00]', 'Iris-setosa'] [4.7, '(3.00, 5.00]', 'Iris-setosa'] [4.6, '(3.00, 5.00]', 'Iris-setosa'] [5.0, '(3.00, 5.00]', 'Iris-setosa'] Can you use the same discretizer for more than one attribute? Yes, as long as they have same cut-off points, of course. Simply call construct_var for each continuous attribute (part of :download:discretization.py ): .. literalinclude:: code/discretization.py :lines: 30-34 Each attribute now has its own (FIXME) ClassifierFromVar in its get_value_from, but all use the same :class:IntervalDiscretizer, idisc. Changing an element of its points affect all attributes. Do not change the length of :obj:~IntervalDiscretizer.points if the discretizer is used by any attribute. The length of :obj:~IntervalDiscretizer.points should always match the number of values of the attribute, which is determined by the length of the attribute's field values. Therefore, if attr is a discretized attribute, than len(attr.values) must equal len(attr.get_value_from.transformer.points)+1. It always does, unless you deliberately change it. If the sizes don't match, Orange will probably crash, and it will be entirely your fault. .. class:: EquiDistDiscretizer More rigid than :obj:IntervalDiscretizer: it uses intervals of fixed width. .. attribute:: first_cut The first cut-off point. .. attribute:: step Width of intervals. .. attribute:: number_of_intervals Number of intervals. .. attribute:: points (read-only) The cut-off points; this is not a real attribute although it behaves as one. Reading it constructs a list of cut-off points and returns it, but changing the list doesn't affect the discretizer - it's a separate list. This attribute is here only for to give the :obj:EquiDistDiscretizer the same interface as that of :obj:IntervalDiscretizer. All values below :obj:~EquiDistDiscretizer.first_cut belong to the first intervala (including possible values smaller than firstVal. Otherwise, value val's interval is floor((val-firstVal)/step). If this is turns out to be greater or equal to :obj:~EquiDistDiscretizer.number_of_intervals, it is decreased to number_of_intervals-1. This discretizer is returned by :class:EquiDistDiscretization; you can see an example in the corresponding section. You can also construct it manually and call its construct_variable, just as shown for the :obj:IntervalDiscretizer. .. class:: ThresholdDiscretizer Threshold discretizer converts continuous values into binary by comparing them with a threshold. This discretizer is actually not used by any discretization method, but you can use it for manual discretization. Orange needs this discretizer for binarization of continuous attributes in decision trees. .. attribute:: threshold Threshold; values below or equal to the threshold belong to the first interval and those that are greater go to the second. .. class:: BiModalDiscretizer This discretizer is the first discretizer that couldn't be replaced by :class:IntervalDiscretizer. It has two cut off points and values are discretized according to whether they belong to the middle region (which includes the lower but not the upper boundary) or not. The discretizer is returned by :class:BiModalDiscretization if its field :obj:~BiModalDiscretization.split_in_two is true (the default). .. attribute:: low Lower boudary of the interval (included in the interval). .. attribute:: high Upper boundary of the interval (not included in the interval). Discretization Algorithms ========================= .. class:: EquiDistDiscretization Discretizes the attribute by cutting it into the prescribed number of intervals of equal width. The examples are needed to determine the span of attribute values. The interval between the smallest and the largest is then cut into equal parts. .. attribute:: number_of_intervals Number of intervals into which the attribute is to be discretized. Default value is 4. For an example, we shall discretize all attributes of Iris dataset into 6 intervals. We shall construct an :class:Orange.data.Table with discretized attributes and print description of the attributes (part of :download:discretization.py ): .. literalinclude:: code/discretization.py :lines: 38-43 Script's answer is :: D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10> Any more decent ways for a script to find the interval boundaries than by parsing the symbolic values? Sure, they are hidden in the discretizer, which is, as usual, stored in attr.get_value_from.transformer. Compare the following with the values above. :: >>> for attr in newattrs: ...    print "%s: first interval at %5.3f, step %5.3f" % \ ...    (attr.name, attr.get_value_from.transformer.first_cut, \ ...    attr.get_value_from.transformer.step) D_sepal length: first interval at 4.900, step 0.600 D_sepal width: first interval at 2.400, step 0.400 D_petal length: first interval at 1.980, step 0.980 D_petal width: first interval at 0.500, step 0.400 As all discretizers, :class:EquiDistDiscretizer also has the method construct_variable (part of :download:discretization.py ): .. literalinclude:: code/discretization.py :lines: 69-73 .. class:: EquiNDiscretization Discretization with Intervals Containing (Approximately) Equal Number of Examples. Discretizes the attribute by cutting it into the prescribed number of intervals so that each of them contains equal number of examples. The examples are obviously needed for this discretization, too. .. attribute:: number_of_intervals Number of intervals into which the attribute is to be discretized. Default value is 4. The use of this discretization is the same as the use of :class:EquiDistDiscretization. The resulting discretizer is :class:IntervalDiscretizer, hence it has points instead of first_cut/ step/number_of_intervals. .. class:: EntropyDiscretization Entropy-based Discretization (Fayyad-Irani). Fayyad-Irani's discretization method works without a predefined number of intervals. Instead, it recursively splits intervals at the cut-off point that minimizes the entropy, until the entropy decrease is smaller than the increase of MDL induced by the new point. An interesting thing about this discretization technique is that an attribute can be discretized into a single interval, if no suitable cut-off points are found. If this is the case, the attribute is rendered useless and can be removed. This discretization can therefore also serve for feature subset selection. .. attribute:: force_attribute Forces the algorithm to induce at least one cut-off point, even when its information gain is lower than MDL (default: false). Part of :download:discretization.py : .. literalinclude:: code/discretization.py :lines: 77-80 The output shows that all attributes are discretized onto three intervals:: sepal length: <5.5, 6.09999990463> sepal width: <2.90000009537, 3.29999995232> petal length: <1.89999997616, 4.69999980927> petal width: <0.600000023842, 1.0000004768> .. class:: BiModalDiscretization Bi-Modal Discretization Sets two cut-off points so that the class distribution of examples in between is as different from the overall distribution as possible. The difference is measure by chi-square statistics. All possible cut-off points are tried, thus the discretization runs in O(n^2). This discretization method is especially suitable for the attributes in which the middle region corresponds to normal and the outer regions to abnormal values of the attribute. Depending on the nature of the attribute, we can treat the lower and higher values separately, thus discretizing the attribute into three intervals, or together, in a binary attribute whose values correspond to normal and abnormal. .. attribute:: split_in_two Decides whether the resulting attribute should have three or two. If true (default), we have three intervals and the discretizer is of type :class:BiModalDiscretizer. If false the result is the ordinary :class:IntervalDiscretizer. Iris dataset has three-valued class attribute, classes are setosa, virginica and versicolor. As the picture below shows, sepal lenghts of versicolors are between lengths of setosas and virginicas (the picture itself is drawn using LOESS probability estimation). .. image:: files/bayes-iris.gif If we merge classes setosa and virginica into one, we can observe whether the bi-modal discretization would correctly recognize the interval in which versicolors dominate. .. literalinclude:: code/discretization.py :lines: 84-87 In this script, we have constructed a new class attribute which tells whether an iris is versicolor or not. We have told how this attribute's value is computed from the original class value with a simple lambda function. Finally, we have constructed a new domain and converted the examples. Now for discretization. .. literalinclude:: code/discretization.py :lines: 97-100 The script prints out the middle intervals:: sepal length: (5.400, 6.200] sepal width: (2.000, 2.900] petal length: (1.900, 4.700] petal width: (0.600, 1.600] Judging by the graph, the cut-off points for "sepal length" make sense. Additional functions ==================== Some functions and classes that can be used for categorization of continuous features. Besides several general classes that can help in this task, we also provide a function that may help in entropy-based discretization (Fayyad & Irani), and a wrapper around classes for categorization that can be used for learning. .. automethod:: Orange.feature.discretization.entropyDiscretization_wrapper .. autoclass:: Orange.feature.discretization.EntropyDiscretization_wrapper .. autoclass:: Orange.feature.discretization.DiscretizedLearner_Class .. rubric:: Example FIXME. A chapter on feature subset selection <../ofb/o_fss.htm>_ in Orange for Beginners tutorial shows the use of DiscretizedLearner. Other discretization classes from core Orange are listed in chapter on categorization <../ofb/o_categorization.htm>_ of the same tutorial. ========== References ========== * UM Fayyad and KB Irani. Multi-interval discretization of continuous valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, pages 1022--1029, Chambery, France, 1993. """ import Orange import Orange.core as orange Discrete2Continuous, \ Discretizer, \ BiModalDiscretizer, \ EquiDistDiscretizer, \ IntervalDiscretizer, \ ThresholdDiscretizer, \ EntropyDiscretization, \ EquiDistDiscretization, \ EquiNDiscretization, \ BiModalDiscretization, \ Discretization BiModalDiscretizer, \ EquiDistDiscretizer as EqualWidthDiscretizer, \ IntervalDiscretizer, \ ThresholdDiscretizer,\ EntropyDiscretization as Entropy, \ EquiDistDiscretization as EqualWidth, \ EquiNDiscretization as EqualFreq, \ BiModalDiscretization as BiModal, \ Discretization, \ Preprocessor_discretize ###### # from orngDics.py def entropyDiscretization_wrapper(table): """Take the classified table set (table) and categorize all continuous features using the entropy based discretization :obj:EntropyDiscretization. def entropyDiscretization_wrapper(data): """Discretize all continuous features in class-labeled data set with the entropy-based discretization :obj:Entropy. :param table: data to discretize. :type table: Orange.data.Table :param data: data to discretize. :type data: Orange.data.Table :rtype: :obj:Orange.data.Table includes all categorical and discretized\ continuous features from the original data table. """ orange.setrandseed(0) tablen=orange.Preprocessor_discretize(table, method=EntropyDiscretization()) data_new = orange.Preprocessor_discretize(data, method=Entropy()) attrlist=[] nrem=0 for i in tablen.domain.attributes: attrlist = [] nrem = 0 for i in data_new.domain.attributes: if (len(i.values)>1): attrlist.append(i) nrem=nrem+1 attrlist.append(tablen.domain.classVar) return tablen.select(attrlist) return data_new.select(attrlist) """ def __init__(self, baseLearner, discretizer=EntropyDiscretization(), **kwds): def __init__(self, baseLearner, discretizer=Entropy(), **kwds): self.baseLearner = baseLearner if hasattr(baseLearner, "name"): def __call__(self, example, resultType = orange.GetValue): return self.classifier(example, resultType) class DiscretizeTable(object): """Discretizes all continuous features of the data table. :param data: data to discretize. :type data: :class:Orange.data.Table :param features: data features to discretize. None (default) to discretize all features. :type features: list of :class:Orange.data.variable.Variable :param method: feature discretization method. :type method: :class:Discretization """ def __new__(cls, data=None, features=None, discretize_class=False, method=EqualFreq(n_intervals=3)): if data is None: self = object.__new__(cls, features=features, discretize_class=discretize_class, method=method) return self else: self = cls(features=features, discretize_class=discretize_class, method=method) return self(data) def __init__(self, features=None, discretize_class=False, method=EqualFreq(n_intervals=3)): self.features = features self.discretize_class = discretize_class self.method = method def __call__(self, data): pp = Preprocessor_discretize(attributes=self.features, discretizeClass=self.discretize_class) pp.method = self.method return pp(data)
• ## Orange/feature/imputation.py

 r9671 """ ########################### Imputation (imputation) ########################### .. index:: imputation .. index:: single: feature; value imputation Imputation is a procedure of replacing the missing feature values with some appropriate values. Imputation is needed because of the methods (learning algorithms and others) that are not capable of handling unknown values, for instance logistic regression. Missing values sometimes have a special meaning, so they need to be replaced by a designated value. Sometimes we know what to replace the missing value with; for instance, in a medical problem, some laboratory tests might not be done when it is known what their results would be. In that case, we impute certain fixed value instead of the missing. In the most complex case, we assign values that are computed based on some model; we can, for instance, impute the average or majority value or even a value which is computed from values of other, known feature, using a classifier. In a learning/classification process, imputation is needed on two occasions. Before learning, the imputer needs to process the training examples. Afterwards, the imputer is called for each example to be classified. In general, imputer itself needs to be trained. This is, of course, not needed when the imputer imputes certain fixed value. However, when it imputes the average or majority value, it needs to compute the statistics on the training examples, and use it afterwards for imputation of training and testing examples. While reading this document, bear in mind that imputation is a part of the learning process. If we fit the imputation model, for instance, by learning how to predict the feature's value from other features, or even if we simply compute the average or the minimal value for the feature and use it in imputation, this should only be done on learning data. If cross validation is used for sampling, imputation should be done on training folds only. Orange provides simple means for doing that. This page will first explain how to construct various imputers. Then follow the examples for proper use of imputers <#using-imputers>_. Finally, quite often you will want to use imputation with special requests, such as certain features' missing values getting replaced by constants and other by values computed using models induced from specified other features. For instance, in one of the studies we worked on, the patient's pulse rate needed to be estimated using regression trees that included the scope of the patient's injuries, sex and age, some attributes' values were replaced by the most pessimistic ones and others were computed with regression trees based on values of all features. If you are using learners that need the imputer as a component, you will need to write your own imputer constructor <#write-your-own-imputer-constructor>_. This is trivial and is explained at the end of this page. Wrapper for learning algorithms =============================== This wrapper can be used with learning algorithms that cannot handle missing values: it will impute the missing examples using the imputer, call the earning and, if the imputation is also needed by the classifier, wrap the resulting classifier into another wrapper that will impute the missing values in examples to be classified. Even so, the module is somewhat redundant, as all learners that cannot handle missing values should, in principle, provide the slots for imputer constructor. For instance, :obj:Orange.classification.logreg.LogRegLearner has an attribute :obj:Orange.classification.logreg.LogRegLearner.imputerConstructor, and even if you don't set it, it will do some imputation by default. .. class:: ImputeLearner Wraps a learner and performs data discretization before learning. Most of Orange's learning algorithms do not use imputers because they can appropriately handle the missing values. Bayesian classifier, for instance, simply skips the corresponding attributes in the formula, while classification/regression trees have components for handling the missing values in various ways. If for any reason you want to use these algorithms to run on imputed data, you can use this wrapper. The class description is a matter of a separate page, but we shall show its code here as another demonstration of how to use the imputers - logistic regression is implemented essentially the same as the below classes. This is basically a learner, so the constructor will return either an instance of :obj:ImputerLearner or, if called with examples, an instance of some classifier. There are a few attributes that need to be set, though. .. attribute:: base_learner A wrapped learner. .. attribute:: imputer_constructor An instance of a class derived from :obj:ImputerConstructor (or a class with the same call operator). .. attribute:: dont_impute_classifier If given and set (this attribute is optional), the classifier will not be wrapped into an imputer. Do this if the classifier doesn't mind if the examples it is given have missing values. The learner is best illustrated by its code - here's its complete :obj:__call__ method:: def __call__(self, data, weight=0): trained_imputer = self.imputer_constructor(data, weight) imputed_data = trained_imputer(data, weight) base_classifier = self.base_learner(imputed_data, weight) if self.dont_impute_classifier: return base_classifier else: return ImputeClassifier(base_classifier, trained_imputer) So "learning" goes like this. :obj:ImputeLearner will first construct the imputer (that is, call :obj:self.imputer_constructor to get a (trained) imputer. Than it will use the imputer to impute the data, and call the given :obj:baseLearner to construct a classifier. For instance, :obj:baseLearner could be a learner for logistic regression and the result would be a logistic regression model. If the classifier can handle unknown values (that is, if :obj:dont_impute_classifier, we return it as it is, otherwise we wrap it into :obj:ImputeClassifier, which is given the base classifier and the imputer which it can use to impute the missing values in (testing) examples. .. class:: ImputeClassifier Objects of this class are returned by :obj:ImputeLearner when given data. .. attribute:: baseClassifier A wrapped classifier. .. attribute:: imputer An imputer for imputation of unknown values. .. method:: __call__ This class is even more trivial than the learner. Its constructor accepts two arguments, the classifier and the imputer, which are stored into the corresponding attributes. The call operator which does the classification then looks like this:: def __call__(self, ex, what=orange.GetValue): return self.base_classifier(self.imputer(ex), what) It imputes the missing values by calling the :obj:imputer and passes the class to the base classifier. .. note:: In this setup the imputer is trained on the training data - even if you do cross validation, the imputer will be trained on the right data. In the classification phase we again use the imputer which was classified on the training data only. .. rubric:: Code of ImputeLearner and ImputeClassifier :obj:Orange.feature.imputation.ImputeLearner puts the keyword arguments into the instance's  dictionary. You are expected to call it like :obj:ImputeLearner(base_learner=, imputer=). When the learner is called with examples, it trains the imputer, imputes the data, induces a :obj:base_classifier by the :obj:base_cearner and constructs :obj:ImputeClassifier that stores the :obj:base_classifier and the :obj:imputer. For classification, the missing values are imputed and the classifier's prediction is returned. Note that this code is slightly simplified, although the omitted details handle non-essential technical issues that are unrelated to imputation:: class ImputeLearner(orange.Learner): def __new__(cls, examples = None, weightID = 0, **keyw): self = orange.Learner.__new__(cls, **keyw) self.__dict__.update(keyw) if examples: return self.__call__(examples, weightID) else: return self def __call__(self, data, weight=0): trained_imputer = self.imputer_constructor(data, weight) imputed_data = trained_imputer(data, weight) base_classifier = self.base_learner(imputed_data, weight) return ImputeClassifier(base_classifier, trained_imputer) class ImputeClassifier(orange.Classifier): def __init__(self, base_classifier, imputer): self.base_classifier = base_classifier self.imputer = imputer def __call__(self, ex, what=orange.GetValue): return self.base_classifier(self.imputer(ex), what) .. rubric:: Example Although most Orange's learning algorithms will take care of imputation internally, if needed, it can sometime happen that an expert will be able to tell you exactly what to put in the data instead of the missing values. In this example we shall suppose that we want to impute the minimal value of each feature. We will try to determine whether the naive Bayesian classifier with its  implicit internal imputation works better than one that uses imputation by minimal values. :download:imputation-minimal-imputer.py  (uses :download:voting.tab ): .. literalinclude:: code/imputation-minimal-imputer.py :lines: 7- Should ouput this:: Without imputation: 0.903 With imputation: 0.899 .. note:: Note that we constructed just one instance of \ :obj:Orange.classification.bayes.NaiveLearner, but this same instance is used twice in each fold, once it is given the examples as they are (and returns an instance of :obj:Orange.classification.bayes.NaiveClassifier. The second time it is called by :obj:imba and the \ :obj:Orange.classification.bayes.NaiveClassifier it returns is wrapped into :obj:Orange.feature.imputation.Classifier. We thus have only one learner, but which produces two different classifiers in each round of testing. Abstract imputers ================= As common in Orange, imputation is done by pairs of two classes: one that does the work and another that constructs it. :obj:ImputerConstructor is an abstract root of the hierarchy of classes that get the training data (with an optional id for weight) and constructs an instance of a class, derived from :obj:Imputer. An :obj:Imputer can be called with an :obj:Orange.data.Instance and it will return a new example with the missing values imputed (it will leave the original example intact!). If imputer is called with an :obj:Orange.data.Table, it will return a new example table with imputed examples. .. class:: ImputerConstructor .. attribute:: imputeClass Tell whether to impute the class value (default) or not. Simple imputation ================= The simplest imputers always impute the same value for a particular attribute, disregarding the values of other attributes. They all use the same imputer class, :obj:Imputer_defaults. .. class:: Imputer_defaults .. attribute::  defaults An example with the default values to be imputed instead of the missing. Examples to be imputed must be from the same domain as :obj:defaults. Instances of this class can be constructed by :obj:Orange.feature.imputation.ImputerConstructor_minimal, :obj:Orange.feature.imputation.ImputerConstructor_maximal, :obj:Orange.feature.imputation.ImputerConstructor_average. For continuous features, they will impute the smallest, largest or the average  values encountered in the training examples. For discrete, they will impute the lowest (the one with index 0, e. g. attr.values[0]), the highest (attr.values[-1]), and the most common value encountered in the data. The first two imputers will mostly be used when the discrete values are ordered according to their impact on the class (for instance, possible values for symptoms of some disease can be ordered according to their seriousness). The minimal and maximal imputers will then represent optimistic and pessimistic imputations. The following code will load the bridges data, and first impute the values in a single examples and then in the whole table. :download:imputation-complex.py  (uses :download:bridges.tab ): .. literalinclude:: code/imputation-complex.py :lines: 9-23 This is example shows what the imputer does, not how it is to be used. Don't impute all the data and then use it for cross-validation. As warned at the top of this page, see the instructions for actual use of imputers <#using-imputers>_. .. note:: The :obj:ImputerConstructor are another class with schizophrenic constructor: if you give the constructor the data, it will return an \ :obj:Imputer - the above call is equivalent to calling \ :obj:Orange.feature.imputation.ImputerConstructor_minimal()(data). You can also construct the :obj:Orange.feature.imputation.Imputer_defaults yourself and specify your own defaults. Or leave some values unspecified, in which case the imputer won't impute them, as in the following example. Here, the only attribute whose values will get imputed is "LENGTH"; the imputed value will be 1234. .. literalinclude:: code/imputation-complex.py :lines: 56-69 :obj:Orange.feature.imputation.Imputer_defaults's constructor will accept an argument of type :obj:Orange.data.Domain (in which case it will construct an empty instance for :obj:defaults) or an example. (Be careful with this: :obj:Orange.feature.imputation.Imputer_defaults will have a reference to the instance and not a copy. But you can make a copy yourself to avoid problems: instead of Imputer_defaults(data[0]) you may want to write Imputer_defaults(Orange.data.Instance(data[0])). Random imputation ================= .. class:: Imputer_Random Imputes random values. The corresponding constructor is :obj:ImputerConstructor_Random. .. attribute:: impute_class Tells whether to impute the class values or not. Defaults to True. .. attribute:: deterministic If true (default is False), random generator is initialized for each example using the example's hash value as a seed. This results in same examples being always imputed the same values. Model-based imputation ====================== .. class:: ImputerConstructor_model Model-based imputers learn to predict the attribute's value from values of other attributes. :obj:ImputerConstructor_model are given a learning algorithm (two, actually - one for discrete and one for continuous attributes) and they construct a classifier for each attribute. The constructed imputer :obj:Imputer_model stores a list of classifiers which are used when needed. .. attribute:: learner_discrete, learner_continuous Learner for discrete and for continuous attributes. If any of them is missing, the attributes of the corresponding type won't get imputed. .. attribute:: use_class Tells whether the imputer is allowed to use the class value. As this is most often undesired, this option is by default set to False. It can however be useful for a more complex design in which we would use one imputer for learning examples (this one would use the class value) and another for testing examples (which would not use the class value as this is unavailable at that moment). .. class:: Imputer_model .. attribute: models A list of classifiers, each corresponding to one attribute of the examples whose values are to be imputed. The :obj:classVar's of the models should equal the examples' attributes. If any of classifier is missing (that is, the corresponding element of the table is :obj:None, the corresponding attribute's values will not be imputed. .. rubric:: Examples The following imputer predicts the missing attribute values using classification and regression trees with the minimum of 20 examples in a leaf. Part of :download:imputation-complex.py  (uses :download:bridges.tab ): .. literalinclude:: code/imputation-complex.py :lines: 74-76 We could even use the same learner for discrete and continuous attributes, as :class:Orange.classification.tree.TreeLearner checks the class type and constructs regression or classification trees accordingly. The common parameters, such as the minimal number of examples in leaves, are used in both cases. You can also use different learning algorithms for discrete and continuous attributes. Probably a common setup will be to use :class:Orange.classification.bayes.BayesLearner for discrete and :class:Orange.regression.mean.MeanLearner (which just remembers the average) for continuous attributes. Part of :download:imputation-complex.py  (uses :download:bridges.tab ): .. literalinclude:: code/imputation-complex.py :lines: 91-94 You can also construct an :class:Imputer_model yourself. You will do this if different attributes need different treatment. Brace for an example that will be a bit more complex. First we shall construct an :class:Imputer_model and initialize an empty list of models. The following code snippets are from :download:imputation-complex.py  (uses :download:bridges.tab ): .. literalinclude:: code/imputation-complex.py :lines: 108-109 Attributes "LANES" and "T-OR-D" will always be imputed values 2 and "THROUGH". Since "LANES" is continuous, it suffices to construct a :obj:DefaultClassifier with the default value 2.0 (don't forget the decimal part, or else Orange will think you talk about an index of a discrete value - how could it tell?). For the discrete attribute "T-OR-D", we could construct a :class:Orange.classification.ConstantClassifier and give the index of value "THROUGH" as an argument. But we shall do it nicer, by constructing a :class:Orange.data.Value. Both classifiers will be stored at the appropriate places in :obj:imputer.models. .. literalinclude:: code/imputation-complex.py :lines: 110-112 "LENGTH" will be computed with a regression tree induced from "MATERIAL", "SPAN" and "ERECTED" (together with "LENGTH" as the class attribute, of course). Note that we initialized the domain by simply giving a list with the names of the attributes, with the domain as an additional argument in which Orange will look for the named attributes. .. literalinclude:: code/imputation-complex.py :lines: 114-119 We printed the tree just to see what it looks like. :: SPAN=SHORT: 1158 SPAN=LONG: 1907 SPAN=MEDIUM |    ERECTED<1908.500: 1325 |    ERECTED>=1908.500: 1528 Small and nice. Now for the "SPAN". Wooden bridges and walkways are short, while the others are mostly medium. This could be done by :class:Orange.classifier.ClassifierByLookupTable - this would be faster than what we plan here. See the corresponding documentation on lookup classifier. Here we are going to do it with a Python function. .. literalinclude:: code/imputation-complex.py :lines: 121-128 :obj:compute_span could also be written as a class, if you'd prefer it. It's important that it behaves like a classifier, that is, gets an example and returns a value. The second element tells, as usual, what the caller expect the classifier to return - a value, a distribution or both. Since the caller, :obj:Imputer_model, always wants values, we shall ignore the argument (at risk of having problems in the future when imputers might handle distribution as well). Missing values as special values ================================ Missing values sometimes have a special meaning. The fact that something was not measured can sometimes tell a lot. Be, however, cautious when using such values in decision models; it the decision not to measure something (for instance performing a laboratory test on a patient) is based on the expert's knowledge of the class value, such unknown values clearly should not be used in models. .. class:: ImputerConstructor_asValue Constructs a new domain in which each discrete attribute is replaced with a new attribute that has one value more: "NA". The new attribute will compute its values on the fly from the old one, copying the normal values and replacing the unknowns with "NA". For continuous attributes, it will construct a two-valued discrete attribute with values "def" and "undef", telling whether the continuous attribute was defined or not. The attribute's name will equal the original's with "_def" appended. The original continuous attribute will remain in the domain and its unknowns will be replaced by averages. :class:ImputerConstructor_asValue has no specific attributes. It constructs :class:Imputer_asValue (I bet you wouldn't guess). It converts the example into the new domain, which imputes the values for discrete attributes. If continuous attributes are present, it will also replace their values by the averages. .. class:: Imputer_asValue .. attribute:: domain The domain with the new attributes constructed by :class:ImputerConstructor_asValue. .. attribute:: defaults Default values for continuous attributes. Present only if there are any. The following code shows what this imputer actually does to the domain. Part of :download:imputation-complex.py  (uses :download:bridges.tab ): .. literalinclude:: code/imputation-complex.py :lines: 137-151 The script's output looks like this:: [RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE] [RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, LANES_def, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE] RIVER: M -> M ERECTED: 1874 -> 1874 (def) PURPOSE: RR -> RR LENGTH: ? -> 1567 (undef) LANES: 2 -> 2 (def) CLEAR-G: ? -> NA T-OR-D: THROUGH -> THROUGH MATERIAL: IRON -> IRON SPAN: ? -> NA REL-L: ? -> NA TYPE: SIMPLE-T -> SIMPLE-T Seemingly, the two examples have the same attributes (with :samp:imputed having a few additional ones). If you check this by :samp:original.domain[0] == imputed.domain[0], you shall see that this first glance is False. The attributes only have the same names, but they are different attributes. If you read this page (which is already a bit advanced), you know that Orange does not really care about the attribute names). Therefore, if we wrote :samp:imputed[i] the program would fail since :samp:imputed has no attribute :samp:i. But it has an attribute with the same name (which even usually has the same value). We therefore use :samp:i.name to index the attributes of :samp:imputed. (Using names for indexing is not fast, though; if you do it a lot, compute the integer index with :samp:imputed.domain.index(i.name).)

For continuous attributes, there is an additional attribute with "_def" appended; we get it by :samp:i.name+"_def". The first continuous attribute, "ERECTED" is defined. Its value remains 1874 and the additional attribute "ERECTED_def" has value "def". Not so for "LENGTH". Its undefined value is replaced by the average (1567) and the new attribute has value "undef". The undefined discrete attribute "CLEAR-G" (and all other undefined discrete attributes) is assigned the value "NA". Using imputers ============== To properly use the imputation classes in learning process, they must be trained on training examples only. Imputing the missing values and subsequently using the data set in cross-validation will give overly optimistic results. Learners with imputer as a component ------------------------------------ Orange learners that cannot handle missing values will generally provide a slot for the imputer component. An example of such a class is :obj:Orange.classification.logreg.LogRegLearner with an attribute called :obj:Orange.classification.logreg.LogRegLearner.imputerConstructor. To it you can assign an imputer constructor - one of the above constructors or a specific constructor you wrote yourself. When given learning examples, :obj:Orange.classification.logreg.LogRegLearner will pass them to :obj:Orange.classification.logreg.LogRegLearner.imputerConstructor to get an imputer (again some of the above or a specific imputer you programmed). It will immediately use the imputer to impute the missing values in the learning data set, so it can be used by the actual learning algorithm. Besides, when the classifier :obj:Orange.classification.logreg.LogRegClassifier is constructed, the imputer will be stored in its attribute :obj:Orange.classification.logreg.LogRegClassifier.imputer. At classification, the imputer will be used for imputation of missing values in (testing) examples. Although details may vary from algorithm to algorithm, this is how the imputation is generally used in Orange's learners. Also, if you write your own learners, it is recommended that you use imputation according to the described procedure. Write your own imputer ====================== Imputation classes provide the Python-callback functionality (not all Orange classes do so, refer to the documentation on subtyping the Orange classes in Python _ for a list). If you want to write your own imputation constructor or an imputer, you need to simply program a Python function that will behave like the built-in Orange classes (and even less, for imputer, you only need to write a function that gets an example as argument, imputation for example tables will then use that function). You will most often write the imputation constructor when you have a special imputation procedure or separate procedures for various attributes, as we've demonstrated in the description of :obj:Orange.feature.imputation.ImputerConstructor_model. You basically only need to pack everything we've written there to an imputer constructor that will accept a data set and the id of the weight meta-attribute (ignore it if you will, but you must accept two arguments), and return the imputer (probably :obj:Orange.feature.imputation.Imputer_model. The benefit of implementing an imputer constructor as opposed to what we did above is that you can use such a constructor as a component for Orange learners (like logistic regression) or for wrappers from module orngImpute, and that way properly use the in classifier testing procedures. """ import Orange.core as orange from orange import ImputerConstructor_minimal
• ## Orange/feature/scoring.py

 r9671 Assesses features' ability to distinguish between very similar instances from different classes. This scoring method was first developed by Kira and Rendell and then improved by Kononenko. The developed by Kira and Rendell and then improved by  Kononenko. The class :obj:Relief works on discrete and continuous classes and thus implements ReliefF and RReliefF.
• ## Orange/fixes/fix_orange_imports.py

 r9671 "orngSOM": "Orange.projection.som", "orngBayes":"Orange.classification.bayes", "orngLR":"Orange.classification.logreg", "orngNetwork":"Orange.network", "orngMisc":"Orange.misc",
• ## Orange/orng/orngCA.py

 r9671 # This has to be seriously outdated, as it uses matrixmultiply, which is not # present in numpy since, like, 2006.     Matija Polajnar, 2012 a.d. """ Correspondence analysis is a descriptive/exploratory technique designed to analyze simple two-way and
• ## docs/reference/rst/Orange.classification.logreg.rst

 r9372 .. automodule:: Orange.classification.logreg .. index: logistic regression .. index: single: classification; logistic regression ******************************** Logistic regression (logreg) ******************************** Logistic regression _ is a statistical classification methods that fits data to a logistic function. Orange's implementation of algorithm can handle various anomalies in features, such as constant variables and singularities, that could make direct fitting of logistic regression almost impossible. Stepwise logistic regression, which iteratively selects the most informative features, is also supported. .. autoclass:: LogRegLearner :members: .. class :: LogRegClassifier A logistic regression classification model. Stores estimated values of regression coefficients and their significances, and uses them to predict classes and class probabilities. .. attribute :: beta Estimated regression coefficients. .. attribute :: beta_se Estimated standard errors for regression coefficients. .. attribute :: wald_Z Wald Z statistics for beta coefficients. Wald Z is computed as beta/beta_se. .. attribute :: P List of P-values for beta coefficients, that is, the probability that beta coefficients differ from 0.0. The probability is computed from squared Wald Z statistics that is distributed with Chi-Square distribution. .. attribute :: likelihood The probability of the sample (ie. learning examples) observed on the basis of the derived model, as a function of the regression parameters. .. attribute :: fit_status Tells how the model fitting ended - either regularly (:obj:LogRegFitter.OK), or it was interrupted due to one of beta coefficients escaping towards infinity (:obj:LogRegFitter.Infinity) or since the values didn't converge (:obj:LogRegFitter.Divergence). The value tells about the classifier's "reliability"; the classifier itself is useful in either case. .. method:: __call__(instance, result_type) Classify a new instance. :param instance: instance to be classified. :type instance: :class:~Orange.data.Instance :param result_type: :class:~Orange.classification.Classifier.GetValue or :class:~Orange.classification.Classifier.GetProbabilities or :class:~Orange.classification.Classifier.GetBoth :rtype: :class:~Orange.data.Value, :class:~Orange.statistics.distribution.Distribution or a tuple with both .. class:: LogRegFitter :obj:LogRegFitter is the abstract base class for logistic fitters. It defines the form of call operator and the constants denoting its (un)success: .. attribute:: OK Fitter succeeded to converge to the optimal fit. .. attribute:: Infinity Fitter failed due to one or more beta coefficients escaping towards infinity. .. attribute:: Divergence Beta coefficients failed to converge, but none of beta coefficients escaped. .. attribute:: Constant There is a constant attribute that causes the matrix to be singular. .. attribute:: Singularity The matrix is singular. .. method:: __call__(examples, weight_id) Performs the fitting. There can be two different cases: either the fitting succeeded to find a set of beta coefficients (although possibly with difficulties) or the fitting failed altogether. The two cases return different results. (status, beta, beta_se, likelihood) The fitter managed to fit the model. The first element of the tuple, result, tells about the problems occurred; it can be either :obj:OK, :obj:Infinity or :obj:Divergence. In the latter cases, returned values may still be useful for making predictions, but it's recommended that you inspect the coefficients and their errors and make your decision whether to use the model or not. (status, attribute) The fitter failed and the returned attribute is responsible for it. The type of failure is reported in status, which can be either :obj:Constant or :obj:Singularity. The proper way of calling the fitter is to expect and handle all the situations described. For instance, if fitter is an instance of some fitter and examples contain a set of suitable examples, a script should look like this:: res = fitter(examples) if res[0] in [fitter.OK, fitter.Infinity, fitter.Divergence]: status, beta, beta_se, likelihood = res < proceed by doing something with what you got > else: status, attr = res < remove the attribute or complain to the user or ... > .. class :: LogRegFitter_Cholesky The sole fitter available at the moment. It is a C++ translation of Alan Miller's logistic regression code _. It uses Newton-Raphson algorithm to iteratively minimize least squares error computed from learning examples. .. autoclass:: StepWiseFSS :members: :show-inheritance: .. autofunction:: dump Examples -------- The first example shows a very simple induction of a logistic regression classifier (:download:logreg-run.py ). .. literalinclude:: code/logreg-run.py Result:: Classification accuracy: 0.778282598819 class attribute = survived class values = Attribute       beta  st. error     wald Z          P OR=exp(beta) Intercept      -1.23       0.08     -15.15      -0.00 status=first       0.86       0.16       5.39       0.00       2.36 status=second      -0.16       0.18      -0.91       0.36       0.85 status=third      -0.92       0.15      -6.12       0.00       0.40 age=child       1.06       0.25       4.30       0.00       2.89 sex=female       2.42       0.14      17.04       0.00      11.25 The next examples shows how to handle singularities in data sets (:download:logreg-singularities.py ). .. literalinclude:: code/logreg-singularities.py The first few lines of the output of this script are:: <=50K <=50K <=50K <=50K <=50K <=50K >50K >50K <=50K >50K class attribute = y class values = <>50K, <=50K> Attribute       beta  st. error     wald Z          P OR=exp(beta) Intercept       6.62      -0.00       -inf       0.00 age      -0.04       0.00       -inf       0.00       0.96 fnlwgt      -0.00       0.00       -inf       0.00       1.00 education-num      -0.28       0.00       -inf       0.00       0.76 marital-status=Divorced       4.29       0.00        inf       0.00      72.62 marital-status=Never-married       3.79       0.00        inf       0.00      44.45 marital-status=Separated       3.46       0.00        inf       0.00      31.95 marital-status=Widowed       3.85       0.00        inf       0.00      46.96 marital-status=Married-spouse-absent       3.98       0.00        inf       0.00      53.63 marital-status=Married-AF-spouse       4.01       0.00        inf       0.00      55.19 occupation=Tech-support      -0.32       0.00       -inf       0.00       0.72 If :obj:remove_singular is set to 0, inducing a logistic regression classifier would return an error:: Traceback (most recent call last): File "logreg-singularities.py", line 4, in lr = classification.logreg.LogRegLearner(table, removeSingular=0) File "/home/jure/devel/orange/Orange/classification/logreg.py", line 255, in LogRegLearner return lr(examples, weightID) File "/home/jure/devel/orange/Orange/classification/logreg.py", line 291, in __call__ lr = learner(examples, weight) orange.KernelException: 'orange.LogRegLearner': singularity in workclass=Never-worked We can see that the attribute workclass is causing a singularity. The example below shows how the use of stepwise logistic regression can help to gain in classification performance (:download:logreg-stepwise.py ): .. literalinclude:: code/logreg-stepwise.py The output of this script is:: Learner      CA logistic     0.841 filtered     0.846 Number of times attributes were used in cross-validation: 1 x a21 10 x a22 8 x a23 7 x a24 1 x a25 10 x a26 10 x a27 3 x a28 7 x a29 9 x a31 2 x a16 7 x a12 1 x a32 8 x a15 10 x a14 4 x a17 7 x a30 10 x a11 1 x a10 1 x a13 10 x a34 2 x a19 1 x a18 10 x a3 10 x a5 4 x a4 4 x a7 8 x a6 10 x a9 10 x a8
• ## docs/reference/rst/Orange.classification.rst

 r9754 ################################### All classifiers in Orange consist of two parts, a Learner and a Classifier. A learner is constructed with all parameters that will be used for learning. When a data table is passed to its __call__ method, a model is fitted to the data and return in a form of a Classifier, which is then used for predicting the dependent variable(s) of new instances. To facilitate correct evaluation, all classifiers in Orange consist of two parts, a Learner and a Classifier. A learner is constructed with all parameters that will be used for learning. When a data table is passed to its __call__ method, a model is fitted to the data and return in a form of a Classifier, which is then used for predicting the dependent variable(s) of new instances. .. class:: Learner() tuple with both You can often program learners and classifiers as classes or functions written entirely in Python and independent from Orange. Such classes can participate, for instance, in the common evaluation functions like those available in modules :obj:Orange.evaluation.testing and :obj:Orange.evaluation.scoring. On the other hand, these classes can't be used as components for pure C++ classes. For instance, :obj:Orange.classification.tree.TreeLearner's attribute nodeLearner should contain a (wrapped) C++ object derived from :obj:Learner, such as :obj:Orange.classification.majority.MajorityLearner or :obj:Orange.classification.bayes.NaiveLearner. They cannot accommodate Python's classes or even functions. When developing new prediction models, one should extend :obj:Learner and :obj:Classifier\. Code that infers the model from the data should be placed in Learners's :obj:~Learner.__call__ method. This method should return a :obj:Classifier. Classifiers' :obj:~Classifier.__call__ method should  return the predicition; :class:~Orange.data.Value, :class:~Orange.statistics.distribution.Distribution or a tuple with both based on the value of the parameter :obj:return_type. There's a workaround, though. You can subtype Orange classes :obj:Learner or :obj:Classifier as if the two classes were defined in Python, but later use your derived Python classes as if they were written in Orange's core. That is, you can define your class in a Python script like this:: class MyLearner(Orange.classifier.Learner): def __call__(self, examples, weightID = 0): Such a learner can then be used as any regular learner written in Orange. You can, for instance, construct a tree learner and use your learner to learn node classifier:: treeLearner = Orange.classification.tree.TreeLearner() treeLearner.nodeLearner = MyLearner() ----- Orange contains implementations of various classifiers that are described in detail on separate pages. Orange implements various classifiers that are described in detail on separate pages. .. toctree::
• ## docs/reference/rst/Orange.feature.discretization.rst

 r9372 .. automodule:: Orange.feature.discretization .. py:currentmodule:: Orange.feature.discretization ################################### Discretization (discretization) ################################### .. index:: discretization .. index:: single: feature; discretization Continues features can be discretized either one feature at a time, or, as demonstrated in the following script, using a single discretization method on entire set of data features: .. literalinclude:: code/discretization-table.py Discretization introduces new categorical features and computes their values in accordance to selected (or default) discretization method:: Original data set: [5.1, 3.5, 1.4, 0.2, 'Iris-setosa'] [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'] [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'] Discretized data set: ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa'] ['<=5.45', '(2.85, 3.15]', '<=2.45', '<=0.80', 'Iris-setosa'] ['<=5.45', '>3.15', '<=2.45', '<=0.80', 'Iris-setosa'] The following discretization methods are supported: * equal width discretization, where the domain of continuous feature is split to intervals of the same width equal-sized intervals (:class:EqualWidth), * equal frequency discretization, where each intervals contains equal number of data instances (:class:EqualFreq), * entropy-based, as originally proposed by [FayyadIrani1993]_ that infers the intervals to minimize within-interval entropy of class distributions (:class:Entropy), * bi-modal, using three intervals to optimize the difference of the class distribution in the middle with the distribution outside it (:class:BiModal), * fixed, with the user-defined cut-off points. The above script used the default discretization method (equal frequency with three intervals). This can be changed as demonstrated below: .. literalinclude:: code/discretization-table-method.py :lines: 3-5 With exception to fixed discretization, discretization approaches infer the cut-off points from the training data set and thus construct a discretizer to convert continuous values of this feature into categorical value according to the rule found by discretization. In this respect, the discretization behaves similar to :class:Orange.classification.Learner. Utility functions ================= Some functions and classes that can be used for categorization of continuous features. Besides several general classes that can help in this task, we also provide a function that may help in entropy-based discretization (Fayyad & Irani), and a wrapper around classes for categorization that can be used for learning. .. autoclass:: Orange.feature.discretization.DiscretizedLearner_Class .. autoclass:: DiscretizeTable .. rubric:: Example FIXME. A chapter on feature subset selection <../ofb/o_fss.htm>_ in Orange for Beginners tutorial shows the use of DiscretizedLearner. Other discretization classes from core Orange are listed in chapter on categorization <../ofb/o_categorization.htm>_ of the same tutorial. Discretization Algorithms ========================= Instances of discretization classes are all derived from :class:Discretization. .. class:: Discretization .. method:: __call__(feature, data[, weightID]) Given a continuous feature, data and, optionally id of attribute with example weight, this function returns a discretized feature. Argument feature can be a descriptor, index or name of the attribute. .. class:: EqualWidth Discretizes the feature by spliting its domain to a fixed number of equal-width intervals. The span of original domain is computed from the training data and is defined by the smallest and the largest feature value. .. attribute:: n Number of discretization intervals (default: 4). The following example discretizes Iris dataset features using six intervals. The script constructs a :class:Orange.data.Table with discretized features and outputs their description: .. literalinclude:: code/discretization.py :lines: 38-43 The output of this script is:: D_sepal length: <<4.90, [4.90, 5.50), [5.50, 6.10), [6.10, 6.70), [6.70, 7.30), >7.30> D_sepal width: <<2.40, [2.40, 2.80), [2.80, 3.20), [3.20, 3.60), [3.60, 4.00), >4.00> D_petal length: <<1.98, [1.98, 2.96), [2.96, 3.94), [3.94, 4.92), [4.92, 5.90), >5.90> D_petal width: <<0.50, [0.50, 0.90), [0.90, 1.30), [1.30, 1.70), [1.70, 2.10), >2.10> The cut-off values are hidden in the discretizer and stored in attr.get_value_from.transformer:: >>> for attr in newattrs: ...    print "%s: first interval at %5.3f, step %5.3f" % \ ...    (attr.name, attr.get_value_from.transformer.first_cut, \ ...    attr.get_value_from.transformer.step) D_sepal length: first interval at 4.900, step 0.600 D_sepal width: first interval at 2.400, step 0.400 D_petal length: first interval at 1.980, step 0.980 D_petal width: first interval at 0.500, step 0.400 All discretizers have the method construct_variable: .. literalinclude:: code/discretization.py :lines: 69-73 .. class:: EqualFreq Infers the cut-off points so that the discretization intervals contain approximately equal number of training data instances. .. attribute:: n Number of discretization intervals (default: 4). The resulting discretizer is of class :class:IntervalDiscretizer. Its transformer includes points that store the inferred cut-offs. .. class:: Entropy Entropy-based discretization as originally proposed by [FayyadIrani1993]_. The approach infers the most appropriate number of intervals by recursively splitting the domain of continuous feature to minimize the class-entropy of training examples. The splitting is repeated until the entropy decrease is smaller than the increase of minimal descripton length (MDL) induced by the new cut-off point. Entropy-based discretization can reduce a continuous feature into a single interval if no suitable cut-off points are found. In this case the new feature is constant and can be removed. This discretization can therefore also serve for identification of non-informative features and thus used for feature subset selection. .. attribute:: force_attribute Forces the algorithm to induce at least one cut-off point, even when its information gain is lower than MDL (default: False). Part of :download:discretization.py : .. literalinclude:: code/discretization.py :lines: 77-80 The output shows that all attributes are discretized onto three intervals:: sepal length: <5.5, 6.09999990463> sepal width: <2.90000009537, 3.29999995232> petal length: <1.89999997616, 4.69999980927> petal width: <0.600000023842, 1.0000004768> .. class:: BiModal Infers two cut-off points to optimize the difference of class distribution of data instances in the middle and in the other two intervals. The difference is scored by chi-square statistics. All possible cut-off points are examined, thus the discretization runs in O(n^2). This discretization method is especially suitable for the attributes in which the middle region corresponds to normal and the outer regions to abnormal values of the feature. .. attribute:: split_in_two Decides whether the resulting attribute should have three or two values. If True (default), the feature will be discretized to three intervals and the discretizer is of type :class:BiModalDiscretizer. If False the result is the ordinary :class:IntervalDiscretizer. Iris dataset has three-valued class attribute. The figure below, drawn using LOESS probability estimation, shows that sepal lenghts of versicolors are between lengths of setosas and virginicas. .. image:: files/bayes-iris.gif If we merge classes setosa and virginica, we can observe if the bi-modal discretization would correctly recognize the interval in which versicolors dominate. The following scripts peforms the merging and construction of new data set with class that reports if iris is versicolor or not. .. literalinclude:: code/discretization.py :lines: 84-87 The following script implements the discretization: .. literalinclude:: code/discretization.py :lines: 97-100 The middle intervals are printed:: sepal length: (5.400, 6.200] sepal width: (2.000, 2.900] petal length: (1.900, 4.700] petal width: (0.600, 1.600] Judging by the graph, the cut-off points inferred by discretization for "sepal length" make sense. Discretizers ============ Discretizers construct a categorical feature from the continuous feature according to the method they implement and its parameters. The most general is :class:IntervalDiscretizer that is also used by most discretization methods. Two other discretizers, :class:EquiDistDiscretizer and :class:ThresholdDiscretizer> could easily be replaced by :class:IntervalDiscretizer but are used for speed and simplicity. The fourth discretizer, :class:BiModalDiscretizer is specialized for discretizations induced by :class:BiModalDiscretization. .. class:: Discretizer A superclass implementing the construction of a new attribute from an existing one. .. method:: construct_variable(feature) Constructs a descriptor for a new feature. The new feature's name is equal to feature.name prefixed by "D\_". Its symbolic values are discretizer specific. .. class:: IntervalDiscretizer Discretizer defined with a set of cut-off points. .. attribute:: points The cut-off points; feature values below or equal to the first point will be mapped to the first interval, those between the first and the second point (including those equal to the second) are mapped to the second interval and so forth to the last interval which covers all values greater than the last value in points. The number of intervals is thus len(points)+1. The script that follows is an examples of a manual construction of a discretizer with cut-off points at 3.0 and 5.0: .. literalinclude:: code/discretization.py :lines: 22-26 First five data instances of data2 are:: [5.1, '>5.00', 'Iris-setosa'] [4.9, '(3.00, 5.00]', 'Iris-setosa'] [4.7, '(3.00, 5.00]', 'Iris-setosa'] [4.6, '(3.00, 5.00]', 'Iris-setosa'] [5.0, '(3.00, 5.00]', 'Iris-setosa'] The same discretizer can be used on several features by calling the function construct_var: .. literalinclude:: code/discretization.py :lines: 30-34 Each feature has its own instance of :class:ClassifierFromVar stored in get_value_from, but all use the same :class:IntervalDiscretizer, idisc. Changing any element of its points affect all attributes. .. note:: The length of :obj:~IntervalDiscretizer.points should not be changed if the discretizer is used by any attribute. The length of :obj:~IntervalDiscretizer.points should always match the number of values of the feature, which is determined by the length of the attribute's field values. If attr is a discretized attribute, than len(attr.values) must equal len(attr.get_value_from.transformer.points)+1. .. class:: EqualWidthDiscretizer Discretizes to intervals of the fixed width. All values lower than :obj:~EquiDistDiscretizer.first_cut are mapped to the first interval. Otherwise, value val's interval is floor((val-first_cut)/step). Possible overflows are mapped to the last intervals. .. attribute:: first_cut The first cut-off point. .. attribute:: step Width of the intervals. .. attribute:: n Number of the intervals. .. attribute:: points (read-only) The cut-off points; this is not a real attribute although it behaves as one. Reading it constructs a list of cut-off points and returns it, but changing the list doesn't affect the discretizer. Only present to provide the :obj:EquiDistDiscretizer the same interface as that of :obj:IntervalDiscretizer. .. class:: ThresholdDiscretizer Threshold discretizer converts continuous values into binary by comparing them to a fixed threshold. Orange uses this discretizer for binarization of continuous attributes in decision trees. .. attribute:: threshold The value threshold; values below or equal to the threshold belong to the first interval and those that are greater go to the second. .. class:: BiModalDiscretizer Bimodal discretizer has two cut off points and values are discretized according to whether or not they belong to the region between these points which includes the lower but not the upper boundary. The discretizer is returned by :class:BiModalDiscretization if its field :obj:~BiModalDiscretization.split_in_two is true (the default). .. attribute:: low Lower boundary of the interval (included in the interval). .. attribute:: high Upper boundary of the interval (not included in the interval). Implementational details ======================== Consider a following example (part of :download:discretization.py ): .. literalinclude:: code/discretization.py :lines: 7-15 The discretized attribute sep_w is constructed with a call to :class:Entropy; instead of constructing it and calling it afterwards, we passed the arguments for calling to the constructor. We then constructed a new :class:Orange.data.Table with attributes "sepal width" (the original continuous attribute), sep_w and the class attribute:: Entropy discretization, first 5 data instances [3.5, '>3.30', 'Iris-setosa'] [3.0, '(2.90, 3.30]', 'Iris-setosa'] [3.2, '(2.90, 3.30]', 'Iris-setosa'] [3.1, '(2.90, 3.30]', 'Iris-setosa'] [3.6, '>3.30', 'Iris-setosa'] The name of the new categorical variable derives from the name of original continuous variable by adding a prefix "D_". The values of the new attributes are computed automatically when they are needed using a transformation function :obj:~Orange.data.variable.Variable.get_value_from (see :class:Orange.data.variable.Variable) which encodes the discretization:: >>> sep_w EnumVariable 'D_sepal width' >>> sep_w.get_value_from >>> sep_w.get_value_from.whichVar FloatVariable 'sepal width' >>> sep_w.get_value_from.transformer >>> sep_w.get_value_from.transformer.points <2.90000009537, 3.29999995232> The select statement in the discretization script converted all data instances from data to the new domain. This includes a new feature sep_w whose values are computed on the fly by calling sep_w.get_value_from for each data instance. The original, continuous sepal width is passed to the transformer that determines the interval by its field points. Transformer returns the discrete value which is in turn returned by get_value_from and stored in the new example. References ========== .. [FayyadIrani1993] UM Fayyad and KB Irani. Multi-interval discretization of continuous valued attributes for classification learning. In Proc. 13th International Joint Conference on Artificial Intelligence, pages 1022--1029, Chambery, France, 1993.
• ## docs/reference/rst/Orange.feature.imputation.rst

 r9372 .. automodule:: Orange.feature.imputation .. py:currentmodule:: Orange.feature.imputation .. index:: imputation .. index:: single: feature; value imputation *************************** Imputation (imputation) *************************** Imputation replaces missing feature values with appropriate values, in this case with minimal values: .. literalinclude:: code/imputation-values.py :lines: 7- The output of this code is:: Example with missing values ['A', 1853, 'RR', ?, 2, 'N', 'DECK', 'WOOD', '?', 'S', 'WOOD'] Imputed values: ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD'] ['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD'] Imputers ================= :obj:ImputerConstructor is the abstract root in a hierarchy of classes that accept training data and construct an instance of a class derived from :obj:Imputer. When an :obj:Imputer is called with an :obj:Orange.data.Instance it returns a new instance with the missing values imputed (leaving the original instance intact). If imputer is called with an :obj:Orange.data.Table it returns a new data table with imputed instances. .. class:: ImputerConstructor .. attribute:: impute_class Indicates whether to impute the class value. Defaults to True. Simple imputation ================= Simple imputers always impute the same value for a particular feature, disregarding the values of other features. They all use the same class :obj:Imputer_defaults. .. class:: Imputer_defaults .. attribute::  defaults An instance :obj:Orange.data.Instance with the default values to be imputed instead of missing value. Examples to be imputed must be from the same :obj:~Orange.data.Domain as :obj:defaults. Instances of this class can be constructed by :obj:~Orange.feature.imputation.ImputerConstructor_minimal, :obj:~Orange.feature.imputation.ImputerConstructor_maximal, :obj:~Orange.feature.imputation.ImputerConstructor_average. For continuous features, they will impute the smallest, largest or the average values encountered in the training examples. For discrete, they will impute the lowest (the one with index 0, e. g. attr.values[0]), the highest (attr.values[-1]), and the most common value encountered in the data, respectively. If values of discrete features are ordered according to their impact on class (for example, possible values for symptoms of some disease can be ordered according to their seriousness), the minimal and maximal imputers  will then represent optimistic and pessimistic imputations. User-define defaults can be given when constructing a :obj:~Orange.feature .imputation.Imputer_defaults. Values that are left unspecified do not get imputed. In the following example "LENGTH" is the only attribute to get imputed with value 1234: .. literalinclude:: code/imputation-complex.py :lines: 56-69 If :obj:~Orange.feature.imputation.Imputer_defaults's constructor is given an argument of type :obj:~Orange.data.Domain it constructs an empty instance for :obj:defaults. If an instance is given, the reference to the instance will be kept. To avoid problems associated with Imputer_defaults (data[0]), it is better to provide a copy of the instance: Imputer_defaults(Orange.data.Instance(data[0])). Random imputation ================= .. class:: Imputer_Random Imputes random values. The corresponding constructor is :obj:ImputerConstructor_Random. .. attribute:: impute_class Tells whether to impute the class values or not. Defaults to True. .. attribute:: deterministic If true (defaults to False), random generator is initialized for each instance using the instance's hash value as a seed. This results in same instances being always imputed with the same (random) values. Model-based imputation ====================== .. class:: ImputerConstructor_model Model-based imputers learn to predict the features's value from values of other features. :obj:ImputerConstructor_model are given two learning algorithms and they construct a classifier for each attribute. The constructed imputer :obj:Imputer_model stores a list of classifiers that are used for imputation. .. attribute:: learner_discrete, learner_continuous Learner for discrete and for continuous attributes. If any of them is missing, the attributes of the corresponding type will not get imputed. .. attribute:: use_class Tells whether the imputer can use the class attribute. Defaults to False. It is useful in more complex designs in which one imputer is used on learning instances, where it uses the class value, and a second imputer on testing instances, where class is not available. .. class:: Imputer_model .. attribute:: models A list of classifiers, each corresponding to one attribute to be imputed. The :obj:class_var's of the models should equal the instances' attributes. If an element is :obj:None, the corresponding attribute's values are not imputed. .. rubric:: Examples Examples are taken from :download:imputation-complex.py . The following imputer predicts the missing attribute values using classification and regression trees with the minimum of 20 examples in a leaf. .. literalinclude:: code/imputation-complex.py :lines: 74-76 A common setup, where different learning algorithms are used for discrete and continuous features, is to use :class:~Orange.classification.bayes.NaiveLearner for discrete and :class:~Orange.regression.mean.MeanLearner (which just remembers the average) for continuous attributes: .. literalinclude:: code/imputation-complex.py :lines: 91-94 To construct a user-defined :class:Imputer_model: .. literalinclude:: code/imputation-complex.py :lines: 108-112 A list of empty models is first initialized :obj:Imputer_model.models. Continuous feature "LANES" is imputed with value 2 using :obj:DefaultClassifier. A float must be given, because integer values are interpreted as indexes of discrete features. Discrete feature "T-OR-D" is imputed using :class:Orange.classification.ConstantClassifier which is given the index of value "THROUGH" as an argument. Feature "LENGTH" is computed with a regression tree induced from "MATERIAL", "SPAN" and "ERECTED" (feature "LENGTH" is used as class attribute here). Domain is initialized by giving a list of feature names and domain as an additional argument where Orange will look for features. .. literalinclude:: code/imputation-complex.py :lines: 114-119 This is how the inferred tree should look like:: SPAN=SHORT: 1158 SPAN=LONG: 1907 SPAN=MEDIUM |    ERECTED<1908.500: 1325 |    ERECTED>=1908.500: 1528 Wooden bridges and walkways are short, while the others are mostly medium. This could be encoded in feature "SPAN" using :class:Orange.classifier.ClassifierByLookupTable, which is faster than the Python function used here: .. literalinclude:: code/imputation-complex.py :lines: 121-128 If :obj:compute_span is written as a class it must behave like a classifier: it accepts an example and returns a value. The second argument tells what the caller expects the classifier to return - a value, a distribution or both. Currently, :obj:Imputer_model, always expects values and the argument can be ignored. Missing values as special values ================================ Missing values sometimes have a special meaning. Cautious is needed when using such values in decision models. When the decision not to measure something (for example, performing a laboratory test on a patient) is based on the expert's knowledge of the class value, such missing values clearly should not be used in models. .. class:: ImputerConstructor_asValue Constructs a new domain in which each discrete feature is replaced with a new feature that has one more value: "NA". The new feature computes its values on the fly from the old one, copying the normal values and replacing the unknowns with "NA". For continuous attributes, it constructs a two-valued discrete attribute with values "def" and "undef", telling whether the value is defined or not.  The features's name will equal the original's with "_def" appended. The original continuous feature will remain in the domain and its unknowns will be replaced by averages. :class:ImputerConstructor_asValue has no specific attributes. It constructs :class:Imputer_asValue that converts the example into the new domain. .. class:: Imputer_asValue .. attribute:: domain The domain with the new feature constructed by :class:ImputerConstructor_asValue. .. attribute:: defaults Default values for continuous features. The following code shows what the imputer actually does to the domain: .. literalinclude:: code/imputation-complex.py :lines: 137-151 The script's output looks like this:: [RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE] [RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, LANES_def, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE] RIVER: M -> M ERECTED: 1874 -> 1874 (def) PURPOSE: RR -> RR LENGTH: ? -> 1567 (undef) LANES: 2 -> 2 (def) CLEAR-G: ? -> NA T-OR-D: THROUGH -> THROUGH MATERIAL: IRON -> IRON SPAN: ? -> NA REL-L: ? -> NA TYPE: SIMPLE-T -> SIMPLE-T The two examples have the same attribute, :samp:imputed having a few additional ones. Comparing :samp:original.domain[0] == imputed.domain[0] will result in False. While the names are same, they represent different features. Writting, :samp:imputed[i]  would fail since :samp:imputed has no attribute :samp:i, but it has an attribute with the same name. Using :samp:i.name to index the attributes of :samp:imputed will work, yet it is not fast. If a frequently used, it is better to compute the index with :samp:imputed.domain.index(i.name). For continuous features, there is an additional feature with name prefix "_def", which is accessible by :samp:i.name+"_def". The value of the first continuous feature "ERECTED" remains 1874, and the additional attribute "ERECTED_def" has value "def". The undefined value  in "LENGTH" is replaced by the average (1567) and the new attribute has value "undef". The undefined discrete attribute  "CLEAR-G" (and all other undefined discrete attributes) is assigned the value "NA". Using imputers ============== Imputation must run on training data only. Imputing the missing values and subsequently using the data in cross-validation will give overly optimistic results. Learners with imputer as a component ------------------------------------ Learners that cannot handle missing values provide a slot for the imputer component. An example of such a class is :obj:~Orange.classification.logreg.LogRegLearner with an attribute called :obj:~Orange.classification.logreg.LogRegLearner.imputer_constructor. When given learning instances, :obj:~Orange.classification.logreg.LogRegLearner will pass them to :obj:~Orange.classification.logreg.LogRegLearner.imputer_constructor to get an imputer and used it to impute the missing values in the learning data. Imputed data is then used by the actual learning algorithm. Also, when a classifier :obj:Orange.classification.logreg.LogRegClassifier is constructed, the imputer is stored in its attribute :obj:Orange.classification.logreg.LogRegClassifier.imputer. At classification, the same imputer is used for imputation of missing values in (testing) examples. Details may vary from algorithm to algorithm, but this is how the imputation is generally used. When writing user-defined learners, it is recommended to use imputation according to the described procedure. Wrapper for learning algorithms =============================== Imputation is also used by learning algorithms and other methods that are not capable of handling unknown values. It imputes missing values, calls the learner and, if imputation is also needed by the classifier, it wraps the classifier that imputes missing values in instances to classify. .. literalinclude:: code/imputation-logreg.py :lines: 7- The output of this code is:: Without imputation: 0.945 With imputation: 0.954 Even so, the module is somewhat redundant, as all learners that cannot handle missing values should, in principle, provide the slots for imputer constructor. For instance, :obj:Orange.classification.logreg.LogRegLearner has an attribute :obj:Orange.classification.logreg.LogRegLearner.imputer_constructor, and even if you don't set it, it will do some imputation by default. .. class:: ImputeLearner Wraps a learner and performs data discretization before learning. Most of Orange's learning algorithms do not use imputers because they can appropriately handle the missing values. Bayesian classifier, for instance, simply skips the corresponding attributes in the formula, while classification/regression trees have components for handling the missing values in various ways. If for any reason you want to use these algorithms to run on imputed data, you can use this wrapper. The class description is a matter of a separate page, but we shall show its code here as another demonstration of how to use the imputers - logistic regression is implemented essentially the same as the below classes. This is basically a learner, so the constructor will return either an instance of :obj:ImputerLearner or, if called with examples, an instance of some classifier. There are a few attributes that need to be set, though. .. attribute:: base_learner A wrapped learner. .. attribute:: imputer_constructor An instance of a class derived from :obj:ImputerConstructor (or a class with the same call operator). .. attribute:: dont_impute_classifier If given and set (this attribute is optional), the classifier will not be wrapped into an imputer. Do this if the classifier doesn't mind if the examples it is given have missing values. The learner is best illustrated by its code - here's its complete :obj:__call__ method:: def __call__(self, data, weight=0): trained_imputer = self.imputer_constructor(data, weight) imputed_data = trained_imputer(data, weight) base_classifier = self.base_learner(imputed_data, weight) if self.dont_impute_classifier: return base_classifier else: return ImputeClassifier(base_classifier, trained_imputer) So "learning" goes like this. :obj:ImputeLearner will first construct the imputer (that is, call :obj:self.imputer_constructor to get a (trained) imputer. Than it will use the imputer to impute the data, and call the given :obj:baseLearner to construct a classifier. For instance, :obj:baseLearner could be a learner for logistic regression and the result would be a logistic regression model. If the classifier can handle unknown values (that is, if :obj:dont_impute_classifier, we return it as it is, otherwise we wrap it into :obj:ImputeClassifier, which is given the base classifier and the imputer which it can use to impute the missing values in (testing) examples. .. class:: ImputeClassifier Objects of this class are returned by :obj:ImputeLearner when given data. .. attribute:: baseClassifier A wrapped classifier. .. attribute:: imputer An imputer for imputation of unknown values. .. method:: __call__ This class is even more trivial than the learner. Its constructor accepts two arguments, the classifier and the imputer, which are stored into the corresponding attributes. The call operator which does the classification then looks like this:: def __call__(self, ex, what=orange.GetValue): return self.base_classifier(self.imputer(ex), what) It imputes the missing values by calling the :obj:imputer and passes the class to the base classifier. .. note:: In this setup the imputer is trained on the training data - even if you do cross validation, the imputer will be trained on the right data. In the classification phase we again use the imputer which was classified on the training data only. .. rubric:: Code of ImputeLearner and ImputeClassifier :obj:Orange.feature.imputation.ImputeLearner puts the keyword arguments into the instance's  dictionary. You are expected to call it like :obj:ImputeLearner(base_learner=, imputer=). When the learner is called with examples, it trains the imputer, imputes the data, induces a :obj:base_classifier by the :obj:base_cearner and constructs :obj:ImputeClassifier that stores the :obj:base_classifier and the :obj:imputer. For classification, the missing values are imputed and the classifier's prediction is returned. Note that this code is slightly simplified, although the omitted details handle non-essential technical issues that are unrelated to imputation:: class ImputeLearner(orange.Learner): def __new__(cls, examples = None, weightID = 0, **keyw): self = orange.Learner.__new__(cls, **keyw) self.__dict__.update(keyw) if examples: return self.__call__(examples, weightID) else: return self def __call__(self, data, weight=0): trained_imputer = self.imputer_constructor(data, weight) imputed_data = trained_imputer(data, weight) base_classifier = self.base_learner(imputed_data, weight) return ImputeClassifier(base_classifier, trained_imputer) class ImputeClassifier(orange.Classifier): def __init__(self, base_classifier, imputer): self.base_classifier = base_classifier self.imputer = imputer def __call__(self, ex, what=orange.GetValue): return self.base_classifier(self.imputer(ex), what) .. rubric:: Example Although most Orange's learning algorithms will take care of imputation internally, if needed, it can sometime happen that an expert will be able to tell you exactly what to put in the data instead of the missing values. In this example we shall suppose that we want to impute the minimal value of each feature. We will try to determine whether the naive Bayesian classifier with its  implicit internal imputation works better than one that uses imputation by minimal values. :download:imputation-minimal-imputer.py  (uses :download:voting.tab ): .. literalinclude:: code/imputation-minimal-imputer.py :lines: 7- Should ouput this:: Without imputation: 0.903 With imputation: 0.899 .. note:: Note that we constructed just one instance of \ :obj:Orange.classification.bayes.NaiveLearner, but this same instance is used twice in each fold, once it is given the examples as they are (and returns an instance of :obj:Orange.classification.bayes.NaiveClassifier. The second time it is called by :obj:imba and the \ :obj:Orange.classification.bayes.NaiveClassifier it returns is wrapped into :obj:Orange.feature.imputation.Classifier. We thus have only one learner, but which produces two different classifiers in each round of testing. Write your own imputer ====================== Imputation classes provide the Python-callback functionality (not all Orange classes do so, refer to the documentation on subtyping the Orange classes in Python _ for a list). If you want to write your own imputation constructor or an imputer, you need to simply program a Python function that will behave like the built-in Orange classes (and even less, for imputer, you only need to write a function that gets an example as argument, imputation for example tables will then use that function). You will most often write the imputation constructor when you have a special imputation procedure or separate procedures for various attributes, as we've demonstrated in the description of :obj:Orange.feature.imputation.ImputerConstructor_model. You basically only need to pack everything we've written there to an imputer constructor that will accept a data set and the id of the weight meta-attribute (ignore it if you will, but you must accept two arguments), and return the imputer (probably :obj:Orange.feature.imputation.Imputer_model`. The benefit of implementing an imputer constructor as opposed to what we did above is that you can use such a constructor as a component for Orange learners (like logistic regression) or for wrappers from module orngImpute, and that way properly use the in classifier testing procedures.
• ## docs/reference/rst/code/discretization.py

 r9372 print "\nEntropy discretization, first 10 examples" sep_w = Orange.feature.discretization.EntropyDiscretization("sepal width", data) sep_w = Orange.feature.discretization.Entropy("sepal width", data) data2 = data.select([data.domain["sepal width"], sep_w, data.domain.class_var]) print "Cut-off points:", sep_w.get_value_from.transformer.points print "\nManual construction of IntervalDiscretizer - single attribute" idisc = Orange.feature.discretization.IntervalDiscretizer(points = [3.0, 5.0]) print "\nManual construction of Interval discretizer - single attribute" idisc = Orange.feature.discretization.Interval(points = [3.0, 5.0]) sep_l = idisc.construct_variable(data.domain["sepal length"]) data2 = data.select([data.domain["sepal length"], sep_l, data.domain.classVar]) print "\nManual construction of IntervalDiscretizer - all attributes" idisc = Orange.feature.discretization.IntervalDiscretizer(points = [3.0, 5.0]) print "\nManual construction of Interval discretizer - all attributes" idisc = Orange.feature.discretization.Interval(points = [3.0, 5.0]) newattrs = [idisc.construct_variable(attr) for attr in data.domain.attributes] data2 = data.select(newattrs + [data.domain.class_var]) print "\n\nEqual interval size discretization" disc = Orange.feature.discretization.EquiDistDiscretization(numberOfIntervals = 6) print "\n\nDiscretization with equal width intervals" disc = Orange.feature.discretization.EqualWidth(numberOfIntervals = 6) newattrs = [disc(attr, data) for attr in data.domain.attributes] data2 = data.select(newattrs + [data.domain.classVar]) print "\n\nQuartile discretization" disc = Orange.feature.discretization.EquiNDiscretization(numberOfIntervals = 6) print "\n\nQuartile (equal frequency) discretization" disc = Orange.feature.discretization.EqualFreq(numberOfIntervals = 6) newattrs = [disc(attr, data) for attr in data.domain.attributes] data2 = data.select(newattrs + [data.domain.classVar]) print "\nManual construction of EquiDistDiscretizer - all attributes" edisc = Orange.feature.discretization.EquiDistDiscretizer(first_cut = 2.0, step = 1.0, number_of_intervals = 5) print "\nManual construction of EqualWidth - all attributes" edisc = Orange.feature.discretization.EqualWidth(first_cut = 2.0, step = 1.0, number_of_intervals = 5) newattrs = [edisc.constructVariable(attr) for attr in data.domain.attributes] data2 = data.select(newattrs + [data.domain.classVar]) print "\nFayyad-Irani discretization" entro = Orange.feature.discretization.EntropyDiscretization() print "\nFayyad-Irani entropy-based discretization" entro = Orange.feature.discretization.Entropy() for attr in data.domain.attributes: disc = entro(attr, data) data_v = Orange.data.Table(newdomain, data) print "\nBi-Modal discretization on binary problem" bimod = Orange.feature.discretization.BiModalDiscretization(split_in_two = 0) print "\nBi-modal discretization on a binary problem" bimod = Orange.feature.discretization.BiModal(split_in_two = 0) for attr in data_v.domain.attributes: disc = bimod(attr, data_v) print print "\nBi-Modal discretization on binary problem" bimod = Orange.feature.discretization.BiModalDiscretization() print "\nBi-modal discretization on a binary problem" bimod = Orange.feature.discretization.BiModal() for attr in data_v.domain.attributes: disc = bimod(attr, data_v) print "\nEntropy discretization on binary problem" print "\nEntropy-based discretization on a binary problem" for attr in data_v.domain.attributes: disc = entro(attr, data_v)
• ## docs/reference/rst/code/distances-test.py

 r9724 # Euclidean distance constructor d2Constr = Orange.distance.Euclidean() d2Constr = Orange.distance.instances.EuclideanConstructor() d2 = d2Constr(iris) # Constructs dPears = Orange.distance.PearsonR(iris) dPears = Orange.distance.instances.PearsonRConstructor(iris) #reference instance
• ## docs/reference/rst/code/ensemble-forest.py

 r9734 forest = Orange.ensemble.forest.RandomForestLearner(trees=50, name="forest") tree = Orange.classification.tree.TreeLearner(min_examples=2, m_prunning=2, \ same_majority_pruning=True, name='tree') tree = Orange.classification.tree.TreeLearner(minExamples=2, mForPrunning=2, \ sameMajorityPruning=True, name='tree') learners = [tree, forest]
• ## docs/reference/rst/code/knnExample2.py

 r9724 knn = Orange.classification.knn.kNNLearner() knn.k = 10 knn.distance_constructor = Orange.distance.Hamming() knn.distance_constructor = Orange.core.ExamplesDistanceConstructor_Hamming() knn = knn(iris) for i in range(5):
• ## docs/reference/rst/code/knnInstanceDistance.py

 r9724 nnc = Orange.classification.knn.FindNearestConstructor() nnc.distanceConstructor = Orange.distance.Euclidean() nnc.distanceConstructor = Orange.core.ExamplesDistanceConstructor_Euclidean() did = Orange.data.new_meta_id()
• ## docs/reference/rst/code/knnlearner.py

 r9724 knn = Orange.classification.knn.kNNLearner(train, k=10) for i in range(5): instance = test.random_instance() instance = test.randomexample() print instance.getclass(), knn(instance) knn = Orange.classification.knn.kNNLearner() knn.k = 10 knn.distance_constructor = Orange.distance.Hamming() knn.distanceConstructor = Orange.core.ExamplesDistanceConstructor_Hamming() knn = knn(train) for i in range(5): instance = test.random_instance() instance = test.randomexample() print instance.getclass(), knn(instance)
• ## docs/reference/rst/code/logreg-singularities.py

 r9372 from Orange import * import Orange table = data.Table("adult_sample") lr = classification.logreg.LogRegLearner(table, removeSingular=1) adult = Orange.data.Table("adult_sample") lr = Orange.classification.logreg.LogRegLearner(adult, removeSingular=1) for ex in table[:5]: for ex in adult[:5]: print ex.getclass(), lr(ex) classification.logreg.dump(lr) Orange.classification.logreg.dump(lr)
• ## docs/reference/rst/code/logreg-stepwise.py

 r9372 ionosphere = Orange.data.Table("ionosphere.tab") lr = Orange.classification.logreg.LogRegLearner(removeSingular=1) lr = Orange.classification.logreg.LogRegLearner(remove_singular=1) learners = ( Orange.classification.logreg.LogRegLearner(name='logistic', removeSingular=1), Orange.classification.logreg.LogRegLearner(name='logistic', remove_singular=1), Orange.feature.selection.FilteredLearner(lr, filter=Orange.classification.logreg.StepWiseFSSFilter(addCrit=0.05, deleteCrit=0.9), name='filtered') filter=Orange.classification.logreg.StepWiseFSSFilter(add_crit=0.05, delete_crit=0.9), name='filtered') ) results = Orange.evaluation.testing.cross_validation(learners, ionosphere, store_classifiers=1) print "\nNumber of times features were used in cross-validation:" featuresUsed = {} features_used = {} for i in range(10): for a in results.classifiers[i][1].atts(): if a.name in featuresUsed.keys(): featuresUsed[a.name] += 1 if a.name in features_used.keys(): features_used[a.name] += 1 else: featuresUsed[a.name] = 1 for k in featuresUsed: print "%2d x %s" % (featuresUsed[k], k) features_used[a.name] = 1 for k in features_used: print "%2d x %s" % (features_used[k], k)
• ## docs/reference/rst/code/lookup-lookup.py

 r9372 import Orange table = Orange.data.Table("monks-1") monks = Orange.data.Table("monks-1") a, b, e = table.domain["a"], table.domain["b"], table.domain["e"] a, b, e = monks.domain["a"], monks.domain["b"], monks.domain["e"] ab = Orange.data.variable.Discrete("a==b", values = ["no", "yes"]) ["yes", "no", "no", "no", "?"]) table2 = table.select([a, b, ab, e, e1, table.domain.class_var]) monks2 = monks.select([a, b, ab, e, e1, monks.domain.class_var]) for i in range(5): print table2.random_example() print monks2.random_example() for i in range(5): ex = table.random_example() ex = monks.random_example() print "%s: ab %i, e1 %i " % (ex, ab.get_value_from.get_index(ex), e1.get_value_from.get_index(ex))
• ## docs/reference/rst/code/majority-classification.py

 r9372 import Orange table = Orange.data.Table("monks-1") monks = Orange.data.Table("monks-1") treeLearner = Orange.classification.tree.TreeLearner() learners = [treeLearner, bayesLearner, majorityLearner] res = Orange.evaluation.testing.cross_validation(learners, table) res = Orange.evaluation.testing.cross_validation(learners, monks) CAs = Orange.evaluation.scoring.CA(res, reportSE=True)

 r9724 # Load some data table = Orange.data.Table("iris.tab") iris = Orange.data.Table("iris.tab") # Construct a distance matrix using Euclidean distance dist = Orange.distance.Euclidean(table) matrix = Orange.core.SymMatrix(len(table)) for i in range(len(table)): dist = Orange.core.ExamplesDistanceConstructor_Euclidean(iris) matrix = Orange.core.SymMatrix(len(iris)) for i in range(len(iris)): for j in range(i+1): matrix[i, j] = dist(table[i], table[j]) matrix[i, j] = dist(iris[i], iris[j]) # Run the Torgerson approximation and calculate stress # Print the points out for (p, e) in zip(mds.points, table): for (p, e) in zip(mds.points, iris): print p, e
• ## docs/reference/rst/code/mds-euclid-torgerson-3d.py

 r9724 # Load some data table = Orange.data.Table("iris.tab") iris = Orange.data.Table("iris.tab") # Construct a distance matrix using Euclidean distance dist = Orange.distance.Euclidean(table) matrix = Orange.core.SymMatrix(len(table)) matrix.setattr('items', table) for i in range(len(table)): dist = Orange.distance.instances.EuclideanConstructor(iris) matrix = Orange.core.SymMatrix(len(iris)) matrix.setattr('items', iris) for i in range(len(iris)): for j in range(i+1): matrix[i, j] = dist(table[i], table[j]) matrix[i, j] = dist(iris[i], iris[j]) # Run the MDS
• ## docs/reference/rst/code/mds-scatterplot.py

 r9827 # Load some data table = Orange.data.Table("iris.tab") iris = Orange.data.Table("iris.tab") # Construct a distance matrix using Euclidean distance euclidean = Orange.distance.Euclidean(table) distance = Orange.core.SymMatrix(len(table)) for i in range(len(table)): euclidean = Orange.distance.Euclidean(iris) distance = Orange.core.SymMatrix(len(iris)) for i in range(len(iris)): for j in range(i + 1): distance[i, j] = euclidean(table[i], table[j]) distance[i, j] = euclidean(iris[i], iris[j]) # Run 100 steps of MDS optimization # Construct points (x, y, instanceClass) points = [] for (i, d) in enumerate(table): for (i, d) in enumerate(iris): points.append((mds.points[i][0], mds.points[i][1], d.getclass())) # Paint each class separately for c in range(len(table.domain.class_var.values)): for c in range(len(iris.domain.class_var.values)): sel = filter(lambda x: x[-1] == c, points) x = [s[0] for s in sel]
• ## docs/reference/rst/code/mean-regression.py

 r9372 import Orange table = Orange.data.Table("housing") housing = Orange.data.Table("housing") treeLearner = Orange.classification.tree.TreeLearner() #Orange.regression.TreeLearner() learners = [treeLearner, meanLearner] res = Orange.evaluation.testing.cross_validation(learners, table) res = Orange.evaluation.testing.cross_validation(learners, housing) MSEs = Orange.evaluation.scoring.MSE(res)
• ## docs/reference/rst/code/misc-selection-bestonthefly.py

 r9372 import Orange table = Orange.data.Table("lymphography") lymphography = Orange.data.Table("lymphography") find_best = Orange.misc.selection.BestOnTheFly(call_compare_on_1st = True) for attr in table.domain.attributes: find_best.candidate((Orange.feature.scoring.GainRatio(attr, table), attr)) for attr in lymphography.domain.attributes: find_best.candidate((Orange.feature.scoring.GainRatio(attr, lymphography), attr)) print "%5.3f: %s" % find_best.winner() find_best = Orange.misc.selection.BestOnTheFly(Orange.misc.selection.compare_first_bigger) for attr in table.domain.attributes: find_best.candidate((Orange.feature.scoring.GainRatio(attr, table), attr)) for attr in lymphography.domain.attributes: find_best.candidate((Orange.feature.scoring.GainRatio(attr, lymphography), attr)) print "%5.3f: %s" % find_best.winner() find_best = Orange.misc.selection.BestOnTheFly() for attr in table.domain.attributes: find_best.candidate(Orange.feature.scoring.GainRatio(attr, table)) for attr in lymphography.domain.attributes: find_best.candidate(Orange.feature.scoring.GainRatio(attr, lymphography)) best_index = find_best.winner_index() print "%5.3f: %s" % (find_best.winner(), table.domain[best_index]) print "%5.3f: %s" % (find_best.winner(), lymphography.domain[best_index])
• ## docs/reference/rst/code/mlc-classify.py

 r9505 import Orange data = Orange.data.Table('emotions') emotions = Orange.data.Table('emotions') learner = Orange.multilabel.BRkNNLearner(k=5) classifier = learner(data) print classifier(data[0]) classifier = learner(emotions) print classifier(emotions[0]) learner = Orange.multilabel.MLkNNLearner(k=5) classifier = learner(data) print classifier(data[0]) classifier = learner(emotions) print classifier(emotions[0]) learner = Orange.multilabel.BinaryRelevanceLearner() classifier = learner(data) print classifier(data[0]) classifier = learner(emotions) print classifier(emotions[0]) learner = Orange.multilabel.LabelPowersetLearner() classifier = learner(data) print classifier(data[0]) classifier = learner(emotions) print classifier(emotions[0]) def test_mlc(data, learners):
• ## docs/reference/rst/code/mlc-evaluate.py

 r9505 learners = [Orange.multilabel.MLkNNLearner(k=5)] data = Orange.data.Table("emotions.tab") emotions = Orange.data.Table("emotions.tab") res = Orange.evaluation.testing.cross_validation(learners, data) res = Orange.evaluation.testing.cross_validation(learners, emotions) print_results(res) res = Orange.evaluation.testing.leave_one_out(learners, data) res = Orange.evaluation.testing.leave_one_out(learners, emotions) print_results(res) res = Orange.evaluation.testing.proportion_test(learners, data, 0.5) res = Orange.evaluation.testing.proportion_test(learners, emotions, 0.5) print_results(res) reses = Orange.evaluation.testing.learning_curve(learners, data) reses = Orange.evaluation.testing.learning_curve(learners, emotions) for res in reses: print_results(res)
• ## docs/reference/rst/code/network-constructor-nx.py

 r9372 # plot vertices plt.plot(x, y, 'ro') plt.savefig("network-constructor-nx.py.png") plt.savefig("network-constructor-nx.png")
• ## docs/reference/rst/code/network-constructor.py

 r9372 # plot vertices plt.plot(x, y, 'ro') plt.savefig("network-constructor.py.png") plt.savefig("network-constructor.png")
• ## docs/reference/rst/code/network-graph-analysis.py

 r9372 # plot vertices of subnetwork plt.plot(x, y, 'ro') plt.savefig("network-graph-analysis.py.png") plt.savefig("network-graph-analysis.png")
• ## docs/reference/rst/code/network-optimization-nx.py

 r9372 # plot vertices plt.plot(x, y, 'ro') plt.savefig("network-optimization-nx.py.png") plt.savefig("network-optimization-nx.png")
• ## docs/reference/rst/code/network-optimization.py

 r9372 networkOptimization = Orange.network.NetworkOptimization(net) # optimize verices layout with one of included algorithms networkOptimization.radial_fruchterman_reingold(100, 1000) # read all edges and plot a line for u, v in net.get_edges(): # plot vertices plt.plot(x, y, 'ro') plt.savefig("network-optimization.py.png") plt.savefig("network-optimization.png")
• ## docs/reference/rst/code/optimization-thresholding1.py

 r9372 import Orange table = Orange.data.Table("bupa") bupa = Orange.data.Table("bupa") learner = Orange.classification.bayes.NaiveLearner() thresh80 = Orange.optimization.ThresholdLearner_fixed(learner=learner, threshold=0.8) res = Orange.evaluation.testing.cross_validation([learner, thresh, thresh80], table) res = Orange.evaluation.testing.cross_validation([learner, thresh, thresh80], bupa) CAs = Orange.evaluation.scoring.CA(res)
• ## docs/reference/rst/code/optimization-thresholding2.py

 r9372 import Orange table = Orange.data.Table("bupa") ri2 = Orange.core.MakeRandomIndices2(table, 0.7) train = table.select(ri2, 0) test = table.select(ri2, 1) bupa = Orange.data.Table("bupa") ri2 = Orange.core.MakeRandomIndices2(bupa, 0.7) train = bupa.select(ri2, 0) test = bupa.select(ri2, 1) bayes = Orange.classification.bayes.NaiveLearner(train)
• ## docs/reference/rst/code/optimization-tuning1.py

 r9372 learner = Orange.classification.tree.TreeLearner() data = Orange.data.Table("voting") voting = Orange.data.Table("voting") tuner = Orange.optimization.Tune1Parameter(object=learner, parameter="minSubset", values=[1, 2, 3, 4, 5, 10, 15, 20], evaluate = Orange.evaluation.scoring.AUC, verbose=2) classifier = tuner(data) classifier = tuner(voting) print "Optimal setting: ", learner.minSubset untuned = Orange.classification.tree.TreeLearner() res = Orange.evaluation.testing.cross_validation([untuned, tuner], data) res = Orange.evaluation.testing.cross_validation([untuned, tuner], voting) AUCs = Orange.evaluation.scoring.AUC(res) learner = Orange.classification.tree.TreeLearner(minSubset=10).instance() data = Orange.data.Table("voting") voting = Orange.data.Table("voting") tuner = Orange.optimization.Tune1Parameter(object=learner, parameter=["split.continuousSplitConstructor.minSubset", evaluate = Orange.evaluation.scoring.AUC, verbose=2) classifier = tuner(data) classifier = tuner(voting) print "Optimal setting: ", learner.split.continuousSplitConstructor.minSubset
• ## docs/reference/rst/code/optimization-tuningm.py

 r9372 learner = Orange.classification.tree.TreeLearner() data = Orange.data.Table("voting") voting = Orange.data.Table("voting") tuner = Orange.optimization.TuneMParameters(object=learner, parameters=[("minSubset", [2, 5, 10, 20]), evaluate = Orange.evaluation.scoring.AUC) classifier = tuner(data) classifier = tuner(voting)
• ## docs/reference/rst/code/orngTree1.py

 r9800 import Orange data = Orange.data.Table("iris") tree = Orange.classification.tree.TreeLearner(data, max_depth=3) iris = Orange.data.Table("iris") tree = Orange.classification.tree.TreeLearner(iris, max_depth=3) formats = ["", "%V (%M out of %N)", "%V (%^MbA%, %^MbP%)", data = Orange.data.Table("housing") tree = Orange.classification.tree.TreeLearner(data, max_depth=3) housing = Orange.data.Table("housing") tree = Orange.classification.tree.TreeLearner(housing, max_depth=3) formats = ["", "%V"] for format in formats:
• ## docs/reference/rst/code/orngTree2.py

 r9800 import re data = Orange.data.Table("iris") tree = Orange.classification.tree.TreeLearner(data, max_depth=3) iris = Orange.data.Table("iris") tree = Orange.classification.tree.TreeLearner(iris, max_depth=3) def get_margin(dist):
• ## docs/reference/rst/code/outlier1.py

 r9372 import Orange data = Orange.data.Table("bridges") bridges = Orange.data.Table("bridges") outlierDet = Orange.preprocess.outliers.OutlierDetection() outlierDet.set_examples(data) outlierDet.set_examples(bridges) print outlierDet.z_values()
• ## docs/reference/rst/code/outlier2.py

 r9724 import Orange data = Orange.data.Table("bridges") bridges = Orange.data.Table("bridges") outlier_det = Orange.preprocess.outliers.OutlierDetection() outlier_det.set_examples(data, Orange.distance.Euclidean(data)) outlier_det.set_examples(bridges, Orange.distance.instances.EuclideanConstructor(bridges)) outlier_det.set_knn(3) z_values = outlier_det.z_values() for ex,zv in sorted(zip(data, z_values), key=lambda x: x[1])[-5:]: for ex,zv in sorted(zip(bridges, z_values), key=lambda x: x[1])[-5:]: print ex, "Z-score: %5.3f" % zv
• ## docs/reference/rst/code/pca-scree.py

 r9372 import Orange table = Orange.data.Table("iris.tab") iris = Orange.data.Table("iris.tab") pca = Orange.projection.pca.Pca()(table) pca = Orange.projection.pca.Pca()(iris) pca.scree_plot("pca-scree.png")
• ## docs/reference/rst/code/randomindices2.py

 r9696 import Orange data = Orange.data.Table("lenses") lenses = Orange.data.Table("lenses") indices2 = Orange.data.sample.SubsetIndices2(p0=6) ind = indices2(data) ind = indices2(lenses) print ind data0 = data.select(ind, 0) data1 = data.select(ind, 1) print len(data0), len(data1) lenses0 = lenses.select(ind, 0) lenses1 = lenses.select(ind, 1) print len(lenses0), len(lenses1) print "\nIndices without playing with random generator" for i in range(5): print indices2(data) print indices2(lenses) print "\nIndices with random generator" indices2.random_generator = Orange.misc.Random(42) indices2.random_generator = Orange.core.RandomGenerator(42) for i in range(5): print indices2(data) print indices2(lenses) print "\nIndices with randseed" indices2.randseed = 42 for i in range(5): print indices2(data) print indices2(lenses) print "\nIndices with p0 set as probability (not 'a number of')" indices2.p0 = 0.25 print indices2(data) print indices2(lenses) print "\n... with stratification" indices2.stratified = indices2.Stratified ind = indices2(data) ind = indices2(lenses) print ind data2 = data.select(ind) od = Orange.core.getClassDistribution(data) sd = Orange.core.getClassDistribution(data2) lenses2 = lenses.select(ind) od = Orange.core.getClassDistribution(lenses) sd = Orange.core.getClassDistribution(lenses2) od.normalize() sd.normalize() print "\n... and without stratification" indices2.stratified = indices2.NotStratified print indices2(data) ind = indices2(data) print indices2(lenses) ind = indices2(lenses) print ind data2 = data.select(ind) od = Orange.core.getClassDistribution(data) sd = Orange.core.getClassDistribution(data2) lenses2 = lenses.select(ind) od = Orange.core.getClassDistribution(lenses) sd = Orange.core.getClassDistribution(lenses2) od.normalize() sd.normalize() print "\n... stratified 'if possible'" indices2.stratified = indices2.StratifiedIfPossible print indices2(data) print indices2(lenses) print "\n... stratified 'if possible', after removing the first example's class" data[0].setclass("?") print indices2(data) lenses[0].setclass("?") print indices2(lenses)
• ## docs/reference/rst/code/randomindicescv.py

 r9372 import Orange data = Orange.data.Table("lenses") lenses = Orange.data.Table("lenses") print "Indices for ordinary 10-fold CV" print Orange.data.sample.SubsetIndicesCV(data) print Orange.data.sample.SubsetIndicesCV(lenses) print "Indices for 5 folds on 10 examples" print Orange.data.sample.SubsetIndicesCV(10, folds=5)
• ## docs/reference/rst/code/randomindicesn.py

 r9372 import Orange data = Orange.data.Table("lenses") lenses = Orange.data.Table("lenses") indicesn = Orange.data.sample.SubsetIndicesN(p=[0.5, 0.25]) ind = indicesn(data) ind = indicesn(lenses) print ind indicesn = Orange.data.sample.SubsetIndicesN(p=[12, 6]) ind = indicesn(data) ind = indicesn(lenses) print ind
• ## docs/reference/rst/code/regression-tree-run.py

 r9372 import Orange table = Orange.data.Table("servo.tab") tree = Orange.regression.tree.TreeLearner(table) servo = Orange.data.Table("servo.tab") tree = Orange.regression.tree.TreeLearner(servo) print tree
• ## docs/reference/rst/code/reliability-basic.py

 r9681 # Description: Reliability estimation - basic & fast # Category:    evaluation # Uses:        housing # Referenced:  Orange.evaluation.reliability # Classes:     Orange.evaluation.reliability.Mahalanobis, Orange.evaluation.reliability.LocalCrossValidation, Orange.evaluation.reliability.Learner import Orange data = Orange.data.Table("housing.tab") housing = Orange.data.Table("housing.tab") knn = Orange.classification.knn.kNNLearner() reliability = Orange.evaluation.reliability.Learner(knn, estimators = estimators) restimator = reliability(data) instance = data[0] restimator = reliability(housing) instance = housing[0] value, probability = restimator(instance, result_type=Orange.core.GetBoth)
• ## docs/reference/rst/code/reliability-long.py

 r9683 # Description: Reliability estimation # Category:    evaluation # Uses:        prostate # Referenced:  Orange.evaluation.reliability # Classes:     Orange.evaluation.reliability.Learner import Orange import Orange Orange.evaluation.reliability.select_with_repeat.random_generator = None Orange.evaluation.reliability.select_with_repeat.randseed = 42 import Orange table = Orange.data.Table("prostate.tab") prostate = Orange.data.Table("prostate.tab") knn = Orange.classification.knn.kNNLearner() reliability = Orange.evaluation.reliability.Learner(knn) res = Orange.evaluation.testing.cross_validation([reliability], table) res = Orange.evaluation.testing.cross_validation([reliability], prostate) reliability_res = Orange.evaluation.reliability.get_pearson_r(res) print "Estimate               r       p" for estimate in reliability_res: print "%-20s %7.3f %7.3f" % (Orange.evaluation.reliability.METHOD_NAME[estimate[3]], print "%-20s %7.3f %7.3f" % (Orange.evaluation.reliability.METHOD_NAME[estimate[3]], \ estimate[0], estimate[1]) reliability = Orange.evaluation.reliability.Learner(knn, estimators=[Orange.evaluation.reliability.SensitivityAnalysis()]) res = Orange.evaluation.testing.cross_validation([reliability], table) res = Orange.evaluation.testing.cross_validation([reliability], prostate) reliability_res = Orange.evaluation.reliability.get_pearson_r(res) print "Estimate               r       p" for estimate in reliability_res: print "%-20s %7.3f %7.3f" % (Orange.evaluation.reliability.METHOD_NAME[estimate[3]], print "%-20s %7.3f %7.3f" % (Orange.evaluation.reliability.METHOD_NAME[estimate[3]], \ estimate[0], estimate[1]) indices = Orange.core.MakeRandomIndices2(table, p0=0.7) train = table.select(indices, 0) test = table.select(indices, 1) indices = Orange.core.MakeRandomIndices2(prostate, p0=0.7) train = prostate.select(indices, 0) test = prostate.select(indices, 1) reliability = Orange.evaluation.reliability.Learner(knn, icv=True)
• ## docs/reference/rst/code/reliability-run.py

 r9683 # Description: Reliability estimation with cross-validation # Category:    evaluation # Uses:        housing # Referenced:  Orange.evaluation.reliability # Classes:     Orange.evaluation.reliability.Learner import Orange Orange.evaluation.reliability.select_with_repeat.random_generator = None Orange.evaluation.reliability.select_with_repeat.randseed = 42 import Orange table = Orange.data.Table("housing.tab") housing = Orange.data.Table("housing.tab") knn = Orange.classification.knn.kNNLearner() reliability = Orange.evaluation.reliability.Learner(knn) results = Orange.evaluation.testing.cross_validation([reliability], table) results = Orange.evaluation.testing.cross_validation([reliability], housing) for i, instance in enumerate(results.results[:10]): print "Instance", i for estimate in instance.probabilities[0].reliability_estimate: print "  ", estimate.method_name, estimate.estimate for estimate in results.results[0].probabilities[0].reliability_estimate: print estimate.method_name, estimate.estimate
• ## docs/reference/rst/code/rules-cn2.py

 r9372 import Orange # Read some data table = Orange.data.Table("titanic") titanic = Orange.data.Table("titanic") # construct the learning algorithm and use it to induce a classifier cn2_learner = Orange.classification.rules.CN2Learner() cn2_clasifier = cn2_learner(table) cn2_clasifier = cn2_learner(titanic) # ... or, in a single step. cn2_classifier = Orange.classification.rules.CN2Learner(table) cn2_classifier = Orange.classification.rules.CN2Learner(titanic) # All rule-base classifiers can have their rules printed out like this:
• ## docs/reference/rst/code/rules-customized.py

 r9372 learner.rule_finder.evaluator = Orange.classification.rules.MEstimateEvaluator(m=50) table =  Orange.data.Table("titanic") classifier = learner(table) titanic =  Orange.data.Table("titanic") classifier = learner(titanic) for r in classifier.rules: Orange.classification.rules.RuleBeamFilter_Width(width = 50) classifier = learner(table) classifier = learner(titanic) for r in classifier.rules:
• ## docs/reference/rst/code/scoring-all.py

 r9372 import Orange table = Orange.data.Table("voting") voting = Orange.data.Table("voting") def print_best_3(ma): print 'Feature scores for best three features (with score_all):' ma = Orange.feature.scoring.score_all(table) ma = Orange.feature.scoring.score_all(voting) print_best_3(ma) print 'Feature scores for best three features (scored individually):' meas = Orange.feature.scoring.Relief(k=20, m=50) mr = [ (a.name, meas(a, table)) for a in table.domain.attributes ] mr = [ (a.name, meas(a, voting)) for a in voting.domain.attributes] mr.sort(key=lambda x: -x[1]) #sort decreasingly by the score print_best_3(mr)
• ## docs/reference/rst/code/scoring-calls.py

 r9372 import Orange table = Orange.data.Table("titanic") titanic = Orange.data.Table("titanic") meas = Orange.feature.scoring.GainRatio() print "Call with variable and data table" print meas(0, table) print meas(0, titanic) print "Call with variable and domain contingency" domain_cont = Orange.statistics.contingency.Domain(table) domain_cont = Orange.statistics.contingency.Domain(titanic) print meas(0, domain_cont) print "Call with contingency and class distribution" cont = Orange.statistics.contingency.VarClass(0, table) cont = Orange.statistics.contingency.VarClass(0, titanic) class_dist = Orange.statistics.distribution.Distribution( \ table.domain.class_var, table) titanic.domain.class_var, titanic) print meas(cont, class_dist)
• ## docs/reference/rst/code/scoring-diff-measures.py

 r9372 import Orange import random table = Orange.data.Table("measure") data = Orange.data.Table("measure") table2 = Orange.data.Table(table) data2 = Orange.data.Table(data) nulls = [(0, 1, 24, 25), (24, 25), range(24, 34), (24, 25)] for attr in range(len(nulls)): for e in nulls[attr]: table2[e][attr]="?" data2[e][attr]="?" names = [a.name for a in table.domain.attributes] names = [a.name for a in data.domain.attributes] attrs = len(names) print def printVariants(meas): print fstr % (("- no unknowns:",) + tuple([meas(i, table) for i in range(attrs)])) print fstr % (("- no unknowns:",) + tuple([meas(i, data) for i in range(attrs)])) meas.unknowns_treatment = meas.IgnoreUnknowns print fstr % (("- ignore unknowns:",) + tuple([meas(i, table2) for i in range(attrs)])) print fstr % (("- ignore unknowns:",) + tuple([meas(i, data2) for i in range(attrs)])) meas.unknowns_treatment = meas.ReduceByUnknowns print fstr % (("- reduce unknowns:",) + tuple([meas(i, table2) for i in range(attrs)])) print fstr % (("- reduce unknowns:",) + tuple([meas(i, data2) for i in range(attrs)])) meas.unknowns_treatment = meas.UnknownsToCommon print fstr % (("- unknowns to common:",) + tuple([meas(i, table2) for i in range(attrs)])) print fstr % (("- unknowns to common:",) + tuple([meas(i, data2) for i in range(attrs)])) meas.unknowns_treatment = meas.UnknownsAsValue print fstr % (("- unknowns as value:",) + tuple([meas(i, table2) for i in range(attrs)])) print fstr % (("- unknowns as value:",) + tuple([meas(i, data2) for i in range(attrs)])) print print "Relief" meas = Orange.feature.scoring.Relief() print fstr % (("- no unknowns:",) + tuple([meas(i, table) for i in range(attrs)])) print fstr % (("- with unknowns:",) + tuple([meas(i, table2) for i in range(attrs)])) print fstr % (("- no unknowns:",) + tuple([meas(i, data) for i in range(attrs)])) print fstr % (("- with unknowns:",) + tuple([meas(i, data2) for i in range(attrs)])) print
• ## docs/reference/rst/code/scoring-info-iris.py

 r9372 import Orange table = Orange.data.Table("iris") iris = Orange.data.Table("iris") d1 = Orange.feature.discretization.EntropyDiscretization("petal length", table) print Orange.feature.scoring.InfoGain(d1, table) d1 = Orange.feature.discretization.EntropyDiscretization("petal length", iris) print Orange.feature.scoring.InfoGain(d1, iris) table = Orange.data.Table("iris") iris = Orange.data.Table("iris") meas = Orange.feature.scoring.Relief() for t in meas.threshold_function("petal length", table): for t in meas.threshold_function("petal length", iris): print "%5.3f: %5.3f" % t thresh, score, distr = meas.best_threshold("petal length", table) thresh, score, distr = meas.best_threshold("petal length", iris) print "\nBest threshold: %5.3f (score %5.3f)" % (thresh, score)
• ## docs/reference/rst/code/scoring-info-lenses.py

 r9525 import Orange, random table = Orange.data.Table("lenses") lenses = Orange.data.Table("lenses") meas = Orange.feature.scoring.InfoGain() astigm = table.domain["astigmatic"] print "Information gain of 'astigmatic': %6.4f" % meas(astigm, table) astigm = lenses.domain["astigmatic"] print "Information gain of 'astigmatic': %6.4f" % meas(astigm, lenses) classdistr = Orange.statistics.distribution.Distribution(table.domain.class_var, table) cont = Orange.statistics.contingency.VarClass("tear_rate", table) classdistr = Orange.statistics.distribution.Distribution(lenses.domain.class_var, lenses) cont = Orange.statistics.contingency.VarClass("tear_rate", lenses) print "Information gain of 'tear_rate': %6.4f" % meas(cont, classdistr) dcont = Orange.statistics.contingency.Domain(table) dcont = Orange.statistics.contingency.Domain(lenses) print "Information gain of the first attribute: %6.4f" % meas(0, dcont) print print "*** A set of more exhaustive tests for different way of passing arguments to MeasureAttribute ***" names = [a.name for a in table.domain.attributes] names = [a.name for a in lenses.domain.attributes] attrs = len(names) print "Computing information gain directly from examples" print fstr % (("- by attribute number:",) + tuple([meas(i, table) for i in range(attrs)])) print fstr % (("- by attribute name:",) + tuple([meas(i, table) for i in names])) print fstr % (("- by attribute descriptor:",) + tuple([meas(i, table) for i in table.domain.attributes])) print fstr % (("- by attribute number:",) + tuple([meas(i, lenses) for i in range(attrs)])) print fstr % (("- by attribute name:",) + tuple([meas(i, lenses) for i in names])) print fstr % (("- by attribute descriptor:",) + tuple([meas(i, lenses) for i in lenses.domain.attributes])) print dcont = Orange.statistics.contingency.Domain(table) dcont = Orange.statistics.contingency.Domain(lenses) print "Computing information gain from DomainContingency" print fstr % (("- by attribute number:",) + tuple([meas(i, dcont) for i in range(attrs)])) print fstr % (("- by attribute name:",) + tuple([meas(i, dcont) for i in names])) print fstr % (("- by attribute descriptor:",) + tuple([meas(i, dcont) for i in table.domain.attributes])) print fstr % (("- by attribute descriptor:",) + tuple([meas(i, dcont) for i in lenses.domain.attributes])) print print "Computing information gain from DomainContingency" cdist = Orange.statistics.distribution.Distribution(table.domain.class_var, table) print fstr % (("- by attribute number:",) + tuple([meas(Orange.statistics.contingency.VarClass(i, table), cdist) for i in range(attrs)])) print fstr % (("- by attribute name:",) + tuple([meas(Orange.statistics.contingency.VarClass(i, table), cdist) for i in names])) print fstr % (("- by attribute descriptor:",) + tuple([meas(Orange.statistics.contingency.VarClass(i, table), cdist) for i in table.domain.attributes])) cdist = Orange.statistics.distribution.Distribution(lenses.domain.class_var, lenses) print fstr % (("- by attribute number:",) + tuple([meas(Orange.statistics.contingency.VarClass(i, lenses), cdist) for i in range(attrs)])) print fstr % (("- by attribute name:",) + tuple([meas(Orange.statistics.contingency.VarClass(i, lenses), cdist) for i in names])) print fstr % (("- by attribute descriptor:",) + tuple([meas(Orange.statistics.contingency.VarClass(i, lenses), cdist) for i in lenses.domain.attributes])) print values = ["v%i" % i for i in range(len(table.domain[2].values)*len(table.domain[3].values))] values = ["v%i" % i for i in range(len(lenses.domain[2].values)*len(lenses.domain[3].values))] cartesian = Orange.data.variable.Discrete("cart", values = values) cartesian.get_value_from = Orange.classification.lookup.ClassifierByLookupTable(cartesian, table.domain[2], table.domain[3], values) cartesian.get_value_from = Orange.classification.lookup.ClassifierByLookupTable(cartesian, lenses.domain[2], lenses.domain[3], values) print "Information gain of Cartesian product of %s and %s: %6.4f" % (table.domain[2].name, table.domain[3].name, meas(cartesian, table)) print "Information gain of Cartesian product of %s and %s: %6.4f" % (lenses.domain[2].name, lenses.domain[3].name, meas(cartesian, lenses)) mid = Orange.data.new_meta_id() table.domain.add_meta(mid, Orange.data.variable.Discrete(values = ["v0", "v1"])) table.add_meta_attribute(mid) lenses.domain.add_meta(mid, Orange.data.variable.Discrete(values = ["v0", "v1"])) lenses.add_meta_attribute(mid) rg = random.Random() rg.seed(0) for ex in table: for ex in lenses: ex[mid] = Orange.data.Value(rg.randint(0, 1)) print "Information gain for a random meta attribute: %6.4f" % meas(mid, table) print "Information gain for a random meta attribute: %6.4f" % meas(mid, lenses)
• ## docs/reference/rst/code/scoring-relief-caching.py

 r9372 import orange data = orange.ExampleTable("iris") iris = orange.ExampleTable("iris") r1 = orange.MeasureAttribute_relief() r2 = orange.MeasureAttribute_relief(check_cached_data = False) print "%.3f\t%.3f" % (r1(0, data), r2(0, data)) for ex in data: print "%.3f\t%.3f" % (r1(0, iris), r2(0, iris)) for ex in iris: ex[0] = 0 print "%.3f\t%.3f" % (r1(0, data), r2(0, data)) print "%.3f\t%.3f" % (r1(0, iris), r2(0, iris))
• ## docs/reference/rst/code/scoring-relief-gainRatio.py

 r9372 import Orange table = Orange.data.Table("voting") voting = Orange.data.Table("voting") print 'Relief GainRt Feature' ma_def = Orange.feature.scoring.score_all(table) ma_def = Orange.feature.scoring.score_all(voting) gr = Orange.feature.scoring.GainRatio() ma_gr  = Orange.feature.scoring.score_all(table, gr) ma_gr  = Orange.feature.scoring.score_all(voting, gr) for i in range(5): print "%5.3f  %5.3f  %s" % (ma_def[i][1], ma_gr[i][1], ma_def[i][0])
• ## docs/reference/rst/code/selection-bayes.py

 r9661 # Uses:        voting # Referenced:  Orange.feature.html#selection # Classes:     Orange.feature.scoring.score_all, Orange.feature.selection.best_n # Classes:     Orange.feature.scoring.score_all, Orange.feature.selection.bestNAtts import Orange else: return learner def __init__(self, name='Naive Bayes with FSS', N=5): self.name = name self.N = 5 def __call__(self, data, weight=None): ma = Orange.feature.scoring.score_all(data) filtered = Orange.feature.selection.select_best_n(data, ma, self.N) def __call__(self, table, weight=None): ma = Orange.feature.scoring.score_all(table) filtered = Orange.feature.selection.selectBestNAtts(table, ma, self.N) model = Orange.classification.bayes.NaiveLearner(filtered) return BayesFSS_Classifier(classifier=model, N=self.N, name=self.name) def __init__(self, **kwds): self.__dict__.update(kwds) def __call__(self, example, resultType=Orange.core.GetValue): def __call__(self, example, resultType = Orange.core.GetValue): return self.classifier(example, resultType)
• ## docs/reference/rst/code/selection-filtered-learner.py

 r9661 # Classes:     Orange.feature.selection.FilteredLearner import Orange import Orange, orngTest, orngStat voting = Orange.data.Table("voting") nb = Orange.classification.bayes.NaiveLearner() fl = Orange.feature.selection.FilteredLearner(nb, filter=Orange.feature.selection.FilterBestN(n=1), name='filtered') fl = Orange.feature.selection.FilteredLearner(nb, filter=Orange.feature.selection.FilterBestNAtts(n=1), name='filtered') learners = (Orange.classification.bayes.NaiveLearner(name='bayes'), fl) results = Orange.evaluation.testing.cross_validation(learners, voting, storeClassifiers=1) results = orngTest.crossValidation(learners, voting, storeClassifiers=1) # output the results print "Learner      CA" for i in range(len(learners)): print "%-12s %5.3f" % (learners[i].name, Orange.evaluation.scoring.CA(results)[i]) print "%-12s %5.3f" % (learners[i].name, orngStat.CA(results)[i]) # find out which attributes were retained by filtering
• ## docs/reference/rst/code/simple_tree_random_forest.py

 r9372 learners = [ rf_def, rf_simple ] table = Orange.data.Table("iris") results = Orange.evaluation.testing.cross_validation(learners, table, folds=3) iris = Orange.data.Table("iris") results = Orange.evaluation.testing.cross_validation(learners, iris, folds=3) print "Learner  CA     Brier  AUC" for i in range(len(learners)): for l in learners: t = time.time() l(table) l(iris) print l.name, time.time() - t
• ## docs/reference/rst/code/simple_tree_regression.py

 r9372 import Orange table = Orange.data.Table("housing.tab") housing = Orange.data.Table("housing.tab") learner = Orange.regression.tree.SimpleTreeLearner res = Orange.evaluation.testing.cross_validation([learner], table) res = Orange.evaluation.testing.cross_validation([learner], housing) print Orange.evaluation.scoring.MSE(res)[0]
• ## docs/reference/rst/code/som-classifier.py

 r9372 import random learner = Orange.projection.som.SOMSupervisedLearner(map_shape=(4, 4)) data = Orange.data.Table("iris.tab") classifier = learner(data) iris = Orange.data.Table("iris.tab") classifier = learner(iris) random.seed(50) for d in random.sample(data, 5): for d in random.sample(iris, 5): print "%-15s originally %-15s" % (classifier(d), d.getclass())
• ## docs/reference/rst/code/statistics-contingency.py

 r9372 import Orange.statistics.contingency import Orange table = Orange.data.Table("monks-1.tab") cont = Orange.statistics.contingency.VarClass("e", table) monks = Orange.data.Table("monks-1.tab") cont = Orange.statistics.contingency.VarClass("e", monks) for val, dist in cont.items(): print val, dist
• ## docs/reference/rst/code/statistics-contingency2.py

 r9372 import Orange table = Orange.data.Table("monks-1.tab") cont = Orange.statistics.contingency.Table(table.domain["e"], table.domain.classVar) for ins in table: monks = Orange.data.Table("monks-1.tab") cont = Orange.statistics.contingency.Table(monks.domain["e"], monks.domain.classVar) for ins in monks: cont [ins["e"]] [ins.get_class()] += 1 print for ins in table: for ins in monks: cont.add(ins["e"], ins.get_class())
• ## docs/reference/rst/code/statistics-contingency3.py

 r9372 import Orange.statistics.contingency table = Orange.data.Table("monks-1.tab") cont = Orange.statistics.contingency.VarClass("e", table) monks = Orange.data.Table("monks-1.tab") cont = Orange.statistics.contingency.VarClass("e", monks) print "Inner variable: ", cont.inner_variable.name print cont = Orange.statistics.contingency.VarClass(table.domain["e"], table.domain.class_var) for ins in table: cont = Orange.statistics.contingency.VarClass(monks.domain["e"], monks.domain.class_var) for ins in monks: cont.add_var_class(ins["e"], ins.getclass())
• ## docs/reference/rst/code/statistics-contingency4.py

 r9372 import Orange.statistics.contingency table = Orange.data.Table("monks-1.tab") cont = Orange.statistics.contingency.ClassVar("e", table) monks = Orange.data.Table("monks-1.tab") cont = Orange.statistics.contingency.ClassVar("e", monks) print "Inner variable: ", cont.inner_variable.name print cont = Orange.statistics.contingency.ClassVar(table.domain["e"], table.domain.class_var) for ins in table: cont = Orange.statistics.contingency.ClassVar(monks.domain["e"], monks.domain.class_var) for ins in monks: cont.add_var_class(ins["e"], ins.get_class())
• ## docs/reference/rst/code/statistics-contingency5.py

 r9372 import Orange table = Orange.data.Table("bridges.tab") cont = Orange.statistics.contingency.VarVar("SPAN", "MATERIAL", table) bridges = Orange.data.Table("bridges.tab") cont = Orange.statistics.contingency.VarVar("SPAN", "MATERIAL", bridges) print "Distributions:" print cont = Orange.statistics.contingency.VarVar(table.domain["SPAN"], table.domain["MATERIAL"]) for ins in table: cont = Orange.statistics.contingency.VarVar(bridges.domain["SPAN"], bridges.domain["MATERIAL"]) for ins in bridges: cont.add(ins["SPAN"], ins["MATERIAL"])
• ## docs/reference/rst/code/statistics-contingency6.py

 r9749 import Orange table = Orange.data.Table("iris.tab") cont = Orange.statistics.contingency.VarClass(0, table) iris = Orange.data.Table("iris.tab") cont = Orange.statistics.contingency.VarClass(0, iris) print "Contingency items:" midkey = (cont.keys()[0] + cont.keys()[1]) / 2.0 midkey = (cont.keys()[0] + cont.keys()[1])/2.0 print "cont[%5.3f] =" % midkey, cont[midkey]
• ## docs/reference/rst/code/statistics-contingency7.py

 r9372 import Orange table = Orange.data.Table("iris") cont = Orange.statistics.contingency.ClassVar("sepal length", table) iris = Orange.data.Table("iris") cont = Orange.statistics.contingency.ClassVar("sepal length", iris) print "Inner variable: ", cont.inner_variable.name print cont = Orange.statistics.contingency.ClassVar(table.domain["sepal length"], table.domain.class_var) for ins in table: cont = Orange.statistics.contingency.ClassVar(iris.domain["sepal length"], iris.domain.class_var) for ins in iris: cont.add_var_class(ins["sepal length"], ins.get_class())
• ## docs/reference/rst/code/statistics-contingency8.py

 r9372 import Orange table = Orange.data.Table("monks-1.tab") monks = Orange.data.Table("monks-1.tab") print "Distributions of classes given the feature value" dc = Orange.statistics.contingency.Domain(table) dc = Orange.statistics.contingency.Domain(monks) print "a: ", dc["a"] print "b: ", dc["b"] print "Distributions of feature values given the class value" dc = Orange.statistics.contingency.Domain(table, classIsOuter = 1) dc = Orange.statistics.contingency.Domain(monks, classIsOuter = 1) print "a: ", dc["a"] print "b: ", dc["b"]
• ## docs/reference/rst/code/svm-custom-kernel.py

 r9724 from Orange.classification.svm import SVMLearner, kernels from Orange.distance import Euclidean from Orange.distance import Hamming from Orange.distance.instances import EuclideanConstructor from Orange.distance.instances import HammingConstructor table = data.Table("iris.tab") iris = data.Table("iris.tab") l1 = SVMLearner() l1.kernel_func = kernels.RBFKernelWrapper(Euclidean(table), gamma=0.5) l1.kernel_func = kernels.RBFKernelWrapper(EuclideanConstructor(iris), gamma=0.5) l1.kernel_type = SVMLearner.Custom l1.probability = True c1 = l1(table) c1 = l1(iris) l1.name = "SVM - RBF(Euclidean)" l2 = SVMLearner() l2.kernel_func = kernels.RBFKernelWrapper(Hamming(table), gamma=0.5) l2.kernel_func = kernels.RBFKernelWrapper(HammingConstructor(iris), gamma=0.5) l2.kernel_type = SVMLearner.Custom l2.probability = True c2 = l2(table) c2 = l2(iris) l2.name = "SVM - RBF(Hamming)" l3 = SVMLearner() l3.kernel_func = kernels.CompositeKernelWrapper( kernels.RBFKernelWrapper(Euclidean(table), gamma=0.5), kernels.RBFKernelWrapper(Hamming(table), gamma=0.5), l=0.5) kernels.RBFKernelWrapper(EuclideanConstructor(iris), gamma=0.5), kernels.RBFKernelWrapper(HammingConstructor(iris), gamma=0.5), l=0.5) l3.kernel_type = SVMLearner.Custom l3.probability = True c3 = l1(table) c3 = l1(iris) l3.name = "SVM - Composite" tests = evaluation.testing.cross_validation([l1, l2, l3], table, folds=5) tests = evaluation.testing.cross_validation([l1, l2, l3], iris, folds=5) [ca1, ca2, ca3] = evaluation.scoring.CA(tests)
• ## docs/reference/rst/code/svm-easy.py

 r9372 from Orange.classification import svm table = data.Table("vehicle.tab") vehicle = data.Table("vehicle.tab") svm_easy = svm.SVMLearnerEasy(name="svm easy", folds=3) from Orange.evaluation import testing, scoring results = testing.cross_validation(learners, table, folds=5) results = testing.cross_validation(learners, vehicle, folds=5) print "Name     CA        AUC" for learner,CA,AUC in zip(learners, scoring.CA(results), scoring.AUC(results)):
• ## docs/reference/rst/code/svm-linear-weights.py

 r9372 from Orange.classification import svm table = data.Table("brown-selected") classifier = svm.SVMLearner(table, brown = data.Table("brown-selected") classifier = svm.SVMLearner(brown, kernel_type=svm.kernels.Linear, normalization=False)
• ## docs/reference/rst/code/svm-recursive-feature-elimination.py

 r9372 from Orange.classification import svm table = data.Table("brown-selected") print table.domain brown = data.Table("brown-selected") print brown.domain rfe = svm.RFE() newdata = rfe(table, 10) newdata = rfe(brown, 10) print newdata.domain
• ## docs/reference/rst/code/symmatrix.py

 r9372 import Orange.data import Orange m = Orange.data.SymMatrix(4)
• ## docs/reference/rst/code/testing-test.py

 r9696 import Orange.evaluation.testing table = Orange.data.Table("voting") voting = Orange.data.Table("voting") bayes = Orange.classification.bayes.NaiveLearner(name="bayes") print "\nproportionsTest that will always give the same results" for i in range(3): res = Orange.evaluation.testing.proportion_test(learners, table, 0.7) res = Orange.evaluation.testing.proportion_test(learners, voting, 0.7) printResults(res) print "\nproportionsTest that will give different results, \ but the same each time the script is run" myRandom = Orange.misc.Random() myRandom = Orange.core.RandomGenerator() for i in range(3): res = Orange.evaluation.testing.proportion_test(learners, table, 0.7, random_generator=myRandom) res = Orange.evaluation.testing.proportion_test(learners, voting, 0.7, randomGenerator=myRandom) printResults(res) # End print "\nproportionsTest that will give different results each time it is run" for i in range(3): res = Orange.evaluation.testing.proportion_test(learners, table, 0.7, res = Orange.evaluation.testing.proportion_test(learners, voting, 0.7, randseed=random.randint(0, 100)) printResults(res) print "\nproportionsTest + storing classifiers" res = Orange.evaluation.testing.proportion_test(learners, table, 0.7, 100, res = Orange.evaluation.testing.proportion_test(learners, voting, 0.7, 100, storeClassifiers=1) print "#iter %i, #classifiers %i" % \ print "\nGood old 10-fold cross validation" res = Orange.evaluation.testing.cross_validation(learners, table) res = Orange.evaluation.testing.cross_validation(learners, voting) printResults(res) print "\nLearning curve" prop = Orange.core.frange(0.2, 1.0, 0.2) res = Orange.evaluation.testing.learning_curve_n(learners, table, folds=5, res = Orange.evaluation.testing.learning_curve_n(learners, voting, folds=5, proportions=prop) for i in range(len(prop)): print "\nLearning curve with pre-separated data" indices = Orange.core.MakeRandomIndices2(table, p0=0.7) train = table.select(indices, 0) test = table.select(indices, 1) indices = Orange.core.MakeRandomIndices2(voting, p0=0.7) train = voting.select(indices, 0) test = voting.select(indices, 1) res = Orange.evaluation.testing.learning_curve_with_test_data(learners, train, test, times=5, proportions=prop)