Preprocessing (preprocess)
- Data Scaling (scaling)

Preprocessing (`preprocess`)¶

class Orange.data.preprocess.DiscretizeEntropy(method=Orange.feature.discretization.Entropy())¶: An discretizer that uses orange.EntropyDiscretization method but, unlike Preprocessor_discretize class, also removes unused attributes from the domain.

class Orange.data.preprocess.RemoveContinuous¶: A preprocessor that removes all continuous features.

class Orange.data.preprocess.Continuize(zeroBased=True, multinomialTreatment=NValues, continuousTreatment=Leave, classTreatment=Ignore, **kwargs)¶: A preprocessor that continuizes a discrete domain (and optionally normalizes it). See Orange.data.continuization.DomainContinuizer for list of accepted arguments.

class Orange.data.preprocess.RemoveDiscrete(zeroBased=True, multinomialTreatment=NValues, continuousTreatment=Leave, classTreatment=Ignore, **kwargs)¶: A Preprocessor that removes all discrete attributes from the domain.

class Orange.data.preprocess.Impute(model=None, **kwargs)¶

A preprocessor that imputes unknown values using a learner.

Parameters:	model – a learner class.

class Orange.data.preprocess.FeatureSelection(measure=Orange.feature.scoring.Relief(), filter=None, limit=10)¶

A preprocessor that runs feature selection using an feature scoring function.

Parameters:	measure – a scoring function (default: orange.MeasureAttribute_relief) filter – a filter function to use for selection (default Preprocessor_featureSelection.bestN) limit – the limit for the filter function (default 10)

Orange.data.preprocess.bestP(attrMeasures, P=10)¶: Return best P percent of attributes

Orange.data.preprocess.bestN(attrMeasures, N=10)¶: Return best N attributes

Orange.data.preprocess.selectNRandom(examples, N=10)¶: Select N random examples.

Orange.data.preprocess.selectPRandom(examples, P=10)¶: Select P percent random examples.

class Orange.data.preprocess.RFE(filter=None, limit=10)¶

A preprocessor that runs RFE(Recursive Feature Elimination) using linear SVM derived attribute weights.

Parameters:	filter – a filter function to use for selection (default Preprocessor_featureSelection.bestN) limit – the limit for the filter function (default 10)

class Orange.data.preprocess.Sample(filter=None, limit=10)¶

A preprocessor that samples a subset of the data.

Parameters:	filter – a filter function to use for selection (default Preprocessor_sample.selectNRandom) limit – the limit for the filter function (default 10)

class Orange.data.preprocess.PreprocessorList(preprocessors=())¶

A preprocessor wrapping a sequence of other preprocessors.

Parameters:	preprocessors – a list of `Preprocessor` instances

class Orange.data.preprocess.RemoveUnusedValues(variable, data, remove_one_valued=False)¶

Removes unused values and reduces the variable, if a variable declares values that do not appear in the data.

Parameters:	variable – `Descriptor` data – `Table` remove_one_valued – Decides whether to remove or to retain the attributes with only one value defined (default: False).

Example:

import Orange
data = Orange.data.Table("unusedValues.tab")

new_variables = [Orange.data.preprocess.RemoveUnusedValues(var, data) for var in data.domain.variables]

print
for variable in range(len(data.domain)):
    print data.domain[variable],
    if new_variables[variable] == data.domain[variable]:
        print "retained as is"
    elif new_variables[variable]:
        print "reduced, new values are", new_variables[variable].values
    else:
        print "removed"

There are four possible outcomes:

1. The variable does not have any used values in the data - value of this variable is undefined for all examples. The variable is thus useless and the class returns None.

2. The variable has only one used value (or, possibly, only one value at all). Such a variable is in fact useless, and can probably be removed without harm. Nevertheless, its fate is decided by the flag remove_one_valued which is False by default, so such variables are retained unless explicitly specified otherwise.

3. All variable’s values occur in the data (and the variable has more than one value; otherwise the above case applies). The original variable is returned.

4. There are some unused values. A new variable is constructed and the unused values are omitted. The value of the new variable is computed automatically from the value of the original variable ClassifierByLookupTable is used for mapping.

Results of example:

a	b	c	d	y
0 1	0 1 2 3	discrete	discrete	discrete
				class
0	0	?	0	0
1	2	?	0	0
0	0	?	0	1

Variables a and y are OK and are left alone. In b, value 1 is not used and is removed (not in the original variable, of course; a new variable is created). c is useless and is removed altogether. d is retained since remove_one_valued was left at False; if we set it to True, this variable would be removed as well.

Data Scaling (`scaling`)¶

This module is a conglomerate of Orange 2.0 modules orngScaleData, orngScaleLinProjData, orngScaleLinProjData3D, orngScalePolyvizData and orngScaleScatterPlotData. The documentation is poor and has to be improved in the future.

class Orange.data.preprocess.scaling.ScaleData¶

getValidIndices(indices)¶: Get array with numbers that represent the instance indices that have a valid data value.

getValidList(indices, also_class_if_exists=1)¶: Get array of 0 and 1 of len = len(self.raw_data). If there is a missing value at any attribute in indices return 0 for that instance.

getValidSubsetIndices(indices)¶: Get array with numbers that represent the instance indices that have a valid data value.

getValidSubsetList(indices, also_class_if_exists=1)¶: Get array of 0 and 1 of len = len(self.raw_subset_data). if there is a missing value at any attribute in indices return 0 for that instance.

get_valid_indices(indices)¶: Get array with numbers that represent the instance indices that have a valid data value.

get_valid_list(indices, also_class_if_exists=1)¶: Get array of 0 and 1 of len = len(self.raw_data). If there is a missing value at any attribute in indices return 0 for that instance.

get_valid_subset_indices(indices)¶: Get array with numbers that represent the instance indices that have a valid data value.

get_valid_subset_list(indices, also_class_if_exists=1)¶: Get array of 0 and 1 of len = len(self.raw_subset_data). if there is a missing value at any attribute in indices return 0 for that instance.

mergeDataSets(data, subset_data)¶: Take examples from data and subset_data and merge them into one dataset.

merge_data_sets(data, subset_data)¶: Take examples from data and subset_data and merge them into one dataset.

rescaleData()¶: Force the existing data to be rescaled due to changes like jitter_continuous, jitter_size, ...

rescale_data()¶: Force the existing data to be rescaled due to changes like jitter_continuous, jitter_size, ...

rndCorrection(max)¶: Return a number from -max to max.

rnd_correction(max)¶: Return a number from -max to max.

scaleExampleValue(instance, index)¶: Scale instance’s value at index index to a range between 0 and 1 with respect to self.raw_data.

scale_example_value(instance, index)¶: Scale instance’s value at index index to a range between 0 and 1 with respect to self.raw_data.

class Orange.data.preprocess.scaling.ScaleLinProjData¶

Bases: Orange.data.preprocess.scaling.ScaleData

createAnchors(num_of_attr, labels=None)¶: Create anchors around the circle.

createProjectionAsExampleTable(attr_indices, **settings_dict)¶: Create the projection of attribute indices given in attr_indices and create an example table with it.

create_anchors(num_of_attr, labels=None)¶: Create anchors around the circle.

create_projection_as_example_table(attr_indices, **settings_dict)¶: Create the projection of attribute indices given in attr_indices and create an example table with it.

getProjectedPointPosition(attr_indices, values, **settings_dict)¶: For attributes in attr_indices and values of these attributes in values compute point positions. This function has more sense in radviz and polyviz methods.

get_projected_point_position(attr_indices, values, **settings_dict)¶: For attributes in attr_indices and values of these attributes in values compute point positions. This function has more sense in radviz and polyviz methods.

saveProjectionAsTabData(filename, attrlist, use_anchor_data=0)¶: Save projection (xattr, yattr, classval) into a filename filename.

save_projection_as_tab_data(filename, attrlist, use_anchor_data=0)¶: Save projection (xattr, yattr, classval) into a filename filename.

class Orange.data.preprocess.scaling.ScaleLinProjData3D¶

Bases: Orange.data.preprocess.scaling.ScaleData

createAnchors(num_of_attrs, labels=None)¶: Create anchors on the sphere.

createProjectionAsExampleTable(attr_indices, **settings_dict)¶: Create the projection of attribute indices given in attr_indices and create an example table with it.

create_anchors(num_of_attrs, labels=None)¶: Create anchors on the sphere.

create_projection_as_example_table(attr_indices, **settings_dict)¶: Create the projection of attribute indices given in attr_indices and create an example table with it.

getProjectedPointPosition(attr_indices, values, **settings_dict)¶: For attributes in attr_indices and values of these attributes in values compute point positions. This function has more sense in radviz and polyviz methods.

get_projected_point_position(attr_indices, values, **settings_dict)¶: For attributes in attr_indices and values of these attributes in values compute point positions. This function has more sense in radviz and polyviz methods.

saveProjectionAsTabData(filename, attrlist, use_anchor_data=0)¶: Save projection (xattr, yattr, zattr, classval) into a filename filename.

save_projection_as_tab_data(filename, attrlist, use_anchor_data=0)¶: Save projection (xattr, yattr, zattr, classval) into a filename filename.

class Orange.data.preprocess.scaling.ScalePolyvizData¶: Bases: Orange.data.preprocess.scaling.ScaleLinProjData

class Orange.data.preprocess.scaling.ScaleScatterPlotData¶

Bases: Orange.data.preprocess.scaling.ScaleData

createProjectionAsExampleTable(attr_indices, **settings_dict)¶: Create the projection of attribute indices given in attr_indices and create an example table with it.

createProjectionAsExampleTable3D(attr_indices, **settings_dict)¶: Create the projection of attribute indices given in attr_indices and create an example table with it.

create_projection_as_example_table(attr_indices, **settings_dict)¶: Create the projection of attribute indices given in attr_indices and create an example table with it.

create_projection_as_example_table_3D(attr_indices, **settings_dict)¶: Create the projection of attribute indices given in attr_indices and create an example table with it.

getProjectedPointPosition(attr_indices, values, **settings_dict)¶: For attributes in attr_indices and values of these attributes in values compute point positions this function has more sense in radviz and polyviz methods. settings_dict has to be because radviz and polyviz have this parameter.

getXYDataPositions(xattr, yattr)¶: Create x-y projection of attributes in attrlist.

getXYSubsetDataPositions(xattr, yattr)¶: Create x-y projection of attributes in attr_list.

get_projected_point_position(attr_indices, values, **settings_dict)¶: For attributes in attr_indices and values of these attributes in values compute point positions this function has more sense in radviz and polyviz methods. settings_dict has to be because radviz and polyviz have this parameter.

get_xy_data_positions(xattr, yattr)¶: Create x-y projection of attributes in attrlist.

get_xy_subset_data_positions(xattr, yattr)¶: Create x-y projection of attributes in attr_list.

Orange.data.preprocess.scaling.get_variable_values_sorted(variable)¶

Return a list of sorted values for given attribute.

EXPLANATION: if variable values have values 1, 2, 3, 4, ... then their order in orange depends on when they appear first in the data. With this function we get a sorted list of values.

Orange.data.preprocess.scaling.get_variable_value_indices(variable, sort_values_for_discrete_attrs=1)¶: Create a dictionary with given variable. Keys are variable values, values are indices (transformed from string to int); in case all values are integers, we also sort them.

Orange.data.preprocess.scaling.discretize_domain(data, remove_unused_values=1, number_of_intervals=2)¶: Discretize the domain. If we have a class, remove the instances with missing class value, discretize the continuous class into discrete class with two values, discretize continuous attributes using entropy discretization (or equiN if we don’t have a class or class is continuous).

Preprocessing (preprocess)¶

Data Scaling (scaling)¶

Preprocessing (`preprocess`)¶

Data Scaling (`scaling`)¶