Basic Data Exploration
Until now (in Loading-in the data) we have looked only at data files that include solely nominal (discrete) attributes. Let's make thinks now more interesting, and look at another file with mixture of attribute types. We will first use adult data set from UCI ML Repository. The prediction task related to this data set is to determine whether a person characterized by 14 attributes like education, race, occupation, etc., makes over $50K/year. Because of the original set adult.tab is rather big (32561 data instances, about 4 MBytes), we will first create a smaller sample of about 3% of instances and use it in our examples. If you are curious how we do this, here is the code:
Above loads the data, prepares a selection vector of length equal to the number of data instances, which includes 0’s and 1’s, but it is told that there should be about 3% of 0’s. Then, those instances are selected which have a corresponding 0 in selection vector, and stored in an object called “sample”. The sampled data is then saved in a file. Note that MakeRandomIndices2 performs a stratified selection, i.e., the class distribution of original and sampled data should be nearly the same.
Basic Data Set Characteristics
For classification data sets, basic data characteristics are most often number of classes, number of attributes (and of these, how many are nominal and continuous), information if data contains missing values, and class distribution. Below is the script that does all this:
The first part is the one that we know already: the script import Orange library into Python, and loads the data. The information on domain (class and attribute names, types, values, etc.) are stored in data.domain. Information on class variable is accessible through the data.domain.classVar object, where data.domain.classVar.values stores a vector of its values. Its length is obtained using a function len(). Similarly, the list of attributes is stored in data.domain.attributes. Notice that to obtain the information on i-th attribute, this list can be indexed, e.g., data.domain.attributes[i].
To count the number of continuous and discrete attributes, we have first initialized two counters (ncont, ndisc), and then iterated through the attributes (variable a is an iteration variable that in is each loop associated with a single attribute). The field varType contains the type of the attribute; for discrete attributes, varType is equal to orange.VarTypes.Discrete, and for continuous varType is equal to orange.VarTypes.Continuous.
To obtain the number of instances for each class, we first
initialized a vector c that would of the length equal to the number of
different classes. Then, we iterated through the data;
e.getclass() returns a class of an instance e, and to
turn it into a class index (a number that is in range from 0 to n-1,
where n is the number of classes) and is used for an index of a
element of c that should be incremented.
Throughout the code, notice that a print statement in Python prints whatever items it has in the line that follows. The items are separated with commas, and Python will by default put a blank between them when printing. It will also print a new line, unless the print statement ends with a comma. It is possible to use print statement in Python with formatting directives, just like in C or C++, but this is beyond this text.
Running the above script, we obtain the following output (although running using the Command Prompt is shown, you may equally use PythonWin to load and run the script; see one of our previous lessons on this):
> python data_characteristics.py Classes: 2 Attributes: 14 , 6 continuous, 8 discrete Instances: 977 total , 236 with class >50K , 741 with class <=50K >
If you would like class distributions printed as proportions of each class in the data sets, then the last part of the script needs to be slightly changed. This time, we have used string formatting with print as well:
The new script outputs the following information:
> python data_characteristics2.py Classes: 2 Attributes: 14 , 6 continuous, 8 discrete Instances: 977 total , 236(24.2%) with class >50K , 741(75.8%) with class <=50K >
As it turns out, there are more people that earn less than those, that earn more… On a more technical site, such information may be important when your build your classifier; the base error for this data set is 1-.758 = .242, and your constructed models should only be better than this.
Contingency matrix for nominal and mean for continuous attributes
Another interesting piece of information that we can obtain from the data is the distribution of classes for each value of the discrete attribute, and means for continuous attribute (we will leave the computation of standard deviation and other statistics to you). Let’s compute means of continuous attributes first:
This script iterates through attributes (outer for loop), and for attributes that are continuous (first if statement) computes a sum over all instances. A single new trick that the script uses is that it checks if the instance has a defined attribute value. Namely, for instance e and attribute a, e[a].isSpecial() is true if the value is not defined (unknown). Variable n stores the number of instances with defined values of attribute. For our sampled adult data set, this part of the code outputs:
For nominal attributes, we could now compose a code that computes, for each attribute, how many times a specific value was used for each class. Instead, we used a build-in method DomainContingency, which does just that. All that our script will do is, mainly, to print it out in a readable form.
Notice that the first part of this script is similar to the one that is dealing with continuous attributes, except that the for loop is a little bit simpler. With continuous attributes, the iterator in the loop was an attribute index, whereas in the script above we iterate through members of data.domain.attributes, which are objects that represent attributes. Data structures that may be addressed in Orange by attribute may most often be addressed either by attribute index, attribute name (string), or an object that represents an attribute.
The output of the code above is rather long (this data set has some attributes that have rather large sets of values), so we show only the output for two attributes:
First, notice that the in the vectors the first number refers to a higher income, and the second number to the lower income (e.g., from this data it looks like that women earn less than men). Notice that Orange outputs the tuples in the form “< tuple-data >”. To change this, we would need another loop that would iterate through members of the tuples. You may also foresee that it would be interesting to compute the proportions rather than number of instances in above contingency matrix, but that we leave for your exercise.
It is often interesting to see, given the attribute, what is the proportion of the instances with that attribute unknown. We have already learned that if a function isSpecial() can be used to determine if for specific instances and attribute the value is not defined. Let us use this function to compute the proportion of missing values per each attribute:
Integer variable natt stores number of attributes in the data set. An array missing stores the number of the missing values per attribute; its size is therefore equal to natt, and all of its elements are initially 0 (in fact, 0.0, since we purposely identified it as a real number, which helped us later when we converted it to percents).
The only line that possibly looks (very?) strange is
missing = map(lambda x, l=len(data):x/l*100., missing). This line could be
replaced with for loop, but we just wanted to have it here to show
how coding in Python may look very strange, but may gain in
efficiency. The function map takes a vector (in our case missing),
and executes a function on every of its elements, thus obtaining a
new vector. The function it executes is in our case defined inline,
and is in Python called lambda expression. You can see that our
lambda function takes a single argument (when mapped, an element of
vector missing), and returns its value that is normalized with the
number of data instances (len(data)) multiplied by 100, to turn it
in percentage. Thus, the map function in fact normalizes the
elements of missing to express a proportion of missing values over
the instances of the data set.
Finally, let us see what outputs the script we have just been working on:
>>> python report_missing.py Missing values per attribute: 0.0% age 4.5% workclass 0.0% fnlwgt 0.0% education 0.0% education-num 0.0% marital-status 4.5% occupation 0.0% relationship 0.0% race 0.0% sex 0.0% capital-gain 0.0% capital-loss 0.0% hours-per-week 1.9% native-country
In our sampled data set, just three attributes contain the missing values.
Basic Data Analysis with orange.DomainDistributions
For some of the tasks above, Orange can provide a shortcut by means of orange.DomainDistributions function which returns an object that holds averages and mean square errors for continuous attributes, value frequencies for discrete attributes, and for both number of instances where specific attribute has a missing value. The use of this object is exemplified in the following script:
Check this script out. Its results should match with the results we have derived by other scripts in this lesson.
This lesson taught some basics of Orange scripting, as well (but not really intentionally) as some basics of Python programming. Perhaps the most important part of it was accessing important pieces of data, like data instances, attribute values of data instances, class variable and its properties, vector with objects that store information about attributes, etc. What we have shown here was very much inclined toward classical machine learning type of data analysis, where a data is classified and classes are nominal. Of course data may not be like that, may be labelled with continuous classes or may not be classified at all. In any case, the concepts that we have presented may apply to different types of data sets as well.
From here, your pathway through our tutorial may not follow the order presented in the list of topics. Instead of learning how to build classification models, you may want to branch to see how Orange deals with regression or some other tasks. Though, we have to admit, that Orange and its authors is currently highly biased toward predictive data mining and building classification model, so those sections of this tutorial may be more elaborate than others.