source: orange/orange/doc/reference/matrix.htm @ 6538:a5f65d7f0b2c

Revision 6538:a5f65d7f0b2c, 11.7 KB checked in by Mitar <Mitar@…>, 4 years ago (diff)

Made XPM version of the icon 32x32.

3<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
4<LINK REL=StyleSheet HREF="style-print.css" TYPE="text/css" MEDIA=print>
10<H1>Converting data to Numeric/numarray/numpy</H1>
11<index name="numeric">
12<index name="numarray">
13<index name="numpy">
14<index name="conversion to/from arrays">
16<P>Besides being a great language by itself, Python features a huge library of scientific functions called SciPy. SciPy is centered around a module called <code>numpy</code>, which evolved from <code>numarray</code> and <code>Numeric</code> (sometimes mistakenly referred to as <code>NumPy</code>), where <code>numarray</code> is a compatibility-breaking variation of <code>Numeric</code> (the one called <code>NumPy</code>) to which all numerical code Python was supposed to migrate. When it did not, module <code>numpy</code> again joined <code>numarray</code> and <code>Numeric</code> (aka <code>NumPy</code>). Or so they say. Orange proudly supports an integration with this mess in hope to make a lasting contribution to it. ;)</P>
18<P>Data from an <code>ExampleTable</code> can be converted to and from Numeric, numarray and numpy's arrays. Numpy also has a new type, record array. Due to lack of documentation for the C API and, even more, doubting that this type would be very useful, Orange at the moment doesn't know about it.</P>
20<P><SMALL>Note: Binary distributions do not necessarily include the listed modules; conversion will yield an error if they are not installed on your system. To build Orange from sources you will need to have numpy installed. Header files for Numeric and numarray are not needed since the necessary objects are binary compatible (or so it seems).</SMALL></P>
22<H2>From ExampleTable to an Array</H2>
24<P><code>ExampleTable</code>'s methods for conversion into various array types are called <code>toNumeric</code>, <code>toNumericMA</code>, <code>toNumarray</code>, <code>toNumarrayMA</code>, <code>toNumpy</code> and <code>toNumpyMA</code>. The functions with the "MA" prefix create mask arrays, where the mask denotes the defined values. Functions without the prefix will yield an error if the data includes any undefined values.
26<P>All functions accept the same set of three optional arguments: contents string, weight ID and the flag that tells how to treat the multinomial attributes.</P>
28<P>Contents string can include 'A' or 'a' for attribute values (excluding the class), 'C' or 'c' for class value, 'W' or 'w' for weight and '0' and '1' for constants 0.0 and 1.0. The same symbol can occur more than once. For instance, to convert an example table 'data' to a Numeric array that will have the class value in the first column, followed by the attributes and finally by two columns of zero's (to which we can later set some other data), we would call <CODE>data.toNumeric("CA00")</CODE></P>
30<P>If the data set doesn't have a class attribute, but the content string includes 'C', an exception is raised. If it includes only the lower case 'c', the corresponding column is omitted without an error or even a warning. Similar goes for 'W' and 'w': the former raises an exception if the weightID (the second argument to a call) is omitted or zero, while the latter simply omits the column. Finally, if 'A' is given and there are no attributes (except, possibly, the class attribute and ignored discrete attributes) an exception is raised.</P>
32<P>In addition to returning the matrix, the functions can return vectors of classes or weights. This is requested by putting a slash to the contents string, followed by a c, C, w and/or W. Like before, capital letters will yield an exception if the class or weight is absent, while in case of lower cases <CODE>None</CODE> is returned instead of the corresponding vector.</P>
34<P>The result of the function is a tuple containing the array and the requested vectors. If certain element is requested, but unavailable (e.g., we want the class, but the data is classless), <code>None</code> is used as a placeholder. If slash is the first character of the contents string, there will be no array. If there's no slash or it is the last character, we will have a one-element tuple containing only the array...</P>
36<P>The default contents is "a/cw" - a matrix with attribute values and separate vectors with classes and weights. Specifying an empty string has the same effect. If you would, for some reason, want a matrix with two columns with class values and three columns of 0's, and, besides that, a separate vector of classes and three vectors of weights, you would request this by "acc000/cwww". The three weight vectors will, however, be one and the same Python object, so modifying one will change all three of them. You can also repeat a's: the combination "ACC1Aw/ccw" will put two copies of attributes values in the matrix, they will be separated by two columns of classes and a column of 1's, and followed by the weights if they exist. In addition, it will return two copies of vector of classes and a vector of weights (or None if there are none).</P>
38<P>The third argument to the function specifies the treatment of non-continuous non-binary values (for binary values we have no problem: they are translated to continuous 0.0 or 1.0). The argument's value can be <CODE>ExampleTable.Multinomial_Ignore</CODE> (such attributes are omitted), <CODE>ExampleTable.Multinomial_AsOrdinal</CODE> (the attribute's values' indices are treated as continuous numbers) or <CODE>ExampleTable.Multinomial_Error</CODE> (such attributes are forbidden, so an exception is raised if they are encountered). Default treatment is <CODE>ExampleTable.Multinomial_AsOrdinal</CODE>.</P>
40<P>When the class attribute is discrete and has more than two values, an exception is raised unless multinomial attributes are treated as ordinal.</P>
42<P>The treatment of multinomial attributes offered by these functions is very limited. There are way more versatile <a href="TransformValue.htm><code>Continuizer</code> classes</a> for converting multinomial attributes into discrete, with which you can, for instance replace a discrete attribute with multiple indicator variables ets.</P>
44<P>Attributes other than the usual discrete and continuous values are represented by <code>orange.IllegalFloat</code> and also masked out if a masked array is being created. Note that <code>orange.IllegalFloat</code> is not NaN since NaNs cannot be compared (e.g. NaN sometimes equals any number, but sometimes two NaNs don't equal each other...). The reason why attributes of other types are stored but masked instead of simply ignored is that otherwise one would need to setup a map for translating attribute indices from the domain to the array.</P>
46<P>We shall show a few examples on the Iris data set with 150 examples described by four attributes and a three-valued class attribute.</P>
48<P>The default behaviour of the function is "a/cw" - returning a three-element tuple with an array, class vector (if there is a class attribute) and a vector of weights (which, by default, cannot be produced since it requires the second argument, a weight attribute ID).</P>
50<P class="header">a part of <a href=""></A></P>
51<XMP class=code>>>> data = orange.ExampleTable("../datasets/iris")
52>>> a, c, w = data.toNumpy()
53>>> a.shape
54(150, 4)
55>>> c.shape
57>>> w
58>>> a[0]
59array([ 5.0999999 ,  3.5       ,  1.39999998,  0.2       ])
60>>> c[0]
62>>> c[120]
66<P>When the array is to be used in linear regression, one would typically want the array to include a column of 1's, say as the first column.</P>
68<P class="header">a part of <a href=""></A></P>
69<XMP class=code>>>> a, c, w = data.toNumpy("1A/cw")
70>>> print a.shape
71(150, 5)
72>>> print a[0]
73[ 1.          5.0999999   3.5         1.39999998  0.2       ]
76<P>For a more perverse example, let's pack the array with a few additional columns: a column with class values will be followed by attributes, than a column of 1's, two more class columns and a column of zeros. This is just an exercise - probably nobody will ever need anything like this.</P>
78<P class="header">a part of <a href=""></A></P>
79<XMP class=code>>>> a, = data.toNumpy("ca1cc0")
80>>> a[0]
81array([ 0.        ,  5.0999999 ,  3.5       ,  1.39999998,  0.2       ,
82        1.        ,  0.        ,  0.        ,  0.        ])
83>>> a[130]
84array([ 2.        ,  7.4000001 ,  2.79999995,  6.0999999 ,  1.89999998,
85        1.        ,  2.        ,  2.        ,  0.        ])
88<P>If you prefer one of the other two numerical modules for Python, Numeric or numarray, you just need to call a different function, and they will wrap the array into a different class. Everything else stays the same.</P>
90<P>Finally, when there is missing data, you should use <code>toNumpyMA</code> (or its equivalents for other modules).</P>
94<H2>From a matrix to an ExampleTable</H2>
96<P>Arrays can be converted into <code>ExampleTable</code>s. This conversion can be explicit or implicit - generally any method that requires an example table will also accept an array and convert it on the fly. This method may not be desirable, though, since the attributes will get generic names and types, and won't be related to any other attributes. Most methods will fail if you attempt this without knowing what you are doing.</P>
98<P>There are two general scenarios for interfacing numeric libraries and Orange: the data can origin from an <code>ExampleTable</code>, from where it is converted into an array, then something is done to/with it and then we want to convert it back to an <code>ExampleTable</code>. In the other case the data comes from somewhere else, we have it into an array and finally want to put it into an <code>ExampleTable</code>.
100<P>Our examples will suppose that <code>a</code> is an array with attribute and class values from the Iris data set:</P>
101<xmp class="code">>>> data = orange.ExampleTable("../datasets/iris")
102>>> a = data.toNumarray("ac")[0]
105<P>The cleaner way to create an <code>ExampleTable</code> is to construct or reuse a <A HREF="Domain.htm"><CODE>Domain</CODE></A>, and call the ExampleTable's constructor, giving it a domain and the matrix. If the attribute is discrete, the value from the matrix is rounded to the closest integer which is then used as the attribute value's index.</P>
107<P>Constructing a domain is trivial.</P>
109<P class="header">a part of <a href=""></A></P>
110<XMP class=code>columns = "sep length", "sep width", "pet length", "pet width"
111classValues = "setosa", "versicolor", "virginica"
112d4 = orange.Domain(map(orange.FloatVariable, columns),
113                   orange.EnumVariable("type", values=classValues))
114t4 = orange.ExampleTable(d4, a)
117<P>This approach is suitable when the data doesn't come from an existing <code>ExampleTable</code>. When it does, we should reuse the domain, like this.</P>
119<P class="header">a part of <a href=""></A></P>
120<XMP class=code>t3 = orange.ExampleTable(data.domain, a)
123<P>There is another, quick and dirty conversion from an array to an <code>ExampleTable</code>: just call the <code>ExampleTable</code>'s constructor with the array as the only argument.</P>
125<P class="header">a part of <a href=""></A></P>
126<XMP class=code>>>> t2 = orange.ExampleTable(a)
127>>> print t2.domain.attributes, t2.domain.classVar
128<FloatVariable 'a1', FloatVariable 'a2', FloatVariable 'a3', FloatVariable 'a4', FloatVariable 'a5'> None
129>>> print t2[0]
130[5.100, 3.500, 1.400, 0.200, 0.000]
133<P>Lacking any information on attributes' names and types, all attributes are continuous (<code>FloatVariable</code>) and have generic names (a1, a2...). There is no class attribute. Note that if you construct two such tables (even if you do it from the same matrix) the attributes will have the same names but will be essentially different attributes. Avoid doing this, it's almost as bad as implicit conversions.</P>
Note: See TracBrowser for help on using the repository browser.