source: orange/Orange/doc/reference/PythonVariable.htm @ 9671:a7b056375472

Revision 9671:a7b056375472, 24.4 KB checked in by anze <anze.staric@…>, 2 years ago (diff)

Moved orange to Orange (part 2)

3<LINK REL=StyleSheet HREF="../style.css" TYPE="text/css">
4<LINK REL=StyleSheet HREF="style-print.css" TYPE="text/css" MEDIA=print>
10<H1>Attribute Types Defined in Python</H1>
11<index name="attribute types defined in Python">
13<P><SMALL><B>Note</B>: this page includes some advanced technical details. The recommended approach is that you read it and ignore the parts you don't understand. If the things later don't work as expected, read it again...</SMALL></P>
15<P><SMALL><B>Warning</B>: at the time of writing this (Aug 24 2004), this stuff is relatively untested, but we will use it in our own work as a kind of beta-testing. Please report any bugs (or remind as to remove this notice eventually :)</SMALL></P>
17<P>Besides the usual discrete and continuous attributes, which are used by learning algorithms, and strings and distributions that are here for convenience, Orange also supports arbitrary attribute types which defined in Python, that is, attributes with descriptors that are Python classes derived from <CODE><INDEX name="classes/PythonVariable">PythonVariable</CODE> (which is itself derived from <CODE>Variable</CODE>).
19<P>Such attributes cannot be used by Orange's learning methods, since most learning algorithms only handle discrete and continuous attributes (with many of them covering only one of the two types). You can, however, use attributes defined in Python in your specific learning algorithms. Another use for such attributes can be describing the examples: by using Python-defined attributes as meta attributes, you can attach arbitrary descriptors to examples. These descriptors won't be used while learning, but can be useful when presenting the examples, or by any auxiliary processes, such as example subset selections. Finally, Python-defined attributes can store data that is converted to ordinary (discrete and continuous attributes) when needed. If a Python-defined value is a list with the dates of patient's visits to the doctor, it can be used for constructing a continuous attribute that will tell the number of visits, the longest span between two consecutive visits or the time between the first and the second visit.</P>
21<P>Python attributes can be constructed in a script or loaded from the old-style tab-delimited file, as described below. No other file formats can accommodate for these attributes.</P>
23<P class=section>Attributes</P>
24<DL class=attributes>
26<DD>Affects the way the data is saved and loaded. See the section on loading/saving values. Default is false (using <CODE>__str__</CODE> is preferred over pickling).</DD>
29<DD>Tells what kind of data will the overloaded methods get and return. If true (default), the methods will deal with pure Python objects; if false, the methods will get and should return objects of type <CODE>Value</CODE>, and the corresponding Python objects can be stored into the field <CODE>svalue</CODE>. The following documentation is written as if <CODE>useSomeValue</CODE> is set. If it's not (which you'll seldom need), you need to modify the <CODE>str2val</CODE>, <CODE>val2str</CODE> and similar functions accordingly.</DD>
32<P>Associated with <CODE>PythonVariable</CODE> is a type <CODE>PythonValue</CODE>. <CODE>PythonValue</CODE> is a class derived from <CODE>SomeValue</CODE> (therefore a sibling of <CODE>StringValue</CODE> and <CODE>Distribution</CODE>) and stores a Python object. You will most often do without explicitly using <CODE>PythonValue</CODE>, since Orange will usually do the conversion for you, except in the cases where this could lead to ambiguity and hard-to-find errors in your scripts. Read on, and you shall see where and why.</P>
34<H2>Simple attribute values in Python</H2>
36<P>Say we have some data loaded and would like to add an attribute with some Python values to the examples. The easiest way to do this is to attach a meta attribute, like this.</P>
38<P class="header">part of <A href=""></A></P>
39<XMP class=code>import orange
41data = orange.ExampleTable("lenses")
43newattr = orange.PythonVariable("foo")
44data.domain.addmeta(orange.newmetaid(), newattr)
46data[0]["foo"] = ("a", "tuple")
47data[1]["foo"] = "a string"
48data[2]["foo"] = orange
49data[3]["foo"] = data
52<P>The example is certainly weird and senseless, but it shows that value of such attribute can be just anything, from tuples and strings to arbitrary Python objects, such as models and even the example table itself.</P>
54<P>If you now check the value of <CODE>data[1]["foo"]</CODE> you will discover that it is not a string but <CODE>orange.Value</CODE>. Sure, examples store values, not just anything you throw into them. Orange did the conversion automatically at the above assignments. The actual value can be <I>read</I> through the <CODE>Value</CODE>'s field <CODE>value</CODE>. Therefore, <CODE>data[1]["foo"].value</CODE> will return the string <CODE>"a string"</CODE>. And if you, for any perverse reason, want to use the Bayesian learner through the module and data stored in the attributes, you would write <CODE>data[2]["foo"].value.BayesLearner(data[3].value)</CODE> (which is, of course, equivalent to <CODE>orange.BayesLearner(data)</CODE>).
56<P>There is a subtlety using <CODE>value</CODE> field; assigning, say, <CODE>data[1]["foo"].value = 15</CODE> won't work as intended - see the beginning of documentation on <A href="Value.htm"><CODE>Value</CODE></A> for explanation.</P>
58<P>Like any attribute, <CODE>PythonVariable</CODE> can compute its values from values of other attributes, as described in the documentation on the <CODE>Variable</CODE>'s method <a href="Variable.htm#getValueFrom"><CODE>getValueFrom</CODE></A>. Let us show how this is done on another not necessarily useful example: we shall construct a an attribute whose value will be a list of indices, representing the values of other example's attributes.</P>
60<P class="header">part of <A href=""></A></P>
61<XMP class=code>def extolist(ex, wh=0):
62    return orange.PythonValue(map(int, ex))
64listvar = orange.PythonVariable("a_list")
65listvar.getValueFrom = extolist
67newdomain = orange.Domain(data.domain.attributes + [listvar], data.domain.classVar)
68newdata = orange.ExampleTable(newdomain, data)
71<P>The first few examples in the <CODE>newdata</CODE> look like this.</P>
73<XMP CLASS="code">['young', 'myope', 'no', 'reduced', '[0, 0, 0, 0, 0]', 'none']
74['young', 'myope', 'no', 'normal', '[0, 0, 0, 1, 1]', 'soft']
75['young', 'myope', 'yes', 'reduced', '[0, 0, 1, 0, 0]', 'none']
76['young', 'myope', 'yes', 'normal', '[0, 0, 1, 1, 2]', 'hard']
79<P>Each element of the list corresponds to an index of the attribute value.</P>
81<P>Note that the function <CODE>extolist</CODE>, which we used as a classifier to put in <CODE>listvar</CODE>'s <CODE>getValueFrom</CODE> explicitly constructs a <CODE>PythonValue</CODE>. Couldn't we just write <CODE>return map(int, ex)</CODE>, and let Orange treat this a value? Well, it's time to describe the story behind <CODE>PythonValue</CODE>.</P>
83<P>As you've probably read in the documentation on <A href="Value.htm"><CODE>Value</CODE></a>, <CODE>Value</CODE> can store an integer used as an index of a discrete attribute value, a floating-point value of continuous attribute or a value derived from <CODE>SomeValue</CODE>. The latter is stored in the value's field <CODE>svalue</CODE> (or so it seems from Python; the actual C++ field is named differently). Field <CODE>value</CODE> is a kind of synonym for all three - it can return an integer, a float of <CODE>SomeValue</CODE>, depending upon the attribute (value) type.</P>
85<P>In the first example, we have set the value of <CODE>data[i]["foo"]</CODE>. Orange knows that the corresponding attribute (<CODE>data.domain["foo"]</CODE>) is of type <CODE>PythonVariable</CODE> and converts the passed value (a tuple, string, module, object) accordingly. If <CODE>data.domain["foo"]</CODE> was a discrete attribute (<CODE>EnumVariable</CODE>) it would attempt accept the value <CODE>"string"</CODE> (if "string" was among the possible attribute's values) and raise a type error in other cases.</P>
87<P>No such check can be done in function <CODE>extolist</CODE>. Classifiers are expected to return values and Orange would be all to happy to convert a list returned by <CODE>map(int, ex)</CODE> to a <CODE>Value</CODE> if it only knew how. But it has no idea about which type of attribute's value is this supposed to be. If this is a value of <CODE>PythonVariable</CODE>, it's alright, but if it's a discrete attribute, we'd have to raise an exception. Orange could, in principle, observe the value's type, conclude that this cannot be anything else than a <CODE>PythonVariable</CODE> and return a <CODE>PythonValue</CODE>, but this would be dangerous: anytime you would misconstruct a value, Orange would silently convert it to <CODE>PythonValue</CODE>, which would cause troubles God knows where.</P>
89<P>There is however a workaround. You can do this as follows.</P>
91<XMP class=code>def extolist(ex, wh=0):
92    return map(int, ex)
94listvar = orange.PythonVariable("a_list")
95listvar.getValueFrom = extolist
96listvar.getValueFrom.classVar = listvar
99<P>Orange now knows that the classifier returns values of attribute <CODE>listvar</CODE>, which is of type <CODE>PythonVariable</CODE>, so it can convert <CODE>map(int, ex)</CODE> into a value. (OK, could you write <CODE>extolist.classVar = listvar</CODE>? See documentation on <A href="callbacks.htm">deriving classes from Orange classes</A> for an explanation why not. And, again, if you don't understand something here on this page, just skip it.)</P>
102<H2>Storing/Reading from Files and Deriving new attribute types</H2>
104<P>Storing Python values to files and reading them, and deriving new attribute types from <CODE>PythonVariable</CODE> are two very related topics. The basic job of attribute descriptors, that is, instances of classes derived from <CODE>Variable</CODE> is to convert the values to and from a string representation, so that they can be saved and loaded from text-based files (in whatever format and with whichever delimiters), and printed and set by the user.</P>
106<P>All attribute descriptors define methods <CODE>str2val</CODE> for converting a string to a <CODE>Value</CODE> and <CODE>val2str</CODE> for the opposite, the first getting a string and returning the value and the other is just the opposite. You don't need to know about these two methods for other attribute types (and even have no direct access to them), but you indirectly use them all the time. If you inquire about the value of <CODE>data[0]["age"]</CODE> and see it's "young" or if you set it to "presbyopic", this goes through <CODE>data.domain["age"]</CODE>'s <CODE>str2val</CODE> and <CODE>val2str</CODE>, respectively.</P>
108<P>If you want to define a special syntax for your Python-based attribute, you will need to derive a new Python class from <CODE>PythonVariable</CODE> and define the two functions.</P>
110<P class="header">part of <A href=""></A></P>
111<XMP class=code>import orange, time
113class DateVariable(orange.PythonVariable):
114    def str2val(self, str):
115        return time.strptime(str, "%b %d %Y")
117    def val2str(self, val):
118        return time.strftime("%b %d %Y (%a)", val)
121<P>Here we defined an attribute to represent a date. We used Python's module <CODE>time</CODE> whose functions <CODE>strptime</CODE> and <CODE>strftime</CODE> convert a date, represented as a string in a given format to an instance of <CODE>time.struct_time</CODE>, used for representing dates, and back. The string formats for <CODE>str2val</CODE> and <CODE>val2str</CODE> do not need to match. See this.</P>
123<XMP class=code>>>> birth = DateVariable("birth")
124>>> val = birth("Aug 19 2003")
125>>> print val
126Aug 19 2003 (Tue)
129<P>When giving a value, we specify a month (a three-letter abbreviation), a day of month and a year. When the value is printed, a weekday is added.</P>
131<P>Special values are treated separately: empty strings, question marks and tildes are converted to values without calling <CODE>str2val</CODE> and special values are converted to string without <CODE>val2str</CODE>. However, <CODE>str2val</CODE> can still return a special value it the string denotes one in some special syntax used. To do this, it should return <CODE>PythonValueSpecial(type)</CODE>, where type is <CODE>orange.ValueTypes.DC</CODE> (which equals 1 and means don't care), <CODE>orange.ValueTypes.DK</CODE> (2, don't know) or any other non-zero integer (which will denote a special value of other types you need).</P>
133<P>Let us construct an example table that will include a new attribute: we shall load the lenses data set, add the new attribute and set its value for the first example.</P>
135<P class="header">part of <A href=""></A> (uses <a href="">lenses</a>)</P>
136<XMP class=code>data = orange.ExampleTable("lenses")
138newdomain = orange.Domain(data.domain.attributes + [birth], data.domain.classVar)
139newdata = orange.ExampleTable(newdomain, data)
141newdata[0]["birth"] = "Aug 19 2003"
142print newdata[0]
145<P>You can also save the <CODE>newdata</CODE> to a tab-delimited file (other formats do not support Python-based attributes).</P>
147<P>If <CODE>val2str</CODE> is not defined, Orange will "print" the value to a string. The alternative to defining the <CODE>DateVariable</CODE>'s <CODE>val2str</CODE> is defining a special Python class that will represent a date and overload its method <CODE>__str__</CODE>, like in the following example.</P>
149<P class="header">part of <A href=""></A></P>
150<XMP class=code>class DateValue(orange.SomeValue):
151    def __init__(self, date):
152 = date
154    def __str__(self):
155        return time.strftime("%b %d %Y (%a)",
157class DateVariable(orange.PythonVariable):
158    def str2val(self, str):
159        return DateValue(time.strptime(str, "%b %d %Y"))
163<P>You may sometimes want to use a different string representation for saving and loading from files. This will be useful when the object is rather complex, so you would need a simpler (yet possibly inaccurate) form for printing the value and a more complex form for storing it. Also, it may be sometimes inconvenient or even impossible to parse the human-readable strings. Finally, we would even have problems saving the above attribute since <CODE>str2val</CODE> and <CODE>val2str</CODE> use different date formats.</P>
165<P>To define a different representation for saving values to files, you need to define methods <CODE>filestr2val</CODE> and <CODE>val2filestr</CODE>. They are similar to <CODE>str2val</CODE> and <CODE>val2str</CODE>, except that they get an additional argument: an example that is being read or written. In the former case, the example may be half constructed: the line in a file is always interpreted from left to right, so some values are already set while other are random (you may notice they are actually not, but refer from using them to avoid incompatibilities with future versions of Orange).</P>
167<P>For our <CODE>DateVariable</CODE>, the two additional functions could, for instance, look as follows.</P>
169<P class="header">part of <A href=""></A></P>
170<XMP class=code>    def filestr2val(self, str, example):
171        if str == "unknown":
172            return orange.PythonValueSpecial(orange.ValueTypes.DK)
173        return DateValue(time.strptime(str, "%m/%d/%Y"))
175    def val2filestr(self, val, example):
176        return time.strftime("%m/%d/%Y", val)
179<P>We have added a new representation for unknown values: string "unknown" translated to <CODE>DK</CODE>. Just for fun, we use a different date format - month (given numerically), day and year, divided by slashes.</P>
181<P><CODE>PythonVariable</CODE> has a flag <CODE>usePickle</CODE>. If set and <CODE>val2filestr</CODE> is undefined, Orange will pickle values when saving to a file. To accommodate for the file's limitations, newlines in the pickled string are changed to "\n" (if you attempt to manually unpickle the strings you find in files, you'll need to convert this back). See Python documentation on module "pickle" for details on pickling; basically, Orange will use <CODE>pickle.dumps</CODE> function which can convert practically any Python object to a string (a concept which is also known as serialization (in Java) or marshalling).</P>
183<P>Finally, here's how loading and saving from files goes. Converting a value read from the file goes like this:
185<LI>If value is an empty string or a question mark, it's don't know. If it's a tilde, it's don't care. You cannot override that.</LI>
186<LI>If <CODE>filestr2val</CODE> is defined, it's called. Error is reported on error.</LI>
187<LI>If <CODE>usePickled</CODE> is not set and <CODE>str2val</CODE> is defined, <CODE>str2val</CODE> is called. Error is reported if this fails.</LI>
188<LI>If <CODE>usePickled</CODE> is set, Python attempts to unpickle the string. If this fails, Orange continues with the next step.</LI>
189<LI>As a final attempt, Orange will treat the string as a Python expression. The scope (local and global variables) will be the same as at the call of <CODE>ExampleTable(filename)</CODE>. The example that is being constructed (the same object as the last argument of <CODE>filestr2val</CODE>) will be present as a local variable <CODE>__fileExample</CODE>).</LI>
190<LI>If none of this succeeds, Orange reports an error.</LI>
194<P>Writing to a file goes like follows.
196<LI>Special values are represented with question marks and tildes.</LI>
197<LI>If <CODE>val2filestr</CODE> is defined, it's used.</LI>
198<LI>If <CODE>usePickle</CODE> is set, the value is pickled.</LI>
199<LI><CODE>val2str</CODE> is called if defined.</LI>
200<LI>The value is printed. This can never fail, but usually won't give useful results if you don't redefined the value's <CODE>__str__</CODE>, as shown above.</LI>
203<P>Although it may seem complicated, this order is natural and will work seamlessly - if you just redefine what you think sensible, Orange will probably work it out fine. Only if it doesn't, check the above steps to determine what went wrong.</P>
205<H2>Other Methods You Can Redefine</H2>
207<P>Attribute descriptors derived from <CODE>Variable</CODE> may support methods for returning a sequence of values, random values and the number of different values. You don't need to provide those methods. If you want, here are the methods you will need to define.</P>
209<P class=section>Overloadable methods of <CODE>PythonVariable</CODE></P>
210<DL class=attributes>
212<DD>Returns the first value of the attribute.</DD>
214<DT>nextValue(self, value)</DT>
215<DD>Returns the next value after <CODE>value.</CODE></DD>
217<DT>randomValue(self, int)</DT>
218<DD>Returns a random value. It is desirable that the method uses the given integer as an argument for constructing the value, <I>i.e.</I> for initializing the random number generator or through some hashing scheme.</DD>
221<DD>Returns the number of different values.</DD>
224<P>You can also redefine two <CODE>PythonValue</CODE>'s methods. Besides <CODE>__str__</CODE> which we examined above, you can also redefine <CODE>__cmp__(self, other)</CODE> which should return a negative integer is <CODE>self</CODE> is smaller than <CODE>other</CODE>, zero if they are equal and positive integer if <CODE>self</CODE> is greater. The meaning of "smaller" and "greater" depends upon the type of the attribute. If you leave the function undefined, Orange will use the Python's built-in comparison function. Which is great, since all decent Python objects support sensible comparisons. For instance, let us continue with the example in which we first defined <CODE>DateVariable</CODE> (but haven't defined <CODE>DateValue</CODE>).</P>
226<P class="header">part of <A href=""></A></P>
227<XMP class=code>newdata[0]["birth"] = "Aug 19 2003"
228newdata[1]["birth"] = "Jan 12 1998"
229newdata[2]["birth"] = "Sep 1 1995"
230newdata[3]["birth"] = "May 25 2001"
233print "\nSorted data"
234for i in newdata:
235    print i
238<P>This sorts the examples according to the date of birth (it might well be that people born in 2003 do not wear contact lenses at the time of writing this documentation, but we expect Orange to be around for a while :).</P>
240<P>This won't work with <CODE>PythonValue</CODE> as we defined it (the second example, where we showed how to redefine <CODE>__str__</CODE>). To fix it, we need to define a method <CODE>__cmp__</CODE> for <CODE>PythonValue</CODE>. The easiest way is to pass the work to Python, like this.</CODE>
242<P class="header">part of <A href=""></A></P>
243<XMP class=code>    def __cmp__(self, other):
244        return cmp(,
248<H2>Tab-delimited format extensions</H2>
250<P>Tab-delimited format is the only format that supports Python-based attribute values. (Some other, especially Excel, may follow soon.) The attribute type (the second row) can be given in three ways.
252<LI>If you define the attribute as <CODE>python</CODE>, a descriptor of type <CODE>PythonVariable</CODE> will be constructed. Orange will try to unpickle the attribute values and treat them as Python expressions if this fails. This way you can given strings, lists and tuples, and as you will see soon, also more complex types.</LI>
254<LI>If you define the attribute as <CODE>python:AttributeDescriptorType</CODE>, the attribute of the corresponding type will be constructed. Type may be <CODE>PythonVariable</CODE> (in which case writing only <CODE>python</CODE> would have the same effect), it can be <CODE>DateVariable</CODE> or even <CODE>orange.EnumVariable</CODE> (having the same effect as writing <CODE>d</CODE> or <CODE>discrete</CODE>). The descriptor type must be defined in the Python scope from which the example loading was called. In addition, you must give a full name of the type. If <CODE>DateVariable</CODE> is defined in a module <CODE>orngDates</CODE> and you imported it using <CODE>import orngDates</CODE> (and not <CODE>from orngDates import *</CODE>), the corresponding type definition would be <CODE>python:orngDates.DateVariable</CODE>.</LI>
256<LI>Finally, the attribute can be defined by <CODE>python:<I>expression</I></CODE>, where <CODE><I>expression</I></CODE> is any Python expression whose evaluation results in an attribute descriptor. You will most often use it to call the descriptor's constructor, for instance <CODE>python:DateVariable(arg1, arg2, arg3=15)</CODE> or <CODE>python:orange.FloatVariable(numberOfDecimals=5)</CODE>. You can however use arbitrary expressions here. The type can be defined as <CODE>python:evar</CODE>, and you would need to execute something like <CODE>evar = orange.EnumVariable()</CODE> prior to loading the data. Or, you can call a function that returns an attribute descriptor, or select an attribute descriptor from a list...</LI>
259<P>The latter tricks - putting arbitrary expressions into can sometimes come handy for specifying the values as well. Here's an example of a file with Python expressions used for specifying values.</P>
261<P class="header"><A href=""></A></P>
262<XMP class=code>tear_rate    foo                  lenses
263discrete     python               none soft hard
264                                  class
265reduced      [0, 0, 0, 0, 0]      none
266normal       A(3.14)              soft
267reduced      math.sqrt(4+a)       none
268normal       a*5                  hard
269reduced      [0, 1, 0, 0, 0]      none
270normal       perfectSquares(100)  soft
271reduced      [0, a, 1, 0, 0]      none
274<P>Loading this file requires defining <CODE>A</CODE> (this shall be some class), import <CODE>math</CODE> so we can compute <CODE>math.sqrt</CODE>, define <CODE>a</CODE> to be some number, and <CODE>perfectSquare(n)</CODE> to be a function (which will, in our case, return a list of perfect squares for up to <CODE>n</CODE>).</P>
276<P class="header">part of <A href=""></A></P>
277<XMP class=code>import orange, math
279def perfectSquares(x):
280    return filter(lambda x:math.floor(math.sqrt(x)) == math.sqrt(x), range(x+1))
282class A:
283    def __init__(self, x):
284        self.x = x
285    def __str__(self):
286        return "value: %s" % self.x
288a = 12
290data = orange.ExampleTable("")
291for i in data:
292    print i
295<P>And here's the output:</P>
296<XMP class=code>['reduced', '[0, 0, 0, 0, 0]', 'none']
297['normal', 'value: 3.14', 'soft']
298['reduced', '4.0', 'none']
299['normal', '60', 'hard']
300['reduced', '[0, 1, 0, 0, 0]', 'none']
301['normal', '[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100]', 'soft']
302['reduced', '[0, 12, 1, 0, 0]', 'none']
Note: See TracBrowser for help on using the repository browser.