Orange Forum • View topic - Corrupted C45 trees when unpickling

Corrupted C45 trees when unpickling

Report bugs (or imagined bugs).
(Archived/read-only, please use our ticketing system for reporting bugs and their discussion.)
Forum rules
Archived/read-only, please use our ticketing system for reporting bugs and their discussion.

Corrupted C45 trees when unpickling

Postby eliasz » Mon Mar 22, 2010 14:52

Some decision trees from C45Learner are corrupted when unpickled. To make things worse, the outcome depends on the order of unpickling. The corruption may be observed when trying to display the tree with orngC45.printTree. The underlying error is that some of their leaves end up having out-of-range value of the class var (it happens that node.leaf >= len(tree.classVar.values)).

How to reproduce. The following scripts leads to the described behaviour. Commenting out one of the load_trees or changing their order results in no error.

Code: Select all
#!/usr/bin/python
# -*- coding: utf-8 -*-
fd1 = 'zoo2.tab'
fd2 = 'zoo.tab'
import orange, orngC45, pickle

def make_tree(fname):
   data = orange.ExampleTable(fname, createNewOn = 0)
   print 'Data loaded:\t%s\t%s' % (fname, repr(data.domain.classVar.values))
   tree = orange.C45Learner(data)
   tfname = fname  + '.tree'
   f = open(tfname, 'w')
   pickle.dump(tree, f)
   f.close()
   print 'Tree saved:\t%s\t%s' % (tfname, repr(tree.classVar.values))
   return tree

def load_tree(tfname):
   f = open(tfname + '.tree')
   tree = pickle.load(f)
   f.close()
   print 'Tree loaded\t%s\t%s' % (tfname, repr(tree.classVar.values))
   orngC45.printTree(tree)
   print
   return tree

def make_them():
   t1 = make_tree(fd1)
   t2 = make_tree(fd2)

def dump_them():
   t1 = load_tree(fd1)
   t2 = load_tree(fd2)

make_them()
dump_them()


Data files:
zoo.tab: http://pastebin.com/xwJxes1T
zoo2.tab: http://pastebin.com/kPWFpche
the above script: http://pastebin.com/adJgfasf

Note: the bug is perhaps related to shared domains (http://www.ailab.si/orange/forum/viewtopic.php?t=877) but this is not clear, since creating DataTable with createNewOn = 0 doesn't help here -- this is why I'm starting a new thread.

TreeLearner trees also get corrupted

Postby eliasz » Wed Mar 24, 2010 15:00

This bug is not limited to C45Learner. TreeLearner-induced trees also get corrupted, depending on the order of unpickling. Some class vars go out of range (displayed #RNGE in orngTree.printTree).

To reproduce, substitute C45 with TreeLearner above or just use the script: http://pastebin.com/uKWt5awP (the same datasets). The same behaviour here: running load_tree(fd2) alone or first prints out a correct tree (with intervebrate and insect class var values), while running both load_trees yields #RNGEs in tree nodes, as below.

Code: Select all
Tree loaded   zoo.tab   <food, mammal, monster>
type=1
|    type=1: #RNGE (100.00%)
|    type=0
|    |    type=1: mammal (100.00%)
|    |    type=0
|    |    |    type=1: #RNGE (100.00%)
|    |    |    type=0
|    |    |    |    type=0: #RNGE (100.00%)
|    |    |    |    type=1
|    |    |    |    |    type=0: #RNGE (100.00%)
|    |    |    |    |    type=1: #RNGE (50.00%)
type=0
|    type=1: #RNGE (100.00%)
|    type=0
|    |    type=1: #RNGE (100.00%)
|    |    type=0
|    |    |    type=0: #RNGE (100.00%)
|    |    |    type=4<null node>: <null node>
|    |    |    type=2<null node>: <null node>
|    |    |    type=5<null node>: <null node>
|    |    |    type=6: #RNGE (100.00%)
|    |    |    type=8<null node>: <null node>
[/url]

Postby Janez » Mon Mar 29, 2010 21:25

I believe I fixed the problem. It's on svn and should be in tomorrow's snapshot. Thanks for the excellent test script!

In general, unpickling discrete attributes is a problem and it's possible to construct a case in which it just can't be done. What you discovered was fortunately just a bug, though.

Postby eliasz » Wed Mar 31, 2010 10:33

Thanks a lot, now it works like a miracle! Both TreeLearner and C45Learner now behave correctly and it doesn't matter if the table is createNewOn=0 or not, the same with the order of unpickling (in the test script as well as in my real app).

In general, unpickling discrete attributes is a problem and it's possible to construct a case in which it just can't be done

Can you predict what would happen in such a case? An explicit error or silent degradation of performance? I'd also be glad to get a hint on when it may happen or how to avoid it (is it related to reusing of attributes?).

Postby Janez » Wed Mar 31, 2010 13:46

The bug was over-fixed. I made a mistake, which was already fixed and will appear in tomorrow's build. Your outputs won't look as nice as they do now -- one of effects of my bug was the same as if you told Orange to always create a new attribute, even at unpickling. But your code will still work.

With regard to your question: what happens depends upon the context, but you can get either an error or a silent but very noticable degradation of performance. But in most cases everything will just work.

The scenario goes like this: you have data with an attribute Y with values a, b, c, you construct a tree and pickle it. You close Python, open it again and load another data set, with another attribute Y with values d, e, f or a, b, d or a, c, b or anything else which does not begin with a, b, c (but a, b, c, d would be OK). You construct a tree and pickle it.

Now you close and open Python again and load the first tree. Y get unpickled and everything works fine. Then you load the second tree -- without closing Python. The new tree again has an attribute Y but the values (or the order of the first tree values) is different. The existing Y cannot be reused so a new Y is constructed. Therefore, you have two attributes which incidentally have the same name.

Sometimes this is just OK. Name Y (or y) is not that uncommon and these actually *are* two different attributes. If this is so, everything will work just fine.

If the two Y's are intended to be the same attribute two things may happen. Some methods which expect the first Y will yield an exception. Some will treat the value of Y as unknown, which will certainly degrade the performance.

Orange just can't do anything here. We were treating this in two other ways, which were both much worse than what we have now. The only thing Orange does is that it sorts the values when loading the data. But this does not help if you have a, b, d in one data set and a, b, c in another.

To avoid these problems you can declare all possible attribute values (the second line of the .tab file), even those which do not appear in a particular data set. Besides, you should avoid having multiple attributes with the same name.

Postby eliasz » Thu Jun 03, 2010 15:00

Janez wrote:Your outputs won't look as nice as they do now -- one of effects of my bug was the same as if you told Orange to always create a new attribute, even at unpickling. But your code will still work.


The example script indeed still works, although the real code with many classifiers doesn't. The changes done to vars.cpp in r8548 bring back the previous problems with unpickling (values out of range). Even though I have set all the example tables to createNewOn = 0.

The thing is that the variables with the same name are intended to be the same/very similar variable. Unfortunately listing the whole domain is hardly possible as in some cases the values include sets of elements, so the theoretically possible domain is quite huge. So I'll try to rename the attributes artificially (e.g. add some prefixes).

I'd be glad if there were a simple way to turn off the unfortunate reusing of symbols globally (besides reverting vars.cpp@8537). Perhaps it yields faster performance, but reliability is definitely more important...

Postby Janez » Thu Jun 03, 2010 15:25

Unfortunately, r8548 fixed a real and serious bug. The line which was commented in 8548 was indeed very wrong, it's not just a question of speed. If this made the #RNGE reappear, then it was not fixed in the first place. I'll try to look into it as soon as I have time.


Return to Bugs



cron