Orange Forum • View topic - Domains are shared, causing unpredictable C45 behaviour

Domains are shared, causing unpredictable C45 behaviour

Report bugs (or imagined bugs).
(Archived/read-only, please use our ticketing system for reporting bugs and their discussion.)
Forum rules
Archived/read-only, please use our ticketing system for reporting bugs and their discussion.

Domains are shared, causing unpredictable C45 behaviour

Postby eliasz » Thu Mar 18, 2010 18:09

I'm working on an application (morpho-syntactic tagger) that trains and uses loads of classifiers at the same time -- currently they are decision trees. This way I need to hold over 100 different trained trees in memory, each possibly operating on different domains.

The bug (?) appears when there are two data tables, whose class variable is named the same but their domains are different. When loading one after another, the domain of the second table's classVar grows to contain all so far encountered values. This seem to happen only when the previous data table is still in the scope. I'd call it a bug although it could theoretically be a conscious decision related to optimisation -- to hold one domain per attr name. In the latter case, I think it would be nice to explicitly say this in docstring of DataTable and Domain.

Unfortunately it does influence the behaviour of C45Learner (possibly affects other classifiers as well). I have observed a difference when starting from scratch and when having loaded the other data table and generated other decision tree. The difference in trees seems only to appear when pickling and unpickling the generated trees. I've prepared a script and two data sets for reproducing this unexpected behaviour. When I run with the argument "21", it will first train and save a tree with the second set, then the same with the first set, yet performing these operations in different scopes (the tree is free to be disposed of by the interpreter and it just seems to happen). When run with "a" argument, it will load the first tree and dump it, similarly "b" causes it to load and dump the second tree. When the argument also contains "r" letter, it will memorise all the so far generated trees and data tables. The output is then different when firing:
./ 21r # train 2 and 1, remembering all the trees
./ a # dump 1
than when issuing:
./ 21 # train 2, then 1, not remembering the state
./ a # dump 1
The last call yields the same results as when training the first tree only:
./ 1
./ a

I have put the employed data sets and the script in the paste bin: (as in orange distro): (modified
The testing file:
My outputs:

P.S. I've got some bigger problem with the usage of C45Learner -- in the application some trees get corrupted. I've found out that some of their leaves end up having out-of-range value of the class var (the 'leaf' field is bigger than len(tree.classVar.values) -- also leading to an error when trying to dump the tree with orngC45.printTree). Unfortunately this error is hard to reproduce without referring to the whole code, which possibly has got some other bugs -- that's why I'm reporting the easily testable one.

Postby Ales » Fri Mar 19, 2010 9:41

This is not a bug, but as you said a conscious decision. The reasoning for this is described at (Reuse of attributes section).

Postby eliasz » Fri Mar 19, 2010 16:06

Thanks a lot for you reply and sorry for not having read that section. It perfectly explains why and how the domains are shared when constructing ExampleTables or Variables. Indeed, adding createNewOn = 0 to ExampleTable's init fixes the described problem with different C4.5 outputs in the posted toy script.

EDIT: cut the rest as irrelevant, since I've managed to distil the bug. Its not related to shared domains, so I will re-post it as a new topic.

Return to Bugs