Orange Forum • View topic - uncontrolled changes to data domain

uncontrolled changes to data domain

Report bugs (or imagined bugs).
(Archived/read-only, please use our ticketing system for reporting bugs and their discussion.)
Forum rules
Archived/read-only, please use our ticketing system for reporting bugs and their discussion.

uncontrolled changes to data domain

Postby Jonna » Wed Oct 31, 2007 18:26

HI,

I have a worring problem with a simple construction of an orange.ExampleTable with data in tab format on disk. I have 2 data sets, toOrangeLarge.tab and toOrangeSmall.tab. Their domains are the same except that for the large data set classVar has 2 possible values, while classVar has 3 values in the small data set. I do the following:

Code: Select all
    trainData = orange.ExampleTable("/home/kgvf414/projects/M-Lab/data/testSuite/toOrangeLarge.tab")
    print trainData.domain.classVar.values
    smallTestData = orange.ExampleTable("/home/kgvf414/projects/M-Lab/data/testSuite/toOrangeSmall.tab")
    print trainData.domain.classVar.values

The first time I print the class values I get the 2 expected classes. However, after having loaded the small data set, the domain of trainData is changed and classVar now has 3 possible values. If I replace the loading of the small data set by the loading of a completly different data set, the problem disappears.

If anyone wants to look at this I can provide you with the data sets.
Thanks
Jonna  :?

Postby Janez » Wed Oct 31, 2007 22:13

Hi Jonna,

this is how it's supposed to work.

Orange used to (long ago!) consider each data set separately. What could happen then was that you loaded a train and test data set, trained the classifier on the first and ... when testing it on the second, it did not do anything because the data sets (and their attributes) were unrelated. Well, the attributes did have the same name, but they were not considered the same attributes.

Since this was a very common problem we decided that Orange should verify whether the attribute names and types match those of any of the existing data sets and reuse them if possible. (Now we go a step further - unless explicitly forced, Orange won't construct a new attribute with the same name and type if it can reuse an old one. This is needed for pickling.)

In your case, the first data set's class attribute is reused in the second data set. They are one and the same attribute. When the second data set is loaded, a new value is added to the attribute.

Are you just worried about this or do you have practical problems because of it?

The documentation has a section on the reuse of attributes (http://www.ailab.si/orange/doc/referenc ... ormats.htm), although I see it's more than a bit cryptical (sorry, it was written in a hurry; complain and I'll fix it). To avoid reusing the attributes, you should add an argument like createNewOn=orange.Variable.MakeStatus.OK when calling ExampleTable to load the data.

Bye,
JAnez

Postby Jonna » Thu Nov 01, 2007 9:28

Hi Janez,

Thanks for your fast replay. The behavious is slightly unintuitative to me and the report was mainly a consern about a serious bug in ExampleTable. My rather artificial situation was constructed for debuging purposes and I can't think of a real situation where the mutual domains actually create a problem.

Thanks a lot
Jonna


Return to Bugs