data file size limitation

Postby johannct » Fri Oct 21, 2005 6:44

I will use classification trees with very big datafile : 260 columns and potentially tens if not hundreds of thousands of lines.
Obviously, with the current flat table tab separated type of files orange completely choke on even trimmed version at 10% of my datafile.....
Do you have plans/solutions for large datafiles.
Our storage format is ROOT (see and this package comes with a python interface, but it is not obvious to me that mixing the 2 somehow (using python as a glue language) would alleviate the problem as an ExempleTable somehow needs to be generated....

any thought?
thanks in advance,

Postby Janez » Fri Oct 21, 2005 12:15

This would be too complicated.

Orange has some classes capable of processing the data directly from the file. ExampleTable table is derived from ExampleGenerator, but there is another class, FileExampleGenerator (I think that's the name), which is also derived from ExampleGenerator but reads examples from the file on the fly. Some classes do not require ExampleTable but can work with any ExampleGenerator. I think that the function which computes, for instance, information gain of attributes requires ExampleGenerator; if you pass FileExampleGenerator instead of ExampleTable, the data table won't be loaded into the memory (that is, it will load the examples one by one).

In the beginning we have used ExampleGenerators a lot, but later we have opted for ExampleTable which is easier to use (because you have random access, you can store pointers to examples etc) and much faster. I would say that you can still induce a naive bayesian classifier directly from the file (though I'd have to check), but not a classification tree.

We do not have any plans to change this, it would require too much work and we need to focus on more urgent things. Sorry...

Postby johannct » Mon Oct 24, 2005 3:52

thanks.... we are precisely interested in classification trees.


