Orange Forum • View topic - two question, flexibility of data and scalablity of system

two question, flexibility of data and scalablity of system

A place to ask questions about methods in Orange and how they are used and other general support.

two question, flexibility of data and scalablity of system

Postby vim » Sat May 06, 2006 15:34

tab-delimited data format is intuitional and easy
to handle. but for some special cases, maybe a sparse representation is more decent. for example, text classification maybe thousands of feature, and only subsection is non-zero. for example, libsvm use the format below,
0.0 1:4.236298e+00 2:2.198210e+01 3:-3.503797e-01 4:9.752163e+01
Translate from libsvm like format into orange's tab-delimited is not a big problem with python, but maybe a lot of "zero" will be stored to disk...

and another question. i want to use orange for a text classfication. what about the scalabity of it.
will it be qualified for several thousands training sample and several thousands features. or, several thousands training sample and several hundreds features? i am doing preprocess tonight, maybe i can answer it myself tommorrow.

thanks in advance. and happy the May Day.

Postby Blaz » Sun May 07, 2006 7:53

orange does not have a specialized data presentation for text mining, like the one you describe above. perhaps closest is its basket format:
http://www.ailab.si/orange/doc/reference/basket.htm

matrices of several thousand x several thousand should in principal not be a problem. we are recently working with those for cancer microarray data that may have several hundred columns and several tens of thousand rows. still, let us know if you encounter problems with your particular data sets.

Postby vim » Sun May 07, 2006 16:07

thanks for the help.
now 6000(samples) * 100(features)
knn needs 10 seconds per learning.
and svm needs about 1 minutes per learning
bayes in still in learing. much more slowly than knn and svm...
----
today, bayes needs 20388 seconds.
just the simplest script as below
is BayesLearner written in pure python? so is it much slower than svm which is written in a combination of C++ and python.

import orange
from time import clock
t1 = clock()
data = orange.ExampleTable("svd100")
classifier = orange.BayesLearner(data)
kk = 0
for i in range(len(data)):
c = classifier(data[i])
if c == data[i].getclass():
kk += 1
print float(kk)/len(data)
print clock()-t1

Postby Blaz » Mon May 08, 2006 7:27

Naive bayes classifier is written in pure C++ (nothing in Python). Your features are most probably continuous - for these, NBC in Orange uses LOESS approximation to estimate probabilities (see http://www.ailab.si/orange/doc/referenc ... earner.htm). if you discretize your data NBC should be the fastest of all learners.

the runtime for NBC that you are reporting is of course too large. try adjusting the parameters for LOESS (http://www.ailab.si/orange/doc/reference/ProbabilityEstimation.htm#ConditionalProbabilityEstimatorConstructor_loess) to reduce the computation time (o discretize the data instead).

Postby vim » Mon May 08, 2006 8:14

Thanks, looks like I didn't read carefully enough.


Return to Questions & Support