Orange Forum • View topic - Too many Attributes

Too many Attributes

A place to ask questions about methods in Orange and how they are used and other general support.

Too many Attributes

Postby fable » Thu Aug 31, 2006 16:34

I have a tab file with more than 300 attributes(My attributes are site keywords that can grows up to 2000).I wonder is there any way to map my attibutes in one file to my main attributes in other file and define my classes in the second tab file(ex attributes:ski,sport,news,... class(sitetype:sport news) :?:

Thanks in advance for your help:)

Postby Blaz » Fri Sep 01, 2006 15:06

To merge two data files (that have to have the same number of examples), use ExampleTable(list-of-tables) (see http://www.ailab.si/orange/doc/referenc ... eTable.htm).

You did not specify which formats do you use for files. I will assume you have used .basket format for keywords. That is, something like (one.basket):

Code: Select all
nobody, expects, the, Spanish, Inquisition=5
our, chief, weapon, is, surprise=3, surprise=2, and, fear,fear, and, surprise
our, two, weapons, are, fear, and, surprise, and, ruthless, efficiency
to, the, Pope, and, nice, red, uniforms, oh damn


If you create the second file that include the class information, e.g. something like (two.tab):

Code: Select all
c
d
class
a
a
b
b


then the following code merges the two data sets:

Code: Select all
import orange
tdata = orange.ExampleTable("one.basket")
cdata = orange.ExampleTable("two.tab")

print tdata[0]
print cdata[0]

data = orange.ExampleTable([tdata, cdata])
print data[0]


Running the script one gets:

Code: Select all
[], {"nobody":1.000, "expects":1.000, "the":1.000, "Spanish":1.000, "Inquisition":5.000}
['a']
['a'], {"nobody":1.000, "expects":1.000, "the":1.000, "Spanish":1.000, "Inquisition":5.000, "our":?, "chief":?, "weapon":?, "is":?, "surprise":?, "and":?, "fear":?, "two":?, "weapons":?, "are":?, "ruthless":?, "efficiency":?, "to":?, "Pope":?, "nice":?, "red":?, "uniforms":?, "oh damn":?}


The two files were merged, with the problem that any meta-attribute that was used in the one.basket is now declared in the domain descriptor of the data file. Hence many unknowns. You can ignore this by constructing your algorithm so that it ignores all undefined meta attributes (those equal to "?"). We are now aware of this annoyance, and will fix it next week (after a fix we will place a notice here).

Postby Janez » Tue Sep 05, 2006 17:25

fable,

you may now use a better solution: you can have a tab-delimited file where one column is a "basket" containing a set of keywords. Each keyword can also be assigned a numeric value.
Code: Select all
category    keywords
d           basket
class
sport       skiing wengen stenmark
sport       football real
news        cnn orange
sport       climbing rope

You can read more about it at http://www.ailab.si/orange/doc/reference/tabdelimited.htm#basket.

Unless you update Orange from CVS, you will have to wait for the next snapshot (Sep 06).

We will add more features useful in text mining in near future.

Postby Janez » Tue Sep 05, 2006 17:40

@blaz,

you did not use the most recent version of Orange, which has optional meta attributes and uses them to store basket data. From a week or so ago the first example printed out by your code is correct
Code: Select all
['a'], {"nobody":1.000, "expects":1.000, "the":1.000, "Spanish":1.000, "Inquisition":5.000}

(and so are the others)

Postby Viktor » Thu Sep 07, 2006 19:06

Janez wrote:fable,

you may now use a better solution: you can have a tab-delimited file where one column is a "basket" containing a set of keywords. Each keyword can also be assigned a numeric value.
Code: Select all
category    keywords
d           basket
class
sport       skiing wengen stenmark
sport       football real
news        cnn orange
sport       climbing rope

You can read more about it at http://www.ailab.si/orange/doc/reference/tabdelimited.htm#basket.

Unless you update Orange from CVS, you will have to wait for the next snapshot (Sep 06).

We will add more features useful in text mining in near future.


Janez,

Firstly, many thanks for making it possible to input sparse data, I am sure many people would appreciate that.

I installed the recent version of orange and the snapshot, but the example you gave above won't read:

Code: Select all
data = orange.ExampleTable("sparse_data2.txt")
for i in range(len(data)): print data[i]
...
['d', 'basket']
['class', '?']
['sport', 'skiing wengen stenmark']
['sport', 'football real']
['news', 'cnn orange']
['sport', 'climbing rope']


Changing the format for attribute descriptions to

Code: Select all
B#keywords   D#category
skiing wengen stenmark   sport
....


gives the error "Orange internal error: NULL pointer to 'BasketFeeder'"

Postby Janez » Fri Sep 08, 2006 15:08

For the first file, is there any chance you accidentally named it .txt instead of .tab and ignored the warning? (Or was there even no warning?)

For the second (B#keywords D#category), there was a stupid bug. Tomorrow's snapshot will work.

Thanks for the report.

Postby Viktor » Mon Sep 11, 2006 0:55

Many thanks for that, the file reads successfully now. Still, I have a problem. I get different results on the same data, when they are input in the sparse format vs. in the "normal" format. Using the 10-fold cross-validation script, knn:

"Normal" format:

Code: Select all
Data:
[1, 3, 0, 0, '1']
[0, 0, 2, 5, '0']
[3, 3, 0, 0, '1']
[0, 0, 4, 5, '0']
[5, 3, 0, 0, '1']
[0, 0, 6, 5, '0']
[0, 0, 7, 5, '0']
[8, 3, 0, 0, '1']
[9, 3, 0, 0, '1']
[0, 0, 10, 5, '0']

Classifications:
[9, 3, 0, 0, '1']: predicted: 1, actual: 1
[1, 3, 0, 0, '1']: predicted: 1, actual: 1
[8, 3, 0, 0, '1']: predicted: 1, actual: 1
[3, 3, 0, 0, '1']: predicted: 1, actual: 1
[5, 3, 0, 0, '1']: predicted: 1, actual: 1
[0, 0, 7, 5, '0']: predicted: 0, actual: 0
[0, 0, 2, 5, '0']: predicted: 0, actual: 0
[0, 0, 4, 5, '0']: predicted: 0, actual: 0
[0, 0, 10, 5, '0']: predicted: 0, actual: 0
[0, 0, 6, 5, '0']: predicted: 0, actual: 0

Classification accuracies:
knn 1.000000


Sparse format (c1,c2,... correspond to the 1st, 2nd,... column above):

Code: Select all
Data:
['1'], {"c1":1.000, "c2":3.000}
['0'], {"c3":2.000, "c4":5.000}
['1'], {"c1":3.000, "c2":3.000}
['0'], {"c3":4.000, "c4":5.000}
['1'], {"c1":5.000, "c2":3.000}
['0'], {"c3":6.000, "c4":5.000}
['0'], {"c3":7.000, "c4":5.000}
['1'], {"c1":8.000, "c2":3.000}
['1'], {"c1":9.000, "c2":3.000}
['0'], {"c3":10.000, "c4":5.000}

Classifications:
['1'], {"c1":9.000, "c2":3.000}: predicted: 0, actual: 1
['1'], {"c1":1.000, "c2":3.000}: predicted: 0, actual: 1
['1'], {"c1":8.000, "c2":3.000}: predicted: 0, actual: 1
['1'], {"c1":3.000, "c2":3.000}: predicted: 0, actual: 1
['1'], {"c1":5.000, "c2":3.000}: predicted: 0, actual: 1
['0'], {"c3":7.000, "c4":5.000}: predicted: 1, actual: 0
['0'], {"c3":2.000, "c4":5.000}: predicted: 1, actual: 0
['0'], {"c3":4.000, "c4":5.000}: predicted: 1, actual: 0
['0'], {"c3":10.000, "c4":5.000}: predicted: 0, actual: 0
['0'], {"c3":6.000, "c4":5.000}: predicted: 1, actual: 0

Classification accuracies:
knn 0.100000

Postby Janez » Mon Sep 11, 2006 9:21

We must have raised your hopes to high: the sparse data (basket) goes to meta attributes and no "classical" machine learning methods can use them (see, for instance http://www.ailab.si/orange/doc/reference/basket.htm). They are meant mostly for basket analysis and for text mining. Association rules work on them, and for text mining you can add stuff (some will be provided with Orange soon, too).

In a domain where you would use algorithms like classification trees, the number of attributes would typically be small, so you don't need sparse data representation. Also, using bayesian classifier on sparse data would probably require a different implementation of the algorithm, simply modifying the existing to read meta attributes probably wouldn't work.

Postby Viktor » Mon Sep 11, 2006 10:40

OK, but if classification algorithms were able to handle sparse data, that would give a big plus to your tool kit, I think. Especially because python is a good choice for a text mining task.


Return to Questions & Support