Orange Forum • View topic - full text clustering

full text clustering

A place to ask questions about methods in Orange and how they are used and other general support.

full text clustering

Postby happy_broccoli » Fri Jan 14, 2005 1:59

the politics example seems to only operate on numerical data.

can orange be used for example to find texts similar to given text ?

Postby Blaz » Fri Jan 21, 2005 21:39

Orange does not have a dedicated text mining module. However, implementing something like a bayesian classifier for text documents should not be too hard (I had several students doing that in the past, but to polish the code and put it in the module would take quite some time).

Also, you may use package like Reverend (http://divmod.org/projects/reverend) for text classification that runs in python and should be easy to integrate it with Orange.

There is also a quick-and-dirty trick I like to use that was published in the paper by Benedetto et al: Language Trees and Zipping (Physical Rev Letters, 28 Jan 2002). If you have a text X and text documents {A,B,C...}, to find which one is most similar to X, simply

- find the length of zipping A
- find the lenght of zipping A+X (text of X is added to document A)
- compute the difference, len(A+X)-len(A)

do so for B, C, ... The one where the difference is smaller where X is most similar to.

In python, zipping text is easy. Use

Code: Select all
import zlib

def clen(s):
    return len(zlib.compress(s, 9))

def delta(a, b):
    return clen(a+b) - clen(a)


store your text in the string field in orange data set, and call delta to compute the "text distances". Notice that this way it is rather straightforward to compute the distance matrix that you can then use for, say, hierarchical clustering (see http://www.ailab.si/orange/doc/reference/clustering.htm).
Last edited by Blaz on Tue Apr 25, 2006 9:44, edited 1 time in total.

Text mining

Postby shashi » Fri Sep 23, 2005 21:12

Blaz

I would be interested in a more details about how your students implemented text classification. are you in a position to point us to some publicly available papers on this topic(presumably written up by your students)?

best,
shashi

shashi.mit gmail


Return to Questions & Support



cron