## full text clustering

3 posts
• Page

**1**of**1**### full text clustering

the politics example seems to only operate on numerical data.

can orange be used for example to find texts similar to given text ?

can orange be used for example to find texts similar to given text ?

Orange does not have a dedicated text mining module. However, implementing something like a bayesian classifier for text documents should not be too hard (I had several students doing that in the past, but to polish the code and put it in the module would take quite some time).

Also, you may use package like Reverend (http://divmod.org/projects/reverend) for text classification that runs in python and should be easy to integrate it with Orange.

There is also a quick-and-dirty trick I like to use that was published in the paper by Benedetto et al: Language Trees and Zipping (Physical Rev Letters, 28 Jan 2002). If you have a text X and text documents {A,B,C...}, to find which one is most similar to X, simply

- find the length of zipping A

- find the lenght of zipping A+X (text of X is added to document A)

- compute the difference, len(A+X)-len(A)

do so for B, C, ... The one where the difference is smaller where X is most similar to.

In python, zipping text is easy. Use

store your text in the string field in orange data set, and call delta to compute the "text distances". Notice that this way it is rather straightforward to compute the distance matrix that you can then use for, say, hierarchical clustering (see http://www.ailab.si/orange/doc/reference/clustering.htm).

Also, you may use package like Reverend (http://divmod.org/projects/reverend) for text classification that runs in python and should be easy to integrate it with Orange.

There is also a quick-and-dirty trick I like to use that was published in the paper by Benedetto et al: Language Trees and Zipping (Physical Rev Letters, 28 Jan 2002). If you have a text X and text documents {A,B,C...}, to find which one is most similar to X, simply

- find the length of zipping A

- find the lenght of zipping A+X (text of X is added to document A)

- compute the difference, len(A+X)-len(A)

do so for B, C, ... The one where the difference is smaller where X is most similar to.

In python, zipping text is easy. Use

- Code: Select all
`import zlib`

def clen(s):

return len(zlib.compress(s, 9))

def delta(a, b):

return clen(a+b) - clen(a)

store your text in the string field in orange data set, and call delta to compute the "text distances". Notice that this way it is rather straightforward to compute the distance matrix that you can then use for, say, hierarchical clustering (see http://www.ailab.si/orange/doc/reference/clustering.htm).

Last edited by Blaz on Tue Apr 25, 2006 9:44, edited 1 time in total.

### Text mining

Blaz

I would be interested in a more details about how your students implemented text classification. are you in a position to point us to some publicly available papers on this topic(presumably written up by your students)?

best,

shashi

shashi.mit gmail

I would be interested in a more details about how your students implemented text classification. are you in a position to point us to some publicly available papers on this topic(presumably written up by your students)?

best,

shashi

shashi.mit gmail

3 posts
• Page

**1**of**1**