Orange Forum • View topic - Improve Orange's Clustering algorithms

Improve Orange's Clustering algorithms

General discussions about Orange and with Orange connected things (data mining, machine learning, bioinformatics...).

Improve Orange's Clustering algorithms

Postby sumith » Sun Mar 27, 2011 14:23

Hi All,

I'm Sumith, a postgraduate research student in the area of Data Mining and Text Clustering techniques. I'm basically research on the area of SOM based algorithms and Text Clustering. I would like to suggest two ideas to improve Orange's capabilities in the areas of above mentioned.

Idea 1 : In SOM clustering, user has to specify the map size. To address the issues with this pre-fix static architecture of SOM, Growing Self Organizing Map(GSOM) has been proposed and has been widely used in many different domains including Bio Informatics and Text Mining, etc. Since I am working with SOM and GSOM, I think implementing GSOM in Orange would be really advantageous to a lot of people using the tool.

Idea 2 : Including Text pre-processing module in Orange. I think having a text pre-processing module with Orange would really enhance its use among the text mining researchers. Different frequency based text weighting schemes can be implemented with the tool which produced a file which can be directly fed into any of the Clustering algorithms.

These are my general ideas to improve Orange's capabilities as a data mining tool. I am really happy to know your input on the above ideas, and its suitability with the GSOC 2011. Since I have worked with both of the above two areas, I do really have a strong theoretical and practical background with their implementation. Please let me know your ideas on this.

Re: Improve Orange's Clustering algorithms

Postby Blaz » Sun Mar 27, 2011 16:56

Personally, I like both ideas. We were thinking of adding the text mining idea to the collation of ideas anyway. The text mining module is a bit old and it has not been touched for a while, so anything to improve it would be most welcome.

I would suggest that you could go with whichever ideas of the two you like. Both would be welcome additions to Orange.

Best wishes,

Re: Improve Orange's Clustering algorithms

Postby sumith » Mon Mar 28, 2011 1:23

Great. I will think and come up with a good proposal about this.


Re: Improve Orange's Clustering algorithms

Postby sumith » Tue Mar 29, 2011 6:32

Hi Blaz,

Just wondering where can I find the documentation related to existing text mining module, that would definitely help me organizing ma new proposal on this.


Re: Improve Orange's Clustering algorithms

Postby Blaz » Tue Mar 29, 2011 6:47

Sumith, this documentation is not on the web site, but comes with installation of the module. Its in doc/modules for scripting part and in widgets/catalog/Text for widgets.

Re: Improve Orange's Clustering algorithms

Postby Mitar » Tue Mar 29, 2011 13:56

You can checkout it also from our SVN repository.

Re: Improve Orange's Clustering algorithms

Postby sumith » Wed Mar 30, 2011 6:58

Great. Thanks a lot.

Re: Improve Orange's Clustering algorithms

Postby sumith » Mon Apr 04, 2011 9:37

Hi Blaz,

I have found following suggestions about the improvements to Orange text clustering? Please give me your valuable feedback on this, suitability of my proposed things and if there are any other things that you are interested to improve on.

1) Improvements to the type of file support - It only supports XML and SGM file formats at the moment. But many of the text sources are delimited files such as, CSV, tab separated, etc. Also, sometimes the individual content resides as individual files. So incorporating browsing for a folder with all the files would be advantageous.

2) Preprocessing - It is advantageous to customize stop world list based on the user requirements. A full list of features (words) can be list down together with the stop word list, and can allow user to customize and finalize the word set as necessary.

3) Bag Of words - it seems TFIDF value is calculated as log(1/frequency). But in text mining literature it shows that, frequency * inverse document frequency as defined below would be an better option too. TF-IDF = TF * IDF , where
TF - number of occurences of a particular term / total number of terms in the file
IDF - log (D/(1+d)), D - total number of documents in the text corpus, d - number of documents containing the term

4) Also, dimension reduction techniques such as Latent Semantic Indexing can be integrated to reduce the feature space to a low dimensional feature vector, this will definitely help in text clustering due to its high dimensionality. (This might go under Matrix factorizations as well)

Please let me know your feedback on the above, and also the new ideas you would like to have in Text Clustering Module.


Re: Improve Orange's Clustering algorithms

Postby salmonix » Sat Jan 07, 2012 11:21

Hi there,
There might also be a demand for token substitution. Eg. a number of tokens / words / are collected under a superior category ( like: dog, cat -> domestic_animal ) and during the analysis the substituted values must be counted. It can happen that a text and a category list ( say, a YAMLed Hash ) is loaded separately.
Also it might be useful for reducing elements filtering out dialectical variants using similar dictionary.
The question is, of course, what is the design policy for the Text mining - processing module set.

Return to General Discussions