GSOC:Text-Mining add-on for orange

Postby aparna » Tue Mar 20, 2012 21:11

I am Aparna.N from India.I am interested in contributing to the open source.I am currently working on an automatic essay grader(AEG) which deals with text mining,hence very much interested to work on the text-mining features.

I was looking into the text-mining features of orange,so had few ideas in mind.
Here are a few ideas,
1)Permits export of raw rotated singular value decomposition (SVD) topic values (for use in any predictive modeling nodes like coherence detection).
2)Automatic stemming to identify root words.
3)Automatic part-of-speech tagging based on sentence context.
4)Out-of-the-box support for many different entity types, including person and company names, locations, dates, addresses, measurements, and email and URL addresses.
5)User-customized and default synonym lists.
Would like to know more about this project.

I am hoping to contribute to the open source this summer :)
Awaiting reply !

6th Sem,

Postby crtg » Wed Mar 21, 2012 16:44


thanks for showing interest in developing text-mining add-on for Orange. Your ideas sound interesting, but you should keep in mind that every new feature also requires documentation with examples, tests and widget implementation or modification. If you plan to submit your proposal we encourage you to install current version of text-mining add-on and try it for yourself. Some features (for example stemming) from your list are already implemented. You should also keep in mind that we prefer simpler, but working text-mining add-on with documentation, installation scripts, examples, tests, tutorials and widgets over the implementation of latest state-of-the-art text-mining algorithms without any documentation.

Thanks, Crt

Postby aparna » Thu Mar 22, 2012 19:47

Thank you for directing me. :) I installed the text-add on and looked into the features that have already implemented. The files supported are of .xml and SGM format only. The widget did not load the .xml file.But still could use the other widgets by loading the .tab files having string fields. I am exploring the API right now. How about just making the text file widget load text files of different formats, so that the existing features can be put to its best use. As of now text mining add on does not have features to extract email addresses and telephone numbers either in direct form or in an in direct format like(eg: abc at xyz dot com) from text fields. Am I thinking along the right lines? Please let me know so that I can be clear with my ideas.

Thanking you,

