Orange Forum • View topic - Text Mining Add On Newbie Question

Text Mining Add On Newbie Question

A place to ask questions about methods in Orange and how they are used and other general support.

Text Mining Add On Newbie Question

Postby greenbjb » Thu Oct 24, 2013 0:15

We have deployed Orange and the Text Mining add on in a HUBzero environment. Our hope is to use it with undergraduate social science students to do basic social media text analysis. I am having little luck getting this add on to do much of anything. My thinking is I'm missing some basics. Can't find much of anything on line in the docs or forums to help. Can someone point me in the right direction? For example, I can read in a text file using the Text File widget, but can't do much of any actually text analysis with the other tools in this add on. They don't seem to do anything. What am I missing? (ie. Bag of Words, Word n-Grams, Feature Selection, etc.)

Thanks in advance.

Re: Text Mining Add On Newbie Question

Postby Ales » Tue Oct 29, 2013 12:02

Like you noticed the Text Mining add-on is practically undocumented (and unmaintained so it is reasonable to expect the situation will not improve on that front). So it might not be good choice for teaching.

However a basic example of the functionality would be to load the bookexcerpts.tab file (using the regular "File" widget not the "Text File", see this post for hints on using that widget). The loaded dataset should have an 'string' attribute column with the full text of the document.

Then use the Preprocessor widget to lemmatize and/or remove stop words for the "text" attribute (note use the Preprocess from the Text Mining category and not the the one from Data which has the same name).

Then you would typically do a "Bag Of Words" or "Letter/Word b-Grams". This would append the term/ngram frequency/count columns to the dataset.

After that you pass the output through the Feature Selection widget to filter out the least informative features constructed in the previous step.

The Text Mining/Distance widget can be used on the output Feature selection (or Bag of Words/ Letter n-Grams directly to compute the cos distance between the documents). After that you can use the standard Orange widgets MDS, Hierarchical Clustering or Distance Map widgets on the disatance matrix output).

Note that Bag of Words/ Letter n-Grams widgets created attributes are added as meta attributes and as such are of limited use by the rest of the Orange widgets unless you move them to the regular attributes (for instance in order to use the Visualize widgets (Scatter Plot, ...) to plot the constructed features they need to be ordinary features), so use the "Select Attributes" widget to move the "Meta Attribtues" (bottom right box) to the "Attributes" (top right box).

Re: Text Mining Add On Newbie Question

Postby greenbjb » Wed Oct 30, 2013 11:29

Ales,

Your thoughtful reply is very much appreciated and full of very valuable tips that I need to make some progress. I'll work on this more and let you know how I make out. Again, thanks very much.

Jim

Re: Text Mining Add On Newbie Question

Postby axanthos » Thu Oct 31, 2013 19:56

Hi Jim,

I believe the text-mining add-on is enhanced for supervised text classification tasks. You might want to have a look to the Textable add-on, which I have developed for teaching computerized text analysis to students in Humanities: http://langtech.ch/textable. The basic idea behind Textable is to bridge the gap between heterogeneous raw text sources and tabular data amenable to all sort of analyses in Orange Canvas.

Equipped only with a basic knowledge of regular expressions, students typically use Textable for exploratory analyses such as concordances, collocations, and the like, before moving on to more advanced multivariate analyses using Orange facilities for correspondence analysis, hierarchical or k-means clustering, and so on.

I am in the (rather slow) process of translating the extensive documentation from French to English, but you can already make yourself an idea by reading the Getting started section: https://orange-textable.readthedocs.org/en/latest/ (or going to the project homepage referenced above and reading the french documentation on Moodle if you can). I'm happy to answer any question on the (so far empty) forum: http://langtech.ch/forum/textable/

Aris

Re: Text Mining Add On Newbie Question

Postby Perugini » Fri Nov 01, 2013 17:23

Hello,
My name is Nick Perugini and I am a SUNY Oneonta student working with the Big Data team. I have gotten the Textable add-on installed and am having trouble figuring out what each widget is used for. Is there any tutorial I can work off of? Any help would be much appreciated.
Thank you,
Nick Perugini

Re: Text Mining Add On Newbie Question

Postby axanthos » Sat Nov 02, 2013 10:33

Hi Jim,

Have a look at https://orange-textable.readthedocs.org/en/latest/, section Getting started (translation from French is in progress, if you can read French you'll find a lot of documentation on http://moodle2.unil.ch/course/view.php?id=574, where you can login as Guest). And any question's welcome on http://langtech.ch/forum/textable/.

Aris

Re: Text Mining Add On Newbie Question

Postby Perugini » Sun Nov 03, 2013 18:03

Aris,
Thank you for all of your help. From what you posted, I have been able to read in the file, preprocess it, bag of words it, and select the attributes I want. The only problem I am having is when I go to graph it with the distributions widget I don't see any results. I'm not sure if I am doing it right but what I am trying to do is create a bar graph of the count of each word in the file.
Thank you again for your help,
Nick Perugini

Re: Text Mining Add On Newbie Question

Postby axanthos » Mon Nov 04, 2013 11:11

Nick,

Can you please post your question (along with a screenshot ideally) on http://langtech.ch/forum/textable/ rather?

Thanks,
Aris


Return to Questions & Support



cron