Orange Forum • View topic - Behaviour of EntropyDiscretization()?

Behaviour of EntropyDiscretization()?

A place to ask questions about methods in Orange and how they are used and other general support.

Behaviour of EntropyDiscretization()?

Postby RichardMR » Tue May 05, 2009 19:21

Dear Orange,

I would be grateful if anyone would clarify the functioning of EntropyDiscretization() for me.

I've read: http://www.ailab.si/Orange/doc/referenc ... zation.htm
and: http://www.ailab.si/Orange/doc/ofb/o_categorization.htm

I've also had a look at the source code: http://www.ailab.si/svn/orange/branches ... retize.cpp (Although, I'm much more comfortable programming in Python; I think I can get the gist of what most functions are doing here, but wasn't able to answer my own question.)

So, I understand that this means of discretization works by selecting a number of candidate split points from the N - 1 midpoints of the N membered data set, ordered by the attribute to be dicretized. A split point is chosen if the information gain > MDL (with probabilities being calculated using simple frequency counts).

If, however, forceAttribute=True, I understood this would mean the split point with the highest information gain would be selected even if the gain <MDL, IFF the normal procedure would lead to no split points being selected.

However, I've found that forceAttribute=True also generated extra split points for other attributes, which were already split before.

I'm confused. Does forceAttribute=True mean that MDL is simply ignored? If so, I would presume this would generate N-1 split points?

Or,does it mean MDL is ignored once - giving rise to an extra split in all attributes split before?

The syntax I used to observe this in using Python:

train_data = orange.ExampleTable(train_tab_output_name)

entro = orange.EntropyDiscretization()

if ForceOneCut == True: #My own command line adjustable boolean
entro = orange.EntropyDiscretization(forceAttribute=True)

for attr in train_data.domain.attributes:
#Don't discretize the IDs!
if not attr.name == IDTag:
disc = entro(attr,train_data)
Desc2Points[attr.name] = disc.getValueFrom.transformer.points

Thank you so much if you can clarify my understanding of this method!

Richard

Return to Questions & Support



cron