Orange Forum • View topic - binarization

binarization

A place to ask questions about methods in Orange and how they are used and other general support.

binarization

Hi,
Is there a reference to documentation about how SplitConstructor_ExhaustiveBinary works? For example, is there a paper or a wiki article that explains how exhaustive binarization works? I couldn't understand the documentation here -

"Finds the binarization with the highest score among all features. In case of ties, a random feature is selected.

The constructed branch_selector is an instance orange.ClassifierFromVarFD, that returns a value of the selected feature. Its transformer contains a MapIntValue that maps values of the feature into a binary feature. Branches with a single feature value are described with that value and branches with more than one are described with [<val1>, <val2>, ..., <valn>]. Only binary features are marked as spent."

Thanks,

Re: binarization

It just tries all possible splits of attribute values into two subsets and computes the attribute score (e.g. information gain) for each split. The one with the highest score is selected.

The second paragraph is technical and you can ignore it.

Re: binarization

Hi,

Exhaustive binarization (ie, if binarization == 1, aka SplitConstructor_ExhaustiveBinary()) should split discrete variables into subsets. On my tests for splitting information gain, gain ratio and gini index don't produce this in my tests (orange 2.0b Mac OS X, GUI).

My simple test data is a set of 80 numbers {1,2,3,4}, classified as {odd, even}, ie, a binary classification.

The root node is 40:40 odd and even.

The natural split into {1,3} vs {2,4} goes from information = 1 to information = 0 (perfect information). Information gain 1.0 at first split.

Each of information gain, gain ratio and gini index give a split like: 1 vs in(2,3,4), then 2 vs (3,4), then split the 3 and the 4. The information in the (2,3,4) branch is 0.918296 (via -p*log2(p) etc), so gain is way less than the natural split.

Friedman et al (in ch 9 of The Elements of Statistical Learning) and Quinlan in the Binarization in C4.5 describe the splitting, which is simple for a binary classification, but computationally complex for m-way (usually dealt with by heuristic methods).

The binary splits is to sort levels (values) from highest proportion of target class downward. The potential splits are the boundaries between each level after they were sorted, and choose the split which has the best gain for that variable. The select the variable with the best gain of all variables.

Can you assist please. Is this a bug? Am I not understanding how C4.5 does splitting?

The Python code looks simple and fine, so it there possibly a bug in the cpp?

Also, can you provide a reference for the relieff method of splitting?

Tom.

Re: binarization

Should have said I was talking about tree.py, not actual C4.5. Tom.

Re: binarization

You're right. I've found the bug and fixed it. Should be in tomorrow's snapshot.

Thanks!

Re: binarization

Thanks.

BTW, I'd like to know where the bug fix is (.py, .cpp, elsewhere), whether I need to install Orange afresh from source. Using Mac OS X.

Trying to juggle my time and whether I wait for the fix to be in the Mac OS X bundle, or in tall.

Suggest the download page can indicate the date (or versions) of what is in the latest full packages and packed source.

Cheers, Tom.

Thanks.

Re: binarization

I've checked binarization against my initial test and indeed, the edit to the cpp code (in tdidt.cpp) now yields splits into subsets.

However I have built some trees on real data which don't appear to be splitting for max InfoGain (or Gain Ration, or other setting).

Here's a simple illustration:

Data is two classes {0,1}, predictor set (one variable) is {1,2,3,4,5} -> 0, {6,7} -> 1. I've run both with weights and without. Example case, weight for 1 thru 5 was 100, weight for 6 and 7 was 250, so 1000 "records".

Ie, variable = 1,2,3,4,5 are homogeneous (all 0's) and 6,7 all 1's. [The real data had many regions (subtrees) homogeneous in 0's, and a small number of regions with a mix of 0's and 1's].

Root node is 500 0's and 500 1's. Info of root node is 0.693.

Binarization = 1 should yield a split of the root node of [1,2,3,4,5] vs [6,7] with Info is 0.

Here's what actually happened:

>>> trw = Orange.classification.tree.TreeLearner(test, "w", binarization =1, measures = "InfoGain")
>>> print trw
v1=2: 0 (100.00%)
v1=in [1, 3, 4, 5, 6, 7]
| v1=3: 0 (100.00%)
| v1=in [1, 4, 5, 6, 7]
| | v1=4: 0 (100.00%)
| | v1=in [1, 5, 6, 7]
| | | v1=5: 0 (100.00%)
| | | v1=in [1, 6, 7]
| | | | v1=1: 0 (100.00%)
| | | | v1=in [6, 7]: 1 (100.00%)

The first split is [2] vs [1,3,4,5,6,7] with 500/900 in the right node, Info is 0.687.

Here's the .to_string version:

>>> print trm.to_string(leaf_str="%V (%M of %N)", node_str='.')
root: 0 (500.000 of 1000.000)
| v1=2: 0 (100.000 of 100.000)
| v1=in [1, 3, 4, 5, 6, 7]: 1 (500.000 of 900.000)
| | v1=3: 0 (100.000 of 100.000)
| | v1=in [1, 4, 5, 6, 7]: 1 (500.000 of 800.000)
| | | v1=4: 0 (100.000 of 100.000)
| | | v1=in [1, 5, 6, 7]: 1 (500.000 of 700.000)
| | | | v1=5: 0 (100.000 of 100.000)
| | | | v1=in [1, 6, 7]: 1 (500.000 of 600.000)
| | | | | v1=1: 0 (100.000 of 100.000)
| | | | | v1=in [6, 7]: 1 (500.000 of 500.000)

Here's the data...

v1 r w
d d c
class meta
1 0 100
2 0 100
3 0 100
4 0 100
5 0 100
6 1 250
7 1 250

Cheers, Tom.

Re: binarization

Bumping this: Classification Tree Binarization still not 100% for Orange inline in Python. Is correct for GUI version. Running in Python, tree is:

v1=2: 0 (100.00%)
v1=in [1, 3, 4, 5, 6, 7]
| v1=3: 0 (100.00%)
| v1=in [1, 4, 5, 6, 7]
| | v1=4: 0 (100.00%)
| | v1=in [1, 5, 6, 7]
| | | v1=5: 0 (100.00%)
| | | v1=in [1, 6, 7]
| | | | v1=1: 0 (100.00%)
| | | | v1=in [6, 7]: 1 (100.00%)

Running in GUI is (equiv to):

v1=in[1,2,3,4,5]: 0
v1=in[6,7]: 1

GUI not does not use metaclass for weights or full control, so I can't use it. Inline does not give the compact tree, so not correct (pruning is conceptually impaired, tree model is complex).

Tom.

Re: binarization

Use 'measure' keyword parameter not 'measures'
Code: Select all
`trw = Orange.classification.tree.TreeLearner(data, "w", binarization=1,                                             measure=Orange.feature.scoring.InfoGain())`