Orange Forum • View topic - How can I limit decision tree resplits on same variable?

## How can I limit decision tree resplits on same variable?

A place to ask questions about methods in Orange and how they are used and other general support.

### How can I limit decision tree resplits on same variable?

I have a continuous variable that has a very high gainratio relative to my other variables. It's called "coexpression" in the following excerpt:

coexpression<0.437
| +-2<10.500
| | -pp2<0.500
| | | -p-n2<0.500
| | | | +Gp-2<0.500
| | | | | Pn-n2<0.500
| | | | | | -n--2<0.500
| | | | | | | coexpression<0.379
| | | | | | | | n+GP2>=0.500: True (100.00%)
| | | | | | | | n+GP2<0.500
| | | | | | | | | coexpression<0.105
| | | | | | | | | | coexpression<-0.098
| | | | | | | | | | | coexpression<-0.241
| | | | | | | | | | | | coexpression<-0.278

I get better cross-validation performance when I leave coexpression out of the data, since splitting on it overfits.
What I would like to do is limit a TreeLearner to only use any particular variable up to some maximum number of times on a branch and then abandon it for the rest of the branch. For example, if I set that limit to 2 and tried to learn the above tree again, then coexpression<0.379 would "use up" coexpression for its subtree and the chain of coexpression splits below n+GP2 would not appear.
I don't see a way that a TreeSplitConstructor can get information about what nodes are above it, but I may be overlooking some way of getting to that information.
If there is a way for a TreeSplitConstructor to see the node's ancestors, how do I do that? If not, how else might I get the behavior I'm looking for?

Decision trees are usually not pruned like this, so Orange doesn't support it by default.

You can write your own tree inducer which would reuse the components from orange. This is fairly easy, but not without documentation. And there's practically no documentation on how to do that.

A more complicated approach - but easier for you - might be to induce the entire tree and then traverse it to find the attributes that are reused for too many times, cut the branch and rebuild the tree at that node with the "spent" attribute removed. Not very elegant.

Finally, the tree nodes do not know their parents. As you can imagine, this would be trivial to add, but I don't want to do it to prevent misconstructed trees (one may (re)attach some nodes without correctly setting the parent). Or, worse, somebody might be already doing this. Since it's not here now, I prefer not to add it.

I think you should solve your problem in some other way: if the tree would need to reuse an attribute like this, than using the tree learner might not be a good idea. Preventing the reuse is unnatural.

JAnez

The lack of tree-inducer documentation was indeed the main reason I asked in the first place. The generally sparse documentation of Orange seems like its Achilles heel, to me; I'd rather have fewer features and more documentation than vice versa.

As for my current problem, I do realize that forcing an arbitrary split limit onto a tree learning algorithm is "unnatural", but in my situation the distribution of the variables is a bit unnatural as well: most of them are the ratio of two unbounded but usually-small integer counts. The result is that they're "continuous" as far as their range goes, but mostly clump at a few values and don't offer a large number of splitpoint selection choices the way a physically continuous measurement like coexpression would.

I'm thinking I could discretize coexpression so that it only offers a few split points, although then the split points won't necessarily be the best choices at a given depth. It might also make sense to not use coexpression as a tree input, and then do some multi-step learning where coexpression is fused with the tree output. I'll have to figure out metalearners or supervised discretization, but both of those seem to have enough documentation to get started on them.