Orange Forum • View topic - Non-linear regression problem

## Non-linear regression problem

A place to ask questions about methods in Orange and how they are used and other general support.

### Non-linear regression problem

Hello.

We have a non-linear regression problem that we're considering using Orange for. The problem is unusal in some respects.

I've appended a description of the problem (especially 'what makes it unusual') written by someone who's much more familar with the theory than I am.

I'm hoping it'll make sense to you and you'll be able to say "yes, you can do that in Orange buy subclassing Foo and adding Bar" etc etc.

Let me know if things aren't clear enough and I'll expand from my more practical perspective.

Thanks.

Tim.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Our problem domain is defined by feature vectors of 5 to 20 nominal
features. Our goal is to predict a continuous value for each feature
vector presented to the system. Thus our problem is a non-linear
regression problem.

The main difference from a standard setting is that
the raw feature vectors are not labeled with a corresponding continuous
value, but with a binary one. The continuous value is the rate (or
average) value within a group of feature vectors. If thinking about a
decision tree, the predicted continuous value would be calculated from
the feature vectors within a leaf node (or any other intermediate node).
We also wish to compute a secondary continuous value from the instance
labels within each tree node.

Our splitting rule is not based on the least-squared deviation (LSD) as
the vectors themselves are not labeled with the continuous value, but
instead on an ad hoc rule that tries to maximize the variance of the
computed continuous values across all the child nodes. It's possible LSD
would work using the binary values of each feature vector.

We also have ad hoc stopping rules based on statistical measures. One
rule determines the validity of each proposed split. If a split is not
acceptable the split is undone and the splitting process stopped on that
branch. A second stopping rule applies to each node. Nodes that do not
meet this rule a deleted from the tree.

We introduced a special "other" feature value to handle cases when a
test vector has a feature value that its not represented in the tree.
Not being represented can happen because the value was not present in
the training set (a common occurrence) or because the number of training
instances was too low to include it in the tree. When training, the
"other" feature value is added to all node splits after the optimal
split is decided. All the data of the parent node is considered present
in the "other" child node. When on the testing phase the "other" value
is a match when no other values are.

We're considering having a pruning stage. It will eliminate all "other"
nodes that are leaves. And it will merge leaves whose corresponding
continuous values are close enough.

When searching the tree (performance phase) the process starts at the
root and proceeds as usual, with the caveat of the "other" value (as
described), but if at some time a match is not possible the current node
is the result of the search and the output continuous value is computed
from the instances within it.

The number of training feature vectors is large, in the millions.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Decision tree induction in Orange is component oriented so you can either subclass the existing components or, the other way around, if your problem is weird enough you can rewrite the skeleton of the algorithm, but reuse the existing components when you can. A part of the documentation on how to do that is in http://www.ailab.si/orange/doc/referenc ... earner.htm, and there is some more in http://www.ailab.si/orange/doc/reference/callbacks.htm, but in general the documentation for the tree learner is incomplete.

Our splitting rule is not based on the least-squared deviation (LSD) as the vectors themselves are not labeled with
the continuous value, but instead on an ad hoc rule that tries to maximize the variance of the computed continuous values across all the child nodes. It's possible LSD
would work using the binary values of each feature vector.

I do not understand what do you mean by the variance of the computed continuous values across the child nodes. If the optimal split (your split criterion is still a value of a single attribute, right?) can be decided upon the values of learning examples that came to a particular node, you will only have to subclass the class MeasureAttribute. If, as I suspect from the above description, the selection of the attribute is global, then you will need to rewrite the skeleton (I'd say it's about 20-30 lines of Python - I'll probably have to do it today for another problem, so I can put the code in the forum), reuse most of the components, and modify others (and the skeleton) correspondingly.

We also have ad hoc stopping rules based on statistical measures. One rule determines the validity of each proposed split. If a split is not acceptable the split is undone and the splitting process stopped on that branch. A second stopping rule applies to each node. Nodes that do not meet this rule a deleted from the tree.

Orange's tree learner uses two classes for that. One is the split constructor which is given the examples at the node and returns the 'optimal' split. In your case, it should also check that the found split is acceptable. If it is not, it can return None. As for the second stopping rule, I understand that it is unrelated to the splits, but only to the examples in the node? So you can apply it prior to splitting? If this is so, you need to use (or subclass) the Orange's stopping criteria. If you can only know it after the tree is constructed, this can go into post-pruning. (I might have misunderstood this question, I fear.)

When training, the "other" feature value is added to all node splits after the optimal split is decided.

This is one of the possible treatments for unknown values, the corresponding class name is TreeExampleSplitter_UnknownsToBranch.

We're considering having a pruning stage. It will eliminate all "other" nodes that are leaves. And it will merge leaves whose corresponding continuous values are close enough.

Post-pruning is a matter of a simple script in Python - just traverse the tree and do with it whatever you want. We haven't programmed anything that does something like this.

As I said, I'll be working on some special trees today, so I'll hopefully be able to post some example scripts that will also be useful for you.