Ticket #1165 (new check)

Opened 2 years ago

Last modified 2 years ago

What does earth do if it encounters unknown class values?

Reported by: janez Owned by:
Milestone: 3.0 Component: library
Severity: minor Keywords:
Cc: Blocking:
Blocked By:

Description

Nothing, currently. It assumes that there are no missing values.

Change History

comment:1 Changed 2 years ago by lanz

This will be a problem for many learners so it would be nice to solve it systematically. I am postponing a fix for this in multi-target trees (see #1111) because the solution really should be implemented only once for Learner (or some MultiLearner base class) and every algorithm gets the solution through inheritance.

This probably needs to be discussed, but I think we should have many more of these "helper functions" implemented in Learner or different subclasses of Learner and use multiple inheritance. For example, we have imputation and continuization in Orange.regression.base.BaseRegressionLearner, although it is by no means limited to regression. A lot of methods convert the data to numpy and then do centering and standardization, but have this implemented individually in many places...

I don't know how to cc other people, but it might be better to discuss this in a meeting anyway.

comment:2 Changed 2 years ago by janez

I don't think there is a general solution. Imputation is quite like prediction - you're predicting the missing values, typically using rather dumb algorithms. So if you impute missing classes before learning, the learning algorithm has to learn what the other model, that is, the one used for imputation, has constructed. Imputation just makes life harder for the learner.

I thus believe that for most learning algorithms it's best if they just ignore such examples. Only if the algorithm can really handle them (e.g. through transduction), it should do so.

My ticket is not formulated well; what I meant is, basically, just that the algorithm counts on all classes being defined and that this should be fixed.

comment:3 Changed 2 years ago by lanz

OK, I agree about imputation. My previous (actually current) solution was to just raise an error if the data has missing class values and then the user can do imputation or something smarter. Ignoring examples would usually remove most of the data set, so doing this silently is also not the best I think (for multiple classes).

However continuization is such a common requirement for algorithms (before transforming to a numpy matrix) that the previous comment still applies. But this is not the ticket for that.

Note: See TracTickets for help on using tickets.