Ticket #937 (accepted bug)

Opened 3 years ago

Last modified 3 years ago

Possible bug in handling missing attribute-values in MeasureAttribute

Reported by: nkarthiks Owned by: janez
Milestone: Future Component: other
Severity: minor Keywords:
Cc: Blocking:
Blocked By:

Description

I came across this issue when attempting to construct regression trees (using TreeLearner) the simple way (without weights, pruning, etc.). Some of the attributes have missing values, and are getting up-weighted in the tree instead of being down-weighted by ReduceByUnknowns. In particular, I'm referring to code here: http://orange.biolab.si/trac/browser/trunk/source/orange/measures.cpp#L1052

I actually inserted log statements and discovered that outer.unknowns was 0 when the attribute-value being evaluated for split had 149 missing values out of a total of 179 examples (20 missing). Hence the MSE was unchanged. outer.cases=40; outer.unknowns=0

Thanks!

Change History

comment:1 Changed 3 years ago by ales

  • Status changed from new to assigned
  • Owner set to janez

comment:2 Changed 3 years ago by janez

  • Status changed from assigned to accepted

I'm sorry: I kept postponing this until I ran into the same problem while preparing a major overhaul of Orange's core. You're right, treatment of unknown values is so wrong that an attribute without any known values would have a maximal information gain.

I'm gonna backport this to the trunk, but if you're in hurry you can fix your version yourself:

outer.cases / (outer.unknowns + outer.cases)

should be replaced by

1 - outer.unknowns/outer.cases

or by something like (I cannot verify right now if this check is needed, I work on a different branch)

outer.cases > 1e-6 ? (1 - outer.unknowns/outer.cases) : 0

This occurs in several places in measures.cpp.

Note: See TracTickets for help on using tickets.