Ticket #974 (new task)

Opened 3 years ago

Last modified 3 years ago

Faster threshold search

Reported by: janez Owned by: janez
Milestone: 3.0 Component: other
Severity: minor Keywords:
Cc: Blocking:
Blocked By:

Description

Threshold search could be sped up by implementing special cases of procedures related to the search for binary attributes that would use a pair of double instead of TDistribution.

An alternative is to just avoid TDistribution and use a double * instead. TScoreFeature would have to implement a corresponding method.

Third alternative is to even keep TDistribution but only one (eg. the left one), implement TScoreFeature's operator that would accept the left distribution and the class distribution. This would simplify the threshold search loop (no maintining of the count for the right side) and possibly allow a more efficient computation of scores as well.

Change History

comment:1 Changed 3 years ago by janez

I tried to simplify the formula for information gain and Gini index for binary cases (if we know the left side distribution and the total distribution), but nothing nicer does come out. Also not keeping the right-hand side distribution does not help a lot since we need it to compute the score (e.g. the entropy).

The code would still run faster because simpler data representation, because we would have less arguments, because we would not update TDistribution::abs and cases if we used double * etc.

Note: See TracTickets for help on using tickets.