source: orange/orange/Orange/classification/tree.py @ 7660:5f432be7cddc

Revision 7660:5f432be7cddc, 117.8 KB checked in by janezd <janez.demsar@…>, 3 years ago (diff)

Renamed Orange.data.feature to Orange.data.variable

Line 
1"""
2
3.. index:: classification tree
4
5.. index::
6   single: classification; tree
7
8********************
9Classification trees
10********************
11
12This page describes the Orange trees. It first describes the basic
13components and procedures: it starts with the structure that represents
14the tree, then it defines how the tree is used for classification,
15how it is built and pruned. The order might seem strange,
16but the things are rather complex and this order is perhaps a
17bit easier to follow. After you have some idea about what the
18principal components do, we described the
19concrete classes that you can use as
20components for a tree learner.
21
22Classification trees are represented as a tree-like hierarchy of
23:obj:`Node` classes.
24
25.. class:: Node
26
27    Node stores information about the learning examples belonging
28    to the node, a branch selector, a list of branches (if the node is
29    not a leaf) with their descriptions and strengths, and a classifier.
30
31    .. attribute:: distribution
32   
33        Stores a distribution for learning examples belonging to the node.
34        Storing distributions can be disabled by setting the
35        :obj:`TreeLearnerBase`'s storeDistributions flag to false.
36
37    .. attribute:: contingency
38
39        Stores complete contingency matrices for the learning examples
40        belonging to the node. Storing contingencies can be enabled by
41        setting :obj:`TreeLearnerBase`'s :obj:`storeContingencies`
42        flag to true. Note that even when the flag is not
43        set, the contingencies get computed and stored to
44        :obj:`Node`, but are removed shortly afterwards.
45        The details are given in the
46        description of the :obj:`TreeLearnerBase` object.
47
48    .. attribute:: examples, weightID
49
50        Store a set of learning examples for the node and the
51        corresponding ID of /weight meta attribute. The root of the
52        tree stores a "master table" of examples, while other nodes'
53        :obj:`Orange.data.Table` contain reference to examples in
54        the root's :obj:`Orange.data.Table`. Examples are only stored
55        if a corresponding flag (:obj:`storeExamples`) has been
56        set while building the tree; to conserve the space, storing
57        is disabled by default.
58
59    .. attribute:: nodeClassifier
60
61        A classifier (usually, but not necessarily, a
62        :obj:`DefaultClassifier`) that can be used to classify
63        examples coming to the node. If the node is a leaf, this is
64        used to decide the final class (or class distribution) of an
65        example. If it's an internal node, it is stored if
66        :obj:`Node`'s flag :obj:`storeNodeClassifier`
67        is set. Since the :obj:`nodeClassifier` is needed by
68        :obj:`Descender` and for pruning (see far below),
69        this is the default behaviour; space consumption of the default
70        :obj:`DefaultClassifier` is rather small. You should
71        never disable this if you intend to prune the tree later.
72
73    If the node is a leaf, the remaining fields are None.
74    If it's an internal node, there are several additional fields.
75
76    .. attribute:: branches
77
78        Stores a list of subtrees, given as :obj:`Node`.
79        An element can be None; in this case the node is empty.
80
81    .. attribute:: branchDescriptions
82
83        A list with string descriptions for branches, constructed by
84        :obj:`SplitConstructor`. It can contain different kinds
85        of descriptions, but basically, expect things like 'red' or '>12.3'.
86
87    .. attribute:: branchSizes
88
89        Gives a (weighted) number of training examples that went into
90        each branch. This can be used later, for instance, for
91        modeling probabilities when classifying examples with
92        unknown values.
93
94    .. attribute:: branchSelector
95
96        Gives a branch for each example. The same object is used during
97        learning and classifying. The :obj:`branchSelector` is of
98        type :obj:`orange.Classifier`, since its job is similar to that
99        of a classifier: it gets an example and returns discrete
100        :obj:`Orange.data.Value` in range :samp:`[0, len(branches)-1]`.
101        When an example cannot be classified to any branch, the selector
102        can return a :obj:`Orange.data.Value` containing a special value
103        (sVal) which should be a discrete distribution
104        (DiscDistribution). This should represent a
105        :obj:`branchSelector`'s opinion of how to divide the
106        example between the branches. Whether the proposition will be
107        used or not depends upon the chosen :obj:`ExampleSplitter`
108        (when learning) or :obj:`Descender` (when classifying).
109
110    The lists :obj:`branches`, :obj:`branchDescriptions` and
111    :obj:`branchSizes` are of the same length; all of them are
112    defined if the node is internal and none if it is a leaf.
113
114    .. method:: treeSize()
115       
116        Return the number of nodes in the subtrees (including the
117        node, excluding null-nodes).
118
119==============
120Classification
121==============
122
123.. class:: TreeClassifier
124
125    Classifies examples according to a tree stored in :obj:`tree`.
126
127    .. attribute:: tree
128
129        The root of the tree, represented as a :class:`Node`.
130   
131    Classification would be straightforward if there were no unknown
132    values or, in general, examples that cannot be placed into a
133    single branch. The response in such cases is determined by a
134    component :obj:`descender`.
135
136    :obj:`Descender` is an abstract object which is given an example
137    and whose basic job is to descend as far down the tree as possible,
138    according to the values of example's attributes. The
139    :obj:`Descender`: calls the node's :obj:`branchSelector` to get
140    the branch index. If it's a simple index, the corresponding branch
141    is followed. If not, it's up to descender to decide what to do, and
142    that's where descenders differ. A :obj:`descender` can choose
143    a single branch (for instance, the one that is the most recommended
144    by the :obj:`branchSelector`) or it can let the branches vote.
145
146    In general there are three possible outcomes of a descent.
147
148    #. Descender reaches a leaf. This happens when nothing went wrong
149       (there are no unknown or out-of-range values in the example) or
150       when things went wrong, but the descender smoothed them by
151       selecting a single branch and continued the descend. In this
152       case, the descender returns the reached :obj:`Node`.
153    #. :obj:`branchSelector` returned a distribution and the
154       :obj:`Descender` decided to stop the descend at this
155       (internal) node.  Again, descender returns the current
156       :obj:`Node` and nothing else.
157    #. :obj:`branchSelector` returned a distribution and the
158       :obj:`Node` wants to split the example (i.e., to decide the
159       class by voting).
160
161    It returns a :obj:`Node` and the vote-weights for the branches.
162    The weights can correspond to the distribution returned by
163    :obj:`branchSelector`, to the number of learning examples that
164    were assigned to each branch, or to something else.
165
166    :obj:`TreeClassifier` uses the descender to descend from the root.
167    If it returns only a :obj:`Node` and no distribution, the
168    descend should stop; it does not matter whether it's a leaf (the
169    first case above) or an internal node (the second case). The node's
170    :obj:`nodeClassifier` is used to decide the class. If the descender
171    returns a :obj:`Node` and a distribution, the :obj:`TreeClassifier`
172    recursively calls itself for each of the subtrees and the
173    predictions are weighted as requested by the descender.
174
175    When voting, subtrees do not predict the class but probabilities
176    of classes. The predictions are multiplied by weights, summed and
177    the most probable class is returned.
178
179    .. method:: vote()
180
181    .. method:: classDistribution()
182
183
184The rest of this section is only for those interested in the C++ code.
185======================================================================
186
187If you'd like to understand how the classification works in C++,
188start reading at :obj:`TTreeClassifier::vote`. It gets a
189:obj:`Node`, an :obj:`Orange.data.Instance`> and a distribution of
190vote weights. For each node, it calls the
191:obj:`TTreeClassifier::classDistribution` and then multiplies
192and sums the distribution. :obj:`vote` returns a normalized
193distribution of predictions.
194
195A new overload of :obj:`TTreeClassifier::classDistribution` gets
196an additional parameter, a :obj:`Node`. This is done
197for the sake of recursion. The normal version of
198:obj:`classDistribution` simply calls the overloaded with a
199tree root as an additional parameter. :obj:`classDistribution`
200uses :obj:`descender`. If descender reaches a leaf, it calls
201:obj:`nodeClassifier`, otherwise it calls :obj:`vote`.
202
203Thus, the :obj:`TreeClassifier`'s :obj:`vote` and
204:obj:`classDistribution` are written in a form of double
205recursion. The recursive calls do not happen at each node of the
206tree but only at nodes where a vote is needed (that is, at nodes
207where the descender halts).
208
209For predicting a class, :obj:`operator()`, calls the
210descender. If it reaches a leaf, the class is predicted by the
211leaf's :obj:`nodeClassifier`. Otherwise, it calls
212:obj:`vote`>. From now on, :obj:`vote` and
213:obj:`classDistribution` interweave down the tree and return
214a distribution of predictions. :obj:`operator()` then simply
215chooses the most probable class.
216
217========
218Learning
219========
220
221The main learning object is :obj:`TreeLearnerBase`. It is basically
222a skeleton into which the user must plug the components for particular
223functions. For easier use, defaults are provided.
224
225Components that govern the structure of the tree are :obj:`split`
226(of type :obj:`SplitConstructor`), :obj:`stop` (of
227type :obj:`StopCriteria` and :obj:`exampleSplitter`
228(of type :obj:`ExampleSplitter`).
229
230.. class:: SplitConstructor
231
232    Finds a suitable criteria for dividing the learning (and later testing)
233    examples coming to the node. The data it gets is a set of examples
234    (and, optionally, an ID of weight meta-attribute), a domain
235    contingency computed from examples, apriori class probabilities,
236    a list of candidate attributes it should consider and a node
237    classifier (if it was constructed, that is, if
238    :obj:`storeNodeClassifier` is left true).
239
240    The :obj:`SplitConstructor` should use the domain contingency
241    when possible. The reasons are two-fold; one is that it's faster
242    and the other is that the contingency matrices are not necessarily
243    constructed by simply counting the examples. Why and how is
244    explained later. There are, however, cases, when domain contingency
245    does not suffice, for examples, when ReliefF is used as a measure
246    of quality of attributes. In this case, there's no other way but
247    to use the examples and ignore the precomputed contingencies.
248
249    The split constructor should consider only the attributes in the
250    candidate list (the list is a vector of booleans, one for each
251    attribute).
252
253    :obj:`SplitConstructor` returns most of the data we talked
254    about when describing the :obj:`Node`. It returns a classifier
255    to be used as :obj:`Node`'s :obj:`branchSelector`, a list of
256    branch descriptions and a list with the number of examples that
257    go into each branch. Just what we need for the :obj:`Node`.
258    It can return an empty list for the number of examples in branches;
259    in this case, the :obj:`TreeLearnerBase` will find the number itself
260    after splitting the example set into subsets. However, if a split
261    constructors can provide the numbers at no extra computational
262    cost, it should do so.
263
264    In addition, it returns a quality of the split; a number without
265    any fixed meaning except that higher numbers mean better splits.
266
267    If the constructed splitting criterion uses an attribute in such
268    a way that the attribute is 'completely spent' and should not be
269    considered as a split criterion in any of the subtrees (the
270    typical case of this are discrete attributes that are used
271    as-they-are, that is, without any binarization or subsetting),
272    then it should report the index of this attribute. Some splits
273    do not spend any attribute; this is indicated by returning a
274    negative index.
275
276    A :obj:`SplitConstructor` can veto the further tree induction
277    by returning no classifier. This can happen for many reasons.
278    A general one is related to number of examples in the branches.
279    :obj:`SplitConstructor` has a field :obj:`minSubset`,
280    which sets the minimal number of examples in a branch; null nodes,
281    however, are allowed. If there is no split where this condition
282    is met, :obj:`SplitConstructor` stops the induction.
283
284    .. attribute:: minSubset
285
286        Sets the minimal number of examples in non-null leaves. As
287        always in Orange (where not specified otherwise), "number of
288        examples" refers to the weighted number of examples.
289   
290    .. method:: __call__(examples, [weightID=0, apriori_distribution, candidates])
291
292        Construct a split. Returns a tuple (:obj:`branchSelector`,
293        :obj:`branchDescriptions`, :obj:`subsetSizes`, :obj:`quality`,
294        :obj:`spentAttribute`). :obj:`SpentAttribute` is -1 if no
295        attribute is completely spent by the split criterion. If no
296        split is constructed, the :obj:`selector`, :obj:`branchDescriptions`
297        and :obj:`subsetSizes` are None, while :obj:`quality` is 0.0 and
298        :obj:`spentAttribute` is -1.
299
300        :param examples:  Examples can be given in any acceptable form
301            (an :obj:`ExampleGenerator`, such as :obj:`ExampleTable`, or a
302            list of examples).
303        :param weightID: Optional; the default of 0 means that all
304            examples have a weight of 1.0.
305        :param apriori-distribution: Should be of type
306            :obj:`orange.Distribution` and candidates should be a Python
307            list of objects which are interpreted as booleans.
308
309
310.. class:: StopCriteria
311
312    Given a set of examples, weight ID and contingency matrices, decide
313    whether to continue the induction or not. The basic criterion checks
314    whether there are any examples and whether they belong to at least
315    two different classes (if the class is discrete). Derived components
316    check things like the number of examples and the proportion of
317    majority classes.
318
319    As opposed to :obj:`SplitConstructor` and similar basic classes,
320    :obj:`StopCriteria` is not an abstract but a fully functional
321    class that provides the basic stopping criteria. That is, the tree
322    induction stops when there is at most one example left; in this case,
323    it is not the weighted but the actual number of examples that counts.
324    Besides that, the induction stops when all examples are in the same
325    class (for discrete problems) or have the same value of the outcome
326    (for regression problems).
327
328    .. method:: __call__(examples[, weightID, domain contingencies])
329
330        Decides whether to stop (true) or continue (false) the induction.
331        If contingencies are given, they are used for checking whether
332        the examples are in the same class (but not for counting the
333        examples). Derived classes should use the contingencies whenever
334        possible. If contingencies are not given, :obj:`StopCriteria`
335        will work without them. Derived classes should also use them if
336        they are available, but otherwise compute them only when they
337        really need them.
338
339
340.. class:: ExampleSplitter
341
342    Just like the :obj:`Descender` decides the branch for an
343    example during classification, the :obj:`ExampleSplitter`
344    sorts the learning examples into branches.
345
346    :obj:`ExampleSplitter` is given a :obj:`Node` (from which
347    it can use different stuff, but most of splitters only use the
348    :obj:`branchSelector`), a set of examples to be divided, and
349    the weight ID. The result is a list of subsets of examples
350    and, optionally, a list of new weight ID's.
351
352    Subsets are usually stored as :obj:`ExamplePointerTable`'s. Most
353    of :obj:`ExampleSplitters` simply call the node's
354    :obj:`branchSelector` and assign examples to corresponding
355    branches. When the value is unknown they choose a particular
356    branch or simply skip the example.
357
358    Some enhanced splitters can split examples. An example (actually,
359    a pointer to it) is copied to more than one subset. To facilitate
360    real splitting, weights are needed. Each branch is assigned a
361    weight ID (each would usually have its own ID) and all examples
362    that are in that branch (either completely or partially) should
363    have this meta attribute. If an example hasn't been split, it
364    has only one additional attribute - with weight ID corresponding
365    to the subset to which it went. Example that is split between,
366    say, three subsets, has three new meta attributes, one for each
367    subset. ID's of weight meta attributes are returned by the
368    :obj:`ExampleSplitter` to be used at induction of the
369    corresponding subtrees.
370
371    Note that weights are used only when needed. When no splitting
372    occured - because the splitter is not able to do it or becauser
373    there was no need for splitting - no weight ID's are returned.
374
375    An abstract base class for objects that split sets of examples into
376    subsets. The derived classes differ in treatment of examples which
377    cannot be unambiguously placed into a single branch (usually due
378    to unknown value of the crucial attribute).
379
380    .. method:: __call__(node, examples[, weightID])
381       
382        Use the information in :obj:`node` (particularly the
383        :obj:`branchSelector`) to split the given set of examples into subsets.
384        Return a tuple with a list of example generators and a list of weights.
385        The list of weights is either an ordinary python list of integers or
386        a None when no splitting of examples occurs and thus no weights are
387        needed.
388
389.. class:: TreeLearnerBase
390
391    TreeLearnerBase has a number of components.
392
393    .. attribute:: split
394
395        Object of type :obj:`SplitConstructor`. Default value,
396        provided by :obj:`TreeLearnerBase`, is :obj:`SplitConstructor_Combined`
397        with separate constructors for discrete and continuous attributes.
398        Discrete attributes are used as are, while continuous attributes
399        are binarized. Gain ratio is used to select attributes.
400        A minimum of two examples in a leaf is required for discreter
401        and five examples in a leaf for continuous attributes.</DD>
402   
403    .. attribute:: stop
404
405        Object of type :obj:`StopCriteria`. The default stopping
406        criterion stops induction when all examples in a node belong
407        to the same class.
408
409    .. attribute:: splitter
410
411        Object of type :obj:`ExampleSplitter`. The default splitter
412        is :obj:`ExampleSplitter_UnknownsAsSelector` that splits
413        the learning examples according to distributions given by the
414        selector.
415
416    .. attribute:: contingencyComputer
417   
418        By default, this slot is left empty and ordinary contingency
419        matrices are computed for examples at each node. If need arises,
420        one can change the way the matrices are computed. This can be
421        used to change the way that unknown values are treated when
422        assessing qualities of attributes. As mentioned earlier,
423        the computed matrices can be used by split constructor and by
424        stopping criteria. On the other hand, they can be (and are)
425        ignored by some splitting constructors.
426
427    .. attribute:: nodeLearner
428
429        Induces a classifier from examples belonging to a node. The
430        same learner is used for internal nodes and for leaves. The
431        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
432
433    .. attribute:: descender
434
435        Descending component that the induces :obj:`TreeClassifier`
436        will use. Default descender is
437        :obj:`Descender_UnknownMergeAsSelector` which votes using
438        the :obj:`branchSelector`'s distribution for vote weights.
439
440    .. attribute:: maxDepth
441
442        Gives maximal tree depth; 0 means that only root is generated.
443        The default is 100 to prevent any infinite tree induction due
444        to missettings in stop criteria. If you are sure you need
445        larger trees, increase it. If you, on the other hand, want
446        to lower this hard limit, you can do so as well.
447
448    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
449
450        Decides whether to store class distributions, contingencies
451        and examples in :obj:`Node`, and whether the
452        :obj:`nodeClassifier` should be build for internal nodes.
453        By default, distributions and node classifiers are stored,
454        while contingencies and examples are not. You won't save any
455        memory by not storing distributions but storing contingencies,
456        since distributions actually points to the same distribution
457        that is stored in :obj:`contingency.classes`.
458
459    The :obj:`TreeLearnerBase` first sets the defaults for missing
460    components. Although stored in the actual :obj:`TreeLearnerBase`'s
461    fields, they are removed when the induction is finished.
462
463    Then it ensures that examples are stored in a table. This is needed
464    because the algorithm juggles with pointers to examples. If
465    examples are in a file or are fed through a filter, they are copied
466    to a table. Even if they are already in a table, they are copied
467    if :obj:`storeExamples` is set. This is to assure that pointers
468    remain pointing to examples even if the user later changes the
469    example table. If they are in the table and the :obj:`storeExamples`
470    flag is clear, we just use them as they are. This will obviously
471    crash in a multi-threaded system if one changes the table during
472    the tree induction. Well... don't do it.
473
474    Apriori class probabilities are computed. At this point we check
475    the sum of example weights; if it's zero, there are no examples and
476    we cannot proceed. A list of candidate attributes is set; in the
477    beginning, all attributes are candidates for the split criterion.
478
479    Now comes the recursive part of the :obj:`TreeLearnerBase`. Its arguments
480    are a set of examples, a weight meta-attribute ID (a tricky thing,
481    it can be always the same as the original or can change to
482    accomodate splitting of examples among branches), apriori class
483    distribution and a list of candidates (represented as a vector
484    of Boolean values).
485
486    The contingency matrix is computed next. This happens
487    even if the flag :obj:`storeContingencies` is false.
488    If the :obj:`contingencyComputer` is given we use it,
489    otherwise we construct just an ordinary contingency matrix.
490
491    A :obj:`stop` is called to see whether it's worth to continue. If
492    not, a :obj:`nodeClassifier` is built and the :obj:`Node` is
493    returned. Otherwise, a :obj:`nodeClassifier` is only built if
494    :obj:`forceNodeClassifier` flag is set.
495
496    To get a :obj:`Node`'s :obj:`nodeClassifier`, the
497    :obj:`nodeLearner`'s :obj:`smartLearn` function is called with
498    the given examples, weight ID and the just computed matrix. If
499    the learner can use the matrix (and the default,
500    :obj:`Orange.classification.majority.MajorityLearner`, can), it won't touch the examples. Thus,
501    a choice of :obj:`contingencyComputer` will, in many cases,
502    affect the :obj:`nodeClassifier`. The :obj:`nodeLearner` can
503    return no classifier; if so and if the classifier would be
504    needed for classification, the :obj:`TreeClassifier`'s function
505    returns DK or an empty distribution. If you're writing your own
506    tree classifier - pay attention.
507
508    If the induction is to continue, a :obj:`split` component is called.
509    If it fails to return a branch selector, induction stops and the
510    :obj:`Node` is returned.
511
512    :obj:`TreeLearnerBase` than uses :obj:`ExampleSplitter` to divide
513    the examples as described above.
514
515    The contingency gets removed at this point if it is not to be
516    stored. Thus, the :obj:`split`, :obj:`stop` and
517    :obj:`exampleSplitter` can use the contingency matrices if they will.
518
519    The :obj:`TreeLearnerBase` then recursively calls itself for each of
520    the non-empty subsets. If the splitter returnes a list of weights,
521    a corresponding weight is used for each branch. Besides, the
522    attribute spent by the splitter (if any) is removed from the
523    list of candidates for the subtree.
524
525    A subset of examples is stored in its corresponding tree node,
526    if so requested. If not, the new weight attributes are removed (if
527    any were created).
528
529Pruning
530=======
531
532Tree pruners derived from :obj:`Pruner` can be given either a
533:obj:`Node` (presumably, but not necessarily a root) or a
534:obj:`TreeClassifier`. The result is a new, pruned :obj:`Node`
535or a new :obj:`TreeClassifier` with a pruned tree. The original
536tree remains intact.
537
538Note however that pruners construct only a shallow copy of a tree.
539The pruned tree's :obj:`Node` contain references to the same
540contingency matrices, node classifiers, branch selectors, ...
541as the original tree. Thus, you may modify a pruned tree structure
542(manually cut it, add new nodes, replace components) but modifying,
543for instance, some node's :obj:`nodeClassifier` (a
544:obj:`nodeClassifier` itself, not a reference to it!) would modify
545the node's :obj:`nodeClassifier` in the corresponding node of
546the original tree.
547
548Talking about node classifiers - pruners cannot construct a
549:obj:`nodeClassifier` nor merge :obj:`nodeClassifier` of the pruned
550subtrees into classifiers for new leaves. Thus, if you want to build
551a prunable tree, internal nodes must have their :obj:`nodeClassifier`
552defined. Fortunately, all you need to do is nothing; if you leave
553the :obj:`TreeLearnerBase`'s flags as they are by default, the
554:obj:`nodeClassifier` are created.
555
556=======
557Classes
558=======
559
560Several classes described above are already functional and can
561(and mostly will) be used as they are. Those classes are :obj:`Node`,
562:obj:`TreeLearnerBase` and :obj:`TreeClassifier`. This section describe
563the other classes.
564
565Classes :obj:`SplitConstructor`, :obj:`StopCriteria`,
566:obj:`ExampleSplitter`, :obj:`Descender`, :obj:`orange.Learner`
567and :obj:`Classifier` are among the Orange classes that can be subtyped
568in Python and have the call operator overloadedd in such a way that it
569is callbacked from C++ code. You can thus program your own components
570for :obj:`TreeLearnerBase` and :obj:`TreeClassifier`. The detailed
571information on how this is done and what can go wrong, is given in a
572separate page, dedicated to callbacks to Python XXXXXXXXXX.
573
574SplitConstructors
575=====================
576
577Split construction is almost as exciting as waiting for a delayed flight.
578Boring, that is. Split constructors are programmed as spaghetti code
579that juggles with contingency matrices, with separate cases for discrete
580and continuous classes... Most split constructors work either for
581discrete or for continuous attributes. The suggested practice is
582to use a :obj:`SplitConstructor_Combined` that can handle
583both by simply delegating attributes to specialized split constructors.
584
585Note: split constructors that cannot handle attributes of particular
586type (discrete, continuous) do not report an error or a warning but
587simply skip the attribute. It is your responsibility to use a correct
588split constructor for your dataset. (May we again suggest
589using :obj:`SplitConstructor_Combined`?)
590
591The same components can be used either for inducing classification and
592regression trees. The only component that needs to be chosen accordingly
593is the 'measure' attribute for the :obj:`SplitConstructor_Measure`
594class (and derived classes).
595
596.. class:: SplitConstructor_Measure
597
598    Bases: :class:`SplitConstructor`
599
600    An abstract base class for split constructors that employ
601    a :obj:`Orange.feature.scoring.Measure` to assess a quality of a split. At present,
602    all split constructors except for :obj:`SplitConstructor_Combined`
603    are derived from this class.
604
605    .. attribute:: measure
606
607        A component of type :obj:`Orange.feature.scoring.Measure` used for
608        evaluation of a split. Note that you must select the subclass
609        :obj:`Orange.feature.scoring.Measure` capable of handling your class type
610        - you cannot use :obj:`Orange.feature.scoring.GainRatio`
611        for building regression trees or :obj:`Orange.feature.scoring.MSE`
612        for classification trees.
613
614    .. attribute:: worstAcceptable
615
616        The lowest required split quality for a split to be acceptable.
617        Note that this value make sense only in connection with a
618        :obj:`measure` component. Default is 0.0.
619
620.. class:: SplitConstructor_Attribute
621
622    Bases: :class:`SplitConstructor_Measure`
623
624    Attempts to use a discrete attribute as a split; each value of the
625    attribute corresponds to a branch in the tree. Attributes are
626    evaluated with the :obj:`measure` and the one with the
627    highest score is used for a split. If there is more than one
628    attribute with the highest score, one of them is selected by random.
629
630    The constructed :obj:`branchSelector` is an instance of
631    :obj:`orange.ClassifierFromVarFD` that returns a value of the
632    selected attribute. If the attribute is :obj:`Orange.data.variable.Discrete`,
633    :obj:`branchDescription`'s are the attribute's values. The
634    attribute is marked as spent, so that it cannot reappear in the
635    node's subtrees.
636
637.. class:: SplitConstructor_ExhaustiveBinary
638
639    Bases: :class:`SplitConstructor_Measure`
640
641    Works on discrete attributes. For each attribute, it determines
642    which binarization of the attribute gives the split with the
643    highest score. If more than one split has the highest score,
644    one of them is selected by random. After trying all the attributes,
645    it returns one of those with the highest score.
646
647    The constructed :obj:`branchSelector` is again an instance
648    :obj:`orange.ClassifierFromVarFD` that returns a value of the
649    selected attribute. This time, however, its :obj:`transformer`
650    contains an instance of :obj:`MapIntValue` that maps the values
651    of the attribute into a binary attribute. Branch descriptions are
652    of form "[<val1>, <val2>, ...<valn>]" for branches corresponding to
653    more than one value of the attribute. Branches that correspond to
654    a single value of the attribute are described with this value. If
655    the attribute was originally binary, it is spent and cannot be
656    used in the node's subtrees. Otherwise, it can reappear in the
657    subtrees.
658
659
660.. class:: SplitConstructor_Threshold
661
662    Bases: :class:`SplitConstructor_Measure`
663
664    This is currently the only constructor for splits with continuous
665    attributes. It divides the range of attributes values with a threshold
666    that maximizes the split's quality. As always, if there is more than
667    one split with the highest score, a random threshold is selected.
668    The attribute that yields the highest binary split is returned.
669
670    The constructed :obj:`branchSelector` is again an instance of
671    :obj:`orange.ClassifierFromVarFD` with an attached
672    :obj:`transformer`. This time, :obj:`transformer` is of type
673    :obj:`orange.ThresholdDiscretizer`. The branch descriptions are
674    "<threshold" and ">=threshold". The attribute is not spent.
675
676.. class:: SplitConstructor_OneAgainstOthers
677   
678    Bases: :class:`SplitConstructor_Measure`
679
680    Undocumented.
681
682.. class:: SplitConstructor_Combined
683
684    Bases: :class:`SplitConstructor`
685
686    This constructor delegates the task of finding the optimal split
687    to separate split constructors for discrete and for continuous
688    attributes. Each split constructor is called, given only attributes
689    of appropriate types as candidates. Both construct a candidate for
690    a split; the better of them is selected.
691
692    (Note that there is a problem when more candidates have the same
693    score. Let there be are nine discrete attributes with the highest
694    score; the split constructor for discrete attributes will select
695    one of them. Now, let us suppose that there is a single continuous
696    attribute with the same score. :obj:`SplitConstructor_Combined`
697    would randomly select between the proposed discrete attribute and
698    the continuous attribute, not aware of the fact that the discrete
699    has already competed with eight other discrete attributes. So,
700    he probability for selecting (each) discrete attribute would be 1/18
701    instead of 1/10. Although not really correct, we doubt that this
702    would affect the tree's performance; many other machine learning
703    systems simply choose the first attribute with the highest score
704    anyway.)
705
706    The :obj:`branchSelector`, :obj:`branchDescriptions` and whether
707    the attribute is spent is decided by the winning split constructor.
708
709    .. attribute: discreteSplitConstructor
710
711        Split constructor for discrete attributes; can be, for instance,
712        :obj:`SplitConstructor_Attribute` or
713        :obj:`SplitConstructor_ExhaustiveBinary`.
714
715    .. attribute: continuousSplitConstructor
716
717        Split constructor for continuous attributes; at the moment, it
718        can be either :obj:`SplitConstructor_Threshold` or a
719        split constructor you programmed in Python.
720
721    .. attribute: continuousSplitConstructor
722   
723        Split constructor for continuous attributes; at the moment, it
724        can be either :obj:`SplitConstructor_Threshold` or a split
725        constructor you programmed in Python.
726
727
728StopCriteria and StopCriteria_common
729============================================
730
731obj:`StopCriteria` determines when to stop the induction of subtrees, as
732described in detail in description of the learning process. XXXXXXXXXX
733
734.. class:: StopCriteria_common
735
736    :obj:`StopCriteria` contains additional criteria for pre-pruning:
737    it checks the proportion of majority class and the number of weighted
738    examples.
739
740    .. attribute:: maxMajor
741
742        Maximal proportion of majority class. When this is exceeded,
743        induction stops.
744
745    .. attribute:: minExamples
746
747        Minimal number of examples in internal leaves. Subsets with less
748        than :obj:`minExamples` examples are not split any further.
749        Example count is weighed.
750
751.. class:: StopCriteria_Python
752
753    Undocumented.
754
755Classes derived from ExampleSplitter
756========================================
757
758:obj:`ExampleSplitter` is the third crucial component of
759:obj:`TreeLearnerBase`. Its function is described in
760description of the learning process. XXXXXXXXXX
761
762.. class:: ExampleSplitter_IgnoreUnknowns
763
764    Bases: :class:`ExampleSplitter`
765
766    Simply ignores the examples for which no single branch can be determined.
767
768.. class:: ExampleSplitter_UnknownsToCommon
769
770    Bases: :class:`ExampleSplitter`
771
772    Places all such examples to a branch with the highest number of
773    examples. If there is more than one such branch, one is selected at
774    random and then used for all examples.
775
776.. class:: ExampleSplitter_UnknownsToAll
777
778    Bases: :class:`ExampleSplitter`
779
780    Places examples with unknown value of the attribute into all branches.
781
782.. class:: ExampleSplitter_UnknownsToRandom
783
784    Bases: :class:`ExampleSplitter`
785
786    Selects a random branch for such examples.
787
788.. class:: ExampleSplitter_UnknownsToBranch
789
790    Bases: :class:`ExampleSplitter`
791
792    Constructs an additional branch to contain all such examples.
793    The branch's description is "unknown".
794
795.. class:: ExampleSplitter_UnknownsAsBranchSizes
796
797    Bases: :class:`ExampleSplitter`
798
799    Splits examples with unknown value of the attribute according to
800    proportions of examples in each branch.
801
802.. class:: ExampleSplitter_UnknownsAsSelector
803
804    Bases: :class:`ExampleSplitter`
805
806    Splits examples with unknown value of the attribute according to
807    distribution proposed by selector (which is in most cases the same
808    as proportions of examples in branches).
809
810Descender and derived classes
811=================================
812
813This is a classifier's counterpart for :obj:`ExampleSplitter`. It
814decides the destiny of examples that need to be classified and cannot
815be unambiguously put in a branch. The detailed function of this class
816is given in description of classification with trees. XXXXXX
817
818.. class:: Descender
819
820    An abstract base object for tree descenders.
821
822    .. method:: __call__(node, example)
823
824        Descends down the tree until it reaches a leaf or a node in
825        which a vote of subtrees is required. In both cases, a tuple
826        of two elements is returned; in the former, the tuple contains
827        the reached node and None, in the latter in
828        contains a node and weights of votes for subtrees (a list of floats).
829
830        :obj:`Descender`'s that never split examples always descend
831        to a leaf, but they differ in the treatment of examples with
832        unknown values (or, in general, examples for which a branch
833        cannot be determined at some node(s) the tree).
834        :obj:`Descender`'s that do split examples differ in returned
835        vote weights.
836
837.. class:: Descender_UnknownsToNode
838
839    Bases: :obj:`Descender`
840
841    When example cannot be classified into a single branch, the
842    current node is returned. Thus, the node's :obj:`NodeClassifier`
843    will be used to make a decision. It is your responsibility to see
844    that even the internal nodes have their :obj:`NodeClassifier`
845    (i.e., don't disable creating node classifier or manually remove
846    them after the induction, that's all)
847
848.. class:: Descender_UnknownsToBranch
849
850    Bases: :obj:`Descender`
851
852    Classifies examples with unknown value to a special branch. This
853    makes sense only if the tree itself was constructed with
854    :obj:`ExampleSplitter_UnknownsToBranch`.
855
856.. class:: Descender_UnknownsToCommonBranch
857
858    Bases: :obj:`Descender`
859
860    Classifies examples with unknown values to the branch with the
861    highest number of examples. If there is more than one such branch,
862    random branch is chosen for each example that is to be classified.
863
864.. class:: Descender_UnknownsToCommonSelector
865
866    Bases: :obj:`Descender`
867
868    Classifies examples with unknown values to the branch which received
869    the highest recommendation by the selector.
870
871.. class:: Descender_MergeAsBranchSizes
872
873    Bases: :obj:`Descender`
874
875    Makes the subtrees vote for the example's class; the vote is
876    weighted according to the sizes of the branches.
877
878.. class:: Descender_MergeAsSelector
879
880    Bases: :obj:`Descender`
881
882    Makes the subtrees vote for the example's class; the vote is
883    weighted according to the selectors proposal.
884
885Pruner and derived classes
886==============================
887
888.. index::
889    pair: classification trees; pruning
890
891Classes derived from :obj:`Pruner` prune the trees as a
892described in the section pruning XXXXXXXX - make sure you read it
893to understand what the pruners will do to your trees.
894
895.. class:: Pruner
896
897    This is an abstract base class which defines nothing useful, only
898    a pure virtual call operator.
899
900    .. method:: __call__(tree)
901
902        Prunes a tree. The argument can be either a tree classifier or
903        a tree node; the result is of the same type as the argument.
904
905.. class:: Pruner_SameMajority
906
907    Bases: :class:`Pruner`
908
909    In Orange, a tree can have a non-trivial subtrees (i.e. subtrees
910    with more than one leaf) in which all the leaves have the same majority
911    class. (This is allowed because those leaves can still have different
912    distributions of classes and thus predict different probabilities.)
913    However, this can be undesired when we're only interested in the
914    class prediction or a simple tree interpretation. The
915    :obj:`Pruner_SameMajority` prunes the tree so that there is no
916    subtree in which all the nodes would have the same majority class.
917
918    This pruner will only prune the nodes in which the node classifier
919    is of class :obj:`orange.DefaultClassifier` (or from a derived class).
920
921    Note that the leaves with more than one majority class require some
922    special handling. The pruning goes backwards, from leaves to the root.
923    When siblings are compared, the algorithm checks whether they
924    have (at least one) common majority class. If so, they can be pruned.
925
926.. class:: Pruner_m
927
928    Bases: :class:`Pruner`
929
930    Prunes a tree by comparing m-estimates of static and dynamic
931    error as defined in (Bratko, 2002).
932
933    .. attribute:: m
934
935        Parameter m for m-estimation.
936
937========
938Examples
939========
940
941This page does not provide examples for programming your own components,
942such as, for instance, a :obj:`SplitConstructor`. Those examples
943can be found on a page dedicated to callbacks to Python XXXXXXXX.
944
945Tree Structure
946==============
947
948To have something to work on, we'll take the data from lenses dataset
949and build a tree using the default components (part of `treestructure.py`_, uses `lenses.tab`_):
950
951.. literalinclude:: code/treestructure.py
952   :lines: 7-10
953
954How big is our tree (part of `treestructure.py`_, uses `lenses.tab`_)?
955
956.. _lenses.tab: code/lenses.tab
957.. _treestructure.py: code/treestructure.py
958
959.. literalinclude:: code/treestructure.py
960   :lines: 12-21
961
962If node is None, we have a null-node; null nodes don't count,
963so we return 0. Otherwise, the size is 1 (this node) plus the
964sizes of all subtrees. The node is an internal node if it has a
965:obj:`branchSelector`; it there's no selector, it's a leaf. Don't
966attempt to skip the if statement: leaves don't have an empty list
967of branches, they don't have a list of branches at all.
968
969    >>> treeSize(treeClassifier.tree)
970    10
971
972Don't forget that this was only an excercise - :obj:`Node` has a
973built-in method :obj:`Node.treeSize` that does exactly the same.
974
975Let us now write a simple script that prints out a tree. The recursive
976part of the function will get a node and its level (part of `treestructure.py`_, uses `lenses.tab`_).
977
978.. literalinclude:: code/treestructure.py
979   :lines: 26-41
980
981Don't waste time on studying formatting tricks (\n's etc.), this is just
982for nicer output. What matters is everything but the print statements.
983As first, we check whether the node is a null-node (a node to which no
984learning examples were classified). If this is so, we just print out
985"<null node>" and return.
986
987After handling null nodes, remaining nodes are internal nodes and leaves.
988For internal nodes, we print a node description consisting of the
989attribute's name and distribution of classes. :obj:`Node`'s branch
990description is, for all currently defined splits, an instance of a
991class derived from :obj:`orange.Classifier` (in fact, it is a
992:obj:`orange.ClassifierFromVarFD`, but a :obj:`orange.Classifier` would
993suffice), and its :obj:`classVar` XXXXX points to the attribute we seek.
994So we print its name. We will also assume that storing class distributions
995has not been disabled and print them as well. A more able function for
996printing trees (as one defined in XXXXXXXXXX) has an alternative
997means to get the distribution, when this fails. Then we iterate
998through branches; for each we print a branch description and iteratively
999call the :obj:`printTree0` with a level increased by 1 (to increase
1000the indent).
1001
1002Finally, if the node is a leaf, we print out the distribution of
1003learning examples in the node and the class to which the examples in
1004the node would be classified. We again assume that the :obj:`nodeClassifier`
1005is the default one - a :obj:`DefaultClassifier`. A better print
1006function should be aware of possible alternatives.
1007
1008Now, we just need to write a simple function to call our printTree0.
1009We could write something like...
1010
1011::
1012
1013    def printTree(x):
1014        printTree0(x.tree, 0)
1015
1016... but we won't. Let us learn how to handle arguments of different
1017types. Let's write a function that will accept either a :obj:`TreeClassifier`
1018or a :obj:`Node`; just like Pruners XXXXXX, remember? Part of `treestructure.py`_, uses `lenses.tab`_.
1019
1020.. literalinclude:: code/treestructure.py
1021   :lines: 43-49
1022
1023It's fairly straightforward: if :obj:`x` is of type derived from
1024:obj:`TreeClassifier`, we print :obj:`x.tree`; if it's
1025:obj:`Node` we just call :obj:`printTree0` with :obj:`x`. If it's
1026of some other type, we don't know how to handle it and thus raise
1027an exception. (Note that we could also use
1028
1029::
1030
1031    if isinstance(x, Orange.classification.tree.TreeClassifier)
1032
1033but this would only work if :obj:`x` would be of type
1034:obj:`TreeClassifier` and not of any derived types. The latter,
1035however, do not exist yet...)
1036
1037    >>> printTree(treeClassifier)
1038    tear_rate (<15.000, 5.000, 4.000>)
1039    : reduced --> none (<12.000, 0.000, 0.000>)
1040    : normal
1041       astigmatic (<3.000, 5.000, 4.000>)
1042       : no
1043          age (<1.000, 5.000, 0.000>)
1044          : young --> soft (<0.000, 2.000, 0.000>)
1045          : pre-presbyopic --> soft (<0.000, 2.000, 0.000>)
1046          : presbyopic --> none (<1.000, 1.000, 0.000>)
1047       : yes
1048          prescription (<2.000, 0.000, 4.000>)
1049          : myope --> hard (<0.000, 0.000, 3.000>)
1050          : hypermetrope --> none (<2.000, 0.000, 1.000>)
1051
1052For a final exercise, let us write a simple pruning procedure. It will
1053be written entirely in Python, unrelated to any :obj:`Pruner`. Our
1054procedure will limit the tree depth - the maximal depth (here defined
1055as the number of internal nodes on any path down the tree) shall be
1056given as an argument. For example, to get a two-level tree, we would
1057call cutTree(root, 2). The function will be recursive, with the second
1058argument (level) decreasing at each call; when zero, the current node
1059will be made a leaf (part of `treestructure.py`_, uses `lenses.tab`_):
1060
1061.. literalinclude:: code/treestructure.py
1062   :lines: 54-62
1063
1064There's nothing to prune at null-nodes or leaves, so we act only when
1065:obj:`node` and :obj:`node.branchSelector` are defined. If level is
1066not zero, we call the function for each branch. Otherwise, we clear
1067the selector, branches and branch descriptions.
1068
1069    >>> cutTree(tree.tree, 2)
1070    >>> printTree(tree)
1071    tear_rate (<15.000, 5.000, 4.000>)
1072    : reduced --> none (<12.000, 0.000, 0.000>)
1073    : normal
1074       astigmatic (<3.000, 5.000, 4.000>)
1075       : no --> soft (<1.000, 5.000, 0.000>)
1076       : yes --> hard (<2.000, 0.000, 4.000>)
1077
1078Learning
1079========
1080
1081You've already seen a simple example of using a :obj:`TreeLearnerBase`.
1082You can just call it and let it fill the empty slots with the default
1083components. This section will teach you three things: what are the
1084missing components (and how to set the same components yourself),
1085how to use alternative components to get a different tree and,
1086finally, how to write a skeleton for tree induction in Python.
1087
1088Default components for TreeLearnerBase
1089======================================
1090
1091Let us construct a :obj:`TreeLearnerBase` to play with.
1092
1093.. _treelearner.py: code/treelearner.py
1094
1095`treelearner.py`_, uses `lenses.tab`_:
1096
1097.. literalinclude:: code/treelearner.py
1098   :lines: 7-10
1099
1100There are three crucial components in learning: the split and stop
1101criteria, and the :obj:`exampleSplitter` (there are some others,
1102which become important during classification; we'll talk about them
1103later). They are not defined; if you use the learner, the slots are
1104filled temporarily but later cleared again.
1105
1106::
1107
1108    >>> print learner.split
1109    None
1110    >>> learner(data)
1111    <TreeClassifier instance at 0x01F08760>
1112    >>> print learner.split
1113    None
1114
1115Stopping criteria
1116=================
1117
1118The stop is trivial. The default is set by
1119
1120::
1121    >>> learner.stop = Orange.classification.tree.StopCriteria_common()
1122
1123Well, this is actually done in C++ and it uses a global component
1124that is constructed once for all, but apart from that we did
1125effectively the same thing.
1126
1127We can now examine the default stopping parameters.
1128
1129    >>> print learner.stop.maxMajority, learner.stop.minExamples
1130    1.0 0.0
1131
1132Not very restrictive. This keeps splitting the examples until
1133there's nothing left to split or all the examples are in the same
1134class. Let us set the minimal subset that we allow to be split to
1135five examples and see what comes out.
1136
1137    >>> learner.stop.minExamples = 5.0
1138    >>> tree = learner(data)
1139    >>> printTree(tree)
1140    tear_rate (<15.000, 5.000, 4.000>)
1141    : reduced --> none (<12.000, 0.000, 0.000>)
1142    : normal
1143       astigmatic (<3.000, 5.000, 4.000>)
1144       : no
1145          age (<1.000, 5.000, 0.000>)
1146          : young --> soft (<0.000, 2.000, 0.000>)
1147          : pre-presbyopic --> soft (<0.000, 2.000, 0.000>)
1148          : presbyopic --> soft (<1.000, 1.000, 0.000>)
1149       : yes
1150          prescription (<2.000, 0.000, 4.000>)
1151          : myope --> hard (<0.000, 0.000, 3.000>)
1152          : hypermetrope --> none (<2.000, 0.000, 1.000>)
1153
1154OK, that's better. If we want an even smaller tree, we can also limit
1155the maximal proportion of majority class.
1156
1157    >>> learner.stop.maxMajority = 0.5
1158    >>> tree = learner(tree)
1159    >>> printTree(tree)
1160    --> none (<15.000, 5.000, 4.000>)
1161
1162
1163Undocumented
1164============
1165
1166.. class:: NodeList
1167
1168    Undocumented.
1169
1170.. class:: C45NodeList
1171
1172    Undocumented.
1173
1174===========================
1175C4.5 Classifier and Learner
1176===========================
1177
1178As C4.5 is a standard benchmark in machine learning,
1179it is incorporated in Orange, although Orange has its own
1180implementation of decision trees.
1181
1182The implementation uses the original Quinlan's code for learning so the
1183tree you get is exactly like the one that would be build by standalone
1184C4.5. Upon return, however, the original tree is copied to Orange
1185components that contain exactly the same information plus what is needed
1186to make them visible from Python. To be sure that the algorithm behaves
1187just as the original, we use a dedicated class :class:`C45Node`
1188instead of reusing the components used by Orange's tree inducer
1189(ie, :class:`Node`). This, however, could be done and probably
1190will be done in the future; we shall still retain :class:`C45Node`
1191but offer transformation to :class:`Node` so that routines
1192that work on Orange trees will also be usable for C45 trees.
1193
1194:class:`C45Learner` and :class:`C45Classifier` behave
1195like any other Orange learner and classifier. Unlike most of Orange
1196learning algorithms, C4.5 does not accepts weighted examples.
1197
1198Building the C4.5 plug-in
1199=========================
1200
1201We haven't been able to obtain the legal rights to distribute
1202C4.5 and therefore couldn't statically link it into Orange. Instead,
1203it's incorporated as a plug-in which you'll need to build yourself.
1204The procedure is trivial, but you'll need a C compiler. On Windows,
1205the scripts we provide work with MS Visual C and the files CL.EXE
1206and LINK.EXE must be on the PATH. On Linux you're equipped with
1207gcc. Mac OS X comes without gcc, but you can download it for
1208free from Apple.
1209
1210Orange must be installed prior to building C4.5. (This is because
1211the build script will copy the created file next to Orange,
1212which it obviously can't if Orange isn't there yet.)
1213
1214#. Download the
1215   `C4.5 (Release 8) sources <http://www.rulequest.com/Personal/c4.5r8.tar.gz>`_
1216   from the `Rule Quest's site <http://www.rulequest.com/>`_ and extract
1217   them into some temporary directory. The files will be modified in the
1218   further process, so don't use your copy of Quinlan's sources that you
1219   need for another purpose.
1220#. Download
1221   `buildC45.zip <http://orange.biolab.si/orange/download/buildC45.zip>`_
1222   and unzip its contents into the directory R8/Src of the Quinlan's
1223   stuff (it's the directory that contains, for instance, the file
1224   average.c).
1225#. Run buildC45.py, which will build the plug-in and put it next to
1226   orange.pyd (or orange.so on Linux/Mac).
1227#. Run python, type :samp:`import Orange` and
1228   create create :samp:`Orange.classification.tree.C45Learner()`.
1229   If this fails, something went wrong; see the diagnostic messages from
1230   buildC45.py and read the below paragraph.
1231#. Finally, you can remove the Quinlan's stuff, along with everything
1232   created by buildC45.py.
1233
1234If the process fails, here's what buildC45.py really does: it creates
1235.h files that wrap Quinlan's .i files and ensure that they are not
1236included twice. It modifies C4.5 sources to include .h's instead of
1237.i's. This step can hardly fail. Then follows the platform dependent
1238step which compiles ensemble.c (which includes all the Quinlan's .c
1239files it needs) into c45.dll or c45.so and puts it next to Orange.
1240If this fails, but you do have a C compiler and linker, and you know
1241how to use them, you can compile the ensemble.c into a dynamic
1242library yourself. See the compile and link steps in buildC45.py,
1243if it helps. Anyway, after doing this check that the built C4.5
1244gives the same results as the original.
1245
1246.. class:: C45Learner
1247
1248    :class:`C45Learner`'s attributes have double names - those that
1249    you know from C4.5 command lines and the corresponding names of C4.5's
1250    internal variables. All defaults are set as in C4.5; if you change
1251    nothing, you are running C4.5.
1252
1253    .. attribute:: gainRatio (g)
1254       
1255        Determines whether to use information gain (false>, default)
1256        or gain ratio for selection of attributes (true).
1257
1258    .. attribute:: batch (b)
1259
1260        Turn on batch mode (no windows, no iterations); this option is
1261        not documented in C4.5 manuals. It conflicts with "window",
1262        "increment" and "trials".
1263
1264    .. attribute:: subset (s)
1265       
1266        Enables subsetting (default: false, no subsetting),
1267 
1268    .. attribute:: probThresh (p)
1269
1270        Probabilistic threshold for continuous attributes (default: false).
1271
1272    .. attribute:: minObjs (m)
1273       
1274        Minimal number of objects (examples) in leaves (default: 2).
1275
1276    .. attribute:: window (w)
1277
1278        Initial windows size (default: maximum of 20% and twice the
1279        square root of the number of data objects).
1280
1281    .. attribute:: increment (i)
1282
1283        The maximum number of objects that can be added to the window
1284        at each iteration (default: 20% of the initial window size).
1285
1286    .. attribute:: cf (c)
1287
1288        Prunning confidence level (default: 25%).
1289
1290    .. attribute:: trials (t)
1291
1292        Set the number of trials in iterative (i.e. non-batch) mode (default: 10).
1293
1294    .. attribute:: prune
1295       
1296        Return pruned tree (not an original C4.5 option) (default: true)
1297
1298
1299:class:`C45Learner` also offers another way for setting
1300the arguments: it provides a function :obj:`C45Learner.commandLine`
1301which is given a string and parses it the same way as C4.5 would
1302parse its command line. XXXXXXXXXXX
1303
1304.. class:: C45Classifier
1305
1306    A faithful reimplementation of Quinlan's function from C4.5. The only
1307    difference (and the only reason it's been rewritten) is that it uses
1308    a tree composed of :class:`C45Node` instead of C4.5's
1309    original tree structure.
1310
1311    .. attribute:: tree
1312
1313        C4.5 tree stored as a tree of :obj:`C45Node`.
1314
1315
1316.. class:: C45Node
1317
1318    This class is a reimplementation of the corresponding *struct* from
1319    Quinlan's C4.5 code.
1320
1321    .. attribute:: nodeType
1322
1323        Type of the node:  :obj:`C45Node.Leaf` (0),
1324        :obj:`C45Node.Branch` (1), :obj:`C45Node.Cut` (2),
1325        :obj:`C45Node.Subset` (3). "Leaves" are leaves, "branches"
1326        split examples based on values of a discrete attribute,
1327        "cuts" cut them according to a threshold value of a continuous
1328        attributes and "subsets" use discrete attributes but with subsetting
1329        so that several values can go into the same branch.
1330
1331    .. attribute:: leaf
1332
1333        Value returned by that leaf. The field is defined for internal
1334        nodes as well.
1335
1336    .. attribute:: items
1337
1338        Number of (learning) examples in the node.
1339
1340    .. attribute:: classDist
1341
1342        Class distribution for the node (of type
1343        :obj:`orange.DiscDistribution`).
1344
1345    .. attribute:: tested
1346       
1347        The attribute used in the node's test. If node is a leaf,
1348        obj:`tested` is None, if node is of type :obj:`Branch` or :obj:`Cut`
1349        :obj:`tested` is a discrete attribute, and if node is of type
1350        :obj:`Cut` then :obj:`tested` is a continuous attribute.
1351
1352    .. attribute:: cut
1353
1354        A threshold for continuous attributes, if node is of type :obj:`Cut`.
1355        Undefined otherwise.
1356
1357    .. attribute:: mapping
1358
1359        Mapping for nodes of type :obj:`Subset`. Element :samp:`mapping[i]`
1360        gives the index for an example whose value of :obj:`tested` is *i*.
1361        Here, *i* denotes an index of value, not a :class:`Orange.data.Value`.
1362
1363    .. attribute:: branch
1364       
1365        A list of branches stemming from this node.
1366
1367Examples
1368========
1369
1370.. _tree_c45.py: code/tree_c45.py
1371.. _iris.tab: code/iris.tab
1372
1373The simplest way to use :class:`C45Learner` is to call it. This
1374script constructs the same learner as you would get by calling
1375the usual C4.5 (`tree_c45.py`_, uses `iris.tab`_):
1376
1377.. literalinclude:: code/tree_c45.py
1378   :lines: 7-14
1379
1380Arguments can be set by the usual mechanism (the below to lines do the
1381same, except that one uses command-line symbols and the other internal
1382variable names)
1383
1384::
1385
1386    tree = Orange.classification.tree.C45Learner(data, m=100)
1387    tree = Orange.classification.tree.C45Learner(data, minObjs=100)
1388
1389The way that could be prefered by veteran C4.5 user might be through
1390method `:obj:C45Learner.commandline`.
1391
1392::
1393
1394    lrn = Orange.classification.tree..C45Learner()
1395    lrn.commandline("-m 1 -s")
1396    tree = lrn(data)
1397
1398There's nothing special about using :obj:`C45Classifier` - it's
1399just like any other. To demonstrate what the structure of
1400:class:`C45Node`'s looks like, will show a script that prints
1401it out in the same format as C4.5 does.
1402
1403.. literalinclude:: code/tree_c45_printtree.py
1404
1405Leaves are the simplest. We just print out the value contained
1406in :samp:`node.leaf`. Since this is not a qualified value (ie.,
1407:obj:`C45Node` does not know to which attribute it belongs), we need to
1408convert it to a string through :obj:`classVar`, which is passed as an
1409extra argument to the recursive part of printTree.
1410
1411For discrete splits without subsetting, we print out all attribute values
1412and recursively call the function for all branches. Continuous splits are
1413equally easy to handle.
1414
1415For discrete splits with subsetting, we iterate through branches, retrieve
1416the corresponding values that go into each branch to inset, turn
1417the values into strings and print them out, separately treating the
1418case when only a single value goes into the branch.
1419
1420Printing out C45 Tree
1421=====================
1422
1423.. autofunction:: printTreeC45
1424
1425=======================
1426orngTree module XXXXXXX
1427=======================
1428
1429.. autoclass:: TreeLearner
1430    :members:
1431
1432For a bit more complex example, here's how to write your own stop
1433function. The example itself is more funny than useful. It constructs
1434and prints two trees. For the first one we define the *defStop*
1435function, which is used by default, and combine it with a random function
1436so that the stop criteria will also be met in additional 20% of the cases
1437when *defStop* is false. The second tree is build such that it
1438considers only the random function as the stopping criteria. Note that in
1439the second case lambda function still has three parameters, since this is
1440a necessary number of parameters for the stop function XXXXX link (
1441part of `tree3.py`_ (uses  `iris.tab`_):
1442
1443.. _tree3.py: code/tree3.py
1444
1445.. literalinclude:: code/tree3.py
1446   :lines: 8-23
1447
1448The output is not shown here since the resulting trees are rather
1449big.
1450
1451Tree size
1452=========
1453
1454.. autofunction:: countNodes
1455
1456.. autofunction:: countLeaves
1457
1458Printing the Tree
1459=================
1460
1461The included printing functions can
1462print out practically anything you'd like to
1463know, from the number of examples, proportion of examples of majority
1464class in nodes and similar, to more complex statistics like the
1465proportion of examples in a particular class divided by the proportion
1466of examples of this class in a parent node. And even more, you can
1467define your own callback functions to be used for printing.
1468
1469
1470.. autofunction:: dumpTree
1471
1472.. autofunction:: printTree
1473
1474.. autofunction:: printTxt
1475
1476Before we go on: you can read all about the function and use it to its
1477full extent, or you can just call it, giving it the tree as the sole
1478argument and it will print out the usual textual representation of the
1479tree. If you're satisfied with that, you can stop here.
1480
1481The magic is in the format string. It is a string which is printed
1482out at every leaf or internal node with the certain format specifiers
1483replaced by data from the tree node. Specifiers are generally of form
1484**%[^]<precision><quantity><divisor>**.
1485
1486**^** at the start tells that the number should be multiplied by 100.
1487It's useful for printing proportions like percentages.
1488
1489**<precision>** is in the same format as in Python (or C) string
1490formatting. For instance, :samp:`%N` denotes the number of examples in the node,
1491hence :samp:`%6.2N` would mean output to two decimal digits and six places
1492altogether. If left out, a default format :samp:`5.3` is used, unless you
1493multiply the numbers by 100, in which case the default is :samp:`.0`
1494(no decimals, the number is rounded to the nearest integer).
1495
1496**<divisor>** tells what to divide the quantity in that node with.
1497:samp:`bP` means division by the same quantity in the parent node; for instance,
1498:samp:`%NbP` will tell give the number of examples in the node divided by the
1499number of examples in parent node. You can add use precision formatting,
1500e.g. :samp:`%6.2NbP.` bA is division by the same quantity over the entire data
1501set, so :samp:`%NbA` will tell you the proportion of examples (out of the entire
1502training data set) that fell into that node. If division is impossible
1503since the parent node does not exist or some data is missing, a dot is
1504printed out instead of the quantity.
1505
1506**<quantity>** is the only required element. It defines what to print.
1507For instance, :samp:`%N` would print out the number of examples in the node.
1508Possible quantities are
1509
1510:samp:`V`
1511    The value predicted at that node. You cannot define the precision
1512    or divisor here.
1513
1514:samp:`N`
1515    The number of examples in the node.
1516
1517:samp:`M`
1518    The number of examples in the majority class (that is, the class
1519    predicted by the node).
1520
1521:samp:`m`
1522    The proportion of examples in the majority class.
1523
1524:samp:`A`
1525    The average class for examples the node; this is available only for
1526    regression trees.
1527
1528:samp:`E`
1529    Standard error for class of examples in the node; available for
1530    regression trees.
1531
1532:samp:`I`
1533    Print out the confidence interval. The modifier is used as
1534    :samp:`%I(95)` of (more complicated) :samp:`%5.3I(95)bP`.
1535
1536:samp:`C`
1537    The number of examples in the given class. For classification trees,
1538    this modifier is used as, for instance in, :samp:`%5.3C="Iris-virginica"bP`
1539    - this will tell the number of examples of Iris-virginica by the
1540    number of examples this class in the parent node. If you are
1541    interested in examples that are *not* Iris-virginica, say
1542    :samp:`%5.3CbP!="Iris-virginica"`
1543
1544    For regression trees, you can use operators =, !=, <, <=, >, and >=,
1545    as in :samp:`%C<22` - add the precision and divisor if you will. You can also
1546    check the number of examples in a certain interval: :samp:`%C[20, 22]`
1547    will give you the number of examples between 20 and 22 (inclusive)
1548    and :samp:`%C(20, 22)` will give the number of such examples excluding the
1549    boundaries. You can of course mix the parentheses, e.g. :samp:`%C(20, 22]`.
1550    If you would like the examples outside the interval, add a :samp:`!`,
1551    like :samp:`%C!(20, 22]`.
1552 
1553:samp:`c`
1554    Same as above, except that it computes the proportion of the class
1555    instead of the number of examples.
1556
1557:samp:`D`
1558    Prints out the number of examples in each class. You can use both,
1559    precision (it is applied to each number in the distribution) or the
1560    divisor. This quantity cannot be computed for regression trees.
1561
1562:samp:`d`
1563    Same as above, except that it shows proportions of examples. This
1564    again doesn't work with regression trees.
1565
1566<user defined formats>
1567    You can add more, if you will. Instructions and examples are given at
1568    the end of this section.
1569
1570
1571Examples
1572========
1573
1574We shall build a small tree from the iris data set - we shall limit the
1575depth to three levels (part of `orngTree1.py`_, uses `iris.tab`_):
1576
1577.. literalinclude:: code/orngTree1.py
1578   :lines: 0-4
1579
1580.. _orngTree1.py: code/orngTree1.py
1581
1582The easiest way to call the function is to pass the tree as the only
1583argument::
1584
1585    >>> Orange.classification.tree.printTree(tree)
1586    petal width<0.800: Iris-setosa (100.00%)
1587    petal width>=0.800
1588    |    petal width<1.750
1589    |    |    petal length<5.350: Iris-versicolor (94.23%)
1590    |    |    petal length>=5.350: Iris-virginica (100.00%)
1591    |    petal width>=1.750
1592    |    |    petal length<4.850: Iris-virginica (66.67%)
1593    |    |    petal length>=4.850: Iris-virginica (100.00%)
1594
1595Let's now print out the predicted class at each node, the number
1596of examples in the majority class with the total number of examples
1597in the node::
1598
1599    >>> Orange.classification.tree.printTree(tree, leafStr="%V (%M out of %N)")
1600    petal width<0.800: Iris-setosa (50.000 out of 50.000)
1601    petal width>=0.800
1602    |    petal width<1.750
1603    |    |    petal length<5.350: Iris-versicolor (49.000 out of 52.000)
1604    |    |    petal length>=5.350: Iris-virginica (2.000 out of 2.000)
1605    |    petal width>=1.750
1606    |    |    petal length<4.850: Iris-virginica (2.000 out of 3.000)
1607    |    |    petal length>=4.850: Iris-virginica (43.000 out of 43.000)
1608
1609Would you like to know how the number of examples declines as
1610compared to the entire data set and to the parent node? We find
1611it with this::
1612
1613    >>> orng.printTree("%V (%^MbA%, %^MbP%)")
1614    petal width<0.800: Iris-setosa (100%, 100%)
1615    petal width>=0.800
1616    |    petal width<1.750
1617    |    |    petal length<5.350: Iris-versicolor (98%, 100%)
1618    |    |    petal length>=5.350: Iris-virginica (4%, 40%)
1619    |    petal width>=1.750
1620    |    |    petal length<4.850: Iris-virginica (4%, 4%)
1621    |    |    petal length>=4.850: Iris-virginica (86%, 96%)
1622
1623Let us first read the format string. :samp:`%M` is the number of
1624examples in the majority class. We want it divided by the number of
1625all examples from this class on the entire data set, hence :samp:`%MbA`.
1626To have it multipied by 100, we say :samp:`%^MbA`. The percent sign *after*
1627that is just printed out literally, just as the comma and parentheses
1628(see the output). The string for showing the proportion of this class
1629in the parent is the same except that we have :samp:`bP` instead
1630of :samp:`bA`.
1631
1632And now for the output: all examples of setosa for into the first node.
1633For versicolor, we have 98% in one node; the rest is certainly
1634not in the neighbouring node (petal length&gt;=5.350) since all
1635versicolors from the node petal width<1.750 went to petal length<5.350
1636(we know this from the 100% in that line). Virginica is the
1637majority class in the three nodes that together contain 94% of this
1638class (4+4+86). The rest must had gone to the same node as versicolor.
1639
1640If you find this guesswork annoying - so do I. Let us print out the
1641number of versicolors in each node, together with the proportion of
1642versicolors among the examples in this particular node and among all
1643versicolors. So,
1644
1645::
1646
1647    '%C="Iris-versicolor" (%^c="Iris-versicolor"% of node, %^CbA="Iris-versicolor"% of versicolors)
1648
1649gives the following output::
1650
1651    petal width<0.800: 0.000 (0% of node, 0% of versicolors)
1652    petal width>=0.800
1653    |    petal width<1.750
1654    |    |    petal length<5.350: 49.000 (94% of node, 98% of versicolors)
1655    |    |    petal length>=5.350: 0.000 (0% of node, 0% of versicolors)
1656    |    petal width>=1.750
1657    |    |    petal length<4.850: 1.000 (33% of node, 2% of versicolors)
1658    |    |    petal length>=4.850: 0.000 (0% of node, 0% of versicolors)
1659
1660Finally, we may want to print out the distributions, using a simple
1661string :samp:`%D`::
1662
1663    petal width<0.800: [50.000, 0.000, 0.000]
1664    petal width>=0.800
1665    |    petal width<1.750
1666    |    |    petal length<5.350: [0.000, 49.000, 3.000]
1667    |    |    petal length>=5.350: [0.000, 0.000, 2.000]
1668    |    petal width>=1.750
1669    |    |    petal length<4.850: [0.000, 1.000, 2.000]
1670    |    |    petal length>=4.850: [0.000, 0.000, 43.000]
1671
1672What is the order of numbers here? If you check
1673:samp:`data.domain.classVar.values` , you'll learn that the order is setosa,
1674versicolor, virginica; so in the node at peta length<5.350 we have 49
1675versicolors and 3 virginicae. To print out the proportions, we can
1676:samp:`%.2d` - this gives us proportions within node, rounded on
1677two decimals::
1678
1679    petal width<0.800: [1.00, 0.00, 0.00]
1680    petal width>=0.800
1681    |    petal width<1.750
1682    |    |    petal length<5.350: [0.00, 0.94, 0.06]
1683    |    |    petal length>=5.350: [0.00, 0.00, 1.00]
1684    |    petal width>=1.750
1685    |    |    petal length<4.850: [0.00, 0.33, 0.67]
1686    |    |    petal length>=4.850: [0.00, 0.00, 1.00]
1687
1688We haven't tried printing out any information for internal nodes.
1689To start with the most trivial case, we shall print the prediction
1690at each node.
1691
1692::
1693
1694    Orange.classification.tree.printTree(tree, leafStr="%V", nodeStr=".")
1695   
1696says that the nodeStr should be the same as leafStr (not very useful
1697here, since leafStr is trivial anyway).
1698
1699::
1700
1701    root: Iris-setosa
1702    |    petal width<0.800: Iris-setosa
1703    |    petal width>=0.800: Iris-versicolor
1704    |    |    petal width<1.750: Iris-versicolor
1705    |    |    |    petal length<5.350: Iris-versicolor
1706    |    |    |    petal length>=5.350: Iris-virginica
1707    |    |    petal width>=1.750: Iris-virginica
1708    |    |    |    petal length<4.850: Iris-virginica
1709    |    |    |    petal length>=4.850: Iris-virginica
1710
1711Note that the output is somewhat different now: there appeared another
1712node called *root* and the tree looks one level deeper. This is
1713needed to print out the data for that node to.
1714
1715Now for something more complicated: let us observe how the number
1716of virginicas decreases down the tree::
1717
1718    Orange.classification.tree.printTree(tree, leafStr='%^.1CbA="Iris-virginica"% (%^.1CbP="Iris-virginica"%)', nodeStr='.')
1719
1720Let's first interpret the format string: :samp:`CbA="Iris-virginica"` is
1721the number of examples from class virginica, divided by the total number
1722of examples in this class. Add :samp:`^.1` and the result will be
1723multiplied and printed with one decimal. The trailing :samp:`%` is printed
1724out. In parentheses we print the same thing except that we divide by the
1725examples in the parent node. Note the use of single quotes, so we can
1726use the double quotes inside the string, when we specify the class.
1727
1728::
1729
1730    root: 100.0% (.%)
1731    |    petal width<0.800: 0.0% (0.0%)
1732    |    petal width>=0.800: 100.0% (100.0%)
1733    |    |    petal width<1.750: 10.0% (10.0%)
1734    |    |    |    petal length<5.350: 6.0% (60.0%)
1735    |    |    |    petal length>=5.350: 4.0% (40.0%)
1736    |    |    petal width>=1.750: 90.0% (90.0%)
1737    |    |    |    petal length<4.850: 4.0% (4.4%)
1738    |    |    |    petal length>=4.850: 86.0% (95.6%)
1739
1740See what's in the parentheses in the root node? If :func:`printTree`
1741cannot compute something (in this case it's because the root has no parent),
1742it prints out a dot. You can also eplace :samp:`=` by :samp:`!=` and it
1743will count all classes *except* virginica.
1744
1745For one final example with classification trees, we shall print the
1746distributions in that nodes, the distribution compared to the parent
1747and the proportions compared to the parent (the latter things are not
1748the same - think why). In the leaves we shall also add the predicted
1749class. So now we'll have to call the function like this.
1750
1751::
1752
1753    >>>Orange.classification.tree.printTree(tree, leafStr='"%V   %D %.2DbP %.2dbP"', nodeStr='"%D %.2DbP %.2dbP"')
1754    root: [50.000, 50.000, 50.000] . .
1755    |    petal width<0.800: [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]:
1756    |        Iris-setosa   [50.000, 0.000, 0.000] [1.00, 0.00, 0.00] [3.00, 0.00, 0.00]
1757    |    petal width>=0.800: [0.000, 50.000, 50.000] [0.00, 1.00, 1.00] [0.00, 1.50, 1.50]
1758    |    |    petal width<1.750: [0.000, 49.000, 5.000] [0.00, 0.98, 0.10] [0.00, 1.81, 0.19]
1759    |    |    |    petal length<5.350: [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]:
1760    |    |    |        Iris-versicolor   [0.000, 49.000, 3.000] [0.00, 1.00, 0.60] [0.00, 1.04, 0.62]
1761    |    |    |    petal length>=5.350: [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]:
1762    |    |    |        Iris-virginica   [0.000, 0.000, 2.000] [0.00, 0.00, 0.40] [0.00, 0.00, 10.80]
1763    |    |    petal width>=1.750: [0.000, 1.000, 45.000] [0.00, 0.02, 0.90] [0.00, 0.04, 1.96]
1764    |    |    |    petal length<4.850: [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]:
1765    |    |    |        Iris-virginica   [0.000, 1.000, 2.000] [0.00, 1.00, 0.04] [0.00, 15.33, 0.68]
1766    |    |    |    petal length>=4.850: [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]:
1767    |    |    |        Iris-virginica   [0.000, 0.000, 43.000] [0.00, 0.00, 0.96] [0.00, 0.00, 1.02]
1768
1769To explore the possibilities when printing regression trees, we are going
1770to induce a tree from the housing data set. Called with the tree as the
1771only argument, :func:`printTree` prints the tree like this::
1772
1773    RM<6.941
1774    |    LSTAT<14.400
1775    |    |    DIS<1.385: 45.6
1776    |    |    DIS>=1.385: 22.9
1777    |    LSTAT>=14.400
1778    |    |    CRIM<6.992: 17.1
1779    |    |    CRIM>=6.992: 12.0
1780    RM>=6.941
1781    |    RM<7.437
1782    |    |    CRIM<7.393: 33.3
1783    |    |    CRIM>=7.393: 14.4
1784    |    RM>=7.437
1785    |    |    TAX<534.500: 45.9
1786    |    |    TAX>=534.500: 21.9
1787
1788Let us add the standard error in both internal nodes and leaves, and the
178990% confidence intervals in the leaves::
1790
1791    >>> Orange.classification.tree.printTree(tree, leafStr="[SE: %E]\t %V %I(90)", nodeStr="[SE: %E]")
1792    root: [SE: 0.409]
1793    |    RM<6.941: [SE: 0.306]
1794    |    |    LSTAT<14.400: [SE: 0.320]
1795    |    |    |    DIS<1.385: [SE: 4.420]:
1796    |    |    |        [SE: 4.420]   45.6 [38.331-52.829]
1797    |    |    |    DIS>=1.385: [SE: 0.244]:
1798    |    |    |        [SE: 0.244]   22.9 [22.504-23.306]
1799    |    |    LSTAT>=14.400: [SE: 0.333]
1800    |    |    |    CRIM<6.992: [SE: 0.338]:
1801    |    |    |        [SE: 0.338]   17.1 [16.584-17.691]
1802    |    |    |    CRIM>=6.992: [SE: 0.448]:
1803    |    |    |        [SE: 0.448]   12.0 [11.243-12.714]
1804    |    RM>=6.941: [SE: 1.031]
1805    |    |    RM<7.437: [SE: 0.958]
1806    |    |    |    CRIM<7.393: [SE: 0.692]:
1807    |    |    |        [SE: 0.692]   33.3 [32.214-34.484]
1808    |    |    |    CRIM>=7.393: [SE: 2.157]:
1809    |    |    |        [SE: 2.157]   14.4 [10.862-17.938]
1810    |    |    RM>=7.437: [SE: 1.124]
1811    |    |    |    TAX<534.500: [SE: 0.817]:
1812    |    |    |        [SE: 0.817]   45.9 [44.556-47.237]
1813    |    |    |    TAX>=534.500: [SE: 0.000]:
1814    |    |    |        [SE: 0.000]   21.9 [21.900-21.900]
1815
1816What's the difference between :samp:`%V`, the predicted value and
1817:samp:`%A` the average? Doesn't a regression tree always predict the
1818leaf average anyway? Not necessarily, the tree predict whatever the
1819:attr:`TreeClassifier.nodeClassifier` in a leaf returns.
1820As :samp:`%V` uses the
1821:obj:`Orange.data.variable.Continuous`' function for printing out the value,
1822therefore the printed number has the same number of decimals
1823as in the data file.
1824
1825Regression trees cannot print the distributions in the same way
1826as classification trees. They instead offer a set of operators for
1827observing the number of examples within a certain range. For instance,
1828let us check the number of examples with values below 22, and compare
1829this number with values in the parent nodes::
1830
1831    >>> Orange.classification.tree.printTree(tree, leafStr="%C<22 (%cbP<22)", nodeStr=".")
1832    root: 277.000 (.)
1833    |    RM<6.941: 273.000 (1.160)
1834    |    |    LSTAT<14.400: 107.000 (0.661)
1835    |    |    |    DIS<1.385: 0.000 (0.000)
1836    |    |    |    DIS>=1.385: 107.000 (1.020)
1837    |    |    LSTAT>=14.400: 166.000 (1.494)
1838    |    |    |    CRIM<6.992: 93.000 (0.971)
1839    |    |    |    CRIM>=6.992: 73.000 (1.040)
1840    |    RM>=6.941: 4.000 (0.096)
1841    |    |    RM<7.437: 3.000 (1.239)
1842    |    |    |    CRIM<7.393: 0.000 (0.000)
1843    |    |    |    CRIM>=7.393: 3.000 (15.333)
1844    |    |    RM>=7.437: 1.000 (0.633)
1845    |    |    |    TAX<534.500: 0.000 (0.000)
1846    |    |    |    TAX>=534.500: 1.000 (30.000)</xmp>
1847
1848The last line, for instance, says the the number of examples with the
1849class below 22 is among those with tax above 534 is 30 times higher
1850than the number of such examples in its parent node.
1851
1852For another exercise, let's count the same for all examples *outside*
1853interval [20, 22] (given like this, the interval includes the bounds).
1854And let us print out the proportions as percents.
1855
1856::
1857
1858    >>> Orange.classification.tree.printTree(tree, leafStr="%C![20,22] (%^cbP![20,22]%)", nodeStr=".")
1859
1860OK, let's observe the format string for one last time. :samp:`%c![20, 22]`
1861would be the proportion of examples (within the node) whose values are
1862below 20 or above 22. By :samp:`%cbP![20, 22]` we derive this by the same
1863statistics computed on the parent. Add a :samp:`^` and you have the percentages.
1864
1865::
1866
1867    root: 439.000 (.%)
1868    |    RM<6.941: 364.000 (98%)
1869    |    |    LSTAT<14.400: 200.000 (93%)
1870    |    |    |    DIS<1.385: 5.000 (127%)
1871    |    |    |    DIS>=1.385: 195.000 (99%)
1872    |    |    LSTAT>=14.400: 164.000 (111%)
1873    |    |    |    CRIM<6.992: 91.000 (96%)
1874    |    |    |    CRIM>=6.992: 73.000 (105%)
1875    |    RM>=6.941: 75.000 (114%)
1876    |    |    RM<7.437: 46.000 (101%)
1877    |    |    |    CRIM<7.393: 43.000 (100%)
1878    |    |    |    CRIM>=7.393: 3.000 (100%)
1879    |    |    RM>=7.437: 29.000 (98%)
1880    |    |    |    TAX<534.500: 29.000 (103%)
1881    |    |    |    TAX>=534.500: 0.000 (0%)
1882
1883
1884Defining Your Own Printout functions
1885====================================
1886
1887:func:`dumpTree`'s argument :obj:`userFormats` can be used to print out
1888some other information in the leaves or nodes. If provided,
1889:obj:`userFormats` should contain a list of tuples with a regular expression
1890and a callback function to be called when that expression is found in the
1891format string. Expressions from :obj:`userFormats` are checked before
1892the built-in expressions discussed above, so you can override the built-ins
1893if you want to.
1894
1895The regular expression should describe a string like those we used above,
1896for instance the string :samp:`%.2DbP`. When a leaf or internal node
1897is printed out, the format string (:obj:`leafStr` or :obj:`nodeStr`)
1898is checked for these regular expressions and when the match is found,
1899the corresponding callback function is called.
1900
1901The callback function will get five arguments: the format string
1902(:obj:`leafStr` or :obj:`nodeStr`), the match object, the node which is
1903being printed, its parent (can be None) and the tree classifier.
1904The function should return the format string in which the part described
1905by the match object (that is, the part that is matched by the regular
1906expression) is replaced by whatever information your callback function
1907is supposed to give.
1908
1909The function can use several utility function provided in the module.
1910
1911.. autofunction:: insertStr
1912
1913.. autofunction:: insertDot
1914
1915.. autofunction:: insertNum
1916
1917.. autofunction:: byWhom
1918
1919
1920There are also a few pieces of regular expression that you may want to reuse.
1921The two you are likely to use are:
1922
1923.. autodata:: fs
1924
1925.. autodata:: by
1926
1927For a trivial example, :samp:`%V` is implemented like this. There is the
1928following tuple in the list of built-in formats::
1929
1930    (re.compile("%V"), replaceV)
1931
1932:obj:`replaceV` is a function defined by::
1933
1934    def replaceV(strg, mo, node, parent, tree):
1935        return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
1936
1937It therefore takes the value predicted at the node
1938(:samp:`node.nodeClassifier.defaultValue` ), converts it to a string
1939and passes it to *insertStr* to do the replacement.
1940
1941A more complex regular expression is the one for the proportion of
1942majority class, defined as :samp:`"%"+fs+"M"+by`. It uses the
1943two partial expressions defined above.
1944
1945Let's say with like to print the classification margin for each node,
1946that is, the difference between the proportion of the largest and the
1947second largest class in the node (part of `orngTree2.py`_):
1948
1949.. _orngTree2.py: code/orngTree2.py
1950
1951.. literalinclude:: code/orngTree2.py
1952   :lines: 7-31
1953
1954We first defined getMargin which gets the distribution and computes the
1955margin. The callback replaces, replaceB, computes the margin for the node.
1956If we need to divided the quantity by something (that is, if the :data:`by`
1957group is present), we call :func:`byWhom` to get the node with whose margin
1958this node's margin is to be divided. If this node (usually the parent)
1959does not exist of if its margin is zero, we call :func:`insertDot`
1960to insert a dot, otherwise we call :func:`insertNum` which will insert
1961the number, obeying the format specified by the user. myFormat is a list
1962containing the regular expression and the callback function.
1963
1964
1965We can now print out the iris tree:
1966
1967.. literalinclude:: code/orngTree2.py
1968    :lines: 33
1969
1970And we get::
1971
1972    petal width<0.800: Iris-setosa 100% (100.00%)
1973    petal width>=0.800
1974    |    petal width<1.750
1975    |    |    petal length<5.350: Iris-versicolor 88% (108.57%)
1976    |    |    petal length>=5.350: Iris-virginica 100% (122.73%)
1977    |    petal width>=1.750
1978    |    |    petal length<4.850: Iris-virginica 33% (34.85%)
1979    |    |    petal length>=4.850: Iris-virginica 100% (104.55%)
1980
1981
1982Plotting the Tree using Dot
1983===========================
1984
1985.. autofunction:: printDot
1986
1987.. autofunction:: dotTree
1988
1989Suppose you saved the tree in a file "tree5.dot". You can then
1990print it out as a gif if you execute the following command line
1991
1992::
1993   
1994    dot -Tgif tree5.dot -otree5.gif
1995
1996GraphViz's dot has quite a few other output formats, check
1997its documentation to learn which.
1998
1999References
2000==========
2001
2002Bratko, I. (2002). `Prolog Programming for Artificial Intelligence`, Addison
2003Wesley, 2002.
2004
2005E Koutsofios, SC North. Drawing Graphs with dot. AT&T Bell Laboratories,
2006Murray Hill NJ, U.S.A., October 1993.
2007
2008`Graphviz - open source graph drawing software <http://www.research.att.com/sw/tools/graphviz/>`_
2009A home page of AT&T's dot and similar software packages.
2010
2011"""
2012
2013from Orange.core import \
2014     TreeLearner as TreeLearnerBase, \
2015         TreeClassifier, \
2016         C45Learner, \
2017         C45Classifier, \
2018         C45TreeNode as C45Node, \
2019         C45TreeNodeList as C45NodeList, \
2020         TreeDescender as Descender, \
2021              TreeDescender_UnknownMergeAsBranchSizes as Descender_UnknownMergeAsBranchSizes, \
2022              TreeDescender_UnknownMergeAsSelector as Descender_UnknownMergeAsSelector, \
2023              TreeDescender_UnknownToBranch as Descender_UnknownToBranch, \
2024              TreeDescender_UnknownToCommonBranch as Descender_UnknownToCommonBranch, \
2025              TreeDescender_UnknownToCommonSelector as Descender_UnknownToCommonSelector, \
2026         TreeExampleSplitter as Splitter, \
2027              TreeExampleSplitter_IgnoreUnknowns as Splitter_IgnoreUnknowns, \
2028              TreeExampleSplitter_UnknownsAsBranchSizes as Splitter_UnknownsAsBranchSizes, \
2029              TreeExampleSplitter_UnknownsAsSelector as Splitter_UnknownsAsSelector, \
2030              TreeExampleSplitter_UnknownsToAll as Splitter_UnknownsToAll, \
2031              TreeExampleSplitter_UnknownsToBranch as Splitter_UnknownsToBranch, \
2032              TreeExampleSplitter_UnknownsToCommon as Splitter_UnknownsToCommon, \
2033              TreeExampleSplitter_UnknownsToRandom as Splitter_UnknownsToRandom, \
2034         TreeNode as Node, \
2035         TreeNodeList as NodeList, \
2036         TreePruner as Pruner, \
2037              TreePruner_SameMajority as Pruner_SameMajority, \
2038              TreePruner_m as Pruner_m, \
2039         TreeSplitConstructor as SplitConstructor, \
2040              TreeSplitConstructor_Combined as SplitConstructor_Combined, \
2041              TreeSplitConstructor_Measure as SplitConstructor_Measure, \
2042                   TreeSplitConstructor_Attribute as SplitConstructor_Feature, \
2043                   TreeSplitConstructor_ExhaustiveBinary as SplitConstructor_ExhaustiveBinary, \
2044                   TreeSplitConstructor_OneAgainstOthers as SplitConstructor_OneAgainstOthers, \
2045                   TreeSplitConstructor_Threshold as SplitConstructor_Threshold, \
2046         TreeStopCriteria as StopCriteria, \
2047              TreeStopCriteria_Python as StopCriteria_Python, \
2048              TreeStopCriteria_common as StopCriteria_common
2049
2050import Orange.core
2051import operator
2052import base64
2053import re
2054import Orange.data
2055import Orange.feature.scoring
2056import Orange.classification.tree
2057
2058def _c45_showBranch(node, classvar, lev, i):
2059    var = node.tested
2060    if node.nodeType == 1:
2061        print ("\n"+"|   "*lev + "%s = %s:") % (var.name, var.values[i]),
2062        _c45_printTree0(node.branch[i], classvar, lev+1)
2063    elif node.nodeType == 2:
2064        print ("\n"+"|   "*lev + "%s %s %.1f:") % (var.name, ["<=", ">"][i], node.cut),
2065        _c45_printTree0(node.branch[i], classvar, lev+1)
2066    else:
2067        inset = filter(lambda a:a[1]==i, enumerate(node.mapping))
2068        inset = [var.values[j[0]] for j in inset]
2069        if len(inset)==1:
2070            print ("\n"+"|   "*lev + "%s = %s:") % (var.name, inset[0]),
2071        else:
2072            print ("\n"+"|   "*lev + "%s in {%s}:") % (var.name, ", ".join(inset)),
2073        _c45_printTree0(node.branch[i], classvar, lev+1)
2074       
2075       
2076def _c45_printTree0(node, classvar, lev):
2077    var = node.tested
2078    if node.nodeType == 0:
2079        print "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items),
2080    else:
2081        for i, branch in enumerate(node.branch):
2082            if not branch.nodeType:
2083                _c45_showBranch(node, classvar, lev, i)
2084        for i, branch in enumerate(node.branch):
2085            if branch.nodeType:
2086                _c45_showBranch(node, classvar, lev, i)
2087
2088def printTreeC45(tree):
2089    """
2090    Prints the tree given as an argument in the same form as Ross Quinlan's
2091    C4.5 program.
2092
2093    ::
2094
2095        import Orange
2096
2097        data = Orange.data.Table("voting")
2098        c45 = Orange.classification.tree.C45Learner(data)
2099        Orange.classification.tree.printTreeC45(c45)
2100
2101    will print out
2102
2103    ::
2104
2105        physician-fee-freeze = n: democrat (253.4)
2106        physician-fee-freeze = y:
2107        |   synfuels-corporation-cutback = n: republican (145.7)
2108        |   synfuels-corporation-cutback = y:
2109        |   |   mx-missile = y: democrat (6.0)
2110        |   |   mx-missile = n:
2111        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6)
2112        |   |   |   adoption-of-the-budget-resolution = y:
2113        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0)
2114        |   |   |   |   anti-satellite-test-ban = y: republican (2.2)
2115
2116
2117    If you run the original C4.5 (that is, the standalone C4.5 - Orange does use the original C4.5) on the same data, it will print out
2118
2119    ::
2120
2121        physician-fee-freeze = n: democrat (253.4/5.9)
2122        physician-fee-freeze = y:
2123        |   synfuels-corporation-cutback = n: republican (145.7/6.2)
2124        |   synfuels-corporation-cutback = y:
2125        |   |   mx-missile = y: democrat (6.0/2.4)
2126        |   |   mx-missile = n:
2127        |   |   |   adoption-of-the-budget-resolution = n: republican (22.6/5.2)
2128        |   |   |   adoption-of-the-budget-resolution = y:
2129        |   |   |   |   anti-satellite-test-ban = n: democrat (5.0/1.2)
2130        |   |   |   |   anti-satellite-test-ban = y: republican (2.2/1.0)
2131
2132    which is adoringly similar, except that C4.5 tested the tree on
2133    the learning data and has also printed out the number of errors
2134    in each node - something which :obj:`c45_printTree` obviously can't do
2135    (nor is there any need it should).
2136
2137    """
2138    _c45_printTree0(tree.tree, tree.classVar, 0)
2139
2140
2141class TreeLearner(Orange.core.Learner):
2142    """
2143    Assembles the generic classification or regression tree learner
2144    (from Orange's objects for induction of decision trees).
2145    :class:`TreeLearner` is essentially a wrapper
2146    around :class:`TreeLearnerBase`, provided for easier use of the latter.
2147    It sets a number of parameters used in induction that
2148    can also be set after the creation of the object, most often through
2149    the object's attributes. If upon initialization
2150    :class:`TreeLearner` is given a set of examples, then an instance
2151    of :class:`TreeClassifier` object is returned instead.
2152
2153    The values of attributes can be also be set in the constructor.
2154
2155    .. attribute:: nodeLearner
2156
2157        Induces a classifier from examples belonging to a node. The
2158        same learner is used for internal nodes and for leaves. The
2159        default :obj:`nodeLearner` is :obj:`Orange.classification.majority.MajorityLearner`.
2160
2161    **Split construction**
2162
2163    .. attribute:: split
2164       
2165        Defines a function that will be used in place of
2166        :obj:`SplitConstructor`.
2167        Useful when prototyping new tree induction
2168        algorithms. When this parameter is defined, other parameters that
2169        affect the procedures for growing of the tree are ignored. These
2170        include :obj:`binarization`, :obj:`measure`,
2171        :obj:`worstAcceptable` and :obj:`minSubset` (Default:
2172        :class:SplitConstructor_Combined
2173        with separate constructors for discrete and continuous attributes.
2174        Discrete attributes are used as they are, while
2175        continuous attributes are binarized.
2176        Gain ratio is used to select attributes.
2177        A minimum of two examples in a leaf is required for
2178        discrete and five examples in a leaf for continuous attributes.)
2179
2180    .. attribute:: binarization
2181
2182        If 1, :class:`SplitConstructor_ExhaustiveBinary` is used.
2183        If 2, use class:`SplitConstructor_OneAgainstOthers`. If
2184        0, do not use binarization (use class:`SplitConstructor_Attribute`).
2185        Default: 0.
2186
2187    .. attribute:: measure
2188   
2189        Measure for scoring of the attributes when deciding which of the
2190        attributes will be used for splitting of the example set in the node.
2191        Can be either a measure XXXXX or one of
2192        "infoGain" (:class:`Orange.feature.scoring.InfoGain`),
2193        "gainRatio" (:class:`Orange.feature.scoring.GainRatio`),
2194        "gini" (:class:`Orange.feature.scoring.Gini`),
2195        "relief" (:class:`Orange.feature.scoring.Relief`),
2196        "retis" (:class: `Orange.feature.scoring.MSE`). Default: "gainRatio".
2197
2198    .. attribute:: reliefM, reliefK
2199
2200        Sem `m` and `k` to given values if the :obj:`measure` is relief.
2201
2202    **Pruning**
2203
2204    .. attribute:: worstAcceptable
2205
2206        Used in pre-pruning, sets the lowest required attribute
2207        score. If the score of the best attribute is below this margin, the
2208        tree at that node is not grown further (default: 0).
2209
2210        So, to allow splitting only when gainRatio (the default measure)
2211        is greater than 0.6, one should run the learner like this:
2212        :samp:`l = Orange.classification.tree.TreeLearner(data, worstAcceptable=0.6)`
2213
2214    .. attribute:: minSubset
2215
2216        Minimal number of examples in non-null leaves (default: 0).
2217
2218    .. attribute:: minExamples
2219
2220        Data subsets with less than :obj:`minExamples`
2221        examples are not split any further, that is, all leaves in the tree
2222        will contain at least that many of examples (default: 0).
2223
2224    .. attribute:: maxDepth
2225
2226        Gives maximal tree depth;  0 means that only root is generated.
2227        The default is 100.
2228
2229    .. attribute:: maxMajority
2230
2231        Induction stops when the proportion of majority class in the
2232        node exceeds the value set by this parameter(default: 1.0). E.g.
2233        to stop the induction as soon as the majority class reaches 70%,
2234        you should say
2235        :samp:`tree2 = Orange.classification.tree.TreeLearner(data, maxMajority=0.7)`
2236
2237        This is an example of the tree on iris data set, built with
2238        XXXXXXXXX what above arguments XXXXXXXXX
2239        the above arguments - the numbers show the majority class
2240        proportion at each node. You can find more details in the
2241        script tree2.py, which induces and prints this tree.
2242
2243        ::
2244
2245            root: 0.333
2246            |    petal width<0.800: 1.000
2247            |    petal width>=0.800: 0.500
2248            |    |    petal width<1.750: 0.907
2249            |    |    petal width>=1.750: 0.978
2250   
2251    .. attribute:: stop
2252
2253        Used for passing a function which is used in place of
2254        :class:`StopCriteria`. Useful when prototyping new
2255        tree induction algorithms. See a documentation on
2256        :class:`StopCriteria` for more info on this function.
2257        When used, parameters  :obj:`maxMajority` and :obj:`minExamples`
2258        will not be  considered (default: None). XXXXX To je pisalo spodaj.
2259        The default stopping criterion stops induction when all examples in a node belong to the same class.
2260
2261    .. attribute:: mForPruning
2262
2263        If non-zero, invokes an error-based bottom-up post-pruning,
2264        where m-estimate is used to estimate class probabilities
2265        (default: 0).
2266
2267    .. attribute:: sameMajorityPruning
2268
2269        If true, invokes a bottom-up post-pruning by removing the
2270        subtrees of which all leaves classify to the same class
2271        (default: False).
2272
2273    **Record keeping**
2274
2275    .. attribute:: storeDistributions, storeContingencies, storeExamples, storeNodeClassifier
2276
2277        Determines whether to store class distributions, contingencies and
2278        examples in :class:`Node`, and whether the :obj:`nodeClassifier`
2279        should be build for internal nodes. By default everything except
2280        :obj:`storeExamples` is enabled. You won't save any memory by not storing
2281        distributions but storing contingencies, since distributions actually points to
2282        the same distribution that is stored in
2283        :obj:`contingency.classes`. (default: True except for
2284        storeExamples, which defaults to False).
2285   
2286    """
2287    def __new__(cls, examples = None, weightID = 0, **argkw):
2288        self = Orange.core.Learner.__new__(cls, **argkw)
2289        if examples:
2290            self.__init__(**argkw)
2291            return self.__call__(examples, weightID)
2292        else:
2293            return self
2294     
2295    def __init__(self, **kw):
2296        self.learner = None
2297        self.__dict__.update(kw)
2298     
2299    def __setattr__(self, name, value):
2300        if name in ["split", "binarization", "measure", "worstAcceptable", "minSubset",
2301              "stop", "maxMajority", "minExamples", "nodeLearner", "maxDepth", "reliefM", "reliefK"]:
2302            self.learner = None
2303        self.__dict__[name] = value
2304
2305    def __call__(self, examples, weight=0):
2306        """
2307        Return a classifier from the given examples.
2308        """
2309        if not self.learner:
2310            self.learner = self.instance()
2311        if not hasattr(self, "split") and not hasattr(self, "measure"):
2312            if examples.domain.classVar.varType == Orange.data.Type.Discrete:
2313                measure = Orange.feature.scoring.GainRatio()
2314            else:
2315                measure = Orange.feature.scoring.MSE()
2316            self.learner.split.continuousSplitConstructor.measure = measure
2317            self.learner.split.discreteSplitConstructor.measure = measure
2318           
2319        tree = self.learner(examples, weight)
2320        if getattr(self, "sameMajorityPruning", 0):
2321            tree = Orange.classification.tree.Pruner_SameMajority(tree)
2322        if getattr(self, "mForPruning", 0):
2323            tree = Orange.classification.tree.Pruner_m(tree, m = self.mForPruning)
2324        return tree
2325
2326    def instance(self):
2327        """
2328        Return the constructed learner - an object of :class:`TreeLearnerBase`.
2329        """
2330        learner = Orange.classification.tree.TreeLearnerBase()
2331
2332        hasSplit = hasattr(self, "split")
2333        if hasSplit:
2334            learner.split = self.split
2335        else:
2336            learner.split = Orange.classification.tree.SplitConstructor_Combined()
2337            learner.split.continuousSplitConstructor = Orange.classification.tree.SplitConstructor_Threshold()
2338            binarization = getattr(self, "binarization", 0)
2339            if binarization == 1:
2340                learner.split.discreteSplitConstructor = Orange.classification.tree.SplitConstructor_ExhaustiveBinary()
2341            elif binarization == 2:
2342                learner.split.discreteSplitConstructor = Orange.classification.tree.SplitConstructor_OneAgainstOthers()
2343            else:
2344                learner.split.discreteSplitConstructor = Orange.classification.tree.SplitConstructor_Feature()
2345
2346            measures = {"infoGain": Orange.feature.scoring.InfoGain,
2347                "gainRatio": Orange.feature.scoring.GainRatio,
2348                "gini": Orange.feature.scoring.Gini,
2349                "relief": Orange.feature.scoring.Relief,
2350                "retis": Orange.feature.scoring.MSE
2351                }
2352
2353            measure = getattr(self, "measure", None)
2354            if type(measure) == str:
2355                measure = measures[measure]()
2356            if not hasSplit and not measure:
2357                measure = Orange.feature.scoring.GainRatio()
2358
2359            measureIsRelief = type(measure) == Orange.feature.scoring.Relief
2360            relM = getattr(self, "reliefM", None)
2361            if relM and measureIsRelief:
2362                measure.m = relM
2363           
2364            relK = getattr(self, "reliefK", None)
2365            if relK and measureIsRelief:
2366                measure.k = relK
2367
2368            learner.split.continuousSplitConstructor.measure = measure
2369            learner.split.discreteSplitConstructor.measure = measure
2370
2371            wa = getattr(self, "worstAcceptable", 0)
2372            if wa:
2373                learner.split.continuousSplitConstructor.worstAcceptable = wa
2374                learner.split.discreteSplitConstructor.worstAcceptable = wa
2375
2376            ms = getattr(self, "minSubset", 0)
2377            if ms:
2378                learner.split.continuousSplitConstructor.minSubset = ms
2379                learner.split.discreteSplitConstructor.minSubset = ms
2380
2381        if hasattr(self, "stop"):
2382            learner.stop = self.stop
2383        else:
2384            learner.stop = Orange.classification.tree.StopCriteria_common()
2385            mm = getattr(self, "maxMajority", 1.0)
2386            if mm < 1.0:
2387                learner.stop.maxMajority = self.maxMajority
2388            me = getattr(self, "minExamples", 0)
2389            if me:
2390                learner.stop.minExamples = self.minExamples
2391
2392        for a in ["storeDistributions", "storeContingencies", "storeExamples", "storeNodeClassifier", "nodeLearner", "maxDepth"]:
2393            if hasattr(self, a):
2394                setattr(learner, a, getattr(self, a))
2395
2396        return learner
2397
2398#counting
2399
2400def __countNodes(node):
2401    count = 0
2402    if node:
2403        count += 1
2404        if node.branches:
2405            for node in node.branches:
2406                count += __countNodes(node)
2407    return count
2408
2409def countNodes(tree):
2410    """
2411    Return the number of nodes of tree.
2412
2413    :param tree: The tree for which to count the nodes.
2414    :type tree: :class:`TreeClassifier`
2415    """
2416    return __countNodes(isinstance(tree, Orange.classification.tree.TreeClassifier) and tree.tree or tree)
2417
2418
2419def __countLeaves(node):
2420    count = 0
2421    if node:
2422        if node.branches: # internal node
2423            for node in node.branches:
2424                count += __countLeaves(node)
2425        else:
2426            count += 1
2427    return count
2428
2429def countLeaves(tree):
2430    """
2431    Return the number of leaves in the tree.
2432
2433    :param tree: The tree for which to count the leaves.
2434    :type tree: :class:`TreeClassifier`
2435    """
2436    return __countLeaves(isinstance(tree, Orange.classification.tree.TreeClassifier) and tree.tree or tree)
2437
2438
2439# the following is for the output
2440
2441
2442fs = r"(?P<m100>\^?)(?P<fs>(\d*\.?\d*)?)"
2443""" Defines the multiplier by 100 (:samp:`^`) and the format
2444for the number of decimals (e.g. :samp:`5.3`). The corresponding
2445groups are named :samp:`m100` and :samp:`fs`. """
2446
2447by = r"(?P<by>(b(P|A)))?"
2448""" Defines bP or bA or nothing; the result is in groups by. """
2449
2450bysub = r"((?P<bysub>b|s)(?P<by>P|A))?"
2451opc = r"(?P<op>=|<|>|(<=)|(>=)|(!=))(?P<num>\d*\.?\d+)"
2452opd = r'(?P<op>=|(!=))"(?P<cls>[^"]*)"'
2453intrvl = r'((\((?P<intp>\d+)%?\))|(\(0?\.(?P<intv>\d+)\))|)'
2454fromto = r"(?P<out>!?)(?P<lowin>\(|\[)(?P<lower>\d*\.?\d+)\s*,\s*(?P<upper>\d*\.?\d+)(?P<upin>\]|\))"
2455re_V = re.compile("%V")
2456re_N = re.compile("%"+fs+"N"+by)
2457re_M = re.compile("%"+fs+"M"+by)
2458re_m = re.compile("%"+fs+"m"+by)
2459re_Ccont = re.compile("%"+fs+"C"+by+opc)
2460re_Cdisc = re.compile("%"+fs+"C"+by+opd)
2461re_ccont = re.compile("%"+fs+"c"+by+opc)
2462re_cdisc = re.compile("%"+fs+"c"+by+opd)
2463re_Cconti = re.compile("%"+fs+"C"+by+fromto)
2464re_cconti = re.compile("%"+fs+"c"+by+fromto)
2465re_D = re.compile("%"+fs+"D"+by)
2466re_d = re.compile("%"+fs+"d"+by)
2467re_AE = re.compile("%"+fs+"(?P<AorE>A|E)"+bysub)
2468re_I = re.compile("%"+fs+"I"+intrvl)
2469
2470def insertStr(s, mo, sub):
2471    """ Replace the part of s which is covered by mo
2472    with the string sub. """
2473    return s[:mo.start()] + sub + s[mo.end():]
2474
2475
2476def insertDot(s, mo):
2477    """ Replace the part of s which is covered by mo
2478    with a dot.  You should use this when the
2479    function cannot compute the desired quantity; it is called, for instance,
2480    when it needs to divide by something in the parent, but the parent
2481    doesn't exist.
2482    """
2483    return s[:mo.start()] + "." + s[mo.end():]
2484
2485def insertNum(s, mo, n):
2486    """ Replace the part of s matched by mo with the number n,
2487    formatted as specified by the user, that is, it multiplies
2488    it by 100, if needed, and prints with the right number of
2489    places and decimals. It does so by checking the mo
2490    for a group named m100 (representing the :samp:`^` in the format string)
2491    and a group named fs representing the part giving the number o
2492    f decimals (e.g. :samp:`5.3`).
2493    """
2494    grps = mo.groupdict()
2495    m100 = grps.get("m100", None)
2496    if m100:
2497        n *= 100
2498    fs = grps.get("fs") or (m100 and ".0" or "5.3")
2499    return s[:mo.start()] + ("%%%sf" % fs % n) + s[mo.end():]
2500
2501def byWhom(by, parent, tree):
2502    """ If by equals bp, it returns parent, else it returns
2503    :samp:`tree.tree`. This is used to find what to divide the quantity
2504    with, when division is required.
2505    """
2506    if by=="bP":
2507        return parent
2508    else:
2509        return tree.tree
2510
2511def replaceV(strg, mo, node, parent, tree):
2512    return insertStr(strg, mo, str(node.nodeClassifier.defaultValue))
2513
2514def replaceN(strg, mo, node, parent, tree):
2515    by = mo.group("by")
2516    N = node.distribution.abs
2517    if by:
2518        whom = byWhom(by, parent, tree)
2519        if whom and whom.distribution:
2520            if whom.distribution.abs > 1e-30:
2521                N /= whom.distribution.abs
2522        else:
2523            return insertDot(strg, mo)
2524    return insertNum(strg, mo, N)
2525       
2526
2527def replaceM(strg, mo, node, parent, tree):
2528    by = mo.group("by")
2529    maj = int(node.nodeClassifier.defaultValue)
2530    N = node.distribution[maj]
2531    if by:
2532        whom = byWhom(by, parent, tree)
2533        if whom and whom.distribution:
2534            if whom.distribution[maj] > 1e-30:
2535                N /= whom.distribution[maj]
2536        else:
2537            return insertDot(strg, mo)
2538    return insertNum(strg, mo, N)
2539       
2540
2541def replacem(strg, mo, node, parent, tree):
2542    by = mo.group("by")
2543    maj = int(node.nodeClassifier.defaultValue)
2544    if node.distribution.abs > 1e-30:
2545        N = node.distribution[maj] / node.distribution.abs
2546        if by:
2547            if whom and whom.distribution:
2548                byN = whom.distribution[maj] / whom.distribution.abs
2549                if byN > 1e-30:
2550                    N /= byN
2551            else:
2552                return insertDot(strg, mo)
2553    else:
2554        N = 0.
2555    return insertNum(strg, mo, N)
2556
2557
2558def replaceCdisc(strg, mo, node, parent, tree):
2559    if tree.classVar.varType != Orange.data.Type.Discrete:
2560        return insertDot(strg, mo)
2561   
2562    by, op, cls = mo.group("by", "op", "cls")
2563    N = node.distribution[cls]
2564    if op == "!=":
2565        N = node.distribution.abs - N
2566    if by:
2567        whom = byWhom(by, parent, tree)
2568        if whom and whom.distribution:
2569            if whom.distribution[cls] > 1e-30:
2570                N /= whom.distribution[cls]
2571        else:
2572            return insertDot(strg, mo)
2573    return insertNum(strg, mo, N)
2574
2575   
2576def replacecdisc(strg, mo, node, parent, tree):
2577    if tree.classVar.varType != Orange.data.Type.Discrete:
2578        return insertDot(strg, mo)
2579   
2580    op, by, cls = mo.group("op", "by", "cls")
2581    N = node.distribution[cls]
2582    if node.distribution.abs > 1e-30:
2583        N /= node.distribution.abs
2584        if op == "!=":
2585            N = 1 - N
2586    if by:
2587        whom = byWhom(by, parent, tree)
2588        if whom and whom.distribution:
2589            if whom.distribution[cls] > 1e-30:
2590                N /= whom.distribution[cls] / whom.distribution.abs
2591        else:
2592            return insertDot(strg, mo)
2593    return insertNum(strg, mo, N)
2594
2595
2596__opdict = {"<": operator.lt, "<=": operator.le, ">": operator.gt, ">=": operator.ge, "=": operator.eq, "!=": operator.ne}
2597
2598def replaceCcont(strg, mo, node, parent, tree):
2599    if tree.classVar.varType != Orange.data.Type.Continuous:
2600        return insertDot(strg, mo)
2601   
2602    by, op, num = mo.group("by", "op", "num")
2603    op = __opdict[op]
2604    num = float(num)
2605    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2606    if by:
2607        whom = byWhom(by, parent, tree)
2608        if whom and whom.distribution:
2609            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2610            if byN > 1e-30:
2611                N /= byN
2612        else:
2613            return insertDot(strg, mo)
2614
2615    return insertNum(strg, mo, N)
2616   
2617   
2618def replaceccont(strg, mo, node, parent, tree):
2619    if tree.classVar.varType != Orange.data.Type.Continuous:
2620        return insertDot(strg, mo)
2621   
2622    by, op, num = mo.group("by", "op", "num")
2623    op = __opdict[op]
2624    num = float(num)
2625    N = sum([x[1] for x in node.distribution.items() if op(x[0], num)], 0.)
2626    if node.distribution.abs > 1e-30:
2627        N /= node.distribution.abs
2628    if by:
2629        whom = byWhom(by, parent, tree)
2630        if whom and whom.distribution:
2631            byN = sum([x[1] for x in whom.distribution.items() if op(x[0], num)], 0.)
2632            if byN > 1e-30:
2633                N /= byN/whom.distribution.abs # abs > byN, so byN>1e-30 => abs>1e-30
2634        else:
2635            return insertDot(strg, mo)
2636    return insertNum(strg, mo, N)
2637
2638
2639def extractInterval(mo, dist):
2640    out, lowin, lower, upper, upin = mo.group("out", "lowin", "lower", "upper", "upin")
2641    lower, upper = float(lower), float(upper)
2642    if out:
2643        lop = lowin == "(" and operator.le or operator.lt
2644        hop = upin == ")" and operator.ge or operator.ge
2645        return filter(lambda x:lop(x[0], lower) or hop(x[0], upper), dist.items())
2646    else:
2647        lop = lowin == "(" and operator.gt or operator.ge
2648        hop = upin == ")" and operator.lt or operator.le
2649        return filter(lambda x:lop(x[0], lower) and hop(x[0], upper), dist.items())
2650
2651   
2652def replaceCconti(strg, mo, node, parent, tree):
2653    if tree.classVar.varType != Orange.data.Type.Continuous:
2654        return insertDot(strg, mo)
2655
2656    by = mo.group("by")
2657    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2658    if by:
2659        whom = byWhom(by, parent, tree)
2660        if whom and whom.distribution:
2661            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2662            if byN > 1e-30:
2663                N /= byN
2664        else:
2665            return insertDot(strg, mo)
2666       
2667    return insertNum(strg, mo, N)
2668
2669           
2670def replacecconti(strg, mo, node, parent, tree):
2671    if tree.classVar.varType != Orange.data.Type.Continuous:
2672        return insertDot(strg, mo)
2673
2674    N = sum([x[1] for x in extractInterval(mo, node.distribution)])
2675    ab = node.distribution.abs
2676    if ab > 1e-30:
2677        N /= ab
2678
2679    by = mo.group("by")
2680    if by:
2681        whom = byWhom(by, parent, tree)
2682        if whom and whom.distribution:
2683            byN = sum([x[1] for x in extractInterval(mo, whom.distribution)])
2684            if byN > 1e-30:
2685                N /= byN/whom.distribution.abs
2686        else:
2687            return insertDot(strg, mo)
2688       
2689    return insertNum(strg, mo, N)
2690
2691   
2692def replaceD(strg, mo, node, parent, tree):
2693    if tree.classVar.varType != Orange.data.Type.Discrete:
2694        return insertDot(strg, mo)
2695
2696    fs, by, m100 = mo.group("fs", "by", "m100")
2697    dist = list(node.distribution)
2698    if by:
2699        whom = byWhom(by, parent, tree)
2700        if whom:
2701            for i, d in enumerate(whom.distribution):
2702                if d > 1e-30:
2703                    dist[i] /= d
2704        else:
2705            return insertDot(strg, mo)
2706    mul = m100 and 100 or 1
2707    fs = fs or (m100 and ".0" or "5.3")
2708    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2709
2710
2711def replaced(strg, mo, node, parent, tree):
2712    if tree.classVar.varType != Orange.data.Type.Discrete:
2713        return insertDot(strg, mo)
2714
2715    fs, by, m100 = mo.group("fs", "by", "m100")
2716    dist = list(node.distribution)
2717    ab = node.distribution.abs
2718    if ab > 1e-30:
2719        dist = [d/ab for d in dist]
2720    if by:
2721        whom = byWhom(by, parent, tree)
2722        if whom:
2723            for i, d in enumerate(whom.distribution):
2724                if d > 1e-30:
2725                    dist[i] /= d/whom.distribution.abs # abs > d => d>1e-30 => abs>1e-30
2726        else:
2727            return insertDot(strg, mo)
2728    mul = m100 and 100 or 1
2729    fs = fs or (m100 and ".0" or "5.3")
2730    return insertStr(strg, mo, "["+", ".join(["%%%sf" % fs % (N*mul) for N in dist])+"]")
2731
2732
2733def replaceAE(strg, mo, node, parent, tree):
2734    if tree.classVar.varType != Orange.data.Type.Continuous:
2735        return insertDot(strg, mo)
2736
2737    AorE, bysub, by = mo.group("AorE", "bysub", "by")
2738   
2739    if AorE == "A":
2740        A = node.distribution.average()
2741    else:
2742        A = node.distribution.error()
2743    if by:
2744        whom = byWhom("b"+by, parent, tree)
2745        if whom:
2746            if AorE == "A":
2747                avg = whom.distribution.average()
2748            else:
2749                avg = whom.distribution.error()
2750            if bysub == "b":
2751                if avg > 1e-30:
2752                    A /= avg
2753            else:
2754                A -= avg
2755        else:
2756            return insertDot(strg, mo)
2757    return insertNum(strg, mo, A)
2758
2759
2760Z = { 0.75:1.15, 0.80:1.28, 0.85:1.44, 0.90:1.64, 0.95:1.96, 0.99:2.58 }
2761
2762def replaceI(strg, mo, node, parent, tree):
2763    if tree.classVar.varType != Orange.data.Type.Continuous:
2764        return insertDot(strg, mo)
2765
2766    fs = mo.group("fs") or "5.3"
2767    intrvl = float(mo.group("intp") or mo.group("intv") or "95")/100.
2768    mul = mo.group("m100") and 100 or 1
2769
2770    if not Z.has_key(intrvl):
2771        raise SystemError, "Cannot compute %5.3f% confidence intervals" % intrvl
2772
2773    av = node.distribution.average()   
2774    il = node.distribution.error() * Z[intrvl]
2775    return insertStr(strg, mo, "[%%%sf-%%%sf]" % (fs, fs) % ((av-il)*mul, (av+il)*mul))
2776
2777
2778# This class is more a collection of function, merged into a class so that they don't
2779# need to transfer too many arguments. It will be constructed, used and discarded,
2780# it is not meant to store any information.
2781class __TreeDumper:
2782    defaultStringFormats = [(re_V, replaceV), (re_N, replaceN), (re_M, replaceM), (re_m, replacem),
2783                              (re_Cdisc, replaceCdisc), (re_cdisc, replacecdisc),
2784                              (re_Ccont, replaceCcont), (re_ccont, replaceccont),
2785                              (re_Cconti, replaceCconti), (re_cconti, replacecconti),
2786                              (re_D, replaceD), (re_d, replaced),
2787                              (re_AE, replaceAE), (re_I, replaceI)
2788                             ]
2789
2790    def __init__(self, leafStr, nodeStr, stringFormats, minExamples, maxDepth, simpleFirst, tree, **kw):
2791        self.stringFormats = stringFormats
2792        self.minExamples = minExamples
2793        self.maxDepth = maxDepth
2794        self.simpleFirst = simpleFirst
2795        self.tree = tree
2796        self.__dict__.update(kw)
2797
2798        if leafStr:
2799            self.leafStr = leafStr
2800        else:
2801            if tree.classVar.varType == Orange.data.Type.Discrete:
2802                self.leafStr = "%V (%^.2m%)"
2803            else:
2804                self.leafStr = "%V"
2805
2806        if nodeStr == ".":
2807            self.nodeStr = self.leafStr
2808        else:
2809            self.nodeStr = nodeStr
2810       
2811
2812    def formatString(self, strg, node, parent):
2813        if hasattr(strg, "__call__"):
2814            return strg(node, parent, self.tree)
2815       
2816        if not node:
2817            return "<null node>"
2818       
2819        for rgx, replacer in self.stringFormats:
2820            if not node.distribution:
2821                strg = rgx.sub(".", strg)
2822            else:
2823                strt = 0
2824                while True:
2825                    mo = rgx.search(strg, strt)
2826                    if not mo:
2827                        break
2828                    strg = replacer(strg, mo, node, parent, self.tree)
2829                    strt = mo.start()+1
2830                       
2831        return strg
2832       
2833
2834    def showBranch(self, node, parent, lev, i):
2835        bdes = node.branchDescriptions[i]
2836        bdes = node.branchSelector.classVar.name + (bdes[0] not in "<=>" and "=" or "") + bdes
2837        if node.branches[i]:
2838            nodedes = self.nodeStr and ": "+self.formatString(self.nodeStr, node.branches[i], node) or ""
2839        else:
2840            nodedes = "<null node>"
2841        return "|    "*lev + bdes + nodedes
2842       
2843       
2844    def dumpTree0(self, node, parent, lev):
2845        if node.branches:
2846            if node.distribution.abs < self.minExamples or lev > self.maxDepth:
2847                return "|    "*lev + ". . .\n"
2848           
2849            res = ""
2850            if self.leafStr and self.nodeStr and self.leafStr != self.nodeStr:
2851                leafsep = "\n"+("|    "*lev)+"    "
2852            else:
2853                leafsep = ""
2854            if self.simpleFirst:
2855                for i, branch in enumerate(node.branches):
2856                    if not branch or not branch.branches:
2857                        if self.leafStr == self.nodeStr:
2858                            res += "%s\n" % self.showBranch(node, parent, lev, i)
2859                        else:
2860                            res += "%s: %s\n" % (self.showBranch(node, parent, lev, i),
2861                                                 leafsep + self.formatString(self.leafStr, branch, node))
2862            for i, branch in enumerate(node.branches):
2863                if branch and branch.branches:
2864                    res += "%s\n%s" % (self.showBranch(node, parent, lev, i),
2865                                       self.dumpTree0(branch, node, lev+1))
2866                elif not self.simpleFirst:
2867                    if self.leafStr == self.nodeStr:
2868                        res += "%s\n" % self.showBranch(node, parent, lev, i)
2869                    else:
2870                        res += "%s: %s\n" % (self.showBranch(node, parent, lev, i),
2871                                             leafsep+self.formatString(self.leafStr, branch, node))
2872            return res
2873        else:
2874            return self.formatString(self.leafStr, node, parent)
2875
2876
2877    def dumpTree(self):
2878        if self.nodeStr:
2879            lev, res = 1, "root: %s\n" % self.formatString(self.nodeStr, self.tree.tree, None)
2880            self.maxDepth += 1
2881        else:
2882            lev, res = 0, ""
2883        return res + self.dumpTree0(self.tree.tree, None, lev)
2884       
2885
2886    def dotTree0(self, node, parent, internalName):
2887        if node.branches:
2888            if node.distribution.abs < self.minExamples or len(internalName)-1 > self.maxDepth:
2889                self.fle.write('%s [ shape="plaintext" label="..." ]\n' % _quoteName(internalName))
2890                return
2891               
2892            label = node.branchSelector.classVar.name
2893            if self.nodeStr:
2894                label += "\\n" + self.formatString(self.nodeStr, node, parent)
2895            self.fle.write('%s [ shape=%s label="%s"]\n' % (_quoteName(internalName), self.nodeShape, label))
2896           
2897            for i, branch in enumerate(node.branches):
2898                if branch:
2899                    internalBranchName = internalName+chr(i+65)
2900                    self.fle.write('%s -> %s [ label="%s" ]\n' % (_quoteName(internalName), _quoteName(internalBranchName), node.branchDescriptions[i]))
2901                    self.dotTree0(branch, node, internalBranchName)
2902                   
2903        else:
2904            self.fle.write('%s [ shape=%s label="%s"]\n' % (internalName, self.leafShape, self.formatString(self.leafStr, node, parent)))
2905
2906
2907    def dotTree(self, internalName="n"):
2908        self.fle.write("digraph G {\n")
2909        self.dotTree0(self.tree.tree, None, internalName)
2910        self.fle.write("}\n")
2911
2912def _quoteName(x):
2913    return '"%s"' % (base64.b64encode(x))
2914
2915def dumpTree(tree, leafStr = "", nodeStr = "", **argkw):
2916    """
2917    Return a string representation of a tree.
2918
2919    :arg tree: The tree to dump to string.
2920    :type tree: class:`TreeClassifier`
2921    :arg leafStr: The format string for printing the tree leaves. If
2922      left empty, "%V (%^.2m%)" will be used for classification trees
2923      and "%V" for regression trees.
2924    :type leafStr: string
2925    :arg nodeStr: The format string for printing out the internal nodes.
2926      If left empty (as it is by default), no data is printed out for
2927      internal nodes. If set to :samp:`"."`, the same string is
2928      used as for leaves.
2929    :type nodeStr: string
2930    :arg maxDepth: If set, it limits the depth to which the tree is
2931      printed out.
2932    :type maxDepth: integer
2933    :arg minExamples: If set, the subtrees with less than the given
2934      number of examples are not printed.
2935    :type minExamples: integer
2936    :arg simpleFirst: If True (default), the branches with a single
2937      node are printed before the branches with larger subtrees.
2938      If False, the branches are printed in order of
2939      appearance.
2940    :type simpleFirst: boolean
2941    :arg userFormats: A list of regular expressions and callback
2942      function through which the user can print out other specific
2943      information in the nodes.
2944    """
2945    return __TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
2946        __TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
2947        argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True),
2948        tree).dumpTree()
2949
2950
2951def printTree(*a, **aa):
2952    """
2953    Print out the tree (call :func:`dumpTree` with the same
2954    arguments and print out the result).
2955    """
2956    print dumpTree(*a, **aa)
2957
2958printTxt = printTree
2959""" An alias for :func:`printTree`. Left for compatibility. """
2960
2961
2962def printDot(tree, fileName, leafStr = "", nodeStr = "", leafShape="plaintext", nodeShape="plaintext", **argkw):
2963    """ Prints the tree to a file in a format used by
2964    `GraphViz <http://www.research.att.com/sw/tools/graphviz>`_.
2965    Uses the same parameters as :func:`printTxt` defined above
2966    plus two parameters which define the shape used for internal
2967    nodes and laves of the tree:
2968
2969    :param leafShape: Shape of the outline around leves of the tree.
2970        If "plaintext", no outline is used (default: "plaintext").
2971    :type leafShape: string
2972    :param internalNodeShape: Shape of the outline around internal nodes
2973        of the tree. If "plaintext", no outline is used (default: "box")
2974    :type leafShape: string
2975
2976    Check `Polygon-based Nodes <http://www.graphviz.org/doc/info/shapes.html>`_
2977    for various outlines supported by GraphViz.
2978    """
2979    fle = type(fileName) == str and file(fileName, "wt") or fileName
2980
2981    __TreeDumper(leafStr, nodeStr, argkw.get("userFormats", []) + 
2982        __TreeDumper.defaultStringFormats, argkw.get("minExamples", 0), 
2983        argkw.get("maxDepth", 1e10), argkw.get("simpleFirst", True), tree,
2984        leafShape=leafShape, nodeShape=nodeShape, fle=fle).dotTree()
2985                       
2986dotTree = printDot
2987""" An alias for :func:`printDot`. Left for compatibility. """
Note: See TracBrowser for help on using the repository browser.