Ticket #1051 (closed wish: fixed)
Weight in vectors, meta attributes attached to examples
|Reported by:||janez||Owned by:||janez|
Description (last modified by janez) (diff)
In writing the documentation, I discovered not only bugs related to exampleIndices, but also inconsistency in its use: do all views have it or can it be empty? I sometimes assumed the former and sometimes the latter. The former requires maintaining the vector when it is not needed (we need it only in views that do not contain all examples from the referenced table and it is needed only for meta values). The latter requires checking whether the table is empty at every access of meta attributes, copying of examples and so forth.
ExampleIndices seem like the major source of future bugs and maintenance problems, and it blows up the code needed to manipulate the examples.
To get rid of them, we would need to reimplement meta attributes. At the same time, it would make sense to separate the concept of a weight and meta attribute. This is also inspired by Anze's suggestions that example tables should have some "default" weights.
Weight is currently a meta attribute referenced by its id. This is annoying since it requires passing the id as an additional argument to practically everything. Besides, meta values are inefficient (see below).
The two advantages of current implementation are that # example can have multiple weights (probably used only in classification and regression trees) # they can be moved to ordinary attributes and back (has anybody ever used this?!)
Weights in Orange 3.0 could be stored as vector<double> TExampleTable::weights - each element of this vector would correspond to an example. If this vector is empty, examples have no weights; if it is not empty, its size would be guaranteed to match the number of examples to avoid having to check this (there would be no default weight of 1 for examples with higher indices). This would still require manipulating weights in parallel with pointers to examples, but doesn't seem like such a headache like exampleIndices.
If examples are weighted, any method that can consider weights will consider them - without having to be told so.
Instead of multiple weights, we would use multiple tables that reference the old one and modify the weights. Classification trees already make reference tables, so the only known use of multiple weights is practically already covered. When constructing a reference table, the weights would be copied but not updated - the weights in the original and the reference data can change independently.
This would change the way the weights are accessed: instead of TExample::getmeta/setmeta that we used in Orange <3, we will have getweight/setweight without arguments. If examples have no weights, getweight returns 1.
We would also have a method TExampleTable::setweights for assigning weights to all examples. The argument would be any iterable, so we could also use it to assign a numpy vector.
Another option would seem to be to allow the table to dedicate a special column to weights to avoid maintaining a separate vector of weigths. This wouldn't work since it would be impossible to add weights after the table is constructed and, more important, to have references with different weights.
To get rid of exampleIndices (and also metaRoots and metaPools) each example would have a column in data that would contain the index into metaPool.
There would be only one, global meta pool. It would no longer be a vector but a just chunk of memory that is resized when it runs out of space (in what respect is this better than vector<TMetaValue>???). That is, we will have TMetaValue *metaPool_. The value from the meta column in data is used as index, metaPool[i].
This would also simplify the implementation of TMetaValue: we can define operator TMetaValue::new that allocates the memory in the pool. TMetaValue::delete would also remove the remainder of the list; removing a single value would have another operator since it requires a pointer/reference to the pointer.
If TExampleTable references data in numpy, if will be allowed to have meta values only if it contains a dedicated column for indices into meta pool. The index of this column would have to be passed to the constructor and would be initialized to 0. Any subsequent changes would crash Orange.
Numpy arrays will seldom have meta attributes anyway, so this is not a major complication.
Either the domain conversion should copy the weight or this should be done when examples are taken to/from the table. TExample probably needs a field double weight.