Ticket #1012 (closed bug: fixed)

Opened 2 years ago

Last modified 2 years ago

Update TDomain to support multi-target prediction

Reported by: matija Owned by: matija
Milestone: 2.6 Component: library
Severity: major Keywords:
Cc: matija, lanz, lan, anze Blocking:
Blocked By:

Description

Orange needs to support multi-label and multi-class prediction. Currently, it is hacked together using attributes on attributes, which makes it error-prone to use.

A reasonable .tab format for multi-target prediction would be to have the 'class' keyword repeated on multiple columns in the third line.

When loading multi-target datasets (with multiple 'class' keywords), the Domain should have:

  • class_var = None
  • class_vars = [cvar1, cvar2, ...]
  • attributes should not contain cvar1, cvar2, ...

class_vars should be a list in any case, even when there is only one class variable and also when there are none (then class_vars = []).

It should be possible to dynamically change the domain (i.e. reuse the same in-memory Table data with a domain that has different class_var and class_vars). Here are two use-cases:

  • When doing "binary relevance" multi-label classification, one wants to construct one (single-label) classifier per label. I would like to be able to do that without copying the dataset in the memory - I'd just set the Domain's class_var to one of the label variables and run a learner on it.
  • Often, multi-target datasets have no distinction between "independent" and "dependent" (class) variables (ask Lan Z. about it); the analyst decides which variables will be used and which predicted. It would be useful if one could construct a new domain on the same table, switching some variables between ("independent") attributes and ("dependent") class_vars.

I'd like to merge Multi-Label Classification into trunk ASAP, and Multi-Class prediction is in active development, so the sooner this gets done, less refactoring will have to be done afterwards.

Change History

comment:1 Changed 2 years ago by lanz

The Instance class will probably have to be adapted as well.

A similar solution to the one above would be that get_class() returns None and a new method get_classes() is introduced which returns a list of classes. The corresponding set_classes() would take as input a list of values for the classes.

comment:2 follow-up: ↓ 3 Changed 2 years ago by janez

Done - mostly.

It would be great if somebody could seriously run regression tests on this. Normal data sets should not feel the change, but I could have messed something up.

The Domain's attribute is 'classes', not 'class_vars' - for safety.

In tab-delimited file, the multiclass attributes are marked with 'multiclass' in the third row, not with 'class'. This is to avoid accidentally giving multiple classes. A dataset can also have both, normal and multiclasses.

Example has methods get_classes and set_classes; besides, the new classes can be accessed by indexing using the name or Variable (but not int).

It is not possible to change the class in place. Orange <3 does not support it (I don't think we have anything like this anywhere). I can however add a "set_as_class" method that would set on of the extra classes as the ordinary class, but it would be quite ugly. Do we really really need this?

It is also not possible to construct a multiclass domain in a script, we can only read it. Do we need it - like urgently?

comment:3 in reply to: ↑ 2 ; follow-ups: ↓ 4 ↓ 5 Changed 2 years ago by matija

Replying to janez:

It would be great if somebody could seriously run regression tests on this. Normal data sets should not feel the change, but I could have messed something up.

I ran orange25 and orange regression tests and none of the failures seem to be related to those changes. (There's an *** glibc detected *** python: double free or corruption (fasttop): 0x000000000102b180 *** in assoc-agrawal2.py, but it also happens to Miha that has yesterday's Orange.)

The Domain's attribute is 'classes', not 'class_vars' - for safety.

You mean to avoid confusion? Seems pretty unintuitive (and thus error-prone) to me, but if you insist, I won't press on you. :)

In tab-delimited file, the multiclass attributes are marked with 'multiclass' in the third row, not with 'class'. This is to avoid accidentally giving multiple classes. A dataset can also have both, normal and multiclasses.

OK.

Example has methods get_classes and set_classes; besides, the new classes can be accessed by indexing using the name or Variable (but not int).

It is not possible to change the class in place. Orange <3 does not support it (I don't think we have anything like this anywhere). I can however add a "set_as_class" method that would set on of the extra classes as the ordinary class, but it would be quite ugly. Do we really really need this?

Is it possible to construct a new domain for existing data (without copying it in-memory), and ignoring some of the variables (i.e., using one of the original classes as a class in the new domain and ignore other classes)? If yes, then we don't. If not ... it'd be great. :)

It is also not possible to construct a multiclass domain in a script, we can only read it. Do we need it - like urgently?

I think Lan Ž. would be quite thankful if he could dynamically choose which variables are attributes and which are (multi)classes. That probably involves creating a new domain over existing data, if that's at all possible, so I guess we need this.

comment:4 in reply to: ↑ 3 ; follow-up: ↓ 6 Changed 2 years ago by matija

Replying to janez:

It would be great if somebody could seriously run regression tests on this. Normal data sets should not feel the change, but I could have messed something up.

I ran orange25 and orange regression tests and none of the failures seem to be related to those changes. (There's an *** glibc detected *** python: double free or corruption (fasttop): 0x000000000102b180 *** in assoc-agrawal2.py, but it also happens to Miha that has yesterday's Orange.)

Miha *did* accidently update Orange before testing this out; Lan doesn't get this crash with an older Orange. Janez, can you check?

comment:5 in reply to: ↑ 3 ; follow-up: ↓ 7 Changed 2 years ago by janez

Replying to matija:

The Domain's attribute is 'classes', not 'class_vars' - for safety.

You mean to avoid confusion? Seems pretty unintuitive (and thus error-prone) to me, but if you insist, I won't press on you. :)

You're right, I changed it to class_vars.

Is it possible to construct a new domain for existing data (without copying it in-memory), and ignoring some of the variables (i.e., using one of the original classes as a class in the new domain and ignore other classes)? If yes, then we don't. If not ... it'd be great. :)

You can now change the table in place. This will not be documented and will probably change in Orange 3.0 since it is much easier to implement it there. For now, it goes like this: ExampleTable can have an ordinary class and multiclasses (this will be the same in 3.0). Method pickClass(var), where var is either a Variable or a string with a name, swaps the current ordinary class and the chosen multi classes. If there is no current class, the chosen multiclass simply becomes the class. pickClass(None) puts the current class into the list of multi classes.

Now, since all this happens in place you can only do it if the table is not referenced by any other table and so forth. Orange 2.5 has no control over such references, so this cannot be made safe. In Orange 3.0 it won't be a problem.

Finally, you can construct a Domain with multiclasses by adding a keyword argument class_vars=list-of-variables when calling a constructor.

Converting examples between domains that contain multi classes does not work yet. Pickling also fails, I guess. Please confirm that the above is OK and I'll implement it.

comment:6 in reply to: ↑ 4 Changed 2 years ago by janez

I ran orange25 and orange regression tests and none of the failures seem to be related to those changes. (There's an *** glibc detected *** python: double free or corruption (fasttop): 0x000000000102b180 *** in assoc-agrawal2.py, but it also happens to Miha that has yesterday's Orange.)

Miha *did* accidently update Orange before testing this out; Lan doesn't get this crash with an older Orange. Janez, can you check?

Works for me. I simplified a few things before I tried it, I may have fixed the bug.

Miha, try again, please (together with all other regression tests since I have changed the Domain's constructor...)

comment:7 in reply to: ↑ 5 ; follow-up: ↓ 8 Changed 2 years ago by matija

Replying to janez:

pickClass(None) puts the current class into the list of multi classes.

I thought a variable can be both a class and a multiclass at once. Seems more intuitive to me, but doesn't really matter in practice, I guess.

Now, since all this happens in place you can only do it if the table is not referenced by any other table and so forth. Orange 2.5 has no control over such references, so this cannot be made safe. In Orange 3.0 it won't be a problem.

So I cannot construct a kNN classifier on the same table, I still have to copy it? Well, that's OK; in 3.0 we'll change it so it won't do the copying ...

Finally, you can construct a Domain with multiclasses by adding a keyword argument class_vars=list-of-variables when calling a constructor.

Great.

Converting examples between domains that contain multi classes does not work yet. Pickling also fails, I guess.

Okay for the time being.

...

Works for me. I simplified a few things before I tried it, I may have fixed the bug. Miha, try again, please /.../

I'd like to test it first, but I cannot compile the current SVN version with G++:

$ g++ -fPIC -fpermissive -w -DLINUX -DORANGE_EXPORTS -O3 -ggdb -Iliblinear  -c table.cpp -o obj/table.o
table.cpp: In member function ‘virtual void TExampleTable::pickClass(PVariable)’:
table.cpp:720:70: error: cannot pass objects of non-trivially-copyable type ‘std::string {aka struct std::basic_string<char>}’ through ‘...’

comment:8 in reply to: ↑ 7 ; follow-up: ↓ 9 Changed 2 years ago by janez

I thought a variable can be both a class and a multiclass at once. Seems more intuitive to me, but doesn't really matter in practice, I guess.

Can't be done in place in Orange 2.5.

So I cannot construct a kNN classifier on the same table, I still have to copy it? Well, that's OK; in 3.0 we'll change it so it won't do the copying ...

You're right in principle, except that kNN would not be a problem since it makes an internal copy of the table.

$ g++ -fPIC -fpermissive -w -DLINUX -DORANGE_EXPORTS -O3 -ggdb -Iliblinear  -c table.cpp -o obj/table.o
table.cpp: In member function ‘virtual void TExampleTable::pickClass(PVariable)’:
table.cpp:720:70: error: cannot pass objects of non-trivially-copyable type ‘std::string {aka struct std::basic_string<char>}’ through ‘...’

Fixed, I guess. Update and try again.

comment:9 in reply to: ↑ 8 Changed 2 years ago by matija

Replying to janez:

$ g++ -fPIC -fpermissive -w -DLINUX -DORANGE_EXPORTS -O3 -ggdb -Iliblinear  -c table.cpp -o obj/table.o
table.cpp: In member function ‘virtual void TExampleTable::pickClass(PVariable)’:
table.cpp:720:70: error: cannot pass objects of non-trivially-copyable type ‘std::string {aka struct std::basic_string<char>}’ through ‘...’

Fixed, I guess. Update and try again.

Seems to compile now, yes. However, the heap corruption still occurs with assoc-agrawal2.py; I might take a shot at debugging it in a few days ...

comment:10 Changed 2 years ago by janez

Added pickling and conversion of domains. Features to classes and multiple classes and so can be moved around by constructing new domains and converting, as usual.

I guess this is it. Please check the heap corruption on Linux and fix it (sorry, I cannot debug this) or open a new ticket if this is unrelated.

comment:11 Changed 2 years ago by matija

  • Status changed from new to closed
  • Resolution set to fixed

Seems like the memory corruption also happens with rev12602 (just before your commits for this ticket), so the problem is unrelated to this task; I'll open a separate ticket tomorrow, CCing you.

What you said about the state of multi-target prediction support, sounds good to me, so I'm closing this ticket. In case I find anything missing or wrong, I'll reopen it.

Thanks, Janez!

comment:12 Changed 2 years ago by lanz

  • Status changed from closed to reopened
  • Type changed from task to bug
  • Resolution fixed deleted

Thank you Janez for the new multiclass support.

I am starting to use it and have found the following bugs:

(EDIT: In comment 5 you warned that this had not worked yet, but in comment 10 you said that now it should... I hope I have not misunderstood something. I had updated orange for testing = [12614])

1.) Making a new instance does not copy the values of multiclasses correctly.

>>> data = Orange.data.Table('emotions')
>>> print data[0].get_classes()
[<orange.Value 'amazed-suprised'='0'>, ...
>>> print Orange.data.Instance(data.domain, data[0]).get_classes()
[<orange.Value 'amazed-suprised'='.'>, ...

2.) Pickling does not work.

>>> s = pickle.dumps(data)
>>> data2 = pickle.loads(s)
Segmentation fault

3.) Maybe this is not a bug, but printing a multiclass instance only prints the attributes and not the classes. I don't know if it is intentional (if there is a reason for this than OK), but printing the classes at the end would be more consistent with the standard (single class) behaviour.

print data[0]

comment:13 Changed 2 years ago by janez

I added printing and fixed copying. As for pickling, I forgot to commit the latest changes. Please try now and close the ticket if it's OK.

comment:14 Changed 2 years ago by lanz

  • Status changed from reopened to closed
  • Resolution set to fixed

Great, everything seems to be fine now.

comment:15 Changed 2 years ago by lanz

In [12622]:

Removed a workaround for a bug that was fixed (#1012).

comment:16 Changed 2 years ago by matija

I've added support for writing multi-class datasets to tab-delimited files, see rev 12647. I hope I didn't break anything. :)

comment:17 Changed 2 years ago by lanz

  • Status changed from closed to reopened
  • Type changed from bug to task
  • Resolution fixed deleted
  • Milestone changed from future to current

One more thing to be checked (or implemented) related to this ticket - method to_numpy() should work with multi-target data sets.

Also, multi-target data set specifics should be added to the documentation.

comment:18 Changed 2 years ago by janez

  • Status changed from reopened to closed
  • Resolution set to fixed

comment:19 Changed 2 years ago by lanz

  • Status changed from closed to reopened
  • Type changed from task to bug
  • Resolution fixed deleted

Method to_numpy() does not work for multi-target data sets.

To reproduce:

>>> import Orange
>>> data = Orange.data.Table('emotions')
>>> data.to_numpy()
XXX lineno: 1, opcode: 0
Segmentation fault

comment:20 Changed 2 years ago by janez

Can't reproduce. I tried on Windows (Python 2.6) and Ubuntu 11.04, Python 2.7.2. Has somebody fixed this and forgot to close the ticket?

comment:21 Changed 2 years ago by matija

  • Owner changed from janez to matija
  • Status changed from reopened to assigned

I can still reproduce this on Ubuntu 11.04 (with Orange updated and recompiled a few hours ago).

matija@pi:~$ python
Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import Orange
>>> emotions = Orange.data.Table('emotions')
>>> emotions.to_numpy()
XXX lineno: 1, opcode: 0
Segmentation fault

I'll see if I can debug it tomorrow ...

comment:22 Changed 2 years ago by matija

Addendum:

>>> emotions.to_numpy("a")
(array([[ 0.034741  ,  0.089665  ,  0.091225  , ...,  0.24545699,
         0.105065  ,  0.40539899],
       [ 0.081374  ,  0.27274701,  0.085733  , ...,  0.34354699,
         0.276366  ,  0.71092403],
       [ 0.110545  ,  0.27356699,  0.08441   , ...,  0.188693  ,
         0.045941  ,  0.45737201],
       ..., 
       [ 0.035169  ,  0.065403  ,  0.075227  , ...,  0.184313  ,
         0.247136  ,  0.47699299],
       [ 0.054276  ,  0.238158  ,  0.095935  , ...,  0.547126  ,
         0.183494  ,  1.25582004],
       [ 0.073194  ,  0.140733  ,  0.080545  , ...,  0.087328  ,
         0.23681501,  0.45170099]]),)
>>> emotions.to_numpy("c")
Segmentation fault

(And I've tried it with classless dataset and it works, so really multi-target dataset is the issue.)

comment:23 Changed 2 years ago by Matija Polajnar <matija.polajnar@…>

  • Status changed from assigned to closed
  • Resolution set to fixed

In [b5d5ca20c46696452244c284a74200ef18e1cb83/orange]:

Bugfix: on some systems and compilers, to_numpy() crashed on multi-target datasets. Hopefully finally closes #1012.

Note: See TracTickets for help on using tickets.