Ticket #1329 (closed bug: invalid)

Opened 7 months ago

Last modified 7 months ago

C4.5 reader mistakes continous attributes as discrete

Reported by: bricklemacho Owned by:
Milestone: Component: library
Severity: minor Keywords: C4.5, Discretization
Cc: Blocking:
Blocked By:

Description (last modified by bricklemacho) (diff)

When importing data from C4.5 files the order of data instances influence the automatic discretization process.

I have attached a minimal set of files that allow error to be duplicated. The two files contain the same data, one is ordered by class, the other the file the order of data instances is random. Here are the steps I follow using visual Orange:

  1. Start Orange
  2. Select File Widget
  3. Open file widget, select C4.5 data file: "bycalss.data"
  4. Select Parallel Coordinates plot
  5. Connect File Widget to Parallel Coordinate Widget
  6. Open Parallel Coordinates and notice the class "banding"
  7. Select "Select Attributes" widget
  8. Connect File widget to Select Attributes
  9. Observe attributes are "discrete"
  10. Exit/Close Orange

Repeat steps 1-10, but this time open "random.data" and observe the rainbow and attributes as discrete. (See note 2 below)

If you need more information, let me know.

Things to note:

  1. The attributes are marked as continuous in .names file
  2. The Orange-Canvas seems to remember/cache information. In duplicating this bug I could not open both files in the same schema and see different results. This actually made it a little harder to diagnose what was happening as I would make changes but not see the true effect. Sometimes it evened rememberd the last file opened so I had to open/close twice (perhaps this is another bug).
  3. The "ordering" somehow encodes the class. In my real data file the number of data instances per class were of uneven and the discretization process still banded by class. Obviously this affect later classifiers later in the workflow schema.
  4. I don't know the default method of discretization, so the bug below may just be a side effect of the default method (if so perhaps a warning in the documentation, if already there sorry, I didn't see it).

Attachments

byclass.names Download (128 bytes) - added by bricklemacho 7 months ago.
byclass.data Download (6.2 KB) - added by bricklemacho 7 months ago.
random.names Download (128 bytes) - added by bricklemacho 7 months ago.
random.data Download (6.2 KB) - added by bricklemacho 7 months ago.

Change History

Changed 7 months ago by bricklemacho

Changed 7 months ago by bricklemacho

Changed 7 months ago by bricklemacho

Changed 7 months ago by bricklemacho

comment:1 Changed 7 months ago by bricklemacho

  • Description modified (diff)

comment:2 Changed 7 months ago by bricklemacho

  • Description modified (diff)
  • Summary changed from Incorrect discretization of data instances during import of data form C4.5 files. to C4.5 reader mistakes continous attributes as discrete
Version 0, edited 7 months ago by bricklemacho (next)

comment:3 Changed 7 months ago by ales

  • Status changed from new to closed
  • Resolution set to invalid

Note the incorrect spelling of "continious" (should be "continuous") in the .names file.

Note: See TracTickets for help on using tickets.