Ticket #1329 (closed bug: invalid)
C4.5 reader mistakes continous attributes as discrete
|Reported by:||bricklemacho||Owned by:|
Description (last modified by bricklemacho) (diff)
When importing data from C4.5 files the order of data instances influence the automatic discretization process.
I have attached a minimal set of files that allow error to be duplicated. The two files contain the same data, one is ordered by class, the other the file the order of data instances is random. Here are the steps I follow using visual Orange:
- Start Orange
- Select File Widget
- Open file widget, select C4.5 data file: "bycalss.data"
- Select Parallel Coordinates plot
- Connect File Widget to Parallel Coordinate Widget
- Open Parallel Coordinates and notice the class "banding"
- Select "Select Attributes" widget
- Connect File widget to Select Attributes
- Observe attributes are "discrete"
- Exit/Close Orange
Repeat steps 1-10, but this time open "random.data" and observe the rainbow and attributes as discrete. (See note 2 below)
If you need more information, let me know.
Things to note:
- The attributes are marked as continuous in .names file
- The Orange-Canvas seems to remember/cache information. In duplicating this bug I could not open both files in the same schema and see different results. This actually made it a little harder to diagnose what was happening as I would make changes but not see the true effect. Sometimes it evened rememberd the last file opened so I had to open/close twice (perhaps this is another bug).
- The "ordering" somehow encodes the class. In my real data file the number of data instances per class were of uneven and the discretization process still banded by class. Obviously this affect later classifiers later in the workflow schema.
- I don't know the default method of discretization, so the bug below may just be a side effect of the default method (if so perhaps a warning in the documentation, if already there sorry, I didn't see it).
- Description modified (diff)
- Summary changed from Incorrect discretization of data instances during import of data form C4.5 files. to C4.5 reader mistakes continous attributes as discrete