Orange Forum • View topic - Duplicate discretize on import using Discretize Widget.

Duplicate discretize on import using Discretize Widget.

A place to ask questions about methods in Orange and how they are used and other general support.

Duplicate discretize on import using Discretize Widget.

Postby bricklemacho » Sat Sep 28, 2013 9:17

I have the following two workflows using the visual interface.

One the feature is imported as a discrete variable, connected to PCA widget, connected to Test Learners, results in good/great performance.

The other workflow imported feature as continuous, connected to Discretize Widget, connected to Test Learners, results in poor classifier performance. I have tried the following methods Entropy-MDL, Equal Frequency, and Equal Width none seem to provide the improvement in my classifier.

Is it possible to duplicate the discretization process used on import via the Discretize Widget?

As the improvement is significant I need to explain the process in a paper I am writing, and preferable duplicate the process (preferable via the visual interface) for my supervisors. I read the Fayyad & Irani paper referenced in your documentation.

Michael.
--

Re: Duplicate discretize on import using Discretize Widget.

Postby bricklemacho » Sun Sep 29, 2013 12:35

Any pointers to where in the source I can locate the "automatic" discretization? I have looked at TC45ExampleGenerator::readExample() and TC45ExampleGenerator::readDomain() in c45inter.cpp but can't see where, if at all, discretisation, is taking place.

The discretisation of the feature seems to only happen when I use a c4.5 file formats. The reason I think it is automatic is because in the c45.names file the attribute is marked as a continuous variable, but when I look at the attribute within "Select Attributes" widget it appears as a discrete variable. So I get more confused.

If no automatic discretisation is taking place, can anyone give me insight to what is happening.

Michael.
--

Quick followup confirming automatic discretization is occurring. After importing form a C4.5 format and then saved in native Orange .tab it appears to be discrete with the cut-offs(?) in the header line(s) of the .tab file. I will keep looking/trace the source code, but my c++ is not strong.

Re: Duplicate discretize on import using Discretize Widget.

Postby Ales » Mon Sep 30, 2013 14:52

There should not be any 'discretization' when loading the data. If the c4.5 reader mistakes a continuous attribute for a discrete then this is a bug (for instance saving the iris dataset to c4.5 format and loading it correctly stores and loads the continuous attributes).

Can you please post a minimal example of a ".data" and ".names" file to reproduce the error.

Re: Duplicate discretize on import using Discretize Widget.

Postby bricklemacho » Mon Sep 30, 2013 15:33

I have logged a bug, see ticket #1329. The bug report includea minimal ".data" and ".names" files. When I logged the bug I though it was to do with the discretization process, so the title is probably incorrect.

When I read the "Discretization" page in Orange document, another file format can discretize data, so that is why I assumed it was automatically happen here even though the attribute was not obviously discrete.

Re: Duplicate discretize on import using Discretize Widget.

Postby bricklemacho » Tue Oct 01, 2013 2:21

My error. I had a typo in the .names file. Thanks for your time.


Return to Questions & Support