Ticket #924 (new bug)

Opened 3 years ago

Last modified 3 years ago

Different data sets can have the same checksum

Reported by: Noughmad Owned by: janez
Milestone: Future Component: library
Severity: minor Keywords:
Cc: Blocking:
Blocked By:

Description

The documentation data sets 'zoo.tab' and 'breast-cancer-wisconsin-cont.tab' have the same checksum, even though they contain completely different data.

In visualization widgets (for example OWScatterplot) and in the 'preprocess.scaling' module, there are many checks to see if the new data is different from the old, and do an update only in this case. However, all those checks compare the checksums instead of the actual data sets. So when changing from 'zoo' to 'breast-cancer-wisconsin-cont', the plot data is not updated.

The solution would be to ensure that checksums are unique, or to replace all test of like 'old_data.checksum() == new_data.checksum()' with 'old_data.checksum() == new_data.checksum() and old_data == new_data'. It would probably also be possible to override the eq operator of ExampleTable so it would first compare the checksum, and then compare the data only if the checksums are the same.

Change History

comment:1 Changed 3 years ago by janez

Checksum is actually a CRC32 code so it is one against 232 that two different data sets would have the same code (unless we have a bug).

d1 = orange.ExampleTable("c:
python26
lib
site-packages
orange
doc
datasets
breast-cancer-wisconsin-cont") d2 = orange.ExampleTable("c:
python26
lib
site-packages
orange
doc
datasets
zoo") d1.checksum()

1600938553

d2.checksum()

2126861903

Do you get any "non-random" checksums, like -1?

comment:2 Changed 3 years ago by Noughmad

orange.ExampleTable("/home/miha/Orange/Orange/trunk/orange/doc/datasets/breast-cancer-wisconsin-cont").checksum()

1227817956

orange.ExampleTable("/home/miha/Orange/Orange/trunk/orange/doc/datasets/zoo").checksum()

1227817956

orange.ExampleTable("/home/miha/Orange/Orange/trunk/orange/doc/datasets/adult").checksum()

1227817956

orange.ExampleTable("/home/miha/Orange/Orange/trunk/orange/doc/datasets/adult_sample").checksum()

1227817534

orange.ExampleTable("/home/miha/Orange/Orange/trunk/orange/doc/datasets/breast-cancer-wisconsin").checksum()

1224059428

For some reason all checksums share at least the first three digits, while the first three are exactly the same. The results might be different because I have an 64-bit OS.

Note: See TracTickets for help on using tickets.