Ticket #1146 (closed bug: fixed)

Opened 2 years ago

Last modified 2 years ago

Instance hashes and comparisons

Reported by: lanz Owned by: janez
Milestone: 2.5 Component: library
Severity: major Keywords:
Cc: matija Blocking:
Blocked By:

Description

The documentations says that data instances compute hashes and can be used for keys in dictionaries etc.

First of all, the hashes are not very good:

>>> import Orange
>>> brown = Orange.data.Table('brown-selected')
>>> len(brown)
186
>>> len(set(hash(ins) for ins in brown))
41

Secondly, instances with different domains can not be in the same dict, because there are a lot of collisions:

>>> len(set(hash(ins) for ins in brown).intersection(hash(ins) for ins in iris))
30

And instances can not be compared:

>>> brown[0] == iris[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
orange.KernelException: 'orange.Example': examples are from different domains

If it is possible to get a quick fix, could at least brown[0] == iris[0] return False instead of throwing an exception?

Change History

comment:1 Changed 2 years ago by janez

The latter, brown[0] == iris[0] is trivial to change, but would anybody intentionally compare two examples from different domains? I guess that such comparisons would always appear by accident, so raising an exception seems more reasonable. (Python 3.0 raises an exception if objects of different types are compared; this seems analogous).

For the hash: Orange still uses CRC32 which is very inappropriate for hashing. Which hash do you suggest? It must be fast (so we can compute it on large tables) and I believe that it must be able to swallow single bytes, too. Perhaps  http://en.wikipedia.org/wiki/Jenkins_hash_function? I'd prefer something simple – CityHash, for instance, seems quite complicated.

comment:2 Changed 2 years ago by lanz

I agree the comparisons are debatable since you could argue that you could (should?) first check if the domains are different manually. Every time, everywhere. I think it is just more practical to use == and see if they are equal or not. The analogy does not seem quite right since the types are the same (Instance). When would one want to do such comparisons? I can't think of many examples, but I have a very practical one right now - cached custom kernel for svm. I need a dict with instance pairs for keys, and for that keys need to be comparable. It is possible to add lots of checks, but I would rather not :)

For the hash, I did not want to suggest we need some cryptographic 512+ bit beast. CRC32 should be fine, I just think that something is wrong... Shouldn't it produce 232 different values, how can 186 examples with 79 distinct floats each produce only 41 different checksums? What are the checksums computed on (I would think they should be on the list of values)? Looks like a bug to me.

comment:3 Changed 2 years ago by janez

I implemented comparison as you suggested. Analogy with Python 3 is still partially true - the idea behind their decision is that you don't want to compare objects that cannot be compared: I still have to raise exception for <, >, <=, >=. Get the new version from mercurial and don't forget to run pyxtract before building.

CRC32 is known to be a bad hash function; its purpose is to catch communication errors. Jenkins hash is a good hash, I guess (Wiki doesn't list is as cryptographic has function). But you're right that this is fishy. There are two possibilities. First, the data may have some regularity that reduces the number of different bits (for Iris, the data consists of floats between 2 and 6...). Second, there may be a bug, in particular on Linux. We've had a bug report with two tables having the same hash, which I was unable to reproduce on Windows. This is weird since hashes on Windows and Linux should be the same, I guess.

You have this problem on Linux, right? I don't debug on Linux; it'd be great if you can help me out. There's not a lot of code behind checksums and it's pretty simple.

comment:4 Changed 2 years ago by janez

I removed some unnecessary casts from int to unsigned long when computing the CRC. This is something that may be platform (or even compiler) dependent. Could you update and check again?

I checked on Windows - I get a perfect hash on brown-selected with old and with new code.

comment:5 Changed 2 years ago by lanz

So it is a platform-specific thing... There is an improvement with the update - instead of 41 I now get 70 unique hashes for brown-selected (out of 186).

comment:6 Changed 2 years ago by matija

  • Cc matija added

Turns out the problem is that crc and crc_table are unsigned long, which has 32 bits on 32-bit systems and 64 bits on 64-bit systems. By changing those to unsigned int, it works on 64-bit systems (with 64-bit Python and compiler) too.

Given that it depends so much on the size of the variable, I'd rather use int32_t, which is guaranteed to be 32-bit on all POSIX platforms, but  it seems that not-so-old versions of Visual C++ don't have that type. Janez, do you agree I push the unsigned int version?

comment:7 Changed 2 years ago by janez

Sure.

comment:8 Changed 2 years ago by Matija Polajnar <matija.polajnar@…>

  • Status changed from new to closed
  • Resolution set to fixed

In [54f9c123f14589e7177235edc96f1493d63886a9/orange]:

Urgent bugfix: Refactor CRC variables from unsigned long to unsigned int to make it work properly on 64-bit machines (with 64-bit Python and libraries). Fixes #1146.

comment:9 Changed 2 years ago by lanz

Thank you both, it works properly now.

And Janez, you've now almost convinced me that being strict with comparisons and throwing exceptions would not be that bad... However I will enjoy the simpler code that the current approach allows :)

Note: See TracTickets for help on using tickets.