Ticket #1146 (closed bug: fixed)
Instance hashes and comparisons
| Reported by: | lanz | Owned by: | janez |
|---|---|---|---|
| Milestone: | 2.5 | Component: | library |
| Severity: | major | Keywords: | |
| Cc: | matija | Blocking: | |
| Blocked By: |
Description
The documentations says that data instances compute hashes and can be used for keys in dictionaries etc.
First of all, the hashes are not very good:
>>> import Orange
>>> brown = Orange.data.Table('brown-selected')
>>> len(brown)
186
>>> len(set(hash(ins) for ins in brown))
41
Secondly, instances with different domains can not be in the same dict, because there are a lot of collisions:
>>> len(set(hash(ins) for ins in brown).intersection(hash(ins) for ins in iris)) 30
And instances can not be compared:
>>> brown[0] == iris[0] Traceback (most recent call last): File "<stdin>", line 1, in <module> orange.KernelException: 'orange.Example': examples are from different domains
If it is possible to get a quick fix, could at least brown[0] == iris[0] return False instead of throwing an exception?
Change History
comment:2 Changed 15 months ago by lanz
I agree the comparisons are debatable since you could argue that you could (should?) first check if the domains are different manually. Every time, everywhere. I think it is just more practical to use == and see if they are equal or not. The analogy does not seem quite right since the types are the same (Instance). When would one want to do such comparisons? I can't think of many examples, but I have a very practical one right now - cached custom kernel for svm. I need a dict with instance pairs for keys, and for that keys need to be comparable. It is possible to add lots of checks, but I would rather not :)
For the hash, I did not want to suggest we need some cryptographic 512+ bit beast. CRC32 should be fine, I just think that something is wrong... Shouldn't it produce 232 different values, how can 186 examples with 79 distinct floats each produce only 41 different checksums? What are the checksums computed on (I would think they should be on the list of values)? Looks like a bug to me.
comment:3 Changed 15 months ago by janez
I implemented comparison as you suggested. Analogy with Python 3 is still partially true - the idea behind their decision is that you don't want to compare objects that cannot be compared: I still have to raise exception for <, >, <=, >=. Get the new version from mercurial and don't forget to run pyxtract before building.
CRC32 is known to be a bad hash function; its purpose is to catch communication errors. Jenkins hash is a good hash, I guess (Wiki doesn't list is as cryptographic has function). But you're right that this is fishy. There are two possibilities. First, the data may have some regularity that reduces the number of different bits (for Iris, the data consists of floats between 2 and 6...). Second, there may be a bug, in particular on Linux. We've had a bug report with two tables having the same hash, which I was unable to reproduce on Windows. This is weird since hashes on Windows and Linux should be the same, I guess.
You have this problem on Linux, right? I don't debug on Linux; it'd be great if you can help me out. There's not a lot of code behind checksums and it's pretty simple.
comment:4 Changed 15 months ago by janez
I removed some unnecessary casts from int to unsigned long when computing the CRC. This is something that may be platform (or even compiler) dependent. Could you update and check again?
I checked on Windows - I get a perfect hash on brown-selected with old and with new code.
comment:5 Changed 15 months ago by lanz
So it is a platform-specific thing... There is an improvement with the update - instead of 41 I now get 70 unique hashes for brown-selected (out of 186).
comment:6 Changed 15 months ago by matija
- Cc matija added
Turns out the problem is that crc and crc_table are unsigned long, which has 32 bits on 32-bit systems and 64 bits on 64-bit systems. By changing those to unsigned int, it works on 64-bit systems (with 64-bit Python and compiler) too.
Given that it depends so much on the size of the variable, I'd rather use int32_t, which is guaranteed to be 32-bit on all POSIX platforms, but it seems that not-so-old versions of Visual C++ don't have that type. Janez, do you agree I push the unsigned int version?

The latter, brown[0] == iris[0] is trivial to change, but would anybody intentionally compare two examples from different domains? I guess that such comparisons would always appear by accident, so raising an exception seems more reasonable. (Python 3.0 raises an exception if objects of different types are compared; this seems analogous).
For the hash: Orange still uses CRC32 which is very inappropriate for hashing. Which hash do you suggest? It must be fast (so we can compute it on large tables) and I believe that it must be able to swallow single bytes, too. Perhaps http://en.wikipedia.org/wiki/Jenkins_hash_function? I'd prefer something simple – CityHash, for instance, seems quite complicated.