Hi, I'm the author of MurmurHash (let me know if verification is needed) and helped review the design of CityHash (I work at Google now). CityHash has higher throughput and is more complex, MurmurHash has lower latency for very short keys and is relatively simple. Both are more than adequate for any sort of hashing needs you might have and should be virtually identical to a random oracle for any keysets you might throw at them. I'm happy to answer questions if you have any.
http://www.reddit.com/r/programming/comments/ozodk/cityhash_...
I think it's fascinating that they're able to determine the number of bytes their code processes during a single cycle. Does anyone know where to start looking to learn more about how to do that? It reminds me of a comment I think I read on here a month or two ago about the number of pixels that could be processed per cycle with an algorithm implemented on a 500MHz FPGA.
Edit: Thanks for the answers!
For a microbenchmark like this, I find it's usually better to call it in a loop 1,000,000 times, and compute the total time. That's often a "best case" scenario, where e.g. the cpu doesn't need to decode the instructions every iteration because they fit in appropriate cache. But it avoids the overhead of the timestamp counter.
The question is how representative this measurement really is. Performance of a hash function depends heavily on memory caches. This looks like a test with many iteration where all data are as close to the processor as possible - hardly a real-life scenario.
That is, if the data isn't "as close to the processor" then won't another hash algorithm in this class be affected the same way? The only thing is that it would deemphasize the performance differences, but it wouldn't change that one is faster than another, would it?
The C++ code is quite readable but I'd say that it contains too many magic numbers. Code is no substitute for a clear description given by the authors of the algorithm, its tradeoffs and design choices.
#define CHUNK(multiplier, z) \
{ \
uint64 old_a = a; \
a = Rotate(b, 41 ^ z) * multiplier + Fetch64(s); \
b = Rotate(c, 27 ^ z) * multiplier + Fetch64(s + 8); \
c = Rotate(d, 41 ^ z) * multiplier + Fetch64(s + 16); \
... etc
I wonder: isn't this the kind of thing that's better handled by creating a function, and maybe explicitly telling the compiler to inline it? I mean, you can do this (from here: http://gcc.gnu.org/onlinedocs/gcc/Inline.html ): inline void foo (const char) __attribute__((always_inline));
or you could simply let your optimizer take care of it. It may be faster to not inline it in some circumstances, but you've gone out of your way to prevent it from being able to make that decision, and to make your code harder to read and more prone to problems due to macro expansion.Why? Is it habit? Or is there a good reason? (currently. historical reasons are not good reasons to continue doing things.)
They also tend to be things that are not designed in code: they are designed in math and then translated to code as the last step, preferably in a way that is able to remain static forever (they should not need maintenance by an engineer, even/especially to update to support random new C compiler builds).
(Some of my friends entered the recent SHA3 NIST hash competition and lasted long enough to have random other mathematicians scrutinize, improve, and finally defeat their math. http://en.wikipedia.org/wiki/Spectral_Hash)
In other words, I don't care if the hash doesn't avalanche perfectly (half changed for one bit of input change; a requirement for cryptographic hashing so changes are "obvious") as long as the collision rate is ~SHA256. Or, if I should care, explanations would be wonderful.
Crypto hashes are made so that collisions are as rare as possible.
Sometimes, you need something very, very fast, with as little collisions are possible (aka it "never" happen, like crypto hashes) yet those are not hashes of passwords and the like, so if it would happen it wouldn't be a security issue.
For that reason, this question is in fact interesting. Not that I have the answer to that :-)
The most important thing to take away from this is "These functions aren’t suitable for cryptography". The requirements for a cryptographically strong hash function are much stronger than one just used for hash-table performance. That's not to say that you shouldn't use it, rather be careful that there's no scope creep.
Neither FNV, Jenkins, Murmur nor this new City Hash are supposed to be used for cryptography.
Besides, the way one generally requires hash functions in cryptography, performance is the least of your worries (unless you're trying to crack), so there's no reason whatsoever not to go with a proper cryptographically strong hash such as SHA256.
That said, Jenkins has worse avalanche behaviour than FNV--though I'm sure with a few tweaks that could be fixed, non-crypto hash functions aren't exactly rocket science (check out where FNV's constants came from)--but as it is, Jenkins is more of an academic example.
Spooky (also developed by Bob Jenkins) seems better. Oddly enough Spooky claims 3 bytes per cycle, but the author cannot confirm CityHash's 6 bytes per cycle ( http://burtleburtle.net/bob/hash/spooky.html ). Though whatever it is, both are obviously blazingly fast, picking one over the other because of performance reasons should only be done once you've established that the speed of the hashing function is indeed a bottleneck, and you can compare the performance on the specific hardware it's supposed to run on.
I don't mean symmetric in the cryptographic sense. I mean as in "ABBA" mapping to the same hash value as "YZZY". I can't remember the specific case, but I think it was with 8 byte keys, and it might be like the above or it might be more like "ABAB" and "YZYZ". Either way, it caused some pretty catastrophic failures in one case due to the unusually high probability of such keys existing in our dataset.
I've done tests with pretty huge dictionaries on both Jenkins & FNV, and despite FNV's enviable characteristics, empirically Jenkins did much better at avoiding collisions. Sometimes the stuff that looks good on paper has unfortunate characteristics with real world data sets (which typically aren't that random).
http://www.reddit.com/r/programming/comments/ozodk/cityhash_...
CityHash: new hash function by Google (faster than Murmur)
Content fetched 0 seconds ago