Library to generate hashes from Clojure data.
Go to file
Peter Selby 9d2dbafdb5 Add serialization for BigDecimal 2021-06-12 22:42:59 -07:00
.circleci add CircleCI configuration 2018-11-02 12:53:10 -04:00
src/valuehash Add serialization for BigDecimal 2021-06-12 22:42:59 -07:00
test/valuehash upgrade to clojure 1.9-alpha16 2017-05-09 16:04:45 -04:00
.gitignore convert to deps.edn 2018-11-02 12:53:10 -04:00
LICENSE initial commit 2016-12-16 10:21:54 -05:00
README.md add CircleCI badge to readme 2018-11-02 12:57:16 -04:00
deps.edn convert to deps.edn 2018-11-02 12:53:10 -04:00

README.md

Valuehash

CircleCI

A Clojure library that provides a way to provide higher-bit hashes of arbitrary Clojure data structures, which respect Clojure's value semantics. That is, if two objects are clojure.core/=, they will have the same hash value. To my knowledge, no other Clojure data hashing libraries make this guarantee.

The protocol is extensible to arbitrary data types and can work with any hash function.

Although the library uses byte streams as an intermediate format, it does not tag types, or perform any optimization or compaction of the byte stream. Therefore it should not be used as a serialization library. Use Fressian, Transit or something similar instead.

Usage

To get a MD5 hash of any Clojure object, do this:

(valuehash.api/md5 {:hello "world"})
=> #object["[B" 0x30cb9804 "[B@30cb9804"]

To get the hexadecimal string version, do this:

(valuehash.api/md5-str {:hello "world"})
=> "d3d7ccf8b8c217f3b52dc08929eabab8"

Also provided are sha-1, sha-1-str, sha-256 and sha-256-str, which do what they say on the tin.

Custom hash functions

If you want a hash function other than md5, sha-1 or sha-256, you can obtain a digest function for any algorithm supported by java.security.MessageDigest in your JVM.

Obtain the digest function using the messagedigest-fn function, then pass it and the object to be hashed to digest.

If you wish to obtain a hexadecimal string of the result, call the hex-str function on the result.

(h/hex-str (h/digest (h/messagedigest-fn "MD2") {:hello "World"}))
=> "81c9637d9fcb071a486eeb0c76dce1f6"

Even more custom hash functions

If nothing in java.security.MessageDigest meets your needs, you can supply your own digest function to valuehash.api/digest. This may be any function which takes a java.io.InputStream and returns a byte array.

For example, the following example defines and uses a valid but terrible hash function:

(defn lazyhash [is]
  ;; chosen by a fair dice roll, guaranteed to be a random oracle
  (byte-array [(byte 4)]))

(h/digest lazyhash {:hello "world"})

Semantics

This does not combine hashes: it converts the entire input data to binary data, and hashes that. As such, it is suitable for cryptographic applications when used with an appropriate hash function.

The binary data supplied to the hash function matches Clojure's equality semantics. That is, objects that are semantically clojure.core/= will have the same binary representation.

This means:

  • All list types are encoded the same
  • All set types are encoded the same
  • All map types are encoded the same
  • All integer numbers are encoded the same
  • All floating-point numbers are encoded the same

The system does take some steps to rule out common types of "collisions", where two unequal objects have the same binary representation (and therefore the same hash). It injects "separator" bytes in collections, so that (for example) the binary representation of ["ab" "c"] is not equal to ["a" "bc"].

Supported Types

By default, Clojure's native types are supported: as a rule of thumb, if it can be printed to EDN by the default printer, it can be hashed with no fuss.

If you want to extend the system to hash arbitrary values, you can extend the valuehash.impl/CanonicalByteArray protocol to any object of your choosing.

Performance

On my Macbook Pro, this library can determine the MD5 hash of small (0-10 element vectors) at a rate of about 22,000 hashed objects per second.

Larger, more complex nested object slow down significantly, to a rate of around 2,600 per second for objects generated by (clojure.test.check.generators/sample-seq gen/any-printable 100)

To run your own benchmarks, check out the valuehash.bench namespace in the test directory.

The current implementation is known to be somewhat naive, as it is single threaded and performs lots of redundant array copying. If you have ideas for how to make this faster, please see the valuehash.impl namespace and re-implement/replace CanonicalByteArray, then submit a pull request with your alternative impl in a separate namespace, with comparative benchmarks attached.

License

Copyright © 2016 Luke VanderHart

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.