finish implementation
This commit is contained in:
parent
82b1d46f7d
commit
2d18c4f43c
104
README.md
104
README.md
|
@ -1,16 +1,111 @@
|
|||
# identihash
|
||||
# Valuehash
|
||||
|
||||
A Clojure library that provides a way to provide higher-bit hashes of arbitrary
|
||||
Clojure data structures, which respect Clojure's identity semantics. That is, if
|
||||
Clojure data structures, which respect Clojure's value semantics. That is, if
|
||||
two objects are `clojure.core/=`, they will have the same hash value. To my
|
||||
knowledge, no other Clojure data hashing libraries make this guarantee.
|
||||
|
||||
The protocol is extensible to arbitrary data types and can work with any hash
|
||||
function that can take a byte array.
|
||||
function.
|
||||
|
||||
Although the library uses byte streams as an intermediate format, it does not
|
||||
tag types, or perform any optimization or compaction of the byte stream. Therefore it should _not_ be used as a serialization library. Use
|
||||
[Fressian](https://github.com/clojure/data.fressian),
|
||||
[Transit](https://github.com/cognitect/transit-clj) or something similar
|
||||
instead.
|
||||
|
||||
## Usage
|
||||
|
||||
TODO
|
||||
To get a MD5 hash of any Clojure object, do this:
|
||||
|
||||
```
|
||||
(valuehash.api/md5 {:hello "world"})
|
||||
=> #object["[B" 0x30cb9804 "[B@30cb9804"]
|
||||
```
|
||||
|
||||
To get the hexadecimal string version, do this:
|
||||
|
||||
```
|
||||
(valuehash.api/md5-str {:hello "world"})
|
||||
=> "d3d7ccf8b8c217f3b52dc08929eabab8"
|
||||
```
|
||||
|
||||
Also provided are `sha-1`, `sha-1-str`, `sha-256` and `sha-256-str`, which do
|
||||
what they say on the tin.
|
||||
|
||||
### Custom hash functions
|
||||
|
||||
If you want a hash function other than md5, sha-1 or sha-256, you can obtain a
|
||||
digest function for any algorithm supported by `java.security.MessageDigest` in
|
||||
your JVM.
|
||||
|
||||
Obtain the digest function using the `messagedigest-fn` function, then pass it
|
||||
and the object to be hashed to `digest`.
|
||||
|
||||
If you wish to obtain a hexadecimal string of the result, call the `hex-str` function on the result.
|
||||
|
||||
```clojure
|
||||
(h/hex-str (h/digest (h/messagedigest-fn "MD2") {:hello "World"}))
|
||||
=> "81c9637d9fcb071a486eeb0c76dce1f6"
|
||||
```
|
||||
|
||||
### Even more custom hash functions
|
||||
|
||||
If nothing in `java.security.MessageDigest` meets your needs, you can supply
|
||||
your own digest function to `valuehash.api/digest`. This may be any function
|
||||
which takes a `java.io.InputStream` and returns a byte array.
|
||||
|
||||
For example, the following example defines and uses a valid but terrible hash function:
|
||||
|
||||
```clojure
|
||||
(defn lazyhash [is]
|
||||
;; chosen by a fair dice roll, guaranteed to be a random oracle
|
||||
(byte-array [(byte 4)]))
|
||||
|
||||
(h/digest lazyhash {:hello "world"})
|
||||
```
|
||||
## Semantics
|
||||
|
||||
This does not combine hashes: it converts the entire input data to binary data,
|
||||
and hashes that. As such, it is suitable for cryptographic applications when
|
||||
used with an appropriate hash function.
|
||||
|
||||
The binary data supplied to the hash function matches Clojure's equality
|
||||
semantics. That is, objects that are semantically `clojure.core/=` will have the
|
||||
same binary representation.
|
||||
|
||||
This means:
|
||||
|
||||
- All lists are encoded the same
|
||||
- All sets are encoded the same
|
||||
- All integer numbers are encoded the same
|
||||
- All floating-point numbers are encoded the same
|
||||
|
||||
The system does take some steps to rule out common types of "collisions", where two unequal objects have the same binary representation (and therefore the same hash). It injects "separator" bytes in collections, so that (for example) the binary representation of `["ab" "c"]` is not equal to `["a" "bc"]`.
|
||||
|
||||
## Supported Types
|
||||
|
||||
By default, Clojure's native types are supported: as a rule of thumb, if it can be printed to EDN by the default printer, it can be hashed with no fuss.
|
||||
|
||||
If you want to extend the system to hash arbitrary values, you can extend the `valuehash.impl/CanonicalByteArray` protocol to any object of your choosing.
|
||||
|
||||
## Performance
|
||||
|
||||
On my Macbook Pro, this library can determine the MD5 hash of small (0-10
|
||||
element vectors) at a rate of about 22,000 hashed objects per second.
|
||||
|
||||
Larger, more complex nested object slow down significantly, to a rate of around
|
||||
2,600 per second for objects generated by
|
||||
`(clojure.test.check.generators/sample-seq gen/any-printable 100)`
|
||||
|
||||
To run your own benchmarks, check out the `valuehash.bench` namespace in the
|
||||
`test` directory.
|
||||
|
||||
The current implementation is known to be somewhat naive, as it is single
|
||||
threaded and performs lots of redundant array copying. If you have ideas for
|
||||
how to make this faster, please see the `valuehash.impl` namespace and
|
||||
re-implement/replace `CanonicalByteArray`, then submit a pull request with your
|
||||
alternative impl in a separate namespace, with comparative benchmarks attached.
|
||||
|
||||
## License
|
||||
|
||||
|
@ -18,4 +113,3 @@ Copyright © 2016 Luke VanderHart
|
|||
|
||||
Distributed under the Eclipse Public License either version 1.0 or (at
|
||||
your option) any later version.
|
||||
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
(defproject arachne-framework/identihash "0.1.0-SNAPSHOT"
|
||||
:description "Identity based hashing for Clojure data"
|
||||
(defproject arachne-framework/valuehash "0.1.0-SNAPSHOT"
|
||||
:description "Value based hashing for Clojure data"
|
||||
:license {:name "Eclipse Public License"
|
||||
:url "http://www.eclipse.org/legal/epl-v10.html"}
|
||||
:dependencies [[org.clojure/clojure "1.9.0-alpha14"]
|
||||
[org.clojure/test.check "0.9.0" :scope "test"]])
|
||||
[org.clojure/test.check "0.9.0" :scope "test"]
|
||||
[criterium "0.4.4" :scope "test"]])
|
||||
|
|
|
@ -0,0 +1,67 @@
|
|||
(ns valuehash.api
|
||||
(:require [valuehash.impl :as impl]
|
||||
[valuehash.specs])
|
||||
(:import [java.security MessageDigest]
|
||||
[java.io InputStream ByteArrayInputStream]))
|
||||
|
||||
(defn digest
|
||||
"Given a digest function and an arbitrary Clojure object, return a byte array
|
||||
representing the digest of the object.
|
||||
|
||||
The digest function must take an InputStream as its argument, and return a
|
||||
byte array."
|
||||
^bytes [digest-fn obj]
|
||||
(digest-fn (ByteArrayInputStream. (impl/to-byte-array obj))))
|
||||
|
||||
(defn- consume
|
||||
"Fully consume the specified input stream, using the supplied MessageDigest
|
||||
object."
|
||||
[^MessageDigest digest ^InputStream is]
|
||||
(let [buf (byte-array 64)]
|
||||
(loop []
|
||||
(let [read (.read is buf)]
|
||||
(when (<= 0 read)
|
||||
(.update digest buf 0 read)
|
||||
(recur))))
|
||||
(.digest digest)))
|
||||
|
||||
(defn messagedigest-fn
|
||||
"Return a digest function using java.security.MessageDigest, using the specified algorithm"
|
||||
[algorithm]
|
||||
(fn [is]
|
||||
(consume (MessageDigest/getInstance algorithm) is)))
|
||||
|
||||
(defn hex-str
|
||||
"Return the hexadecimal string representation of a byte array"
|
||||
[ba]
|
||||
(apply str (map #(format "%02x" %) ba)))
|
||||
|
||||
(defn md5
|
||||
"Return the MD5 digest of an arbitrary Clojure data structure"
|
||||
[obj]
|
||||
(digest (messagedigest-fn "MD5") obj))
|
||||
|
||||
(defn md5-str
|
||||
"Return the MD5 digest of an arbitrary Clojure data structure, as a string"
|
||||
[obj]
|
||||
(hex-str (md5 obj)))
|
||||
|
||||
(defn sha-1
|
||||
"Return the SHA-1 digest of an arbitrary Clojure data structure"
|
||||
[obj]
|
||||
(digest (messagedigest-fn "SHA-1") obj))
|
||||
|
||||
(defn sha-1-str
|
||||
"Return the SHA-1 digest of an arbitrary Clojure data structure, as a string"
|
||||
[obj]
|
||||
(hex-str (sha-1 obj)))
|
||||
|
||||
(defn sha-256
|
||||
"Return the SHA-256 digest of an arbitrary Clojure data structure"
|
||||
[obj]
|
||||
(digest (messagedigest-fn "SHA-256") obj))
|
||||
|
||||
(defn sha-256-str
|
||||
"Return the sha256 digest of an arbitrary Clojure data structure, as a string"
|
||||
[obj]
|
||||
(hex-str (sha-256 obj)))
|
|
@ -0,0 +1,111 @@
|
|||
(ns valuehash.impl
|
||||
"Simple implementation based on plain byte arrays"
|
||||
(:import [java.util UUID Date]))
|
||||
|
||||
(defprotocol CanonicalByteArray
|
||||
"An object that can be converted to a canonical byte array, with value
|
||||
semantics intact (that is, two objects that are clojure.core/= will always
|
||||
have the same binary representation)"
|
||||
(to-byte-array [this] "Convert an object to a canonical byte array"))
|
||||
|
||||
(defn- ba-comparator
|
||||
"Comparator function for byte arrays"
|
||||
^long [^bytes a ^bytes b]
|
||||
(let [alen (alength a)
|
||||
blen (alength b)]
|
||||
(if (not= alen blen)
|
||||
(- alen blen)
|
||||
; compare backward, since lots of symbols/keywords have a common prefix
|
||||
(loop [i (dec alen)]
|
||||
(if (< i 0)
|
||||
0
|
||||
(let [c (- (aget a i) (aget b i))]
|
||||
(if (= 0 c)
|
||||
(recur (dec i))
|
||||
c)))))))
|
||||
|
||||
(defn long->bytes
|
||||
"Convert a long value to a byte array"
|
||||
[val]
|
||||
(.toByteArray (biginteger val)))
|
||||
|
||||
(defn- join-byte-arrays
|
||||
"Copy multiple byte arrays to a single output byte array in the order they
|
||||
are given."
|
||||
[arrays]
|
||||
(let [dest (byte-array (+ (reduce + (map alength arrays))))]
|
||||
(loop [offset 0
|
||||
[^bytes src & more] arrays]
|
||||
(when src
|
||||
(let [srclen (alength src)]
|
||||
(System/arraycopy src 0 dest offset srclen)
|
||||
(recur (+ offset srclen) more))))
|
||||
dest))
|
||||
|
||||
;; Primitive values
|
||||
(extend-protocol CanonicalByteArray
|
||||
nil
|
||||
(to-byte-array [_] (byte-array 1 (byte 0)))
|
||||
String
|
||||
(to-byte-array [this] (.getBytes ^String this))
|
||||
clojure.lang.Keyword
|
||||
(to-byte-array [this] (.getBytes (str this)))
|
||||
clojure.lang.Symbol
|
||||
(to-byte-array [this] (.getBytes (str this)))
|
||||
Byte
|
||||
(to-byte-array [this] (long->bytes this))
|
||||
Integer
|
||||
(to-byte-array [this] (long->bytes this))
|
||||
Long
|
||||
(to-byte-array [this] (long->bytes this))
|
||||
Double
|
||||
(to-byte-array [this] (long->bytes (Double/doubleToLongBits this)))
|
||||
Float
|
||||
(to-byte-array [this] (long->bytes (Double/doubleToLongBits this)))
|
||||
clojure.lang.Ratio
|
||||
(to-byte-array [this] (long->bytes (Double/doubleToLongBits (double this))))
|
||||
Boolean
|
||||
(to-byte-array [this] (byte-array 1 (if this (byte 1) (byte 0))))
|
||||
Character
|
||||
(to-byte-array [this] (.getBytes (str this)))
|
||||
UUID
|
||||
(to-byte-array [this]
|
||||
(join-byte-arrays [(long->bytes (.getMostSignificantBits ^UUID this))
|
||||
(long->bytes (.getLeastSignificantBits ^UUID this))]))
|
||||
Date
|
||||
(to-byte-array [this]
|
||||
(long->bytes (.getTime this))))
|
||||
|
||||
(def list-sep (byte-array 1 (byte 42)))
|
||||
(def set-sep (byte-array 1 (byte 21)))
|
||||
(def map-sep (byte-array 1 (byte 19)))
|
||||
|
||||
(defn- map-entry->byte-array
|
||||
[map-entry]
|
||||
(join-byte-arrays [(to-byte-array (.getKey map-entry))
|
||||
(to-byte-array (.getValue map-entry))]))
|
||||
|
||||
;; Collections
|
||||
(extend-protocol CanonicalByteArray
|
||||
java.util.List
|
||||
(to-byte-array [this]
|
||||
(->> this
|
||||
(map to-byte-array)
|
||||
(interpose list-sep)
|
||||
(cons list-sep)
|
||||
(join-byte-arrays)))
|
||||
java.util.Set
|
||||
(to-byte-array [this]
|
||||
(->> this
|
||||
(map to-byte-array)
|
||||
(sort ba-comparator)
|
||||
(interpose set-sep)
|
||||
(cons set-sep)
|
||||
(join-byte-arrays)))
|
||||
java.util.Map
|
||||
(to-byte-array [this]
|
||||
(->> this
|
||||
(map map-entry->byte-array)
|
||||
(sort ba-comparator)
|
||||
(cons map-sep)
|
||||
(join-byte-arrays))))
|
|
@ -0,0 +1,25 @@
|
|||
(ns valuehash.specs
|
||||
(:require [clojure.spec :as s]))
|
||||
|
||||
(def byte-array-class (class (byte-array 0)))
|
||||
|
||||
(defn byte-array? [obj] (instance? byte-array-class obj))
|
||||
(defn input-stream? [obj] (instance? java.io.InputStream obj))
|
||||
|
||||
(s/def ::digest-fn
|
||||
(s/fspec
|
||||
:args (s/cat :input-stream input-stream?)
|
||||
:ret byte-array?))
|
||||
|
||||
(s/fdef valuehash.api/digest
|
||||
:args (s/cat :digest-fn ::digest-fn, :obj any?)
|
||||
:ret byte-array?)
|
||||
|
||||
(s/fdef valuehash.api/mesagedigest-fn
|
||||
:args (s/cat :algorithm string?)
|
||||
:ret ::digest-fn)
|
||||
|
||||
(s/fdef valuehash.api/hex-str
|
||||
:args (s/cat :byte-array byte-array?)
|
||||
:ret string?)
|
||||
|
|
@ -0,0 +1,67 @@
|
|||
(ns valuehash.api-test
|
||||
(:require [clojure.test.check.generators :as gen]
|
||||
[clojure.test.check.properties :as prop]
|
||||
[clojure.test.check.clojure-test :refer [defspec]]
|
||||
[valuehash.api :as api]
|
||||
))
|
||||
|
||||
(defprotocol Perturbable
|
||||
"A value that can be converted to a value of a different type, but stil be equal"
|
||||
(perturb [obj] "Convert an object to a different but equal object"))
|
||||
|
||||
(defn select
|
||||
"Deterministically select one of the options (based on the hash of the key)"
|
||||
[key & options]
|
||||
(nth options (mod (hash key) (count options))))
|
||||
|
||||
(extend-protocol Perturbable
|
||||
|
||||
Object
|
||||
(perturb [obj] obj)
|
||||
|
||||
Long
|
||||
(perturb [l]
|
||||
(select l
|
||||
(if (< Byte/MIN_VALUE l Byte/MAX_VALUE) (byte l) l)
|
||||
(if (< Integer/MIN_VALUE l Integer/MAX_VALUE) (int l) l)))
|
||||
|
||||
Double
|
||||
(perturb [d]
|
||||
(if (= d (unchecked-float d))
|
||||
(unchecked-float d)
|
||||
d))
|
||||
|
||||
java.util.Map
|
||||
(perturb [obj]
|
||||
(let [keyvals (interleave (reverse (keys obj))
|
||||
(reverse (map perturb (vals obj))))]
|
||||
(select obj
|
||||
(apply array-map keyvals)
|
||||
(apply hash-map keyvals)
|
||||
(java.util.HashMap. (apply array-map keyvals)))))
|
||||
|
||||
java.util.List
|
||||
(perturb [obj]
|
||||
(let [l (map perturb obj)]
|
||||
(select obj
|
||||
(lazy-seq l)
|
||||
(apply vector l)
|
||||
(apply list l)
|
||||
(java.util.ArrayList. l)
|
||||
(java.util.LinkedList. l))))
|
||||
|
||||
java.util.Set
|
||||
(perturb [obj]
|
||||
(let [s (reverse (map perturb obj))]
|
||||
(select obj
|
||||
(apply hash-set s)
|
||||
(java.util.HashSet. s)
|
||||
(java.util.LinkedHashSet. s)))))
|
||||
|
||||
|
||||
(defspec value-semantics-hold 150
|
||||
(prop/for-all [o gen/any-printable]
|
||||
(let [p (perturb o)]
|
||||
(= (api/md5-str o) (api/md5-str p))
|
||||
(= (api/sha-1-str o) (api/sha-1-str p))
|
||||
(= (api/sha-256-str o) (api/sha-256-str p)))))
|
|
@ -0,0 +1,57 @@
|
|||
(ns valuehash.bench
|
||||
(:require [valuehash.api :as api]
|
||||
[criterium.core :as c]
|
||||
[clojure.test.check.generators :as gen]
|
||||
[clojure.test.check.random :as random]
|
||||
[clojure.test.check.rose-tree :as rose]))
|
||||
|
||||
(defn- sample-seq
|
||||
"Return a sequence of realized values from `generator`.
|
||||
|
||||
Copy of the built in `sample-seq`, but lets you pass in the seed so benchmark
|
||||
runs can be deterministic across machines."
|
||||
[generator seed max-size]
|
||||
(let [r (random/make-random seed)
|
||||
size-seq (gen/make-size-range-seq max-size)]
|
||||
(map #(rose/root (gen/call-gen generator %1 %2))
|
||||
(gen/lazy-random-states r)
|
||||
size-seq)))
|
||||
|
||||
(defn- do-bench
|
||||
"Benchmark the specified digest function, using the specified seq of sample data"
|
||||
[do-digest data]
|
||||
(let [data (doall data)]
|
||||
(c/with-progress-reporting
|
||||
(let [results (c/benchmark
|
||||
(doseq [obj data]
|
||||
(do-digest obj))
|
||||
{})
|
||||
hps (/ (count data) (first (:mean results)))]
|
||||
(c/report-result results)
|
||||
(println "\nThis translates to about" (Math/round hps) "hashed objects per second")))))
|
||||
|
||||
(defn bench-small-vectors
|
||||
[]
|
||||
(do-bench api/md5 (take 1000 (sample-seq (gen/vector gen/simple-type-printable) 42 10))))
|
||||
|
||||
(defn bench-small-maps
|
||||
[]
|
||||
(do-bench api/md5 (take 1000 (sample-seq (gen/map
|
||||
gen/simple-type-printable
|
||||
gen/simple-type-printable)
|
||||
42 10))))
|
||||
|
||||
(defn bench-complex
|
||||
[]
|
||||
(do-bench api/md5 (take 1000 (sample-seq gen/any-printable 42 100))))
|
||||
|
||||
|
||||
(comment
|
||||
|
||||
(bench-small-vectors)
|
||||
|
||||
(bench-small-maps)
|
||||
|
||||
(bench-complex)
|
||||
|
||||
)
|
Loading…
Reference in New Issue