Welcome to the documentation for Sketch!¶
Getting Started¶
Sketch has a range of sketch data structures implemented. All of them are available in C++, but only a subset of functionality has been exposed to Python. This documentation is primarily for the Python interface.
Installation (Python)¶
One-line installation:
git clone --recursive https://github.com/dnbaker/sketch && cd sketch/python && python3 setup.py build_ext -j4 && python3 setup.py install
At this point, you will simply import sketch from python:
import sketch_ds
HLL = sketch_ds.hll.hll
h = HLL(10)
for i in range(10000): h.addh(i)
print("Estimated cardinality: %f" % h.report())
Installation (C++)¶
You don’t. Use sketch as a header-only library, but clone recursively.
Features¶
- Sketch structure bindings:
Bloom Filter
HyperLogLog
B-bit minhash
Set Sketch
Distance calculation functions
- Miscellaneous
fastmod/fastdiv for integer reductions
ngram hashing
fast hamming space distance calculations
Modules¶
- There are separate modules for each sketch structure for which there are bindings.
sketch_ds.hll, providing HyperLogLog and comparison, and serialization functions
sketch_ds.bf, providing Bloom Filters and comparison, and serialization functions
sketch_ds.bbmh, providing b-bit minhash implementation + comparison, and serialization functions
sketch_ds.setsketch, providing set sketch + comparison, and serialization functions
For each of these, the module provides construction - either taking parameters or a path to a file. Each of these can be written to and read from a file with .write() and a constructor.
They can be compared with each other with member functions, or you can calculate comparison matrices via sketch_ds.util.jaccard_matrix, sketch_ds.util.containment_matrix, sketch_ds.util.union_size_matrix, sketch_ds.util.intersection_matrix, all of which are in the util module.
Additionally, there are utilities for pairwise distance calculation in the util module.
Additional utilities: sketch_ds.util¶
- fastdiv/fastmod:
Python bindings for fastdiv/fastmod; See https://arxiv.org/abs/1902.01961
fastdiv_ and fastmod_ are in-place modifications, while the un-suffixed returns a new array
- count_eq
Compute # of equal registers between two 1-d numpy arrays.
- ccount_eq
Compute row-pair-wise equal register counts between two 2-d numpy arrays.
- pcount_eq
Compute row-wise upper triangular distance matrix for equal register counts for 1 2-d numpy array.
- shsisz
Computes intersection size between two sorted hash sets.
- hash
hashes strings
- hash_ngrams
takes a list of strings, and then computes
usage
def hash\_ngrams(toks, n=3, seed=0):
:param toks: list of strings
:param n: n- for n-grams, default = 3
:param seed: Set seed for hashing; default = 0
:returns: np.ndarray, with dtype = np.uint64
Computing optimal a and b¶
For lossy compression via quantization, _optimal_ab_ computes the parameter values for best using hash space.