Welcome to the documentation for Sketch!¶
Sketch has a range of sketch data structures implemented. All of them are available in C++, but only a subset of functionality has been exposed to Python. This documentation is primarily for the Python interface.
git clone --recursive https://github.com/dnbaker/sketch && cd sketch/python && python3 setup.py build_ext -j4 && python3 setup.py install
At this point, you will simply import sketch from python:
import sketch HLL = sketch.hll.hll h = HLL(10) for i in range(10000): h.addh(i) print("Estimated cardinality: %f" % h.report())
You don’t. Use sketch as a header-only library, but clone recursively.
- Sketch structure bindings:
Distance calculation functions
fastmod/fastdiv for integer reductions
fast hamming space distance calculations
- There are separate modules for each sketch structure for which there are bindings.
sketch.hll, providing HyperLogLog and comparison, and serialization functions
sketch.bf, providing Bloom Filters and comparison, and serialization functions
sketch.bbmh, providing b-bit minhash implementation + comparison, and serialization functions
sketch.setsketch, providing set sketch + comparison, and serialization functions
For each of these, the module provides construction - either taking parameters or a path to a file. Each of these can be written to and read from a file with .write() and a constructor.
They can be compared with each other with member functions, or you can calculate comparison matrices via sketch.util.jaccard_matrix, sketch.util.containment_matrix, sketch.util.union_size_matrix, sketch.util.intersection_matrix, all of which are in the util module.
Additionally, there are utilities for pairwise distance calculation in the util module.
Additional utilities: sketch.util¶
Python bindings for fastdiv/fastmod; See https://arxiv.org/abs/1902.01961
fastdiv_ and fastmod_ are in-place modifications, while the un-suffixed returns a new array
Compute # of equal registers between two 1-d numpy arrays.
Compute row-pair-wise equal register counts between two 2-d numpy arrays.
Compute row-wise upper triangular distance matrix for equal register counts for 1 2-d numpy array.
Computes intersection size between two sorted hash sets.
takes a list of strings, and then computes
def hash\_ngrams(toks, n=3, seed=0): :param toks: list of strings :param n: n- for n-grams, default = 3 :param seed: Set seed for hashing; default = 0 :returns: np.ndarray, with dtype = np.uint64
Computing optimal a and b¶
For lossy compression via quantization, _optimal_ab_ computes the parameter values for best using hash space.
1 2SetSketchParams = namedtuple("SetSketchParams", 'a, b') 3 4def optimal_ab(maxv, minv, *, q): 5 ''' 6 Calculate a and b for maxv and minv, such that the maxv is mapped to 7 0 and minv's value is mapped to q. 8 :param maxv: float value which is the maximum to be quantized 9 :param minv: float value which is the minimum to be quantized 10 :param q: float or integral value for the ceiling; required. 11 :return: namedtuple SetSketchParams, consisting of (a, b); 12 access through ssp.a, ssp, or tuple access 13 ''' 14 15 if maxv < minv: