Welcome to the documentation for Sketch!

Getting Started

Sketch has a range of sketch data structures implemented. All of them are available in C++, but only a subset of functionality has been exposed to Python. This documentation is primarily for the Python interface.

Installation (Python)

One-line installation:

git clone --recursive https://github.com/dnbaker/sketch && cd sketch/python && python3 setup.py build_ext -j4 && python3 setup.py install

At this point, you will simply import sketch from python:

import sketch
HLL = sketch.hll.hll
h = HLL(10)
for i in range(10000): h.addh(i)
print("Estimated cardinality: %f" % h.report())

Installation (C++)

You don’t. Use sketch as a header-only library, but clone recursively.


  1. Sketch structure bindings:
    1. Bloom Filter

    2. HyperLogLog

    3. B-bit minhash

    4. Set Sketch

  2. Distance calculation functions

  3. Miscellaneous
    1. fastmod/fastdiv for integer reductions

    2. ngram hashing

    3. fast hamming space distance calculations


There are separate modules for each sketch structure for which there are bindings.
  • sketch.hll, providing HyperLogLog and comparison, and serialization functions

  • sketch.bf, providing Bloom Filters and comparison, and serialization functions

  • sketch.bbmh, providing b-bit minhash implementation + comparison, and serialization functions

  • sketch.setsketch, providing set sketch + comparison, and serialization functions

For each of these, the module provides construction - either taking parameters or a path to a file. Each of these can be written to and read from a file with .write() and a constructor.

They can be compared with each other with member functions, or you can calculate comparison matrices via sketch.util.jaccard_matrix, sketch.util.containment_matrix, sketch.util.union_size_matrix, sketch.util.intersection_matrix, all of which are in the util module.

Additionally, there are utilities for pairwise distance calculation in the util module.

Additional utilities: sketch.util

  • fastdiv/fastmod:
    • Python bindings for fastdiv/fastmod; See https://arxiv.org/abs/1902.01961

    • fastdiv_ and fastmod_ are in-place modifications, while the un-suffixed returns a new array

  • count_eq
    • Compute # of equal registers between two 1-d numpy arrays.

  • ccount_eq
    • Compute row-pair-wise equal register counts between two 2-d numpy arrays.

  • pcount_eq
    • Compute row-wise upper triangular distance matrix for equal register counts for 1 2-d numpy array.

  • shsisz
    • Computes intersection size between two sorted hash sets.

  • hash
    • hashes strings

  • hash_ngrams
    • takes a list of strings, and then computes


def hash\_ngrams(toks, n=3, seed=0):

:param toks: list of strings
:param n:    n- for n-grams, default = 3
:param seed: Set seed for hashing; default = 0
:returns: np.ndarray, with dtype = np.uint64

Computing optimal a and b

For lossy compression via quantization, _optimal_ab_ computes the parameter values for best using hash space.

 2SetSketchParams = namedtuple("SetSketchParams", 'a, b')
 4def optimal_ab(maxv, minv, *, q):
 5    '''
 6        Calculate a and b for maxv and minv, such that the maxv is mapped to
 7        0 and minv's value is mapped to q.
 8        :param maxv: float value which is the maximum to be quantized
 9        :param minv: float value which is the minimum to be quantized
10        :param q:    float or integral value for the ceiling; required.
11        :return: namedtuple SetSketchParams, consisting of (a, b);
12                 access through ssp.a, ssp[0], or tuple access
13    '''
15    if maxv < minv:

Indices and tables