# Welcome to the documentation for Sketch!¶

# Getting Started¶

Sketch has a range of sketch data structures implemented. All of them are available in C++, but only a subset of functionality has been exposed to Python. This documentation is primarily for the Python interface.

## Installation (Python)¶

One-line installation:

```
git clone --recursive https://github.com/dnbaker/sketch && cd sketch/python && python3 setup.py build_ext -j4 && python3 setup.py install
```

At this point, you will simply import sketch from python:

```
import sketch
HLL = sketch.hll.hll
h = HLL(10)
for i in range(10000): h.addh(i)
print("Estimated cardinality: %f" % h.report())
```

## Installation (C++)¶

You don’t. Use sketch as a header-only library, but clone recursively.

## Features¶

- Sketch structure bindings:
Bloom Filter

HyperLogLog

B-bit minhash

Set Sketch

Distance calculation functions

- Miscellaneous
fastmod/fastdiv for integer reductions

ngram hashing

fast hamming space distance calculations

# Modules¶

- There are separate modules for each sketch structure for which there are bindings.
sketch.hll, providing HyperLogLog and comparison, and serialization functions

sketch.bf, providing Bloom Filters and comparison, and serialization functions

sketch.bbmh, providing b-bit minhash implementation + comparison, and serialization functions

sketch.setsketch, providing set sketch + comparison, and serialization functions

For each of these, the module provides construction - either taking parameters or a path to a file. Each of these can be written to and read from a file with .write() and a constructor.

They can be compared with each other with member functions, or you can calculate comparison matrices via
**sketch.util.jaccard_matrix**, **sketch.util.containment_matrix**, **sketch.util.union_size_matrix**, **sketch.util.intersection_matrix**, all of which are in the util module.

Additionally, there are utilities for pairwise distance calculation in the util module.

# Additional utilities: sketch.util¶

- fastdiv/fastmod:
Python bindings for fastdiv/fastmod; See https://arxiv.org/abs/1902.01961

fastdiv_ and fastmod_ are in-place modifications, while the un-suffixed returns a new array

- count_eq
Compute # of equal registers between two 1-d numpy arrays.

- ccount_eq
Compute row-pair-wise equal register counts between two 2-d numpy arrays.

- pcount_eq
Compute row-wise upper triangular distance matrix for equal register counts for 1 2-d numpy array.

- shsisz
Computes intersection size between two sorted hash sets.

- hash
hashes strings

- hash_ngrams
takes a list of strings, and then computes

usage

```
def hash\_ngrams(toks, n=3, seed=0):
:param toks: list of strings
:param n: n- for n-grams, default = 3
:param seed: Set seed for hashing; default = 0
:returns: np.ndarray, with dtype = np.uint64
```

## Computing optimal a and b¶

For lossy compression via quantization, _optimal_ab_ computes the parameter values for best using hash space.

```
1
2SetSketchParams = namedtuple("SetSketchParams", 'a, b')
3
4def optimal_ab(maxv, minv, *, q):
5 '''
6 Calculate a and b for maxv and minv, such that the maxv is mapped to
7 0 and minv's value is mapped to q.
8 :param maxv: float value which is the maximum to be quantized
9 :param minv: float value which is the minimum to be quantized
10 :param q: float or integral value for the ceiling; required.
11 :return: namedtuple SetSketchParams, consisting of (a, b);
12 access through ssp.a, ssp[0], or tuple access
13 '''
14
15 if maxv < minv:
```