.. Sketch documentation master file, created by
sphinx-quickstart on Thu Apr 1 14:51:46 2021.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to the documentation for Sketch!
==================================
.. toctree::
:maxdepth: 2
:caption: Contents::
Getting Started
============
Sketch has a range of sketch data structures implemented. All of them are available in C++, but only a subset of functionality has been exposed to Python.
This documentation is primarily for the Python interface.
Installation (Python)
---------
One-line installation::
git clone --recursive https://github.com/dnbaker/sketch && cd sketch/python && python3 setup.py build_ext -j4 && python3 setup.py install
At this point, you will simply import sketch from python::
import sketch
HLL = sketch.hll.hll
h = HLL(10)
for i in range(10000): h.addh(i)
print("Estimated cardinality: %f" % h.report())
Installation (C++)
-------------
You don't. Use sketch as a header-only library, but clone recursively.
Features
------
1. Sketch structure bindings:
1. Bloom Filter
2. HyperLogLog
3. B-bit minhash
4. Set Sketch
2. Distance calculation functions
3. Miscellaneous
1. fastmod/fastdiv for integer reductions
2. ngram hashing
3. fast hamming space distance calculations
Modules
========
There are separate modules for each sketch structure for which there are bindings.
* sketch.hll, providing HyperLogLog and comparison, and serialization functions
* sketch.bf, providing Bloom Filters and comparison, and serialization functions
* sketch.bbmh, providing b-bit minhash implementation + comparison, and serialization functions
* sketch.setsketch, providing set sketch + comparison, and serialization functions
For each of these, the module provides construction - either taking parameters or a path to a file.
Each of these can be written to and read from a file with .write() and a constructor.
They can be compared with each other with member functions, or you can calculate comparison matrices via
**sketch.util.jaccard\_matrix**, **sketch.util.containment\_matrix**, **sketch.util.union\_size\_matrix**, **sketch.util.intersection_matrix**, all of which are in the util module.
Additionally, there are utilities for pairwise distance calculation in the `util` module.
Additional utilities: sketch.util
=====================
* fastdiv/fastmod:
* Python bindings for fastdiv/fastmod; See https://arxiv.org/abs/1902.01961
* fastdiv\_ and fastmod\_ are in-place modifications, while the un-suffixed returns a new array
* count\_eq
* Compute # of equal registers between two 1-d numpy arrays.
* ccount\_eq
* Compute row-pair-wise equal register counts between two 2-d numpy arrays.
* pcount\_eq
* Compute row-wise upper triangular distance matrix for equal register counts for 1 2-d numpy array.
* shs\isz
* Computes intersection size between two sorted hash sets.
* hash
* hashes strings
* hash\_ngrams
* takes a list of strings, and then computes
usage ::
def hash\_ngrams(toks, n=3, seed=0):
:param toks: list of strings
:param n: n- for n-grams, default = 3
:param seed: Set seed for hashing; default = 0
:returns: np.ndarray, with dtype = np.uint64
Computing optimal a and b
------
For lossy compression via quantization, _optimal\_ab_ computes the parameter values for best using hash space.
.. literalinclude:: ../../python/sketch/__init__.py
:language: python
:linenos:
:lines: 11-25
Indices and tables
------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`