This repo contains a high-performance implementation of BanditPAM from BanditPAM: Almost Linear-Time k-Medoids Clustering. The code can be called directly from Python, R, or C++.
If you use this software, please cite:
Mo Tiwari, Martin Jinye Zhang, James Mayclin, Sebastian Thrun, Chris Piech, Ilan Shomorony. "BanditPAM: Almost Linear Time *k*-medoids Clustering via Multi-Armed Bandits" Advances in Neural Information Processing Systems (NeurIPS) 2020.
python -m pip install banditpam
or install.packages(banditpam)
and jump to the examples.If you have any difficulties, please see the platform-specific guides and file a Github issue if you have additional trouble.
This can be done either through PyPI (recommended)
OR through the source code via
Please see here.
Documentation for BanditPAM can be found on read the docs.
Please note that it is NOT necessary to build the C++ executable from source to use the Python code above. However, if you would like to use the C++ executable directly, follow the instructions below.
We highly recommend building using Docker. One can download and install Docker by following instructions at the Docker install page. Once you have Docker installed and the Docker Daemon is running, run the following commands:
which will start a Docker instance with the necessary dependencies. Then:
This will create an executable named BanditPAM
in BanditPAM/build/src
.
Building this repository requires four external requirements:
If installing these requirements from source, one can generally use the following procedure to install each requirement from the library's root folder (with armadillo
used as an example here):
Note that CARMA
has different installation instructions; see its instructions.
Further installation information for MacOS, Linux, and Windows is available in the [docs folder](docs). Ensure all the requirements above are installed and then run:
This will create an executable named BanditPAM
in BanditPAM/build/src
.
Once the executable has been built, it can be invoked with:
-f
is mandatory and specifies the path to the dataset-k
is mandatory and specifies the number of clusters with which to fit the dataFor example, if you ran ./env_setup.sh
and downloaded the MNIST dataset, you could run:
The expected output in the command line will be:
One of the advantages of $k$-medoids is that it works with arbitrary distance metrics; in fact, your "metric" need not even be a real metric – it can be negative, asymmetric, and/or not satisfy the triangle inequality or homogeneity. Any pairwise dissimilarity function works with $k$-medoids.
This also allows for clustering of "exotic" objects like trees, graphs, natural language, and more – settings where running $k$-means wouldn't even make sense. We talk about one such setting in the full paper.
The package currently supports a number of distance metrics, including all $L_p$ losses and cosine distance.
If you're willing to write a little C++, you only need to add a few lines to kmedoids_algorithm.cpp and kmedoids_algorithm.hpp to implement your distance metric / pairwise dissimilarity!
Then, be sure to re-install the repository with a python -m pip install .
(note the trailing .
).
The maintainers of this repository are working on permitting arbitrary dissimilarity metrics that users write in Python, as well; see #4.
To run the full suite of tests, run in the root directory:
Alternatively, to run a "smaller" set of tests, from the main repo folder run python tests/test_smaller.py
or python tests/test_larger.py
to run a set of longer, more intensive tests.
Mo Tiwari wrote the original Python implementation of BanditPAM and many features of the C++ implementation. Mo now maintains the C++ implementation.
James Mayclin developed the initial C++ implementation of BanditPAM.
The original BanditPAM paper was published by Mo Tiwari, Martin Jinye Zhang, James Mayclin, Sebastian Thrun, Chris Piech, and Ilan Shomorony.
We would like to thank Jerry Quinn, David Durst, Geet Sethi, and Max Horton for helpful guidance regarding the C++ implementation.