k-means Clustering

See also https://en.wikipedia.org/wiki/K-means_clustering

k-means clustering is a method of vector quantization, originally from signal
processing, that is popular for cluster analysis in data mining.  k-means
clustering aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean, serving as a
prototype of the cluster.  This results in a partitioning of the data space
into Voronoi cells.

The problem is computationally difficult (NP-hard); however, there are
efficient heuristic algorithms that are commonly employed and converge quickly
to a local optimum.  These are usually similar to the expectation-maximization
algorithm for mixtures of Gaussian distributions via an iterative refinement
approach employed by both algorithms.  Additionally, they both use cluster
centers to model the data; however, k-means clustering tends to find clusters
of comparable spatial extent, while the expectation-maximization mechanism
allows clusters to have different shapes.

Demonstration of the standard algorithm

1. k initial "means" (in this case k=3) are randomly generated within the data
domain.

2. k clusters are created by associating every observation with the nearest
mean.  The partitions here represent the Voronoi diagram generated by the
means.

3. The centroid of each of the k clusters becomes the new mean.

4. Steps 2 and 3 are repeated until convergence has been reached.

As it is a heuristic algorithm, there is no guarantee that it will converge
to the global optimum, and the result may depend on the initial clusters.  As
the algorithm is usually very fast, it is common to run it multiple times with
different starting conditions.  However, in the worst case, k-means can be very
slow to converge: in particular it has been shown that there exist certain
point sets, even in 2 dimensions, on which k-means takes exponential time, that
is 2^omega(n), to converge.  These point sets do not seem to arise in practice:
this is corroborated by the fact that the smoothed running time of k-means is
polynomial.

-------------------------------------------------------------------------------

Hierarchical Clustering

See also https://en.wikipedia.org/wiki/Hierarchical_clustering

Hierarchical clustering is a method of cluster analysis which seeks to
build a hierarchy of clusters.  Strategies for hierarchical clustering
generally fall into two types:

o Agglomerative: This is a "bottom up" approach: each observation starts in its
  own cluster, and pairs of clusters are merged as one moves up the hierarchy.
o Divisive: This is a "top down" approach: all observations start in one
  cluster, and splits are performed recursively as one moves down the
  hierarchy.
In general, the merges and splits are determined in a greedy manner.  The
results of hierarchical clustering are usually presented in a dendrogram.

In the general case, the complexity of agglomerative clustering is
O(n^2 log(n)), while divisive clustering with an exhaustive search is O(2^n),
which is even worse.  However, for some special cases, optimal efficient
agglomerative methods (of complexity O(n^2)) are known: SLINK for
single-linkage and CLINK for complete-linkage clustering.

-------------------------------------------------------------------------------

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Martin Ester, Hans-Peter Kriegel and J\"org Sander and Xiaowei Xu, ``A
Density-Based Algorithm for Discovering Clusters in Large Spatial Databases
with Noise'', in _Proceedings of 2nd International Conference on Knowledge
Discovery and Data Mining (KDD-96)_ edited by Evangelos Simoudis, Jiawei Han
and Usama M. Fayyad, AAAI Press, 1996, 226--231, ISBN:1-57735-004-9,
DOI:10.1.1.71.1980.

   Development of the DBSCAN (Density-Based Spatial Clustering of Applications
   with Noise) algorithm (also see the Wikipedia page on DBSCAN for clearer
   pseudocode describing the workings of the algorithm).  Basically, clusters
   consist of core (interior) points and border (boundary) points.  The point
   q is "directly density-reachable" from the point p if it is in the epsilon
   neighborhood of p (for a given epsilon) and this neighborhood contains a
   minimal number of points (MinPts, also specified).  This relationship is
   symmetric for core points, but not so if one point is in the core and the
   other is in the border region.

   q is "density-reachable" from p if there is a series of points p_i
   connecting them, such that p_i+1 is directly density-reachable from p_i
   (p = p_1, q = p_n).  Again, this relationship is not necessarily symmetric.
   p and q are "density-connected" (a symmetric relationship) if there is a
   point o such that p and q are both density-reachable from o.  A "cluster"
   is then a set of points that are mutually density-connected.  If a new
   point is density-connected to any part of an existing cluster, it is then
   part of the cluster as well.  Noise are any points that are not in
   clusters.

   epsilon and MinPts may be specified on a cluster by cluster basis, but
   typically they are global parameters determined for the least dense cluster
   in the database.  MinPts = 4 was used for 2D data (the Wikipedia page
   suggests MinPts >= spatial dimension + 1).  To determine epsilon, the
   sorted k-distance graph is generated, which tabulates the distance from each
   point to its kth nearest neighbor and then is sorted from largest to
   smallest distance.  Good values for epsilon occur where the graph makes a
   strong bend with k = MinPts.

   This run time of the algorithm is slightly higher than linear in the number
   of points, and it seems to detect arbitrarily shaped clusters quite well
   compared to earlier methods.

There are at least 4 MATLAB implementations (adjudged by speed and stability
under coordinate reordering):

o M. Daszykowski, B. Walczak and D. L. Massart, ``Looking for natural patterns
  in data. Part 1: Density-based approach'', _Chemometrics and Intelligent
  Laboratory Systems_, Volume 56, Issue 2, May 2001, 83--92.
  [fast and stable under coordinate reordering]

o Peter Kovesi, Centre for Exploration Targeting, The University of Western
  Australia, 2013.
  [slow and NOT stable under coordinate reordering]

o Carolyn Pehlke, SpatioTemporal Modeling Center, University of New Mexico,
  2013.
  [somewhat slow and NOT stable under coordinate reordering]

o Thanh N. Tran, Klaudia Drab and Michal Daszykowski, ``Revised DBSCAN
  algorithm to cluster data with dense adjacent clusters'', _Chemometrics and
  Intelligent Laboratory Systems_, Volume 120, Issue 92, January 2013, 92--96
  (DOI: 10.1016/j.chemolab.2012.11.006).
  [supposed to be stable under coordinate reordering, but in practice, this
   does not seem to be the case]

-------------------------------------------------------------------------------

Getis based Clustering (developed by Carolyn Pehlke)

Getis statistic:

            sum(w_ij(d) x_j, j != i) 
   G_i(d) = ------------------------
                sum(x_j, j != i)

where

   x_j     is the intensity of the jth point (n points total)
   w_ij(d) is a symmetric weight matrix that is a function of the distance d
           that i is separated from j

J. K. Ord and Arthur Getis, ``Local Spatial Autocorrelation Statistics:
Distributional Issues and an Application'', _Geographical Analysis_, Volume 27,
Number 4, October 1995, 286--306.

Getis based clustering analysis is done as follows:

o Getis statistic produces an intensity heat map yielding seed points
o Estimate local length scale of clustered structures via the first critical
  point in G vs r plots for each seed point
o r_crit acts as epsilon for DBSCAN style clustering
o Good for densely structured data

In detail,

Ripley's L-function is applied to the whole ROI, producing a maximum cutoff
distance d, which is a rough measure of the prevailing cluster sizes.
Next, a series of binary weight matrices are computed for r = 0 to d.  These
matrices are 1 for particles <= r, 0 otherwise.
The ROI is divided up into a grid of subregions, each region containing one or
more pixels.
The SR localizations are then converted into a histogram image, that is, each
subregion counts the number of localizations contained within it.
The histogram image is binarized, that is, if the number of localizations in a
subregion >= 1, the value is set to 1.
The Getis G statistic is then computed for each subregion and each r, producing
a 3D array of G values.
Note that G_i(r) is the weighted number of particles (in this case, the number
of particles excluding particle i that are within a distance r of i) divided by
the total number of particles.
The maximum value of G over r for each subregion in the 3D array is determined,
and then used to construct a new 2D matrix of G values (this is a projection of
the maximum values of the statistic as a function of r for each subregion back
into 2D form).  These values are proportional to the degree localizations are
clustered about each subregion.
Local maxima of the 2D G-matrix are then located; they will be used as seed
points.
Next, for each local maximum, G values are calculated over a range of varying
distances to produce a G versus distance curve.  The first critical point (when
the first derivative of the above curve is zero), r_crit, defines a critical
radius, indicating the approximate domain size.
Finally, for each seed point, pixels within r_crit are grouped into a cluster
and overlapping clusters originating from different seed points are then
merged.

Michelle S. Itano, Matthew S. Graus, Carolyn Pehlke, Michael J. Wester,
Ping Liu, Keith A. Lidke, Nancy L. Thompson, Ken Jacobson and Aaron K. Neumann,
``Super-resolution imaging of C-type lectin spatial rearrangement within the
dendritic cell plasma membrane at fungal microbe contact sites'', _Frontiers in
Physics, section Membrane Physiology and Membrane Biophysics_, Volume 2,
Number 46, August 2014, 1--17 (DOI: 10.3389/fphy.2014.00046).

-------------------------------------------------------------------------------

Voronoi based Clustering

   A Voronoi diagram or tesselation is a partitioning of the plane into
   regions, each containing one seed point, such that each boundary edge
   segment is equidistant from the nearest seed points.

   A Voronoi based method for segmenting super-resolution data is described.
   A Voronoi diagram of image observations is produced, where polygon P^0_i
   corresponds to seed (observation) s_i and has area A^0_i.  First rank
   neighbors are those polygons P^1_ij sharing edges with P^0_i, which when
   combined together with P^0_i form the rank 1 polygon P^1_i.  The rank 1 area
   A^1_i is then the sum of the areas A^1_ij + A^0_i.  From these, densities
   delta^k (k is the rank) can be computed.  At this point, object segmentation
   (clustering) is initiated.  For example, if delta is the average density of
   the data distribution, then clustering proceeds by merging all 1st rank
   polygons that satisfy

      delta^1_i > alpha delta

   where alpha > 0 (alpha = 2 is a typical choice).

   An extension to this idea is to also include a shortest distance parameter
   Delta such that clustering only occurs if data points are also closer than
   Delta.  In addition, by choosing multiple alphas, different levels of
   clustering can be considered.  Finally, these ideas can be applied to
   combining multiple observations into single localizations by examining
   multiple movie frames and searching within a distance tolerance omega.
   However, doing so assumes a photo activation model for fluorescent proteins,
   which is not always appropriate; for example, dSTORM probes do not photo
   bleach, but rather switch on and off.

Florian Levet, Eric Hosy, Adel Kechkar, Corey Butler, Anne Beghin, Daniel
Choquet and Jean-Baptiste Sibarita, ``SR-Tesseler: a method to segment and
quantify localization-based super-resolution microscopy data'', _Nature
Methods_, Volume 12, Number 11, 2015, 1065--1071 (DOI:10.1038/NMETH.3579).

-------------------------------------------------------------------------------

Hopkins' Statistic

   The Hopkins' statistic (H) tests for complete spatial randomness of a probe
   pattern by comparing nearest neighbor distances from random points and
   randomly chosen probes.  If the number of probes in the set S (an image) is
   n, choose m << n random sampling locations s_j and probes p_j, then compute
      U = sum(d^2(s_j, S), j = 1 .. m)
      W = sum(d^2(p_j, S), j = 1 .. m)
   where
      d(p_j, S) = min{ d(p_j, p_k) for all p_k in S }
   and d(p_j, p_k) = || p_j - p_k || is the distance between p_j and p_k.
   The Hopkins' Statistic is defined as
      H = U / (U + W)
   and will lie in the interval [0, 1].

   For good results, H should be computed multiple times for a single image.
      H = 1/2 for completely random probes
      H = 1   for completely clustered probes

-------------------------------------------------------------------------------

Ripley's Statistics

   Ripley's K analysis tests for clustering and co-clustering by comparing the
   average number of probes in a disk of radius r about each of the probes with
   the average density of probes over the region considered.

      K(r) = A/n^2 sum(sum(delta_ij, j = 1 ..n), i = 1 .. n)

   where A is the area of the domain, n is the number of points, and
   delta_ij = 1   if the distance d(i, j) < r, otherwise 0

   This counts the number of points encircled by concentric rings centered on
   each point, normalized by the average density of the region.  This can be
   linearized as

      L(r) = sqrt(K(r) / pi)

   so that

      L(r) < r   points are less clustered than random
      L(r) = r   points are clustered as in a random distribution
      L(r) > r   points are more clustered than random

   In a bivariate analysis (where two different sets of probes in the same
   region are considered), the L function acts as above except that now the
   cluster size refers to clusters of co-mingled points from the two sets.

Jun Zhang, Karin Leiderman, Janet R. Pfeiffer, Bridget S. Wilson, Janet M.
Oliver and Stanly L. Steinberg, ``Characterizing the Topography of Membrane
Receptors and Signaling Molecules from Spatial Patterns Obtained using
Nanometer-scale Electron-dense Probes and Electron Microscopy'', _Micron_,
Volume 37, Issue 1, January 2006, 14--34 (DOI:10.1016/j.micron.2005.03.014).

Dendrograms (method for estimating cluster epsilon):

Flor A. Espinoza, Janet M. Oliver, Bridget S. Wilson and Stanly L. Steinberg,
``Using Hierarchical Clustering and Dendrograms to Quantify the Clustering of
Membrane Proteins'', _Bulletin of Mathematical Biology_, Volume 74, Issue 1,
January 2012, 190--211 (PMID: 21751075, PMCID: PMC3429354).

-------------------------------------------------------------------------------

H-SET (Hierarchical Single Emitter hypothesis Test)

  A top-down clustering algorithm to collapse clusters of observations of
  blinking fluorophores into a single estimate of the true location of the
  fluorophore using a log-likelihood hypothesis test.

H-SET is a top-down hierarchical clustering algorithm in MATLAB to collapse
clusters of observations of blinking fluorophores into a single estimate of the
true location of the fluorophore.
For a cluster of observations to be collapsed into a single fluorophore
position, we make a hypothesis test with the null hypothesis that all
observations come from the same fluorophore.
The null hypothesis is not rejected if the p-value is larger than a specified
level of significance.
The p-value is calculated using the log-likelihood ratio statistic, R, which
is calculated as R = -2 log(L/L0).
L is the likelihood of the N observations given a fluorophore located at the
maximum likelihood position and given the observed localization uncertainty.
L0 is the likelihood of the N positions assuming there are N independent
fluorophores.
In M dimensions, R is chi-squared distributed with M (N - 1) degrees of
freedom.

The maximum likelihood estimate of the collapsed position is the
variance-weighted mean value of the observed positions:
        N
       ====    x
       \        i
        >    -------
       /           2
       ====  sigma'
   _   i = 1       i
   x = -------------
        N
       ====
       \       1
        >    -------
       /           2
       ====  sigma'
       i = 1       i
where
         2        2        2
   sigma'  = sigma  + sigma
         i        i        reg
Here, x_i and sigma_i are the observed positions of the fluorophores and the
uncertainties in the observed positions, respectively, while sigma_reg is the
registration error and sigma'_i is the modified uncertainty including the
effects of drift correction. 
The uncertainty in the position of the collapsed fluorophore is then calculated
from the variance of the weighted mean:
   _____2        1
   sigma  = ------------
             N
            ====
            \       1
             >    -------
            /           2
            ====  sigma'
            i = 1       i

The algorithm proceeds in two passes.
The first makes use of the MATLAB standard hierarchical clustering algorithm
(linkage) to create a binary tree of relationships connecting observations
together in clusters of varying sizes.
Each node of the tree corresponds to a cluster of one or more observations,
that is, each node splits into two branches which end at either a single
observation or a node representing a smaller cluster of observations.
The algorithm starts at the top node, which collects together all the
observations (the tree grows downward), and checks if this cluster of
observations could be due to a point source according to the level of
significance (LoS) assigned by the user (the default is LoS = 0.01).
If yes (P > LoS), the cluster is collapsed into a single localization and the
algorithm terminates at this node.
If no (P <= LoS), the two branches are traversed in turn, making the same check
on the nodes at the ends.
Anytime the test succeeds, all the observations represented by that node are
collapsed into a single localization and the algorithm does not descend
further.
The final output is the set of localizations discovered, which may be collapsed
clusters of observations or isolated observations which were not collected into
any larger cluster by the algorithm and so were deemed singlet localizations.
We chose to traverse the linkage tree top-down as (1) the significance test
works better for larger clusters, and (2) bottom-up traversal could produce
different results depending on which points were collapsed first, that is, on
the order of collapsing.

The second pass of the algorithm uses a user specified clustering algorithm
such as Hierarchical, DBSCAN, Getis based, etc. to cluster the
localizations together produced by the first pass.
It then checks each cluster it finds in turn with the above significance test
and collapses any cluster that passes (that is, could be due to a point source
localization).
Most of the collapsing is typically done by the first pass; the second pass
sometimes finds a few more small clusters to collapse.

Jia Lin, Michael J. Wester, Matthew S. Graus, Keith A. Lidke and Aaron K.
Neumann, ``Nanoscopic cell wall architecture of an immunogenic ligand in
_Candida albicans_ during antifungal drug treatment'', _Molecular Biology of
the Cell_, Volume 27, Number 6, March 15, 2016, 1002--1014
(DOI: 10.1091/mbc.E15-06-0355, PMID: 26792838).

-------------------------------------------------------------------------------

Pair Auto- and Cross-Correlation

Sarah L. Veatch, Benjamin B. Machta, Sarah A. Shelby, Ethan N. Chiang, David A.
Holowka and Barbara A. Baird, ``Correlation Functions Quantify Super-Resolution
Images and Estimate Apparent Clustering Due to Over-Counting'', _PLoS ONE_,
Volume 7, Issue 2, February 2012, 1--13.

   Derivation of molecular pair auto-correlation and cross-correlation
   functions, g(r) and c(r), that report the increased probability of finding a
   second localized signal a distance r away from a given localized signal in
   super-resolution images.

   c(r) = Re[ FFT^(-1)( FFT(I_1)*conjugate(FFT(I_2)) ) / ( rho_1 rho_2 N(r) ) ]

   g(r) = FFT^(-1)( |FFT(I)|^2 ) / ( rho^2 N(r) )

   where I_1, I_2, I are images; rho_1, rho_2, rho are average molecular
   surface densities; N(r) is a normalization that accounts for the finite
   size of the acquired image (the fact that there are fewer possible pairs
   separated by large distances due to the finite image size):

      N(r) = FFT^(-1)( |FFT(W)|^2 )

   with W a binary image with value 1 inside the measurement area.
   All images are padded with zeros in both directions out to a distance
   larger than the range of the desired correlation function (maximally the
   size of the original image) to avoid artifacts due to the periodic nature
   of FFT functions.

   This computational method of tabulating pair auto- and cross-correlations is
   mathematically identical to brute force averaging methods.  Note

      g(r) = < rho(R) rho(R - r) > / rho^2

   where the average is over all positions R in the image.  In this definition,
   g(r) = 1 represents a random distribution.  Often, it can be assumed that
   g(r) is symmetric to rotations, and it is averaged over angles.  At r = 0,
   g(r) = 1 / rho delta(r) as rho(r) = sum_i delta(r - r_i) is the density of
   (point mass) molecules as a function of r.

   Fitting the g/c(r) curve produces estimates of cluster and localization
   sizes and densities.

Prabuddha Sengupta, Tijana Jovanovic-Talisman, Dunja Skoko, Malte Renz,
Sarah L. Veatch and Jennifer Lippincott-Schwartz, ``Probing protein
heterogeneity in the plasma membrane using PALM and pair correlation
analysis'', _Nature Methods_, Volume 8, Number 11, November 2011, 969--975.

   Derivation of a mathematical approach for correlating all peaks in a PALM
   (photoactivated localization microscopy) image:

   g(r)^peaks = (g(r)^centroid + g(r)^protein) * g(r)^PSF

   where * represents convolution,
   g(r)^PSF        correlation function of the effective point spread function
                   of uncertainty in localization
   g(r)^centroid   protein correlation function at r = 0; the convolution with
                   g(r)^PSF quantifies the correlation arising from multiple
                   appearances of the same molecule and is defined as
                   g(r)^stoch
   g(r)^protein    protein correlation function at r > 0; the convolution with
                   g(r)^PSF represents the correlation function of the
                   relative spatial distribution of the proteins

   g(r)^PSF   = 1 / (4 pi sigma_s^2) exp(-r^2 / (4 sigma_s^2)) =>
   g(r)^stoch = 1 / (4 pi sigma_s^2 rho^average) exp(-r^2 / (4 sigma_s^2))

   where rho^average is the average density of the protein and
         sigma_s is the average uncertainty in single molecule localization.

   If the proteins are randomly distributed, g(r)^protein> 0) ~ 1.
   If the proteins are clustered,

      g(r)^protein = A exp(-r / xi) + 1

   where A is a measure of the protein density in the domain, and the
   correlation length xi gives an estimate of the radius of the domain.  Hence,

      g(r)^peaks = 1 / (4 pi sigma_s^2 rho_average) exp(-r^2 / (4 sigma_s^2)
                   + (A exp(-r / xi) + 1) * g(r)^PSF

   and so, the average number of proteins per cluster, is

      N^cluster = 1 + rho^average int(g(r)^protein - 1) 2 pi r dr, r = 0..inf)
                  ~ 2 A pi xi^2 rho^average

   and psi^cluster, the increased density of proteins in clusters relative to
   the overall average density of proteins across the whole image, is

      psi^cluster = rho^cluster  rho^cluster / rho^average ~ 2 A

   Autocorrelation functions, g(r), are computed as in the Veatch paper.