k-means Clustering See also https://en.wikipedia.org/wiki/K-means_clustering k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes. Demonstration of the standard algorithm 1. k initial "means" (in this case k=3) are randomly generated within the data domain. 2. k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means. 3. The centroid of each of the k clusters becomes the new mean. 4. Steps 2 and 3 are repeated until convergence has been reached. As it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum, and the result may depend on the initial clusters. As the algorithm is usually very fast, it is common to run it multiple times with different starting conditions. However, in the worst case, k-means can be very slow to converge: in particular it has been shown that there exist certain point sets, even in 2 dimensions, on which k-means takes exponential time, that is 2^omega(n), to converge. These point sets do not seem to arise in practice: this is corroborated by the fact that the smoothed running time of k-means is polynomial. ------------------------------------------------------------------------------- Hierarchical Clustering See also https://en.wikipedia.org/wiki/Hierarchical_clustering Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: o Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. o Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram. In the general case, the complexity of agglomerative clustering is O(n^2 log(n)), while divisive clustering with an exhaustive search is O(2^n), which is even worse. However, for some special cases, optimal efficient agglomerative methods (of complexity O(n^2)) are known: SLINK for single-linkage and CLINK for complete-linkage clustering. ------------------------------------------------------------------------------- DBSCAN (Density-Based Spatial Clustering of Applications with Noise) Martin Ester, Hans-Peter Kriegel and J\"org Sander and Xiaowei Xu, ``A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise'', in _Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96)_ edited by Evangelos Simoudis, Jiawei Han and Usama M. Fayyad, AAAI Press, 1996, 226--231, ISBN:1-57735-004-9, DOI:10.1.1.71.1980. Development of the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm (also see the Wikipedia page on DBSCAN for clearer pseudocode describing the workings of the algorithm). Basically, clusters consist of core (interior) points and border (boundary) points. The point q is "directly density-reachable" from the point p if it is in the epsilon neighborhood of p (for a given epsilon) and this neighborhood contains a minimal number of points (MinPts, also specified). This relationship is symmetric for core points, but not so if one point is in the core and the other is in the border region. q is "density-reachable" from p if there is a series of points p_i connecting them, such that p_i+1 is directly density-reachable from p_i (p = p_1, q = p_n). Again, this relationship is not necessarily symmetric. p and q are "density-connected" (a symmetric relationship) if there is a point o such that p and q are both density-reachable from o. A "cluster" is then a set of points that are mutually density-connected. If a new point is density-connected to any part of an existing cluster, it is then part of the cluster as well. Noise are any points that are not in clusters. epsilon and MinPts may be specified on a cluster by cluster basis, but typically they are global parameters determined for the least dense cluster in the database. MinPts = 4 was used for 2D data (the Wikipedia page suggests MinPts >= spatial dimension + 1). To determine epsilon, the sorted k-distance graph is generated, which tabulates the distance from each point to its kth nearest neighbor and then is sorted from largest to smallest distance. Good values for epsilon occur where the graph makes a strong bend with k = MinPts. This run time of the algorithm is slightly higher than linear in the number of points, and it seems to detect arbitrarily shaped clusters quite well compared to earlier methods. There are at least 4 MATLAB implementations (adjudged by speed and stability under coordinate reordering): o M. Daszykowski, B. Walczak and D. L. Massart, ``Looking for natural patterns in data. Part 1: Density-based approach'', _Chemometrics and Intelligent Laboratory Systems_, Volume 56, Issue 2, May 2001, 83--92. [fast and stable under coordinate reordering] o Peter Kovesi, Centre for Exploration Targeting, The University of Western Australia, 2013. [slow and NOT stable under coordinate reordering] o Carolyn Pehlke, SpatioTemporal Modeling Center, University of New Mexico, 2013. [somewhat slow and NOT stable under coordinate reordering] o Thanh N. Tran, Klaudia Drab and Michal Daszykowski, ``Revised DBSCAN algorithm to cluster data with dense adjacent clusters'', _Chemometrics and Intelligent Laboratory Systems_, Volume 120, Issue 92, January 2013, 92--96 (DOI: 10.1016/j.chemolab.2012.11.006). [supposed to be stable under coordinate reordering, but in practice, this does not seem to be the case] ------------------------------------------------------------------------------- Getis based Clustering (developed by Carolyn Pehlke) Getis statistic: sum(w_ij(d) x_j, j != i) G_i(d) = ------------------------ sum(x_j, j != i) where x_j is the intensity of the jth point (n points total) w_ij(d) is a symmetric weight matrix that is a function of the distance d that i is separated from j J. K. Ord and Arthur Getis, ``Local Spatial Autocorrelation Statistics: Distributional Issues and an Application'', _Geographical Analysis_, Volume 27, Number 4, October 1995, 286--306. Getis based clustering analysis is done as follows: o Getis statistic produces an intensity heat map yielding seed points o Estimate local length scale of clustered structures via the first critical point in G vs r plots for each seed point o r_crit acts as epsilon for DBSCAN style clustering o Good for densely structured data In detail, Ripley's L-function is applied to the whole ROI, producing a maximum cutoff distance d, which is a rough measure of the prevailing cluster sizes. Next, a series of binary weight matrices are computed for r = 0 to d. These matrices are 1 for particles <= r, 0 otherwise. The ROI is divided up into a grid of subregions, each region containing one or more pixels. The SR localizations are then converted into a histogram image, that is, each subregion counts the number of localizations contained within it. The histogram image is binarized, that is, if the number of localizations in a subregion >= 1, the value is set to 1. The Getis G statistic is then computed for each subregion and each r, producing a 3D array of G values. Note that G_i(r) is the weighted number of particles (in this case, the number of particles excluding particle i that are within a distance r of i) divided by the total number of particles. The maximum value of G over r for each subregion in the 3D array is determined, and then used to construct a new 2D matrix of G values (this is a projection of the maximum values of the statistic as a function of r for each subregion back into 2D form). These values are proportional to the degree localizations are clustered about each subregion. Local maxima of the 2D G-matrix are then located; they will be used as seed points. Next, for each local maximum, G values are calculated over a range of varying distances to produce a G versus distance curve. The first critical point (when the first derivative of the above curve is zero), r_crit, defines a critical radius, indicating the approximate domain size. Finally, for each seed point, pixels within r_crit are grouped into a cluster and overlapping clusters originating from different seed points are then merged. Michelle S. Itano, Matthew S. Graus, Carolyn Pehlke, Michael J. Wester, Ping Liu, Keith A. Lidke, Nancy L. Thompson, Ken Jacobson and Aaron K. Neumann, ``Super-resolution imaging of C-type lectin spatial rearrangement within the dendritic cell plasma membrane at fungal microbe contact sites'', _Frontiers in Physics, section Membrane Physiology and Membrane Biophysics_, Volume 2, Number 46, August 2014, 1--17 (DOI: 10.3389/fphy.2014.00046). ------------------------------------------------------------------------------- Voronoi based Clustering A Voronoi diagram or tesselation is a partitioning of the plane into regions, each containing one seed point, such that each boundary edge segment is equidistant from the nearest seed points. A Voronoi based method for segmenting super-resolution data is described. A Voronoi diagram of image observations is produced, where polygon P^0_i corresponds to seed (observation) s_i and has area A^0_i. First rank neighbors are those polygons P^1_ij sharing edges with P^0_i, which when combined together with P^0_i form the rank 1 polygon P^1_i. The rank 1 area A^1_i is then the sum of the areas A^1_ij + A^0_i. From these, densities delta^k (k is the rank) can be computed. At this point, object segmentation (clustering) is initiated. For example, if delta is the average density of the data distribution, then clustering proceeds by merging all 1st rank polygons that satisfy delta^1_i > alpha delta where alpha > 0 (alpha = 2 is a typical choice). An extension to this idea is to also include a shortest distance parameter Delta such that clustering only occurs if data points are also closer than Delta. In addition, by choosing multiple alphas, different levels of clustering can be considered. Finally, these ideas can be applied to combining multiple observations into single localizations by examining multiple movie frames and searching within a distance tolerance omega. However, doing so assumes a photo activation model for fluorescent proteins, which is not always appropriate; for example, dSTORM probes do not photo bleach, but rather switch on and off. Florian Levet, Eric Hosy, Adel Kechkar, Corey Butler, Anne Beghin, Daniel Choquet and Jean-Baptiste Sibarita, ``SR-Tesseler: a method to segment and quantify localization-based super-resolution microscopy data'', _Nature Methods_, Volume 12, Number 11, 2015, 1065--1071 (DOI:10.1038/NMETH.3579). ------------------------------------------------------------------------------- Hopkins' Statistic The Hopkins' statistic (H) tests for complete spatial randomness of a probe pattern by comparing nearest neighbor distances from random points and randomly chosen probes. If the number of probes in the set S (an image) is n, choose m << n random sampling locations s_j and probes p_j, then compute U = sum(d^2(s_j, S), j = 1 .. m) W = sum(d^2(p_j, S), j = 1 .. m) where d(p_j, S) = min{ d(p_j, p_k) for all p_k in S } and d(p_j, p_k) = || p_j - p_k || is the distance between p_j and p_k. The Hopkins' Statistic is defined as H = U / (U + W) and will lie in the interval [0, 1]. For good results, H should be computed multiple times for a single image. H = 1/2 for completely random probes H = 1 for completely clustered probes ------------------------------------------------------------------------------- Ripley's Statistics Ripley's K analysis tests for clustering and co-clustering by comparing the average number of probes in a disk of radius r about each of the probes with the average density of probes over the region considered. K(r) = A/n^2 sum(sum(delta_ij, j = 1 ..n), i = 1 .. n) where A is the area of the domain, n is the number of points, and delta_ij = 1 if the distance d(i, j) < r, otherwise 0 This counts the number of points encircled by concentric rings centered on each point, normalized by the average density of the region. This can be linearized as L(r) = sqrt(K(r) / pi) so that L(r) < r points are less clustered than random L(r) = r points are clustered as in a random distribution L(r) > r points are more clustered than random In a bivariate analysis (where two different sets of probes in the same region are considered), the L function acts as above except that now the cluster size refers to clusters of co-mingled points from the two sets. Jun Zhang, Karin Leiderman, Janet R. Pfeiffer, Bridget S. Wilson, Janet M. Oliver and Stanly L. Steinberg, ``Characterizing the Topography of Membrane Receptors and Signaling Molecules from Spatial Patterns Obtained using Nanometer-scale Electron-dense Probes and Electron Microscopy'', _Micron_, Volume 37, Issue 1, January 2006, 14--34 (DOI:10.1016/j.micron.2005.03.014). Dendrograms (method for estimating cluster epsilon): Flor A. Espinoza, Janet M. Oliver, Bridget S. Wilson and Stanly L. Steinberg, ``Using Hierarchical Clustering and Dendrograms to Quantify the Clustering of Membrane Proteins'', _Bulletin of Mathematical Biology_, Volume 74, Issue 1, January 2012, 190--211 (PMID: 21751075, PMCID: PMC3429354). ------------------------------------------------------------------------------- H-SET (Hierarchical Single Emitter hypothesis Test) A top-down clustering algorithm to collapse clusters of observations of blinking fluorophores into a single estimate of the true location of the fluorophore using a log-likelihood hypothesis test. H-SET is a top-down hierarchical clustering algorithm in MATLAB to collapse clusters of observations of blinking fluorophores into a single estimate of the true location of the fluorophore. For a cluster of observations to be collapsed into a single fluorophore position, we make a hypothesis test with the null hypothesis that all observations come from the same fluorophore. The null hypothesis is not rejected if the p-value is larger than a specified level of significance. The p-value is calculated using the log-likelihood ratio statistic, R, which is calculated as R = -2 log(L/L0). L is the likelihood of the N observations given a fluorophore located at the maximum likelihood position and given the observed localization uncertainty. L0 is the likelihood of the N positions assuming there are N independent fluorophores. In M dimensions, R is chi-squared distributed with M (N - 1) degrees of freedom. The maximum likelihood estimate of the collapsed position is the variance-weighted mean value of the observed positions: N ==== x \ i > ------- / 2 ==== sigma' _ i = 1 i x = ------------- N ==== \ 1 > ------- / 2 ==== sigma' i = 1 i where 2 2 2 sigma' = sigma + sigma i i reg Here, x_i and sigma_i are the observed positions of the fluorophores and the uncertainties in the observed positions, respectively, while sigma_reg is the registration error and sigma'_i is the modified uncertainty including the effects of drift correction. The uncertainty in the position of the collapsed fluorophore is then calculated from the variance of the weighted mean: _____2 1 sigma = ------------ N ==== \ 1 > ------- / 2 ==== sigma' i = 1 i The algorithm proceeds in two passes. The first makes use of the MATLAB standard hierarchical clustering algorithm (linkage) to create a binary tree of relationships connecting observations together in clusters of varying sizes. Each node of the tree corresponds to a cluster of one or more observations, that is, each node splits into two branches which end at either a single observation or a node representing a smaller cluster of observations. The algorithm starts at the top node, which collects together all the observations (the tree grows downward), and checks if this cluster of observations could be due to a point source according to the level of significance (LoS) assigned by the user (the default is LoS = 0.01). If yes (P > LoS), the cluster is collapsed into a single localization and the algorithm terminates at this node. If no (P <= LoS), the two branches are traversed in turn, making the same check on the nodes at the ends. Anytime the test succeeds, all the observations represented by that node are collapsed into a single localization and the algorithm does not descend further. The final output is the set of localizations discovered, which may be collapsed clusters of observations or isolated observations which were not collected into any larger cluster by the algorithm and so were deemed singlet localizations. We chose to traverse the linkage tree top-down as (1) the significance test works better for larger clusters, and (2) bottom-up traversal could produce different results depending on which points were collapsed first, that is, on the order of collapsing. The second pass of the algorithm uses a user specified clustering algorithm such as Hierarchical, DBSCAN, Getis based, etc. to cluster the localizations together produced by the first pass. It then checks each cluster it finds in turn with the above significance test and collapses any cluster that passes (that is, could be due to a point source localization). Most of the collapsing is typically done by the first pass; the second pass sometimes finds a few more small clusters to collapse. Jia Lin, Michael J. Wester, Matthew S. Graus, Keith A. Lidke and Aaron K. Neumann, ``Nanoscopic cell wall architecture of an immunogenic ligand in _Candida albicans_ during antifungal drug treatment'', _Molecular Biology of the Cell_, Volume 27, Number 6, March 15, 2016, 1002--1014 (DOI: 10.1091/mbc.E15-06-0355, PMID: 26792838). ------------------------------------------------------------------------------- Pair Auto- and Cross-Correlation Sarah L. Veatch, Benjamin B. Machta, Sarah A. Shelby, Ethan N. Chiang, David A. Holowka and Barbara A. Baird, ``Correlation Functions Quantify Super-Resolution Images and Estimate Apparent Clustering Due to Over-Counting'', _PLoS ONE_, Volume 7, Issue 2, February 2012, 1--13. Derivation of molecular pair auto-correlation and cross-correlation functions, g(r) and c(r), that report the increased probability of finding a second localized signal a distance r away from a given localized signal in super-resolution images. c(r) = Re[ FFT^(-1)( FFT(I_1)*conjugate(FFT(I_2)) ) / ( rho_1 rho_2 N(r) ) ] g(r) = FFT^(-1)( |FFT(I)|^2 ) / ( rho^2 N(r) ) where I_1, I_2, I are images; rho_1, rho_2, rho are average molecular surface densities; N(r) is a normalization that accounts for the finite size of the acquired image (the fact that there are fewer possible pairs separated by large distances due to the finite image size): N(r) = FFT^(-1)( |FFT(W)|^2 ) with W a binary image with value 1 inside the measurement area. All images are padded with zeros in both directions out to a distance larger than the range of the desired correlation function (maximally the size of the original image) to avoid artifacts due to the periodic nature of FFT functions. This computational method of tabulating pair auto- and cross-correlations is mathematically identical to brute force averaging methods. Note g(r) = < rho(R) rho(R - r) > / rho^2 where the average is over all positions R in the image. In this definition, g(r) = 1 represents a random distribution. Often, it can be assumed that g(r) is symmetric to rotations, and it is averaged over angles. At r = 0, g(r) = 1 / rho delta(r) as rho(r) = sum_i delta(r - r_i) is the density of (point mass) molecules as a function of r. Fitting the g/c(r) curve produces estimates of cluster and localization sizes and densities. Prabuddha Sengupta, Tijana Jovanovic-Talisman, Dunja Skoko, Malte Renz, Sarah L. Veatch and Jennifer Lippincott-Schwartz, ``Probing protein heterogeneity in the plasma membrane using PALM and pair correlation analysis'', _Nature Methods_, Volume 8, Number 11, November 2011, 969--975. Derivation of a mathematical approach for correlating all peaks in a PALM (photoactivated localization microscopy) image: g(r)^peaks = (g(r)^centroid + g(r)^protein) * g(r)^PSF where * represents convolution, g(r)^PSF correlation function of the effective point spread function of uncertainty in localization g(r)^centroid protein correlation function at r = 0; the convolution with g(r)^PSF quantifies the correlation arising from multiple appearances of the same molecule and is defined as g(r)^stoch g(r)^protein protein correlation function at r > 0; the convolution with g(r)^PSF represents the correlation function of the relative spatial distribution of the proteins g(r)^PSF = 1 / (4 pi sigma_s^2) exp(-r^2 / (4 sigma_s^2)) => g(r)^stoch = 1 / (4 pi sigma_s^2 rho^average) exp(-r^2 / (4 sigma_s^2)) where rho^average is the average density of the protein and sigma_s is the average uncertainty in single molecule localization. If the proteins are randomly distributed, g(r)^protein> 0) ~ 1. If the proteins are clustered, g(r)^protein = A exp(-r / xi) + 1 where A is a measure of the protein density in the domain, and the correlation length xi gives an estimate of the radius of the domain. Hence, g(r)^peaks = 1 / (4 pi sigma_s^2 rho_average) exp(-r^2 / (4 sigma_s^2) + (A exp(-r / xi) + 1) * g(r)^PSF and so, the average number of proteins per cluster, is N^cluster = 1 + rho^average int(g(r)^protein - 1) 2 pi r dr, r = 0..inf) ~ 2 A pi xi^2 rho^average and psi^cluster, the increased density of proteins in clusters relative to the overall average density of proteins across the whole image, is psi^cluster = rho^cluster rho^cluster / rho^average ~ 2 A Autocorrelation functions, g(r), are computed as in the Veatch paper.