Integral Curve Clustering and Simplification for Flow Visualization: A Comparative Evaluation

Unsupervised clustering techniques have been widely applied to flow simulation data to alleviate clutter and occlusion in the resulting visualization. However, there is an absence of systematic guidelines for users to evaluate (both quantitatively and visually) the appropriate clustering technique and similarity measures for streamline and pathline curves. In this work, we provide an overview of a number of prevailing curve clustering techniques. We then perform a comprehensive experimental study to qualitatively and quantitatively compare these clustering techniques coupled with popular similarity measures used in the flow visualization literature. Based on our experimental results, we derive empirical guidelines for selecting the appropriate clustering technique and similarity measure given the requirements of the visualization task. We believe our work will inform the task of generating meaningful reduced representations for large-scale flow data and inspire the continuous investigation of a more refined guidance on clustering technique selection.


INTRODUCTION
V ECTOR fields are commonly used in various engineering and scientific applications to help experts understand and assess dynamical systems. Among many visualization techniques, integral curve based approaches are widely employed to visually depict the behavior of complex 3D vector fields (especially, 3D flows) due to their intuitive representation of the flow characteristics via their geometric shapes. One common goal for integral curve-based visualization techniques is to choose a proper set of curves to visually draw the attention of users to important flow regions (with interesting flow patterns) based on their distribution and geometric characteristics (e.g., highly rotating). However, choosing such a set of integral curves is challenging due to the complex nature of the flow behavior that often results in clutter, which is further worsened by 3D occlusion. There are a number of existing techniques to address this challenge, such as evenly-spaced curve placement [1], topology-based integral curve placement [2], [3], information entropy-based integral curve placement and selection [4], [5], [6], and viewpoint-dependent integral curve filtering [7], [8], [9]. This work focuses on the integral curve clustering techniques [10], [11], [12], [13], [14], [15], as it is a general framework that has been proven effective in achieving information reduction by grouping (usually raw input) data with similar characteristics into larger and coarser representations (i.e., clusters and/or segments).
In past years, unsupervised clustering techniques borrowed from the data mining and analytics literature have been applied to flow simulation data to alleviate clutter and occlusion and simplify overall visual representation while highlighting meaningful patterns of the flow. Among all clustering methods listed by Xu and Tian [16], k-means, agglomerative hierarchical clustering (AHC) with various linkages (complete, single, average and Ward's), spectral clustering (SC), affinity propagation (AP) and DBSCAN are the most popular and dominant techniques for integral curve clustering and abstraction. Incorporated with either customized or well-accepted similarity measures, a specific clustering framework chosen from the aforementioned clustering techniques is able to facilitate an abstraction of specific vector fields while capturing important flow features of interest [10], [11], [13], [17], [18], [19]. Despite many promising clustering results, existing methods still suffer from the following: There still lacks well-documented guidance among countless clustering techniques to inform selection of the appropriate combination of similarity measure and clustering method to achieve the desired results. There is no comprehensive study to evaluate different combinations of clustering techniques and similarity measures based on criteria, such as the desired features to capture and the computational complexity. How can a user quantitatively and visually evaluate different combinations of clustering algorithms with similarity measures? Whether a good evaluation (quantitative) score for a specific clustering combination indicates a good visualization of the clustering results for the input streamlines/pathlines still remains an open question. Even though previous work attempts to quantitatively analyze clustering techniques for blood flow data visualization [10], [11], their analysis of clustering techniques and similarity measures is still incomplete and insufficient to address the aforementioned challenges. To derive guidelines for selecting an informed clustering technique and similarity measure for flow simulation data, we start with a comprehensive experimental study, in which we compare the performance (both qualitatively and quantitatively) of the most popularly used combinations of existing clustering techniques and similarity measures in the flow visualization literature on a selected set of streamline and pathline data sets. From this comprehensive evaluation, we derive empirical guidelines in the form of summary metrics for different combinations of clustering algorithms and similarity measures with a number of flow visualization tasks in practice. We believe this is a valuable contribution to the flow visualization community and will inspire similar follow-up work to refine these guidelines. We release our complete implementation of the clustering algorithms and similarity measures via Github 1 for reproducibility by others.
The rest of the paper is organized as follows: Section 2 briefly reviews different clustering techniques and the advantages and disadvantages of each. Section 3 describes our experimental set-up for the evaluation of clustering algorithms with the selected flow data sets. Section 4 contains detailed quantitative and visual comparisons of different clustering techniques with various similarity measures, which leads to empirical guidelines for the selection of an informed clustering technique for flow simulation data. Finally, conclusions and future work are discussed in Section 5.

OVERVIEW OF CLUSTERING TECHNIQUES
In this section, we provide a concise description of the clustering methods prevalant in flow visualization (see Table 1 for an overview), and discuss their primary advantages and disadvantages based on performance analysis and wellrecognized properties [40].

k-Center Clustering
According to Xu and Tian [16] and Xu and Wunsch [41], kcenter clustering methods belong to an exclusive, complete and well-separated clustering category without considering noise or outliers. k-center clustering has been applied with MCP [10], [11], end-point distance and sub-bundle properties for illustrative visualization [18].
Depending on whether the center of a cluster is the centroid or the medoid, k-center clustering can be divided into k-means and k-medoids [42] respectively, and the former appears frequently as the baseline for comparison [10], [11], [18] in integral curve clustering. From an optimization perspective, k-means minimizes squared summation of the euclidean distance, while k-medoids minimizes the Manhattan norm instead. k-means has some well-studied strengths and shortcomings [43]. Pros þ It benefits from linear complexity in both time OðkntÞ (k-medoids has Oðkðn À kÞ 2 Þ complexity) and memory w.r.t. the number of the input curves, hence can potentially handle large-scale data. þ It is easy to implement in parallel. þ Its objective function can be easily modified or adjusted for more controllable design. þ It is versatile in clustering candidates not only based on coordinates, but also arbitrary vectors, e.g., selfdefined shape property vectors [18]. Cons -The optimization can get trapped in local minimum and hence can lead to sub-optimal clustering. -It works best for elliptic and convex cluster shapes (k-means favors spherical clusters while k-medoids is less strict), and sensitive to outliers, noise and often fails in handling non-globular clusters and clusters with widely different sizes [10], [11]. -A preset cluster number k is not easily chosen for optimal clustering, and could be determined by many complicated methods, e.g., joint probability change [44] and the elbow method [45]. -Random initialization can affect the clustering result and hence k-means lacks stability. It can be improved by k-means++ [46], bisection k-means [47] or simply iterating k-means several times and choosing the best result [10], [11]. Remark. k-means stands out in computational efficiency and performance even though it suffers from limitations of cluster shape and size, and can still provide a first overview of a data set if combined with a specifically designed similarity measure.

Hierarchical Clustering
Hierarchical clustering generally falls into two types, agglomerative (bottom-up) and divisive (top-down) [48], and often agglomerative is more practical because divisive methods have Oð2 n Þ complexity. Depending on the linkage criteria, agglomerative hierarchical clustering (AHC) can feature single-linkage, complete-linkage, average-linkage and Ward's method, and can have overall time complexity Oðn 3 Þ or reduced to Oðn 2 log nÞ with a heap.
AHC is widely used in flow visualization. It was first applied with point-wise distance metrics, e.g., mean-of-closest-point (MCP), for DTI fiber bundle clustering. Single-linkage AHC with MCP previously achieved the best clustering effect [17], [20], [21], [22], [23], while average-linkage with MCP was better recognized in blood flow clustering [10], [11]. Average-linkage AHC with dimensionality-reduced euclidean distance by principal component analysis (PCA) was also applied to the clustering challenge for vector field ensembles [15]. Besides, AHC can be coupled with many customized similarity measures that describe either the spatial distance or shape similarity of integral curves. Examples include the average-linkage AHC with a weighted end-curve-distance [12], average-linkage AHC with weighted form of signature-based similarity and mean distance [13], single-linkage AHC with mean-of-thresholded-closest-distance for fiber bundle clustering [24], Wards-variance AHC with segment matching cost distance [27], penalized-linkage AHC with a DTW-based histogram similarity measure [28], average-linkage with the string matching cost measure [14], self-defined AHC with a graph-based similarity measure [29], average-linkage AHC with a specific spatio-temporal similarity measure for adjacent blood flow pattern classification [25], [26].
Besides the conventional AHC, a fast and single-scan hierarchical clustering algorithm BIRCH was also proposed based on B+ tree [49], which can handle larger-scale data with higher efficiency than conventional AHC for a given set of resources (e.g., memory). It has complexity of OðnÞ, but favors spherical cluster shapes because the clustering algorithm was derived from variance computation [16]. We have not found BIRCH applied in flow visualization possibly because irregular cluster shapes cannot be detected by BIRCH. Pros þ AHC is able to handle clusters of different sizes and arbitrary shapes and can often show hierarchical structures using a dendrogram. þ An AHC result is usually stable, and can generate any number of clusters once the hierarchical merging tree is built.
þ If the number of clusters is not known in advance, users can interactively browse the cluster hierarchy with only local updates to the result, which are easier to track than global changes as in, e.g., spectral clustering. Cons -AHC is expensive in terms of computational and storage requirements, with a time complexity of Oðn 3 Þ (or Oðn 2 log nÞ [16] and space complexity of Oðn 2 Þ. This severely restricts application of AHC to large-scale data sets. -AHC lacks a global objective function hence has limited controllability and predictability. The latest work has proposed an objective function for AHC [50] but it is still difficult for practical problems.
Remark. The biggest advantage of AHC is its ability to detect clusters of arbitrary shape and size, which is also restricted by computational complexity for practical applications. However, because the input to the clustering algorithms in the existing flow visualization applications is usually not too large, AHC is still regarded as the most prominent and the state-of-the-art clustering algorithm in the flow visualization literature.

Density-based Clustering
In density-based clustering [51], clusters are defined as areas of higher density than the remainder of the data set, and objects in sparse areas are often considered as noise or border points. The most popular density-based clustering method is DBSCAN [52]. It has been applied in visualization with spatial distance, e.g., weighted sum of local distance [53], segment endpoint distance [19] and with euclidean distance in low dimensional space (obtained using the feature descriptors for streamlines and stream surfaces through auto-encoder [30]), respectively. DBSCAN is also combined with the averaged centerline distance (ACD) in aortic vortex flow clustering [31] and with a modified coherence distance encoding FTLE values for blood flow exploration [32]. DBSCAN requires two parameters, (radius) and minPts (minimum number of points). It is a query-based algorithm to detect candidates as either core, border or noise based on their neighboring connected points. The runtime complexity of DBSCAN can be reduced to Oðnlog nÞ with a naive implementation as Oðn 2 Þ, and requires memory Oðn 2 Þ for an implementation that needs to store the distance matrix and Oðnlog nÞ for an implementation without storing the matrix.
Besides DBSCAN, another popular density-based clustering method is OPTICS [54], which is similar to DBSCAN but addresses one of the limitations of DBSCAN, that is, DBSCAN fails to detect meaningful clusters of varying density. Compared to DBSCAN, OPTICS replaces the with a maximum value that affects performance, and minPts specifies the minimum neighbor size to find, hence it lowers the difficulty of parameterization. Pros þ It can detect noise and is robust to outliers. þ It can find arbitrarily shaped clusters and does not require an a priori preset cluster number.

Cons
-Parameter setting is very difficult for a data set without a priori knowledge, and the final result is quite sensitive to parameter settings. -Parallelism of density-based clustering is challenging due to the data dependency of connectivity expansion. Recent parallel implementations on distributed systems have been proposed [55], [56], [57], but they are yet to be applied to the clustering problems in visualization.
Remark. The most promising aspect of density-based clustering is that it is robust to outliers and noise, and able to detect arbitrarily shaped clusters with lower cost than AHC, while parameter setting might require prior knowledge of the data.

Spectral Clustering
Spectral clustering (SC) [58] is a classical clustering method utilizing a spectrum of data similarity matrix to perform dimension reduction before clustering in a lower dimensional space. From the dimension-reduction point of view, SC benefits from the same intuition as PCA-based clustering [15]. The difference is that, PCA [15] focuses on the reduction of a coordinate matrix while SC on reduction of a distance matrix. We refer the interested readers to the survey by Luxburg [59] for a detailed description of SC in both theory and practice. SC has been applied to integral curve clustering, e.g., with an average distance between pairs of nearest points in white matter fiber tract clustering [33], with a user-specified spatial distance in medical image analysis [34], with MCP [10], [11], Hausdorff distance [35] and minimal closest point distance [19] , respectively. It has been reported that SC with MCP has achieved better results in blood flow clustering than k-means and AHC [10], [11]. In terms of implementation, the technique presented in [33], [35] uses k-way normalized cut as post eigen-decomposition as by Shi and Malik [58], [19] uses uses k-means by Ng et al. [60], while [10], [11], [19] uses eigen-rotation minimization [61].
Complexity analysis for SC can be non-trivial due to the fact that we do not know yet whether the sparsity of the normalized Graph Laplacian L impacts the result or not. For time complexity, in addition to the distance matrix computation with Oðn 2 Þ complexity, the complexity of the eigendecomposition for a non-sparse matrix is Oðn v Þ ð2 < v < 2:376Þ [62] in the optimal scenario but is Oðn 3 Þ in general. Memory complexity is around Oðn 2 Þ because several n Â n matrices need to be stored.
Pros þ SC provides more intrinsic partitioning/segmentation, and can achieve meaningful projection into a lower-dimensional space [35]. þ SC with eigen-rotation can automatically find the optimal number of clusters among a user-given range [10], [11], [19]. þ SC can detect clusters with arbitrary shapes. þ SC has objective functions, and is versatile for modification.

Cons
-Runtime and memory complexity restricts SC from being applied to large scale data sets. -SC tends to form clusters of even size and might fail for clusters with varying sizes.
Remark. Spectral clustering gradually gains attention via its application to blood flow visualization, and may provide more natural and meaningful cluster extraction at the cost of higher runtime and memory usage.

Affinity Propagation
Affinity propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points [63]. It finds "exemplars" for each cluster, and it does not require the number of clusters as user input. AP has been used in flow visualization literature mostly with shape similarity measures of streamlines, patternbased distance metrics, e.g., bag-of-features from feature vector computation of streamline signatures [36], [37] and the adapted Procrustes distance on re-sampled points [38], [39]. The initial values of Sði; iÞ which are diagonal entries of S are vital in controlling clusters the algorithm produces. According to Frey and Dueck [63], Sði; iÞ is typically initialized to the median similarity of all pairs of inputs. Pros þ AP is simple and easy to implement in parallel. þ AP is insensitive to outliers and noise. þ AP does not require user input of cluster numbers and it automatically detects a suitable number of clusters based on the given parameters. Cons -The runtime complexity is Oðn 2 log nÞ and memory usage is Oðn 2 Þ [16], which may restrict its application to large data. -The clustering result is sensitive to the parameters involved in the AP algorithm, e.g., diagonal entry initialization and coefficient in updating responsibility and availability matrix.
Remark. AP has even more expensive computational cost (i.e., time complexity) than AHC, and the final result is highly dependent on initialization. Right now AP is only combined with very specifically-designed similarity measures, and sometimes the clustering result is very difficult to interpret.

EXPERIMENTAL SETUP
In this section, we describe our setup for the experimental study and evaluation of the aforementioned clustering algorithms using a number of well-known flow simulation data sets.
Short-length streamlines are preserved in case they convey important flow information. We perform pathline tracing for three unsteady data sets, i.e., 3D flow behind a square cylinder [65], a tube simulation [66] and blood flow simulation from Berg et al. [67] and Janiga et al. [68].

Clustering Algorithm Implementation and Parameter Selection
Given the flow simulation data sets in Section 3.1, we implement the following clustering algorithms mentioned in Section 2. All the clustering algorithms are run on a PC and a cluster. The PC has an Intel Xeon (R) CPU running at 2.40 GHz, 32 GB main memory, and an nVidia Quadro K4200 graphics card with 4 GB graphics memory. The cluster has 8 nodes and each node uses an Intel Xeon(R) CPU E5-2620 v4 @ 2.10 GHz and 64 GB main memory.
1) k-means: The maximal iteration is set to 50, and k-means++ [46] is provided as an option to the user for better centroid initialization. 2) k-medoids: We use Eq. (1) of Weiszfeld's algorithm [69] for an iterative medoid computing because it matches the representative extraction of closest and furthest to centroid (i.e., medoids in kmedoids) in Section 3.6. 3) PCA clustering: Our implementation features two differences from Ferstl et al. [15]. First, we use k-means instead of AHC in dimension-reduction space because we find PCA-k-means works better than PCA-AHC in most tested data sets as shown in Fig. 1. Second, we extract the centroid instead of median to make the representative consistent with other clustering results in Section 3.6. 4) AHC: We build a bottom-up dendrogram, i.e., the hierarchical merging tree, until the preset number of clusters is obtained. Since only average and single linkage of AHC are utilized in related work, we only evaluate the results of these two linkages and leave the other two as user options. We adopt the idea from Walter et al. [70] to achieve a faster version of AHC by merging necessary tree nodes beforehand. 5) BIRCH: BIRCH in Section 2.2 is also implemented because of its faster computation of hierarchical clustering. However, BIRCH requires a distance value as user input which hierarchically merges any two objects within this range. The value selection is intrinsically challenging and it is difficult to achieve similar cluster numbers to other hierarchical clustering methods. To address this challenge, we use a binary-search algorithm to adaptively adjust a distance threshold to obtain roughly similar numbers of clusters until a given maximal number of iterations is reached. 6) DBSCAN: Parameter setting is challenging and requires prior knowledge of the data sets for DBSCAN.
To alleviate this, we provide two methods of setting (radius). One is to set to the average of minPts-th smallest distance on all candidates, and the other relies on user input, of which the former is adopted for assessing the DBSCAN clustering due to its simplicity. 7) OPTICS: Besides setting parameters and minPts, OPTICS is also parameter-sensitive to find clusters because final clusters are determined by valleys of the 2D reachability-plot. We used the naive idea that detects valleys by steepness as in Ankerst et al. [54] to determine the resulting number of clusters. 8) SC: Similar to Oeltze et al. [10], [11], [19] and Rossl and Theisel [35], we implement both k-means and eigen-rotation minimization for SC, and set the dimension of reduced eigenvector (k) to the preset cluster number. We use 5 percent as the scaling factor as suggested by Oeltze et al. [10], [11]. 9) AP: The maximal iteration is set to 40, and we use implicit iteration to update availability and responsibility matrix to ensure numerical stability. Similar to Tao et al. [38], [39], we use the minimum similarity values as the preference initialization and two-level clustering to obtain a reduced number of clusters.

Similarity Measures
It is not possible or technically practical to consider every similarity measure proposed for integral curve comparison, especially those specifically designed for pattern search, which requires complicated pre-processing or segmentation. Therefore, we choose well-known and widely-accepted measures for our clustering study. Additionally, since our clustering analysis for both streamline and pathline data sets is based on geometric properties, we also consider one similarity measure specified only for pathlines (d T in Meuschke et al. [25], [26]) .
1) Euclidean distance d E ðÁ; ÁÞ that computes the pairwise euclidean distance of two curves. 2) Fraction norm d F ðÁ; ÁÞ, taken from Aggarwal et al. [71] which addresses the curse of dimensionality challenge for high-dimensional space. We set p ¼ 0:5.

3) Geometric similarity measure d G ðÁ; ÁÞ, introduced by
Shi and Chen [72] based on the intuition that two curves are considered similar if their pairs of piecewise line segments are parallel. 4) Accumulated rotation difference d R ðÁ; ÁÞ, which measures the difference of summation of discrete curvature along two curves [73]. 5) Mean-of-closest-point (MCP) d M ðÁ; ÁÞ, is considered a state-of-the-art distance metric in [10], [11], [17], [20], [21], [22], [23]. PCA-k-means shows better and more consistent labels in clustering tested streamline data sets than PCA-AHC [15], e.g., in reduced Bernard. Note that PCA-based clustering resembles direct clustering based on euclidean distance, and streamlines should be separated into left and right bundles if preset clusters are 2 for this data. PCA-k-means can achieve more consistent labeling than PCA-AHC, so in our experimental study we prefer PCA-k-means over PCA-AHC.
6) Hausdorff distance d H ðÁ; ÁÞ from Rossl and Theisel [35] which is topologically meaningful and forms a metric space. 7) Signature-based measure d S ðÁ; ÁÞ, proposed by McLoughlin et al. [13], which uses a combination of both closest-point-distance and x 2 test of streamline signatures (discrete curvature). We chose a fixed number of signature bins for each streamline/pathline since our sampling strategy can make streamlines/pathlines all equal size, and set a ¼ 0:5. 8) Adapted Procrustes distance d P ðÁ; ÁÞ, used in Tao et al. [38], [39], is defined as the euclidean distance after Procrustes superimposition [74]. We set the local shape size r to be 7 as Tao et al. [38], [39] and compute the average Procrustes distance for all pairs of local shapes among integral curve pairs. 9) Time-series MCP d T ðÁ; ÁÞ, introduced in Meuschke et al. [25], [26] which is MCP considering time interval overlapping and mismatching. This similarity measure is specifically suited for temporal pathline similarity computation. Note that since d T is introduced specifically for pathlines, we will only apply it to our pathline data sets. In all, d E , d F , d M , d H and d T are spatial similarity measures that characterize spatial proximity of integral lines, while d G , d R and d P are shape-based measures that characterize shape similarity, and d S belongs to both groups. Note that, we provide a simplified implementation of d S and d P using our consistent sampling strategy, and we opt for not discussing the parameters for the distance calculation since our focus is the clustering analysis and evaluation. More details on the formula and discussion of these similarity measures can be found in the supplementary materials, which can be found on the Computer Society Digital Library at http:// doi.ieeecomputersociety.org/10.1109/TVCG.2019.2940935.
Note that although most of the above similarity measures are for streamlines (except d T ), they can be applied to pathline comparison as well despite that pathlines may intersect. This is because both streamlines and pathlines are 3D curves with similar geometric characteristics except that for comparing pathlines, points on different pathlines are aligned based on their respective time stamps.

Sampling Strategy
Since the centroid computation of each cluster in Section 3.6 requires all curves to have the same number of samples, we need to re-sample the input curves as a pre-processing step. Three common equal-size sampling strategies are considered.
Directly repeating the whole array with the last vertex coordinates [15]. This strategy is easy to implement but can be problematic for curves with too few vertices. Evenly-spaced re-sampling given the total number of samples [28]. This strategy requires a scan of each streamline at least twice and often distorts the original streamlines morphologically (e.g., tortuous streamlines might become flat) even with very dense sampling. Sampling based on signatures (e.g., curvature, torsion or tortuosity) [13], [38], [39]. This strategy is computationally expensive and the segmentation is often sensitive to parameters and user-input thresholds. Our PCA clustering utilizes direct repeating as in Ferstl et al. [15]. For pathlines, we repeat their starting and/or end points such that all pathlines have the same number of points, including the same starting and terminating times and well-aligned intermediate points in time, which guarantees the cluster centroid of pathlines is temporally meaningful.For streamlines, to achieve both the efficiency and the preservation of the geometric shape of curves as much as possible, we use a sampling method which preserves the original vertices while adding samples to the existing line segments. The total number of samples for each curve is set to be the maximal vertex number of all input curves to preserve the initial geometry as much as possible, with the cost of additional storage. Given the maximal vertex count as M, for an integral line with n vertices, we uniformly embed ½ MÀn nÀ1 new samples on each line segment between vertex i and i þ 1 (1 i n) starting from the beginning of the integral line, until the maximal vertex count (M) is reached.

A Priori Cluster Numbers
As discussed in Section 2, some clustering algorithms (i.e., k-means, k-medoids, AHC, PCA, SC+k-means) rely on specific number of clusters as input, and users may have no prior knowledge of the simulation. Most work in flow visualization that requires cluster number as input regards cluster number as part of the interactive exploration, e.g., [15], [18], [20], [22]. Optimal cluster numbers could be detected under given k (maximal cluster number) by SC with eigen-rotation minimization as discussed in [10], [11], [19] and SC with k-way cut in [35], which provides a practical way to determine number of clusters if cluster number is needed. The L-method is applied to detect optimal cluster number (less than 20) for blood flow pattern classification [25], [26]. In our paper, we implement two methods to detect an optimal number of clusters for a given streamline/pathline data set. The first one is SC with eigen-rotation minimization [10], [11], [19]. We set k (maximal cluster number) to be a fixed number (i.e., 100 as a compromise between efficiency and accuracy because 20 used in [10], [11], [19] may not be enough), such that eigen-rotation minimization can find the optimal cluster number within this range. The second is the L-method [75] that can find optimal k for hierarchical clustering by iteratively refining the knee of the "clusternumber versus merged-distance" graph. SC eigen-rotation minimization is claimed to be better than the L-method for blood flow data in [10], [11].
We consider the optimal k obtained using both the SC with eigen-rotation and the L-method for our quantitative study. Our experiments show that neither method is better than the other (see discussion in Section 2.1 in the supplementary document, available online). For the flow abstraction obtained using the clustering results, we choose an appropriate cluster number for each data set that is not dependent on either method. This is because too few clusters may fail to capture important features in streamlines/pathlines, and both L-method and SC can sometimes generate very small cluster numbers.

Select Cluster Representatives
Selecting the representative curves for a reduced representation of the original data is also critical after clustering. A naive method is to choose the centroid (or average) streamline [13], [14] as the representative for each cluster, which is often not adopted because the centroid streamline is artificially generated by averaging the streamlines within the same cluster, thus cannot reveal the authentic flow characteristics. In general, streamlines closest to centroids [18] or streamlines furthest away from centroids [76] are selected because they can often depict certain flow patterns. Besides, when streamlines are projected into lower dimensional space and further clustered, representative streamlines could be chosen as the actual median of clusters in the lower dimensional space [15] or closest to centroid in the streamline-embedding space [35]. Density-based representation is also applied [10], [11], [19] based on the information contained in voxels that streamlines pass through. An attribute-based representative strategy is recommended if the clustering is guided by an attribute of integral curves [10], [11]. Representative streamlines can also be chosen by iteratively removing the most similar streamlines until characteristic candidates remain [13], or derived from skeleton of line predicate-based streamline bundles [77]. For trajectory clusters, average coordinates w.r.t. average direction vector are collected as representatives [53]. Recently a new representative selection approach based on functional decomposition is proposed and is able to reduce clutter and to find important patterns [78].
Since the focus of this work is on the quality of different combinations of clustering techniques and similarity measures, we intentionally choose a naive strategy to avoid the discussion on how different similarity measures affect the definition of "most-representative". In particular, we extract the closest or furthest (or called boundary lines by Yu et al. [17]) curve to the centroid as the representative of each cluster. Additionally, we choose stream tubes as a representative visualization suggested by Oeltze et al. [10], [11] because streamtapes introduced by Chen et al. [18] cannot handle densely-distributed vortex rings.

Clustering Evaluation
So far, most integral curve clustering techniques for flow data rely on visual inspection and comparison to assess the clustering quality, which is rather subjective. Despite many quantitative metrics having been introduced to evaluate the clustering quality [41], [80], clustering analysis (especially quantitative analysis) for flow visualization is not very popular to the best of our knowledge. A new weighted normalized adjusted random (WNAR) was introduced in flow visualization for evaluating and validating clustering results by Moberts et al. [21], which is an external quality measure w.r. t. ground truth. Normalized information distance (NID) borrowed from Vinh et al. [81] is also used to compare bottomup clustering techniques before and after top-down balancing by Yu et al. [17] and it demonstrates an improvement over the adjusted Rand index. Silhouette width, connectivity, Hubert's G statistic and stability have been used for comparing k-means, AHC and spectral clustering (SC) in [10], [11].
Due to the lack of ground-truth or pre-identified labels for comparison, external evaluation measures are not applicable in our clustering evaluation. We apply the following clustering evaluation measurements in our experiments partially borrowed from blood flow visualization analysis work [10], [11].
Silhouette width: According to Oeltze et al. [10], [11], silhouette width is a non-linear combination measure of cluster cohesion and separation with higher value indicating higher cohesion within clusters and good separation among clusters. Hubert's G statistic: Taken from [10], [11] and implementation by Marghescu [82]. Higher correlation values indicate better equality of clustering results. Davies-Bouldin index [83]: The smallest DB value is considered the best, since algorithms that produce clusters with low intra-cluster distances and high inter-cluster distances will have a low DB index. Normalized validity measurement: Validity measurement is first proposed by Yousri et al. [79] and described as: where DD c measures the distance homogeneity and S c measures the density separateness for a cluster c, and smaller F indicates better clustering as claimed by Yousri et al. [79]. This novel density-based validity measurement was able to evaluate clustering analysis of arbitrary cluster shapes and densities as in Fig. 2a and show its correctness and effectiveness in Fig. 2b. Note that only after normalization can the validity  [79] is a better clustering evaluation metric for arbitrary cluster shapes, and normalized as in Eq. (3) for comparing different similarity measures. The three-ring data set in (a) indicates that validity measurement is a better evaluation metric than silhouette, db index and G statistics for non-convex shapes of clusters in point cloud data sets using euclidean distance (better values marked bold). In (b) we amplify the coordinates of the points by 10 times while maintaining the same clustering, and validity measurement F in Eq. (2) [79] indicates that the clustering result before amplification is better than after, while the normalized validity measurement F N in Eq. (3) indicates that both clustering results are quantitatively the same.
measure be applied to quantitatively evaluate and compare the clustering results across various similarity measures. Even though Yousri et al. [79] indicate validity measurement is a more robust and effective evaluation than conventions (e.g., silhouette, G statistics, DB index, all of which only work for convex cluster shapes) for clusters of arbitrary shapes (including clusters with convex shape) and density, we still follow the evaluation work of [10], [11] and consider all these evaluation metrics in our experimental study. The essential reason is that streamlines/pathlines often involve customized similarity measures (some are even not rigorously norm or metric) and point-based knowledge under euclidean distance could not be simply and directly extended to flow visualization. This might explain why silhouette, e.t.c., are still used as a Golden Rule to quantitatively evaluate among different flow clustering techniques as in [10], [11].

EXPERIMENTAL RESULTS AND DISCUSSION
In this section, we report our experimental results and analysis. The average computation time of clustering algorithms with similarity measures for streamlines/pathlines is listed in Table 3 of the supplementary material, available online, and the time includes distance matrix computation because all the evaluation metrics in Section 3.7 depend on pair-wise similarity values between integral curves. From the table we see that two density-based clustering algorithms, i.e., DBSCAN and OPTICS, have the smallest computational overhead, while SC with eigen-rotation and AP take the most computational time. Meanwhile for similarity measures, d E and d R cost the least while d M and d H cost the most.
We also compare the clustering results both quantitatively (Section 4.1) and qualitatively (Section 4.2). From these results, we summarize a set of general guidelines for the selection of optimal clustering technique and similarity measures for specific tasks (Section 4.3).

Comprehensive Quantitative Analysis
We run the clustering algorithms of Section 3.2 with similarity measures listed in Section 3.3 on 8 streamline/pathline data sets (see Section 3.1). As discussed in Section 3.5, we adopt both optimal cluster numbers by SC with eigenrotation and the L-method if clustering algorithms require the cluster number as input and average the quality values of the two optimal cluster numbers for each evaluation metric. The detailed tables of evaluation metrics for each data set can be found in the supplementary document, available online. Afterwards, we calculate the average evaluation separately over all streamline and pathline data sets, respectively, and create two evaluation metric tables (Table 13 for  average of streamlines and Table 14 for pathlines). Due to space limits we place all evaluation tables in the supplementary material, available online.
We use a ranking-based visualization technique similar to what has been applied in hex-mesh quality visualization in [84] for two evaluation matrices. First, we map all four evaluation matrices to the range [0.1,1.0] so that 0.1 always denotes the worst value and 1.0 for the best, e.g., largest silhouette value and smallest validity marked as 1.0. The reason for starting from 0.1 rather than 0 is because evaluation metrics for some clustering combinations do not have values. To avoid mapping both those non-existing values and the worst value to zero, we map only the non-existing values to zero, while the worst value to 0.1. In particular, we use dynamic mapping for DB index and normalize validity measures since their values vary significantly. Fig. 3 shows the averaged evaluation/quality values for the individual combinations of clustering techniques and similarity measures organized w.r.t. the four evaluation metrics in the form of four matrices for streamlines. These four matrices summarize the evaluation measures obtained from the six streamline data sets. To help us better compare across different combinations, we re-order the rows and columns of the four matrices. Specifically, for each row, we compute the average value that characterizes the overall quality of a clustering technique w.r.t. all similarity measures. We then sort the rows based on their respective values in descending order from top to bottom so that the top row corresponds to the clustering technique that has the best average value. Similarly, we compute the average value for each column to characterize the corresponding similarity measure, based on which we sort the columns from left to right so that the similarity that works the best with all clustering techniques is in the left-most column. Note that PCA clustering is only performed based on d E (euclidean distance) on lower-dimension space, so all the other similarity measures can not be combined with PCA.

Results of Streamline and Pathline Clustering
Since clustering techniques and similarity measures are ranked across four evaluation metrics, we calculate the average ranking score of each clustering technique and similarity measure for streamlines and pathlines, and sort clustering techniques and similarity measures by the average ranking score, respectively. Next, we will discuss the ranking results for clustering techniques and similarity measures, respectively, in detail.

Discussion of Clustering Techniques
We note that for both streamlines and pathlines, the top three clustering techniques by ranking-based visualization are PCA, AHC-average and k-means in descending order. As indicated in Section 2, these three often generate spherical and convex shapes of clusters, and convex shapes of clusters usually have better evaluation values in silhouette, G statistics and DB index. Therefore, the average ranking score of evaluation metrics highlight strong preference towards those clustering techniques that can generate convex shapes of clusters.
The reason for PCA ranking best is that due to PCs (principal components) number determination algorithm designed in [15], i.e., a variance threshold (e.g., 0.99) preset for first r PC components, often results in dimensions lower than 4. Thus either via k-means or AHC-average as postprocessing the lower-dimension points can possibly form convex or spherical shapes that lead to favorable scores for silhouette, G statistics and DB index. Besides, compared to many complicated high-dimension similarity measures, euclidean distance in lower-dimensional space is easier to achieve rather superior performance similar to point cloud data sets.
If we omit PCA for being unable to work with other similarity measures and consider other clustering algorithms, we find that AHC-average (ranking order is 3.25) ranks the best while being compatible with all similarity measures introduced in Section 3.3. From this perspective we can provide a quantitative reasoning why AHC-average is the most dominant clustering algorithm with various customized similarity measures/metrics in flow visualization as indicated in Section 2.2.
In addition, k-means also exhibits relatively good clustering evaluation scores. Compared to PCA and AHC-average, it has lower computational overhead (see performance Table 3 (streamlines) and Table 4 (pathlines) in supplementary document, available online) and memory requirement due to its linearity in both time and memory. This shows great potential of k-means in handling large flow data sets generated from simulation and experiments, as k-means can provide a generally reasonable overview of the flow behavior (suggested by the clustering evaluation metrics) at a much lower cost. However, we should be also aware that the iterative procedure for refining the centers in k-means is strictly deduced from the euclidean-based similarity measures (i.e., the minimization of distance variation), hence kmeans coupled with customized, non-euclidean similarity measures (e.g., with d M in [10], [11] and d G in [72]) has no theoretical foundation despite achieving good clustering results in flow visualization.

Discussion of Similarity Measures
For streamlines and pathlines, both d R and d E are surprisingly good especially for the G statistics and silhouette, likely because these metrics favor euclidean distance based metrics. d R is a naive attribute-based similarity measure as described by Oeltze et al. [10], [11], which maps integral curves from original space into one-dimensional space by : R m ! R þ . After this mapping, line clustering is performed on points in R þ . Due to unconditional convexity of R þ , evaluation metrics like silhouette, G and validity measurement tend to exhibit good values. However, the exception is for DB index, DB ¼ 1 q P q i¼1 max j6 ¼i s i þs j dðc i ;c j Þ , in which centers of each clusters (c i ; c j ) are still computed in original space, R m , instead of R þ . After computation, centroid lines are often flattened such that dðc i ; c j Þ is often smaller, which generates a quite large DB index value as shown in Figs. 3c and 4c. d E also ranks high which naturally results from the fact that all the existing clustering techniques and evaluation metrics are established based on the euclidean distance. We notice a significant difference of the d E ranking score between streamlines and pathlines, i.e., the ranking score of d E for streamlines (3.75) is lower than that of pathlines (2.5). This means d E usually exhibits best evaluation values for the pathline clustering with all clustering techniques, while with streamlines this is not the case. The difference is caused by the fact that the dimension of pathlines is usually low (e.g., 303 for cylinder pathlines) compared to those of streamlines (e.g., usually larger than 1,800), as lowerdimensional euclidean space, i.e., d E , often tends to have better DB index of displaying lower intra-cluster distances and high inter-cluster distances. PCA also tends to generate good G statistics and DB index which has similarly low dimensions as discussed in Section 4.1.2, hence together it can imply d E is preferable in low-dimensional space rather than high-dimensional space.
However, d E and d R are unfortunately not popularly used in the flow visualization community due to the fact that they fail to capture either the spatial proximity or feature related information in the flow. We can consider the similarity measures specifically serving the flow visualization community.
Streamlines. d M (4.0) is ranked third (better than d H , d S , d P and d G ). Compared to d R which loses the spatial proximity information of streamlines and d E which is rarely applied in streamline clustering, d M encodes more spatial proximity information, therefore, it is quantitatively regarded as the state-of-the-art measure for computing streamline similarity [11] and even for seeding curve searching [85].
Pathlines. d T (4.0) is ranked fourth (better than d H , d M , d G and d S and d P ), and it is a revised MCP similarity measure for pathlines considering time overlapping and mismatching. Note that d T has better evaluation values in clustering evaluation than the conventional d M , and we quantitatively demonstrate the advantage of d T over d M for pathline clustering, which further validates the work by Meuschke et al. [25], [26].

Visual Inspection
We now discuss the visual comparison of various results obtained using different clustering combinations, which will help determine whether the aforementioned quality metrics are effective or not in indicating the visual quality of the clustering results.
As discussed in Section 3.5, different optimal numbers of clusters obtained using either the eigen rotation or the Lmethods may fail to provide a fair visual comparison, especially for flow abstraction. This is because different initial numbers of clusters may lead to varying numbers of representative curves for flow abstraction. It is apparent that more representative curves lead to a more informative representation. Hence in Section 4.2.2 we use a constant number of clusters as input for all clustering algorithms that require the number of clusters as input. Specifically, we choose the largest number (not larger than 50) from all tested numbers of different similarity measures (see Table 2 in the supplementary document, available online) for each data set, e.g., 30 for Hurricane streamlines. The exception is for blood flow in which we choose four because four cluster representatives are already clear enough to provide a completely visual comparison (see Fig. 12). Meanwhile for those clustering algorithms that do not require a prescribed cluster number, e.g., DBSCAN and OPTICS, our default parameter setting fortunately generates appropriate numbers of clusters for all our testing data sets, which facilitates the subsequent visual comparisons.
From the visual inspection for different visualization tasks, we found that in general, clustering combinations indicated by quantitative quality metrics do not necessarily provide informative visualization, especially for visual abstraction of streamlines and pathlines. In the following, we provide a detailed discussion of our findings.
Before providing the detailed discussion, we wish to point out several general observations from the visual inspection of clustering results.
The BIRCH clustering technique always generates thousands of clusters when coupled with d G and d R which belong to shape-based similarity measures. This implies that BIRCH is better compatible with spatial proximity based similarity measures.
AP (Affinity Propagation) clustering (two-level suggested by Tao et al. [38], [39]) sometimes results in either hundreds of single curve clusters or only one cluster for some similarity measures, despite it is better than single-level AP which often generates more than 1000 clusters. This is severely problematic and impractical for visual inspection of the clustering results, since too many or too few clusters either cause occlusion or provide no useful information. Besides, the computational and memory cost for AP is often higher than other methods (note that SC-eigen can reduce computational time by using smaller k). Although AP can combine well with the adapted Procrustes distance [38], [39] and bag-offeatures [36], [37], we argue that a deeper investigation should be undertaken to make AP a well-accepted clustering strategy for flow visualization, especially with various customized similarity measures that are specific to a given pattern.

Segmentation
Clustering techniques combined with similarity measures are often used to segment all the streamlines/pathlines into various groups in which candidates inside each group share similar characteristics or properties. Segmentation of integral curves can provide users with more precise and insightful understanding of the flow domain with less redundancy, hence is first used for visual inspection of our clustering results.
From the results to be discussed below, we found that for streamline/pathline segmentation, clustering combinations by best individual quality metrics cannot provide well-separated segmentation compared to those obtained using spatial similarity measures (d M , d H ). In contrast, those clustering techniques by average ranking scores (PCA and AHC-average) can. This is because in laminar flow spatially close streamlines/pathlines tend to exhibit similar geometric characteristics.
Streamline Segmentation. For streamlines, we can see from Fig. 3 that AHC-single with d P has the best silhouette, DB index values and validity measurement on average, while AHC-single with d R has the best G statistics. Therefore, we will visualize the corresponding segmentation results w.r.t. these four clustering combinations for comparison. In addition, PCA and AHC-average are claimed to be on average the top two clustering algorithms based on their ranking scores in Section 4.1.3 for streamline data sets. It is also known that clustering with spatial similarity measures, e.g., d E , d F , d M , d H and d S , can be used to achieve streamline segmentation. Therefore, we will also choose the segmentation results obtained using PCA and AHC-average with d M , d H and d S for comparison. Fig. 5 shows the set of the segmentation results of the crayfish data set selected using the above observation. From this comparison, we see that the results in the bottom row of Fig. 5 provide more reasonable and coherent spatial segmentation for streamlines, even though the top row results have relatively high evaluation metrics. If we correlate the clustering combinations in bottom row with Table 16 of the supplementary document, available online, we observe that the evaluation values of AHC-average with d M , d H and d S are not very different (i.e., the values are in the same order of magnitude) from AHC-single with d P in silhouette, G statistics and DB index, but with more than two-orders of magnitude difference (e.g., 1:2e À 4 compared to 1:2e À 6) in validity.
In addition, we observe that AHC-single causes a chaining effect by generating clusters each of which contains only a single curve, as pointed out previously in [10], [11], [15] (see Figs. 5a, 5b, 5c, and 5d), and more than 95 percent of streamlines are assigned to fewer than two clusters (we call them dominant clusters ). Therefore, the evaluation metrics in Section 3.7 can only be computed on few clusters containing more than two curves and the evaluation values on these dominant clusters tend to be quantitatively good despite biased.
Pathline Segmentation. For pathline data sets, we also compare their segmentation results generated with different clustering combinations, according to best individual evaluation metrics (Figs. 6a, 6b, 6c, and 6d), best average rankingscores in Section 4.1.2 (Figs. 6e, 6f, 6g, and 6h), and by   Fig. 4 (see (a-d)), best average ranking-score clustering with highest ranking-score similarity measures (see (e-g)), and visually well-segmented combinations (see (h-j)), respectively. manual selection of best visual segmentation (i.e., AHCaverage with d M (i) and d H (j)), respectively. In this case, the silhouette favors AHC-average with d R (a), G statistics for AHC-average with d H (b), DB index for DBSCAN with d G (c) and validity for DBSCAN with d R (d). We find that segmentation results selected based on average ranking scores (PCA (e), AHC-average with d E (f), d F (g) and d T (h)) achieve similar quality compared to those obtained using spatial segmentation (e.g., the state-of-the-art AHC-average with d M (i) and d H (j)), while segmentation by best individual evaluation metrics cannot.
Limitations of Density-based Clustering. We also discover an intrinsic drawback of density-based clustering segmentation, especially for streamline data sets. That is, the densitybased clustering algorithms inappropriately treat those important or geometrically interesting integral curves as outliers especially with geometry-based similarity measures, as illustrated in Fig. 7. In particular, combining DBSCAN with geometry-based similarity measures (i.e., d G , d R , d S , d P ) is even worse than with spatial measures (e.g., d M , d H ). Nonetheless, in either case, some important streamlines (e.g., with strong swirling/rotation configuration) are completely omitted or lost in the output. This can be explained by the fact that DBSCAN considers them as outliers using the respective similarity measures. The reason they are outliers is because those important or geometrical interesting streamlines typically have rather different shapes from the majority of the other streamlines in geometry-based similarity measures, thus, their distance to the majority of the streamlines is very large (possibly much larger than the threshold for the determination of outliers). This in turn causes them to be considered as outliers due to an insufficient number of neighbors. We have observed the similar behavior in the OPTICS results.
Unfortunately, this drawback of the density-based clustering cannot be easily overcome by tuning the parameters, minPts and . One reason is because trial-and-error parameter tuning is prohibited by the clustering technique itself if the process (especially for streamlines) has a high overhead (see performance ofDBSCAN andOPTICS in Table 3 of supplementary document, available online). This important observation indicates that density-based clustering may not be able to generate the desired segmentation results, which is in contrast to the judgment based on the evaluation metrics, in which the validity metric specifically favors DBSCAN (see Section 4.1.1). We wish to point out that this observation is consistent with recent work [30] where DBSCAN is applied with dimension-reduced euclidean distance for feature descriptor of streamlines and stream-surfaces. In contrast to their work focusing on clustering of a lowdimensional space and spatial measures (d M and d H ) with interactive parameter tuning, we emphasize the drawback of DBSCAN with geometry-based similarity measures (d G , d R , d S and d P ) in original (high-dimensional) space.
However, from Figs. 3d and 4d we observe that (normalized) validity measurement is biased towards DBSCAN, which contradicts the aforementioned visual observation for DBSCAN. The reason why DBSCAN clustering tends to have the best validity is due to the essential similarity of DBSCAN clustering with validity computation, in which they both use density-based concepts. Given a distance threshold defined in Section 3.2, only pairs whose distances are less than the threshold are considered members of the same cluster by DBSCAN. Therefore, it leads the resulting clusters to exhibit distance homogeneity (hðÁ; ÁÞ in Eq. (2)) and density separateness (gðÁ; ÁÞ in Eq. (2)) as small as possible computed from the MST (minimal spanning tree) algorithm. Thus, validity computed for DBSCAN clustering is on average smaller than other clustering algorithms. In this case, quantitative analysis contradicts visual inspection.

Flow Abstraction
After obtaining the segmentation of streamlines/pathlines, reduced representation of the original data can be achieved. The typical strategy is to choose one or more representative curves from each cluster/segment for visualization, as discussed in Section 3.6. In this section, we will compare the reduced representations of the tested data sets obtained using different clustering combinations to qualitatively evaluate the clustering quality. The selected representative curves are either the closest or the furthest curve to the centroid of a cluster (Section 3.6).
Similarly, we organize our discussion for streamlines and pathline results, separately. For the streamline results, we use the 3D flow behind a cylinder, crayfish, and plume simulation as examples, as they all contain vortical (or rotational/ swirling) flows that are interesting to experts. For these simulations, we wish the selected representatives based on the clustering results exhibit as much rotation as possible. To quantify that, we use a metric that measures the total amount of rotation (or directional change) along the individual representative curves [73]. Specifically, we use the average of the total rotation of the individual representative curve, denoted by F. Intuitively, the larger the value of F, the more rotation the selected curves exhibit. Note that F is not the only determinant for judging the quality of the visual abstraction of streamlines. In addition to F, we need to consider which abstraction covers the domain as much as possible if F values of several abstractions are similarly high.
For pathlines, we find that from geometric perspective, physically interesting pathlines should also exhibit high rotations which means they are spiralling or circulating in this spatio-temporal domain. Hence, we also recommend observing judge which pathline abstraction captures more such important features based on the aforementioned F, and spatial coverage as well, e.g., vortex regions and two subregions highlighted in blue-dashed area in Fig. 12j.
In general, considering visual abstraction with aforementioned objectives, clustering combinations identified by the best individual quality metrics fail to present desired abstraction, while those by average ranking scores may sometimes, but not always, lead to desired abstractions.
Streamline Abstraction. Figs. 8a, 8b, 8c, and 8d shows the reduced representations of the crayfish data that are generated with the clustering combinations identified by individual ((a) and (b)) and average quality metrics ((c) and (d)) with the best quality, respectively. As a comparison, a representation that visually reveals more vortical behaviors (or vortices) in this data is shown in (e). In these visualizations, the green curves are the representative streamlines, while the yellow transparent curves are the original input streamlines. We see that clustering combinations with the top individual quality metric values and top average ranking scores (except PCA) can select streamlines with the desired swirling behavior as well as in Fig. 8e). Among the four representations suggested by the quality metrics, AHC-single with d P seems to perform the best both visually and quantitatively (i.e., having large F value). This is in fact due to the chaining effect that tends to separate the streamlines with swirling configurations into individual clusters.
We also compare d M and d H to d P and d G when used with AHC-average clustering since the latter two both perform well in the average ranking score of evaluation and the visual inspection in streamline segmentation for the crayfish data (see Fig. 5). We find that in general d P and d G demonstrate a better ability to preserve the swirling streamlines in a reduced representation of the crayfish streamlines. Fig. 10 shows the reduced representations for the plume (top row) and the 3D flow behind cylinder (bottom row) data sets. For each row, we again show the representation suggested by the quality metrics and visual inspection, respectively. From this comparison, we observe that AHCaverage with d M (see Figs. 10b and 10g) and d H (see Fig. 10d) and PCA (see Fig. 10c) are often effective in conveying the overview and overall structures of the flow, while AHC-average/single with d G , d R , d S , and d P , and kmeans with d G tend to highlight and characterize the vortex details. Meanwhile, d P is not stable in feature highlighting. d P suffers from a chaining effect not only in AHC-single but also other clustering algorithms (e.g., AHC-average), which results in preferences of large bundles of boundary streamlines (see red dashed part in Figs. 10a and 10f), instead of internal swirling streamlines. Sometimes it is able to extract vortex rings that no other measures can (see red dashed area in Fig. 11). Fig. 9. AHC-average with d G and d P can provide better abstraction preserving swirling streamlines than d M and d S for crayfish streamlines, even though the latter two exhibit good segmentation results in Fig. 5. (a) is obtained with the clustering having the best silhouette and DB index values, (b) has the best G statistics value, (c)(d) the abstractions obtained with the clustering having the top average ranking scores, respectively. We also select one abstraction from k-means with d G in (e). As a comparison, the visually most ideal result is shown in (e) Original streamlines are shown in yellow with lower opacity, while green streamlines are representatives with higher opacity. We find that abstractions indicated by evaluation metrics (except PCA (c)) can effectively select representative streamlines with strong swirling behavior for crayfish streamlines.
From the three reduced streamline representations, we conclude that clustering algorithms with d G , d R and d S are more robust and stable for feature highlighting in flow abstraction than d P . k-means with d G [72] and d R are able to capture streamlines with large curvature variation with similar quality to AHCaverage with d S [13], while having the lowest computation cost (see Table 3 in the supplementary document, available online). Individual quality metrics (e.g., AHC-single with d P ) cannot provide a visually desired abstraction, while an average ranking score of quality metrics can partially (e.g., AHC-average with d R ) achieve this objective.
Combining Multiple Abstractions. In most cases, abstraction with one single similarity measure can only capture a subset of features reflected by the streamlines, and it can be helpful to combine multiple abstractions to make a more complete visual representation of the important features. For example, in Fig. 11 AHC-average with d P can capture a sequence of near closed streamlines (highlighted in the dashed red rectangle) while failing to highlight a number of other swirling streamlines. In contrast, AHC-average with d S can detect more vortical features than d P but still misses the bottom-left vortex rings. Combining the representations obtained from the two clustering results leads to a more complete visualization of the flow.
This conclusion is also observable for pathline abstraction, e.g., in blood flow abstraction results (see Fig. 12), AHC-average with d H (h) combined with PCA (i) can generate the required abstraction that preserves each highlighted feature similar to AHC-average with d M (k). However, we think it is unnecessary to mention the benefit of combinations in pathline abstraction since a single abstraction (like (k)) already generates good abstraction results. This is in contrast to streamline abstraction where no single abstraction can capture the complete and complex features.
Pathline Abstraction. Similarly, we compare the reduced representation suggested by the quality metrics. Guided by our experience that density-based clustering tends to classify streamlines/pathlines with distinct swirling behavior as outliers (see Fig. 7 and Section 4.2.1), hence we only consider results obtained using AHC-average with d R (by best silhouette in Fig. 4a) and AHC-average with d H (by best G statistics in Fig. 4b). In addition, the abstractions suggested by the average ranking-score (i.e., AHC-average with d T and PCA) are also selected. Further, we select one or two reduced representations, that can preserve pathlines of important features, e.g., pathlines around the vortices behind cylinder in the cylinder pathlines (see blue-dashed area in Fig. 12e), and vortex regions and two subregions in the blood flow (see bluedashed areas in Fig. 12j), as an additional visual comparison for the aforementioned abstraction from quantitative analysis.
From Fig. 12 we can see that abstractions suggested by individual evaluation metrics (i.e., AHC-average with d H in Figs. 12b and 12h) and average ranking scores (i.e., PCA in Figs. 12c and 12i) lose more or less the aforementioned pathline features. However, in general AHC-average with shapebased similarity measures, i.e., d S (see Fig. 15 in the supplementary document, available online), d R and d P , and k-means with d G and d R are able to capture these important pathlines in the abstraction in the given pathline data sets. Specifically, for cylinder pathlines, AHC-average with d P (Fig. 12e) and kmeans with d R (Fig. 12f) provide better focus on the vortical features than AHC-average with d R (Fig. 12a). While for blood flow, AHC-average with d R (Fig. 12g) and d M (Fig. 12k) generate the best abstraction results by preserving not only the outlier vortices but also the vortex cores (see blue-dashed area in Fig. 12h) in the center region of the blood flow.
We conclude that AHC-average with shape-based simialrity measures (d R , d S and d P ) and k-means with d R (k-means with d G is not good for cylinder pathline abstraction in Fig. 15 of supplementary document, available online) can generate better representation of pathlines than those indicated by best individuals or averaged ranking scores of evaluation metrics.

Empirical Guidelines
By comparing the quantitative analysis and visual inspection of the clustering quality for streamlines/pathlines, we see Fig. 10. Visual abstraction of streamlines from solar plume (top row) and cylinder flow (bottom row) by clustering algorithms (@ marks better abstraction). The yellow are original streamlines with lower opacity while green streamlines are representatives with higher opacity. We find that PCA and clustering algorithms with d M and d H characterize the overview of the flow, while clustering with d P , d S and d G tends to highlight and preserve swirling streamlines. Additionally, d G and d S are more stable than d P because d P is sometimes trapped in erratic boundaries instead of vortex rings due to a chaining effect, see outlying boundary streamlines in dashed rectangle of (a)(f). Fig. 11. Abstraction combination of AHC-average with d P (green) and d S (red) for cylinder streamlines. Each of them may capture partial features inside the domain. For example, d P extracts lower left vortices (highlighted in the dashed red rectangle) that cannot be captured with d S . Combining them together can create a much better and more thorough abstract visualization. that they need not agree with each other. This in part explains that most of the clustering approaches for geometric-based flow visualization do not rely on the well-established quantitative metrics to identify the most effective clustering technique and distance measures for their specific flow data. On the other hand, purely relying on visual inspection to determine the clustering result quality can be subjective and affected by many visualization factors (e.g., the rendering of the lines, view points, lighting, etc., as described in [86], [87], [88]). Nonetheless, based on the above assessment of the clustering results, we can offer the following guidance for streamline/pathline clustering in flow visualization. 4) AP clustering (including two-level) often generates too many or few clusters at a higher computational cost. BIRCH is not compatible with shape-based similarity measures. 5) For the task of segmenting streamlines/pathlines, clustering algorithms with top average ranking scores (e.g., PCA and AHC-average with d M for streamlines, PCA and AHC-average with d T and d E for pathlines) can provide a better spatial segmentation than those obtained using the clustering combinations with the highest individual evaluation metrics. Besides, visual inspection also suggests that AHC-average with d M , d H , d S and d T can generate the most ideal segmentation of the streamline/pathline data sets. PCA produces good spatial segmentation for both streamlines/pathlines, which also exhibits on average good evaluation scores. Generally, a segmentation is best achieved by clustering algorithms with spatial similarity measures. 6) To generate a reduced representation for sets of densely placed streamlines, clustering algorithms (especially AHC-average) with the similarity measures d R , d G and d S are more robust than with d P in highlighting vortex structure and swirling streamlines, and kmeans with d G and d R can also produce a desired reduced representation but at a much lower cost. For pathline abstraction, AHC-average with shape-based similarity measures (d R , d S and d P ) and k-means with d R are more preferred in preserving important features of pathlines. Unfortunately, these visual inspection results of abstraction often do not match with the results obtained via quantitative evaluation. 7) d S [13] in general works well in both segmentation and reduced representation for streamlines because d S considers both spatial and shape similarity of streamlines. This indicates that similarity measures that linearly combine spatial and signature measures can be potentially helpful to flow visualization. 8) Compared to setting the input number of clusters for some clustering algorithms (i.e., k-means, k-medoids, AHC, SC k-means, and PCA), setting the values of (for DBSCAN and OPTICS) or distance threshold (for BIRCH) or preference initialization (AP) is usually more difficult and requires prior knowledge of the data sets. Hence in general, the clustering algorithms that only need to set the number of clusters are more popular (especially AHC-average) for processing data sets without prior knowledge. This observation is similar to the discussion between AHC and DBSCAN by Meuschke et al. [25], [26]. 9) Low-dimensional euclidean distance (d E ) demonstrates good overall evaluation results in clustering analysis (see Section 4.1.1). Besides, d E in lowerdimensional space features advantages of numerical efficiency, theoretical stability and predictable shapes of clusters. These important advantages lead to burgeoning work in flow visualization to perform clustering after dimension-reduction techniques for streamlines/pathlines, e.g., PCA in streamline variability plot [15] and t-SNE for feature descriptors after auto-encoder learning [30]. We believe after proper dimension reduction, d E combined with a carefully chosen clustering algorithm will form a trend in clustering frameworks for flow visualization. 10) Attribute-based streamline distances (e.g., d R in our experiment, linear and angular entropy in [18], and streamline attributes in [10], [11]) usually exhibit good evaluation scores for clustering analysis due to their ability in mapping high-dimensional streamlines to lower-dimensional euclidean spaces. That said, the design of similarity measures that aim to emphasize given flow characteristics can focus on the relevant attribute of integral curves, which can be beneficial in performing clustering in a low-dimensional euclidean space. 11) Sometimes combining the reduced representations generated with two or more appropriate clustering techniques may capture more complete sets of features in streamlines as illustrated in Fig. 11. Finally, choosing appropriate clustering algorithms and similarity measures is a complex problem, which needs to consider not only the aforementioned visual/quantitative guidance, but also other aspects of the data, e.g., the size of data sets and the required computation time, whether focusing on spatial or shape difference, or customized similarity measures for specific purposes. In most cases, similarity measures designed in flow visualization literature are not rigorous mathematical metrics, therefore, it is not possible to theoretically derive the property of metric space, like its topology discussed in [35] or convex analysis in [60], nor can shapes of clusters be predicted so that a proper clustering can be determined and applied. There exists inextricable disparity between theory and application for clustering in flow visualization, and the qualitative (visual) and quantitative (analytic) are often not compatible for clustering results, which still requires in-depth investigation.

Comparison with Blood Flow Analysis Work
To our best knowledge, the only existing work on clustering analysis (visually and quantitatively) for flow visualization is the visualization of blood flow [10], [11]. Despite this work shares some similarity to that work, e.g., both use silhouette and G statistics for evaluation, and both apply tubebased rendering for visualization, there are important differences between these two.
We evaluate a much larger set of clustering algorithms and similarity measures, while blood flow analysis work only considers d M with k-means, SC with eigenrotation and four types of AHC. We experiment on both streamline and pathline data sets with different flow characteristics, compared to blood flow data sets. That said, our conclusions are more general. We adopt a novel normalized validity measurement which proves to be more effective and general in evaluating clustering quality for point-based data sets than silhouette and G statistics (both of which only work for convex clusters of shape), and further investigate the drawback of only using validity as quantitative reference in flow visualization. We employ detailed quantitative analysis and visual comparisons based on either individuals or average ranking score of the evaluation metrics, and thoroughly discuss the disparity between quantitative and visual preferences for different clustering combinations. Our quantitative conclusion for AHC-average in streamlines (see Section 4.1) is consistent with blood flow where both by silhouette and G statistics AHCaverage works well. However, in the blood flow visualization, SC with eigen-rotation was claimed to exhibit good evaluation based on silhouette and G statistics (see Fig. 11 in the supplementary document, available online, where SC-eigen with d M exhibits the best three evaluation scores), while in our evaluation it is actually one of the worst of all clustering algorithms. We argue that the disparity for ranking of SCeigen w.r.t. silhouette and G statistics is that we adopt an average evaluation over different similarity measures other than just d M , and quantitative conclusions from blood flow simulation cannot be simply extended into other streamline data sets. The conclusion made by [10], [11] that SC (SC-eigen in our paper) is the best clustering technique only applies to the blood flow data set based on our experimental results. We believe it is due to the simpler and clearer flow features of the blood flow (see blood flow abstraction in Section 4.2.2) when compared to other physically sophisticated flows, e.g., crayfish and plume, with more hiden vortices. Besides, we found no further applications of SC to other flow data sets except blood flow after the streamline embedding work [35] (TVCG 2012). This is likely because SC (especially SC-eigen) has a high overhead in determining the optimal number of clusters for simulated flow data with a large value of k. Compared to blood flow whose k is not larger than 20 [10], [11], we usually do not have a-priori knowledge of k for general simulated flow data, hence, we have to set k to a large number which results in a longer computation time. Therefore, SC is often impractical for interactive exploration of large-scale flow data sets, while AHC is preferred instead (see details in Section 9 of the supplementary document, available online). Note that, Han et al. [30] recently compared different clustering algorithms for feature descriptors of streamlines and stream surfaces through different dimensional reduction techniques. However, their work is purely qualitative and focusing on feature descriptors after autoencoder learning, while our work is both quantitatively and qualitatively in original integral line space.

CONCLUSION AND FUTURE WORK
In this paper we perform a comprehensive evaluation of popular clustering techniques coupled with a number of popular similarity measures. We use both quantitative analysis and visual inspection to assess the clustering results, and derive empirical guidance (see Section 4.3) for selecting an appropriate clustering and similarity measure combination. This is the first work, to the best of our knowledge, that attempts a comprehensive experimental study on a large number of clustering techniques and similarity measures for integral curve clustering. We believe the outcome of this study will benefit the practice of flow data visualization in achieving a reduced representation for streamline/pathline data sets.
Limitations and Future Work. There are a number of limitations of our current study that we wish to improve in future work.
First, we use a distance-based representative approach for visual inspection of clustering results, and it may be prone to specific patterns, e.g., vortical structures in the streamlines. In the future we want to investigate several existing representative approaches to find the optimal method that can characterize the clusters of integral curves. In addition, there are more features or patterns conveyed in the streamlines (e.g., separation lines), and we wish to further explore them via clustering techniques.
Second, the similarity measures (except for d T ) are generally suitable for streamlines, and we would in the future investigate those specific similarity measures for pathlines to derive a more accurate guidance on pathline clustering. Also visual reference for judging pathline abstraction might be not accurate and precise, and we would like to extend our experimental work to more complex and meaningful pathline data sets.
Third, clustering methods based on statistical analysis and information are not frequently applied in flow visualization, and they are mostly viewed as inferior for comparison. For example, PCA+GMM (Gaussian mixture model) is compared to PCA+AHC, and the latter is shown to be better for streamline variability clustering [15]. We would like to extend these statistical-based methods to integral curve clustering in the future, either in representative selection under functional decomposition [78], or using model-based clustering for turbulent combustion particle data [89]. Due to solid theory and well-constructed objectives of statistical-based methods, we believe they can exhibit enormous potential and versatile applications in conveying effective abstraction of complicated flow data sets.