
Cluster Sequences via Dissimilarity Matrix based on String Distances
Source:R/clusters.R, R/print.R
cluster_sequences.RdPerforms clustering on sequence data using specified dissimilarity measures
and clustering methods. The sequences are first converted to strings
and compared using the stringdist package.
Arguments
- data
A
data.frameor amatrixwhere the rows are sequences and the columns are time points.- k
An
integergiving the number of clusters.- dissimilarity
A
characterstring specifying the dissimilarity measure. The available options are:"osa","lv","dl","hamming","qgram","cosine","jaccard", and"jw". See stringdist::stringdist-metrics for more information on these measures.- method
A
characterstring specifying clustering method. The available methods are"pam","ward.D","ward.D2","complete","average","single","mcquitty","median", and"centroid". Seecluster::pam()andstats::hclust()for more information on these methods.- na_syms
A
charactervector of symbols or factor levels to convert to explicit missing values.- weighted
A
logicalvalue indicating whether the dissimilarity measure should be weighted (the default isFALSEfor no weighting). IfTRUE, earlier observations of the sequences receive a greater weight in the distance calculation with an exponential decay. Currently only supported for the Hamming distance.- lambda
A
numericvalue defining the strength of the decay whenweighted = TRUE. The default is1.0.- ...
Additional arguments passed to
stringdist::stringdist().- x
A
tna_clusteringobject.
Value
A tna_clustering object which is a list containing:
data: The original data.k: The number of clusters.assignments: Anintegervector of cluster assignments.silhouette: Silhouette score measuring clustering quality.sizes: Anintegervector of cluster sizes.method: The clustering method used.distance: The distance matrix.
Examples
data <- data.frame(
T1 = c("A", "B", "A", "C", "A", "B"),
T2 = c("B", "A", "B", "A", "C", "A"),
T3 = c("C", "C", "A", "B", "B", "C")
)
# PAM clustering with optimal string alignment (default)
result <- cluster_sequences(data, k = 2)
print(result)
#> Clustering method: pam
#> Number of clusters: 2
#> Silhouette score: 0.4345238
#> Cluster sizes:
#> 1 2
#> 3 3