Performs clustering using specified dissimilarity measures
and clustering methods. The rows of the data are first converted to strings
and compared using the dissimilarity measures available in the
stringdist package.
Arguments
- data
A
data.frameor amatrixin wide format.- k
An
integergiving the number of clusters.- dissimilarity
A
characterstring specifying the dissimilarity measure. The available options are:"osa","lv","dl","hamming","lcs","qgram","cosine","jaccard", and"jw". See stringdist::stringdist-metrics for more information on these measures.- method
A
characterstring specifying clustering method. The available methods are"pam","ward.D","ward.D2","complete","average","single","mcquitty","median", and"centroid". Seecluster::pam()andstats::hclust()for more information on these methods.- na_syms
A
charactervector of symbols or factor levels to convert to explicit missing values.- weighted
A
logicalvalue indicating whether the dissimilarity measure should be weighted (the default isFALSEfor no weighting). IfTRUE, earlier observations of the sequences receive a greater weight in the distance calculation with an exponential decay. Currently only supported for the Hamming distance.- lambda
A
numericvalue defining the strength of the decay whenweighted = TRUE. The default is1.0.- ...
Additional arguments passed to
stringdist::stringdist().
Value
A tna_clustering object which is a list containing:
data: The original data.k: The number of clusters.assignments: Anintegervector of cluster assignments.silhouette: Silhouette score measuring clustering quality.sizes: Anintegervector of cluster sizes.method: The clustering method used.distance: The distance matrix.
Examples
data <- data.frame(
T1 = c("A", "B", "A", "C", "A", "B"),
T2 = c("B", "A", "B", "A", "C", "A"),
T3 = c("C", "C", "A", "B", "B", "C")
)
# PAM clustering with optimal string alignment (default)
result <- cluster_sequences(data, k = 2)
