
Cluster Sequences via Dissimilarity Matrix based on String Distances
Source:R/clusters.R
, R/print.R
cluster_sequences.Rd
Performs clustering on sequence data using specified dissimilarity measures
and clustering methods. The sequences are first converted to strings
and compared using the stringdist
package.
Arguments
- data
A
data.frame
or amatrix
where the rows are sequences and the columns are time points.- k
An
integer
giving the number of clusters.- dissimilarity
A
character
string specifying the dissimilarity measure. The available options are:"osa"
,"lv"
,"dl"
,"hamming"
,"qgram"
,"cosine"
,"jaccard"
, and"jw"
. See stringdist::stringdist-metrics for more information on these measures.- method
A
character
string specifying clustering method. The available methods are"pam"
,"ward.D"
,"ward.D2"
,"complete"
,"average"
,"single"
,"mcquitty"
,"median"
, and"centroid"
. Seecluster::pam()
andstats::hclust()
for more information on these methods.- na_syms
A
character
vector of symbols or factor levels to convert to explicit missing values.- weighted
A
logical
value indicating whether the dissimilarity measure should be weighted (the default isFALSE
for no weighting). IfTRUE
, earlier observations of the sequences receive a greater weight in the distance calculation with an exponential decay. Currently only supported for the Hamming distance.- lambda
A
numeric
value defining the strength of the decay whenweighted = TRUE
. The default is1.0
.- ...
Additional arguments passed to
stringdist::stringdist()
.- x
A
tna_clustering
object.
Value
A tna_clustering
object which is a list
containing:
data
: The original data.k
: The number of clusters.assignments
: Aninteger
vector of cluster assignments.silhouette
: Silhouette score measuring clustering quality.sizes
: Aninteger
vector of cluster sizes.method
: The clustering method used.distance
: The distance matrix.
Examples
data <- data.frame(
T1 = c("A", "B", "A", "C", "A", "B"),
T2 = c("B", "A", "B", "A", "C", "A"),
T3 = c("C", "C", "A", "B", "B", "C")
)
# PAM clustering with optimal string alignment (default)
result <- cluster_sequences(data, k = 2)
print(result)
#> Clustering method: pam
#> Number of clusters: 2
#> Silhouette score: 0.4345238
#> Cluster sizes:
#> 1 2
#> 3 3