install.packages("tna")1 Introduction
This tutorial covers data-driven clustering of temporal sequences using the tna package. It is a companion to the main TNA tutorial and the group comparison tutorial. We assume familiarity with the basics of building and analyzing a TNA model.
When a meaningful grouping variable is available (e.g., achievement level, experimental condition), the group tutorial shows how to split data by that variable and compare group-specific models. But in many research settings, no such variable exists — or the researcher suspects that the true structure of behavioral differences does not align with any available categorical variable. In these cases, data-driven clustering discovers naturally occurring subgroups directly from the sequence data, without imposing predefined categories.
This tutorial demonstrates how to:
- Cluster sequences based on structural dissimilarity.
- Build group-specific TNA models from the discovered clusters.
- Compare and visualize cluster-specific networks.
- Validate cluster differences with permutation testing.
- Assess within-cluster edge reliability with bootstrapping.
- Choose an appropriate number of clusters.
1.1 Installation
The tna package is the only package required. It provides all the functions needed for data preparation, model building, visualization, clustering, permutation testing, and bootstrapping.
Install from CRAN:
Or install the development version from GitHub:
# install.packages("remotes")
remotes::install_github("sonsoleslp/tna")1.2 Data Preparation
We use the built-in group_regulation_long dataset, which contains coded collaborative regulation behaviors from student groups. The prepare_data() function converts this long-format event log into the structure required for TNA.
# Load the built-in collaborative regulation dataset
data("group_regulation_long")
# Convert long-format event log into TNA data
prepared_data <- prepare_data(
group_regulation_long,
action = "Action", # behavioral states (network nodes)
actor = "Actor", # participant IDs (one sequence per actor)
time = "Time" # timestamps (for ordering and session splitting)
)
# Build the aggregate TNA model (all sequences combined)
model <- tna(prepared_data)prepare_data() Arguments
| Argument | Description |
|---|---|
action |
(Required) Column containing the events or states to model. These become the network nodes. |
actor |
Column identifying who performed the action. Creates one sequence per actor. |
time |
Column with timestamps. Sorts events and splits sequences at temporal gaps (default: 15 minutes). |
order |
Numeric column for event ordering when timestamps are unavailable. |
time_threshold |
Gap duration in seconds that starts a new session (default: 900). |
Any columns not specified as action, actor, time, or order are automatically preserved as metadata.
This tutorial uses long-format event data, but tna() accepts several other input formats:
| Input Format | Function | Description |
|---|---|---|
| Long event log | prepare_data() then tna() |
Timestamped events with actors |
| Wide data frame | tna(df) |
Rows = sequences, columns = time points |
| Pre-computed matrix | tna(mat) |
Square weight matrix with named rows and columns |
| TraMineR sequence | tna(seqobj) |
Object from TraMineR::seqdef() |
| One-hot binary data | import_onehot() |
Co-occurrence model from binary feature data |
2 Why Cluster?
A single TNA model computed from all sequences describes the average transition dynamics across all individuals. But averages can mask substantial heterogeneity. If one subgroup of learners follows a strategic regulatory cycle (plan → monitor → adapt) while another subgroup is stuck in social loops (discuss → consensus → discuss), the aggregate model shows a blend of both patterns that accurately represents neither.
Clustering addresses this by partitioning sequences into subgroups with distinct transition structures, allowing each cluster to be modeled separately. The result is a set of group-specific TNA models — one per cluster — that capture the actual behavioral patterns present in the data rather than an uninformative average.
This is exploratory analysis: we let the data reveal structure rather than imposing predefined categories. The discovered clusters may correspond to meaningful behavioral profiles — different learning strategies, different levels of engagement, different regulatory styles — that would be invisible in an aggregate analysis.
3 Running cluster_sequences()
The cluster_sequences() function takes a sequence data frame, computes pairwise dissimilarities between sequences, and partitions them into k groups:
# Partition sequences into 2 clusters based on pairwise dissimilarity
clustering <- cluster_sequences(prepared_data$sequence_data, k = 2)
print(clustering)Clustering method: pam
Number of clusters: 2
Silhouette score: 0
Cluster sizes:
1 2
1010 990
The procedure has two steps:
- Compute pairwise dissimilarity between all sequences using a string distance metric. Each pair of sequences receives a dissimilarity score reflecting how structurally different they are.
- Cluster the dissimilarity matrix using the chosen algorithm and assign each sequence to one of
kgroups.
| Argument | Description | Default |
|---|---|---|
data |
Sequence data frame (e.g., prepared_data$sequence_data) |
— |
k |
Number of clusters | — |
dissimilarity |
Distance metric: "hamming", "lcs", "cosine", "jaccard", "osa", etc. |
"hamming" |
method |
Clustering algorithm: "pam", "ward.D2", "complete", "average", etc. |
"pam" |
Dissimilarity metrics:
"hamming"— counts the number of positions where two sequences differ. Fast and interpretable; requires equal-length sequences."lcs"— longest common subsequence. Handles variable-length sequences and is sensitive to ordering."osa"— optimal string alignment. Allows insertions, deletions, and substitutions; flexible but slower."cosine"/"jaccard"— set-based metrics that compare state compositions regardless of order.
Clustering algorithms:
"pam"— Partitioning Around Medoids. Robust to outliers; identifies a representative sequence (medoid) for each cluster."ward.D2"— Ward’s hierarchical clustering. Minimizes within-cluster variance; produces compact clusters."complete"/"average"— other hierarchical linkage methods for different cluster shape assumptions.
4 Building Group TNA from Clusters
The clustering result can be passed directly to group_tna(), which builds a separate TNA model for each cluster:
Once clusters are identified, we pass the clustering result directly to group_tna(), which builds a separate TNA model for each cluster. We also rename the clusters to give them more descriptive labels.
# Build one TNA model per cluster and assign descriptive names
gtna_clust <- group_tna(clustering)
gtna_clust <- rename_groups(gtna_clust, c("Pattern A", "Pattern B"))Plotting the cluster-specific models side by side reveals how the discovered subgroups differ in their transition structures:
plot(gtna_clust, cut = 0.1, minimum = 0.05)