install.packages("tna")1 Introduction
Transition Network Analysis (TNA) has emerged as a methodologically rigorous framework for modeling sequential processes as weighted directed networks, combining the temporal resolution of stochastic process mining with the structural analytic capacity of graph theory (Saqr, López-Pernas, Törmänen, et al., 2025). Since its introduction at LAK 2025, the framework has expanded rapidly: it now supports multiple model types—first-order Markov models, frequency-based transition models, co-occurrence models, and attention-based models—each suited to different data characteristics and research questions . What distinguishes TNA from earlier approaches to sequential and process data is that every level of analysis is subject to confirmatory testing. Edge significance is established through bootstrapping, centrality stability is quantified through case-dropping with correlation stability coefficients, and group differences are evaluated through permutation testing that produces edge-level p-values, effect sizes, and corrections for multiple comparisons. The comparison framework itself operates at four levels of granularity—edge, summary, centrality, and global topology—providing researchers with the means to specify where two networks differ, how much they differ, and whether those differences are statistically reliable. The framework is implemented in the tna R package (Tikka, López-Pernas, & Saqr, 2025), with companion interfaces in Jamovi and a Shiny web application. Empirical applications have already appeared across regulatory interaction in collaborative learning, emotional dynamics during group work, and student–AI interaction processes. This combination of replicability safeguards at each analytic step, a formal hypothesis-testing apparatus for edges, centralities, and group comparisons, and a growing base of applied work positions TNA as an inferential method capable of supporting the confirmatory claims that theory development in the learning sciences requires.
This tutorial demonstrates the complete TNA workflow using the tna R package (version 1.1.0), starting from raw long-format event logs — the kind of data typically exported from log files, coded interaction data, or learning management systems. We use the built-in group_regulation_long dataset, which contains coded collaborative regulation behaviors from student groups.
For group comparisons and permutation testing, see the companion tutorial: TNA Group Analysis. For data-driven clustering of sequences into latent subgroups, see the TNA Clustering tutorial.
2 Installation
The tna package is the only package required for this tutorial. It provides all the functions needed for data preparation, model building, visualization, pruning, centrality analysis, community detection, bootstrapping, permutation testing, and sequence analysis — no additional dependencies need to be loaded.
Install the stable release from CRAN:
Alternatively, install the development version from GitHub to access the latest features:
# install.packages("remotes") # if not already installed
remotes::install_github("sonsoleslp/tna")Once installed, load the package:
library("tna")3 Getting Started with Long-Format Data
The tna package includes prepare_data() which is a very powerful function that converts long-format event data into the structure required for TNA. prepare_data() handles many tedious issues under the hood and makes using TNA easy and straightforward and less error prone. It also handles ordering, session detection, and — most importantly — preserves metadata columns (like achievement group, gender, or course) so you can use them later for group comparisons without manual data wrangling.
For the sake of this tutorial we will use the data set built into TNA but you can use any event data set that has actions, states or behaviors that are ordered or chronologically stored. Let’s have a look at the data.
# Load the built-in dataset of coded collaborative regulation behaviors
data("group_regulation_long")
group_regulation_longThe group_regulation_long dataset has 27,533 rows and 6 columns. Each row is a single event: an action performed by an actor at a specific point in time, along with metadata columns like Achiever (High/Low achievement group).
4 Understanding prepare_data()
The prepare_data() function is the bridge between your raw event data and TNA:
# Convert long-format event log into sequences for TNA
prepared_data <- prepare_data(
group_regulation_long,
action = "Action", # column with behavioral states (become network nodes)
actor = "Actor", # column with participant IDs (one sequence per actor)
time = "Time" # column with timestamps (for ordering and session splitting)
)4.1 Arguments
Each argument controls a different aspect of how the raw data is converted into sequences. The action argument defines the behaviors or events that serve as nodes in your network model. Without additional arguments, default processing treats all rows as one long continuous sequence. Including actor identifies specific participants and shapes the modeling on a per-person basis. This is essential for ensuring that transitions only occur between events from the same individual. The time argument uses timestamps to sort events chronologically and handles shuffled data frame rows.
action — what happened
The only argument the function technically requires. This is the name of the column that contains the events, states, or behaviors you want to model — things like “Plan”, “Monitor”, “Discuss”. These become the nodes in your network.
# Minimal call --- works, but probably not what you want
prepared <- prepare_data(df, action = "Action")If you call prepare_data() with just action, the function reads every row in the data frame from top to bottom and treats the whole thing as one long sequence. Row 1 transitions to row 2, row 2 transitions to row 3, and so on, all the way down. That means:
- Every person’s events get chained together as if they were one continuous stream. The last event of student A transitions directly into the first event of student B, which makes no sense — those two events have nothing to do with each other.
- The row order in your data frame is the sequence order. If your data is not sorted properly, the transitions will be wrong.
- There is no session splitting. If a student did something on Monday and something else on Thursday, those two events are treated as consecutive steps in the same sequence.
For a classroom observation where one researcher coded one continuous stream of events in real time, this minimal call might actually be fine. But for most research data — where you have multiple participants, timestamps, or natural breaks between sessions — you need the other arguments.
actor — who did it
The name of the column identifying who performed the action: a student ID, a user ID, a group ID, whatever defines a unit of analysis. When you include actor, the function creates one sequence per actor instead of mashing everyone together.
prepared <- prepare_data(df, action = "Action", actor = "Actor")This is the single most important argument after action. Without it, you get one sequence for the whole dataset. With it, you get one sequence per person (or per group, or per whatever your actor column represents). Almost every analysis needs this.
The function sorts events within each actor by their row order. So if your data is already sorted chronologically within each actor, this is enough. If it is not sorted, you need time or order to fix that.
time — when it happened
The name of the column containing timestamps. This does two things:
- Sorts events in the right order. Within each actor, events get sorted by their timestamp, so it does not matter if your data frame rows are shuffled.
- Splits sequences at gaps. If two consecutive events from the same actor are more than 15 minutes apart (the default threshold), the function treats them as belonging to different sequences. A student who works from 9:00–9:30, takes a break, and comes back at 10:15 gets two separate sequences instead of one. This matters because a transition from the last event before a break to the first event after a break is not a real transition — there is a gap in between where the process stopped.
prepared <- prepare_data(df, action = "Action", actor = "Actor", time = "Time")The 15-minute default works well for many learning analytics datasets, but your data may need something different. A chat conversation might need a 2-minute threshold. A longitudinal study with weekly sessions might need a 1-day threshold. You can change it with time_threshold:
# 10-minute gap starts a new sequence
prepared <- prepare_data(
df, action = "Action", actor = "Actor",
time = "Time", time_threshold = 10 * 60
)
# 1-hour gap starts a new sequence
prepared <- prepare_data(
df, action = "Action", actor = "Actor",
time = "Time", time_threshold = 60 * 60
)The value is in seconds, so multiply minutes by 60 or hours by 3600.
order — what came first
Sometimes your data does not have timestamps, but it does have a column that tells you the order of events — a step number, a turn counter, a line number. The order argument tells the function to sort events within each actor by that column.
prepared <- prepare_data(df, action = "Action", actor = "Actor", order = "step")If both time and order are provided, data is sorted by time first with ties broken by order. This is useful when multiple events share the same timestamp and you need a secondary sort criterion.
Any columns not specified as action, actor, time, or order are automatically preserved as metadata. The Achiever column (High/Low) is preserved and can be used later with group_tna() for group comparisons.
4.2 Inspecting the Prepared Data
The output of prepare_data() is a list with three parts (sequence data, metadata, and the original long-format data) and will be the input that TNA uses to build the model for analysis.
The first is the sequences in wide format — a data frame where each row is one sequence and each column is a position, so a sequence of length 7 sits in the first 7 columns and the rest are NA. The second is the metadata — a data frame with one row per sequence containing every column from your original data that you didn’t assign to action, actor, time, or order, things like achievement level or experimental condition. This is how group_tna() knows which sequences belong to which group when you write group_tna(prepared_data, group = "Achievement"). The third is the original long-format data, sorted and tagged with sequence IDs, which functions like plot_sequences() and compare_sequences() go back to when they need event-level detail. The three parts share the same indexing: row 1 of the wide sequences, row 1 of the metadata, and all long-format rows tagged as sequence 1 all refer to the same actor or session. You can inspect each part with print(prepared_data, data = "sequence"), print(prepared_data, data = "meta"), and print(prepared_data, data = "long"). The whole point is that you run prepare_data() once and pass the result to everything else — tna(), group_tna(), cluster_sequences(), plot_sequences() — without reshaping anything again.
# View the wide-format sequence data (rows = sequences, columns = positions)
print(prepared_data, data = "sequence")# View the preserved metadata (e.g., Achiever group) for each sequence
print(prepared_data, data = "meta")