TNA Data Preparation: A Comprehensive Guide to prepare_data()

Companion Tutorial for Transition Network Analysis

tutorial
R
Authors
Affiliation

University of Eastern Finland

University of Eastern Finland

Published

February 6, 2026

1 Introduction

Before building a TNA model, your raw event data needs to be reshaped into sequences. The prepare_data() function does this in one call. For most data, the call is three arguments and you are done.

This tutorial has two parts:

  1. Quick Start — the three-argument call that covers 90% of use cases. Start here.
  2. Detailed Guide — session splitting, time parsing, output anatomy, edge cases. Only needed when the defaults don’t fit your data.

Other tutorials:

1.1 Installation

install.packages("tna")

Or the development version:

# install.packages("remotes")
remotes::install_github("sonsoleslp/tna")

2 Quick Start

2.1 Load your data

Your data should be in long format: one row per event, with columns for what happened, who did it, and when.

We use the built-in dataset as an example:

# Built-in dataset: coded collaborative regulation behaviors
data("group_regulation_long")
group_regulation_long

The columns that matter for prepare_data():

  • Action — what happened (the behavioral state). These become network nodes.
  • Actor — who did it (participant ID). One sequence per actor.
  • Time — when it happened (timestamp). Used for sorting and session splitting.

Everything else (Achiever, Group, Course) is kept automatically as metadata.

2.2 Prepare the data

# Three arguments: action, actor, time. That's it.
pd <- prepare_data(
  group_regulation_long,
  action = "Action",
  actor = "Actor",
  time = "Time"
)

That’s the whole call. Events are sorted by time within each actor, split into sessions when gaps exceed 15 minutes, and pivoted into wide format. Metadata columns are preserved automatically.

2.3 Build a model

# Build a TNA model from the prepared data
model <- tna(pd)
plot(model)

2.4 Group comparisons

Any column you did not assign to action, actor, or time is preserved as metadata. You can use it for group comparisons without any extra work:

# The Achiever column was preserved automatically
group_models <- group_tna(pd, group = "Achiever")
plot(group_models)

After preparing, you can inspect the statistics to see how many sequences were created, how many actors, sequence lengths, and the time range:

pd$statistics
$total_sessions
[1] 2000

$total_actions
[1] 27533

$max_sequence_length
[1] 26

$unique_users
[1] 2000

$sessions_per_user
# A tibble: 2,000 × 2
   Actor n_sessions
   <int>      <int>
 1     1          1
 2     2          1
 3     3          1
 4     4          1
 5     5          1
 6     6          1
 7     7          1
 8     8          1
 9     9          1
10    10          1
# ℹ 1,990 more rows

$actions_per_session
# A tibble: 2,000 × 2
   .session_id   n_actions
   <chr>             <int>
 1 1010 session1        26
 2 1015 session1        26
 3 1030 session1        26
 4 1092 session1        26
 5 1106 session1        26
 6 1107 session1        26
 7 1153 session1        26
 8 1184 session1        26
 9 1209 session1        26
10 1267 session1        26
# ℹ 1,990 more rows

$time_range
[1] "2025-01-01 10:01:16 EET" "2025-01-01 15:03:20 EET"

If these numbers look reasonable, you are ready to go.

That covers the typical workflow. If the defaults work for your data, you can stop here and move on to the main tutorial.

3 Detailed Guide (Not Usually Needed)

The rest of this tutorial covers situations where the defaults don’t fit: custom session thresholds, unusual timestamp formats, the full output structure, and troubleshooting. Read this when you run into a specific issue.

3.1 The Full Function Signature

prepare_data(
  data,
  action,
  actor = NULL,
  time = NULL,
  order = NULL,
  time_threshold = 900,
  custom_format = NULL,
  is_unix_time = FALSE,
  unix_time_unit = "seconds",
  unused_fn = NULL
)
Argument Default What It Does
data (required) Raw event data in long format
action (required) Column with events/states (become network nodes)
actor NULL Column with participant IDs (one sequence per actor)
time NULL Column with timestamps (sorting + session splitting)
order NULL Column for tiebreaking same-timestamp events
time_threshold 900 Gap in seconds that starts a new session (default: 15 min)
custom_format NULL strptime format for unusual timestamps
is_unix_time FALSE Force Unix timestamp interpretation
unix_time_unit "seconds" Unit for Unix timestamps
unused_fn NULL Aggregation function for metadata during pivot

Only data and action are required. But without actor, the entire dataset becomes one sequence — you lose the ability to do permutation testing, bootstrapping, or group comparisons. Always include actor unless your data genuinely has a single observation stream.

3.2 Session Splitting with time_threshold

When time is provided, prepare_data() splits each actor’s events into sessions. If two consecutive events are more than time_threshold seconds apart, a new session starts. The default is 900 seconds (15 minutes).

How it works for an actor with events at 9:00, 9:03, 9:07, 10:30, 10:32 with a 15-minute threshold:

  • 9:00 → 9:03 (3 min gap) — same session
  • 9:03 → 9:07 (4 min gap) — same session
  • 9:07 → 10:30 (83 min gap) — new session
  • 10:30 → 10:32 (2 min gap) — same session

Result: two sequences — (9:00, 9:03, 9:07) and (10:30, 10:32).

Changing the threshold affects how many sequences you get:

# 5-minute gaps start a new session → more, shorter sessions
pd_5min <- prepare_data(
  group_regulation_long,
  action = "Action", actor = "Actor", time = "Time",
  time_threshold = 300
)
# 1-hour gaps start a new session → fewer, longer sessions
pd_1hr <- prepare_data(
  group_regulation_long,
  action = "Action", actor = "Actor", time = "Time",
  time_threshold = 3600
)
# Compare
threshold_comparison <- data.frame(
  Threshold = c("300s (5 min)", "900s (15 min, default)", "3600s (1 hour)"),
  Sessions = c(
    pd_5min$statistics$total_sessions,
    pd$statistics$total_sessions,
    pd_1hr$statistics$total_sessions
  )
)
threshold_comparison
  • Chat or messaging data: 2–5 min. Conversations have rapid exchanges.
  • LMS logs: 10–30 min. Students pause to read or think. The 15-minute default works well.
  • Collaborative coding: 15–60 min. Longer focused work sessions.
  • Diary studies: hours or days. Each entry is a separate session.

If unsure, try a few values and check $statistics. Sessions should be long enough to contain meaningful transitions (not just 1–2 events) but short enough that unrelated events aren’t chained together.

3.3 The order Argument

Some logging systems record multiple events with the exact same timestamp. The order argument provides a tiebreaker — a numeric column (step number, line number) that determines which event comes first among same-timestamp events.

# Use step_number to break ties when events share the same timestamp
prepared <- prepare_data(
  my_data,
  action = "Action", actor = "Actor", time = "Time",
  order = "step_number"
)

When both time and order are given, events are sorted by time first, ties broken by order. You can also use order without time (sorts by that column alone, no session splitting), but this is rarely needed — data is usually already in the right row order.

3.4 The Output Object

prepare_data() returns a tna_data object with five components:

Component What It Contains When Present
$long_data Original data + .standardized_time, .session_nr, .session_id, .sequence Always
$sequence_data Wide format: rows = sequences, columns = time positions Always
$meta_data .session_id + all columns not assigned to action/actor/time/order Always
$time_data Wide timestamps aligned with $sequence_data Only with time
$statistics Session counts, user counts, actions per session, time range Always

3.4.1 Sequence data

Wide format: each row is one sequence, each column is a time position. Shorter sequences are padded with NA.

# First 5 rows, first 10 columns
pd$sequence_data[1:5, 1:10]

3.4.2 Metadata

One row per sequence. Contains .session_id and every column not used as action/actor/time/order. This is how group_tna(pd, group = "Achiever") knows which sequences belong to which group.

pd$meta_data

3.4.3 Long data

The original events, sorted and annotated with session information:

pd$long_data

3.4.4 Time data

Wide-format timestamps aligned with $sequence_data. Each cell is the timestamp of the corresponding event. NULL when time is not provided.

pd$time_data[1:5, 1:8]

You can inspect components without $ notation:

# Print sequence data
print(pd, data = "sequence")
# A tibble: 2,000 × 26
   Action_T1 Action_T2 Action_T3 Action_T4 Action_T5  Action_T6  Action_T7
   <chr>     <chr>     <chr>     <chr>     <chr>      <chr>      <chr>    
 1 cohesion  consensus discuss   synthesis adapt      consensus  plan     
 2 emotion   cohesion  discuss   synthesis <NA>       <NA>       <NA>     
 3 plan      consensus plan      <NA>      <NA>       <NA>       <NA>     
 4 discuss   discuss   consensus plan      cohesion   consensus  discuss  
 5 cohesion  consensus plan      plan      monitor    plan       consensus
 6 discuss   adapt     cohesion  consensus discuss    emotion    cohesion 
 7 discuss   emotion   cohesion  consensus coregulate coregulate plan     
 8 cohesion  plan      consensus plan      consensus  discuss    discuss  
 9 emotion   cohesion  emotion   plan      monitor    discuss    emotion  
10 emotion   cohesion  consensus plan      plan       plan       plan     
# ℹ 1,990 more rows
# ℹ 19 more variables: Action_T8 <chr>, Action_T9 <chr>, Action_T10 <chr>,
#   Action_T11 <chr>, Action_T12 <chr>, Action_T13 <chr>, Action_T14 <chr>,
#   Action_T15 <chr>, Action_T16 <chr>, Action_T17 <chr>, Action_T18 <chr>,
#   Action_T19 <chr>, Action_T20 <chr>, Action_T21 <chr>, Action_T22 <chr>,
#   Action_T23 <chr>, Action_T24 <chr>, Action_T25 <chr>, Action_T26 <chr>
# Print metadata
print(pd, data = "meta")
# A tibble: 2,000 × 7
   .session_id   Actor Achiever Group Course Time                .session_nr
   <chr>         <int> <chr>    <dbl> <chr>  <dttm>                    <int>
 1 1 session1        1 High         1 A      2025-01-01 10:27:07           1
 2 10 session1      10 High         1 A      2025-01-01 10:23:45           1
 3 100 session1    100 High        10 A      2025-01-01 12:11:50           1
 4 1000 session1  1000 High       100 B      2025-01-01 11:12:00           1
 5 1001 session1  1001 Low        101 B      2025-01-01 11:18:40           1
 6 1002 session1  1002 Low        101 B      2025-01-01 11:18:53           1
 7 1003 session1  1003 Low        101 B      2025-01-01 11:18:05           1
 8 1004 session1  1004 Low        101 B      2025-01-01 11:22:26           1
 9 1005 session1  1005 Low        101 B      2025-01-01 11:22:31           1
10 1006 session1  1006 Low        101 B      2025-01-01 11:15:23           1
# ℹ 1,990 more rows

3.5 Time Parsing

The time parser auto-detects 52 timestamp formats. You usually don’t need to do anything — just pass the column name.

Date + time (YYYY-MM-DD)

Format Example
%Y-%m-%d %H:%M:%S 2023-01-09 18:44:00
%Y-%m-%d %H:%M 2023-01-09 18:44
%Y/%m/%d %H:%M:%S 2023/01/09 18:44:00
%Y/%m/%d %H:%M 2023/01/09 18:44
%Y.%m.%d %H:%M:%S 2023.01.09 18:44:00
%Y.%m.%d %H:%M 2023.01.09 18:44

ISO 8601 (T separator)

Format Example
%Y-%m-%dT%H:%M:%S 2023-01-09T18:44:00
%Y-%m-%dT%H:%M 2023-01-09T18:44
%Y-%m-%dT%H:%M:%OS 2023-01-09T18:44:00.123

With timezone offset

Format Example
%Y-%m-%d %H:%M:%S%z 2023-01-09 18:44:00+0100
%Y-%m-%d %H:%M%z 2023-01-09 18:44+0100
%Y-%m-%d %H:%M:%S %z 2023-01-09 18:44:00 +0100
%Y-%m-%d %H:%M %z 2023-01-09 18:44 +0100

Compact (no separators)

Format Example
%Y%m%d%H%M%S 20230109184400
%Y%m%d%H%M 202301091844

European (DD-MM-YYYY)

Format Example
%d-%m-%Y %H:%M:%S 09-01-2023 18:44:00
%d-%m-%Y %H:%M 09-01-2023 18:44
%d/%m/%Y %H:%M:%S 09/01/2023 18:44:00
%d/%m/%Y %H:%M 09/01/2023 18:44
%d.%m.%Y %H:%M:%S 09.01.2023 18:44:00
%d.%m.%Y %H:%M 09.01.2023 18:44
%d-%m-%YT%H:%M:%S 09-01-2023T18:44:00
%d-%m-%YT%H:%M 09-01-2023T18:44

US (MM-DD-YYYY)

Format Example
%m-%d-%Y %H:%M:%S 01-09-2023 18:44:00
%m-%d-%Y %H:%M 01-09-2023 18:44
%m/%d/%Y %H:%M:%S 01/09/2023 18:44:00
%m/%d/%Y %H:%M 01/09/2023 18:44
%m.%d.%Y %H:%M:%S 01.09.2023 18:44:00
%m.%d.%Y %H:%M 01.09.2023 18:44
%m-%d-%YT%H:%M:%S 01-09-2023T18:44:00
%m-%d-%YT%H:%M 01-09-2023T18:44

With month names

Format Example
%d %b %Y %H:%M:%S 09 Jan 2023 18:44:00
%d %b %Y %H:%M 09 Jan 2023 18:44
%d %B %Y %H:%M:%S 09 January 2023 18:44:00
%d %B %Y %H:%M 09 January 2023 18:44
%b %d %Y %H:%M:%S Jan 09 2023 18:44:00
%b %d %Y %H:%M Jan 09 2023 18:44
%B %d %Y %H:%M:%S January 09 2023 18:44:00
%B %d %Y %H:%M January 09 2023 18:44

Date only

Format Example
%Y-%m-%d 2023-01-09
%Y/%m/%d 2023/01/09
%Y.%m.%d 2023.01.09
%d-%m-%Y 09-01-2023
%d/%m/%Y 09/01/2023
%d.%m.%Y 09.01.2023
%m-%d-%Y 01-09-2023
%m/%d/%Y 01/09/2023
%m.%d.%Y 01.09.2023
%d %b %Y 09 Jan 2023
%d %B %Y 09 January 2023
%b %d %Y Jan 09 2023
%B %d %Y January 09 2023

Unix timestamps (numeric seconds, milliseconds, or microseconds since epoch) are also detected automatically.

For unusual formats not covered by auto-detection:

# Custom format: "15-Mar-2024_14h30m"
prepared <- prepare_data(
  data, action = "Action", actor = "Actor",
  time = "Time", custom_format = "%d-%b-%Y_%Hh%Mm"
)

For Unix timestamps (numeric seconds or milliseconds since epoch):

# Time column contains Unix timestamps in milliseconds
prepared <- prepare_data(
  data, action = "Action", actor = "Actor",
  time = "Time", is_unix_time = TRUE, unix_time_unit = "milliseconds"
)

03/04/2024 could be March 4 (US) or April 3 (European). The parser tries US format first. If your data uses European dates and the day values never exceed 12, use custom_format:

# Force European date interpretation
prepared <- prepare_data(
  data, action = "Action", actor = "Actor",
  time = "Time", custom_format = "%d/%m/%Y %H:%M:%S"
)

3.6 The unused_fn Argument

During the pivot from long to wide, metadata columns are collapsed from multiple rows per actor to one row per sequence. By default, the first value is taken. This works when metadata is constant within a session (e.g., achievement level doesn’t change between events).

If metadata varies within a session (e.g., a running score):

# Take the last value per session
prepared <- prepare_data(
  data, action = "Action", actor = "Actor", time = "Time",
  unused_fn = dplyr::last
)

For most use cases (grouping variables, demographics), the default is correct.

3.7 Troubleshooting

One giant sequence: You forgot actor. Without it, the entire dataset becomes one sequence.

Too many / too few sessions: Adjust time_threshold. Check $statistics to see if the session count looks right.

Wrong event order: Without time, events are read in row order. If your data isn’t sorted, provide time.

Time parsing errors: Use custom_format to specify the format explicitly.

Very short sequences: Sequences of length 1 contribute zero transitions. Increase time_threshold to merge micro-sessions, or filter them out.

NA in the action column: Remove or impute before calling prepare_data().

When something looks wrong, check the annotated long data:

# Check a specific actor's events
subset(pd$long_data, Actor == "some_actor_id")

# Look at session boundaries
library(dplyr)
pd$long_data |>
  group_by(.session_id) |>
  summarize(n_events = n(), start = min(.standardized_time),
            end = max(.standardized_time))

4 Quick Reference

4.1 Decision Guide

flowchart LR
  A["Do you have multiple participants?"]
  A -->|Yes| B["use actor"]
  A -->|No| C["omit actor (single observation stream)"]

flowchart LR
  D["Do you have timestamps?"]
  D -->|Yes| E["use time (sorting + session splitting)"]
  D -->|No| F["make sure rows are already in the right order"]

  E --> G["Are the default 15-min sessions appropriate?"]
  G -->|Yes| H["done!"]
  G -->|No| I["set time_threshold (in seconds)"]

  E --> J["Can multiple events share the same timestamp?"]
  J -->|Yes| K["also use order"]
  J -->|No| L["time alone is fine"]

flowchart LR
  M["Do you need group comparisons later?"]
  M -->|Yes| N["keep grouping variable as a column (preserved automatically)"]
  M -->|No| O["no extra steps"]

References

  • Saqr, M., López-Pernas, S., Törmänen, T., Kaliisa, R., Misiejuk, K., & Tikka, S. (2025). Transition Network Analysis: A Novel Framework for Modeling, Visualizing, and Identifying the Temporal Patterns of Learners and Learning Processes. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (LAK ’25) (pp. 351–361). ACM. https://doi.org/10.1145/3706468.3706513
  • Tikka, S., López-Pernas, S., & Saqr, M. (2025). tna: An R Package for Transition Network Analysis. Applied Psychological Measurement. https://doi.org/10.1177/01466216251348840
  • Package website: https://sonsoles.me/tna/

Citation

BibTeX citation:
@misc{saqr2026,
  author = {Saqr, Mohammed and López-Pernas, Sonsoles},
  title = {TNA {Data} {Preparation:} {A} {Comprehensive} {Guide} to
    `Prepare\_data()`},
  date = {2026-02-06},
  url = {https://sonsoleslp.github.io/posts/tna-data/},
  langid = {en}
}
For attribution, please cite this work as:
Saqr, Mohammed, and Sonsoles López-Pernas. 2026. “TNA Data Preparation: A Comprehensive Guide to `Prepare_data()`.” https://sonsoleslp.github.io/posts/tna-data/.