data("group_regulation_long", package = "tna")
prepared <- prepare_data(
group_regulation_long,
action = "Action", actor = "Actor", time = "Time"
)1 Exploring Sequences
Learning unfolds as an ordered sequence of states—engagement levels across weeks, problem-solving actions within a session, regulatory behaviors during collaboration. Sequence analysis preserves this temporal ordering rather than collapsing it into a single summary. The approach originated in molecular biology for comparing DNA sequences and was adapted for the social sciences by Abbott (1995). In education, it has been used to study course-taking trajectories, self-regulated learning strategies, and collaborative regulation dynamics (Saqr et al., 2024a).
From a complex dynamic systems perspective, learners exhibit feedback loops, attractor states, and phase transitions (Saqr et al., 2025a). A student’s trajectory is not a random walk—disengagement breeds more disengagement, and current states affect next ones as well as sudden shifts from one stable regime to another are common. Sequence-level analysis captures these dynamics at the individual level, complementing the aggregate view provided by transition models.
This tutorial is part of a series on tutorials on dynamics of learning and learners using Transition Network Analysis with the tna and codyna R packages:
- Transition Network Analysis with R — building, visualizing, and interpreting TNA models; centrality, communities, bootstrapping.
- TNA Group Analysis — analysis and comparison of groups.
- TNA Clustering — discovering and analyzing clusters of sequences.
- TNA Model Comparison — edge-level, summary, centrality, and network-level comparison; permutation tests.
- Sequence Patterns, Outcomes, and Indices (this tutorial) — pattern discovery, outcome modeling, and structural indices with
codyna.
Package website: https://sonsoles.me/tna/
Sequence analysis is a family of methods for studying ordered categorical data. In education, the “sequence” is typically a student’s trajectory through a series of states measured at successive time points.
Three levels of analysis are possible:
- Sequence visualization: plotting individual trajectories and state distributions to see the raw data before modeling.
- Sequence indices: computing per-sequence summary measures (entropy, stability, complexity) that characterize how each sequence unfolds.
- Pattern discovery: identifying recurring sub-sequences (n-grams, gapped patterns) that tell us what specific pathways students follow.
This tutorial covers all three. For a comprehensive introduction to sequence analysis in education, see Saqr et al. (2024a). For transition-based approaches, see López-Pernas et al. (2024a) on Markov models and Saqr et al. (2025b) on Transition Network Analysis.
This tutorial works with three datasets, each illustrating different scenarios of sequence analysis. Before analyzing any dataset, we visualize it—distribution plots and frequency plots establish the context that makes pattern and index results interpretable.
| Dataset | Source | Sequences | States | Time points | Used for |
|---|---|---|---|---|---|
group_regulation |
tna package |
2,000 | 9 (collaborative regulation) | up to 26 | N-gram examples with filtering |
| Codyna | Codyna.RDS |
5,000 | 10 (math exercise actions) | up to 10 | Pattern discovery, outcome modeling |
engagement |
codyna package |
1,000 | 3 (Active, Average, Disengaged) | 25 | Sequence indices |
1.1 Visualizing the regulation data
The first step with any dataset is to visualize it. We begin with the group_regulation dataset—2,000 collaborative regulation sequences with 9 states. TNA provides tools for preparing and visualizing event data. We will use TNA here to prepare the data but we will focus on the sequences not the network analysis.
Having prepared the data we can explore it with sequence analysis and visualize the sequence. TNA comes with several methods for plotting and incldueds sequence plots. Here we will use distribution plots. A state distribution plot aggregates individual trajectories into proportions at each time point:
plot_sequences(prepared, type = "distribution", scale = "proportion")Also, TNA includes tools for plot state frequencies, so you don’t have to manually do it. A frequency plot shows overall state prevalence—the marginal baseline for interpreting lift later:
model_reg <- tna(prepared)
plot_frequencies(model_reg)1.2 Visualizing the problem-solving data
This math problem solving dataset contains 5,000 math exercise sequences with 10 states (Correct, Wrong, Clue, Guide, Instruct, Question, Quit, Right, Skip, Try). These were students states while solving math problems. The codes for Correct, Quit, Clue, Guide, Question are AI support trying to help students solve the questions. The right and wrong are the outcome of these exercises.
Each row is one problem attempt with up to 10 steps.
codyna_long <- data.frame(
id = rep(seq_len(nrow(codyna_data)), each = ncol(codyna_data)),
time = rep(seq_len(ncol(codyna_data)), nrow(codyna_data)),
action = as.vector(t(as.matrix(codyna_data)))
)
codyna_long <- codyna_long[!is.na(codyna_long$action), ]
prepared_codyna <- prepare_data(codyna_long, action = "action", actor = "id", time = "time")plot_sequences(prepared_codyna, sort_by ="action_T2")plot_sequences(prepared_codyna, type = "distribution", scale = "proportion")model_codyna <- tna(prepared_codyna)
plot_frequencies(model_codyna)The index plot reveals that most sequences are short and begin with Wrong which captures students wrong attempts and offered feedback.
2 Pattern Discovery
TNA builds a transition matrix from all sequences, revealing which pairwise transitions are most probable. Pattern discovery complements this by examining each sequence individually—identifying the specific multi-step sub-sequences that recur across students. Longer pathways like Wrong→Quit→Skip→Instruct→Wrong extend the picture to five-step chains. Patterns can also be linked to outcomes—not just which pathways exist, but which predict success or failure which offers a unique perspective and new functionality that no other package provides.
2.1 N-grams
TNA already captures pairwise transitions (length 2), so the value of n-grams begins at length 3 and above. We start with the regulation data. A TNA model would show consensus→plan and plan→consensus as the two strongest edges. Do these chain into sustained multi-step pathways within the same sequences?
data("group_regulation")
reg_ngrams <- discover_patterns(group_regulation, type = "ngram", len = 3:5)
reg_ngrams Top n-grams (lengths 3–5) from collaborative regulation sequences
- Frequency: total occurrences across all sequences (one sequence can contribute multiple instances).
- Count: number of sequences containing the pattern at least once.
- Support: count / total sequences—the proportion containing the pattern. Use this to compare across datasets of different sizes.
- Lift: observed support / expected support under independence. Above 1 = over-represented; below 1 = under-represented (Agrawal et al., 1993).
- Proportion: pattern’s share of total frequency at its length.
consensus→plan→plan (support = 0.35) appears in over a third of all sequences; plan→plan→plan (support = 0.23) in nearly a quarter. The consensus-then-planning loop is a genuine multi-step pathway—groups who reach agreement build extended planning episodes. The two strong TNA edges consensus→plan and plan→plan combine into coherent within-sequence trajectories.
What follows planning? The start parameter isolates pathways originating from a given state:
reg_plan <- discover_patterns(group_regulation, type = "ngram", len = 3:4, start = "plan")
reg_planPlanning leads three ways: sustained planning (plan→plan→plan, support = 0.23), cycling back to consensus (plan→consensus→plan, support = 0.24), and emotional reactions (plan→plan→emotion, support = 0.14). The first two sustain the task; the third signals that extended planning sometimes triggers affect.
What follows consensus?
reg_cons <- discover_patterns(group_regulation, type = "ngram", len = 3:4, start = "consensus")
reg_consN-grams starting with consensus
Consensus chains into planning: the top three trigrams all route through plan. consensus→plan→plan (support = 0.35) dominates—when groups agree, they commit to extended planning.
Where does emotion lie within the sequence of interactions Emotion is a low-frequency state in the TNA network, but pattern discovery reveals its role as a connector or how it mediates other regulatory behaviors.
reg_emotion <- discover_patterns(group_regulation, type = "ngram", len = 3:4, contain = "emotion")
reg_emotionN-grams involving emotion
emotion→cohesion→consensus (support = 0.18) is the dominant pathway. Emotional expression leads to social bonding and then group agreement—a three-step recovery arc invisible in the aggregate network, where emotion has weak edges to many states. Emotion feeds back into the consensus→plan cycle.
Now we apply the same approach to the math problem-solving data. With defaults, discover_patterns() extracts n-grams of length 2 through 5:
ngrams <- discover_patterns(codyna_data)
ngramsTop 10 n-grams (default: lengths 2–5)
plot(ngrams, n = 10)Trigrams (length 3) reveal multi-step pathways that bigrams cannot:
trigrams <- discover_patterns(codyna_data, type = "ngram", len = 3)
trigramsTop 10 trigrams (length 3)
Wrong→Quit→Skip has lift 4.82—nearly 5 times more frequent than expected. The bigrams Wrong→Quit and Quit→Skip each appear separately, but only the trigram reveals them as a single giving-up sequence. Extracting lengths 3 through 5 shows how the pathway extends:
ngrams_range <- discover_patterns(codyna_data, type = "ngram", len = 3:5)
ngrams_rangeTop 10 n-grams at lengths 3–5
At length 4, Wrong→Quit→Skip→Instruct (support = 0.19); at length 5, it adds →Wrong—a complete failure loop. Since most sequences are 5 steps long, length-5 n-grams capture entire trajectories:
ngrams_5 <- discover_patterns(codyna_data, type = "ngram", len = 5)
ngrams_5Top 10 five-step patterns (full sequences)
Wrong→Quit→Skip→Instruct→Wrong (support = 0.17) is the single most common complete trajectory—17% of all students follow this path.
2.2 Gapped patterns
N-grams require consecutive states. Gapped patterns allow wildcards (*) between anchoring states, capturing regularities that persist regardless of intervening steps. The same start, end, and contain filters from the n-gram examples work here:
gapped <- discover_patterns(codyna_data, type = "gapped")
gappedTop 10 gapped patterns (defaults)
Wrong→***→Wrong (support = 0.51): half of all sequences contain a return to Wrong after three intervening states. Wrong→***→Right (support = 0.34): about a third eventually reach success through three intermediate steps.
Each * matches exactly one wildcard state. Wrong→*→Right matches any three-step sub-sequence from Wrong to Right. The gap parameter controls the range: gap = 1 produces single-wildcard patterns; gap = 1:3 produces 1, 2, or 3 wildcards. Defaults cover a range appropriate to the sequence length.
2.3 Repeated patterns
Repeated patterns detect consecutive runs of the same state—instructional loops, hint-seeking chains, or persistent errors:
repeated <- discover_patterns(codyna_data, type = "repeated", len = 2:4)
repeatedRepeated patterns (same-state runs)
Instruct→Instruct dominates (support = 0.16)—16% of sequences contain at least two consecutive Instruct steps. The triple Instruct→Instruct→Instruct still appears in 5% of sequences. In Section 6 we will see how self_loop_tendency captures this as a per-sequence index.
TNA’s diagonal entries give the aggregate self-loop probability for each state. Repeated patterns identify which sequences contain runs of a given length and how common they are. A high aggregate self-loop with low repeated-pattern support means self-loops are spread thinly across many sequences; high support means they concentrate in specific sequences as genuine loops.
2.4 Targeted search
When theory predicts a specific pathway, test whether it exists. The wildcard * matches any single state:
targeted <- discover_patterns(codyna_data, pattern = "Wrong->*->Correct")
targetedRecovery pathways: Wrong through one step to Correct
Two recovery routes: Wrong→Clue→Correct (support = 0.15) and Wrong→Correct→Correct (support = 0.07). The Clue route has lift near 1.0—common but not over-represented.
end_right <- discover_patterns(codyna_data, end = "Right", type = "ngram", len = 2)
end_rightBigrams ending in Right (problem solved)
Instruct→Right is the most common final bigram, but its lift (0.75) is below 1—reflecting Instruct’s high base rate, not a specific affinity.
2.5 Which sequence lead to different outcomes?
Do specific sequences of actions predict whether a student ultimately succeeds or fails? The outcome parameter links each pattern to a binary outcome—here, whether the last observed state is Right (solved) or Wrong (unsolved). For each pattern, discover_patterns() counts its prevalence in each outcome group and runs a chi-squared test of association:
ngrams_outcome <- discover_patterns(
codyna_data,
outcome = "last_obs",
type = "ngram", len = 2:3
)
ngrams_outcomeTop 10 outcome-differentiated n-grams
plot(ngrams_outcome, n = 10)- count_<group>: sequences in each outcome group containing the pattern.
- chisq: chi-squared statistic testing independence of pattern presence and outcome.
- p_value: p-value from the chi-squared test.
- effect_size (Cramér’s V): strength of association, bounded in [0, 1]. Above 0.10 = small; above 0.30 = medium (Cohen, 1988).
The chi-squared treats each pattern independently. Correlated patterns are not adjusted for—the regression models in Section 5 handle this.
Skip→Instruct (p < 0.001) appears in 1016 Wrong-ending sequences and 0 Right-ending sequences—every student who skips and then receives instruction ultimately fails. The next section quantifies these effects with regression.
3 Predicting Outcomes from Patterns
Chi-squared tests (Section 4.5) identify which patterns differ between groups but cannot rank them by effect size or adjust for confounding. analyze_outcome() selects the top patterns, encodes them as predictors, and fits a logistic regression.
Three steps:
- Pattern discovery: calls
discover_patterns()with the specifiedtype,len, etc. - Selection: ranks patterns by the
prioritycriterion ("chisq","lift", or"support") and picks the topn. - Model fitting: encodes each pattern as binary (0/1, when
freq = FALSE) or as within-sequence count (whenfreq = TRUE), then fitsglm()(orlme4::glmer()whenmixed = TRUE).
The returned object is a standard glm or glmerMod, so summary(), coef(), AIC(), predict(), and confint() all work.
3.1 Binary predictors: presence/absence of a certain state. If students receive a clue would it help them get the answer correct?
Each selected pattern becomes a 0/1 predictor:
model_binary <- analyze_outcome(
codyna_data,
outcome = "last_obs",
reference = "Wrong",
n = 5,
freq = FALSE,
priority = "chisq",
type = "ngram",
len = 1:2,
mixed = FALSE
)
summary(model_binary)
Call:
glm(formula = f, family = binomial, data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.00564 0.04514 0.12 0.90058
Quit -5.03463 0.41466 -12.14 < 0.0000000000000002 ***
Instruct -0.03506 0.08240 -0.43 0.67046
Clue_to_Clue 0.40243 0.09233 4.36 0.000013 ***
Question_to_Guide -0.51225 0.13511 -3.79 0.00015 ***
Instruct_to_Instruct 0.02576 0.10314 0.25 0.80281
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6771.5 on 4999 degrees of freedom
Residual deviance: 5642.7 on 4994 degrees of freedom
AIC: 5655
Number of Fisher Scoring iterations: 7
analyze_outcome() looked at all unigrams and bigrams, ranked them by chi-squared, picked the top 5, and fit a logistic regression using them as 0/1 predictors. reference = "Wrong" means positive coefficients increase the odds of Right (success). Converting to odds ratios:
odds_ratios <- exp(coef(model_binary))
data.frame(
Pattern = names(odds_ratios),
`Log-Odds` = round(coef(model_binary), 3),
`Odds Ratio` = round(odds_ratios, 3),
row.names = NULL, check.names = FALSE
) Coefficients as log-odds and odds ratios
- Odds ratio = exp(coefficient). OR = 3.0 means “3× the odds of success”; OR = 0.5 means “half the odds.” OR = 1.0 = no effect.
- Confidence intervals:
exp(confint(model))on the odds ratio scale. If the CI includes 1.0, the effect is not significant. - AIC: lower = better fit, penalized for complexity. Compare models with different predictor sets.
- Quit (OR = 0.01): essentially zero odds of success—quitting is the primary failure pathway, consistent with the chi-squared results in Section 4.5.
- Clue_to_Clue (OR = 1.5): about 1.5× the odds of succeeding. Repeated hint-seeking signals engagement, matching the repeated-pattern findings in Section 4.3.
- Question_to_Guide (OR = 0.6): reduces the odds—students who ask a question and then receive guidance are less likely to solve the problem.
- Instruct and Instruct_to_Instruct: not significant (p > 0.05). Common but not outcome-differentiating once the other predictors are in the model.
3.2 Frequency-based predictors. Here we take into account the counts, we don’t only count presence but also how frequent. Are more clues useful?
Setting freq = TRUE uses within-sequence pattern counts instead of 0/1:
model_freq <- analyze_outcome(
codyna_data,
outcome = "last_obs",
reference = "Wrong",
n = 10,
freq = TRUE,
priority = "chisq",
type = "ngram",
len = 1:2,
mixed = FALSE
)
summary(model_freq)
Call:
glm(formula = f, family = binomial, data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2109 0.0730 -2.89 0.0038 **
Quit -4.6488 0.4190 -11.09 < 0.0000000000000002 ***
Clue -0.0317 0.0672 -0.47 0.6377
Correct 0.0999 0.0827 1.21 0.2275
Instruct -0.2151 0.0904 -2.38 0.0173 *
Question_to_Instruct 0.5310 0.1148 4.63 0.0000037146 ***
Correct_to_Correct 0.2755 0.1455 1.89 0.0582 .
Clue_to_Clue 0.5759 0.1147 5.02 0.0000005160 ***
Guide_to_Guide 0.5571 0.0951 5.86 0.0000000047 ***
Question_to_Guide -0.2534 0.1453 -1.74 0.0811 .
Instruct_to_Instruct 0.3264 0.1292 2.53 0.0115 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6771.5 on 4999 degrees of freedom
Residual deviance: 5581.7 on 4989 degrees of freedom
AIC: 5604
Number of Fisher Scoring iterations: 7
AIC = 5603.7 vs. 5654.7 for the binary model—a 51-point improvement. Repetition count adds predictive value. Guide→Guide (0.56) is positive and significant—each additional consecutive guidance step raises the odds, consistent with the repeated-pattern findings in Section 4.3.
3.3 Gapped pattern predictors
model_gapped <- analyze_outcome(
codyna_data,
outcome = "last_obs",
reference = "Wrong",
n = 5,
freq = FALSE,
priority = "chisq",
type = "gapped",
gap = 1,
len = 2,
mixed = FALSE
)
summary(model_gapped)
Call:
glm(formula = f, family = binomial, data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.1768 0.0388 -4.56 0.0000052086 ***
Quit_to_._to_Instruct -4.8847 0.4113 -11.88 < 0.0000000000000002 ***
Clue_to_._to_Correct 0.5469 0.1188 4.60 0.0000041943 ***
Correct_to_._to_Clue 0.4285 0.0850 5.04 0.0000004597 ***
Clarify_to_._to_Instruct 0.9209 0.1534 6.00 0.0000000019 ***
Guide_to_._to_Guide 1.2755 0.2110 6.05 0.0000000015 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6771.5 on 4999 degrees of freedom
Residual deviance: 5580.7 on 4994 degrees of freedom
AIC: 5593
Number of Fisher Scoring iterations: 7
AIC = 5592.7, the lowest of the three. Clue→*→Correct captures the delayed hint effect; Guide→*→Guide captures returning to guidance after a detour.
3.4 Priority selection
The priority parameter controls which patterns enter the regression:
model_lift <- analyze_outcome(
codyna_data,
outcome = "last_obs", reference = "Wrong",
n = 5, freq = FALSE, priority = "lift",
type = "ngram", len = 1:2, mixed = FALSE
)
model_support <- analyze_outcome(
codyna_data,
outcome = "last_obs", reference = "Wrong",
n = 5, freq = FALSE, priority = "support",
type = "ngram", len = 1:2, mixed = FALSE
)| Priority | Selects for | AIC | Use when |
|---|---|---|---|
"chisq" |
Maximum group differentiation | 5654.7 | Outcome prediction is the goal |
"lift" |
Over-representation relative to marginals | 6473 | Seeking structurally surprising patterns |
"support" |
Most common patterns overall | 6316.3 | Describing typical behavior |
3.5 Mixed-effects models
When students solve multiple problems or are nested within classrooms, observations are not independent. mixed = TRUE adds a random intercept per group. This is very useful when you have multiple sessions from the same student, or multiple sequences and so on. Codyna has an advanced mixed effect model that takes this into account.
model_mixed <- analyze_outcome(
raw_data, cols = T1:T10,
group = student_id,
outcome = "last_obs",
reference = "Wrong",
n = 5,
freq = FALSE,
priority = "chisq",
type = "ngram",
len = 1:2,
mixed = TRUE
)
summary(model_mixed)Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: .outcome ~ Quit + Instruct + Clue_to_Clue + Question_to_Guide +
Instruct_to_Instruct + (1 | student_id)
Data: df
AIC BIC logLik -2*log(L) df.resid
5653 5699 -2820 5639 4993
Scaled residuals:
Min 1Q Median 3Q Max
-1.279 -0.964 -0.077 0.961 12.346
Random effects:
Groups Name Variance Std.Dev.
student_id (Intercept) 0.151 0.389
Number of obs: 5000, groups: student_id, 3937
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.00615 0.04705 0.13 0.8961
Quit -5.09947 0.41717 -12.22 < 0.0000000000000002 ***
Instruct -0.03721 0.08553 -0.44 0.6635
Clue_to_Clue 0.41435 0.09596 4.32 0.000016 ***
Question_to_Guide -0.52008 0.14000 -3.71 0.0002 ***
Instruct_to_Instruct 0.03076 0.10712 0.29 0.7740
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) Quit Instrc Cl_t_C Qst__G
Quit -0.012
Instruct -0.486 -0.134
Clue_to_Clu -0.414 0.012 0.136
Questn_t_Gd -0.251 0.026 0.017 0.112
Instrct_t_I -0.024 0.096 -0.590 0.023 0.070
- Random intercept variance (0.15): captures student-level differences in baseline success probability. A standard deviation of 0.39 means students differ substantially in their baseline ability.
- Fixed effects: same interpretation as standard logistic regression, but adjusted for nesting.
cat("AIC (fixed, no nesting):", round(AIC(model_binary), 1), "\n")AIC (fixed, no nesting): 5655
cat("AIC (mixed, student nesting):", round(AIC(model_mixed), 1), "\n")AIC (mixed, student nesting): 5653
Lower AIC confirms that nesting matters. Use mixed = TRUE whenever data has a natural grouping structure.
Standard logistic regression assumes independence. In educational data, this is often violated: the same student solves multiple problems, or students are nested within classrooms. Ignoring nesting inflates the effective sample size, producing artificially small standard errors.
A mixed-effects model (via lme4::glmer()) adds a random intercept per group—capturing group-level baselines so that the fixed effects represent within-group pattern effects. Use it when:
- The same participant appears in multiple sequences
- Students are nested within classrooms, schools, or conditions
- The random intercept variance is non-trivial
4 Sequence Indices
Sequence indices in codyna are designed for educational contexts and build on complex dynamic system theory. Patterns identify which pathways students follow. Indices characterize how each trajectory unfolds—its diversity, stability, dynamism, and complexity—without reference to any specific pattern. We switch to the built-in engagement dataset (1,000 students, 3 states, 25 time points) for richer distributions. As with every dataset in this tutorial, we visualize first.
data("engagement")The distribution plot reveals a slight cohort-level drift toward disengagement over the 25 weeks. These visual patterns will connect directly to the stability, initial conditions, and emergence indices below.
plot_sequences(engagement, type = "distribution", scale = "proportion")indices <- sequence_indices(engagement, favorable = "Active")
numeric_indices <- indices[, sapply(indices, is.numeric)]
summary_df <- data.frame(
Index = names(numeric_indices),
Min = round(sapply(numeric_indices, min, na.rm = TRUE), 3),
Median = round(sapply(numeric_indices, median, na.rm = TRUE), 3),
Mean = round(sapply(numeric_indices, mean, na.rm = TRUE), 3),
Max = round(sapply(numeric_indices, max, na.rm = TRUE), 3),
row.names = NULL
)
summary_dfSummary statistics for all numeric sequence indices
favorable = "Active" designates Active as the target for directional indices like integrative_potential (convergence toward the favorable state) and emergent_state_proportion.
4.1 Index families
The 24 indices group into 10 families:
| Family | Indices | Question |
|---|---|---|
| Coverage | valid_n, valid_proportion |
How complete is the sequence? |
| Diversity | unique_states, longitudinal_entropy, simpson_diversity |
How spread is time across states? |
| Stability | self_loop_tendency, mean_spell_duration, max_spell_duration |
How persistent are state episodes? |
| Dynamism | transition_rate, transition_complexity |
How frequent and diverse are transitions? |
| Initial conditions | initial_state_persistence, initial_state_proportion, initial_state_influence_decay |
How influential is the starting state? |
| Cyclicity | cyclic_feedback_strength |
Return patterns? |
| Dominance | dominant_state, dominant_proportion, dominant_max_spell |
Which state dominates? |
| First/Last | first_state, last_state |
Boundary conditions |
| Emergence | emergent_state, emergent_state_persistence, emergent_state_proportion |
Late-appearing dominant state? |
| Integrative | integrative_potential, complexity_index |
Convergence and complexity |
Coverage: valid_n = non-missing time points; valid_proportion = fraction of complete observations.
Diversity: unique_states = distinct states visited. longitudinal_entropy = -\sum p_i \log p_i; maximized when all states equally visited. simpson_diversity = 1 - \sum p_i^2; gives less weight to rare states.
Stability: self_loop_tendency = proportion of consecutive pairs where state does not change. mean_spell_duration = average run length. max_spell_duration = longest run.
Dynamism: transition_rate = 1 − self_loop_tendency. transition_complexity = entropy of the transition distribution; distinguishes toggling between two states (low) from cycling through many (high).
Initial conditions: initial_state_persistence = consecutive time points in first state. initial_state_proportion = fraction of sequence in first state. initial_state_influence_decay = exponential decay rate of first state’s autocorrelation.
Cyclicity: cyclic_feedback_strength = tendency to revisit previously visited states.
Dominance: dominant_state = most prevalent state. dominant_proportion = its share. dominant_max_spell = its longest run.
Emergence: emergent_state = state dominating the second half. When different from the first-half dominant, a phase transition occurred. emergent_state_persistence and emergent_state_proportion quantify its hold.
Integrative: integrative_potential = convergence toward the favorable state over time. complexity_index = composite of entropy and transition complexity.
5 Decision Workflow
- Visualize —
plot_sequences()andplot_frequencies()for raw data exploration. - Discover patterns —
discover_patterns()with n-grams, gapped, and repeated types; usestart,end,containto focus. - Compare across outcomes — add
outcome = ...to identify group-differentiating patterns. - Model outcomes —
analyze_outcome()for logistic regression;mixed = TRUEfor nested data. - Compute indices —
sequence_indices()for per-sequence structural summaries. - Interpret together — a coefficient is more trustworthy when the pattern is both statistically significant (chi-squared) and structurally meaningful (lift > 1, high support), and the index profile is consistent.
References
LA Methods Chapters
Sequence Analysis and Temporal Methods
- Saqr, M., López-Pernas, S., Helske, S., Durand, M., Murphy, K., Studer, M., & Ritschard, G. (2024). Sequence analysis in education: Principles, technique, and tutorial with R. In M. Saqr & S. López-Pernas (Eds.), Learning analytics methods and tutorials: A practical guide using R (pp. 267–298). Springer. https://lamethods.org/book1/chapters/ch10-sequence-analysis/ch10-seq.html
- Helske, J., Helske, S., Saqr, M., López-Pernas, S., & Murphy, K. (2024). A modern approach to transition analysis and process mining with Markov models in education. In M. Saqr & S. López-Pernas (Eds.), Learning analytics methods and tutorials: A practical guide using R (pp. 331–362). Springer. https://lamethods.org/book1/chapters/ch12-markov/ch12-markov.html
- López-Pernas, S., Saqr, M., Helske, S., & Murphy, K. (2024). Multi-channel sequence analysis in educational research: An introduction and tutorial with R. In M. Saqr & S. López-Pernas (Eds.), Learning analytics methods and tutorials: A practical guide using R (pp. 363–400). Springer. https://lamethods.org/book1/chapters/ch13-multichannel/ch13-multi.html
Transition Network Analysis
- Saqr, M., López-Pernas, S., & Tikka, S. (2025). Mapping relational dynamics with transition network analysis: A primer and tutorial. In M. Saqr & S. López-Pernas (Eds.), Advanced learning analytics methods. Springer. https://lamethods.org/book2/chapters/ch15-tna/ch15-tna.html
- Saqr, M., López-Pernas, S., & Tikka, S. (2025). Capturing the breadth and dynamics of the temporal processes with frequency transition network analysis: A primer and tutorial. In M. Saqr & S. López-Pernas (Eds.), Advanced learning analytics methods. Springer. https://lamethods.org/book2/chapters/ch16-ftna/ch16-ftna.html
- López-Pernas, S., Tikka, S., & Saqr, M. (2025). Mining patterns and clusters with transition network analysis: A heterogeneity approach. In M. Saqr & S. López-Pernas (Eds.), Advanced learning analytics methods. Springer. https://lamethods.org/book2/chapters/ch17-tna-clusters/ch17-tna-clusters.html
Complex Dynamic Systems
- Saqr, M., Dever, D., López-Pernas, S., Gernigon, C., Marchand, G., & Kaplan, A. (2025). Complex dynamic systems in education: Beyond the static, the linear and the causal reductionism. In M. Saqr & S. López-Pernas (Eds.), Advanced learning analytics methods. Springer. https://lamethods.org/book2/chapters/ch12-cds/ch12-cds.html
- Saqr, M., Schreuder, M. J., & López-Pernas, S. (2024). Why educational research needs a complex system revolution that embraces individual differences, heterogeneity, and uncertainty. In M. Saqr & S. López-Pernas (Eds.), Learning analytics methods and tutorials: A practical guide using R. Springer. https://lamethods.org/book1/chapters/ch22-conclusion/ch22-conclusion.html
Package References
- Tikka, S., López-Pernas, S., & Saqr, M. (2025). tna: An R package for Transition Network Analysis. Applied Psychological Measurement. https://doi.org/10.1177/01466216251348840
- Saqr, M., López-Pernas, S., Törmänen, T., Kaliisa, R., Misiejuk, K., & Tikka, S. (2025d). Transition Network Analysis: A novel framework for modeling, visualizing, and identifying the temporal patterns of learners and learning processes. In LAK ’25 (pp. 351–361). ACM. https://doi.org/10.1145/3706468.3706513
Methodological References
- Abbott, A. (1995). Sequence analysis: New methods for old ideas. Annual Review of Sociology, 21(1), 93–113. https://doi.org/10.1146/annurev.so.21.080195.000521
- Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In SIGMOD ’93 (pp. 207–216). ACM. https://doi.org/10.1145/170036.170072
- Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01
- Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum.
Citation
@misc{saqr2026,
author = {Saqr, Mohammed and López-Pernas, Sonsoles},
title = {Sequence {Patterns,} {Outcomes,} and {Indices} with `Codyna`},
date = {2026-02-14},
url = {https://sonsoleslp.github.io/posts/codyna-seq-tutorial/},
langid = {en}
}