1 Exploring Sequences

Learning unfolds as an ordered sequence of states—engagement levels across weeks, problem-solving actions within a session, regulatory behaviors during collaboration. Sequence analysis preserves this temporal ordering rather than collapsing it into a single summary. The approach originated in molecular biology for comparing DNA sequences and was adapted for the social sciences by Abbott (1995). In education, it has been used to study course-taking trajectories, self-regulated learning strategies, and collaborative regulation dynamics (Saqr et al., 2024a).

From a complex dynamic systems perspective, learners exhibit feedback loops, attractor states, and phase transitions (Saqr et al., 2025a). A student’s trajectory is not a random walk—disengagement breeds more disengagement, and current states affect next ones as well as sudden shifts from one stable regime to another are common. Sequence-level analysis captures these dynamics at the individual level, complementing the aggregate view provided by transition models.

Companion tutorials

This tutorial is part of a series on tutorials on dynamics of learning and learners using Transition Network Analysis with the tna and codyna R packages:

Transition Network Analysis with R — building, visualizing, and interpreting TNA models; centrality, communities, bootstrapping.
TNA Group Analysis — analysis and comparison of groups.
TNA Clustering — discovering and analyzing clusters of sequences.
TNA Model Comparison — edge-level, summary, centrality, and network-level comparison; permutation tests.
Sequence Patterns, Outcomes, and Indices (this tutorial) — pattern discovery, outcome modeling, and structural indices with codyna.

Package website: https://sonsoles.me/tna/

What is sequence analysis?

Sequence analysis is a family of methods for studying ordered categorical data. In education, the “sequence” is typically a student’s trajectory through a series of states measured at successive time points.

Three levels of analysis are possible:

Sequence visualization: plotting individual trajectories and state distributions to see the raw data before modeling.
Sequence indices: computing per-sequence summary measures (entropy, stability, complexity) that characterize how each sequence unfolds.
Pattern discovery: identifying recurring sub-sequences (n-grams, gapped patterns) that tell us what specific pathways students follow.

This tutorial covers all three. For a comprehensive introduction to sequence analysis in education, see Saqr et al. (2024a). For transition-based approaches, see López-Pernas et al. (2024a) on Markov models and Saqr et al. (2025b) on Transition Network Analysis.

This tutorial works with three datasets, each illustrating different scenarios of sequence analysis. Before analyzing any dataset, we visualize it—distribution plots and frequency plots establish the context that makes pattern and index results interpretable.

Table 1: Datasets used in this tutorial

Dataset	Source	Sequences	States	Time points	Used for
`group_regulation`	`tna` package	2,000	9 (collaborative regulation)	up to 26	N-gram examples with filtering
Codyna	`Codyna.RDS`	5,000	10 (math exercise actions)	up to 10	Pattern discovery, outcome modeling
`engagement`	`codyna` package	1,000	3 (Active, Average, Disengaged)	25	Sequence indices

1.1 Visualizing the regulation data

The first step with any dataset is to visualize it. We begin with the group_regulation dataset—2,000 collaborative regulation sequences with 9 states. TNA provides tools for preparing and visualizing event data. We will use TNA here to prepare the data but we will focus on the sequences not the network analysis.

data("group_regulation_long", package = "tna")
prepared <- prepare_data(
  group_regulation_long,
  action = "Action", actor = "Actor", time = "Time"
)

Having prepared the data we can explore it with sequence analysis and visualize the sequence. TNA comes with several methods for plotting and incldueds sequence plots. Here we will use distribution plots. A state distribution plot aggregates individual trajectories into proportions at each time point:

plot_sequences(prepared, type = "distribution", scale = "proportion")

Figure 1: State proportions over time. The roughly flat distribution indicates a stationary process—no strong temporal trend.

Also, TNA includes tools for plot state frequencies, so you don’t have to manually do it. A frequency plot shows overall state prevalence—the marginal baseline for interpreting lift later:

model_reg <- tna(prepared)
plot_frequencies(model_reg)

Figure 2: State frequencies. Consensus and plan dominate; synthesis and adapt are rare.

1.2 Visualizing the problem-solving data

This math problem solving dataset contains 5,000 math exercise sequences with 10 states (Correct, Wrong, Clue, Guide, Instruct, Question, Quit, Right, Skip, Try). These were students states while solving math problems. The codes for Correct, Quit, Clue, Guide, Question are AI support trying to help students solve the questions. The right and wrong are the outcome of these exercises.

Each row is one problem attempt with up to 10 steps.

codyna_long <- data.frame(
  id = rep(seq_len(nrow(codyna_data)), each = ncol(codyna_data)),
  time = rep(seq_len(ncol(codyna_data)), nrow(codyna_data)),
  action = as.vector(t(as.matrix(codyna_data)))
)
codyna_long <- codyna_long[!is.na(codyna_long$action), ]
prepared_codyna <- prepare_data(codyna_long, action = "action", actor = "id", time = "time")

plot_sequences(prepared_codyna, sort_by ="action_T2")

Figure 3: Sequence index plot: each row is one of 5,000 problem-solving sequences. Most are short (5 steps), and Wrong (red) and Instruct (blue) dominate the early positions.

plot_sequences(prepared_codyna, type = "distribution", scale = "proportion")

Figure 4: State proportions over time. Wrong peaks at T1 and declines; Right appears only at the end—the terminal success state.

model_codyna <- tna(prepared_codyna)
plot_frequencies(model_codyna)

Figure 5: State frequencies across all time points. Instruct and Wrong dominate; Right and Correct are less frequent.

The index plot reveals that most sequences are short and begin with Wrong which captures students wrong attempts and offered feedback.

2 Pattern Discovery

TNA builds a transition matrix from all sequences, revealing which pairwise transitions are most probable. Pattern discovery complements this by examining each sequence individually—identifying the specific multi-step sub-sequences that recur across students. Longer pathways like Wrong→Quit→Skip→Instruct→Wrong extend the picture to five-step chains. Patterns can also be linked to outcomes—not just which pathways exist, but which predict success or failure which offers a unique perspective and new functionality that no other package provides.

2.1 N-grams

TNA already captures pairwise transitions (length 2), so the value of n-grams begins at length 3 and above. We start with the regulation data. A TNA model would show consensus→plan and plan→consensus as the two strongest edges. Do these chain into sustained multi-step pathways within the same sequences?

data("group_regulation")
reg_ngrams <- discover_patterns(group_regulation, type = "ngram", len = 3:5)
reg_ngrams

Top n-grams (lengths 3–5) from collaborative regulation sequences

Output columns

Frequency: total occurrences across all sequences (one sequence can contribute multiple instances).
Count: number of sequences containing the pattern at least once.
Support: count / total sequences—the proportion containing the pattern. Use this to compare across datasets of different sizes.
Lift: observed support / expected support under independence. Above 1 = over-represented; below 1 = under-represented (Agrawal et al., 1993).
Proportion: pattern’s share of total frequency at its length.

consensus→plan→plan (support = 0.35) appears in over a third of all sequences; plan→plan→plan (support = 0.23) in nearly a quarter. The consensus-then-planning loop is a genuine multi-step pathway—groups who reach agreement build extended planning episodes. The two strong TNA edges consensus→plan and plan→plan combine into coherent within-sequence trajectories.

What follows planning? The start parameter isolates pathways originating from a given state:

reg_plan <- discover_patterns(group_regulation, type = "ngram", len = 3:4, start = "plan")
reg_plan

Planning leads three ways: sustained planning (plan→plan→plan, support = 0.23), cycling back to consensus (plan→consensus→plan, support = 0.24), and emotional reactions (plan→plan→emotion, support = 0.14). The first two sustain the task; the third signals that extended planning sometimes triggers affect.

What follows consensus?

reg_cons <- discover_patterns(group_regulation, type = "ngram", len = 3:4, start = "consensus")
reg_cons

N-grams starting with consensus

Consensus chains into planning: the top three trigrams all route through plan. consensus→plan→plan (support = 0.35) dominates—when groups agree, they commit to extended planning.

Where does emotion lie within the sequence of interactions Emotion is a low-frequency state in the TNA network, but pattern discovery reveals its role as a connector or how it mediates other regulatory behaviors.

reg_emotion <- discover_patterns(group_regulation, type = "ngram", len = 3:4, contain = "emotion")
reg_emotion

N-grams involving emotion

emotion→cohesion→consensus (support = 0.18) is the dominant pathway. Emotional expression leads to social bonding and then group agreement—a three-step recovery arc invisible in the aggregate network, where emotion has weak edges to many states. Emotion feeds back into the consensus→plan cycle.

Now we apply the same approach to the math problem-solving data. With defaults, discover_patterns() extracts n-grams of length 2 through 5:

ngrams <- discover_patterns(codyna_data)
ngrams

Top 10 n-grams (default: lengths 2–5)

plot(ngrams, n = 10)

Trigrams (length 3) reveal multi-step pathways that bigrams cannot:

trigrams <- discover_patterns(codyna_data, type = "ngram", len = 3)
trigrams

Top 10 trigrams (length 3)

Wrong→Quit→Skip has lift 4.82—nearly 5 times more frequent than expected. The bigrams Wrong→Quit and Quit→Skip each appear separately, but only the trigram reveals them as a single giving-up sequence. Extracting lengths 3 through 5 shows how the pathway extends:

ngrams_range <- discover_patterns(codyna_data, type = "ngram", len = 3:5)
ngrams_range

Top 10 n-grams at lengths 3–5

At length 4, Wrong→Quit→Skip→Instruct (support = 0.19); at length 5, it adds →Wrong—a complete failure loop. Since most sequences are 5 steps long, length-5 n-grams capture entire trajectories:

ngrams_5 <- discover_patterns(codyna_data, type = "ngram", len = 5)
ngrams_5

Top 10 five-step patterns (full sequences)

Wrong→Quit→Skip→Instruct→Wrong (support = 0.17) is the single most common complete trajectory—17% of all students follow this path.

2.2 Gapped patterns

N-grams require consecutive states. Gapped patterns allow wildcards (*) between anchoring states, capturing regularities that persist regardless of intervening steps. The same start, end, and contain filters from the n-gram examples work here:

gapped <- discover_patterns(codyna_data, type = "gapped")
gapped

Top 10 gapped patterns (defaults)

Wrong→***→Wrong (support = 0.51): half of all sequences contain a return to Wrong after three intervening states. Wrong→***→Right (support = 0.34): about a third eventually reach success through three intermediate steps.

Gapped pattern notation

Each * matches exactly one wildcard state. Wrong→*→Right matches any three-step sub-sequence from Wrong to Right. The gap parameter controls the range: gap = 1 produces single-wildcard patterns; gap = 1:3 produces 1, 2, or 3 wildcards. Defaults cover a range appropriate to the sequence length.

2.3 Repeated patterns

Repeated patterns detect consecutive runs of the same state—instructional loops, hint-seeking chains, or persistent errors:

repeated <- discover_patterns(codyna_data, type = "repeated", len = 2:4)
repeated

Repeated patterns (same-state runs)

Instruct→Instruct dominates (support = 0.16)—16% of sequences contain at least two consecutive Instruct steps. The triple Instruct→Instruct→Instruct still appears in 5% of sequences. In Section 6 we will see how self_loop_tendency captures this as a per-sequence index.

Repeated patterns vs. TNA self-loops

TNA’s diagonal entries give the aggregate self-loop probability for each state. Repeated patterns identify which sequences contain runs of a given length and how common they are. A high aggregate self-loop with low repeated-pattern support means self-loops are spread thinly across many sequences; high support means they concentrate in specific sequences as genuine loops.

2.4 Targeted search

When theory predicts a specific pathway, test whether it exists. The wildcard * matches any single state:

targeted <- discover_patterns(codyna_data, pattern = "Wrong->*->Correct")
targeted

Recovery pathways: Wrong through one step to Correct

Two recovery routes: Wrong→Clue→Correct (support = 0.15) and Wrong→Correct→Correct (support = 0.07). The Clue route has lift near 1.0—common but not over-represented.

end_right <- discover_patterns(codyna_data, end = "Right", type = "ngram", len = 2)
end_right

Bigrams ending in Right (problem solved)

Instruct→Right is the most common final bigram, but its lift (0.75) is below 1—reflecting Instruct’s high base rate, not a specific affinity.

2.5 Which sequence lead to different outcomes?

Do specific sequences of actions predict whether a student ultimately succeeds or fails? The outcome parameter links each pattern to a binary outcome—here, whether the last observed state is Right (solved) or Wrong (unsolved). For each pattern, discover_patterns() counts its prevalence in each outcome group and runs a chi-squared test of association:

ngrams_outcome <- discover_patterns(
  codyna_data,
  outcome = "last_obs",
  type = "ngram", len = 2:3
)
ngrams_outcome

Top 10 outcome-differentiated n-grams

plot(ngrams_outcome, n = 10)

Outcome-differentiated n-grams. Patterns with large count imbalances are candidate predictors for Section 5.

Outcome comparison columns

count_<group>: sequences in each outcome group containing the pattern.
chisq: chi-squared statistic testing independence of pattern presence and outcome.
p_value: p-value from the chi-squared test.
effect_size (Cramér’s V): strength of association, bounded in [0, 1]. Above 0.10 = small; above 0.30 = medium (Cohen, 1988).

The chi-squared treats each pattern independently. Correlated patterns are not adjusted for—the regression models in Section 5 handle this.

Skip→Instruct (p < 0.001) appears in 1016 Wrong-ending sequences and 0 Right-ending sequences—every student who skips and then receives instruction ultimately fails. The next section quantifies these effects with regression.

3 Predicting Outcomes from Patterns

Chi-squared tests (Section 4.5) identify which patterns differ between groups but cannot rank them by effect size or adjust for confounding. analyze_outcome() selects the top patterns, encodes them as predictors, and fits a logistic regression.

How analyze_outcome() works

Three steps:

Pattern discovery: calls discover_patterns() with the specified type, len, etc.
Selection: ranks patterns by the priority criterion ("chisq", "lift", or "support") and picks the top n.
Model fitting: encodes each pattern as binary (0/1, when freq = FALSE) or as within-sequence count (when freq = TRUE), then fits glm() (or lme4::glmer() when mixed = TRUE).

The returned object is a standard glm or glmerMod, so summary(), coef(), AIC(), predict(), and confint() all work.

3.1 Binary predictors: presence/absence of a certain state. If students receive a clue would it help them get the answer correct?

Each selected pattern becomes a 0/1 predictor:

model_binary <- analyze_outcome(
  codyna_data,
  outcome = "last_obs",
  reference = "Wrong",
  n = 5,
  freq = FALSE,
  priority = "chisq",
  type = "ngram",
  len = 1:2,
  mixed = FALSE
)
summary(model_binary)


Call:
glm(formula = f, family = binomial, data = df)

Coefficients:
                     Estimate Std. Error z value             Pr(>|z|)    
(Intercept)           0.00564    0.04514    0.12              0.90058    
Quit                 -5.03463    0.41466  -12.14 < 0.0000000000000002 ***
Instruct             -0.03506    0.08240   -0.43              0.67046    
Clue_to_Clue          0.40243    0.09233    4.36             0.000013 ***
Question_to_Guide    -0.51225    0.13511   -3.79              0.00015 ***
Instruct_to_Instruct  0.02576    0.10314    0.25              0.80281    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6771.5  on 4999  degrees of freedom
Residual deviance: 5642.7  on 4994  degrees of freedom
AIC: 5655

Number of Fisher Scoring iterations: 7

analyze_outcome() looked at all unigrams and bigrams, ranked them by chi-squared, picked the top 5, and fit a logistic regression using them as 0/1 predictors. reference = "Wrong" means positive coefficients increase the odds of Right (success). Converting to odds ratios:

odds_ratios <- exp(coef(model_binary))

data.frame(
  Pattern = names(odds_ratios),
  `Log-Odds` = round(coef(model_binary), 3),
  `Odds Ratio` = round(odds_ratios, 3),
  row.names = NULL, check.names = FALSE
)

Coefficients as log-odds and odds ratios

Reading logistic regression output

Odds ratio = exp(coefficient). OR = 3.0 means “3× the odds of success”; OR = 0.5 means “half the odds.” OR = 1.0 = no effect.
Confidence intervals: exp(confint(model)) on the odds ratio scale. If the CI includes 1.0, the effect is not significant.
AIC: lower = better fit, penalized for complexity. Compare models with different predictor sets.

Quit (OR = 0.01): essentially zero odds of success—quitting is the primary failure pathway, consistent with the chi-squared results in Section 4.5.
Clue_to_Clue (OR = 1.5): about 1.5× the odds of succeeding. Repeated hint-seeking signals engagement, matching the repeated-pattern findings in Section 4.3.
Question_to_Guide (OR = 0.6): reduces the odds—students who ask a question and then receive guidance are less likely to solve the problem.
Instruct and Instruct_to_Instruct: not significant (p > 0.05). Common but not outcome-differentiating once the other predictors are in the model.

3.2 Frequency-based predictors. Here we take into account the counts, we don’t only count presence but also how frequent. Are more clues useful?

Setting freq = TRUE uses within-sequence pattern counts instead of 0/1:

model_freq <- analyze_outcome(
  codyna_data,
  outcome = "last_obs",
  reference = "Wrong",
  n = 10,
  freq = TRUE,
  priority = "chisq",
  type = "ngram",
  len = 1:2,
  mixed = FALSE
)
summary(model_freq)


Call:
glm(formula = f, family = binomial, data = df)

Coefficients:
                     Estimate Std. Error z value             Pr(>|z|)    
(Intercept)           -0.2109     0.0730   -2.89               0.0038 ** 
Quit                  -4.6488     0.4190  -11.09 < 0.0000000000000002 ***
Clue                  -0.0317     0.0672   -0.47               0.6377    
Correct                0.0999     0.0827    1.21               0.2275    
Instruct              -0.2151     0.0904   -2.38               0.0173 *  
Question_to_Instruct   0.5310     0.1148    4.63         0.0000037146 ***
Correct_to_Correct     0.2755     0.1455    1.89               0.0582 .  
Clue_to_Clue           0.5759     0.1147    5.02         0.0000005160 ***
Guide_to_Guide         0.5571     0.0951    5.86         0.0000000047 ***
Question_to_Guide     -0.2534     0.1453   -1.74               0.0811 .  
Instruct_to_Instruct   0.3264     0.1292    2.53               0.0115 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6771.5  on 4999  degrees of freedom
Residual deviance: 5581.7  on 4989  degrees of freedom
AIC: 5604

Number of Fisher Scoring iterations: 7

AIC = 5603.7 vs. 5654.7 for the binary model—a 51-point improvement. Repetition count adds predictive value. Guide→Guide (0.56) is positive and significant—each additional consecutive guidance step raises the odds, consistent with the repeated-pattern findings in Section 4.3.

3.3 Gapped pattern predictors

model_gapped <- analyze_outcome(
  codyna_data,
  outcome = "last_obs",
  reference = "Wrong",
  n = 5,
  freq = FALSE,
  priority = "chisq",
  type = "gapped",
  gap = 1,
  len = 2,
  mixed = FALSE
)
summary(model_gapped)


Call:
glm(formula = f, family = binomial, data = df)

Coefficients:
                         Estimate Std. Error z value             Pr(>|z|)    
(Intercept)               -0.1768     0.0388   -4.56         0.0000052086 ***
Quit_to_._to_Instruct     -4.8847     0.4113  -11.88 < 0.0000000000000002 ***
Clue_to_._to_Correct       0.5469     0.1188    4.60         0.0000041943 ***
Correct_to_._to_Clue       0.4285     0.0850    5.04         0.0000004597 ***
Clarify_to_._to_Instruct   0.9209     0.1534    6.00         0.0000000019 ***
Guide_to_._to_Guide        1.2755     0.2110    6.05         0.0000000015 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6771.5  on 4999  degrees of freedom
Residual deviance: 5580.7  on 4994  degrees of freedom
AIC: 5593

Number of Fisher Scoring iterations: 7

AIC = 5592.7, the lowest of the three. Clue→*→Correct captures the delayed hint effect; Guide→*→Guide captures returning to guidance after a detour.

3.4 Priority selection

The priority parameter controls which patterns enter the regression:

model_lift <- analyze_outcome(
  codyna_data,
  outcome = "last_obs", reference = "Wrong",
  n = 5, freq = FALSE, priority = "lift",
  type = "ngram", len = 1:2, mixed = FALSE
)
model_support <- analyze_outcome(
  codyna_data,
  outcome = "last_obs", reference = "Wrong",
  n = 5, freq = FALSE, priority = "support",
  type = "ngram", len = 1:2, mixed = FALSE
)

Table 2: Priority selection guide

Priority	Selects for	AIC	Use when
`"chisq"`	Maximum group differentiation	5654.7	Outcome prediction is the goal
`"lift"`	Over-representation relative to marginals	6473	Seeking structurally surprising patterns
`"support"`	Most common patterns overall	6316.3	Describing typical behavior

3.5 Mixed-effects models

When students solve multiple problems or are nested within classrooms, observations are not independent. mixed = TRUE adds a random intercept per group. This is very useful when you have multiple sessions from the same student, or multiple sequences and so on. Codyna has an advanced mixed effect model that takes this into account.

model_mixed <- analyze_outcome(
  raw_data, cols = T1:T10,
  group = student_id,
  outcome = "last_obs",
  reference = "Wrong",
  n = 5,
  freq = FALSE,
  priority = "chisq",
  type = "ngram",
  len = 1:2,
  mixed = TRUE
)
summary(model_mixed)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: .outcome ~ Quit + Instruct + Clue_to_Clue + Question_to_Guide +  
    Instruct_to_Instruct + (1 | student_id)
   Data: df

      AIC       BIC    logLik -2*log(L)  df.resid 
     5653      5699     -2820      5639      4993 

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-1.279 -0.964 -0.077  0.961 12.346 

Random effects:
 Groups     Name        Variance Std.Dev.
 student_id (Intercept) 0.151    0.389   
Number of obs: 5000, groups:  student_id, 3937

Fixed effects:
                     Estimate Std. Error z value             Pr(>|z|)    
(Intercept)           0.00615    0.04705    0.13               0.8961    
Quit                 -5.09947    0.41717  -12.22 < 0.0000000000000002 ***
Instruct             -0.03721    0.08553   -0.44               0.6635    
Clue_to_Clue          0.41435    0.09596    4.32             0.000016 ***
Question_to_Guide    -0.52008    0.14000   -3.71               0.0002 ***
Instruct_to_Instruct  0.03076    0.10712    0.29               0.7740    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) Quit   Instrc Cl_t_C Qst__G
Quit        -0.012                            
Instruct    -0.486 -0.134                     
Clue_to_Clu -0.414  0.012  0.136              
Questn_t_Gd -0.251  0.026  0.017  0.112       
Instrct_t_I -0.024  0.096 -0.590  0.023  0.070

Random intercept variance (0.15): captures student-level differences in baseline success probability. A standard deviation of 0.39 means students differ substantially in their baseline ability.
Fixed effects: same interpretation as standard logistic regression, but adjusted for nesting.

cat("AIC (fixed, no nesting):", round(AIC(model_binary), 1), "\n")

AIC (fixed, no nesting): 5655

cat("AIC (mixed, student nesting):", round(AIC(model_mixed), 1), "\n")

AIC (mixed, student nesting): 5653

Lower AIC confirms that nesting matters. Use mixed = TRUE whenever data has a natural grouping structure.

When to use mixed-effects models

Standard logistic regression assumes independence. In educational data, this is often violated: the same student solves multiple problems, or students are nested within classrooms. Ignoring nesting inflates the effective sample size, producing artificially small standard errors.

A mixed-effects model (via lme4::glmer()) adds a random intercept per group—capturing group-level baselines so that the fixed effects represent within-group pattern effects. Use it when:

The same participant appears in multiple sequences
Students are nested within classrooms, schools, or conditions
The random intercept variance is non-trivial

4 Sequence Indices

Sequence indices in codyna are designed for educational contexts and build on complex dynamic system theory. Patterns identify which pathways students follow. Indices characterize how each trajectory unfolds—its diversity, stability, dynamism, and complexity—without reference to any specific pattern. We switch to the built-in engagement dataset (1,000 students, 3 states, 25 time points) for richer distributions. As with every dataset in this tutorial, we visualize first.

data("engagement")

The distribution plot reveals a slight cohort-level drift toward disengagement over the 25 weeks. These visual patterns will connect directly to the stability, initial conditions, and emergence indices below.

plot_sequences(engagement, type = "distribution", scale = "proportion")

Figure 6: State distribution over 25 weeks. Active declines slightly over time; Disengaged increases—a gradual drift toward disengagement across the cohort.

indices <- sequence_indices(engagement, favorable = "Active")
numeric_indices <- indices[, sapply(indices, is.numeric)]
summary_df <- data.frame(
  Index = names(numeric_indices),
  Min = round(sapply(numeric_indices, min, na.rm = TRUE), 3),
  Median = round(sapply(numeric_indices, median, na.rm = TRUE), 3),
  Mean = round(sapply(numeric_indices, mean, na.rm = TRUE), 3),
  Max = round(sapply(numeric_indices, max, na.rm = TRUE), 3),
  row.names = NULL
)
summary_df

Summary statistics for all numeric sequence indices

favorable = "Active" designates Active as the target for directional indices like integrative_potential (convergence toward the favorable state) and emergent_state_proportion.

4.1 Index families

The 24 indices group into 10 families:

Table 3: Index families and their research questions

Family	Indices	Question
Coverage	`valid_n`, `valid_proportion`	How complete is the sequence?
Diversity	`unique_states`, `longitudinal_entropy`, `simpson_diversity`	How spread is time across states?
Stability	`self_loop_tendency`, `mean_spell_duration`, `max_spell_duration`	How persistent are state episodes?
Dynamism	`transition_rate`, `transition_complexity`	How frequent and diverse are transitions?
Initial conditions	`initial_state_persistence`, `initial_state_proportion`, `initial_state_influence_decay`	How influential is the starting state?
Cyclicity	`cyclic_feedback_strength`	Return patterns?
Dominance	`dominant_state`, `dominant_proportion`, `dominant_max_spell`	Which state dominates?
First/Last	`first_state`, `last_state`	Boundary conditions
Emergence	`emergent_state`, `emergent_state_persistence`, `emergent_state_proportion`	Late-appearing dominant state?
Integrative	`integrative_potential`, `complexity_index`	Convergence and complexity

Detailed index definitions

Coverage: valid_n = non-missing time points; valid_proportion = fraction of complete observations.

Diversity: unique_states = distinct states visited. longitudinal_entropy = -\sum p_i \log p_i; maximized when all states equally visited. simpson_diversity = 1 - \sum p_i^2; gives less weight to rare states.

Stability: self_loop_tendency = proportion of consecutive pairs where state does not change. mean_spell_duration = average run length. max_spell_duration = longest run.

Dynamism: transition_rate = 1 − self_loop_tendency. transition_complexity = entropy of the transition distribution; distinguishes toggling between two states (low) from cycling through many (high).

Initial conditions: initial_state_persistence = consecutive time points in first state. initial_state_proportion = fraction of sequence in first state. initial_state_influence_decay = exponential decay rate of first state’s autocorrelation.

Cyclicity: cyclic_feedback_strength = tendency to revisit previously visited states.

Dominance: dominant_state = most prevalent state. dominant_proportion = its share. dominant_max_spell = its longest run.

Emergence: emergent_state = state dominating the second half. When different from the first-half dominant, a phase transition occurred. emergent_state_persistence and emergent_state_proportion quantify its hold.

Integrative: integrative_potential = convergence toward the favorable state over time. complexity_index = composite of entropy and transition complexity.

5 Decision Workflow

Visualize — plot_sequences() and plot_frequencies() for raw data exploration.
Discover patterns — discover_patterns() with n-grams, gapped, and repeated types; use start, end, contain to focus.
Compare across outcomes — add outcome = ... to identify group-differentiating patterns.
Model outcomes — analyze_outcome() for logistic regression; mixed = TRUE for nested data.
Compute indices — sequence_indices() for per-sequence structural summaries.
Interpret together — a coefficient is more trustworthy when the pattern is both statistically significant (chi-squared) and structurally meaningful (lift > 1, high support), and the index profile is consistent.

References

LA Methods Chapters

Sequence Analysis and Temporal Methods

Saqr, M., López-Pernas, S., Helske, S., Durand, M., Murphy, K., Studer, M., & Ritschard, G. (2024). Sequence analysis in education: Principles, technique, and tutorial with R. In M. Saqr & S. López-Pernas (Eds.), Learning analytics methods and tutorials: A practical guide using R (pp. 267–298). Springer. https://lamethods.org/book1/chapters/ch10-sequence-analysis/ch10-seq.html
Helske, J., Helske, S., Saqr, M., López-Pernas, S., & Murphy, K. (2024). A modern approach to transition analysis and process mining with Markov models in education. In M. Saqr & S. López-Pernas (Eds.), Learning analytics methods and tutorials: A practical guide using R (pp. 331–362). Springer. https://lamethods.org/book1/chapters/ch12-markov/ch12-markov.html
López-Pernas, S., Saqr, M., Helske, S., & Murphy, K. (2024). Multi-channel sequence analysis in educational research: An introduction and tutorial with R. In M. Saqr & S. López-Pernas (Eds.), Learning analytics methods and tutorials: A practical guide using R (pp. 363–400). Springer. https://lamethods.org/book1/chapters/ch13-multichannel/ch13-multi.html

Transition Network Analysis

Saqr, M., López-Pernas, S., & Tikka, S. (2025). Mapping relational dynamics with transition network analysis: A primer and tutorial. In M. Saqr & S. López-Pernas (Eds.), Advanced learning analytics methods. Springer. https://lamethods.org/book2/chapters/ch15-tna/ch15-tna.html
Saqr, M., López-Pernas, S., & Tikka, S. (2025). Capturing the breadth and dynamics of the temporal processes with frequency transition network analysis: A primer and tutorial. In M. Saqr & S. López-Pernas (Eds.), Advanced learning analytics methods. Springer. https://lamethods.org/book2/chapters/ch16-ftna/ch16-ftna.html
López-Pernas, S., Tikka, S., & Saqr, M. (2025). Mining patterns and clusters with transition network analysis: A heterogeneity approach. In M. Saqr & S. López-Pernas (Eds.), Advanced learning analytics methods. Springer. https://lamethods.org/book2/chapters/ch17-tna-clusters/ch17-tna-clusters.html

Complex Dynamic Systems

Saqr, M., Dever, D., López-Pernas, S., Gernigon, C., Marchand, G., & Kaplan, A. (2025). Complex dynamic systems in education: Beyond the static, the linear and the causal reductionism. In M. Saqr & S. López-Pernas (Eds.), Advanced learning analytics methods. Springer. https://lamethods.org/book2/chapters/ch12-cds/ch12-cds.html
Saqr, M., Schreuder, M. J., & López-Pernas, S. (2024). Why educational research needs a complex system revolution that embraces individual differences, heterogeneity, and uncertainty. In M. Saqr & S. López-Pernas (Eds.), Learning analytics methods and tutorials: A practical guide using R. Springer. https://lamethods.org/book1/chapters/ch22-conclusion/ch22-conclusion.html

Package References

Tikka, S., López-Pernas, S., & Saqr, M. (2025). tna: An R package for Transition Network Analysis. Applied Psychological Measurement. https://doi.org/10.1177/01466216251348840
Saqr, M., López-Pernas, S., Törmänen, T., Kaliisa, R., Misiejuk, K., & Tikka, S. (2025d). Transition Network Analysis: A novel framework for modeling, visualizing, and identifying the temporal patterns of learners and learning processes. In LAK ’25 (pp. 351–361). ACM. https://doi.org/10.1145/3706468.3706513

Methodological References

Abbott, A. (1995). Sequence analysis: New methods for old ideas. Annual Review of Sociology, 21(1), 93–113. https://doi.org/10.1146/annurev.so.21.080195.000521
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In SIGMOD ’93 (pp. 207–216). ACM. https://doi.org/10.1145/170036.170072
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum.

Citation

BibTeX citation:

@misc{saqr2026,
  author = {Saqr, Mohammed and López-Pernas, Sonsoles},
  title = {Sequence {Patterns,} {Outcomes,} and {Indices} with `Codyna`},
  date = {2026-02-14},
  url = {https://sonsoleslp.github.io/posts/codyna-seq-tutorial/},
  langid = {en}
}

For attribution, please cite this work as:

Saqr, Mohammed, and Sonsoles López-Pernas. 2026. “Sequence Patterns, Outcomes, and Indices with `Codyna`.” https://sonsoleslp.github.io/posts/codyna-seq-tutorial/.