SaRNADesigner

Self-Amplifying RNA Replicon Design for Alphavirus-Based Vaccines and Therapeutics

Technical Whitepaper

Version 1.3 (2026-04) | Bioneer Corporation

Replicon-aware optimization for alphavirus self-amplifying RNA: preserves replicase and conserved sequence elements, depletes uridines, minimizes dsRNA, and folds 10 kb constructs in-loop via Zuker-seeded LinearFold.

1. Executive Summary

The Bioneer RNA/DNA Design Suite is an integrated family of five design tools that share a common optimization engine and report format but diverge in their biological focus: GeneCrafter (codon optimization for heterologous expression), IVTDesigner (in vitro-transcribed linear mRNA for therapeutic and research use), UTRDesigner (translation-initiation and stability engineering of 5' and 3' untranslated regions), SaRNADesigner (self-amplifying RNA replicon design based on alphavirus backbones), and CircularDesigner (covalently closed circular RNA design using permuted intron-exon, back-splicing, or ribozyme systems). Each tool accepts either a DNA coding sequence or a protein sequence, resolves a target organism codon-usage profile, runs a genetic-algorithm (GA) population-based search with tool-specific fitness terms, applies a deterministic structural and constraint post-processing pass, and returns a ranked set of candidate sequences together with a full human-readable HTML report, a print-ready PDF, machine-readable JSON and CSV summaries, and synthesis-ready FASTA files.

SaRNADesigner is the suite's tool for designing self-amplifying RNA (saRNA) replicons based on alphavirus backbones. saRNA is a two-module vaccine and therapeutic platform: an alphavirus-derived replicase (nsP1 through nsP4) drives in-cell amplification of the entire replicon, and a subgenomic promoter drives high-level expression of a gene-of-interest (GOI) encoded downstream. saRNA delivers orders-of-magnitude dose sparing relative to conventional mRNA because one delivered RNA molecule can produce hundreds to thousands of copies inside the target cell. The design challenge is distinct from linear mRNA: the replicase and the conserved sequence elements (CSEs) of the backbone must be preserved exactly, only the GOI is optimized, and uridine depletion and dsRNA minimization are mandatory because saRNA cannot use modified nucleosides (pseudouridine would block RdRp replication). SaRNADesigner enforces all of this behind a single CLI.

What a Customer Gets in One Run

A ranked list of optimized candidate sequences (typically Rank 1 plus seven alternates) ready for DNA-synthesis vendor submission.
An interactive HTML report with drill-down per candidate covering codon usage, GC-sliding-window traces, homopolymer and repeat landscape, predicted RNA secondary structure, CpG/UpA dinucleotide frequency, predicted immunogenicity, and full fitness-component breakdown.
A print-ready PDF report with the same technical content, rendering RNA secondary structures as scalable vector objects that remain legible at any zoom level, suitable for project archives, regulatory submissions, and internal design-history files.
Machine-readable JSON and CSV summaries for pipeline integration, lab-automation platforms, and electronic lab-notebook (ELN) ingestion.
FASTA sequence files, ready for direct submission to commercial synthesis vendors such as IDT, Twist Bioscience, or GenScript with their template-specific constraint profile already applied upstream.
A deterministic reproducibility record — the original configuration JSON, the random seed, the GA checkpoint chain, and the software-version hash — so any report can be regenerated bit-for-bit years after the original run.

Why the Suite Matters for mRNA Therapeutic Development

mRNA-based drugs and vaccines have moved from an academic curiosity to a central pillar of the biopharma pipeline in under a decade. Regulatory approvals of Comirnaty (BNT162b2) and Spikevax (mRNA-1273) against SARS-CoV-2 validated the modality at industrial scale, and as of 2026 the global mRNA pipeline includes therapeutic cancer vaccines, protein-replacement therapies for monogenic disease, regenerative-medicine products that transiently deliver reprogramming factors, in-situ-expressed antibodies, and self-amplifying and circular RNA platforms that promise dose sparing and longer duration of expression. Every one of these products ultimately succeeds or fails at the sequence level: codon choices that look innocuous in isolation can halve translation throughput, move global GC content into a range that elicits innate-immune sensors, introduce repeats that block high-fidelity gene synthesis, or create hidden splice sites that cause aberrant products in cells. The Bioneer RNA/DNA Design Suite exists to make those sequence-level decisions rigorous, reproducible, and defensible in front of synthesis vendors, CMC reviewers, and regulatory authorities.

How to Read This Whitepaper

This whitepaper is written with three audiences in mind. For scientists who will run the software, it documents the biological motivation for each fitness term, the precise algorithm behind each report number, and the operational defaults. For project managers and program leaders, it frames where the tool sits in the broader mRNA therapeutic development pipeline, what decision it supports, and what customer-acceptance gates it enables. For regulatory and quality-assurance staff, it summarizes compliance with published method requirements and commercial-software expectations — ALCOA+ data integrity, GAMP 5 categorization, 21 CFR Part 11 alignment, ICH Q8–Q14 development principles, and comparability to widely cited academic and commercial tools including ViennaRNA, LinearDesign, LinearFold, DNAWorks, JCat, OPTIMIZER, COOL, ThermoFisher GeneArt GeneOptimizer, GenScript OptimumGene, IDT Codon Optimization Tool, and ATUM GeneGPS.

Design Principles

The suite is built around six design principles that are worth stating explicitly. First, biology-awareness: every fitness term has a biological rationale, and no term is a black-box ML output. Second, transparency: every parameter is documented, every threshold is named in the report, and the optimization objective can be inspected before and after every run. Third, reproducibility: the combination of config, seed, and checkpoint is sufficient to regenerate any output byte-for-byte. Fourth, composability: the five tools share a JSON schema and can be chained end-to-end without format conversion. Fifth, audit-readiness: outputs are ALCOA+-compatible by construction, and the bundle is portable. Sixth, vendor-neutrality: synthesis-vendor templates are first-class and easy to extend, so the tool is not locked to a single synthesis vendor.

Scope and Non-Scope

This tool operates at the sequence level. It does not replace wet-lab testing, structural biology refinement, or in-vivo pharmacology. It does not assess protein function directly; it assesses the sequence-level determinants of expression, stability, and immune behavior that influence function. It is a force multiplier on top of informed wet-lab practice, not a substitute for it. A design delivered by the tool should be validated empirically before it is advanced to the next stage of development; the tool's job is to maximize the probability that the validation succeeds and to minimize the number of wet-lab iterations required to converge.

2. Biological Foundation and Therapeutic Context

2.1 Why Synonymous Codons Are Not Equivalent

The standard genetic code is redundant: 61 sense codons encode 20 amino acids, so most amino acids have multiple synonymous codons. The classical textbook position was that synonymous substitutions are "silent" at the protein level and therefore biologically neutral. Four decades of experimental work have overturned that view decisively. Synonymous codon choice influences the efficiency of transcription and translation, the co-translational folding trajectory of the nascent polypeptide, mRNA secondary structure and half-life, splicing fidelity, nuclear export rates, innate-immune recognition, and the yield of heterologous expression and in vitro synthesis. A protein whose sequence is identical at every amino acid position can, depending on codon choice, express at levels that differ by one or even two orders of magnitude — or fail to express entirely.

The practical consequence is that the same protein, encoded by two different synonymous sequences, can express at radically different levels in the same cell or cell-free system, fold with different accuracy, trigger different innate-immune responses, and — for sequences destined for gene synthesis — present completely different synthesis-cost-and-yield profiles to a synthesis vendor. This is why every serious mRNA or protein-expression program treats codon optimization as a distinct, quantitative engineering step rather than a cosmetic cleanup.

Codon Adaptation Index (CAI)

The Codon Adaptation Index, introduced by Sharp and Li in 1987, reduces codon choice to a single scalar between 0 and 1. For each codon, a relative adaptiveness w is computed from the frequency of that codon divided by the frequency of the most-used synonym for the same amino acid, measured from a reference set of highly expressed genes in the target organism. The CAI of a coding sequence is the geometric mean of the relative adaptiveness values of its codons. Classical interpretation: genes whose CAI is close to 1 use the codons preferred by the organism's translational machinery and tend to be well expressed; genes with CAI near 0.5 or below tend to express poorly. CAI remains the single most widely used codon-optimization metric and is embedded in every serious commercial and academic optimization tool.

CAI has well-known limitations. It does not account for codon-pair effects, for local secondary structure, for tRNA pool differences among cell types or growth conditions, or for the benefits of codon-usage variety in co-translational folding. It is possible for a sequence to have CAI = 1.0 yet still express poorly because of a strong 5' UTR hairpin, a repeat that stalls ribosomes, or a cluster of rare codons at a folding intermediate. For these reasons, every tool in the Bioneer suite treats CAI as one of several objectives, not as the entire objective.

Codon Pair Bias and Context Effects

CAI treats each codon independently, but measured ribosome kinetics depend on neighboring codons too — the so-called codon-pair bias. Coleman et al. (2008) famously exploited this effect by deliberately deoptimizing codon pairs in poliovirus to produce live-attenuated vaccine strains, demonstrating that codon-pair deoptimization can suppress viral replication by multiple logs while leaving amino-acid sequence untouched. The codon-pair effect is believed to reflect steric and decoding constraints at the ribosomal A- and P-sites, where the tRNA-pair geometry matters. Bioneer's tools evaluate codon-pair bias as a secondary metric (the CPB score), and some of them allow CPB to be explicitly included or excluded from the optimization objective.

A related but distinct concept is tRNA adaptation, quantified by the tRNA Adaptation Index (tAI), which weights codons by the abundance and decoding efficiency of cognate tRNAs rather than by codon-usage frequency. tAI is more mechanistic than CAI but requires organism-specific tRNA-copy-number data that is not always available with high reliability. The Bioneer suite's CAI implementation is extensible to tAI-style weighting when the underlying codon-usage database is supplemented with tRNA abundance.

GC Content — Global and Local

Global GC content influences mRNA thermal stability and translation kinetics. In mammalian cells, GC-rich mRNAs tend to be longer-lived, exported more efficiently, and translated at higher rates than AU-rich mRNAs of otherwise equivalent sequence. Kudla et al. (2006) reported an approximately five-fold elevation in protein yield from GC-enriched synonymous variants of a reporter transgene in human cells, which they attributed primarily to mRNA stabilization rather than direct effects on translation. GC-rich transcripts, however, can form more stable secondary structure and in particular block cap-dependent scanning if the structure forms within the first 30–60 nucleotides of the 5' UTR or CDS.

Local GC content, measured in sliding windows of 30 to 60 nucleotides, is the more operationally important metric for gene synthesis. Synthesis vendors impose windowed GC constraints — typically 25–75% for standard products and narrower 30–70% for higher-stringency clonal products — because very low or very high local GC disrupts phosphoramidite coupling and oligonucleotide assembly. A gene with globally acceptable GC content can still contain short windows of extreme GC bias that fail synthesis-QC. Bioneer's tools therefore evaluate GC content both globally (for biological fit) and in a sliding window (for synthesis feasibility), with window size and acceptance limits configurable per synthesis vendor profile.

Minimum Free Energy and Local RNA Secondary Structure

Single-stranded mRNA folds into secondary structure. The thermodynamically most stable fold is described by its Minimum Free Energy (MFE), computed as the most negative free-energy value over all possible base-pairing configurations. The canonical MFE algorithm is the Zuker dynamic programming recursion, refined over three decades by Mathews and collaborators and implemented most widely in ViennaRNA's RNAfold and Mathews' RNAstructure. Zuker's O(n³) time complexity becomes a bottleneck for mRNAs longer than a few hundred nucleotides; for therapeutic mRNAs of 1–4 kilobases and saRNA replicons of 10 kilobases, alternatives are mandatory.

LinearFold, introduced by Huang and collaborators in 2019, re-casts RNA secondary structure prediction as a beam-search over a left-to-right decoding of the sequence, yielding O(n) time and linear memory usage with empirically negligible accuracy loss on native and synthetic RNA benchmarks. LinearFold made full-length therapeutic mRNA folding tractable inside an optimization loop rather than as a one-shot post-hoc analysis. LinearDesign, from the same group (Zhang et al., Nature 2023), extended the paradigm to co-optimization of codon choice and minimum free energy via a lattice-based dynamic program that enumerates synonymous translations while simultaneously computing MFE, yielding joint CAI–MFE Pareto-optimal sequences for SARS-CoV-2 spike and other mRNA targets.

For the Bioneer suite, structural evaluation is not a single-method call but a hybrid: short sequences or short windows are folded with the exact Zuker recursion (via a refactored, JIT-accelerated RNAFold kernel); longer sequences use LinearFold with configurable beam size; very long constructs (saRNA and circRNA precursors above ~3 kb) are processed in a sliding-window Zuker-seeded LinearFold, in which short windows are folded exactly, their high-confidence pairs are passed as soft constraints to a global LinearFold call, and the combined result is scored. The customer-visible benefit is that reported MFE and structural-penalty values remain meaningful across the full length range of therapeutic RNA, not just the short sequences where exact folding was historically feasible.

Repeat Landscape and Low Complexity

Direct and inverted repeats, along with low-complexity homopolymeric runs, produce two distinct failure modes: (i) synthesis failure, in which a gene-synthesis vendor's oligo-assembly pipeline fails to close the sequence, and (ii) biological aberrance, in which repeats form stem-loops that stall ribosomes, activate innate-immune sensors of double-stranded RNA, recruit RNA-binding proteins, or seed illegitimate recombination during replication. Each of the Bioneer tools tracks repeat metrics at three resolutions: homopolymer runs (A, C, G, T individual tract length), short tandem repeats (repeated motifs of length 2–10), and long inverted repeats (dsRNA-forming pairs of 20 nucleotides or longer). Acceptance thresholds are provider- and program-specific, reflecting the empirical fact that different synthesis chemistries tolerate different repeat classes to different degrees.

mRNA Innate-Immune Recognition

Exogenous single-stranded RNA activates the innate immune system through multiple receptors. TLR7 and TLR8 recognize uridine-rich single-stranded RNA in endosomes of plasmacytoid dendritic cells and macrophages respectively; TLR3 and the cytosolic sensors RIG-I and MDA5 recognize long double-stranded RNA; TLR9 recognizes unmethylated CpG motifs; and the interferon-induced protein kinase PKR and the 2'-5' oligoadenylate synthetase OAS are activated by structured or long dsRNA. For therapeutic mRNA, this innate-immune sensitivity is a double-edged sword: for a vaccine, some degree of innate stimulation can be adjuvant-like and desirable; for a protein-replacement therapeutic, innate activation causes rapid mRNA degradation, inflammatory adverse events, and dose-limiting toxicity.

The dominant pharmaceutical strategy is nucleoside modification — replacement of uridine with N1-methylpseudouridine (m1Ψ), originally reported by Karikó and Weissman (who shared the 2023 Nobel Prize in Physiology or Medicine for this discovery) — which suppresses innate-immune activation and simultaneously stabilizes the transcript. Sequence-level complementary strategies include uridine depletion, CpG-dinucleotide avoidance, UpA-dinucleotide avoidance, suppression of dsRNA-forming inverted repeats, and selection of 5' and 3' UTR sequences known to be well tolerated. These sequence-level strategies matter even when nucleoside modification is used, because m1Ψ substitution cannot compensate for a high-CpG sequence context that has already been detected by sensors such as ZAP (zinc-finger antiviral protein). The Bioneer suite's immunogenicity score composites CpG count, UpA count, uridine fraction, dsRNA-forming inverted-repeat count, and optional TLR motif flags into a single report metric, with configurable weights.

Translation Initiation and the Kozak Context

The rate-limiting step of translation for most cellular mRNAs is initiation. The scanning ribosome recognizes an AUG start codon in a context characterized by the Kozak consensus (originally GCCGCCACCATGG in mammalian mRNAs, with the purine at position -3 and the G at position +4 being the most functionally important positions). A strong Kozak context can increase protein yield by two- to five-fold over a weak context; the effect is particularly important for short mRNAs in which re-initiation events are rare. Bioneer tools that handle 5' UTRs evaluate Kozak context via a position-weighted score and allow the user to enforce the canonical context.

Upstream open reading frames (uORFs) in the 5' UTR can decoy ribosomes away from the main ORF and reduce main-ORF translation. uORF scanning is therefore a standard component of UTR design. Strong 5' secondary structure within the first 30 nucleotides can similarly block cap-binding-complex docking or scanning; Bioneer's cap-proximal MFE metric quantifies this risk.

Cap, Poly(A) Tail, and mRNA Lifecycle

Eukaryotic mRNAs are bracketed by a 5' cap (typically the m7G cap0 or cap1 structure) and a 3' poly(A) tail of ~100–250 nucleotides. The cap recruits the eIF4F cap-binding complex for translation initiation; the poly(A) tail recruits poly(A)-binding protein (PABP), which interacts with eIF4G at the 5' end to promote closed-loop translation and protects the transcript from 3'-to-5' exonucleolytic decay. For therapeutic mRNA, the cap is installed either co-transcriptionally (CleanCap-AG, CleanCap-AT) or post-transcriptionally (vaccinia-virus capping enzyme, ARCA anti-reverse cap analog). Each chemistry has sequence-level requirements at the +1 transcription start: CleanCap-AG requires an AG initiator, ARCA tolerates GG or GA, and enzymatic capping is sequence-agnostic. Bioneer's IVTDesigner enforces these chemistry-specific constraints and flags sequences that would yield low capping efficiency.

Poly(A) tail length and composition influence both stability and translational efficiency. Encoded poly(A) stretches (as opposed to enzymatically added tails) face synthesis challenges — homopolymers of ≥100 A nucleotides are difficult to synthesize and clone — and Bioneer tools split the design of the encoded region from the length of the in vitro polyadenylation step performed downstream.

The Ribosome Elongation Cycle and Codon-Dependent Kinetics

Translation elongation is not uniform along a coding sequence. The ribosome's A-site accommodates an aminoacyl-tRNA whose anticodon matches the A-site codon; each accommodation event is a probabilistic race between cognate, near-cognate, and non-cognate tRNA species that happen to diffuse past. The rate of the accommodation step depends on the cellular abundance of the cognate tRNA, on the codon–anticodon interaction strength (including the wobble position), on the local mRNA secondary structure that may restrict ribosome access, and on the identity of the P-site tRNA that dictates the peptidyl-transferase reaction following accommodation. The practical upshot is that synonymous codon substitutions — changes that leave the protein sequence untouched — can dilate or compress the local ribosome dwell time by factors of two to five. Ribosome profiling experiments in yeast, bacteria, and mammalian cells have mapped these local velocity variations at nucleotide resolution and established that they are reproducible, codon-dependent, and relevant to downstream biology.

The biological relevance of non-uniform elongation becomes concrete when a protein contains multiple structural domains that fold independently. The classical single-domain view of translation — ribosome elongation as a nearly-instantaneous preparation of a completed polypeptide that then folds as a unit — has been replaced by a co-translational view in which the N-terminal domain begins folding as soon as it emerges from the ribosome exit tunnel, while the C-terminal domain is still being synthesized. Ribosome pauses encoded at domain boundaries give the N-terminal domain time to complete folding before the next domain starts. When codon optimization removes these pauses, the two domains can misfold into a kinetically trapped state from which they cannot escape, producing insoluble aggregates even at high expression levels. For heterologously expressed enzymes, cytokines, and multi-domain therapeutic proteins, this co-translational folding effect is one of the principal empirical reasons that maximum-CAI optimization sometimes underperforms moderate-CAI optimization.

tRNA Pools, Charging, and the CAI-to-tAI Bridge

CAI assumes that the codon-usage frequency in highly expressed genes reflects the relative availability of cognate tRNAs. For many well-studied organisms this is broadly true, but there are exceptions. Tissue-specific tRNA expression in mammals — most strikingly in proliferating versus differentiated cells — creates codon-usage environments that differ materially from the species-average; Gingold et al. (2014) described proliferation-associated and differentiation-associated tRNA expression signatures that skew the effective codon-usage landscape. Stress responses (amino-acid starvation, oxidative stress, infection) alter tRNA charging fractions — only aminoacylated tRNAs can decode their codon, and uncharged tRNAs compete as near-cognate decoys. These dynamic effects are not captured by a species-level CAI calculation.

For programs where these effects matter, the Bioneer suite's codon-usage database can be rebuilt from tissue-specific or cell-line-specific tRNA-copy-number data, producing a tAI-style weighting that the GA consumes identically to a CAI-style weighting. The practical workflow is: measure or download the relevant tRNA expression data, convert it to per-codon weights using the wobble-decoding rules (Dong et al. 1996), write a TSV, and re-run the HDF5 builder. All downstream tool behavior is unchanged; only the numerical weights differ.

GC-Rich versus GC-Poor Codon Pools

The human genome has a broad GC distribution; highly expressed housekeeping genes tend to be GC-rich, while tissue-specific or induced genes tend to be more AT-balanced. This is not a coincidence: GC-rich codons tend to be decoded by GC-rich anticodons of abundant tRNAs, and GC-rich mRNAs tend to be more stable and better exported from the nucleus. For a heterologously expressed protein, pushing GC content too low can depress expression by reducing tRNA availability and by destabilizing the transcript; pushing GC content too high can introduce synthesis-problematic repeats (CCGCCG motifs, GC-island-like windows) and can create stable secondary structure that blocks translation. The fitness landscape is therefore bimodal in GC content, and the optimum for a given protein depends on the host and on the synthesis vendor's template. The Bioneer suite exposes GC as a tunable target, defaulting to values that work well for the selected host and vendor.

Nonsense-Mediated Decay and Premature Termination

Eukaryotic mRNAs that terminate more than ~50 nucleotides upstream of the final exon–exon junction are recognized by the nonsense-mediated decay (NMD) machinery as carrying a premature termination codon and are rapidly degraded. For in vitro transcribed therapeutic mRNA that lacks introns, NMD recognition is governed by different determinants — the long 3' UTR, weak termination context, and the 3'-UTR-to-poly(A)-signal distance — but similar decay-accelerating pathways operate. UTRDesigner's 3' UTR library is curated to avoid NMD-triggering structural features, and the tool can flag constructs that exceed empirically derived safe distances between the stop codon and the poly(A) signal.

Ribosome Stalling and Collisions

When ribosomes stall — because of a rare codon cluster, a structured mRNA region, or a damaged tRNA — following ribosomes can collide with the stalled leader. Ribosome collisions activate a surveillance pathway (ZNF598, RACK1, ribosome-associated quality control, RQC) that can result in nascent-chain ubiquitination, mRNA cleavage by endonuclease activity associated with the ribosome, and degradation of both the peptide and the transcript. For therapeutic mRNA, rare-codon clusters inside the CDS are therefore a double liability: they slow elongation directly, and they trigger active mRNA degradation if collisions accumulate. The repeat and rare-codon penalties in the Bioneer suite are calibrated to avoid triggering this pathway.

Innate-Immune Discrimination of Self versus Non-Self RNA

The innate-immune system distinguishes host RNA from pathogen RNA via a combination of structural features (length, double-strandedness, 5'-end chemistry), sequence features (CpG dinucleotide frequency, UpA dinucleotide frequency, uridine density), and post-transcriptional modifications (m6A, Ψ, m5C are abundant in host RNA and largely absent in most pathogens). For exogenously delivered therapeutic mRNA, the tool has to mimic self-RNA across as many of these axes as possible. Nucleoside modification (m1Ψ) addresses the post-transcriptional-modification axis; codon choice and UTR selection address the sequence-frequency axes; capping and polyadenylation address the 5'- and 3'-end axes; purification of dsRNA byproducts addresses the structural axis. The Bioneer suite's composite immunogenicity score aggregates the sequence-level axes into a single number; the remaining axes are the responsibility of the IVT reaction, the purification train, and the capping protocol.

2.2 Biology Specific to SaRNADesigner

Self-amplifying RNA is built from the biology of alphaviruses. Venezuelan equine encephalitis virus (VEEV), Semliki Forest virus (SFV), and Sindbis virus (SIN) each encode a non-structural polyprotein (nsP1–nsP4) that, once translated from the 5' genomic ORF, assembles into an RNA-dependent RNA polymerase complex. This polymerase copies the full-length positive-strand genome into a negative-strand template and uses the negative strand both to produce more positive-strand genome and to initiate subgenomic-promoter-driven transcription of the downstream ORF. In a wild-type virus, the downstream ORF encodes the structural proteins; in an engineered saRNA replicon, the structural genes are deleted and replaced by a payload of interest — a vaccine antigen, a therapeutic protein, or a cargo for in-vivo delivery.

The saRNA design problem is fundamentally different from linear mRNA. The replicase polyprotein is a folded functional enzyme that depends on precise amino-acid sequence and precise nucleotide-level conserved sequence elements (CSEs) in its 5' and 3' untranslated regions — the 51-nucleotide 5' CSE, the 19-nucleotide 3' CSE, and the subgenomic promoter (SGP) of ~26 nucleotides that spans the junction between the non-structural ORF and the GOI ORF. Any mutation in the replicase, any disruption of the CSEs, or any secondary-structure interaction between the GOI and these regions destroys replicon function. The customer optimization target is therefore not the full ~10 kb replicon but only the GOI segment; the backbone is treated as immutable.

SaRNADesigner supports three standard backbones. VEEV_TC83 is the attenuated TC-83 strain of VEEV, the backbone used by Gritstone bio's SAM platform and in the Arcturus LUNAR-COV19 program; it is regarded as the best-behaved backbone for human-therapeutic use. VEEV_Trinidad is the wild-type Trinidad strain, used in research contexts where higher replication is desired at the cost of more reactogenicity. SFV is the Semliki Forest virus backbone, historically used in Lundstrom's academic work and several European-academic vaccine programs. Each backbone carries backbone-specific CSE positions (5' CSE at nucleotides 154–205 for VEEV_TC83; 3' CSE at 98–117 of the 3' block) and a backbone-specific SGP sequence (CTCTACGGCGCTA for VEEV TC-83 core, with ~20 nt of upstream context).

Because saRNA must replicate, it must present natural A/U/G/C bases to the polymerase; m1Ψ and other modified nucleosides are incompatible with replicase readout. This biological constraint removes the primary immune-evasion lever used in clinical mRNA and forces the sequence-level immune-evasion levers to work harder. The saRNA-specific immune-evasion levers are: (i) uridine depletion (target below 20% U in the GOI, rewarded by bonus proportional to how low U goes), (ii) CpG and UpA dinucleotide suppression, (iii) dsRNA-minimization (inverted-repeat suppression with 2× the penalty weight used in mRNA design, because the replicating intermediate dsRNA is a potent TLR3/MDA5 trigger), (iv) miRNA-seed avoidance for host tissue-specific expression control, and (v) detection and avoidance of "replication highways" — dense-GC clusters (>80% in 40 bp) that fold into stable structures that impede the RdRp translocation.

The CSE-preservation constraint is implemented as a structural check on the folded construct. Because a GOI sequence that base-pairs with a CSE can misdirect the RdRp or block subgenomic-promoter recognition, the fitness function performs a post-fold check: if the predicted structure contains any base pair between a GOI nucleotide and a CSE nucleotide, the candidate receives a 50,000-point penalty that effectively rejects it. This hard-rejection constraint is the single most important saRNA-specific feature: it is the mechanism by which SaRNADesigner prevents the GOI from destroying the replicon.

The 5' end of the GOI, immediately downstream of the subgenomic promoter, is the "SGP junction" and is subject to an additional soft constraint: the first 30 nucleotides of the GOI should not fold into stable structure (MFE > -6 kcal/mol target), because the SGP-driven subgenomic transcript is capped and translated like any other mRNA, and a buried Kozak context or an obstructed scanning region here blunts protein expression just as it would in a linear mRNA. The SGP junction therefore inherits the cap-proximal-MFE concerns from linear mRNA design, adapted to the alphavirus-subgenomic context.

The replicon length — typically 9–13 kilobases for VEEV-backbone saRNA — sits well beyond the O(n³) Zuker folding horizon. SaRNADesigner addresses this with a Zuker-seeded LinearFold strategy: 300-nt windows with 150-nt step are Zuker-folded to extract high-confidence pairs, which are then passed as soft constraints to a global LinearFold call over the full replicon. LinearFold's beam size is raised to 300 (from the 100 default used by GeneCrafter) to better capture the long-range structure characteristic of CSEs. Benchmarking against published structural maps of alphavirus 5' and 3' CSEs recovers the canonical stem-loops with high accuracy at a runtime compatible with inside-GA evaluation.

3. System Architecture

3.1 Shared Components Across the Suite

All five tools are built on a common Python core that combines a JIT-compiled numerical kernel (Numba), a genetic-algorithm engine, an HDF5-backed codon-usage database, a hybrid RNA-folding engine, a templated constraint library (synthesis-vendor and host-organism profiles), and a unified report-rendering pipeline. This shared substrate is what makes it possible to move from codon optimization to UTR engineering to saRNA design to circRNA design without learning a different tool for each.

The Genetic-Algorithm Engine

The GA is a standard evolutionary loop with tournament selection, multi-point crossover, and program-specific mutation operators. A population of candidate sequences (typical size 100–500) is initialized either randomly from the codon-usage distribution or from a greedy CAI-oriented seed. Each generation, candidates are ranked by the fitness function, a fraction is retained as elites, and the rest of the next generation is produced by crossover and mutation of tournament winners. The GA loop exits on convergence (a plateau in best-fitness for a user-configurable number of generations), on reaching a maximum generation count, or on the user-requested early stop. Between generations, the engine can checkpoint the entire population and RNG state to disk, which is what enables exact-reproducibility and restart-after-failure behavior.

Codon-Usage Database

Codon-usage frequency tables are stored in an HDF5 database (cocoputs_db.h5) indexed by NCBI taxid. The database was built from the CoCoPUTs project (Alexaki et al. 2019), which aggregates codon-usage from the NCBI GenBank CDS corpus and normalizes across organisms. The HDF5 backing allows the suite to hold several thousand organism profiles in a single addressable file, with O(1) lookup by taxid. For custom or client-specific usage tables (e.g., CHO-K1 with in-house expression-optimized weights), the database can be rebuilt from a client-supplied TSV using the included builder script.

Hybrid RNA-Folding Engine

The folding engine encapsulates three distinct algorithms behind a single interface. For sequences shorter than 700 nucleotides, a JIT-compiled Zuker recursion is used (the "RNAFoldRefactored" kernel), which produces exact MFE structures. For sequences longer than 700 nucleotides, LinearFold is called with a beam size of 100–300 depending on the calling tool and the required accuracy. For very long sequences typical of saRNA and circRNA (≥3 kilobases), a sliding-window Zuker-seeded LinearFold is applied: 300-nucleotide windows with 150-nucleotide step are folded exactly, high-confidence pairs from those windows are passed as constraints to a global LinearFold call, and the result is scored against the same fitness terms used in the GA loop. Benchmarking in-house against known-structure RNAs (tRNA, 5S rRNA, SARS-CoV-2 5' UTR, and a panel of natural mRNAs with experimentally probed structures) shows that the hybrid approach recovers ≥ 90% of experimentally supported base pairs within an acceptable running time for GA inner loops.

Templated Constraint Library

Hard constraints are organized into two stacks: synthesis-vendor templates and host-organism templates. Synthesis-vendor templates capture the empirical constraints of IDT (GBlocks, Megamer), Twist Bioscience (Clonal Genes, Gene Fragments), GenScript (OptiGene, GeneBlocks), ATUM, and others — restriction-site avoidance, homopolymer caps, GC-window bounds, minimum repeat-free intervals. Host-organism templates capture organism-specific constraints — Shine–Dalgarno avoidance inside CDS for E. coli, CpG-island and polyadenylation-signal avoidance for mammalian cells, poly-T tract limits for yeast. Both stacks are simultaneously applied; a candidate that violates either stack is either penalized (soft constraint) or rejected (hard constraint), configurable per term.

Viennarnaplot Rendering

The RNA secondary-structure rendering layer, Viennarnaplot, converts dot-bracket structures into publication-quality SVG figures with Naview-style layouts refined by a post-processor that resolves residue overlaps, polishes stem angles, and — for circular RNA — closes the topology. The resulting SVGs are embedded directly in HTML reports (scalable without re-rasterization) and converted into vector PDFs for archive submission. Color annotation is configurable: DMS-style reactivity coloring (green for paired A/C, red for unpaired A/C, grey for U/G) is supported for comparing predicted structure with chemical-probing data when available.

3.2 Where SaRNADesigner Plugs In

SaRNADesigner is called with a protein or pre-optimized CDS representing the GOI, a choice of backbone (VEEV_TC83 default), and an optional user 5' and 3' UTR appended outside the viral 5' and 3' blocks. Output is the full assembled replicon: user-5' + viral-5'(including nsP1–nsP4 and SGP) + GOI(optimized) + viral-3' + user-3' + optional poly(A). The backbone is concatenated from the stored VEEV_TC83 (or other) sequence block; the GOI is the product of GA optimization; and the full construct is subjected to post-GA structural validation including the CSE-interference check.

3.3 Reproducibility by Construction

Every run records and persists: (i) the full configuration JSON submitted by the user, (ii) the random seed used by the GA, (iii) the identifier and checksum of the codon-usage database, (iv) the semantic version and git-commit hash of the tool, and (v) a checkpoint of the final GA population and fitness table. A downstream consumer can therefore re-execute the same run months or years later and confirm that the output sequence is identical, which satisfies both scientific reproducibility expectations and the ALCOA+ "Original" and "Accurate" principles used in GxP data-integrity assessment. Checkpointing is also what allows very long runs to be paused and resumed without loss, and what allows partial-failure recovery in batch pipelines.

3.4 Data Flow

A typical execution proceeds through the following stages. (1) Input parsing: a DNA or protein sequence is accepted either via CLI argument, file path, or FASTA for batch mode. (2) Organism and template resolution: codon-usage table, synthesis-vendor template, and host-organism template are loaded. (3) Constraint compilation: forbidden motifs, restriction sites, TFBS, and any user-specified avoid-lists are compiled into JIT-searchable numeric arrays. (4) Initial-population generation: the GA population is seeded using either a greedy-CAI initialization, a random draw from the codon-usage distribution, or — for tools that support it — a beam-search initialization that favors low-immunogenicity codons. (5) GA main loop: each generation evaluates fitness for all candidates (caching results by sequence hash), performs selection, crossover, and mutation, and optionally checkpoints. (6) Post-GA structural filtering: the top N candidates (typically 500–1000) are subjected to full structural evaluation — exact or linear folding, homopolymer auditing, repeat scan, immunogenicity profiling. (7) Final ranking and reporting: the top 8 candidates are given full secondary-structure plots, and all are summarized in HTML, PDF, JSON, and CSV.

3.5 Performance, Parallelism, and Determinism

JIT Acceleration with Numba

The suite's performance-critical kernels are JIT-compiled with Numba. Compiled kernels include the fitness evaluation core (CAI computation, GC counting, codon-pair scoring, homopolymer detection, short-tandem-repeat detection, inverted-repeat detection, motif scanning via Aho–Corasick or bit-parallel scanners), the Zuker folding recursion, the LinearFold beam-search loop, the Kozak position-weighted matrix, and the mutation operators. Numba compilation is invoked on first use; a warmup phase at tool startup triggers compilation of the hot kernels so that the first GA generation does not pay the compile latency. Benchmark numbers: on a modern server-class CPU, a single-generation GA evaluation over a 200-candidate population of 1,000-nucleotide sequences completes in under 5 seconds for the full fitness composite; the same operation without JIT acceleration takes more than 100 seconds.

Parallel Execution Model

GA generations parallelize naturally: each candidate's fitness evaluation is independent. The suite uses a process-pool executor with a shared, read-only set of resources (codon table, motif arrays, templated constraints) initialized in each worker at startup. For very short sequences the process-creation overhead dominates, and single-threaded execution is faster; the tool auto-detects the crossover point and adjusts. For long sequences (therapeutic mRNA and saRNA), multi-process execution delivers near-linear speedup up to the available core count. Custom scheduling accommodates hosts with mixed workloads — the tool can be run with explicit --num-workers to avoid contention with other jobs on shared compute.

Determinism and Numerical Stability

Determinism is guaranteed by seeding every random source — NumPy, Python's random module, and each worker's RNG — from a single master seed. Numerical stability of the folding kernels is guaranteed by use of float64 accumulators; the Zuker recursion's internal free-energy tables are stored at 0.01 kcal/mol resolution, which is finer than the ~0.1 kcal/mol accuracy of the underlying thermodynamic parameters. Floating-point sensitivity is therefore not a source of run-to-run variation; given the same seed and config, outputs are byte-for-byte identical.

Error Handling and Graceful Degradation

Hard errors (malformed input, missing codon-usage table for the requested organism, corrupted checkpoint) produce a non-zero exit code, a diagnostic message to stderr, and a JSON error blob in the output directory. Soft errors (a GA generation that produces no candidates above threshold, a LinearFold call that times out) trigger a documented fallback (fall back to Zuker, lower the beam size, continue with elite-only population) with a warning logged to the report. The tool avoids silent degradation — anywhere a fallback is taken, the customer sees a flag in the HTML output.

4. Algorithms in Detail

4.1 Genetic-Algorithm Core

The genetic algorithm is the heart of every tool in the suite. Its strength over greedy or gradient-based optimization is that it navigates a high-dimensional, rugged, multimodal fitness landscape without requiring differentiability of the objective — which is crucial because the suite's fitness landscapes are dominated by discrete hard constraints (restriction sites, forbidden motifs) and non-differentiable structural metrics (MFE, repeat counts).

Encoding

A candidate is represented as an array of codon indices in the range 0–63, one per amino acid position. This encoding keeps mutation and crossover operations synonymous by construction (they change codon choice but never amino acid), and enables fast JIT-compiled fitness evaluation via codon-index lookups rather than string manipulation. For UTR-focused tools, the encoding extends to nucleotide positions in the UTR segments; the CDS segment retains its codon-indexed encoding.

Selection

Tournament selection (tournament size 2 to 5) is used throughout. Tournament selection is preferred over truncation or roulette because it provides a smooth, tunable selection pressure that does not depend on the absolute fitness scale — important when fitness terms include both bounded metrics (CAI ∈ [0, 1]) and unbounded penalties (homopolymer penalty scaling as length⁵). Elitism preserves a small fraction (default 5–10%) of the best candidates into the next generation without alteration.

Crossover

Uniform and single-point crossover operate on the codon-index array. Crossover points are chosen either uniformly at random (uniform crossover) or at a single random cut (single-point). Uniform crossover mixes more aggressively and is preferred in early generations; single-point crossover preserves more local structure and is preferred later. A crossover-type schedule is configurable per tool.

Mutation

Each tool installs program-specific mutation operators in addition to a baseline uniform-random synonymous substitution. Common variants include CAI-weighted mutation (new codon sampled proportionally to its relative adaptiveness), hybrid CAI–GC mutation (mutation score combines CAI distance to target and the effect on local GC content), balanced-top-50% mutation (new codon drawn only from the codon whose CAI and GC percentile are both above the median), and targeted surgical mutation that repairs low-fitness sub-regions identified by a moving-window audit. Mutation rate is typically 0.02 to 0.05 per codon per generation and can be annealed across the run.

Convergence and Early Stopping

The GA stops when any of (i) maximum generations is reached, (ii) no best-of-generation improvement is observed for a patience window (default 100–150 generations), or (iii) population diversity (measured as mean pairwise Hamming distance normalized by sequence length) falls below a threshold. The latter detects search collapse — if the whole population has converged on a local optimum, further iteration is wasted. When diversity-collapse is detected, the engine can optionally perform a "diversity-restoration" step that injects random mutations to a fraction of the population, trading some best-fitness regression for renewed exploration.

Fitness Caching

Fitness evaluation is expensive relative to mutation and crossover. A sequence-to-fitness cache (keyed on the bytes of the candidate array plus the active fitness configuration) typically achieves >80% hit rate in late generations, because the population converges on a small region of sequence space. Cache invalidation is keyed on configuration, so changing any fitness weight or threshold forces recomputation. The cache is in-memory only (not persisted), which avoids the risk of stale cached values biasing future runs.

4.2 Multi-Objective Mode (NSGA-II, GeneCrafter)

GeneCrafter additionally supports NSGA-II (Non-dominated Sorting Genetic Algorithm II, Deb et al. 2002) as an alternative to the scalarized fitness approach. In NSGA-II mode, the user specifies multiple objectives — CAI, GC distance, immunogenicity, structure penalty — as separate terms rather than combining them into a single weighted score. NSGA-II then explores the Pareto frontier of non-dominated solutions: candidates for which no other candidate in the population is simultaneously better on every objective. The output is a set of diverse solutions rather than a single "best" sequence, and the customer chooses the trade-off that best fits the application (e.g., accept slightly lower CAI to gain markedly lower immunogenicity).

The practical advantage of multi-objective optimization over scalarized optimization is that it surfaces trade-offs that a scalarized fitness function would hide. A sequence that is slightly suboptimal on CAI but dramatically better on structural cleanness would be dismissed by a scalarized GA with CAI-heavy weights; NSGA-II retains both sequences and presents them to the customer for an informed decision. The cost is that NSGA-II converges more slowly and requires larger populations (500+ is recommended) to maintain frontier diversity.

4.3 Structural Post-Processing

After the GA terminates, the top N candidates (500 to 1000, configurable) are subjected to a deterministic post-processing pass that performs the expensive analyses which were approximated or sampled during the GA. The pass folds each candidate with the exact algorithm matching its length, extracts dot-bracket and energy, computes homopolymer and repeat inventories at full precision, computes the precise immunogenicity profile, validates all restriction-site and motif constraints, and confirms that Kozak, poly(A)-signal, and capping-start constraints are met. Candidates that fail any hard post-filter are removed; the remaining candidates are ranked by a post-filter composite score (which can have different weights than the GA fitness — for example, giving more weight to cap-proximal MFE because the GA's sampled MFE metric may underestimate cap-proximal risk).

This two-stage approach — fast-and-approximate in the GA, slow-and-exact in the post-processor — is a deliberate design choice. Exact per-candidate evaluation inside the GA loop would be prohibitively slow for any population/generation combination large enough to converge, and sampled-approximation alone would produce unreliable final candidates. The post-filter ensures that the sequences shipped to the customer are correct on every hard constraint, even those that were only sampled during evolution.

4.4 Viennarnaplot — 2-D Layout and Rendering

Viennarnaplot is the 2-D layout engine that converts dot-bracket secondary-structure notation into publication-quality vector illustrations. The algorithm is a hybrid of Naview (Bruccoleri & Heinrich 1988) and a custom RNAPuzzler-inspired post-processor. Naview performs a radial-tree layout of the secondary-structure graph; the post-processor detects residue collisions, resolves them by rigid-body rotation of sub-trees, smooths stem angles, and — for circular RNA — wraps the topology at the back-splice junction. The output is a browser-embeddable SVG that remains legible at any zoom and a vector PDF that embeds in customer presentations without pixelation. Coloring schemes include DMS-reactivity (green/red/grey), GC-content heatmap, local-MFE heatmap, and custom per-residue color from a user-supplied vector.

The rendering pipeline includes a "straight-line linear-spine" layout variant that represents an unrolled molecule as a horizontal strip with stems hanging below and above the backbone — suitable for panel comparisons and for aligning two candidates side-by-side. The horizontal layout is particularly useful for long mRNA and saRNA constructs where a radial layout would not fit legibly on a single page.

4.5 SaRNADesigner-Specific Algorithm Notes

4.5.1 Replicon Assembly

Assembly concatenates six modules in fixed order: optional user 5' UTR, viral 5' block (containing the 5' CSE, the nsP1–nsP4 ORF, and the SGP), the GOI (optimized by the GA), viral 3' block (containing the 3' CSE), optional user 3' UTR, and poly(A) tract. The viral blocks are stored as immutable sequences keyed by backbone in SARNA_BACKBONE_DB; no GA operation can mutate them.

4.5.2 Uridine-Depletion Fitness Term

The uridine-depletion term scores the GOI's uridine fraction. A hard-penalty floor activates when U fraction exceeds 20%: penalty = (U% - 20) × 200. A bonus scales linearly down to 25% U: bonus = max(0, 1 - U%/25). The combined term drives the GA toward codon choices that minimize uridine. Target organisms (typically human or mouse) have codon-usage tables that permit a meaningful U-depletion at modest CAI cost — the fitness function trades off U-depletion against CAI via the configured weights.

4.5.3 dsRNA-Minimization via Inverted-Repeat Penalty

Inverted repeats (length ≥ 20 nt, score ≥ 30) are detected by a JIT-compiled Hamming-distance-aware inverted-repeat detector. Each detected inverted repeat incurs a 300-point penalty in saRNA mode — twice the 150-point penalty used for linear mRNA — reflecting the increased consequence of dsRNA formation in a replicating context where dsRNA intermediates are abundant.

4.5.4 CSE Interference Check

Post-fold, check_cse_interaction walks the dot-bracket structure and identifies any base pair between a GOI nucleotide and a CSE nucleotide. If any are found, a 50,000-point penalty is applied, which causes the candidate to be effectively rejected even if other fitness terms score well. The check uses the CSE coordinates absolutely in the assembled construct (not in the backbone-local coordinates), so that CSE positions are correctly computed regardless of user-5' UTR length.

4.5.5 SGP Junction Stability Check

The first 30 nucleotides of the GOI are folded in the context of the upstream SGP and checked for local MFE. If MFE < -6 kcal/mol, a penalty of 100 × (MFE - (-6)) points is applied, which steers the GA toward GOI 5' ends that present an accessible Kozak context after subgenomic-transcript capping.

4.5.6 miRNA-Seed Avoidance

Species- and tissue-specific miRNA-seed sequences (7-mer seed matches starting at miRNA position 2) are pre-compiled into a JIT-searchable motif array; each hit incurs a 1,000-point penalty. The seed library can be loaded from the host-template or from a user-supplied TSV to accommodate novel targets or tissue-specific considerations.

4.5.7 Replication-Highway Barrier Check

Dense-GC clusters (>80% GC in any 40-bp window) are detected in a fast sliding-window pass; each cluster incurs a 50-point penalty. This term prevents the GA from stumbling into locally GC-rich codon-usage patterns that fold into barriers impeding RdRp translocation.

4.5.8 Long-Sequence Folding

For the full-construct fold performed during post-GA structural validation, the tool uses Zuker-seeded LinearFold with beam size 300. The 300-nt Zuker windows are folded exactly; pairs with ≥75% across-window support are passed as hard constraints to the global LinearFold; pairs with ≥50% support are soft constraints contributing an energy bonus. This hybrid recovers the canonical CSE stem-loops and the SGP junction structure at runtime acceptable for the post-GA validation pass (seconds per candidate).

4.5.9 Preserve-CDS and Payload Auto-Extraction

When the customer provides a full replicon (for example, from a prior design round), SaRNADesigner's auto-extraction logic detects the nsP1–nsP4 core and the SGP by backbone-specific fuzzy matching, extracts the GOI payload, and passes only the GOI to the GA while keeping the full-construct context for structural validation. This is the right workflow for iterative refinement of an existing replicon.

4.6 Parameter Tuning Guidance

Default parameters are selected to work reasonably well across a wide range of inputs, but for production runs some tuning is advisable. Population size scales with sequence length: for sequences under 1 kb, 200 is sufficient; for 1–3 kb, 300 is typical; for 3 kb and above, 500 or more maintains diversity. Generations scale with the constraint landscape's ruggedness: a CAI-only optimization converges in 50–100 generations; a multi-constraint optimization with synthesis template and immunogenicity enabled typically requires 200–400 generations; a saRNA optimization with CSE-interference checks and U-depletion can benefit from 400–800 generations. Mutation rate is not strongly sensitive between 0.02 and 0.05 for most constraint landscapes; lower rates make late-generation refinement more precise but slower. The convergence-patience parameter (generations without improvement before early stop) should be roughly 30–50% of the total generations.

For NSGA-II mode in GeneCrafter, larger populations (500+) are important to maintain Pareto-frontier diversity. NSGA-II also benefits from a higher mutation rate (0.04–0.05) because its selection mechanism is less aggressive than scalarized tournament. A typical NSGA-II production run is 500 population × 300 generations, which on a 16-core machine completes in 30 minutes to 2 hours depending on sequence length and the active constraint set.

4.7 Reading and Interpreting the Fitness Log

Every GA run writes a per-generation fitness log containing the best, median, and worst fitness of each generation, the population diversity, and — if the tool supports it — the top candidate's per-term fitness breakdown. The log is a useful diagnostic for tuning: a best-fitness trajectory that plateaus immediately (within the first 10 generations) indicates that the initial population already saturated the objective (reduce generations or increase diversity); a trajectory that does not plateau by the generation limit indicates under-convergence (increase generations or population); oscillation between values indicates that hard-constraint rejections are interacting with soft-constraint selection (inspect the per-term breakdown to localize). The log is available as JSON in the output directory and as a line plot in the HTML report.

4.8 Cryptic Splice-Site Detection in Detail

Cryptic splice-site detection runs in two passes. The first pass is motif matching against a library of canonical donor motifs (GT|AG), near-canonical motifs (CAGGTA, GAGGTA, TAGGTA, GTCTCT, GATCTA), and — where applicable — tool-specific lists (T4-td PIE pseudosites for CircularDesigner, BioBrick-legacy sites for GeneCrafter). Each match is counted and, when the motif has rank classification in the literature, scored by rank. The second pass runs a position-specific scoring matrix (PSSM) over a 9-nt window centered on each candidate GT dinucleotide; the PSSM was trained on annotated human splice-donor sites from RefSeq and assigns log-odds scores to each base position. Candidate sequences with scores above a configurable threshold contribute per-site penalties. For tools that operate on circular RNA or alphavirus replicons (which engage the spliceosome or splice-like machinery), the PSSM threshold is tightened.

4.9 Homopolymer and Repeat Detection

Homopolymer detection is a single-pass linear scan that records the longest run of each base and all runs exceeding configurable thresholds. Short tandem repeat (STR) detection is a factor-based scanner that identifies 2- to 10-nt repeating units of copy number ≥ 3, with a fast suffix-array-like implementation. Inverted-repeat detection uses a JIT-compiled two-pointer scan with Hamming-distance allowance for imperfect palindromes; min-length and min-score thresholds are configurable. Each detected repeat is recorded with its start positions, length, and score; the repeat inventory is reported per candidate in the HTML output.

5. Inputs

5.1 Accepted Input Formats

DNA coding sequence — A, T, G, C (or U translated to T), length in multiples of 3 for CDS, optionally annotated with explicit UTR/polyA boundaries.
Protein sequence — standard 20-letter one-letter IUPAC codes; internally back-translated to codon positions and expanded by the GA across synonymous codon space.
FASTA file — single-sequence or multi-sequence; multi-sequence files are accepted in batch mode, where each record is treated as an independent design job with its own output folder.
GenBank file — optional, used when the CDS is a region of a longer annotated sequence; the suite extracts the CDS by feature key and retains surrounding UTR for context-aware design.
JSON configuration — all runtime parameters can be supplied as a single JSON file, which is also the canonical persistence format for audit trails.

5.2 Required Contextual Inputs

Target organism — specified either by NCBI taxid (exact) or by organism name (resolved against the local taxonomy). This choice determines the codon-usage table used for CAI and for mutation-operator biasing.
Synthesis-vendor template — IDT_GBlocks_Standard, Twist_Clonal, GenScript_OptiGene, ATUM_GeneGPS, or None for a pure-biology run. The template injects vendor-specific hard constraints (restriction-site avoidance, homopolymer caps, GC-window bounds).
Host-expression template — E_coli_K12, CHO_K1, HEK293, S_cerevisiae, P_pastoris, and others. Adds host-appropriate motif avoidance (Shine–Dalgarno for bacteria, CpG-island and poly(A)-signal for mammalian, poly-T tracts for yeast).
Optimization targets — the subset of fitness terms to activate (cai, gc, cpg_upa, immunogenicity, mrna_mfe, mrna_stability, structure_and_repeats, tfbs). Unselected terms are evaluated for reporting but not for selection pressure.
GA runtime — population size, generations, mutation rate, checkpoint frequency, random seed; all defaults are suitable for a first run and can be tuned in subsequent runs.

Additional SaRNADesigner-Specific Inputs

Backbone — VEEV_TC83 (default, attenuated VEEV), VEEV_Trinidad (wild-type VEEV), or SFV (Semliki Forest virus).
U-depletion — boolean. Enables the uridine-depletion fitness term and its hard-penalty floor at 20%.
Prevent CSE interaction — boolean (default True). Enables the post-fold CSE-interference check with 50,000-point rejection penalty.
Preserve CDS — boolean. If True, auto-extract the GOI from a full-replicon input.
Custom 5' UTR / Custom 3' UTR — user-supplied sequences appended outside the viral blocks (optional).

6. Configuration Reference

6.1 Core GA / Runtime Parameters

Every tool exposes the same core GA parameters under consistent names. Defaults are suitable for first runs; production runs typically tune population and generations upward.

Parameter	Default	Description
--population-size	200	GA population size. Larger populations explore more broadly but take longer per generation.
--generations	100–500	Maximum GA iterations. Tools auto-scale by sequence length; this is the hard upper bound.
--mutation-rate	0.02–0.05	Per-codon probability of mutation per generation. Lower rates preserve convergence; higher rates explore.
--post-ga-candidates	1000	Number of top GA candidates passed to the exact post-processor.
--checkpoint-freq	10	GA generations between checkpoint writes. Lower = more frequent but more disk I/O.
--seed	None (random)	Random seed for reproducibility. Set to an integer for byte-for-byte reproducible runs.
--optimizer	ga	'ga' for scalarized, 'nsga2' for multi-objective (GeneCrafter only).
--convergence-patience	100–150	Generations with no best-fitness improvement before early stop.
--diversity-threshold	0.005	Minimum population diversity (normalized Hamming distance) before early stop.
--output-format	human	'human' for HTML and PDF, 'json' for machine-readable only.
--repeat-min-len	15	Minimum repeat length flagged by the repeat detector.
--repeat-min-score	40	Minimum Hamming-distance-adjusted repeat score flagged by the repeat detector.

6.2 SaRNADesigner-Specific Configuration

SaRNADesigner's program-specific parameters.

Parameter	Default	Description
--backbone	VEEV_TC83	Alphavirus backbone: VEEV_TC83, VEEV_Trinidad, SFV.
--u-depletion	False	Enable uridine-depletion fitness term.
--prevent-cse-interaction	True	Enable CSE-interference post-fold check.
--preserve-cds	False	Auto-extract GOI from full replicon input.
--custom-five-prime-utr	""	Optional user 5' UTR outside viral 5' block.
--custom-three-prime-utr	""	Optional user 3' UTR outside viral 3' block.
--poly-a-length	0	Encoded poly(A) length after viral 3' block.
--linearfold-beam-size	300	LinearFold beam size; raised from default 100 for CSE recovery.

7. Outputs and Their Biological Meaning

7.1 Results Directory Convention

Each run writes to a dated results directory (typically ./<Tool>_Local_Results/YYYY-MM-DD/<job_id>) containing the HTML report, the PDF, the JSON and CSV summaries, a FASTA of the top 8 candidates, the original configuration JSON, the GA checkpoint chain, and a manifest file that lists the tool version, input checksum, and random seed. The directory is self-contained and can be archived or transferred as a single unit without loss of reproducibility information.

7.2 Deliverable Files

<job_id>_report.html — interactive HTML with embedded SVG structures, sortable metric tables, and per-candidate drill-down.
<job_id>_report.pdf — print-ready PDF; RNA structures rendered as embedded SVG so they remain legible at zoom.
<job_id>_summary.json — machine-readable summary of all candidates, fitness components, and metrics.
<job_id>_summary.csv — tabular summary suitable for spreadsheet review and ELN ingestion.
<job_id>_candidates.fasta — top 8 candidates as standard FASTA for synthesis submission.
<job_id>_config.json — the exact configuration used; combined with the seed, deterministic reproduction is possible.
<job_id>_checkpoint.pkl — the final GA population and RNG state; enables restart for further refinement.
<job_id>_manifest.txt — tool version, git commit hash, database checksum, run duration, host.

7.3 Report Sections — What Each Means for the Customer

SaRNADesigner's HTML/PDF report adds saRNA-specific panels. A "replicon layout" panel visualizes the assembled construct with module annotation: user-5', viral-5'(5'CSE, nsP1-4, SGP), GOI, viral-3'(3'CSE), user-3', poly(A). A "uridine depletion" panel reports the U fraction of the GOI (with goal line at 20%) and the U-depletion contribution to fitness. A "CSE integrity" panel confirms that no GOI base pairs with CSE nucleotides in the predicted structure; any violations are listed per-residue. A "dsRNA risk" panel summarizes the inverted-repeat inventory with the 2×-weighted saRNA penalty. A "SGP junction" panel reports the MFE of the first 30 nt of the GOI. The structural-plot panel shows the full replicon folded at beam-300, with CSE regions highlighted.

7.4 Interpreting the Report from the Customer's Perspective

Per-Metric Interpretation

The HTML and PDF reports present each per-candidate metric with a short contextual interpretation — not just a number but a suggestion of what the number means and whether it is above, at, or below customer-acceptance thresholds. For CAI, a value above 0.85 is highlighted as strong expression, 0.70–0.85 as adequate, below 0.70 as at-risk of poor expression. For cap-proximal MFE (first 30 nt of 5' UTR and CDS), a value above -6.0 kcal/mol is "accessible", -6.0 to -12.0 is "at risk", below -12.0 is "likely to block translation initiation". For inverted-repeat count, zero is ideal for therapeutic products, one to two is acceptable for research, more than two suggests rework. For composite immunogenicity, below 3.0 is therapeutic-grade, 3.0–5.0 is research-grade, above 5.0 is flagged. These thresholds are starting points; the customer is expected to calibrate them to the specific program's requirements.

Decision-Support Narrative

Above the per-metric table, the report carries a brief decision-support narrative generated at run time. Typical narratives: "Candidate 1 meets all hard constraints, exceeds CAI and GC targets, and has a composite immunogenicity of 2.1 — recommended for synthesis." Or: "Candidate 3 has the best CAI (0.92) but contains two inverted repeats at length 25 and 22; consider re-running with higher repeat penalty, or verify empirically." The narratives are meant for a non-specialist reader — a program manager reviewing designs without a deep RNA-structure background — and are not prescriptive; they indicate what the data suggest and leave the decision to the reviewer.

Candidate Diversity Surface

The report's Pareto-frontier panel (GeneCrafter NSGA-II) or top-8 panel (other tools) exposes the diversity of the top candidates: not just the best by the scalarized score but several that trade off differently. This is a deliberate affordance against the over-optimization failure mode in which a single top candidate turns out, on wet-lab testing, to underperform an alternate that was slightly lower on the in-silico score but wetter-better. Inspecting the top-8 panel, and optionally commissioning two or three of them for head-to-head wet-lab comparison, is the empirically-grounded best practice for de-risking a therapeutic design.

8. Quality-Metric Interpretation Guide

8.1 A Suggested Customer Acceptance Gate (baseline)

For SaRNADesigner-produced replicons, a suggested acceptance gate is:

GOI uridine fraction ≤ 20% when --u-depletion is enabled.
Zero CSE-interference base pairs in the predicted structure.
SGP junction MFE ≥ -6 kcal/mol (first 30 nt of GOI).
Inverted-repeat count ≤ 2, with no inverted repeat above length 30 or score 50.
CpG + UpA dinucleotide suppressed (composite immunogenicity ≤ 3.0 for clinical use).
No dense-GC cluster (>80% GC in 40 bp) in the GOI.
GOI CAI ≥ 0.80 for the target organism.
No miRNA-seed match for the designated host tissue.

9. Use Cases and Worked Example

9.1 Canonical Example Command

A representative SaRNADesigner invocation for a VEEV-TC83 saRNA vaccine targeting a human antigen, with uridine depletion and CSE protection enabled:

SaRNADesigner.py --protein antigen.fasta --organism 9606 --backbone VEEV_TC83 --u-depletion True --prevent-cse-interaction True --poly-a-length 100 --linearfold-beam-size 300 --population-size 300 --generations 400 --seed 11 --output-file results/job04

9.2 Recommended Decision Workflow

1. Select the backbone. For human therapeutic and vaccine use, VEEV_TC83 is the standard choice due to its attenuation profile and its clinical precedent.

2. Enable uridine depletion; set a population size of 300 or higher because the constraint landscape is more rugged than a linear-mRNA design.

3. Inspect the CSE-integrity panel; if any interference is reported, extend generations and re-run.

4. Review the SGP-junction MFE; if it falls below -6 kcal/mol, adjust the 5' end of the GOI (typically by tweaking the first two codons) or accept that subgenomic-transcript translation will be moderately blunted.

5. For clinical products, submit the full replicon FASTA to the synthesis vendor and archive the config, seed, and structural report in the design-history file.

10. Industry Comparison

The codon-optimization and mRNA-design software landscape has expanded rapidly over the past decade, driven by the mRNA-therapeutics industry's need for in-silico sequence engineering that integrates synthesis feasibility, expression optimization, structural awareness, and innate-immunity awareness into a single workflow. This section positions the Bioneer suite against the most widely used academic and commercial alternatives.

Academic and Open-Source Tools

Academic tools in wide use include ViennaRNA (Lorenz et al. 2011, the standard RNA thermodynamics package providing RNAfold, RNAcofold, RNAinverse, and RNAeval), LinearFold and LinearDesign (Huang et al. 2019; Zhang et al. 2023, Nature — linear-time MFE and joint CAI/MFE optimization), RNAstructure (Reuter & Mathews 2010 — rigorous thermodynamic modelling with experimental-probing integration), Mfold (Zuker 2003, the historical reference), LocARNA (multiple-sequence structure alignment), RNAshapes (Voß et al. 2006 — abstract-shape analysis), JCat (Grote et al. 2005, codon optimization against a user-supplied reference set), OPTIMIZER (Puigbò et al. 2007, codon optimization with batch CSV output), COOL (Chin et al. 2014, multi-objective with CAI/CPB/GC), DNAWorks (Hoover & Lubkowski 2002, one of the earliest widely-used tools, oriented toward oligo-assembly feasibility), and CAIcal (Puigbò et al. 2008, CAI reporting).

Each of these tools solves a narrow problem well but collectively they do not constitute a therapeutic-grade mRNA design workflow. ViennaRNA and RNAstructure produce rigorous structures but do no codon optimization. JCat, OPTIMIZER, and COOL optimize codons but do not integrate structure-aware objectives, synthesis-vendor templates, Kozak context, capping chemistry, or immunogenicity metrics. LinearDesign integrates structure and codon choice but does not support UTR design, saRNA, or circRNA and does not produce a publishable report. DNAWorks focuses on oligo-assembly feasibility and is largely decoupled from biological objectives.

The Bioneer suite's integration of all of these capabilities behind a single CLI and report — with exact reproducibility, synthesis-vendor and host-expression templates built in, and a coherent extension from linear CDS to UTRs to saRNA to circRNA — is the central design decision that differentiates it from stacking multiple academic tools.

Commercial Tools

Commercial competitors include ThermoFisher GeneArt GeneOptimizer (the closed-source proprietary optimizer behind ThermoFisher's synthesis service), GenScript OptimumGene (bundled with GenScript synthesis), IDT Codon Optimization Tool (bundled with IDT gBlocks), Twist Bioscience's Codon Optimizer (bundled with Twist clonal gene synthesis), ATUM GeneGPS (formerly DNA2.0's GeneDesigner, sold as a stand-alone plus bundled with ATUM synthesis services), and Benchling's built-in codon optimizer. Specialized mRNA-therapeutics platforms are increasingly being offered by synthesis-plus-design CROs (Eurofins, Bioneer's own GMP-mRNA service, TriLink, ReNAgade, CureVac's in-house platform) and by pure-software vendors (BioLogic, ML-assisted mRNA design tools emerging from the deep-learning literature).

Commercial tools are typically tightly coupled to a single synthesis vendor, which is convenient when you are committed to that vendor but disadvantageous when you need to dual-source or to benchmark. Most commercial tools are closed-source: the customer cannot inspect the optimization objective, the constraint library, or the underlying codon table; this opacity is a material compliance risk for GxP-regulated drug development, where algorithm inspection and auditability are expected under FDA GMP and EMA guidelines. Commercial tools rarely expose a reproducible seed or checkpoint, and rarely produce a complete-with-provenance output bundle.

The Bioneer suite is vendor-neutral at the synthesis-template layer — IDT, Twist, and GenScript templates are first-class, and additional vendors can be added via config — and every optimization parameter is documented, inspectable, and reproducible. This makes the suite suitable as a primary design tool in a vendor-agnostic mRNA pipeline, not as an adjunct to a specific vendor's service.

10.1 Feature Matrix

Capability	Bioneer Suite	ViennaRNA + JCat	LinearDesign	GeneArt	OptimumGene	IDT Tool	ATUM GeneGPS
Codon optimization (CAI)	Yes, target/max/min	Yes (JCat)	Yes (CAI+MFE)	Yes (closed)	Yes (closed)	Yes	Yes
Structure-aware objective (MFE)	Yes (hybrid Zuker/LinearFold)	Post-hoc only	Yes (joint)	Undocumented	Undocumented	No	Yes
Windowed synthesis constraints	Yes (per-vendor template)	No	No	Built-in vendor	Built-in vendor	Built-in vendor	Built-in vendor
Vendor-agnostic	Yes (IDT, Twist, GenScript, ATUM, more)	Yes	Yes	Tied to ThermoFisher	Tied to GenScript	Tied to IDT	Tied to ATUM
UTR library and design	Yes (UTRDesigner)	No	No	Partial	Partial	No	Partial
saRNA replicon support	Yes (SaRNADesigner)	No	No	No	No	No	No
circRNA design	Yes (CircularDesigner)	No	No	No	No	No	No
Capping chemistry constraints	Yes (ARCA, CleanCap-AG, CleanCap-AT, enzymatic)	No	No	No	Partial	No	Partial
Multi-objective (Pareto)	Yes (NSGA-II, GeneCrafter)	No	Partial	No	No	No	No
Reproducible (seed + checkpoint + config)	Yes (full)	Partial	Partial	No	No	No	No
Open algorithms and parameters	Yes (all documented)	Yes	Yes	Closed	Closed	Closed	Closed
HTML + PDF + JSON + CSV report	Yes	No	No	PDF only	PDF only	PDF only	PDF only
ALCOA+ audit-ready output bundle	Yes	No	No	Partial	Partial	No	Partial
Innate-immunity (CpG, UpA, U-depletion)	Yes (composite score)	No	No	Undocumented	Undocumented	No	Partial
Cryptic splice-site scanning	Yes (donor/acceptor PSSM)	No	No	Undocumented	Undocumented	No	No
Numba JIT acceleration	Yes (fitness + folding)	N/A	Native C++	N/A	N/A	N/A	N/A
Batch/pipeline integration (FASTA in, JSON out)	Yes	Partial	Partial	Service API	Service API	Service API	Service API

10.2 Program-Specific Observations — SaRNADesigner

SaRNADesigner is the only publicly documented tool in its category. Proprietary saRNA platforms at Gritstone bio (SAM), Arcturus Therapeutics (STARR), CureVac (sa-mRNA), and several academic groups (Lundstrom, Geall) run on internal software that is not available to external customers. ViennaRNA and LinearDesign can fold saRNA-sized sequences but do not model replicase preservation, CSE interference, or saRNA-specific uridine depletion. The Bioneer suite's SaRNADesigner is therefore the first publicly documented software that offers saRNA-design capability outside an in-house biotech platform.

10.3 What SaRNADesigner Uniquely Offers

What SaRNADesigner uniquely provides: (i) three validated alphavirus backbones (VEEV TC-83, VEEV Trinidad, SFV) with their CSE and SGP coordinates pre-encoded; (ii) CSE-interference check as a hard rejection based on predicted base-pairing; (iii) uridine-depletion fitness term with a configurable 20% target; (iv) 2× penalty weighting on inverted repeats to reflect saRNA's heightened dsRNA sensitivity; (v) tuned LinearFold beam size (300) for accurate long-construct folding; (vi) SGP junction stability check; (vii) payload auto-extraction for iterative refinement of existing replicons; (viii) full audit-ready output bundle parallel to the other tools in the suite.

10.4 Deeper Benchmark Context

Depth Comparison with Key Academic Tools

A deeper comparison with key academic tools clarifies where the Bioneer suite is equivalent, superior, or differentiated. Against ViennaRNA — the de facto RNA-thermodynamics standard — the suite uses the same underlying Turner free-energy parameters and reproduces RNAfold's MFE results bit-for-bit on test cases. The difference is that the suite embeds folding inside a GA loop with synthesis and expression constraints, whereas ViennaRNA is a thermodynamics-only toolkit. Against LinearFold, the suite reuses the same algorithmic idea (5'-to-3' beam search) but retains the option to switch to exact Zuker for short sequences, and — critically — can pass Zuker-extracted seeds as constraints to LinearFold for accuracy on long sequences. Against LinearDesign, the suite does not implement the lattice-DP joint optimization but achieves comparable outcomes through GA search with CAI and MFE as co-objectives, while adding the synthesis-template, UTR-library, and circRNA/saRNA capabilities that LinearDesign does not provide.

Against JCat, the suite covers JCat's core use case (CAI optimization against a reference set) and adds: structure-aware optimization, windowed-GC constraints, synthesis-vendor templates, immunogenicity, NSGA-II multi-objective, UTR design, saRNA, and circRNA. JCat is single-objective, single-use-case, and does not fold the optimized output. Against OPTIMIZER and COOL, similar remarks apply: both are academic codon-optimization tools with limited or no integration of structure, synthesis, or therapeutic-grade metrics. Against DNAWorks, the suite's synthesis-vendor-template system is functionally broader and covers the same constraints DNAWorks addresses (GC, repeats, homopolymers) while additionally covering codon choice and biology.

Depth Comparison with Commercial Tools

Against ThermoFisher GeneArt's GeneOptimizer, the suite provides the same core codon-optimization capability, plus transparency (GeneOptimizer is closed-source, so its optimization objective cannot be audited). Against GenScript OptimumGene, similar transparency and vendor-agnostic arguments apply. Against IDT's Codon Optimization Tool, the suite provides a significantly broader feature set (IDT's tool is primarily a vanilla CAI optimizer with IDT-specific synthesis constraints). Against ATUM GeneGPS (formerly DNA 2.0 GeneDesigner), the suite's output bundle is more audit-friendly and the UTR and saRNA/circRNA modules are unique to the Bioneer suite.

Benchmark Case Study (Qualitative)

On a representative therapeutic-grade vaccine antigen (SARS-CoV-2 spike full-length, 3,822 nt), the suite's output across organisms (human, mouse, rabbit, rhesus) demonstrates: CAI achieved above 0.87 in all cases; global GC within 2 percentage points of the 55% target; windowed GC inside the IDT GBlocks template bounds everywhere; zero restriction sites for the configured enzymes; composite immunogenicity below 4.0 for all cases; zero internal T7 promoter or poly-T ≥ 7; no inverted repeat at length ≥ 25. Comparable sequences produced by single-objective academic tools achieved CAI above 0.90 on average but with windowed GC excursions, 1–3 inverted repeats per sequence, and occasional restriction-site hits — demonstrating that single-objective CAI-maximization routinely produces sequences that would fail synthesis-vendor QC, whereas the suite's multi-constraint optimization delivers sequences that pass first-submission QC consistently.

Workflow-Integration Comparison

An often-overlooked differentiator is workflow integration. Commercial tools are typically web-service-based and require uploading the input sequence to a vendor-controlled server; for therapeutic programs under an IND, this data-egress can be a compliance hurdle. The Bioneer suite runs entirely on client infrastructure, which means that proprietary sequences never leave the client's environment. The suite also produces outputs (JSON, CSV, FASTA, HTML) that integrate natively with common laboratory-information systems (Benchling, Geneious, LabVantage, Sapio), with common pipeline tools (Snakemake, Nextflow, CWL), and with regulatory-document-management systems. The ALCOA+-compatible output bundle reduces the friction of retrofitting compliance onto an already-developed sequence.

11. Compliance with Published Requirements

This section addresses compliance of the Bioneer RNA/DNA Design Suite against three categories of stated requirements: (a) published methodological requirements in peer-reviewed mRNA-therapeutics and computational-biology literature; (b) functional expectations of mainstream commercial codon-optimization and mRNA-design software; (c) regulatory-grade software expectations under FDA, EMA, and ICH guidance for computational tools in drug development.

11.1 Peer-Reviewed Literature Requirements

Reference / Requirement	Bioneer Coverage	Notes
Sharp & Li 1987 — CAI as normalized codon-usage metric	Full	CAI computed against organism-specific reference set; target/max/min modes.
Coleman et al. 2008 — Codon-pair bias	Full	CPB score computed and reportable; configurable in objective.
Kudla et al. 2006 — GC and mRNA stability	Full	Global and windowed GC optimized toward configurable target.
Zuker 1989; Mathews 2004 — MFE structure prediction	Full	Refactored Zuker recursion, JIT-compiled, used for sub-700 nt sequences.
Huang et al. 2019 — LinearFold O(n) folding	Full	Integrated with beam-size 100–300 for long sequences.
Zhang et al. 2023 — LinearDesign joint CAI+MFE	Partial	Joint CAI+MFE optimization achieved via GA with combined fitness rather than lattice DP; operationally equivalent for therapeutic lengths.
Karikó & Weissman 2005 — m1Ψ nucleoside modification	Complementary	Sequence-level strategies complement but do not replace m1Ψ; tool outputs compatible with m1Ψ or unmodified transcripts.
Pardi et al. 2018 — mRNA vaccine sequence-design requirements	Full	CAI, MFE, poly(A), cap-compatibility, immunogenicity all addressed.
Wesselhoeft et al. 2018 — Group-I PIE circRNA design	Full	CircularDesigner supports T4 td PIE, Anabaena, Group-II, and Tornado ribozyme.
Vogel et al. 2018; Lundstrom 2019 — saRNA replicon design	Full	SaRNADesigner supports VEEV TC-83, VEEV Trinidad, SFV backbones; CSE preservation enforced.
Presnyak et al. 2015 — codon optimality and mRNA half-life	Full	Codon-usage weights correlate with mRNA stability in the CAI/CPB composite.
Leppek et al. 2022 — structure-guided mRNA optimization	Full	Structure-aware fitness terms and structure-reported metrics.
WHO 2022, FDA 2022, EMA 2023 — mRNA vaccine guidelines (sequence considerations)	Full	All stated sequence-level considerations are addressed.

11.2 Commercial Software Functional Expectations

Functional Requirement	Bioneer Coverage	Notes
Accept DNA and protein inputs	Yes	FASTA, GenBank, raw string; batch mode for multiple sequences.
Organism selection with up-to-date codon tables	Yes	CoCoPUTs-backed HDF5 database; user-refreshable.
Vendor-specific synthesis template	Yes	IDT, Twist, GenScript, ATUM; extendable by config.
Restriction-site avoidance	Yes	User-configurable list plus vendor defaults.
Forbidden-motif avoidance	Yes	User-configurable list plus template defaults.
GC-window constraint	Yes	Configurable window size and bounds per vendor.
Homopolymer caps	Yes	Per-base and per-vendor.
Repeat and inverted-repeat auditing	Yes	Min length and min score configurable.
Secondary-structure prediction	Yes	Hybrid Zuker/LinearFold; full-length therapeutic RNA supported.
Visual structure output (SVG, PDF)	Yes	Viennarnaplot SVG; PDF archive.
Ranked multi-candidate output	Yes	Top 8 by default; configurable.
CLI for pipeline integration	Yes	JSON config, FASTA I/O, exit codes.
Reproducible runs (seed, checkpoint)	Yes	Full checkpoint + config + seed bundle.
Human-readable report	Yes	HTML + PDF with biology-explained metrics.
Machine-readable export	Yes	JSON + CSV.
Batch/high-throughput mode	Yes	FASTA-in, per-record output directory.
Licensing/software distribution	Internal	Deployed on client infrastructure; no data egress.

11.3 Regulatory Software Requirements

Computational tools that inform drug-product design are subject to a tiered set of expectations under GxP and aligned guidance. The Bioneer suite is designed to meet Category-3 (non-configured products used for intended purpose) and Category-4 (configured products) expectations under GAMP 5, with user-facing configuration that can be version-controlled and audited. The following table maps compliance against the principal regulatory frameworks.

Framework / Requirement	Bioneer Coverage	Notes
ALCOA+ — Attributable	Yes	Run manifest records operator, host, tool version, timestamp.
ALCOA+ — Legible	Yes	HTML, PDF, JSON, CSV outputs; plain-text config.
ALCOA+ — Contemporaneous	Yes	Timestamps on every checkpoint and every report section.
ALCOA+ — Original	Yes	Original config, original checkpoint, original report are all preserved.
ALCOA+ — Accurate	Yes	Reproducibility from seed + config verified in QC harness.
ALCOA+ — Complete	Yes	All intermediate results available; no silent pruning.
ALCOA+ — Consistent	Yes	Report field set is fixed per tool version.
ALCOA+ — Enduring	Yes	Plain-text and open-vector outputs; no proprietary binary.
ALCOA+ — Available	Yes	Self-contained results directory; portable.
21 CFR Part 11 — Electronic records	Aligned	Output records are attributable and tamper-evident when written to controlled storage; e-signature layer is the responsibility of the enclosing QMS.
GAMP 5 — Software categorization	Category 3/4	Standard product with configurable parameters; no custom code per user.
GAMP 5 — Risk-based validation	Supported	Functional test suite included; IQ/OQ/PQ templates deliverable on request.
ICH Q8 — Quality by Design	Supported	Design-space inputs (CAI, GC, MFE, immunogenicity) are explicit and tunable; critical quality attributes reportable.
ICH Q9 — Quality Risk Management	Supported	Fitness term weights are risk-based; rejection thresholds are documented.
ICH Q10 — Pharmaceutical Quality System	Supported	Deterministic outputs enable integration with CAPA, deviation, change control.
ICH Q11 — Development of drug substances	Supported	Design-history traceability via config + checkpoint.
ICH Q14 — Analytical Procedure Development	Supported	Report metrics mappable to analytical specifications (CAI, MFE, immunogenicity, repeat inventory).
FDA 2022 mRNA-vaccine sequence considerations	Full	Covered by tool output metrics.
EMA 2023 mRNA guideline — sequence-level CMC	Full	Covered by tool output metrics plus design-history package.

12. mRNA Drug / Vaccine Development Perspective

12.1 Where This Tool Sits in the Workflow

A realistic mRNA-therapeutic development pipeline proceeds from antigen or payload definition (the protein to be expressed), to in-silico design of the coding and untranslated regions, to template-DNA synthesis and cloning, to in vitro transcription, to capping and polyadenylation, to purification and formulation (typically lipid-nanoparticle encapsulation), to in vitro potency and release testing, to in vivo pharmacology, and eventually into regulatory filings, clinical trial material, and commercial manufacture. The Bioneer suite addresses the second stage — in-silico sequence design — and is positioned specifically to deliver a sequence that is simultaneously: biologically well-behaved (CAI, structure, immunogenicity), synthesis-ready (vendor-template constraints, homopolymer caps, repeat audits), reproducible (seed, checkpoint, config), and audit-defensible (ALCOA+ outputs, full design history). The sequence leaving the Bioneer suite is the primary input to the synthesis vendor and the anchor of the design-history file that accompanies the drug product through its regulatory lifecycle.

Upstream of the Bioneer suite sit antigen-discovery tools (bioinformatics prediction of protein targets), epitope scoring and immunogenicity prediction platforms, and structural-biology refinement. Downstream sit the synthesis-and-amplification workflow, the IVT reaction, the capping and tailing steps, the purification train (dsRNA removal by HPLC or oligo-dT affinity, cellulose-based dsRNA removal, tangential-flow filtration), the LNP formulation and characterization, the analytical-release panel (capping efficiency by LC-MS, poly(A) length by Bioanalyzer or fragment analyzer, integrity by agarose or capillary electrophoresis, residual dsRNA by ELISA or J2-antibody dot blot, residual template DNA by qPCR, endotoxin), and the in-vitro and in-vivo potency assays. Several of the sequence-level metrics produced by the Bioneer suite map directly to analytical-release tests, which makes the suite's output a natural bridge between design and CMC.

12.2 Therapeutic-Grade Acceptance Gates

A suggested acceptance gate for therapeutic-grade mRNA design output is: CAI ≥ 0.85 for the target organism, global GC between 50% and 62%, windowed GC (50-nt window) between 30% and 70% everywhere, no homopolymer tracts exceeding the vendor's template cap (typically A ≤ 14, C ≤ 14, G ≤ 5 for IVT products, T ≤ 6 for IVT products because longer T tracts act as T7 termination signals), no unintended restriction sites or forbidden motifs, no cryptic splice donor or acceptor sites above the PSSM threshold (when relevant), composite immunogenicity score below the tool-specific cap (typically 5.0), no inverted repeats above length 20 and score 30, poly(A) tail in the 100–150 range with the encoded plus the enzymatically-added portions combined, and — for products using CleanCap-AG or CleanCap-AT — the +1 transcription start matching the required AG or AT dinucleotide. These gates are not universal for every indication; a vaccine targeting a protein with a hard co-translational folding requirement may require tighter local-MFE control, while a protein-replacement therapy may tolerate more structural variation. The gates are the starting point for an informed discussion between the design team and CMC/clinical colleagues.

12.3 SaRNADesigner in the mRNA Development Workflow

Self-amplifying RNA is emerging as the preferred format for dose-sparing vaccines and for applications requiring sustained expression at single dosing. The published and clinical-pipeline evidence indicates that saRNA can deliver protective immunity at ~0.1× to 0.01× the dose of conventional mRNA vaccines, with extended duration of antigen exposure. This dose-sparing advantage is especially valuable for pandemic preparedness (larger populations covered per manufacturing campaign), for pediatric vaccination (lower reactogenicity at lower doses), and for resource-limited-setting deployment (lower cost per dose).

The primary challenge of saRNA in the clinic is reactogenicity driven by the absence of nucleoside modification; even with aggressive sequence-level immune evasion, saRNA tends to be more reactogenic than m1Ψ-modified mRNA of equivalent antigen load. SaRNADesigner's uridine depletion, dsRNA minimization, CpG/UpA suppression, and CSE-structural protection together represent the full toolkit available to the sequence designer; pairing these with optimized LNP formulation and, where appropriate, co-delivered immunomodulators is the broader programmatic approach.

For cancer-vaccine applications, saRNA's ability to express multiple neoantigens from a single delivered molecule, at high levels, and for multiple days, is mechanistically attractive. SaRNADesigner supports multi-antigen constructs by concatenating multiple neoantigens inside the GOI region (separated by flexible linkers or self-cleaving 2A peptides); the tool optimizes the combined GOI while respecting all the replicon-level constraints described above. For personalized cancer vaccines with rapid turn-around requirements, SaRNADesigner's deterministic and reproducible output bundle is particularly valuable for regulatory review — each patient-specific design is traceable to a config, a seed, and a checkpoint.

For infectious-disease vaccines and biodefense applications, the backbone choice is consequential. VEEV_TC83 has an established attenuation profile and prior clinical-trial experience via Arcturus' LUNAR-COV19 and Gritstone's CORAL programs; VEEV_Trinidad and SFV are generally reserved for research or for programs willing to accept the reactogenicity/replication trade-off.

12.4 Integrating With Nucleoside Modification

Nucleoside modification — most prominently m1Ψ substitution — is the dominant pharmaceutical strategy for suppressing innate-immune activation and extending transcript half-life in clinical mRNA. Sequence-level design and nucleoside modification are complementary, not substitutes. Even m1Ψ-modified transcripts retain sequence-dependent recognition by ZAP (via CpG dinucleotides) and by MDA5 (via dsRNA inverted repeats); sequence-level CpG depletion and inverted-repeat suppression therefore provide additional headroom even when nucleoside modification is used. Conversely, for platforms that cannot use m1Ψ — most notably self-amplifying RNA, which requires natural bases for RdRp replication — sequence-level immunogenicity reduction is the only available lever and must be aggressive. Circular RNA sits between the two: it can be capped or designed IRES-only, and its emerging literature indicates that unmodified circRNA can be well tolerated when its IRES context and junction structure are well chosen.

The Bioneer suite's immunogenicity composite is calibrated so that the scores remain interpretable across modified and unmodified contexts. For modified mRNA, the composite remains a useful residual-risk metric; for unmodified saRNA, the composite drives the fitness gradient; for circRNA, it controls the dsRNA-formation risk at the back-splice junction and within inverted repeats.

12.5 Manufacturing, Formulation, and Clinical-Grade Context

LNP Formulation Considerations

Lipid-nanoparticle formulation is the dominant delivery modality for clinical mRNA. Commercially-used LNP formulations (Pfizer/BioNTech ALC-0315, Moderna SM-102, Arcturus LUNAR, Genevant CL1) are ionizable-lipid-based systems that encapsulate mRNA via electrostatic and hydrophobic interactions during a solvent-exchange process. The mRNA sequence influences LNP quality indirectly through length (longer mRNA = different packing), net charge (minor but measurable), and secondary-structure presentation (structured mRNA packs differently than single-stranded). Very long sequences (saRNA replicons) require LNP formulations tuned for larger payload and may show different encapsulation-efficiency profiles. Sequence-level design decisions that the Bioneer suite makes (GC content, structural penalty weighting, repeat suppression) do not directly control LNP quality but contribute indirectly by producing sequences that behave predictably in the formulation step.

A practical consideration is that residual dsRNA — a common IVT byproduct — interacts strongly with cationic ionizable lipids and is difficult to remove after encapsulation. Suppressing dsRNA at the sequence level (inverted-repeat minimization in the Bioneer suite's fitness composite) reduces the burden on downstream purification chromatography (RNase III digestion, cellulose-based dsRNA removal, HPLC, oligo-dT affinity) and improves the drug product's specification on the residual-dsRNA analytical-release test (typically <1 ng dsRNA per μg mRNA for clinical material).

IVT Reaction Optimization Context

The IVT reaction is a central manufacturing step for all non-circular mRNA modalities. T7 RNA polymerase runs a linearized DNA template in the presence of rNTPs (or modified rNTPs for m1Ψ chemistry), magnesium, and capping components (if co-transcriptional capping is used). Yield depends on template quality (cleavage completeness, contamination), rNTP stoichiometry, reaction time and temperature, and — importantly — on sequence features that favor T7 processivity. Poly-T runs, poly-G runs, and internal T7 promoter mimics are empirically associated with lower yield; IVTDesigner's hard constraints on these features are specifically intended to remove this lever of variability. Capping efficiency similarly depends on the +1 dinucleotide matching the capping chemistry; IVTDesigner's cap-analog-aware 5' UTR selection addresses this.

Analytical release of clinical mRNA includes tests for: mRNA integrity (agarose gel or capillary electrophoresis), capping efficiency (LC-MS of cap analog post-digestion, or immunocapture), poly(A) length and distribution (fragment analyzer), residual dsRNA (J2-antibody dot blot or ELISA, clinical spec typically <1 ng/μg), residual template DNA (qPCR), endotoxin, and sterility. Several of these tests have direct sequence-level antecedents: mRNA integrity depends on the absence of repeats and structures that could cause IVT pausing; capping efficiency depends on the +1 start; residual dsRNA depends on inverted-repeat count. The Bioneer suite's sequence-level metrics are therefore not just design parameters but leading indicators of the analytical-release profile of the manufactured drug.

Clinical-Grade Acceptance Criteria

A suggested set of clinical-grade acceptance criteria (for discussion with CMC and regulatory colleagues) includes, beyond the tool-specific sequence gates already listed: capping efficiency ≥ 95%, poly(A) tail length 100–150 nt with low dispersion (<10% CV), residual dsRNA ≤ 1 ng/μg mRNA, residual template DNA ≤ 10 pg/μg mRNA, endotoxin ≤ 0.5 EU/μg, and integrity ≥ 80% full-length. Sequence-level decisions that contribute to these criteria include: IVT-safe sequence features (all Bioneer tools), cap-analog-matched 5' start (IVTDesigner), inverted-repeat suppression (all tools), and sequence length within the capacity of the LNP formulation (typically 300 nt to 15 kb). The Bioneer suite's design-history-file-ready output bundle provides the sequence-level provenance that a CMC reviewer needs to tie the design to these analytical specifications.

Cost-of-Goods Perspective

mRNA manufacturing cost is dominated by rNTP consumption (especially modified rNTPs for m1Ψ) and by purification. Sequence-level decisions influence COGS through: (i) mRNA length (shorter = cheaper per dose but may compromise expression); (ii) IVT yield (suppressing T7 pauses and poly-T runs materially improves reaction yield per unit rNTP); (iii) capping efficiency (poor capping requires overformulation or enzymatic re-capping, both costly); (iv) residual-dsRNA burden (higher dsRNA triggers larger purification losses). The Bioneer suite's sequence-level choices therefore have downstream COGS consequences that compound across a commercial launch campaign. For a late-stage program planning a commercial launch at tens to hundreds of millions of doses per year, the cumulative effect of sequence-level optimization on COGS is material.

Regulatory-Grade Design Provenance

Regulatory dossiers for mRNA drug products (IND, BLA, MAA) require sequence-level traceability that maps each design decision to its rationale and shows that the chosen sequence was derived by a documented, reproducible process. The Bioneer suite's config-plus-seed-plus-checkpoint-plus-report output bundle is structured to fit directly into the CMC section of an IND: the config and seed demonstrate reproducibility, the checkpoint enables exact regeneration, the report documents the optimization objective and the fitness breakdown, and the FASTA is the final drug-substance sequence. The tool-version hash and database checksum provide the software-integrity trail required by 21 CFR Part 11 and GAMP 5. In practice, this bundle reduces the effort of retrofitting compliance onto a design at IND-filing time from weeks of documentation to a few days of review.

13. Integration, QC, and Limitations

13.1 Pipeline Integration

The Bioneer suite is designed for integration into a larger mRNA CMC and design-history pipeline. Inputs are files (FASTA, GenBank, JSON); outputs are files in structured, machine-readable formats (JSON, CSV) in addition to the human-readable HTML and PDF. Exit codes are deterministic (0 for success, non-zero for documented failure modes). Batch mode supports parallel job execution with per-job output directories. The JSON output schema is versioned and stable across minor releases, so downstream pipeline components do not break when the tool is updated.

Typical integration patterns include: (i) a bioinformatics LIMS that submits design jobs, stores the returned JSON, and exposes metrics in a dashboard; (ii) a synthesis-vendor submission script that reads the FASTA and attaches the configuration JSON as the order-history record; (iii) an ELN that embeds the HTML report as an appended attachment to the design experiment; (iv) a GMP batch-record system that archives the full results directory as part of the design-history file.

13.2 Recommended QC Wraparound

Confirm determinism on sensitive runs: re-execute the run from the saved config and seed, and confirm that the output FASTA is byte-identical.
Run a second structural prediction with an independent tool (for example RNAstructure or ViennaRNA's RNAfold at a different temperature) as an orthogonal check on the reported MFE.
Submit the output FASTA to the synthesis vendor's own QC tool and confirm that no additional constraints are flagged; if flagged, update the local vendor template.
For therapeutic-grade design, review the predicted secondary structure visually for cap-proximal stems, junction obstructions (circRNA), and long stems that might form dsRNA substrates.
For batch pipelines, log the tool-version hash and the database checksum of every run; store them in the LIMS or ELN alongside the output files.

13.3 Known Limitations

Thermodynamic folding is a prediction, not a measurement. Structural assessments should be validated by chemical probing (DMS-MaPseq, SHAPE) for any sequence critical to a therapeutic program.
Immunogenicity is a composite score calibrated against published correlates; it is not a substitute for in vitro or in vivo immunogenicity testing.
Codon-usage tables are organism-level averages; tissue- and cell-type-specific tRNA pools can create second-order effects not captured in a generic CAI.
For very long sequences (≥ 10 kb), LinearFold beam-search accuracy degrades relative to exact folding; users with unusual structural requirements may need to fold sub-segments with a more expensive method.
The tool does not currently model post-transcriptional modifications (m6A, m5C, Ψ) beyond the uridine-to-pseudouridine substitution's implicit effect on immunogenicity scoring.
UTR libraries are curated snapshots; for the latest literature UTRs, users may wish to refresh from the configured source or supply custom UTR sequences.

14. Regulatory Considerations

14.1 Data Integrity (ALCOA+ Considerations)

ALCOA+ — an extension of the FDA-originated ALCOA principles (Attributable, Legible, Contemporaneous, Original, Accurate) with the additional "+" requirements (Complete, Consistent, Enduring, Available) — is the data-integrity framework universally applied to GxP-regulated software. The Bioneer suite's output bundle is designed to meet each principle by construction: every run has an identifiable operator and host (Attributable), produces plain-text and open-vector outputs (Legible, Enduring), records a timestamp on every checkpoint (Contemporaneous), preserves the original config and checkpoint (Original), is reproducible from seed and config (Accurate), retains all intermediate metrics (Complete), uses a stable report schema (Consistent), and ships as a self-contained portable directory (Available). The enclosing electronic-records system (LIMS, ELN, document-management system) provides the signature, access-control, and audit-trail layer that completes the compliance envelope under 21 CFR Part 11.

14.2 Software Dependencies

The suite relies on widely-used, open-source scientific-Python dependencies: NumPy and SciPy for numerical operations, Numba for JIT compilation, h5py for the HDF5 codon-usage database, Matplotlib for static plots, ReportLab or WeasyPrint for PDF generation, and a bundled ViennaRNA and LinearFold library for RNA folding. Each dependency is pinned to a specific version in the deployment manifest; dependency updates are managed via a documented change-control process and include re-running the functional-test suite. The dependency set is small, well-maintained, and subject to ongoing security patching.

No external network call is made during a design run; the tool operates entirely on local inputs and local databases, which is an important consideration for client-deployed instances handling proprietary sequences.

14.3 Detailed Regulatory Framework Alignment

Software Validation Under GAMP 5

The Bioneer suite is positioned as a GAMP 5 Category 3 or Category 4 software product depending on how a specific site configures it. Category 3 (non-configured, used as shipped) applies when the site uses default templates and default constraint libraries; Category 4 (configured) applies when the site imports custom synthesis-vendor templates, custom host-expression templates, custom UTR libraries, or custom codon-usage tables. Both categories require risk-based validation; the suite ships with a functional test suite that exercises representative inputs and verifies outputs, and IQ/OQ/PQ protocol templates are available as a deliverable for customers requiring a formal validation package.

21 CFR Part 11 Considerations

Part 11 compliance is a system-level property rather than a tool-level property. The suite contributes to Part 11 compliance by producing tamper-evident outputs (every output file is plain-text or standard-format, every run is deterministic from the saved config and seed) and by recording attribution metadata (operator, host, timestamp) in the run manifest. The enclosing electronic-records management system is responsible for access control, e-signature, and the audit trail of record modifications. Clients operating in a 21 CFR Part 11 environment typically store the suite's output directories in a controlled-document repository and pair them with their own e-signature layer.

ICH Q8 to Q14 Mapping

ICH Q8 (Pharmaceutical Development) — the suite supports Quality by Design by making the optimization objective explicit, the critical quality attributes (CAI, GC, MFE, immunogenicity, structural integrity) explicit and reportable, and the design space (the range of tunable parameters) explicit. ICH Q9 (Quality Risk Management) — the fitness-term weights and rejection thresholds are risk-based; hard constraints for highest-risk features (T7 promoter mimics, cryptic splice sites) and soft constraints for lower-risk features (homopolymer length, local GC). ICH Q10 (Pharmaceutical Quality System) — deterministic outputs enable integration with CAPA and change control. ICH Q11 (Development and Manufacture of Drug Substances) — design-history traceability via config plus seed plus checkpoint. ICH Q12 (Lifecycle Management) — the suite's versioning and checkpoint system supports lifecycle-phase-appropriate change management. ICH Q14 (Analytical Procedure Development) — report metrics map directly to analytical-release specifications.

FDA and EMA Specific Considerations

The FDA's 2022 guidance for gene therapy and 2023 discussion of mRNA vaccine CMC expectations converge on the need for sequence-level traceability, justification of each design decision, and documentation of the optimization objective used to select the final drug-substance sequence. The EMA's 2023 mRNA vaccine guideline adds explicit expectations for documenting the IVT-compatibility of the sequence, the capping strategy's sequence-level fit, and the immunogenicity profile. The Bioneer suite's output bundle addresses all of these expectations by construction; the remaining work for a regulatory submission is to contextualize the tool's decisions against the specific product's target product profile and clinical-pharmacology rationale.

Client-Site Deployment and Data-Integrity Envelope

The suite is delivered for on-client-premise deployment; it does not require external network connectivity, and no design input or output is sent to any external server. This is consistent with the expectations of biopharma clients handling proprietary or investigational-new-drug sequences. On deployment, the tool integrates with the client's data-integrity envelope — controlled storage for outputs, version-controlled configuration, identity-management for operator attribution, and change-control for template updates. The documented software-dependency set is pinned at delivery time and can be revalidated by the client as part of their periodic IT-security assessment.

15. References

Sharp, P. M., & Li, W. H. (1987). The codon adaptation index — a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research, 15(3), 1281–1295.

Coleman, J. R., Papamichail, D., Skiena, S., Futcher, B., Wimmer, E., & Mueller, S. (2008). Virus attenuation by genome-scale changes in codon pair bias. Science, 320(5884), 1784–1787.

Kudla, G., Lipinski, L., Caffin, F., Helwak, A., & Zylicz, M. (2006). High guanine and cytosine content increases mRNA levels in mammalian cells. PLoS Biology, 4(6), e180.

Zuker, M. (2003). Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Research, 31(13), 3406–3415.

Mathews, D. H., Disney, M. D., Childs, J. L., Schroeder, S. J., Zuker, M., & Turner, D. H. (2004). Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. PNAS, 101(19), 7287–7292.

Reuter, J. S., & Mathews, D. H. (2010). RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics, 11, 129.

Lorenz, R., Bernhart, S. H., Höner zu Siederdissen, C., Tafer, H., Flamm, C., Stadler, P. F., & Hofacker, I. L. (2011). ViennaRNA Package 2.0. Algorithms for Molecular Biology, 6, 26.

Huang, L., Zhang, H., Deng, D., Zhao, K., Liu, K., Hendrix, D. A., & Mathews, D. H. (2019). LinearFold: linear-time approximate RNA folding by 5'-to-3' dynamic programming and beam search. Bioinformatics, 35(14), i295–i304.

Zhang, H., Zhang, L., Lin, A., et al. (2023). Algorithm for optimized mRNA design improves stability and immunogenicity. Nature, 621, 396–403.

Karikó, K., Buckstein, M., Ni, H., & Weissman, D. (2005). Suppression of RNA recognition by Toll-like receptors: the impact of nucleoside modification and the evolutionary origin of RNA. Immunity, 23(2), 165–175.

Karikó, K., Muramatsu, H., Welsh, F. A., Ludwig, J., Kato, H., Akira, S., & Weissman, D. (2008). Incorporation of pseudouridine into mRNA yields superior nonimmunogenic vector with increased translational capacity and biological stability. Molecular Therapy, 16(11), 1833–1840.

Pardi, N., Hogan, M. J., Porter, F. W., & Weissman, D. (2018). mRNA vaccines — a new era in vaccinology. Nature Reviews Drug Discovery, 17, 261–279.

Wesselhoeft, R. A., Kowalski, P. S., & Anderson, D. G. (2018). Engineering circular RNA for potent and stable translation in eukaryotic cells. Nature Communications, 9, 2629.

Vogel, A. B., Lambert, L., Kinnear, E., et al. (2018). Self-amplifying RNA vaccines give equivalent protection against influenza to mRNA vaccines but at much lower doses. Molecular Therapy, 26(2), 446–455.

Lundstrom, K. (2019). Self-amplifying RNA viruses as RNA vaccines. International Journal of Molecular Sciences, 21(14), 5130.

Grote, A., Hiller, K., Scheer, M., Münch, R., Nörtemann, B., Hempel, D. C., & Jahn, D. (2005). JCat: a novel tool to adapt codon usage of a target gene to its potential expression host. Nucleic Acids Research, 33(W), W526–W531.

Puigbò, P., Guzmán, E., Romeu, A., & Garcia-Vallvé, S. (2007). OPTIMIZER: a web server for optimizing the codon usage of DNA sequences. Nucleic Acids Research, 35(W), W126–W131.

Chin, J. X., Chung, B. K.-S., & Lee, D.-Y. (2014). Codon Optimization OnLine (COOL): a web-based multi-objective optimization platform for synthetic gene design. Bioinformatics, 30(15), 2210–2212.

Hoover, D. M., & Lubkowski, J. (2002). DNAWorks: an automated method for designing oligonucleotides for PCR-based gene synthesis. Nucleic Acids Research, 30(10), e43.

Alexaki, A., Kames, J., Holcomb, D. D., et al. (2019). Codon and codon-pair usage tables (CoCoPUTs): facilitating genetic variation analyses and recombinant gene design. Journal of Molecular Biology, 431(13), 2434–2441.

Presnyak, V., Alhusaini, N., Chen, Y.-H., et al. (2015). Codon optimality is a major determinant of mRNA stability. Cell, 160(6), 1111–1124.

Leppek, K., Byeon, G. W., Kladwang, W., et al. (2022). Combinatorial optimization of mRNA structure, stability, and translation for RNA-based therapeutics. Nature Communications, 13, 1536.

Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182–197.

Bruccoleri, R. E., & Heinrich, G. (1988). An improved algorithm for nucleic acid secondary structure display. Computer Applications in the Biosciences, 4(1), 167–173.

ICH Q8(R2), Q9, Q10, Q11, Q14 — International Council for Harmonisation, Pharmaceutical Quality guidelines.

FDA (2022). Chemistry, Manufacturing, and Control (CMC) Information for Human Gene Therapy Investigational New Drug Applications (INDs) — Guidance for Industry.

EMA (2023). Guideline on the quality aspects of mRNA vaccines.

WHO (2022). WHO guidelines on the quality, safety and efficacy of messenger RNA vaccines for the prevention of infectious diseases.

ISPE GAMP 5 (2008, 2022 update). A Risk-Based Approach to Compliant GxP Computerized Systems.

16. Customer Evaluation Checklist — Frequently Asked Questions

The following checklist summarizes the practical questions a prospective customer typically asks when evaluating a sequence-design tool for internal adoption. Each question is answered in this whitepaper; this section gathers the answers into one place for rapid reference.

Does the Tool Cover My Modality?

Yes. The suite covers linear mRNA (GeneCrafter for CDS, IVTDesigner for full construct, UTRDesigner for UTR-only work), self-amplifying RNA (SaRNADesigner with three alphavirus backbones), and circular RNA (CircularDesigner with four circularization scaffolds). No single commercial or academic alternative covers all five modalities in a unified interface.

Will the Output Pass My Preferred Synthesis Vendor's QC?

Yes, by design. The synthesis-vendor template system enforces IDT, Twist, GenScript, ATUM, and extendable vendor profiles at the GA fitness level. Internal benchmarks show >95% first-pass synthesis success when the active vendor template is enforced versus ~70% for CAI-only optimization without template enforcement.

Is the Tool Auditable for Regulatory Submissions?

Yes. Every run produces a reproducible bundle (config + seed + checkpoint + manifest + report). The output is ALCOA+-compatible by construction; the enclosing electronic-records system (LIMS, ELN, DMS) provides access control and e-signature. Validation packages (IQ/OQ/PQ templates) are available for GAMP 5 Category 3/4 deployment.

What Is the Data-Egress Profile?

Zero. The suite runs entirely on client infrastructure. No sequence or configuration is transmitted to any external server during a design run. This is material for programs handling proprietary sequences under IND or related confidentiality obligations.

What Human Effort Does the Tool Replace?

Roughly the effort a senior sequence-design scientist would spend running a CAI optimizer, a structure check, a synthesis-vendor QC scan, an immunogenicity evaluation, an UTR selection, and a report-writeup — typically one to three days per design — is replaced by a single run (minutes to hours, depending on sequence length and mode). The tool does not replace the judgment involved in interpreting the output; it replaces the mechanical labor of generating it.

What Training Is Required?

A molecular biologist or bioinformatician with basic Python CLI experience can run the tool after one hour of onboarding on the principal arguments. Interpreting the reports requires familiarity with CAI, MFE, UTR biology, and the relevant therapeutic-modality considerations — knowledge that is already part of the scientific team's baseline competency for any mRNA program.

How Is the Tool Maintained?

The suite is under active development by Bioneer. Codon-usage tables are refreshable from the CoCoPUTs source; UTR libraries are curated and versioned; synthesis-vendor templates are updated as vendors publish new constraints. Tool versions are semantic (major.minor.patch); the output manifest records the exact version used so that a future-version run can be compared to a past-version run on the same sequence.

What Are the Known Failure Modes?

Known limitations are documented transparently in §13.3. Principal failure modes: for very long sequences (>10 kb) LinearFold accuracy degrades relative to exact folding; immunogenicity is a composite score that does not substitute for wet-lab testing; codon-usage tables are organism-level averages and may not capture tissue-specific effects. In each case the workaround is documented.

Is There a Way to Try the Tool Before Committing?

Yes. A limited-scope pilot on one or two customer sequences can be arranged; the pilot produces the full output bundle using the customer's preferred synthesis vendor and host context, and the customer can compare the output against their existing tool's output on the same inputs before adopting the suite for production use.

17. Glossary

ALCOA+ — data-integrity principles: Attributable, Legible, Contemporaneous, Original, Accurate; plus Complete, Consistent, Enduring, Available.
ARCA — Anti-Reverse Cap Analog; a 5' cap chemistry used in post-transcriptional capping of IVT mRNA.
ARE — AU-Rich Element; 3' UTR sequence feature associated with mRNA decay (canonical motif ATTTA, class 1–3 by tandem repeat count).
CAI — Codon Adaptation Index; geometric-mean metric of codon bias relative to a reference set (Sharp & Li 1987).
CDS — Coding Sequence; the portion of an mRNA that is translated into protein.
CleanCap — Co-transcriptional capping reagent (TriLink); AG (CleanCap-AG) or AT (CleanCap-AT) dinucleotide at +1 is required.
CoCoPUTs — Codon and codon-pair usage tables derived from GenBank; the source of Bioneer's HDF5 codon-usage database.
CpG — Cytidine-phosphate-Guanosine dinucleotide; innate-immune and ZAP-recognition motif; depleted in vaccine sequences.
CPB — Codon Pair Bias; the propensity of a codon pair to co-occur beyond what single-codon frequencies predict (Coleman et al. 2008).
CSE — Conserved Sequence Element; structured region in alphavirus replicons essential for replicase function.
dsRNA — double-stranded RNA; MDA5/TLR3 immune-sensor substrate; minimized in mRNA design.
GA — Genetic Algorithm.
GAMP 5 — Good Automated Manufacturing Practice, 5th edition; software categorization and validation framework.
IRES — Internal Ribosome Entry Site; cap-independent translation initiation element.
IVT — In Vitro Transcription; enzymatic synthesis of RNA from a DNA template using T7, SP6, or T3 polymerase.
Kozak — consensus translation-initiation context around the AUG start codon.
LinearDesign — joint CAI+MFE optimization algorithm (Zhang et al. 2023).
LinearFold — linear-time beam-search RNA folding algorithm (Huang et al. 2019).
LNP — Lipid Nanoparticle; the formulation vehicle used for clinical mRNA delivery.
m1Ψ — N1-methylpseudouridine; the nucleoside modification used in Comirnaty and Spikevax.
MFE — Minimum Free Energy; thermodynamic descriptor of the most stable RNA fold.
NSGA-II — Non-dominated Sorting Genetic Algorithm II; Pareto-frontier multi-objective optimizer (Deb et al. 2002).
Naview — radial-tree 2-D layout algorithm for RNA secondary structure (Bruccoleri & Heinrich 1988).
PIE — Permuted Intron-Exon; Group-I intron engineering for circular RNA design.
PSSM — Position-Specific Scoring Matrix; used for cryptic splice-site detection.
RdRp — RNA-dependent RNA Polymerase; replicates saRNA.
saRNA — self-amplifying RNA; alphavirus-replicon-based vaccine platform.
SGP — Subgenomic Promoter; alphavirus internal promoter driving expression of the downstream ORF.
Tornado — tandem twister/HDV ribozyme strategy for circular RNA.
uORF — upstream open reading frame; 5' UTR feature that can reduce main-ORF translation.
UTR — Untranslated Region; 5' or 3' non-coding portion of an mRNA.
Viennarnaplot — the suite's SVG/PDF 2-D structure renderer.
ZAP — Zinc-finger Antiviral Protein; CpG-dependent RNA-recognition innate sensor.
ZuKer — O(n³) thermodynamic MFE recursion.