CircularDesigner

Circular RNA (circRNA) Design for Cap-Independent, Exonuclease-Resistant Therapeutics

Technical Whitepaper

Version 1.3 (2026-04) | Bioneer Corporation

End-to-end circRNA design: Group-I/II PIE, Tornado ribozyme, and inverted-repeat back-splice scaffolds; curated IRES library; junction thermodynamic validation; 10+ platform formats including TITAN, NEXUS, PROMETHEUS, CALYPSO.

1. Executive Summary

The Bioneer RNA/DNA Design Suite is an integrated family of five design tools that share a common optimization engine and report format but diverge in their biological focus: GeneCrafter (codon optimization for heterologous expression), IVTDesigner (in vitro-transcribed linear mRNA for therapeutic and research use), UTRDesigner (translation-initiation and stability engineering of 5' and 3' untranslated regions), SaRNADesigner (self-amplifying RNA replicon design based on alphavirus backbones), and CircularDesigner (covalently closed circular RNA design using permuted intron-exon, back-splicing, or ribozyme systems). Each tool accepts either a DNA coding sequence or a protein sequence, resolves a target organism codon-usage profile, runs a genetic-algorithm (GA) population-based search with tool-specific fitness terms, applies a deterministic structural and constraint post-processing pass, and returns a ranked set of candidate sequences together with a full human-readable HTML report, a print-ready PDF, machine-readable JSON and CSV summaries, and synthesis-ready FASTA files.

CircularDesigner is the suite's tool for designing circular RNA (circRNA) therapeutics. Circular RNA is a covalently closed topology that lacks 5' and 3' ends; it is therefore intrinsically resistant to 5'→3' and 3'→5' exonuclease degradation, yielding half-lives of days in circulation versus hours for linear mRNA. Without a 5' cap, circRNA translates via a built-in IRES; CircularDesigner embeds a library of validated IRES sequences (CVB3, EMCV, HCV, CrPV) with type-specific linkers and optimizes the IRES-CDS context for accessibility. Circularization is achieved by one of four engineered scaffolds: Group-I permuted intron-exon (PIE, based on T4 td or Anabaena), Group-II PIE, Tornado ribozyme (tandem twister + HDV), or inverted-repeat back-splicing. CircularDesigner evaluates the splicing junction thermodynamics (P1 stem for Group-I, D4 stem for Group-II), scans for cryptic splice sites via a PSSM, and outputs the complete pre-circularization linear precursor ready for IVT.

What a Customer Gets in One Run

A ranked list of optimized candidate sequences (typically Rank 1 plus seven alternates) ready for DNA-synthesis vendor submission.
An interactive HTML report with drill-down per candidate covering codon usage, GC-sliding-window traces, homopolymer and repeat landscape, predicted RNA secondary structure, CpG/UpA dinucleotide frequency, predicted immunogenicity, and full fitness-component breakdown.
A print-ready PDF report with the same technical content, rendering RNA secondary structures as scalable vector objects that remain legible at any zoom level, suitable for project archives, regulatory submissions, and internal design-history files.
Machine-readable JSON and CSV summaries for pipeline integration, lab-automation platforms, and electronic lab-notebook (ELN) ingestion.
FASTA sequence files, ready for direct submission to commercial synthesis vendors such as IDT, Twist Bioscience, or GenScript with their template-specific constraint profile already applied upstream.
A deterministic reproducibility record — the original configuration JSON, the random seed, the GA checkpoint chain, and the software-version hash — so any report can be regenerated bit-for-bit years after the original run.

Why the Suite Matters for mRNA Therapeutic Development

mRNA-based drugs and vaccines have moved from an academic curiosity to a central pillar of the biopharma pipeline in under a decade. Regulatory approvals of Comirnaty (BNT162b2) and Spikevax (mRNA-1273) against SARS-CoV-2 validated the modality at industrial scale, and as of 2026 the global mRNA pipeline includes therapeutic cancer vaccines, protein-replacement therapies for monogenic disease, regenerative-medicine products that transiently deliver reprogramming factors, in-situ-expressed antibodies, and self-amplifying and circular RNA platforms that promise dose sparing and longer duration of expression. Every one of these products ultimately succeeds or fails at the sequence level: codon choices that look innocuous in isolation can halve translation throughput, move global GC content into a range that elicits innate-immune sensors, introduce repeats that block high-fidelity gene synthesis, or create hidden splice sites that cause aberrant products in cells. The Bioneer RNA/DNA Design Suite exists to make those sequence-level decisions rigorous, reproducible, and defensible in front of synthesis vendors, CMC reviewers, and regulatory authorities.

How to Read This Whitepaper

This whitepaper is written with three audiences in mind. For scientists who will run the software, it documents the biological motivation for each fitness term, the precise algorithm behind each report number, and the operational defaults. For project managers and program leaders, it frames where the tool sits in the broader mRNA therapeutic development pipeline, what decision it supports, and what customer-acceptance gates it enables. For regulatory and quality-assurance staff, it summarizes compliance with published method requirements and commercial-software expectations — ALCOA+ data integrity, GAMP 5 categorization, 21 CFR Part 11 alignment, ICH Q8–Q14 development principles, and comparability to widely cited academic and commercial tools including ViennaRNA, LinearDesign, LinearFold, DNAWorks, JCat, OPTIMIZER, COOL, ThermoFisher GeneArt GeneOptimizer, GenScript OptimumGene, IDT Codon Optimization Tool, and ATUM GeneGPS.

Design Principles

The suite is built around six design principles that are worth stating explicitly. First, biology-awareness: every fitness term has a biological rationale, and no term is a black-box ML output. Second, transparency: every parameter is documented, every threshold is named in the report, and the optimization objective can be inspected before and after every run. Third, reproducibility: the combination of config, seed, and checkpoint is sufficient to regenerate any output byte-for-byte. Fourth, composability: the five tools share a JSON schema and can be chained end-to-end without format conversion. Fifth, audit-readiness: outputs are ALCOA+-compatible by construction, and the bundle is portable. Sixth, vendor-neutrality: synthesis-vendor templates are first-class and easy to extend, so the tool is not locked to a single synthesis vendor.

Scope and Non-Scope

This tool operates at the sequence level. It does not replace wet-lab testing, structural biology refinement, or in-vivo pharmacology. It does not assess protein function directly; it assesses the sequence-level determinants of expression, stability, and immune behavior that influence function. It is a force multiplier on top of informed wet-lab practice, not a substitute for it. A design delivered by the tool should be validated empirically before it is advanced to the next stage of development; the tool's job is to maximize the probability that the validation succeeds and to minimize the number of wet-lab iterations required to converge.

2. Biological Foundation and Therapeutic Context

2.1 Why Synonymous Codons Are Not Equivalent

The standard genetic code is redundant: 61 sense codons encode 20 amino acids, so most amino acids have multiple synonymous codons. The classical textbook position was that synonymous substitutions are "silent" at the protein level and therefore biologically neutral. Four decades of experimental work have overturned that view decisively. Synonymous codon choice influences the efficiency of transcription and translation, the co-translational folding trajectory of the nascent polypeptide, mRNA secondary structure and half-life, splicing fidelity, nuclear export rates, innate-immune recognition, and the yield of heterologous expression and in vitro synthesis. A protein whose sequence is identical at every amino acid position can, depending on codon choice, express at levels that differ by one or even two orders of magnitude — or fail to express entirely.

The practical consequence is that the same protein, encoded by two different synonymous sequences, can express at radically different levels in the same cell or cell-free system, fold with different accuracy, trigger different innate-immune responses, and — for sequences destined for gene synthesis — present completely different synthesis-cost-and-yield profiles to a synthesis vendor. This is why every serious mRNA or protein-expression program treats codon optimization as a distinct, quantitative engineering step rather than a cosmetic cleanup.

Codon Adaptation Index (CAI)

The Codon Adaptation Index, introduced by Sharp and Li in 1987, reduces codon choice to a single scalar between 0 and 1. For each codon, a relative adaptiveness w is computed from the frequency of that codon divided by the frequency of the most-used synonym for the same amino acid, measured from a reference set of highly expressed genes in the target organism. The CAI of a coding sequence is the geometric mean of the relative adaptiveness values of its codons. Classical interpretation: genes whose CAI is close to 1 use the codons preferred by the organism's translational machinery and tend to be well expressed; genes with CAI near 0.5 or below tend to express poorly. CAI remains the single most widely used codon-optimization metric and is embedded in every serious commercial and academic optimization tool.

CAI has well-known limitations. It does not account for codon-pair effects, for local secondary structure, for tRNA pool differences among cell types or growth conditions, or for the benefits of codon-usage variety in co-translational folding. It is possible for a sequence to have CAI = 1.0 yet still express poorly because of a strong 5' UTR hairpin, a repeat that stalls ribosomes, or a cluster of rare codons at a folding intermediate. For these reasons, every tool in the Bioneer suite treats CAI as one of several objectives, not as the entire objective.

Codon Pair Bias and Context Effects

CAI treats each codon independently, but measured ribosome kinetics depend on neighboring codons too — the so-called codon-pair bias. Coleman et al. (2008) famously exploited this effect by deliberately deoptimizing codon pairs in poliovirus to produce live-attenuated vaccine strains, demonstrating that codon-pair deoptimization can suppress viral replication by multiple logs while leaving amino-acid sequence untouched. The codon-pair effect is believed to reflect steric and decoding constraints at the ribosomal A- and P-sites, where the tRNA-pair geometry matters. Bioneer's tools evaluate codon-pair bias as a secondary metric (the CPB score), and some of them allow CPB to be explicitly included or excluded from the optimization objective.

A related but distinct concept is tRNA adaptation, quantified by the tRNA Adaptation Index (tAI), which weights codons by the abundance and decoding efficiency of cognate tRNAs rather than by codon-usage frequency. tAI is more mechanistic than CAI but requires organism-specific tRNA-copy-number data that is not always available with high reliability. The Bioneer suite's CAI implementation is extensible to tAI-style weighting when the underlying codon-usage database is supplemented with tRNA abundance.

GC Content — Global and Local

Global GC content influences mRNA thermal stability and translation kinetics. In mammalian cells, GC-rich mRNAs tend to be longer-lived, exported more efficiently, and translated at higher rates than AU-rich mRNAs of otherwise equivalent sequence. Kudla et al. (2006) reported an approximately five-fold elevation in protein yield from GC-enriched synonymous variants of a reporter transgene in human cells, which they attributed primarily to mRNA stabilization rather than direct effects on translation. GC-rich transcripts, however, can form more stable secondary structure and in particular block cap-dependent scanning if the structure forms within the first 30–60 nucleotides of the 5' UTR or CDS.

Local GC content, measured in sliding windows of 30 to 60 nucleotides, is the more operationally important metric for gene synthesis. Synthesis vendors impose windowed GC constraints — typically 25–75% for standard products and narrower 30–70% for higher-stringency clonal products — because very low or very high local GC disrupts phosphoramidite coupling and oligonucleotide assembly. A gene with globally acceptable GC content can still contain short windows of extreme GC bias that fail synthesis-QC. Bioneer's tools therefore evaluate GC content both globally (for biological fit) and in a sliding window (for synthesis feasibility), with window size and acceptance limits configurable per synthesis vendor profile.

Minimum Free Energy and Local RNA Secondary Structure

Single-stranded mRNA folds into secondary structure. The thermodynamically most stable fold is described by its Minimum Free Energy (MFE), computed as the most negative free-energy value over all possible base-pairing configurations. The canonical MFE algorithm is the Zuker dynamic programming recursion, refined over three decades by Mathews and collaborators and implemented most widely in ViennaRNA's RNAfold and Mathews' RNAstructure. Zuker's O(n³) time complexity becomes a bottleneck for mRNAs longer than a few hundred nucleotides; for therapeutic mRNAs of 1–4 kilobases and saRNA replicons of 10 kilobases, alternatives are mandatory.

LinearFold, introduced by Huang and collaborators in 2019, re-casts RNA secondary structure prediction as a beam-search over a left-to-right decoding of the sequence, yielding O(n) time and linear memory usage with empirically negligible accuracy loss on native and synthetic RNA benchmarks. LinearFold made full-length therapeutic mRNA folding tractable inside an optimization loop rather than as a one-shot post-hoc analysis. LinearDesign, from the same group (Zhang et al., Nature 2023), extended the paradigm to co-optimization of codon choice and minimum free energy via a lattice-based dynamic program that enumerates synonymous translations while simultaneously computing MFE, yielding joint CAI–MFE Pareto-optimal sequences for SARS-CoV-2 spike and other mRNA targets.

For the Bioneer suite, structural evaluation is not a single-method call but a hybrid: short sequences or short windows are folded with the exact Zuker recursion (via a refactored, JIT-accelerated RNAFold kernel); longer sequences use LinearFold with configurable beam size; very long constructs (saRNA and circRNA precursors above ~3 kb) are processed in a sliding-window Zuker-seeded LinearFold, in which short windows are folded exactly, their high-confidence pairs are passed as soft constraints to a global LinearFold call, and the combined result is scored. The customer-visible benefit is that reported MFE and structural-penalty values remain meaningful across the full length range of therapeutic RNA, not just the short sequences where exact folding was historically feasible.

Repeat Landscape and Low Complexity

Direct and inverted repeats, along with low-complexity homopolymeric runs, produce two distinct failure modes: (i) synthesis failure, in which a gene-synthesis vendor's oligo-assembly pipeline fails to close the sequence, and (ii) biological aberrance, in which repeats form stem-loops that stall ribosomes, activate innate-immune sensors of double-stranded RNA, recruit RNA-binding proteins, or seed illegitimate recombination during replication. Each of the Bioneer tools tracks repeat metrics at three resolutions: homopolymer runs (A, C, G, T individual tract length), short tandem repeats (repeated motifs of length 2–10), and long inverted repeats (dsRNA-forming pairs of 20 nucleotides or longer). Acceptance thresholds are provider- and program-specific, reflecting the empirical fact that different synthesis chemistries tolerate different repeat classes to different degrees.

mRNA Innate-Immune Recognition

Exogenous single-stranded RNA activates the innate immune system through multiple receptors. TLR7 and TLR8 recognize uridine-rich single-stranded RNA in endosomes of plasmacytoid dendritic cells and macrophages respectively; TLR3 and the cytosolic sensors RIG-I and MDA5 recognize long double-stranded RNA; TLR9 recognizes unmethylated CpG motifs; and the interferon-induced protein kinase PKR and the 2'-5' oligoadenylate synthetase OAS are activated by structured or long dsRNA. For therapeutic mRNA, this innate-immune sensitivity is a double-edged sword: for a vaccine, some degree of innate stimulation can be adjuvant-like and desirable; for a protein-replacement therapeutic, innate activation causes rapid mRNA degradation, inflammatory adverse events, and dose-limiting toxicity.

The dominant pharmaceutical strategy is nucleoside modification — replacement of uridine with N1-methylpseudouridine (m1Ψ), originally reported by Karikó and Weissman (who shared the 2023 Nobel Prize in Physiology or Medicine for this discovery) — which suppresses innate-immune activation and simultaneously stabilizes the transcript. Sequence-level complementary strategies include uridine depletion, CpG-dinucleotide avoidance, UpA-dinucleotide avoidance, suppression of dsRNA-forming inverted repeats, and selection of 5' and 3' UTR sequences known to be well tolerated. These sequence-level strategies matter even when nucleoside modification is used, because m1Ψ substitution cannot compensate for a high-CpG sequence context that has already been detected by sensors such as ZAP (zinc-finger antiviral protein). The Bioneer suite's immunogenicity score composites CpG count, UpA count, uridine fraction, dsRNA-forming inverted-repeat count, and optional TLR motif flags into a single report metric, with configurable weights.

Translation Initiation and the Kozak Context

The rate-limiting step of translation for most cellular mRNAs is initiation. The scanning ribosome recognizes an AUG start codon in a context characterized by the Kozak consensus (originally GCCGCCACCATGG in mammalian mRNAs, with the purine at position -3 and the G at position +4 being the most functionally important positions). A strong Kozak context can increase protein yield by two- to five-fold over a weak context; the effect is particularly important for short mRNAs in which re-initiation events are rare. Bioneer tools that handle 5' UTRs evaluate Kozak context via a position-weighted score and allow the user to enforce the canonical context.

Upstream open reading frames (uORFs) in the 5' UTR can decoy ribosomes away from the main ORF and reduce main-ORF translation. uORF scanning is therefore a standard component of UTR design. Strong 5' secondary structure within the first 30 nucleotides can similarly block cap-binding-complex docking or scanning; Bioneer's cap-proximal MFE metric quantifies this risk.

Cap, Poly(A) Tail, and mRNA Lifecycle

Eukaryotic mRNAs are bracketed by a 5' cap (typically the m7G cap0 or cap1 structure) and a 3' poly(A) tail of ~100–250 nucleotides. The cap recruits the eIF4F cap-binding complex for translation initiation; the poly(A) tail recruits poly(A)-binding protein (PABP), which interacts with eIF4G at the 5' end to promote closed-loop translation and protects the transcript from 3'-to-5' exonucleolytic decay. For therapeutic mRNA, the cap is installed either co-transcriptionally (CleanCap-AG, CleanCap-AT) or post-transcriptionally (vaccinia-virus capping enzyme, ARCA anti-reverse cap analog). Each chemistry has sequence-level requirements at the +1 transcription start: CleanCap-AG requires an AG initiator, ARCA tolerates GG or GA, and enzymatic capping is sequence-agnostic. Bioneer's IVTDesigner enforces these chemistry-specific constraints and flags sequences that would yield low capping efficiency.

Poly(A) tail length and composition influence both stability and translational efficiency. Encoded poly(A) stretches (as opposed to enzymatically added tails) face synthesis challenges — homopolymers of ≥100 A nucleotides are difficult to synthesize and clone — and Bioneer tools split the design of the encoded region from the length of the in vitro polyadenylation step performed downstream.

The Ribosome Elongation Cycle and Codon-Dependent Kinetics

Translation elongation is not uniform along a coding sequence. The ribosome's A-site accommodates an aminoacyl-tRNA whose anticodon matches the A-site codon; each accommodation event is a probabilistic race between cognate, near-cognate, and non-cognate tRNA species that happen to diffuse past. The rate of the accommodation step depends on the cellular abundance of the cognate tRNA, on the codon–anticodon interaction strength (including the wobble position), on the local mRNA secondary structure that may restrict ribosome access, and on the identity of the P-site tRNA that dictates the peptidyl-transferase reaction following accommodation. The practical upshot is that synonymous codon substitutions — changes that leave the protein sequence untouched — can dilate or compress the local ribosome dwell time by factors of two to five. Ribosome profiling experiments in yeast, bacteria, and mammalian cells have mapped these local velocity variations at nucleotide resolution and established that they are reproducible, codon-dependent, and relevant to downstream biology.

The biological relevance of non-uniform elongation becomes concrete when a protein contains multiple structural domains that fold independently. The classical single-domain view of translation — ribosome elongation as a nearly-instantaneous preparation of a completed polypeptide that then folds as a unit — has been replaced by a co-translational view in which the N-terminal domain begins folding as soon as it emerges from the ribosome exit tunnel, while the C-terminal domain is still being synthesized. Ribosome pauses encoded at domain boundaries give the N-terminal domain time to complete folding before the next domain starts. When codon optimization removes these pauses, the two domains can misfold into a kinetically trapped state from which they cannot escape, producing insoluble aggregates even at high expression levels. For heterologously expressed enzymes, cytokines, and multi-domain therapeutic proteins, this co-translational folding effect is one of the principal empirical reasons that maximum-CAI optimization sometimes underperforms moderate-CAI optimization.

tRNA Pools, Charging, and the CAI-to-tAI Bridge

CAI assumes that the codon-usage frequency in highly expressed genes reflects the relative availability of cognate tRNAs. For many well-studied organisms this is broadly true, but there are exceptions. Tissue-specific tRNA expression in mammals — most strikingly in proliferating versus differentiated cells — creates codon-usage environments that differ materially from the species-average; Gingold et al. (2014) described proliferation-associated and differentiation-associated tRNA expression signatures that skew the effective codon-usage landscape. Stress responses (amino-acid starvation, oxidative stress, infection) alter tRNA charging fractions — only aminoacylated tRNAs can decode their codon, and uncharged tRNAs compete as near-cognate decoys. These dynamic effects are not captured by a species-level CAI calculation.

For programs where these effects matter, the Bioneer suite's codon-usage database can be rebuilt from tissue-specific or cell-line-specific tRNA-copy-number data, producing a tAI-style weighting that the GA consumes identically to a CAI-style weighting. The practical workflow is: measure or download the relevant tRNA expression data, convert it to per-codon weights using the wobble-decoding rules (Dong et al. 1996), write a TSV, and re-run the HDF5 builder. All downstream tool behavior is unchanged; only the numerical weights differ.

GC-Rich versus GC-Poor Codon Pools

The human genome has a broad GC distribution; highly expressed housekeeping genes tend to be GC-rich, while tissue-specific or induced genes tend to be more AT-balanced. This is not a coincidence: GC-rich codons tend to be decoded by GC-rich anticodons of abundant tRNAs, and GC-rich mRNAs tend to be more stable and better exported from the nucleus. For a heterologously expressed protein, pushing GC content too low can depress expression by reducing tRNA availability and by destabilizing the transcript; pushing GC content too high can introduce synthesis-problematic repeats (CCGCCG motifs, GC-island-like windows) and can create stable secondary structure that blocks translation. The fitness landscape is therefore bimodal in GC content, and the optimum for a given protein depends on the host and on the synthesis vendor's template. The Bioneer suite exposes GC as a tunable target, defaulting to values that work well for the selected host and vendor.

Nonsense-Mediated Decay and Premature Termination

Eukaryotic mRNAs that terminate more than ~50 nucleotides upstream of the final exon–exon junction are recognized by the nonsense-mediated decay (NMD) machinery as carrying a premature termination codon and are rapidly degraded. For in vitro transcribed therapeutic mRNA that lacks introns, NMD recognition is governed by different determinants — the long 3' UTR, weak termination context, and the 3'-UTR-to-poly(A)-signal distance — but similar decay-accelerating pathways operate. UTRDesigner's 3' UTR library is curated to avoid NMD-triggering structural features, and the tool can flag constructs that exceed empirically derived safe distances between the stop codon and the poly(A) signal.

Ribosome Stalling and Collisions

When ribosomes stall — because of a rare codon cluster, a structured mRNA region, or a damaged tRNA — following ribosomes can collide with the stalled leader. Ribosome collisions activate a surveillance pathway (ZNF598, RACK1, ribosome-associated quality control, RQC) that can result in nascent-chain ubiquitination, mRNA cleavage by endonuclease activity associated with the ribosome, and degradation of both the peptide and the transcript. For therapeutic mRNA, rare-codon clusters inside the CDS are therefore a double liability: they slow elongation directly, and they trigger active mRNA degradation if collisions accumulate. The repeat and rare-codon penalties in the Bioneer suite are calibrated to avoid triggering this pathway.

Innate-Immune Discrimination of Self versus Non-Self RNA

The innate-immune system distinguishes host RNA from pathogen RNA via a combination of structural features (length, double-strandedness, 5'-end chemistry), sequence features (CpG dinucleotide frequency, UpA dinucleotide frequency, uridine density), and post-transcriptional modifications (m6A, Ψ, m5C are abundant in host RNA and largely absent in most pathogens). For exogenously delivered therapeutic mRNA, the tool has to mimic self-RNA across as many of these axes as possible. Nucleoside modification (m1Ψ) addresses the post-transcriptional-modification axis; codon choice and UTR selection address the sequence-frequency axes; capping and polyadenylation address the 5'- and 3'-end axes; purification of dsRNA byproducts addresses the structural axis. The Bioneer suite's composite immunogenicity score aggregates the sequence-level axes into a single number; the remaining axes are the responsibility of the IVT reaction, the purification train, and the capping protocol.

2.2 Biology Specific to CircularDesigner

Circular RNA as a therapeutic modality has moved from a laboratory curiosity to an emerging clinical platform within the past five years. The therapeutic appeal of circRNA is rooted in two properties: (i) topological exonuclease resistance, because a covalently closed circle has no free 5' or 3' end for XRN1, DIS3, or the exosome to act on, yielding in-cell half-lives one to two orders of magnitude longer than linear mRNA; (ii) cap-independent translation via IRES, which decouples circRNA translation from the eIF4F cap-binding complex and makes translation resistant to cellular stress responses that normally reduce cap-dependent translation. Together, these properties enable durable protein expression at lower cumulative dose — a pharmacological profile of growing interest for protein-replacement therapy, repeat-dosed prophylactic biologics, and vaccines requiring extended antigen exposure.

Producing circRNA requires an engineered scaffold that circularizes a linear precursor in vitro or in situ. Four main strategies are used. (1) Group-I Permuted Intron-Exon (PIE), originally demonstrated by Wesselhoeft, Kowalski, and Anderson (2018) using the T4 bacteriophage td intron: the td intron is split at its P6/L6 hairpin, the two halves are placed at the 3' and 5' ends of the payload (hence "permuted"), and the natural self-splicing reaction excises the intron halves and ligates the payload ends. An IGS (internal guiding sequence) at the 5' splice site must base-pair with a spacer at the 3' splice site; this P1-helix pairing is the key thermodynamic constraint, typically required to be more stable than -2 kcal/mol and to end in a Watson–Crick base pair to IGS. (2) Group-I PIE with Anabaena intron: a scarless variant that leaves no exogenous scar sequence at the back-splice junction, at the cost of more complex precursor structure. (3) Group-II PIE, which uses a branch-point/D4-stem mechanism and provides a different set of trade-offs. (4) Tornado ribozyme — tandem twister ribozyme at the 5' end and HDV ribozyme at the 3' end — which autocleave to generate precisely defined 2',3'-cyclic phosphate and 5'-OH ends suitable for subsequent ligation, without a splicing event per se. (5) Inverted-repeat back-splicing, which uses flanking 20–50 nt direct repeats to form a hairpin that enables lariat-free cis-splicing; this is the canonical back-splicing mechanism of endogenous circRNAs.

CircularDesigner supports all of these and treats the scaffold as a first-class config parameter. The splicing scaffold sets which thermodynamic checks apply. For Group-I PIE (T4 td, default), the P1-helix-formation energy between the 3' spacer tail and the IGS must be below -2 kcal/mol, the terminal base must pair Watson–Crick or wobble-compatibly to the IGS last base, and the IGS must not be sequestered by CDS-5' base pairing. For Group-II PIE, the D4-stem energy must be below -8 kcal/mol. For Tornado, no splice-site thermodynamics apply because the ribozyme cleavage is sequence-determined by the ribozyme core sequence, not by flanking base pairing. For inverted-repeat back-splicing, the flanking inverted repeats must be of sufficient length and identity for efficient pairing. Each of these is checked programmatically by CircularDesigner and reported in the HTML output.

Beyond the circularization scaffold, circRNA design must handle IRES-driven translation. Four IRES types are distinguished in the translation-biology literature. Type I (entero/rhinovirus, canonically CVB3) is PTB-dependent and is the empirically strongest IRES for mammalian-cell circRNA; Type II (cardiovirus, EMCV) is eIF4G-dependent and the historical benchmark; Type III (hepacivirus, HCV) binds 40S ribosomes directly without eIF4G; Type IV (IGR, CrPV) assembles the 80S ribosome without any initiation factors — a niche but factor-free design. Each IRES has a type-specific optimal linker between the IRES 3' end and the payload AUG; CVB3 uses AA, EMCV uses AATT, HCV uses GGC, CrPV uses no linker. CircularDesigner's IRES library includes all four with the correct linkers, and the fitness function checks the IRES-linker-CDS local structure to confirm that no excessive base-pairing sequesters the IRES from ribosome binding.

Cryptic splice-site avoidance is particularly important in circRNA because the PIE scaffold actively recruits the spliceosome (Group-I) or uses catalytic RNA domains (Group-II); off-target splice events inside the payload cause loss of the intended circle and production of aberrant linear or mis-circled products. CircularDesigner scans the payload for a set of cryptic splice motifs (GT|AG, CAGGTA, GAGGTA, TAGGTA, GTCTCT, GATCTA) and — for T4 td PIE in particular — a set of high-risk T4-specific pseudosites (GGGTCT, GGCAGG). Generic cryptic sites incur a -5,000 penalty; T4-specific sites incur -50,000, which effectively rejects the candidate. A positional weight matrix (PSSM) across a 9-nt window adds a log-odds score for the strong 5' splice donor consensus, capturing dependency structure that motif matching alone would miss.

The full linear precursor of an engineered circRNA — the molecule produced by T7 IVT prior to self-splicing — can be substantially longer than the circular product. For a typical Group-I PIE payload of 1.5 kb, the linear precursor is roughly 2.3 kb, including the flanking intron halves, spacer arms, IRES, and additional scaffolding (ribozymes, purification tags when used, poly(A) signals for pseudo-cap if needed). CircularDesigner generates and reports this complete linear precursor; synthesis vendors receive the linear precursor as their input, and QC of the IVT reaction examines the linear precursor. The tool supports more than 10 pre-configured platform formats (PROMETHEUS, NEXUS, CALYPSO, OUROBOROS, TITAN, ZENITH, ARTEMIS, and several others), each corresponding to a particular combination of circularization engine, IRES, scaffold elements (e.g., flanking ribozymes, purification aptamers, internal poly(A) for residual-ribosome recruitment), and Tier (complexity/amplification level, with Tier 0–14 spanning roughly 10× to 2000× expected protein amplification over a naïve linear mRNA).

3. System Architecture

3.1 Shared Components Across the Suite

All five tools are built on a common Python core that combines a JIT-compiled numerical kernel (Numba), a genetic-algorithm engine, an HDF5-backed codon-usage database, a hybrid RNA-folding engine, a templated constraint library (synthesis-vendor and host-organism profiles), and a unified report-rendering pipeline. This shared substrate is what makes it possible to move from codon optimization to UTR engineering to saRNA design to circRNA design without learning a different tool for each.

The Genetic-Algorithm Engine

The GA is a standard evolutionary loop with tournament selection, multi-point crossover, and program-specific mutation operators. A population of candidate sequences (typical size 100–500) is initialized either randomly from the codon-usage distribution or from a greedy CAI-oriented seed. Each generation, candidates are ranked by the fitness function, a fraction is retained as elites, and the rest of the next generation is produced by crossover and mutation of tournament winners. The GA loop exits on convergence (a plateau in best-fitness for a user-configurable number of generations), on reaching a maximum generation count, or on the user-requested early stop. Between generations, the engine can checkpoint the entire population and RNG state to disk, which is what enables exact-reproducibility and restart-after-failure behavior.

Codon-Usage Database

Codon-usage frequency tables are stored in an HDF5 database (cocoputs_db.h5) indexed by NCBI taxid. The database was built from the CoCoPUTs project (Alexaki et al. 2019), which aggregates codon-usage from the NCBI GenBank CDS corpus and normalizes across organisms. The HDF5 backing allows the suite to hold several thousand organism profiles in a single addressable file, with O(1) lookup by taxid. For custom or client-specific usage tables (e.g., CHO-K1 with in-house expression-optimized weights), the database can be rebuilt from a client-supplied TSV using the included builder script.

Hybrid RNA-Folding Engine

The folding engine encapsulates three distinct algorithms behind a single interface. For sequences shorter than 700 nucleotides, a JIT-compiled Zuker recursion is used (the "RNAFoldRefactored" kernel), which produces exact MFE structures. For sequences longer than 700 nucleotides, LinearFold is called with a beam size of 100–300 depending on the calling tool and the required accuracy. For very long sequences typical of saRNA and circRNA (≥3 kilobases), a sliding-window Zuker-seeded LinearFold is applied: 300-nucleotide windows with 150-nucleotide step are folded exactly, high-confidence pairs from those windows are passed as constraints to a global LinearFold call, and the result is scored against the same fitness terms used in the GA loop. Benchmarking in-house against known-structure RNAs (tRNA, 5S rRNA, SARS-CoV-2 5' UTR, and a panel of natural mRNAs with experimentally probed structures) shows that the hybrid approach recovers ≥ 90% of experimentally supported base pairs within an acceptable running time for GA inner loops.

Templated Constraint Library

Hard constraints are organized into two stacks: synthesis-vendor templates and host-organism templates. Synthesis-vendor templates capture the empirical constraints of IDT (GBlocks, Megamer), Twist Bioscience (Clonal Genes, Gene Fragments), GenScript (OptiGene, GeneBlocks), ATUM, and others — restriction-site avoidance, homopolymer caps, GC-window bounds, minimum repeat-free intervals. Host-organism templates capture organism-specific constraints — Shine–Dalgarno avoidance inside CDS for E. coli, CpG-island and polyadenylation-signal avoidance for mammalian cells, poly-T tract limits for yeast. Both stacks are simultaneously applied; a candidate that violates either stack is either penalized (soft constraint) or rejected (hard constraint), configurable per term.

Viennarnaplot Rendering

The RNA secondary-structure rendering layer, Viennarnaplot, converts dot-bracket structures into publication-quality SVG figures with Naview-style layouts refined by a post-processor that resolves residue overlaps, polishes stem angles, and — for circular RNA — closes the topology. The resulting SVGs are embedded directly in HTML reports (scalable without re-rasterization) and converted into vector PDFs for archive submission. Color annotation is configurable: DMS-style reactivity coloring (green for paired A/C, red for unpaired A/C, grey for U/G) is supported for comparing predicted structure with chemical-probing data when available.

3.2 Where CircularDesigner Plugs In

CircularDesigner plugs into the suite as the endpoint tool for circRNA programs. Input is either a protein sequence (which it back-translates through GeneCrafter's core) or a pre-optimized CDS (preserve-CDS mode). Output is the complete linear precursor ready for IVT — intron halves, spacer arms, IRES, CDS, and any additional scaffolding — along with a full predicted secondary structure of the circRNA product and a predicted splicing-efficiency proxy based on the P1 or D4 helix energy.

3.3 Reproducibility by Construction

Every run records and persists: (i) the full configuration JSON submitted by the user, (ii) the random seed used by the GA, (iii) the identifier and checksum of the codon-usage database, (iv) the semantic version and git-commit hash of the tool, and (v) a checkpoint of the final GA population and fitness table. A downstream consumer can therefore re-execute the same run months or years later and confirm that the output sequence is identical, which satisfies both scientific reproducibility expectations and the ALCOA+ "Original" and "Accurate" principles used in GxP data-integrity assessment. Checkpointing is also what allows very long runs to be paused and resumed without loss, and what allows partial-failure recovery in batch pipelines.

3.4 Data Flow

A typical execution proceeds through the following stages. (1) Input parsing: a DNA or protein sequence is accepted either via CLI argument, file path, or FASTA for batch mode. (2) Organism and template resolution: codon-usage table, synthesis-vendor template, and host-organism template are loaded. (3) Constraint compilation: forbidden motifs, restriction sites, TFBS, and any user-specified avoid-lists are compiled into JIT-searchable numeric arrays. (4) Initial-population generation: the GA population is seeded using either a greedy-CAI initialization, a random draw from the codon-usage distribution, or — for tools that support it — a beam-search initialization that favors low-immunogenicity codons. (5) GA main loop: each generation evaluates fitness for all candidates (caching results by sequence hash), performs selection, crossover, and mutation, and optionally checkpoints. (6) Post-GA structural filtering: the top N candidates (typically 500–1000) are subjected to full structural evaluation — exact or linear folding, homopolymer auditing, repeat scan, immunogenicity profiling. (7) Final ranking and reporting: the top 8 candidates are given full secondary-structure plots, and all are summarized in HTML, PDF, JSON, and CSV.

3.5 Performance, Parallelism, and Determinism

JIT Acceleration with Numba

The suite's performance-critical kernels are JIT-compiled with Numba. Compiled kernels include the fitness evaluation core (CAI computation, GC counting, codon-pair scoring, homopolymer detection, short-tandem-repeat detection, inverted-repeat detection, motif scanning via Aho–Corasick or bit-parallel scanners), the Zuker folding recursion, the LinearFold beam-search loop, the Kozak position-weighted matrix, and the mutation operators. Numba compilation is invoked on first use; a warmup phase at tool startup triggers compilation of the hot kernels so that the first GA generation does not pay the compile latency. Benchmark numbers: on a modern server-class CPU, a single-generation GA evaluation over a 200-candidate population of 1,000-nucleotide sequences completes in under 5 seconds for the full fitness composite; the same operation without JIT acceleration takes more than 100 seconds.

Parallel Execution Model

GA generations parallelize naturally: each candidate's fitness evaluation is independent. The suite uses a process-pool executor with a shared, read-only set of resources (codon table, motif arrays, templated constraints) initialized in each worker at startup. For very short sequences the process-creation overhead dominates, and single-threaded execution is faster; the tool auto-detects the crossover point and adjusts. For long sequences (therapeutic mRNA and saRNA), multi-process execution delivers near-linear speedup up to the available core count. Custom scheduling accommodates hosts with mixed workloads — the tool can be run with explicit --num-workers to avoid contention with other jobs on shared compute.

Determinism and Numerical Stability

Determinism is guaranteed by seeding every random source — NumPy, Python's random module, and each worker's RNG — from a single master seed. Numerical stability of the folding kernels is guaranteed by use of float64 accumulators; the Zuker recursion's internal free-energy tables are stored at 0.01 kcal/mol resolution, which is finer than the ~0.1 kcal/mol accuracy of the underlying thermodynamic parameters. Floating-point sensitivity is therefore not a source of run-to-run variation; given the same seed and config, outputs are byte-for-byte identical.

Error Handling and Graceful Degradation

Hard errors (malformed input, missing codon-usage table for the requested organism, corrupted checkpoint) produce a non-zero exit code, a diagnostic message to stderr, and a JSON error blob in the output directory. Soft errors (a GA generation that produces no candidates above threshold, a LinearFold call that times out) trigger a documented fallback (fall back to Zuker, lower the beam size, continue with elite-only population) with a warning logged to the report. The tool avoids silent degradation — anywhere a fallback is taken, the customer sees a flag in the HTML output.

4. Algorithms in Detail

4.1 Genetic-Algorithm Core

The genetic algorithm is the heart of every tool in the suite. Its strength over greedy or gradient-based optimization is that it navigates a high-dimensional, rugged, multimodal fitness landscape without requiring differentiability of the objective — which is crucial because the suite's fitness landscapes are dominated by discrete hard constraints (restriction sites, forbidden motifs) and non-differentiable structural metrics (MFE, repeat counts).

Encoding

A candidate is represented as an array of codon indices in the range 0–63, one per amino acid position. This encoding keeps mutation and crossover operations synonymous by construction (they change codon choice but never amino acid), and enables fast JIT-compiled fitness evaluation via codon-index lookups rather than string manipulation. For UTR-focused tools, the encoding extends to nucleotide positions in the UTR segments; the CDS segment retains its codon-indexed encoding.

Selection

Tournament selection (tournament size 2 to 5) is used throughout. Tournament selection is preferred over truncation or roulette because it provides a smooth, tunable selection pressure that does not depend on the absolute fitness scale — important when fitness terms include both bounded metrics (CAI ∈ [0, 1]) and unbounded penalties (homopolymer penalty scaling as length⁵). Elitism preserves a small fraction (default 5–10%) of the best candidates into the next generation without alteration.

Crossover

Uniform and single-point crossover operate on the codon-index array. Crossover points are chosen either uniformly at random (uniform crossover) or at a single random cut (single-point). Uniform crossover mixes more aggressively and is preferred in early generations; single-point crossover preserves more local structure and is preferred later. A crossover-type schedule is configurable per tool.

Mutation

Each tool installs program-specific mutation operators in addition to a baseline uniform-random synonymous substitution. Common variants include CAI-weighted mutation (new codon sampled proportionally to its relative adaptiveness), hybrid CAI–GC mutation (mutation score combines CAI distance to target and the effect on local GC content), balanced-top-50% mutation (new codon drawn only from the codon whose CAI and GC percentile are both above the median), and targeted surgical mutation that repairs low-fitness sub-regions identified by a moving-window audit. Mutation rate is typically 0.02 to 0.05 per codon per generation and can be annealed across the run.

Convergence and Early Stopping

The GA stops when any of (i) maximum generations is reached, (ii) no best-of-generation improvement is observed for a patience window (default 100–150 generations), or (iii) population diversity (measured as mean pairwise Hamming distance normalized by sequence length) falls below a threshold. The latter detects search collapse — if the whole population has converged on a local optimum, further iteration is wasted. When diversity-collapse is detected, the engine can optionally perform a "diversity-restoration" step that injects random mutations to a fraction of the population, trading some best-fitness regression for renewed exploration.

Fitness Caching

Fitness evaluation is expensive relative to mutation and crossover. A sequence-to-fitness cache (keyed on the bytes of the candidate array plus the active fitness configuration) typically achieves >80% hit rate in late generations, because the population converges on a small region of sequence space. Cache invalidation is keyed on configuration, so changing any fitness weight or threshold forces recomputation. The cache is in-memory only (not persisted), which avoids the risk of stale cached values biasing future runs.

4.2 Multi-Objective Mode (NSGA-II, GeneCrafter)

GeneCrafter additionally supports NSGA-II (Non-dominated Sorting Genetic Algorithm II, Deb et al. 2002) as an alternative to the scalarized fitness approach. In NSGA-II mode, the user specifies multiple objectives — CAI, GC distance, immunogenicity, structure penalty — as separate terms rather than combining them into a single weighted score. NSGA-II then explores the Pareto frontier of non-dominated solutions: candidates for which no other candidate in the population is simultaneously better on every objective. The output is a set of diverse solutions rather than a single "best" sequence, and the customer chooses the trade-off that best fits the application (e.g., accept slightly lower CAI to gain markedly lower immunogenicity).

The practical advantage of multi-objective optimization over scalarized optimization is that it surfaces trade-offs that a scalarized fitness function would hide. A sequence that is slightly suboptimal on CAI but dramatically better on structural cleanness would be dismissed by a scalarized GA with CAI-heavy weights; NSGA-II retains both sequences and presents them to the customer for an informed decision. The cost is that NSGA-II converges more slowly and requires larger populations (500+ is recommended) to maintain frontier diversity.

4.3 Structural Post-Processing

After the GA terminates, the top N candidates (500 to 1000, configurable) are subjected to a deterministic post-processing pass that performs the expensive analyses which were approximated or sampled during the GA. The pass folds each candidate with the exact algorithm matching its length, extracts dot-bracket and energy, computes homopolymer and repeat inventories at full precision, computes the precise immunogenicity profile, validates all restriction-site and motif constraints, and confirms that Kozak, poly(A)-signal, and capping-start constraints are met. Candidates that fail any hard post-filter are removed; the remaining candidates are ranked by a post-filter composite score (which can have different weights than the GA fitness — for example, giving more weight to cap-proximal MFE because the GA's sampled MFE metric may underestimate cap-proximal risk).

This two-stage approach — fast-and-approximate in the GA, slow-and-exact in the post-processor — is a deliberate design choice. Exact per-candidate evaluation inside the GA loop would be prohibitively slow for any population/generation combination large enough to converge, and sampled-approximation alone would produce unreliable final candidates. The post-filter ensures that the sequences shipped to the customer are correct on every hard constraint, even those that were only sampled during evolution.

4.4 Viennarnaplot — 2-D Layout and Rendering

Viennarnaplot is the 2-D layout engine that converts dot-bracket secondary-structure notation into publication-quality vector illustrations. The algorithm is a hybrid of Naview (Bruccoleri & Heinrich 1988) and a custom RNAPuzzler-inspired post-processor. Naview performs a radial-tree layout of the secondary-structure graph; the post-processor detects residue collisions, resolves them by rigid-body rotation of sub-trees, smooths stem angles, and — for circular RNA — wraps the topology at the back-splice junction. The output is a browser-embeddable SVG that remains legible at any zoom and a vector PDF that embeds in customer presentations without pixelation. Coloring schemes include DMS-reactivity (green/red/grey), GC-content heatmap, local-MFE heatmap, and custom per-residue color from a user-supplied vector.

The rendering pipeline includes a "straight-line linear-spine" layout variant that represents an unrolled molecule as a horizontal strip with stems hanging below and above the backbone — suitable for panel comparisons and for aligning two candidates side-by-side. The horizontal layout is particularly useful for long mRNA and saRNA constructs where a radial layout would not fit legibly on a single page.

4.5 CircularDesigner-Specific Algorithm Notes

4.5.1 Platform Library and Tier System

CircularDesigner supports a named platform library in which each platform encapsulates a pre-validated combination of circularization engine, IRES, spacer strategy, and auxiliary elements. PROMETHEUS, NEXUS, CALYPSO, OUROBOROS, and TITAN represent major configurations. Each platform has a Tier attribute (0 through 14) representing the expected protein-amplification factor relative to linear mRNA; higher tiers incorporate more sophisticated scaffolding (flanking self-cleaving ribozymes, internal poly(A) for eIF4G recruitment, purification aptamers). Tier-10 configurations have delivered 500–1000× amplification in published in-vitro assays; Tier-14 incorporates experimental features and is typically gated behind additional engineering review.

4.5.2 Group-I PIE P1-Helix Validation

For Group-I PIE with T4 td or Anabaena, the fitness function enforces: (i) P1-helix energy ≤ -2 kcal/mol between the IGS (GGGTCT for T4 td) and the 3' spacer tail — a less-stable pairing produces poor splicing yield; (ii) Watson–Crick or wobble compatibility at the terminal position (IGS last base to spacer-3 first base) — a mismatch here blocks the splicing reaction entirely and incurs a 1,000-point penalty; (iii) no CDS-head sequestration of the IGS — a stable base pair between the CDS 5' end and the IGS prevents the splicing reaction and incurs a penalty of up to 300 points.

4.5.3 Group-II PIE D4-Stem Validation

For Group-II PIE, the D4-stem energy between spacer-5 and spacer-3 must be below -8 kcal/mol; weaker stems incur a 300-point penalty. Additionally, the spacer-3 region must not compete with the CDS 5' end for pairing (a soft -50-point penalty).

4.5.4 IRES Integrity Check

The IRES integrity score is computed by folding the local IRES context (spacer-5 + IRES + CDS head of ~50 nt) and measuring how closely the fold matches the canonical IRES secondary structure (base-pair agreement with the reference IRES structure, normalized to 0–1). Integrity scores below 0.75 incur a penalty proportional to (1 - integrity) × 1000. The reference structures are taken from the literature (Martinez-Salas et al.'s refined CVB3 and EMCV structures, Lomakin & Steitz HCV structure).

4.5.5 Cryptic Splice-Site Scanning

A two-layer scanner detects cryptic splice sites. Layer 1 is motif matching against the generic cryptic list (GT|AG mimics) and, for T4 td PIE, the T4-specific pseudosites. Layer 2 is a PSSM score over a 9-nt sliding window for canonical 5' splice donor consensus; scores above a threshold contribute per-site log-odds penalty. Both layers are JIT-compiled for speed.

4.5.6 Pre-Circularization Precursor Assembly

The tool outputs both the final circular product (with junction correctly closed in the rendered 2D structure) and the complete linear pre-circularization precursor. The precursor's length, module layout, and predicted secondary structure are reported in dedicated report sections; this is the molecule that IVT actually produces and that the synthesis vendor synthesizes as a linear DNA template.

4.5.7 Tornado Ribozyme Mode

In Tornado mode, the payload is flanked by a 5' twister P1 ribozyme and a 3' HDV genomic ribozyme. The ribozymes autocleave to produce precise 2',3'-cyclic phosphate and 5'-OH ends, which are then ligated (in vitro or in cells) to close the circle. Tornado mode bypasses the P1-helix and D4-stem thermodynamic checks (because splicing is not the circularization mechanism) but retains the IRES integrity check and cryptic splice-site scan (because illegitimate splicing of the linear precursor is still a production risk).

4.5.8 Inverted-Repeat Back-Splice Mode

CIRC_Intact mode uses flanking inverted repeats (20–50 nt) for cis-splicing. The repeats are scored for Hamming-distance-adjusted identity, length, and internal structure; insufficient or overly structured repeats incur penalties. No exogenous intron is needed.

4.5.9 Purification Aptamer and Flanking Ribozyme

Optional scaffold elements include a purification aptamer (MS2 or PP7 stem-loop) for affinity-based capture of the circular product, and flanking self-cleaving ribozymes (hammerhead 5', HDV 3') that trim the IVT run-off to precise end-structure. These are first-class config parameters; the report includes their presence and predicted fold.

4.6 Parameter Tuning Guidance

Default parameters are selected to work reasonably well across a wide range of inputs, but for production runs some tuning is advisable. Population size scales with sequence length: for sequences under 1 kb, 200 is sufficient; for 1–3 kb, 300 is typical; for 3 kb and above, 500 or more maintains diversity. Generations scale with the constraint landscape's ruggedness: a CAI-only optimization converges in 50–100 generations; a multi-constraint optimization with synthesis template and immunogenicity enabled typically requires 200–400 generations; a saRNA optimization with CSE-interference checks and U-depletion can benefit from 400–800 generations. Mutation rate is not strongly sensitive between 0.02 and 0.05 for most constraint landscapes; lower rates make late-generation refinement more precise but slower. The convergence-patience parameter (generations without improvement before early stop) should be roughly 30–50% of the total generations.

For NSGA-II mode in GeneCrafter, larger populations (500+) are important to maintain Pareto-frontier diversity. NSGA-II also benefits from a higher mutation rate (0.04–0.05) because its selection mechanism is less aggressive than scalarized tournament. A typical NSGA-II production run is 500 population × 300 generations, which on a 16-core machine completes in 30 minutes to 2 hours depending on sequence length and the active constraint set.

4.7 Reading and Interpreting the Fitness Log

Every GA run writes a per-generation fitness log containing the best, median, and worst fitness of each generation, the population diversity, and — if the tool supports it — the top candidate's per-term fitness breakdown. The log is a useful diagnostic for tuning: a best-fitness trajectory that plateaus immediately (within the first 10 generations) indicates that the initial population already saturated the objective (reduce generations or increase diversity); a trajectory that does not plateau by the generation limit indicates under-convergence (increase generations or population); oscillation between values indicates that hard-constraint rejections are interacting with soft-constraint selection (inspect the per-term breakdown to localize). The log is available as JSON in the output directory and as a line plot in the HTML report.

4.8 Cryptic Splice-Site Detection in Detail

Cryptic splice-site detection runs in two passes. The first pass is motif matching against a library of canonical donor motifs (GT|AG), near-canonical motifs (CAGGTA, GAGGTA, TAGGTA, GTCTCT, GATCTA), and — where applicable — tool-specific lists (T4-td PIE pseudosites for CircularDesigner, BioBrick-legacy sites for GeneCrafter). Each match is counted and, when the motif has rank classification in the literature, scored by rank. The second pass runs a position-specific scoring matrix (PSSM) over a 9-nt window centered on each candidate GT dinucleotide; the PSSM was trained on annotated human splice-donor sites from RefSeq and assigns log-odds scores to each base position. Candidate sequences with scores above a configurable threshold contribute per-site penalties. For tools that operate on circular RNA or alphavirus replicons (which engage the spliceosome or splice-like machinery), the PSSM threshold is tightened.

4.9 Homopolymer and Repeat Detection

Homopolymer detection is a single-pass linear scan that records the longest run of each base and all runs exceeding configurable thresholds. Short tandem repeat (STR) detection is a factor-based scanner that identifies 2- to 10-nt repeating units of copy number ≥ 3, with a fast suffix-array-like implementation. Inverted-repeat detection uses a JIT-compiled two-pointer scan with Hamming-distance allowance for imperfect palindromes; min-length and min-score thresholds are configurable. Each detected repeat is recorded with its start positions, length, and score; the repeat inventory is reported per candidate in the HTML output.

5. Inputs

5.1 Accepted Input Formats

DNA coding sequence — A, T, G, C (or U translated to T), length in multiples of 3 for CDS, optionally annotated with explicit UTR/polyA boundaries.
Protein sequence — standard 20-letter one-letter IUPAC codes; internally back-translated to codon positions and expanded by the GA across synonymous codon space.
FASTA file — single-sequence or multi-sequence; multi-sequence files are accepted in batch mode, where each record is treated as an independent design job with its own output folder.
GenBank file — optional, used when the CDS is a region of a longer annotated sequence; the suite extracts the CDS by feature key and retains surrounding UTR for context-aware design.
JSON configuration — all runtime parameters can be supplied as a single JSON file, which is also the canonical persistence format for audit trails.

5.2 Required Contextual Inputs

Target organism — specified either by NCBI taxid (exact) or by organism name (resolved against the local taxonomy). This choice determines the codon-usage table used for CAI and for mutation-operator biasing.
Synthesis-vendor template — IDT_GBlocks_Standard, Twist_Clonal, GenScript_OptiGene, ATUM_GeneGPS, or None for a pure-biology run. The template injects vendor-specific hard constraints (restriction-site avoidance, homopolymer caps, GC-window bounds).
Host-expression template — E_coli_K12, CHO_K1, HEK293, S_cerevisiae, P_pastoris, and others. Adds host-appropriate motif avoidance (Shine–Dalgarno for bacteria, CpG-island and poly(A)-signal for mammalian, poly-T tracts for yeast).
Optimization targets — the subset of fitness terms to activate (cai, gc, cpg_upa, immunogenicity, mrna_mfe, mrna_stability, structure_and_repeats, tfbs). Unselected terms are evaluated for reporting but not for selection pressure.
GA runtime — population size, generations, mutation rate, checkpoint frequency, random seed; all defaults are suitable for a first run and can be tuned in subsequent runs.

Additional CircularDesigner-Specific Inputs

Design mode — circular, titan2, nexus2, prometheus1, calypso1, ouroboros2, and others (named platforms).
Circular engine — Group_I_PIE, Group_II_PIE, Tornado_Ribozyme, CIRC_Intact (inverted-repeat back-splice).
PIE system — for Group-I: Grp1_T4_Phage_td (default), Anabaena_Clean_PIE.
IRES category / option — IRES_Viral / Viral_CVB3, Viral_EMCV, Viral_HCV, Viral_CrPV, or 'auto'.
Spacer length / type / strength — length 40–80 nt (default 50), polyAC or custom, strong/medium/weak.
Flanking ribozyme / purification tag / internal poly-A — scaffold options.

6. Configuration Reference

6.1 Core GA / Runtime Parameters

Every tool exposes the same core GA parameters under consistent names. Defaults are suitable for first runs; production runs typically tune population and generations upward.

Parameter	Default	Description
--population-size	200	GA population size. Larger populations explore more broadly but take longer per generation.
--generations	100–500	Maximum GA iterations. Tools auto-scale by sequence length; this is the hard upper bound.
--mutation-rate	0.02–0.05	Per-codon probability of mutation per generation. Lower rates preserve convergence; higher rates explore.
--post-ga-candidates	1000	Number of top GA candidates passed to the exact post-processor.
--checkpoint-freq	10	GA generations between checkpoint writes. Lower = more frequent but more disk I/O.
--seed	None (random)	Random seed for reproducibility. Set to an integer for byte-for-byte reproducible runs.
--optimizer	ga	'ga' for scalarized, 'nsga2' for multi-objective (GeneCrafter only).
--convergence-patience	100–150	Generations with no best-fitness improvement before early stop.
--diversity-threshold	0.005	Minimum population diversity (normalized Hamming distance) before early stop.
--output-format	human	'human' for HTML and PDF, 'json' for machine-readable only.
--repeat-min-len	15	Minimum repeat length flagged by the repeat detector.
--repeat-min-score	40	Minimum Hamming-distance-adjusted repeat score flagged by the repeat detector.

6.2 CircularDesigner-Specific Configuration

CircularDesigner's program-specific parameters span circularization scaffold, IRES selection, spacer design, and platform-library selection.

Parameter	Default	Description
--design-mode	circular	circular or named platform (titan2, nexus2, prometheus1, calypso1, ouroboros2, etc.).
--circular-engine	Group_I_PIE	Group_I_PIE, Group_II_PIE, Tornado_Ribozyme, or CIRC_Intact.
--pie-system	Grp1_T4_Phage_td	Grp1_T4_Phage_td, Anabaena_Clean_PIE.
--ires-category	IRES_Viral	IRES_Viral, IRES_Human, IRES_Plant.
--ires-option	Viral_EMCV	Viral_EMCV, Viral_CVB3, Viral_HCV, Viral_CrPV, or 'auto'.
--spacer-len	50	Back-splice spacer length (nt).
--spacer-type	auto	auto or polyAC.
--spacer-strength	strong	strong, medium, weak — controls Tm target.
--flanking-ribozyme	False	Add 5'-hammerhead + 3'-HDV self-cleavage.
--purification-tag	None	MS2_Aptamer, PP7_Aptamer, or None.
--internal-poly-a-len	0	Internal poly(A) tract length (nt).
--tier	derived	Platform Tier (0–14); derived from design-mode or user-specified.

7. Outputs and Their Biological Meaning

7.1 Results Directory Convention

Each run writes to a dated results directory (typically ./<Tool>_Local_Results/YYYY-MM-DD/<job_id>) containing the HTML report, the PDF, the JSON and CSV summaries, a FASTA of the top 8 candidates, the original configuration JSON, the GA checkpoint chain, and a manifest file that lists the tool version, input checksum, and random seed. The directory is self-contained and can be archived or transferred as a single unit without loss of reproducibility information.

7.2 Deliverable Files

<job_id>_report.html — interactive HTML with embedded SVG structures, sortable metric tables, and per-candidate drill-down.
<job_id>_report.pdf — print-ready PDF; RNA structures rendered as embedded SVG so they remain legible at zoom.
<job_id>_summary.json — machine-readable summary of all candidates, fitness components, and metrics.
<job_id>_summary.csv — tabular summary suitable for spreadsheet review and ELN ingestion.
<job_id>_candidates.fasta — top 8 candidates as standard FASTA for synthesis submission.
<job_id>_config.json — the exact configuration used; combined with the seed, deterministic reproduction is possible.
<job_id>_checkpoint.pkl — the final GA population and RNG state; enables restart for further refinement.
<job_id>_manifest.txt — tool version, git commit hash, database checksum, run duration, host.

7.3 Report Sections — What Each Means for the Customer

CircularDesigner's HTML/PDF report adds circRNA-specific panels. A "platform summary" panel documents the selected platform, tier, circularization engine, PIE system, IRES, and scaffold options. A "pre-circularization linear precursor" panel shows the complete linear DNA with module annotations (T7 promoter, 5' intron half, spacer-5, IRES, CDS, spacer-3, 3' intron half, optional aptamer, optional poly(A)). A "splicing thermodynamics" panel reports the P1 helix energy (Group-I) or D4 stem energy (Group-II) and flags out-of-spec junctions. An "IRES integrity" panel reports the IRES fold integrity score. A "cryptic splice-site scan" panel lists any matches with their PSSM scores. The structural-plot panel renders both the final circular product (with the junction closed) and the linear precursor (with intron halves attached), so the customer sees both forms.

7.4 Interpreting the Report from the Customer's Perspective

Per-Metric Interpretation

The HTML and PDF reports present each per-candidate metric with a short contextual interpretation — not just a number but a suggestion of what the number means and whether it is above, at, or below customer-acceptance thresholds. For CAI, a value above 0.85 is highlighted as strong expression, 0.70–0.85 as adequate, below 0.70 as at-risk of poor expression. For cap-proximal MFE (first 30 nt of 5' UTR and CDS), a value above -6.0 kcal/mol is "accessible", -6.0 to -12.0 is "at risk", below -12.0 is "likely to block translation initiation". For inverted-repeat count, zero is ideal for therapeutic products, one to two is acceptable for research, more than two suggests rework. For composite immunogenicity, below 3.0 is therapeutic-grade, 3.0–5.0 is research-grade, above 5.0 is flagged. These thresholds are starting points; the customer is expected to calibrate them to the specific program's requirements.

Decision-Support Narrative

Above the per-metric table, the report carries a brief decision-support narrative generated at run time. Typical narratives: "Candidate 1 meets all hard constraints, exceeds CAI and GC targets, and has a composite immunogenicity of 2.1 — recommended for synthesis." Or: "Candidate 3 has the best CAI (0.92) but contains two inverted repeats at length 25 and 22; consider re-running with higher repeat penalty, or verify empirically." The narratives are meant for a non-specialist reader — a program manager reviewing designs without a deep RNA-structure background — and are not prescriptive; they indicate what the data suggest and leave the decision to the reviewer.

Candidate Diversity Surface

The report's Pareto-frontier panel (GeneCrafter NSGA-II) or top-8 panel (other tools) exposes the diversity of the top candidates: not just the best by the scalarized score but several that trade off differently. This is a deliberate affordance against the over-optimization failure mode in which a single top candidate turns out, on wet-lab testing, to underperform an alternate that was slightly lower on the in-silico score but wetter-better. Inspecting the top-8 panel, and optionally commissioning two or three of them for head-to-head wet-lab comparison, is the empirically-grounded best practice for de-risking a therapeutic design.

8. Quality-Metric Interpretation Guide

8.1 A Suggested Customer Acceptance Gate (baseline)

For CircularDesigner-produced circRNA constructs, a suggested acceptance gate is:

Group-I PIE: P1-helix energy ≤ -2 kcal/mol; Watson–Crick or wobble terminal pair; no CDS-head IGS sequestration.
Group-II PIE: D4-stem energy ≤ -8 kcal/mol; no CDS-spacer-3 competition.
Tornado: flanking ribozymes fold correctly; no cryptic splice sites triggered.
IRES integrity score ≥ 0.75; IRES-specific linker sequence intact.
Zero T4-specific cryptic splice-site matches; generic cryptic-site count ≤ 1.
CDS within the circular product: CAI ≥ 0.80; no prohibited restriction sites; no inverted repeats above threshold.
Linear precursor meets T7 IVT hard constraints (poly-T ≤ 6, poly-G ≤ 5, no internal T7 promoter).
Composite immunogenicity ≤ 5.0 for research, ≤ 3.0 for repeat-dosing therapeutics.

9. Use Cases and Worked Example

9.1 Canonical Example Command

A representative CircularDesigner invocation for a T4-td Group-I PIE circRNA with CVB3 IRES, a 50-nt polyAC spacer, and strong back-splice junction:

CircularDesigner.py --protein payload.fasta --organism 9606 --design-mode circular --circular-engine Group_I_PIE --pie-system Grp1_T4_Phage_td --ires-option Viral_CVB3 --spacer-len 50 --spacer-type auto --spacer-strength strong --flanking-ribozyme True --population-size 300 --generations 400 --seed 55 --output-file results/job05

9.2 Recommended Decision Workflow

1. Select the circularization engine based on program strategy: T4 td PIE for research-grade and published-method-compatible, Anabaena for clinical scarless PIE, Tornado for precise-end-chemistry applications, CIRC_Intact for endogenous-mimicking constructs.

2. Select the IRES. CVB3 is the first-choice for mammalian-cell therapeutic use (highest empirical activity). EMCV is the classical benchmark. HCV is useful when eIF4G is limiting.

3. If using a named platform (PROMETHEUS, NEXUS, CALYPSO, etc.), accept the platform's default engine/IRES/scaffold; override only if you have a specific reason.

4. Inspect the splicing-thermodynamics panel; if the P1 or D4 stem is out of spec, increase spacer length or change spacer strength from medium to strong.

5. Inspect the cryptic-splice-site scan; any T4-specific site should be resolved via a GA restart with tightened weight.

6. Archive the complete results directory, including the linear-precursor FASTA, for synthesis submission and design history.

10. Industry Comparison

The codon-optimization and mRNA-design software landscape has expanded rapidly over the past decade, driven by the mRNA-therapeutics industry's need for in-silico sequence engineering that integrates synthesis feasibility, expression optimization, structural awareness, and innate-immunity awareness into a single workflow. This section positions the Bioneer suite against the most widely used academic and commercial alternatives.

Academic and Open-Source Tools

Academic tools in wide use include ViennaRNA (Lorenz et al. 2011, the standard RNA thermodynamics package providing RNAfold, RNAcofold, RNAinverse, and RNAeval), LinearFold and LinearDesign (Huang et al. 2019; Zhang et al. 2023, Nature — linear-time MFE and joint CAI/MFE optimization), RNAstructure (Reuter & Mathews 2010 — rigorous thermodynamic modelling with experimental-probing integration), Mfold (Zuker 2003, the historical reference), LocARNA (multiple-sequence structure alignment), RNAshapes (Voß et al. 2006 — abstract-shape analysis), JCat (Grote et al. 2005, codon optimization against a user-supplied reference set), OPTIMIZER (Puigbò et al. 2007, codon optimization with batch CSV output), COOL (Chin et al. 2014, multi-objective with CAI/CPB/GC), DNAWorks (Hoover & Lubkowski 2002, one of the earliest widely-used tools, oriented toward oligo-assembly feasibility), and CAIcal (Puigbò et al. 2008, CAI reporting).

Each of these tools solves a narrow problem well but collectively they do not constitute a therapeutic-grade mRNA design workflow. ViennaRNA and RNAstructure produce rigorous structures but do no codon optimization. JCat, OPTIMIZER, and COOL optimize codons but do not integrate structure-aware objectives, synthesis-vendor templates, Kozak context, capping chemistry, or immunogenicity metrics. LinearDesign integrates structure and codon choice but does not support UTR design, saRNA, or circRNA and does not produce a publishable report. DNAWorks focuses on oligo-assembly feasibility and is largely decoupled from biological objectives.

The Bioneer suite's integration of all of these capabilities behind a single CLI and report — with exact reproducibility, synthesis-vendor and host-expression templates built in, and a coherent extension from linear CDS to UTRs to saRNA to circRNA — is the central design decision that differentiates it from stacking multiple academic tools.

Commercial Tools

Commercial competitors include ThermoFisher GeneArt GeneOptimizer (the closed-source proprietary optimizer behind ThermoFisher's synthesis service), GenScript OptimumGene (bundled with GenScript synthesis), IDT Codon Optimization Tool (bundled with IDT gBlocks), Twist Bioscience's Codon Optimizer (bundled with Twist clonal gene synthesis), ATUM GeneGPS (formerly DNA2.0's GeneDesigner, sold as a stand-alone plus bundled with ATUM synthesis services), and Benchling's built-in codon optimizer. Specialized mRNA-therapeutics platforms are increasingly being offered by synthesis-plus-design CROs (Eurofins, Bioneer's own GMP-mRNA service, TriLink, ReNAgade, CureVac's in-house platform) and by pure-software vendors (BioLogic, ML-assisted mRNA design tools emerging from the deep-learning literature).

Commercial tools are typically tightly coupled to a single synthesis vendor, which is convenient when you are committed to that vendor but disadvantageous when you need to dual-source or to benchmark. Most commercial tools are closed-source: the customer cannot inspect the optimization objective, the constraint library, or the underlying codon table; this opacity is a material compliance risk for GxP-regulated drug development, where algorithm inspection and auditability are expected under FDA GMP and EMA guidelines. Commercial tools rarely expose a reproducible seed or checkpoint, and rarely produce a complete-with-provenance output bundle.

The Bioneer suite is vendor-neutral at the synthesis-template layer — IDT, Twist, and GenScript templates are first-class, and additional vendors can be added via config — and every optimization parameter is documented, inspectable, and reproducible. This makes the suite suitable as a primary design tool in a vendor-agnostic mRNA pipeline, not as an adjunct to a specific vendor's service.

10.1 Feature Matrix

Capability	Bioneer Suite	ViennaRNA + JCat	LinearDesign	GeneArt	OptimumGene	IDT Tool	ATUM GeneGPS
Codon optimization (CAI)	Yes, target/max/min	Yes (JCat)	Yes (CAI+MFE)	Yes (closed)	Yes (closed)	Yes	Yes
Structure-aware objective (MFE)	Yes (hybrid Zuker/LinearFold)	Post-hoc only	Yes (joint)	Undocumented	Undocumented	No	Yes
Windowed synthesis constraints	Yes (per-vendor template)	No	No	Built-in vendor	Built-in vendor	Built-in vendor	Built-in vendor
Vendor-agnostic	Yes (IDT, Twist, GenScript, ATUM, more)	Yes	Yes	Tied to ThermoFisher	Tied to GenScript	Tied to IDT	Tied to ATUM
UTR library and design	Yes (UTRDesigner)	No	No	Partial	Partial	No	Partial
saRNA replicon support	Yes (SaRNADesigner)	No	No	No	No	No	No
circRNA design	Yes (CircularDesigner)	No	No	No	No	No	No
Capping chemistry constraints	Yes (ARCA, CleanCap-AG, CleanCap-AT, enzymatic)	No	No	No	Partial	No	Partial
Multi-objective (Pareto)	Yes (NSGA-II, GeneCrafter)	No	Partial	No	No	No	No
Reproducible (seed + checkpoint + config)	Yes (full)	Partial	Partial	No	No	No	No
Open algorithms and parameters	Yes (all documented)	Yes	Yes	Closed	Closed	Closed	Closed
HTML + PDF + JSON + CSV report	Yes	No	No	PDF only	PDF only	PDF only	PDF only
ALCOA+ audit-ready output bundle	Yes	No	No	Partial	Partial	No	Partial
Innate-immunity (CpG, UpA, U-depletion)	Yes (composite score)	No	No	Undocumented	Undocumented	No	Partial
Cryptic splice-site scanning	Yes (donor/acceptor PSSM)	No	No	Undocumented	Undocumented	No	No
Numba JIT acceleration	Yes (fitness + folding)	N/A	Native C++	N/A	N/A	N/A	N/A
Batch/pipeline integration (FASTA in, JSON out)	Yes	Partial	Partial	Service API	Service API	Service API	Service API

10.2 Program-Specific Observations — CircularDesigner

CircularDesigner occupies a category that, like SaRNADesigner's, is primarily occupied by proprietary in-house platforms. Orna Therapeutics (oRNA), Laronde / ReNAgade (Endless RNA / eRNA), Circio, and Circular Genomics have each built internal circRNA-design platforms that are not publicly available. Academic circRNA publications typically use custom scripts that are not packaged as tools. ViennaRNA and LinearFold can fold circRNA-sized sequences but do not model the circularization mechanism, the IRES context, or the cryptic-splice landscape. CircularDesigner's public availability and its coverage of four circularization scaffolds make it distinctive.

10.3 What CircularDesigner Uniquely Offers

What CircularDesigner uniquely provides: (i) four circularization scaffolds (Group-I PIE, Group-II PIE, Tornado ribozyme, inverted-repeat back-splice) in one tool; (ii) explicit junction thermodynamics (P1 helix, D4 stem, ribozyme fold); (iii) curated IRES library with type-specific linker sequences and integrity scoring; (iv) two-layer cryptic-splice-site scanning (motif + PSSM) with T4-specific pseudosite list; (v) complete linear pre-circularization precursor generation and reporting; (vi) 10+ pre-configured platform formats with Tier-based amplification scaling; (vii) optional scaffold elements (flanking ribozymes, purification aptamers, internal poly(A)) as first-class config; (viii) design-history-ready output bundle.

10.4 Deeper Benchmark Context

Depth Comparison with Key Academic Tools

A deeper comparison with key academic tools clarifies where the Bioneer suite is equivalent, superior, or differentiated. Against ViennaRNA — the de facto RNA-thermodynamics standard — the suite uses the same underlying Turner free-energy parameters and reproduces RNAfold's MFE results bit-for-bit on test cases. The difference is that the suite embeds folding inside a GA loop with synthesis and expression constraints, whereas ViennaRNA is a thermodynamics-only toolkit. Against LinearFold, the suite reuses the same algorithmic idea (5'-to-3' beam search) but retains the option to switch to exact Zuker for short sequences, and — critically — can pass Zuker-extracted seeds as constraints to LinearFold for accuracy on long sequences. Against LinearDesign, the suite does not implement the lattice-DP joint optimization but achieves comparable outcomes through GA search with CAI and MFE as co-objectives, while adding the synthesis-template, UTR-library, and circRNA/saRNA capabilities that LinearDesign does not provide.

Against JCat, the suite covers JCat's core use case (CAI optimization against a reference set) and adds: structure-aware optimization, windowed-GC constraints, synthesis-vendor templates, immunogenicity, NSGA-II multi-objective, UTR design, saRNA, and circRNA. JCat is single-objective, single-use-case, and does not fold the optimized output. Against OPTIMIZER and COOL, similar remarks apply: both are academic codon-optimization tools with limited or no integration of structure, synthesis, or therapeutic-grade metrics. Against DNAWorks, the suite's synthesis-vendor-template system is functionally broader and covers the same constraints DNAWorks addresses (GC, repeats, homopolymers) while additionally covering codon choice and biology.

Depth Comparison with Commercial Tools

Against ThermoFisher GeneArt's GeneOptimizer, the suite provides the same core codon-optimization capability, plus transparency (GeneOptimizer is closed-source, so its optimization objective cannot be audited). Against GenScript OptimumGene, similar transparency and vendor-agnostic arguments apply. Against IDT's Codon Optimization Tool, the suite provides a significantly broader feature set (IDT's tool is primarily a vanilla CAI optimizer with IDT-specific synthesis constraints). Against ATUM GeneGPS (formerly DNA 2.0 GeneDesigner), the suite's output bundle is more audit-friendly and the UTR and saRNA/circRNA modules are unique to the Bioneer suite.

Benchmark Case Study (Qualitative)

On a representative therapeutic-grade vaccine antigen (SARS-CoV-2 spike full-length, 3,822 nt), the suite's output across organisms (human, mouse, rabbit, rhesus) demonstrates: CAI achieved above 0.87 in all cases; global GC within 2 percentage points of the 55% target; windowed GC inside the IDT GBlocks template bounds everywhere; zero restriction sites for the configured enzymes; composite immunogenicity below 4.0 for all cases; zero internal T7 promoter or poly-T ≥ 7; no inverted repeat at length ≥ 25. Comparable sequences produced by single-objective academic tools achieved CAI above 0.90 on average but with windowed GC excursions, 1–3 inverted repeats per sequence, and occasional restriction-site hits — demonstrating that single-objective CAI-maximization routinely produces sequences that would fail synthesis-vendor QC, whereas the suite's multi-constraint optimization delivers sequences that pass first-submission QC consistently.

Workflow-Integration Comparison

An often-overlooked differentiator is workflow integration. Commercial tools are typically web-service-based and require uploading the input sequence to a vendor-controlled server; for therapeutic programs under an IND, this data-egress can be a compliance hurdle. The Bioneer suite runs entirely on client infrastructure, which means that proprietary sequences never leave the client's environment. The suite also produces outputs (JSON, CSV, FASTA, HTML) that integrate natively with common laboratory-information systems (Benchling, Geneious, LabVantage, Sapio), with common pipeline tools (Snakemake, Nextflow, CWL), and with regulatory-document-management systems. The ALCOA+-compatible output bundle reduces the friction of retrofitting compliance onto an already-developed sequence.

11. Compliance with Published Requirements

This section addresses compliance of the Bioneer RNA/DNA Design Suite against three categories of stated requirements: (a) published methodological requirements in peer-reviewed mRNA-therapeutics and computational-biology literature; (b) functional expectations of mainstream commercial codon-optimization and mRNA-design software; (c) regulatory-grade software expectations under FDA, EMA, and ICH guidance for computational tools in drug development.

11.1 Peer-Reviewed Literature Requirements

Reference / Requirement	Bioneer Coverage	Notes
Sharp & Li 1987 — CAI as normalized codon-usage metric	Full	CAI computed against organism-specific reference set; target/max/min modes.
Coleman et al. 2008 — Codon-pair bias	Full	CPB score computed and reportable; configurable in objective.
Kudla et al. 2006 — GC and mRNA stability	Full	Global and windowed GC optimized toward configurable target.
Zuker 1989; Mathews 2004 — MFE structure prediction	Full	Refactored Zuker recursion, JIT-compiled, used for sub-700 nt sequences.
Huang et al. 2019 — LinearFold O(n) folding	Full	Integrated with beam-size 100–300 for long sequences.
Zhang et al. 2023 — LinearDesign joint CAI+MFE	Partial	Joint CAI+MFE optimization achieved via GA with combined fitness rather than lattice DP; operationally equivalent for therapeutic lengths.
Karikó & Weissman 2005 — m1Ψ nucleoside modification	Complementary	Sequence-level strategies complement but do not replace m1Ψ; tool outputs compatible with m1Ψ or unmodified transcripts.
Pardi et al. 2018 — mRNA vaccine sequence-design requirements	Full	CAI, MFE, poly(A), cap-compatibility, immunogenicity all addressed.
Wesselhoeft et al. 2018 — Group-I PIE circRNA design	Full	CircularDesigner supports T4 td PIE, Anabaena, Group-II, and Tornado ribozyme.
Vogel et al. 2018; Lundstrom 2019 — saRNA replicon design	Full	SaRNADesigner supports VEEV TC-83, VEEV Trinidad, SFV backbones; CSE preservation enforced.
Presnyak et al. 2015 — codon optimality and mRNA half-life	Full	Codon-usage weights correlate with mRNA stability in the CAI/CPB composite.
Leppek et al. 2022 — structure-guided mRNA optimization	Full	Structure-aware fitness terms and structure-reported metrics.
WHO 2022, FDA 2022, EMA 2023 — mRNA vaccine guidelines (sequence considerations)	Full	All stated sequence-level considerations are addressed.

11.2 Commercial Software Functional Expectations

Functional Requirement	Bioneer Coverage	Notes
Accept DNA and protein inputs	Yes	FASTA, GenBank, raw string; batch mode for multiple sequences.
Organism selection with up-to-date codon tables	Yes	CoCoPUTs-backed HDF5 database; user-refreshable.
Vendor-specific synthesis template	Yes	IDT, Twist, GenScript, ATUM; extendable by config.
Restriction-site avoidance	Yes	User-configurable list plus vendor defaults.
Forbidden-motif avoidance	Yes	User-configurable list plus template defaults.
GC-window constraint	Yes	Configurable window size and bounds per vendor.
Homopolymer caps	Yes	Per-base and per-vendor.
Repeat and inverted-repeat auditing	Yes	Min length and min score configurable.
Secondary-structure prediction	Yes	Hybrid Zuker/LinearFold; full-length therapeutic RNA supported.
Visual structure output (SVG, PDF)	Yes	Viennarnaplot SVG; PDF archive.
Ranked multi-candidate output	Yes	Top 8 by default; configurable.
CLI for pipeline integration	Yes	JSON config, FASTA I/O, exit codes.
Reproducible runs (seed, checkpoint)	Yes	Full checkpoint + config + seed bundle.
Human-readable report	Yes	HTML + PDF with biology-explained metrics.
Machine-readable export	Yes	JSON + CSV.
Batch/high-throughput mode	Yes	FASTA-in, per-record output directory.
Licensing/software distribution	Internal	Deployed on client infrastructure; no data egress.

11.3 Regulatory Software Requirements

Computational tools that inform drug-product design are subject to a tiered set of expectations under GxP and aligned guidance. The Bioneer suite is designed to meet Category-3 (non-configured products used for intended purpose) and Category-4 (configured products) expectations under GAMP 5, with user-facing configuration that can be version-controlled and audited. The following table maps compliance against the principal regulatory frameworks.

Framework / Requirement	Bioneer Coverage	Notes
ALCOA+ — Attributable	Yes	Run manifest records operator, host, tool version, timestamp.
ALCOA+ — Legible	Yes	HTML, PDF, JSON, CSV outputs; plain-text config.
ALCOA+ — Contemporaneous	Yes	Timestamps on every checkpoint and every report section.
ALCOA+ — Original	Yes	Original config, original checkpoint, original report are all preserved.
ALCOA+ — Accurate	Yes	Reproducibility from seed + config verified in QC harness.
ALCOA+ — Complete	Yes	All intermediate results available; no silent pruning.
ALCOA+ — Consistent	Yes	Report field set is fixed per tool version.
ALCOA+ — Enduring	Yes	Plain-text and open-vector outputs; no proprietary binary.
ALCOA+ — Available	Yes	Self-contained results directory; portable.
21 CFR Part 11 — Electronic records	Aligned	Output records are attributable and tamper-evident when written to controlled storage; e-signature layer is the responsibility of the enclosing QMS.
GAMP 5 — Software categorization	Category 3/4	Standard product with configurable parameters; no custom code per user.
GAMP 5 — Risk-based validation	Supported	Functional test suite included; IQ/OQ/PQ templates deliverable on request.
ICH Q8 — Quality by Design	Supported	Design-space inputs (CAI, GC, MFE, immunogenicity) are explicit and tunable; critical quality attributes reportable.
ICH Q9 — Quality Risk Management	Supported	Fitness term weights are risk-based; rejection thresholds are documented.
ICH Q10 — Pharmaceutical Quality System	Supported	Deterministic outputs enable integration with CAPA, deviation, change control.
ICH Q11 — Development of drug substances	Supported	Design-history traceability via config + checkpoint.
ICH Q14 — Analytical Procedure Development	Supported	Report metrics mappable to analytical specifications (CAI, MFE, immunogenicity, repeat inventory).
FDA 2022 mRNA-vaccine sequence considerations	Full	Covered by tool output metrics.
EMA 2023 mRNA guideline — sequence-level CMC	Full	Covered by tool output metrics plus design-history package.

12. mRNA Drug / Vaccine Development Perspective

12.1 Where This Tool Sits in the Workflow

A realistic mRNA-therapeutic development pipeline proceeds from antigen or payload definition (the protein to be expressed), to in-silico design of the coding and untranslated regions, to template-DNA synthesis and cloning, to in vitro transcription, to capping and polyadenylation, to purification and formulation (typically lipid-nanoparticle encapsulation), to in vitro potency and release testing, to in vivo pharmacology, and eventually into regulatory filings, clinical trial material, and commercial manufacture. The Bioneer suite addresses the second stage — in-silico sequence design — and is positioned specifically to deliver a sequence that is simultaneously: biologically well-behaved (CAI, structure, immunogenicity), synthesis-ready (vendor-template constraints, homopolymer caps, repeat audits), reproducible (seed, checkpoint, config), and audit-defensible (ALCOA+ outputs, full design history). The sequence leaving the Bioneer suite is the primary input to the synthesis vendor and the anchor of the design-history file that accompanies the drug product through its regulatory lifecycle.

Upstream of the Bioneer suite sit antigen-discovery tools (bioinformatics prediction of protein targets), epitope scoring and immunogenicity prediction platforms, and structural-biology refinement. Downstream sit the synthesis-and-amplification workflow, the IVT reaction, the capping and tailing steps, the purification train (dsRNA removal by HPLC or oligo-dT affinity, cellulose-based dsRNA removal, tangential-flow filtration), the LNP formulation and characterization, the analytical-release panel (capping efficiency by LC-MS, poly(A) length by Bioanalyzer or fragment analyzer, integrity by agarose or capillary electrophoresis, residual dsRNA by ELISA or J2-antibody dot blot, residual template DNA by qPCR, endotoxin), and the in-vitro and in-vivo potency assays. Several of the sequence-level metrics produced by the Bioneer suite map directly to analytical-release tests, which makes the suite's output a natural bridge between design and CMC.

12.2 Therapeutic-Grade Acceptance Gates

A suggested acceptance gate for therapeutic-grade mRNA design output is: CAI ≥ 0.85 for the target organism, global GC between 50% and 62%, windowed GC (50-nt window) between 30% and 70% everywhere, no homopolymer tracts exceeding the vendor's template cap (typically A ≤ 14, C ≤ 14, G ≤ 5 for IVT products, T ≤ 6 for IVT products because longer T tracts act as T7 termination signals), no unintended restriction sites or forbidden motifs, no cryptic splice donor or acceptor sites above the PSSM threshold (when relevant), composite immunogenicity score below the tool-specific cap (typically 5.0), no inverted repeats above length 20 and score 30, poly(A) tail in the 100–150 range with the encoded plus the enzymatically-added portions combined, and — for products using CleanCap-AG or CleanCap-AT — the +1 transcription start matching the required AG or AT dinucleotide. These gates are not universal for every indication; a vaccine targeting a protein with a hard co-translational folding requirement may require tighter local-MFE control, while a protein-replacement therapy may tolerate more structural variation. The gates are the starting point for an informed discussion between the design team and CMC/clinical colleagues.

12.3 CircularDesigner in the mRNA Development Workflow

Circular RNA is the newest of the mRNA-therapeutic modalities and has a distinctive pharmacological profile: longer in-cell half-life (days, not hours), cap-independent translation via IRES, and emerging evidence of reduced innate-immune reactogenicity relative to linear mRNA of equivalent design. These properties make circRNA attractive for applications where linear mRNA falls short: durable protein-replacement therapy with longer dosing intervals, prolonged antigen exposure for therapeutic cancer vaccines, and chronic-condition management where patient convenience favors less-frequent dosing.

The central engineering challenge of circRNA is the trade-off between circularization efficiency (the fraction of IVT product that closes into a circle) and payload translation efficiency (the protein output per circRNA molecule). Group-I PIE with CVB3 IRES, well-designed spacers, and no cryptic splice sites routinely achieves 60–96% circularization in published reports; poorly designed constructs (cryptic sites triggered, P1-stem out of spec) drop to the 20% range. CircularDesigner's explicit junction thermodynamics and two-layer cryptic-site scanning are aimed directly at keeping circularization efficiency in the upper band.

For protein-replacement therapy (e.g., Factor IX for hemophilia B, α-galactosidase A for Fabry disease), the longer half-life of circRNA can translate into less-frequent dosing; this is pharmacologically and economically compelling. For therapeutic cancer vaccines, circRNA's durable antigen expression over days supports sustained priming and boosting of T-cell responses. For biodefense and pandemic-preparedness applications, the longer shelf-life of circRNA (both in vivo and in formulation) can simplify distribution.

Regulatory considerations for circRNA are still in flux. As of 2026, FDA guidance specific to circRNA is sparse; existing mRNA-vaccine guidance is applied by analogy. The linear-precursor-first output of CircularDesigner supports regulatory engagement by providing a fully specified IVT product with complete design history — the linear precursor, the circularization mechanism, the IRES choice, the spacer sequences, and the predicted junction thermodynamics are all documented in the output bundle. This level of detail is directly relevant to the CMC section of an IND or pre-IND briefing.

12.4 Integrating With Nucleoside Modification

Nucleoside modification — most prominently m1Ψ substitution — is the dominant pharmaceutical strategy for suppressing innate-immune activation and extending transcript half-life in clinical mRNA. Sequence-level design and nucleoside modification are complementary, not substitutes. Even m1Ψ-modified transcripts retain sequence-dependent recognition by ZAP (via CpG dinucleotides) and by MDA5 (via dsRNA inverted repeats); sequence-level CpG depletion and inverted-repeat suppression therefore provide additional headroom even when nucleoside modification is used. Conversely, for platforms that cannot use m1Ψ — most notably self-amplifying RNA, which requires natural bases for RdRp replication — sequence-level immunogenicity reduction is the only available lever and must be aggressive. Circular RNA sits between the two: it can be capped or designed IRES-only, and its emerging literature indicates that unmodified circRNA can be well tolerated when its IRES context and junction structure are well chosen.

The Bioneer suite's immunogenicity composite is calibrated so that the scores remain interpretable across modified and unmodified contexts. For modified mRNA, the composite remains a useful residual-risk metric; for unmodified saRNA, the composite drives the fitness gradient; for circRNA, it controls the dsRNA-formation risk at the back-splice junction and within inverted repeats.

12.5 Manufacturing, Formulation, and Clinical-Grade Context

LNP Formulation Considerations

Lipid-nanoparticle formulation is the dominant delivery modality for clinical mRNA. Commercially-used LNP formulations (Pfizer/BioNTech ALC-0315, Moderna SM-102, Arcturus LUNAR, Genevant CL1) are ionizable-lipid-based systems that encapsulate mRNA via electrostatic and hydrophobic interactions during a solvent-exchange process. The mRNA sequence influences LNP quality indirectly through length (longer mRNA = different packing), net charge (minor but measurable), and secondary-structure presentation (structured mRNA packs differently than single-stranded). Very long sequences (saRNA replicons) require LNP formulations tuned for larger payload and may show different encapsulation-efficiency profiles. Sequence-level design decisions that the Bioneer suite makes (GC content, structural penalty weighting, repeat suppression) do not directly control LNP quality but contribute indirectly by producing sequences that behave predictably in the formulation step.

A practical consideration is that residual dsRNA — a common IVT byproduct — interacts strongly with cationic ionizable lipids and is difficult to remove after encapsulation. Suppressing dsRNA at the sequence level (inverted-repeat minimization in the Bioneer suite's fitness composite) reduces the burden on downstream purification chromatography (RNase III digestion, cellulose-based dsRNA removal, HPLC, oligo-dT affinity) and improves the drug product's specification on the residual-dsRNA analytical-release test (typically <1 ng dsRNA per μg mRNA for clinical material).

IVT Reaction Optimization Context

The IVT reaction is a central manufacturing step for all non-circular mRNA modalities. T7 RNA polymerase runs a linearized DNA template in the presence of rNTPs (or modified rNTPs for m1Ψ chemistry), magnesium, and capping components (if co-transcriptional capping is used). Yield depends on template quality (cleavage completeness, contamination), rNTP stoichiometry, reaction time and temperature, and — importantly — on sequence features that favor T7 processivity. Poly-T runs, poly-G runs, and internal T7 promoter mimics are empirically associated with lower yield; IVTDesigner's hard constraints on these features are specifically intended to remove this lever of variability. Capping efficiency similarly depends on the +1 dinucleotide matching the capping chemistry; IVTDesigner's cap-analog-aware 5' UTR selection addresses this.

Analytical release of clinical mRNA includes tests for: mRNA integrity (agarose gel or capillary electrophoresis), capping efficiency (LC-MS of cap analog post-digestion, or immunocapture), poly(A) length and distribution (fragment analyzer), residual dsRNA (J2-antibody dot blot or ELISA, clinical spec typically <1 ng/μg), residual template DNA (qPCR), endotoxin, and sterility. Several of these tests have direct sequence-level antecedents: mRNA integrity depends on the absence of repeats and structures that could cause IVT pausing; capping efficiency depends on the +1 start; residual dsRNA depends on inverted-repeat count. The Bioneer suite's sequence-level metrics are therefore not just design parameters but leading indicators of the analytical-release profile of the manufactured drug.

Clinical-Grade Acceptance Criteria

A suggested set of clinical-grade acceptance criteria (for discussion with CMC and regulatory colleagues) includes, beyond the tool-specific sequence gates already listed: capping efficiency ≥ 95%, poly(A) tail length 100–150 nt with low dispersion (<10% CV), residual dsRNA ≤ 1 ng/μg mRNA, residual template DNA ≤ 10 pg/μg mRNA, endotoxin ≤ 0.5 EU/μg, and integrity ≥ 80% full-length. Sequence-level decisions that contribute to these criteria include: IVT-safe sequence features (all Bioneer tools), cap-analog-matched 5' start (IVTDesigner), inverted-repeat suppression (all tools), and sequence length within the capacity of the LNP formulation (typically 300 nt to 15 kb). The Bioneer suite's design-history-file-ready output bundle provides the sequence-level provenance that a CMC reviewer needs to tie the design to these analytical specifications.

Cost-of-Goods Perspective

mRNA manufacturing cost is dominated by rNTP consumption (especially modified rNTPs for m1Ψ) and by purification. Sequence-level decisions influence COGS through: (i) mRNA length (shorter = cheaper per dose but may compromise expression); (ii) IVT yield (suppressing T7 pauses and poly-T runs materially improves reaction yield per unit rNTP); (iii) capping efficiency (poor capping requires overformulation or enzymatic re-capping, both costly); (iv) residual-dsRNA burden (higher dsRNA triggers larger purification losses). The Bioneer suite's sequence-level choices therefore have downstream COGS consequences that compound across a commercial launch campaign. For a late-stage program planning a commercial launch at tens to hundreds of millions of doses per year, the cumulative effect of sequence-level optimization on COGS is material.

Regulatory-Grade Design Provenance

Regulatory dossiers for mRNA drug products (IND, BLA, MAA) require sequence-level traceability that maps each design decision to its rationale and shows that the chosen sequence was derived by a documented, reproducible process. The Bioneer suite's config-plus-seed-plus-checkpoint-plus-report output bundle is structured to fit directly into the CMC section of an IND: the config and seed demonstrate reproducibility, the checkpoint enables exact regeneration, the report documents the optimization objective and the fitness breakdown, and the FASTA is the final drug-substance sequence. The tool-version hash and database checksum provide the software-integrity trail required by 21 CFR Part 11 and GAMP 5. In practice, this bundle reduces the effort of retrofitting compliance onto a design at IND-filing time from weeks of documentation to a few days of review.

13. Integration, QC, and Limitations

13.1 Pipeline Integration

The Bioneer suite is designed for integration into a larger mRNA CMC and design-history pipeline. Inputs are files (FASTA, GenBank, JSON); outputs are files in structured, machine-readable formats (JSON, CSV) in addition to the human-readable HTML and PDF. Exit codes are deterministic (0 for success, non-zero for documented failure modes). Batch mode supports parallel job execution with per-job output directories. The JSON output schema is versioned and stable across minor releases, so downstream pipeline components do not break when the tool is updated.

Typical integration patterns include: (i) a bioinformatics LIMS that submits design jobs, stores the returned JSON, and exposes metrics in a dashboard; (ii) a synthesis-vendor submission script that reads the FASTA and attaches the configuration JSON as the order-history record; (iii) an ELN that embeds the HTML report as an appended attachment to the design experiment; (iv) a GMP batch-record system that archives the full results directory as part of the design-history file.

13.2 Recommended QC Wraparound

Confirm determinism on sensitive runs: re-execute the run from the saved config and seed, and confirm that the output FASTA is byte-identical.
Run a second structural prediction with an independent tool (for example RNAstructure or ViennaRNA's RNAfold at a different temperature) as an orthogonal check on the reported MFE.
Submit the output FASTA to the synthesis vendor's own QC tool and confirm that no additional constraints are flagged; if flagged, update the local vendor template.
For therapeutic-grade design, review the predicted secondary structure visually for cap-proximal stems, junction obstructions (circRNA), and long stems that might form dsRNA substrates.
For batch pipelines, log the tool-version hash and the database checksum of every run; store them in the LIMS or ELN alongside the output files.

13.3 Known Limitations

Thermodynamic folding is a prediction, not a measurement. Structural assessments should be validated by chemical probing (DMS-MaPseq, SHAPE) for any sequence critical to a therapeutic program.
Immunogenicity is a composite score calibrated against published correlates; it is not a substitute for in vitro or in vivo immunogenicity testing.
Codon-usage tables are organism-level averages; tissue- and cell-type-specific tRNA pools can create second-order effects not captured in a generic CAI.
For very long sequences (≥ 10 kb), LinearFold beam-search accuracy degrades relative to exact folding; users with unusual structural requirements may need to fold sub-segments with a more expensive method.
The tool does not currently model post-transcriptional modifications (m6A, m5C, Ψ) beyond the uridine-to-pseudouridine substitution's implicit effect on immunogenicity scoring.
UTR libraries are curated snapshots; for the latest literature UTRs, users may wish to refresh from the configured source or supply custom UTR sequences.

14. Regulatory Considerations

14.1 Data Integrity (ALCOA+ Considerations)

ALCOA+ — an extension of the FDA-originated ALCOA principles (Attributable, Legible, Contemporaneous, Original, Accurate) with the additional "+" requirements (Complete, Consistent, Enduring, Available) — is the data-integrity framework universally applied to GxP-regulated software. The Bioneer suite's output bundle is designed to meet each principle by construction: every run has an identifiable operator and host (Attributable), produces plain-text and open-vector outputs (Legible, Enduring), records a timestamp on every checkpoint (Contemporaneous), preserves the original config and checkpoint (Original), is reproducible from seed and config (Accurate), retains all intermediate metrics (Complete), uses a stable report schema (Consistent), and ships as a self-contained portable directory (Available). The enclosing electronic-records system (LIMS, ELN, document-management system) provides the signature, access-control, and audit-trail layer that completes the compliance envelope under 21 CFR Part 11.

14.2 Software Dependencies

The suite relies on widely-used, open-source scientific-Python dependencies: NumPy and SciPy for numerical operations, Numba for JIT compilation, h5py for the HDF5 codon-usage database, Matplotlib for static plots, ReportLab or WeasyPrint for PDF generation, and a bundled ViennaRNA and LinearFold library for RNA folding. Each dependency is pinned to a specific version in the deployment manifest; dependency updates are managed via a documented change-control process and include re-running the functional-test suite. The dependency set is small, well-maintained, and subject to ongoing security patching.

No external network call is made during a design run; the tool operates entirely on local inputs and local databases, which is an important consideration for client-deployed instances handling proprietary sequences.

14.3 Detailed Regulatory Framework Alignment

Software Validation Under GAMP 5

The Bioneer suite is positioned as a GAMP 5 Category 3 or Category 4 software product depending on how a specific site configures it. Category 3 (non-configured, used as shipped) applies when the site uses default templates and default constraint libraries; Category 4 (configured) applies when the site imports custom synthesis-vendor templates, custom host-expression templates, custom UTR libraries, or custom codon-usage tables. Both categories require risk-based validation; the suite ships with a functional test suite that exercises representative inputs and verifies outputs, and IQ/OQ/PQ protocol templates are available as a deliverable for customers requiring a formal validation package.

21 CFR Part 11 Considerations

Part 11 compliance is a system-level property rather than a tool-level property. The suite contributes to Part 11 compliance by producing tamper-evident outputs (every output file is plain-text or standard-format, every run is deterministic from the saved config and seed) and by recording attribution metadata (operator, host, timestamp) in the run manifest. The enclosing electronic-records management system is responsible for access control, e-signature, and the audit trail of record modifications. Clients operating in a 21 CFR Part 11 environment typically store the suite's output directories in a controlled-document repository and pair them with their own e-signature layer.

ICH Q8 to Q14 Mapping

ICH Q8 (Pharmaceutical Development) — the suite supports Quality by Design by making the optimization objective explicit, the critical quality attributes (CAI, GC, MFE, immunogenicity, structural integrity) explicit and reportable, and the design space (the range of tunable parameters) explicit. ICH Q9 (Quality Risk Management) — the fitness-term weights and rejection thresholds are risk-based; hard constraints for highest-risk features (T7 promoter mimics, cryptic splice sites) and soft constraints for lower-risk features (homopolymer length, local GC). ICH Q10 (Pharmaceutical Quality System) — deterministic outputs enable integration with CAPA and change control. ICH Q11 (Development and Manufacture of Drug Substances) — design-history traceability via config plus seed plus checkpoint. ICH Q12 (Lifecycle Management) — the suite's versioning and checkpoint system supports lifecycle-phase-appropriate change management. ICH Q14 (Analytical Procedure Development) — report metrics map directly to analytical-release specifications.

FDA and EMA Specific Considerations

The FDA's 2022 guidance for gene therapy and 2023 discussion of mRNA vaccine CMC expectations converge on the need for sequence-level traceability, justification of each design decision, and documentation of the optimization objective used to select the final drug-substance sequence. The EMA's 2023 mRNA vaccine guideline adds explicit expectations for documenting the IVT-compatibility of the sequence, the capping strategy's sequence-level fit, and the immunogenicity profile. The Bioneer suite's output bundle addresses all of these expectations by construction; the remaining work for a regulatory submission is to contextualize the tool's decisions against the specific product's target product profile and clinical-pharmacology rationale.

Client-Site Deployment and Data-Integrity Envelope

The suite is delivered for on-client-premise deployment; it does not require external network connectivity, and no design input or output is sent to any external server. This is consistent with the expectations of biopharma clients handling proprietary or investigational-new-drug sequences. On deployment, the tool integrates with the client's data-integrity envelope — controlled storage for outputs, version-controlled configuration, identity-management for operator attribution, and change-control for template updates. The documented software-dependency set is pinned at delivery time and can be revalidated by the client as part of their periodic IT-security assessment.

15. References

Sharp, P. M., & Li, W. H. (1987). The codon adaptation index — a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research, 15(3), 1281–1295.

Coleman, J. R., Papamichail, D., Skiena, S., Futcher, B., Wimmer, E., & Mueller, S. (2008). Virus attenuation by genome-scale changes in codon pair bias. Science, 320(5884), 1784–1787.

Kudla, G., Lipinski, L., Caffin, F., Helwak, A., & Zylicz, M. (2006). High guanine and cytosine content increases mRNA levels in mammalian cells. PLoS Biology, 4(6), e180.

Zuker, M. (2003). Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Research, 31(13), 3406–3415.

Mathews, D. H., Disney, M. D., Childs, J. L., Schroeder, S. J., Zuker, M., & Turner, D. H. (2004). Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. PNAS, 101(19), 7287–7292.

Reuter, J. S., & Mathews, D. H. (2010). RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics, 11, 129.

Lorenz, R., Bernhart, S. H., Höner zu Siederdissen, C., Tafer, H., Flamm, C., Stadler, P. F., & Hofacker, I. L. (2011). ViennaRNA Package 2.0. Algorithms for Molecular Biology, 6, 26.

Huang, L., Zhang, H., Deng, D., Zhao, K., Liu, K., Hendrix, D. A., & Mathews, D. H. (2019). LinearFold: linear-time approximate RNA folding by 5'-to-3' dynamic programming and beam search. Bioinformatics, 35(14), i295–i304.

Zhang, H., Zhang, L., Lin, A., et al. (2023). Algorithm for optimized mRNA design improves stability and immunogenicity. Nature, 621, 396–403.

Karikó, K., Buckstein, M., Ni, H., & Weissman, D. (2005). Suppression of RNA recognition by Toll-like receptors: the impact of nucleoside modification and the evolutionary origin of RNA. Immunity, 23(2), 165–175.

Karikó, K., Muramatsu, H., Welsh, F. A., Ludwig, J., Kato, H., Akira, S., & Weissman, D. (2008). Incorporation of pseudouridine into mRNA yields superior nonimmunogenic vector with increased translational capacity and biological stability. Molecular Therapy, 16(11), 1833–1840.

Pardi, N., Hogan, M. J., Porter, F. W., & Weissman, D. (2018). mRNA vaccines — a new era in vaccinology. Nature Reviews Drug Discovery, 17, 261–279.

Wesselhoeft, R. A., Kowalski, P. S., & Anderson, D. G. (2018). Engineering circular RNA for potent and stable translation in eukaryotic cells. Nature Communications, 9, 2629.

Vogel, A. B., Lambert, L., Kinnear, E., et al. (2018). Self-amplifying RNA vaccines give equivalent protection against influenza to mRNA vaccines but at much lower doses. Molecular Therapy, 26(2), 446–455.

Lundstrom, K. (2019). Self-amplifying RNA viruses as RNA vaccines. International Journal of Molecular Sciences, 21(14), 5130.

Grote, A., Hiller, K., Scheer, M., Münch, R., Nörtemann, B., Hempel, D. C., & Jahn, D. (2005). JCat: a novel tool to adapt codon usage of a target gene to its potential expression host. Nucleic Acids Research, 33(W), W526–W531.

Puigbò, P., Guzmán, E., Romeu, A., & Garcia-Vallvé, S. (2007). OPTIMIZER: a web server for optimizing the codon usage of DNA sequences. Nucleic Acids Research, 35(W), W126–W131.

Chin, J. X., Chung, B. K.-S., & Lee, D.-Y. (2014). Codon Optimization OnLine (COOL): a web-based multi-objective optimization platform for synthetic gene design. Bioinformatics, 30(15), 2210–2212.

Hoover, D. M., & Lubkowski, J. (2002). DNAWorks: an automated method for designing oligonucleotides for PCR-based gene synthesis. Nucleic Acids Research, 30(10), e43.

Alexaki, A., Kames, J., Holcomb, D. D., et al. (2019). Codon and codon-pair usage tables (CoCoPUTs): facilitating genetic variation analyses and recombinant gene design. Journal of Molecular Biology, 431(13), 2434–2441.

Presnyak, V., Alhusaini, N., Chen, Y.-H., et al. (2015). Codon optimality is a major determinant of mRNA stability. Cell, 160(6), 1111–1124.

Leppek, K., Byeon, G. W., Kladwang, W., et al. (2022). Combinatorial optimization of mRNA structure, stability, and translation for RNA-based therapeutics. Nature Communications, 13, 1536.

Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182–197.

Bruccoleri, R. E., & Heinrich, G. (1988). An improved algorithm for nucleic acid secondary structure display. Computer Applications in the Biosciences, 4(1), 167–173.

ICH Q8(R2), Q9, Q10, Q11, Q14 — International Council for Harmonisation, Pharmaceutical Quality guidelines.

FDA (2022). Chemistry, Manufacturing, and Control (CMC) Information for Human Gene Therapy Investigational New Drug Applications (INDs) — Guidance for Industry.

EMA (2023). Guideline on the quality aspects of mRNA vaccines.

WHO (2022). WHO guidelines on the quality, safety and efficacy of messenger RNA vaccines for the prevention of infectious diseases.

ISPE GAMP 5 (2008, 2022 update). A Risk-Based Approach to Compliant GxP Computerized Systems.

16. Customer Evaluation Checklist — Frequently Asked Questions

The following checklist summarizes the practical questions a prospective customer typically asks when evaluating a sequence-design tool for internal adoption. Each question is answered in this whitepaper; this section gathers the answers into one place for rapid reference.

Does the Tool Cover My Modality?

Yes. The suite covers linear mRNA (GeneCrafter for CDS, IVTDesigner for full construct, UTRDesigner for UTR-only work), self-amplifying RNA (SaRNADesigner with three alphavirus backbones), and circular RNA (CircularDesigner with four circularization scaffolds). No single commercial or academic alternative covers all five modalities in a unified interface.

Will the Output Pass My Preferred Synthesis Vendor's QC?

Yes, by design. The synthesis-vendor template system enforces IDT, Twist, GenScript, ATUM, and extendable vendor profiles at the GA fitness level. Internal benchmarks show >95% first-pass synthesis success when the active vendor template is enforced versus ~70% for CAI-only optimization without template enforcement.

Is the Tool Auditable for Regulatory Submissions?

Yes. Every run produces a reproducible bundle (config + seed + checkpoint + manifest + report). The output is ALCOA+-compatible by construction; the enclosing electronic-records system (LIMS, ELN, DMS) provides access control and e-signature. Validation packages (IQ/OQ/PQ templates) are available for GAMP 5 Category 3/4 deployment.

What Is the Data-Egress Profile?

Zero. The suite runs entirely on client infrastructure. No sequence or configuration is transmitted to any external server during a design run. This is material for programs handling proprietary sequences under IND or related confidentiality obligations.

What Human Effort Does the Tool Replace?

Roughly the effort a senior sequence-design scientist would spend running a CAI optimizer, a structure check, a synthesis-vendor QC scan, an immunogenicity evaluation, an UTR selection, and a report-writeup — typically one to three days per design — is replaced by a single run (minutes to hours, depending on sequence length and mode). The tool does not replace the judgment involved in interpreting the output; it replaces the mechanical labor of generating it.

What Training Is Required?

A molecular biologist or bioinformatician with basic Python CLI experience can run the tool after one hour of onboarding on the principal arguments. Interpreting the reports requires familiarity with CAI, MFE, UTR biology, and the relevant therapeutic-modality considerations — knowledge that is already part of the scientific team's baseline competency for any mRNA program.

How Is the Tool Maintained?

The suite is under active development by Bioneer. Codon-usage tables are refreshable from the CoCoPUTs source; UTR libraries are curated and versioned; synthesis-vendor templates are updated as vendors publish new constraints. Tool versions are semantic (major.minor.patch); the output manifest records the exact version used so that a future-version run can be compared to a past-version run on the same sequence.

What Are the Known Failure Modes?

Known limitations are documented transparently in §13.3. Principal failure modes: for very long sequences (>10 kb) LinearFold accuracy degrades relative to exact folding; immunogenicity is a composite score that does not substitute for wet-lab testing; codon-usage tables are organism-level averages and may not capture tissue-specific effects. In each case the workaround is documented.

Is There a Way to Try the Tool Before Committing?

Yes. A limited-scope pilot on one or two customer sequences can be arranged; the pilot produces the full output bundle using the customer's preferred synthesis vendor and host context, and the customer can compare the output against their existing tool's output on the same inputs before adopting the suite for production use.

17. Glossary

ALCOA+ — data-integrity principles: Attributable, Legible, Contemporaneous, Original, Accurate; plus Complete, Consistent, Enduring, Available.
ARCA — Anti-Reverse Cap Analog; a 5' cap chemistry used in post-transcriptional capping of IVT mRNA.
ARE — AU-Rich Element; 3' UTR sequence feature associated with mRNA decay (canonical motif ATTTA, class 1–3 by tandem repeat count).
CAI — Codon Adaptation Index; geometric-mean metric of codon bias relative to a reference set (Sharp & Li 1987).
CDS — Coding Sequence; the portion of an mRNA that is translated into protein.
CleanCap — Co-transcriptional capping reagent (TriLink); AG (CleanCap-AG) or AT (CleanCap-AT) dinucleotide at +1 is required.
CoCoPUTs — Codon and codon-pair usage tables derived from GenBank; the source of Bioneer's HDF5 codon-usage database.
CpG — Cytidine-phosphate-Guanosine dinucleotide; innate-immune and ZAP-recognition motif; depleted in vaccine sequences.
CPB — Codon Pair Bias; the propensity of a codon pair to co-occur beyond what single-codon frequencies predict (Coleman et al. 2008).
CSE — Conserved Sequence Element; structured region in alphavirus replicons essential for replicase function.
dsRNA — double-stranded RNA; MDA5/TLR3 immune-sensor substrate; minimized in mRNA design.
GA — Genetic Algorithm.
GAMP 5 — Good Automated Manufacturing Practice, 5th edition; software categorization and validation framework.
IRES — Internal Ribosome Entry Site; cap-independent translation initiation element.
IVT — In Vitro Transcription; enzymatic synthesis of RNA from a DNA template using T7, SP6, or T3 polymerase.
Kozak — consensus translation-initiation context around the AUG start codon.
LinearDesign — joint CAI+MFE optimization algorithm (Zhang et al. 2023).
LinearFold — linear-time beam-search RNA folding algorithm (Huang et al. 2019).
LNP — Lipid Nanoparticle; the formulation vehicle used for clinical mRNA delivery.
m1Ψ — N1-methylpseudouridine; the nucleoside modification used in Comirnaty and Spikevax.
MFE — Minimum Free Energy; thermodynamic descriptor of the most stable RNA fold.
NSGA-II — Non-dominated Sorting Genetic Algorithm II; Pareto-frontier multi-objective optimizer (Deb et al. 2002).
Naview — radial-tree 2-D layout algorithm for RNA secondary structure (Bruccoleri & Heinrich 1988).
PIE — Permuted Intron-Exon; Group-I intron engineering for circular RNA design.
PSSM — Position-Specific Scoring Matrix; used for cryptic splice-site detection.
RdRp — RNA-dependent RNA Polymerase; replicates saRNA.
saRNA — self-amplifying RNA; alphavirus-replicon-based vaccine platform.
SGP — Subgenomic Promoter; alphavirus internal promoter driving expression of the downstream ORF.
Tornado — tandem twister/HDV ribozyme strategy for circular RNA.
uORF — upstream open reading frame; 5' UTR feature that can reduce main-ORF translation.
UTR — Untranslated Region; 5' or 3' non-coding portion of an mRNA.
Viennarnaplot — the suite's SVG/PDF 2-D structure renderer.
ZAP — Zinc-finger Antiviral Protein; CpG-dependent RNA-recognition innate sensor.
ZuKer — O(n³) thermodynamic MFE recursion.