A Tour Through VaxPress
This guide will show you how to optimize a wild-type mRNA sequence using VaxPress. As an example, we’ll focus on the Hemagglutinin (HA) protein from the Influenza A virus.
Step 1. Downloading a Sequence
To begin, use the following command to download the complete CDS sequence of the Influenza A virus’s HA protein from a GenBank page.
# Download a sequence from GenBank
ID="FJ981613.1"
wget -O Influenza_HA.fa "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ID}&rettype=fasta"
Alternatively, you may also begin with a protein sequence. For example, the following command will download the protein sequence of the Influenza A virus’s HA protein from UniProt.
# Download a protein sequence from UniProt
ID="C3W5X2"
wget -O Influenza_HA_protein.fa "https://www.uniprot.org/uniprotkb/${ID}.fasta"
Step 2. Evaluating the Given Sequence
Before starting optimization, let’s evaluate the sequence. By setting
the --iterations option to 0, VaxPress will generate a
report presenting the properties and optimality measures of the
provided sequence, without making any changes to the original
sequence.
# Evaluate the initial sequence
vaxpress -i Influenza_HA.fa -o eval_results --iterations 0
The output directory, eval_results, will contain several data
files, including report.html. Take a look at the sections on
Sequence Optimality Metrics and Predicted Secondary Structure
in the report. Since no optimization took place, the metrics of
the Initial and Optimized columns in the Sequence Optimality
Metrics section will be identical.
It’s important to note that if the input is provided as a protein sequence, the codons in the initial sequence will be chosen randomly. As such, the evaluation results may not hold significant information for the optimality of given protein sequence.
Step 3. Running VaxPress Optimization
Let’s proceed with the VaxPress optimization. With the --lineardesign
option, VaxPress utilizes the sequence optimized by LinearDesign as
a starting point. LinearDesign produces a sequence with a virtually
minimum free energy secondary structure among all possible codon
combinations for the protein sequence. Subsequently, VaxPress performs even
further optimization based on all the other parameters. This includes
factors like start codon accessibility, GC content, tandem repeats,
and predicted in-cell and in-solution stability among others, along
with the already optimized secondary structure. See the
Using LinearDesign for Optimization Initialization for more details.
# Run VaxPress optimization
vaxpress -i Influenza_HA.fa -o vaxpress_results --processes 36 \
--lineardesign 1.0 --lineardesign-dir path/to/LinearDesign \
--conservative-start 10 --initial-mutation-rate 0.01 \
--iterations 2000 --folding-engine linearfold
Using the --processes option is recommended to fully utilize
multiple CPU cores. Also, employing the --folding-engine
linearfold can reduce the running time by a minimum of 50%, without
significant deterioration in optimization quality.
The final optimized sequence can be reviewed in the report.html
found in the output directory. This document includes informative
sections such as Sequence Optimality Metrics and Predicted
Secondary Structure. The Optimization Process plots also help
diagnose and improve the optimization process. If necessary,
optimization parameters can be modified based on these plots following
the guide in this documentation.