Abstract
Self-assembling peptide nanostructures have been shown to be of great importance in nature and have presented many promising applications in medicine as
DDVs Drug delivery vehicles
, biosensors, and antivirals. Being very promising candidates for the growing field of bottom-up manufacture of functional nanomaterials, previous work has screened all possible amino acid combinations for di- & tripeptides in search of such materials, however, as the enormous complexity and variety of linear combinations of the twenty amino acids makes exhaustive simulation of all combinations beyond a chain length of 3 or 4 unfeasible. We employ an active machine learning method which improves itself and its dataset upon each iteration while searching for the best candidate peptides.
Figure 3. Coarse-grained molecular dynamics simulation of KVDYY pentapeptide over 50 ns (water omitted).Visualised using the OVITO scientific visualization and analysis software[8].
Introduction
Many peptides show been shown to exhibit the tendency to self-assemble in water into a vast array of different structures including hydrogels, micelles, nanovesicles, nanotubes, nanofibers.
A favourable property of many unprotected peptide nanomaterials is their inherent biocompatibilityand recently there has been a drive in large scale efforts to identify peptides of interest to leverage their useful properties including:
Previous work has surveyed the entire search space of di- & tri- peptides[2,3] of sizes 400 & 8000 molecular structures respectively
via
CGMD Coarse-grained molecular dynamics
.
The logical next step being a survey of all tetrapeptides would comprise 160,000 unique simulations, an achievable but costly endeavour which would prove impossible for pentapeptides and beyond (Figure 1).
Figure 1. Relative size of different peptide datasets accurarley scaled by volume.
$$ AP = \frac{SASA_{initial}}{SASA_{final}} $$
Equation 1. Aggregation propensity is calculated as a ratio of initial and final (50 ns) solvent accessible surface area.
$$ logP = \sum_{i=1}^{N} \Delta G_{water-oct, i} $$
Equation 2. Method for calculating log P of peptides proposed by White et al.[5]
Methodology
Our aim is to survey peptides of chain length 4 - 6 with the intention that this method could be further scaled to peptides of chain length 7 - 8. We assume aggregation as the precondition to order self-assembly that can be measured
via
CGMD Coarse-grained molecular dynamics
at low computational expense. We use the
AP Aggregation propensity
score (Equation 1) at 50 ns simulations as an aggregation quantification for peptides in water.
Figure 2. Flow chart showing the active learning loop as well as preparation of initial training dataset. Our stopping criteria was based on number of peptides generated.
Parameters
Our main regression algorithm uses 47 parameters calculated using the Mordred [4] software package. However these parameters are too expensive (time to generate and space to store) for applying to very large datasets. This prompted us to create our own reduced set of parameters that could be generated with only the peptide single letter code as input, we called this programme Judred and the parameters it produces are listed below.
Name |
Description |
SP2 |
Number of SP2 carbon atoms |
NH2 |
Number of NH2/NH3 groups on the side chain(s) |
MW |
Molecular weight |
S |
Number of sulphur atoms |
log P |
Wimley-White log P (Equation 2.) [5] |
Z |
Charge |
MaxASA |
Maximum solvent accessible surface [6] |
Bulkiness |
Sum of amino acid bulkiness [7] |
OH |
Number of OH groups (excluding backbone) |
RotRatio |
Ratio of SP2 to SP3 carbon atoms |
Initial dataset
The initial training set was constructed by taking a representative sample of the dataset containing a set of peptides that had values for each Judred parameter (normalized) at specific intervals (Figure 4), these provided the range of values from lowest to highest with an total of 30 equally spaces values for parameters SPRatio, NH2, MW, Log P WW, MaxASA, RotRatio & Bulkiness and 6 values for S & OH in an effort to reduce data imbalance as these are no applicable to most peptides (total = 252).
Figure 4. Bars showing relative positions of the representative sample peptides taken at equal intervals of each Judred parameter.
General model preformance
We found our sampling method to produced more accurate predictions than random sampling for both Judred and Mordred predictions.
Judred (random forest)
|
Training data |
R² |
RMSE |
Tetra |
Representative |
0.79 |
0.156 |
|
Random |
0.70 |
0.185 |
Penta |
Representative |
0.70 |
0.138 |
|
Random |
0.68 |
0.141 |
Hexa |
Representative |
0.77 |
0.098 |
|
Random |
0.58 |
0.131 |
Mordred (linear SVR)
|
Training data |
R² |
RMSE |
Tetra |
Representative |
0.85 |
0.134 |
|
Random |
0.82 |
0.146 |
Penta |
Representative |
0.77 |
0.119 |
|
Random |
0.75 |
0.126 |
Hexa |
Representative |
0.72 |
0.107 |
|
Random |
0.74 |
0.104 |
Results
Dataset may be further pre-processed by removing peptides outside of defined ranges of Judred parameters. For example, prior to training machine learning models we removed insoluble peptides (log P < 0) from the dataset so that our search was restricted to water-soluble aggregating peptides. This process can be performed for any parameter or combination of Judred parameters, for example we may only want to search within the search space of peptides with a positive charge, or a charge of -2 only. Or peptides that have a charge of +2 and a molecular weight of < 100 amu.
We apply this to test different datasets for soluble peptides, using the following restrictions:
- • Tetrapeptides log P < 0
- • Pentapeptides log P < 0
- • Pentapeptides log P < -4
- • Pentapeptides log P < 0 & no aromatic residues
- • Hexapeptides unrestricted
- • Hexapeptides log P < 0
- • Hexapeptides log P < -6
Figure 5. Moving graph showing the AP score of each peptide iteratively found by the active learning process. The structure formed by the top pentapeptide found (KVDYY) has been visualised in Figure 3.
[1] Deng, L.; Wang, Y. Multiscale Computational Prediction of β-Sheet Peptide Self-Assembly Morphology. Mol. Simul. 2020, 0 (0), 1–11.
[2] P. Frederix, R. V. Ulijn, N. T. Hunt and T. Tuttle, J. Phys. Chem. Lett., 2011, 2, 2380–2384.
[3] P. Frederix, G. G. Scott, Y. M. Abul-Haija, D. Kalafatovic, C. G. Pappas, N. Javid, N. T. Hunt, R. V. Ulijn and T. Tuttle, Nat. Chem., 2015, 7, 30–37.
[4] H. Moriwaki, Y. S. Tian, N. Kawashita and T. Takagi, J. Cheminform., 2018, 10, 1–14.
[5] White, S. H.; Wimley, W. C. Hydrophobic Interactions of Peptides with Membrane Interfaces. Biochim. Biophys. Acta - Rev. Biomembr. 1998, 1376 (3), 339–352.
[6] Tien, M. Z.; Meyer, A. G.; Sydykova, D. K.; Spielman, S. J.; Wilke, C. O. Maximum Allowed Solvent Accessibilites of Residues in Proteins. PLoS One 2013, 8 (11), 1–16.
[7] Zimmerman, J. M.; Eliezer, N.; Simha, R. The Characterization of Amino Acid Sequences in Proteins by Statistical Methods. J. Theor. Biol. 1968, 21 (2), 170–201.
[8] A. Stukowski, Model. Simul. Mater. Sci. Eng., 2010, 18, 015012.