Beyond tripeptides

navigating massive search space for self-assemblying peptides

Alexander van Teijlingen


Self-assembling peptide nanostructures have been shown to be of great importance in nature and have presented many promising applications in medicine as
DDVs Drug delivery vehicles
, biosensors, and antivirals. Being very promising candidates for the growing field of bottom-up manufacture of functional nanomaterials, previous work has screened all possible amino acid combinations for di- & tripeptides in search of such materials, however, as the enormous complexity and variety of linear combinations of the twenty amino acids makes exhaustive simulation of all combinations beyond a chain length of 3 or 4 unfeasible. We employ an active machine learning method which improves itself and its dataset upon each iteration while searching for the best candidate peptides.
Figure 3. Coarse-grained molecular dynamics simulation of KVDYY pentapeptide over 50 ns (water omitted).Visualised using the OVITO scientific visualization and analysis software[8].


Many peptides show been shown to exhibit the tendency to self-assemble in water into a vast array of different structures including hydrogels, micelles, nanovesicles, nanotubes, nanofibers. A favourable property of many unprotected peptide nanomaterials is their inherent biocompatibilityand recently there has been a drive in large scale efforts to identify peptides of interest to leverage their useful properties including:

Previous work has surveyed the entire search space of di- & tri- peptides[2,3] of sizes 400 & 8000 molecular structures respectively via
CGMD Coarse-grained molecular dynamics
. The logical next step being a survey of all tetrapeptides would comprise 160,000 unique simulations, an achievable but costly endeavour which would prove impossible for pentapeptides and beyond (Figure 1).
Figure 1. Relative size of different peptide datasets accurarley scaled by volume.
$$ AP = \frac{SASA_{initial}}{SASA_{final}} $$
Equation 1. Aggregation propensity is calculated as a ratio of initial and final (50 ns) solvent accessible surface area.
$$ logP = \sum_{i=1}^{N} \Delta G_{water-oct, i} $$
Equation 2. Method for calculating log P of peptides proposed by White et al.[5]


Our aim is to survey peptides of chain length 4 - 6 with the intention that this method could be further scaled to peptides of chain length 7 - 8. We assume aggregation as the precondition to order self-assembly that can be measured via
CGMD Coarse-grained molecular dynamics
at low computational expense. We use the
AP Aggregation propensity
score (Equation 1) at 50 ns simulations as an aggregation quantification for peptides in water.
Figure 2. Flow chart showing the active learning loop as well as preparation of initial training dataset. Our stopping criteria was based on number of peptides generated.


Our main regression algorithm uses 47 parameters calculated using the Mordred [4] software package. However these parameters are too expensive (time to generate and space to store) for applying to very large datasets. This prompted us to create our own reduced set of parameters that could be generated with only the peptide single letter code as input, we called this programme Judred and the parameters it produces are listed below.
Name Description
SP2 Number of SP2 carbon atoms
NH2 Number of NH2/NH3 groups on the side chain(s)
MW Molecular weight
S Number of sulphur atoms
log P Wimley-White log P (Equation 2.) [5]
Z Charge
MaxASA Maximum solvent accessible surface [6]
Bulkiness Sum of amino acid bulkiness [7]
OH Number of OH groups (excluding backbone)
RotRatio Ratio of SP2 to SP3 carbon atoms

Initial dataset

The initial training set was constructed by taking a representative sample of the dataset containing a set of peptides that had values for each Judred parameter (normalized) at specific intervals (Figure 4), these provided the range of values from lowest to highest with an total of 30 equally spaces values for parameters SPRatio, NH2, MW, Log P WW, MaxASA, RotRatio & Bulkiness and 6 values for S & OH in an effort to reduce data imbalance as these are no applicable to most peptides (total = 252).
Figure 4. Bars showing relative positions of the representative sample peptides taken at equal intervals of each Judred parameter.

General model preformance

We found our sampling method to produced more accurate predictions than random sampling for both Judred and Mordred predictions.

Judred (random forest)
  Training data RMSE
Tetra Representative 0.79 0.156
  Random 0.70 0.185
Penta Representative 0.70 0.138
  Random 0.68 0.141
Hexa Representative 0.77 0.098
  Random 0.58 0.131
Mordred (linear SVR)
  Training data RMSE
Tetra Representative 0.85 0.134
  Random 0.82 0.146
Penta Representative 0.77 0.119
  Random 0.75 0.126
Hexa Representative 0.72 0.107
  Random 0.74 0.104


Dataset may be further pre-processed by removing peptides outside of defined ranges of Judred parameters. For example, prior to training machine learning models we removed insoluble peptides (log P < 0) from the dataset so that our search was restricted to water-soluble aggregating peptides. This process can be performed for any parameter or combination of Judred parameters, for example we may only want to search within the search space of peptides with a positive charge, or a charge of -2 only. Or peptides that have a charge of +2 and a molecular weight of < 100 amu.

We apply this to test different datasets for soluble peptides, using the following restrictions:

  •    •   Tetrapeptides log P < 0
  •    •   Pentapeptides log P < 0
  •    •   Pentapeptides log P < -4
  •    •   Pentapeptides log P < 0 & no aromatic residues
  •    •   Hexapeptides unrestricted
  •    •   Hexapeptides log P < 0
  •    •   Hexapeptides log P < -6
Figure 5. Moving graph showing the AP score of each peptide iteratively found by the active learning process. The structure formed by the top pentapeptide found (KVDYY) has been visualised in Figure 3.
[1] Deng, L.; Wang, Y. Multiscale Computational Prediction of β-Sheet Peptide Self-Assembly Morphology. Mol. Simul. 2020, 0 (0), 1–11.
[2] P. Frederix, R. V. Ulijn, N. T. Hunt and T. Tuttle, J. Phys. Chem. Lett., 2011, 2, 2380–2384.
[3] P. Frederix, G. G. Scott, Y. M. Abul-Haija, D. Kalafatovic, C. G. Pappas, N. Javid, N. T. Hunt, R. V. Ulijn and T. Tuttle, Nat. Chem., 2015, 7, 30–37.
[4] H. Moriwaki, Y. S. Tian, N. Kawashita and T. Takagi, J. Cheminform., 2018, 10, 1–14.
[5] White, S. H.; Wimley, W. C. Hydrophobic Interactions of Peptides with Membrane Interfaces. Biochim. Biophys. Acta - Rev. Biomembr. 1998, 1376 (3), 339–352.
[6] Tien, M. Z.; Meyer, A. G.; Sydykova, D. K.; Spielman, S. J.; Wilke, C. O. Maximum Allowed Solvent Accessibilites of Residues in Proteins. PLoS One 2013, 8 (11), 1–16.
[7] Zimmerman, J. M.; Eliezer, N.; Simha, R. The Characterization of Amino Acid Sequences in Proteins by Statistical Methods. J. Theor. Biol. 1968, 21 (2), 170–201.
[8] A. Stukowski, Model. Simul. Mater. Sci. Eng., 2010, 18, 015012.