# Beyond tripeptides

## navigating massive search space for self-assemblying peptides

### Abstract

Self-assembling peptide nanostructures have been shown to be of great importance in nature and have presented many promising applications in medicine as , biosensors, and antivirals. Being very promising candidates for the growing field of bottom-up manufacture of functional nanomaterials, previous work has screened all possible amino acid combinations for di- & tripeptides in search of such materials, however, as the enormous complexity and variety of linear combinations of the twenty amino acids makes exhaustive simulation of all combinations beyond a chain length of 3 or 4 unfeasible. We employ an active machine learning method which improves itself and its dataset upon each iteration while searching for the best candidate peptides.

## Introduction

Many peptides show been shown to exhibit the tendency to self-assemble in water into a vast array of different structures including hydrogels, micelles, nanovesicles, nanotubes, nanofibers. A favourable property of many unprotected peptide nanomaterials is their inherent biocompatibilityand recently there has been a drive in large scale efforts to identify peptides of interest to leverage their useful properties including:

Previous work has surveyed the entire search space of di- & tri- peptides[2,3] of sizes 400 & 8000 molecular structures respectively via . The logical next step being a survey of all tetrapeptides would comprise 160,000 unique simulations, an achievable but costly endeavour which would prove impossible for pentapeptides and beyond (Figure 1).
$$AP = \frac{SASA_{initial}}{SASA_{final}}$$ $$logP = \sum_{i=1}^{N} \Delta G_{water-oct, i}$$

## Methodology

Our aim is to survey peptides of chain length 4 - 6 with the intention that this method could be further scaled to peptides of chain length 7 - 8. We assume aggregation as the precondition to order self-assembly that can be measured via at low computational expense. We use the score (Equation 1) at 50 ns simulations as an aggregation quantification for peptides in water.

### Parameters

Our main regression algorithm uses 47 parameters calculated using the Mordred [4] software package. However these parameters are too expensive (time to generate and space to store) for applying to very large datasets. This prompted us to create our own reduced set of parameters that could be generated with only the peptide single letter code as input, we called this programme Judred and the parameters it produces are listed below.
Name Description
SP2 Number of SP2 carbon atoms
NH2 Number of NH2/NH3 groups on the side chain(s)
MW Molecular weight
S Number of sulphur atoms
log P Wimley-White log P (Equation 2.) [5]
Z Charge
MaxASA Maximum solvent accessible surface [6]
Bulkiness Sum of amino acid bulkiness [7]
OH Number of OH groups (excluding backbone)
RotRatio Ratio of SP2 to SP3 carbon atoms

### Initial dataset

The initial training set was constructed by taking a representative sample of the dataset containing a set of peptides that had values for each Judred parameter (normalized) at specific intervals (Figure 4), these provided the range of values from lowest to highest with an total of 30 equally spaces values for parameters SPRatio, NH2, MW, Log P WW, MaxASA, RotRatio & Bulkiness and 6 values for S & OH in an effort to reduce data imbalance as these are no applicable to most peptides (total = 252).

### General model preformance

We found our sampling method to produced more accurate predictions than random sampling for both Judred and Mordred predictions.

Judred (random forest)
Training data RMSE
Tetra Representative 0.79 0.156
Random 0.70 0.185
Penta Representative 0.70 0.138
Random 0.68 0.141
Hexa Representative 0.77 0.098
Random 0.58 0.131
Mordred (linear SVR)
Training data RMSE
Tetra Representative 0.85 0.134
Random 0.82 0.146
Penta Representative 0.77 0.119
Random 0.75 0.126
Hexa Representative 0.72 0.107
Random 0.74 0.104

### Results

Dataset may be further pre-processed by removing peptides outside of defined ranges of Judred parameters. For example, prior to training machine learning models we removed insoluble peptides (log P < 0) from the dataset so that our search was restricted to water-soluble aggregating peptides. This process can be performed for any parameter or combination of Judred parameters, for example we may only want to search within the search space of peptides with a positive charge, or a charge of -2 only. Or peptides that have a charge of +2 and a molecular weight of < 100 amu.

We apply this to test different datasets for soluble peptides, using the following restrictions:

•    •   Tetrapeptides log P < 0
•    •   Pentapeptides log P < 0
•    •   Pentapeptides log P < -4
•    •   Pentapeptides log P < 0 & no aromatic residues
•    •   Hexapeptides unrestricted
•    •   Hexapeptides log P < 0
•    •   Hexapeptides log P < -6
[1] Deng, L.; Wang, Y. Multiscale Computational Prediction of β-Sheet Peptide Self-Assembly Morphology. Mol. Simul. 2020, 0 (0), 1–11.
[2] P. Frederix, R. V. Ulijn, N. T. Hunt and T. Tuttle, J. Phys. Chem. Lett., 2011, 2, 2380–2384.
[3] P. Frederix, G. G. Scott, Y. M. Abul-Haija, D. Kalafatovic, C. G. Pappas, N. Javid, N. T. Hunt, R. V. Ulijn and T. Tuttle, Nat. Chem., 2015, 7, 30–37.
[4] H. Moriwaki, Y. S. Tian, N. Kawashita and T. Takagi, J. Cheminform., 2018, 10, 1–14.
[5] White, S. H.; Wimley, W. C. Hydrophobic Interactions of Peptides with Membrane Interfaces. Biochim. Biophys. Acta - Rev. Biomembr. 1998, 1376 (3), 339–352.
[6] Tien, M. Z.; Meyer, A. G.; Sydykova, D. K.; Spielman, S. J.; Wilke, C. O. Maximum Allowed Solvent Accessibilites of Residues in Proteins. PLoS One 2013, 8 (11), 1–16.
[7] Zimmerman, J. M.; Eliezer, N.; Simha, R. The Characterization of Amino Acid Sequences in Proteins by Statistical Methods. J. Theor. Biol. 1968, 21 (2), 170–201.
[8] A. Stukowski, Model. Simul. Mater. Sci. Eng., 2010, 18, 015012.