innov’SAR Platform

PEACCEL has developed the “innov’SAR platform”, including 4 modules:

  1. innov’SAR core: For the optimization of polypeptides (peptides, proteins, enzymes, antibodies, VHHs) and metabolic or signaling pathways.
  2. RAS module: For evaluating combination of Drugs (FDCs) in various diseases
  3. automlSAR: automlSAR tests in parallel >135 algorithms, is agnostic with respect to the data set and can be combined with innov’SAR core & RAS module.
  4. GraphMut: Graph-based visualization tool of relations between the mutations of protein sequence variants and activity variations.

 

Innov’SAR core

Innov’SAR Core is the main tool from the Innov’SAR platform developped by PEACCEL for the statistical modeling of protein sequence-activity relation in the R language.

The application is coded with the shiny package.

Our proprietary innovative Sequence-Activity Relationship methodology, called innov’SAR core, identifies high fitness mutants from smart mutant libraries relying on physico-chemical properties of the amino acids, digital signal processing and regression techniques.

The novelty of Innov’SAR core is that it uses Fast Fourier Transform (FFT) to numerically encode protein sequences of a library of variants of the protein/enzyme with known activities into a set of protein spectra.

 

To sum up the basic characteristic of the procedure: Only an initial dataset containing the primary sequences of enzyme variants and the respective biological properties is required. It is different from other ML approaches due to the following characteristics: i. thanks to the Fourier transform, the non-linear aspects inside the protein sequence are captured; ii. FFT allows to introduce new mutations at positions not previously explored or new positions of mutations; iii. a single round, as in this case, allows the identification of high performing mutants, while avoiding iv. the need for excessively large datasets customary in other ML or deep learning approaches; v. no need for alignment-based amino acid descriptors, no need for protein sequences of equal length, as well as, vi. large computational resources and/or long computational times are not required.

Applying an FFT to a protein sequence digitally encoded is not the same as simply encoding it in another way, indeed this mathematical treatment makes it possible to take into account the order of the protein sequence and all the interactions between positions within it, and thus to better identify epistatic phenomena.

The innov’SAR core approach is interpolative, extrapolative and predicts outside-the-box, not found in other state-of-the-art Machine Learning or Deep Learning approaches. The comparison of innov’SAR core with 14 other methods shows that it outperforms these SOTA ML & DL methods in terms of hit rate (81%).

Figure S2: https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cbic.202000612
Relationship between the hit rate normalized to the log10 number of functionally characterized mutants used for training and the size of the search region explored: comparison of 15 studies. Turquoise blue square: assuming a non-normalized hit rate of maximum value of 1 (incomplete data to have the exact hit rate from the paper) for the CNN model proposed by Xu et al (2020). Purple diamond: The hit rate indicated in the Attention-Based Neural Networks model proposed by Wu et al (2020) is used for comparison.

 

AutomlSAR

AutomlSAR is a tool developped by PEACCEL for the statistical modeling of protein sequence-activity relation in the R language.

It allows modeling of protein sequence-activity with deterministic regression algorithms (PLSR, Principal component regression, Lasso regression, Ridge regression models…), stochastic regression algorithms (ANOVA -analysis of variance- models; Linear, generalized linear, and nonlinear mixed models…), or black box type algorithms in models based on artificial neural networks. Alternatively, a regression model can be built for each index directly, and a selection of the best model for each index used to form ensemble of models in order to calculate for example a mean of the predictions of the hold out sequences with each of these models, or to use the predicted values of each model to build a new model that allow new predictions, or more generally to use different approaches of ensemble modeling such as staking, bagging, boosting. Nevertheless, whatever the approach chosen the difficulty will remain the final choice of a model among all the models tested. This choice will certainly be based on performance metrics, but when these indicators are very close, the choice becomes difficult: the experience and expertise of the researcher will then make it possible to select the final model.

In the online demo version only 5 algorithms (pls, svmLinear, svmPoly, rf, knn) are available.

automlSAR tests in parallel >130 algorithms, is agnostic with respect to the data set and can be combined with innov’SAR core & RAS module.

AutomlFLS ie. Automl for Life Sciences can be applied to diverse kind of biological data from Flux metabolic Pathways to InfraRed spectra.

 

Graphmut

GraphMut generates a graph based on the mutation relations between the protein sequences of a dataset.

This tool allows you to visualize with a single glance the complexity of the relationships between the different sequences. This approach provides information on the mutations to be avoided and the one to be favoured.

 

In case you are interested in accessing our online demo platform, please send us an email at: contact@peaccel.com