Real Value Prediction of Solvent Accessibility using Neural Network

Ref: Shandar Ahmad, M. Michael Gromiha and Akinori Sarai
        Proteins 50(2003)629-635 Abstract

We have developed a neural network model to predict real value of
solvent accessibility from sequence information.
Detailed results of predictions for training, test and validation data sets
have been provided through links on this page. For online predictions using this method, please visit

Three lists of proteins have been created from each set of proteins provided by
references given below. Rotating training, test and validation data sets among
these lists, leaves six sets of data for each of these referred groups of chians.
They have been labelled as  set1, set2, set3, ... set6. Each of these data sets have
two directories called  "data" and "preds". "data" directory contains all the input
information use  for training/ prediction. In each "data" directory, there are files
called "train.dat", "test.dat" and "val.dat". "train.dat" has the list of proteins
used for training the network. "test.dat" has been used for determining the
stopping point for training and "val.dat" proteins, have been kept out of
training process for cross validation after training. It may be noted that
the data files for Manesh-215 set have two significant digits after the decimal
whereas the other datasets have only integer values for their ASA values
in the data directories. This is due to the fact that DSSP has been used
for calculating ASA values of all data sets and DSSP returns an integer value
of ASA in A^2. For Manesh-215 data sets, ASA values were calculated using
another standard program called ASC (see ref. below), and ASC returns,
ASA values upto second place of decimal (in A^2). The actual effect of these
decimal places is however insignificant compared to the variation in
prediction accuracy values. "pred" directory contains results
of prediction for the corresponding set for all training, test and validation
proteins. All these prediction files have four columns. First column is the
residue name, second column is the desired value, third coulmn is the
predicted value and the last coulmn has the absolute error in prediction.
Units have been normalised to unity. To get the percent relative accessibility
one has to multiply these values by 100. To obtain the total solvent accessibility
in A^2, one needs to multiply this with ASA of the extended state of Ala-X-Ala,
for residue type X, as described in the following reference:

The data containing the above information can be downloaded as a single tar file now.
Please click here to do so.
For comments and suggestions, please contact: Shandar Ahmad