Data Mining. 1.

Glycine Content Estimation from Activity Coefficients Measurement

 

 

Lorentz JÄNTSCHIa, Mircea V. DIUDEAb

 

a Technical University of Cluj-Napoca, RO, http://lori.academicdirect.ro

b "Babeş-Bolyai" University of Cluj-Napoca, RO, http://j.academicdirect.ro/~diudea

 

 

                        Abstract

    The paper presents a method to estimate the glycine content from salt aqueous solutions based on standard adding of salt repeated activity coefficients measurement. An original program for data mining is presented. The program execution shows that the glycine content can be predicted from mean ionic activity coefficients in the presence of glycine at room chamber temperature.

    The original program was build up on PHP technology and can run via HTTP service at the address:

http://academicdirect.ro/virtual_library/molecular_topology/data_mining/

 

                        Keywords

Data mining, PHP programming, QSPR/QSAR studies

 

           

 

                        Introduction

 

            The software market provides a varied offer for data mining and analysis. Beginning with office software like Microsoft Excel [1] and ending with professional data mining software like StatSoft Statistica [2] a large list of choices and modalities to mining data are available. Even if the software pool is sometimes endless, it does not offer a complete answer on data dependency. In a previous paper [3], it was discussed a program capable to consider all possible dependencies in a data set and to generate a complete list of data predictors. The program was adapted now to store results into a database; by querying the database is now possible to obtain the best predictor of selected variable.

Note that the actual program does not replace the old one in tasks. First, located at the address

http://vl.academicdirect.ro/applied_statistics/linear_regression/multiple/v1.5/

it makes all recursively combinations to find dependencies. The present one considers only a user selection subset of the data set, the results being used for prediction, validation and comparison.

 

 

                        Program Interface and Architecture

 

            The program has two interfaces: the first one (figure 1) is for query statistics and allows user to enounce a SQL query for `statistics` table interrogation.

 

Fig. 1. Query statistics menu of data_mining program

            First, the user selects his fields and the order (by clauses) of subtotal statistics, by pressing the Select button; the query is thus made.

            The second menu interface is available from Admin link and leads to (figure 2):

 

Fig. 2. Operations menu of the data mining program

 

            Note that the access to the operations menu is password restricted, by data security reasons. First step of data analysis is data submitting to server, via append/create option (figure 3):

 

Fig. 3. Data uploading (data_mining program)

            The data must be in table format, with columns, rows and captions included. The submitted data are stored into a MySQL table with the same names as gives by the user.

After uploading, the table will appear in table’s lists. The user can browse or drop the table and, most important, can make statistics on it.

First step in statistics making is to select a subset of data (figure 4).

 

Fig. 4. Data selecting (data_mining program)

 

            Based on user selection, the program displays selected values (figure 5) and allows user to select the dependent and independent variables (X for independent, Y for dependent ones). Further, Rnd check box enables, for the selected variable(s), a test of significance or validation, by permuting values into a column.

Fig. 5. Data assigning (data_mining program)

 

In the same window (figure 6), after the data assignment, the user must select the required threshold for the correlation coefficient r. If the user wants to convert the data to a Banach Space [4], he submits them to the partial least squares coefficients determination procedure [5].

 

Fig. 6. Limits and conversion features (data_mining program)

 

            All statistics results are written into a `statistics` table with the structure given below.

 

c_cols

date

table

n_rows

n_cols

r_value

c_rows

k_rows

k_cols

equation

rownames

banach

id

 

In the above, `c_cols` column stores the name of permuted values column (if any) or "Y" value otherwise and allows to select a specific subset of correlation data, according to the permuted column (Rnd option). The `date` column stores the date in format YY:MM:DD:HH:mm:SS (Y-year, M-month, D-day, H-hour, m-minute, s-second, e.g., 03:10:01:18:13:49). The `table` column stores the name of original table, from which a subset was extracted. The `n_rows` and `n_cols` store the number of data rows and columns (excepting the subtotal records, for which the fields is empty). The `r_value` column stores the correlation coefficient founds. The `c_rows` contain the name of starting row in rows permutation (if any), empty (for subtotals) or null (for non-permuted data). The `k_rows` contain an ordered list of all considered rows, stored by their record numbers. The `k_cols` is the place for selected columns. The `equation` field contains the regression equation as a string (in our case, a value is given in eq. 1. The `rownames` field stores an ordered list of names for selected rows. Finally, the `banach` field can store only two values (0 or 1), depending on user choice, about the converting data to a Banach Space.

 

 

                        Results and discussion

 

            Using the data_mining program it is easy to discover the dependency between measured or calculated parameters. A set of experimental data [6] (Table 1) was considered for testing the program. The collected data represent the ratio of the mean ionic activity coefficients of NaBr, in the presence of glycine, at different NaBr and glycine molalities, at T = 298.15 K. The NaBrX (X=1, 3, 5, 7, 10) columns contain the activity coeffcients of NaBr solved in ratios of 0.1, 0.3, 0.5, 0.7 and 1.0 mol∙kg-1. For glycine, only molality (in I.S. units) is provided.

 

Table 1. ratios of the mean ionic activity coefficients of NaBr in the presence of glycine [6]

miac

Glycine

NaBr1

NaBr3

NaBr5

NaBr7

NaBr10

 

miac

Glycine

NaBr1

NaBr3

NaBr5

NaBr7

NaBr10

m1

0.1

0.992

0.96

0.98

0.97

0.967

 

m10

1

0.897

0.887

0.916

0.912

0.918

m2

0.2

0.981

0.951

0.972

0.963

0.961

 

m11

1.2

0.88

0.874

0.904

0.901

0.908

m3

0.3

0.97

0.941

0.963

0.955

0.956

 

m12

1.4

0.866

0.862

0.892

0.892

0.899

m4

0.4

0.958

0.933

0.957

0.949

0.95

 

m13

1.6

0.852

0.851

0.881

0.883

0.89

m5

0.5

0.947

0.925

0.949

0.942

0.944

 

m14

1.8

0.838

0.841

0.873

0.873

0.881

m6

0.6

0.937

0.916

0.942

0.936

0.94

 

m15

2

0.826

0.831

0.864

0.865

0.873

m7

0.7

0.927

0.909

0.936

0.93

0.933

 

m16

2.2

0.814

0.822

0.856

0.856

0.866

m8

0.8

0.916

0.902

0.929

0.924

0.928

 

m17

2.4

0.805

0.813

0.848

0.849

0.859

m9

0.9

0.906

0.894

0.923

0.918

0.923

 

 

 

 

 

 

 

 

 

            The query of `statistics` table on all `miac` selected values shows that the glycine concentration is strongly dependent on NaBr ionic activity (if both exists in solution). The equation that proves this truth is:

YGlycine=21.448XNaBr1-0.232XNaBr3-22.943XNaBr5-27.755XNaBr7+7.884XNaBr10+20.808*1

             m = 17; r = 0.998                                                                                            (1)

 

            The program allows the calculation of monovariate regression (i.e., a single independent variable, e.g., NaBr1):

 

YGlycine=-11.977*XNaBr1+11.853*1

            m = 17; r = 0.992                                                                                             (2)

 

            Supposing that we want to validate the assumption of NaBr1 dependency, all that we have to do is to include a validation test on NaBr1 variable in our program. This task can be done on two ways (figure 7).

 

                         or                    

Fig. 7. Validation test running (Rnd checkbox selection)

 

            The execution of the program in both cases will produce a set of correlations (figure 8, see also figure 7 for correspondences):

 

YGlycine=-11.9774*XNaBr1+11.8528*1
r=0.9921

YGlycine=-7.4288*XNaBr1+7.7558*1
r=0.6153 XNaBr1->m2

YGlycine=-3.5341*XNaBr1+4.2479*1
r=0.2927 XNaBr1->m3

YGlycine=-0.2988*XNaBr1+1.3338*1
r=0.0247 XNaBr1->m4

YGlycine=2.2304*XNaBr1-0.9442*1
r=0.1847 XNaBr1->m5

YGlycine=4.0916*XNaBr1-2.6206*1
r=0.3389 XNaBr1->m6

YGlycine=5.3249*XNaBr1-3.7314*1
r=0.441 XNaBr1->m7

YGlycine=5.9267*XNaBr1-4.2735*1
r=0.4909 XNaBr1->m8

YGlycine=5.8571*XNaBr1-4.2108*1
r=0.4851 XNaBr1->m9

 

YGlycine=-11.9774*XNaBr1+11.8528*1
r=0.9921

YGlycine=-8.5594*XNaBr1+8.7742*1
r=0.709 YGlycine->m2

YGlycine=-5.3571*XNaBr1+5.8899*1
r=0.4437 YGlycine->m3

YGlycine=-2.4992*XNaBr1+3.3157*1
r=0.207 YGlycine->m4

YGlycine=0.0126*XNaBr1+1.0534*1
r=0.001 YGlycine->m5

YGlycine=2.0877*XNaBr1-0.8157*1
r=0.1729 YGlycine->m6

YGlycine=3.7193*XNaBr1-2.2853*1
r=0.308 YGlycine->m7

YGlycine=4.9022*XNaBr1-3.3507*1
r=0.406 YGlycine->m8

YGlycine=5.5023*XNaBr1-3.8912*1
r=0.4557 YGlycine->m9

 

Fig. 8. Output of validation tests


 

YGlycine=5.5023*XNaBr1-3.8912*1
r=0.4557 XNaBr1->m10

YGlycine=4.9022*XNaBr1-3.3507*1
r=0.406 XNaBr1->m11

YGlycine=3.7193*XNaBr1-2.2853*1
r=0.308 XNaBr1->m12

YGlycine=2.0877*XNaBr1-0.8157*1
r=0.1729 XNaBr1->m13

YGlycine=0.0126*XNaBr1+1.0534*1
r=0.001 XNaBr1->m14

YGlycine=-2.4992*XNaBr1+3.3157*1
r=0.207 XNaBr1->m15

YGlycine=-5.3571*XNaBr1+5.8899*1
r=0.4437 XNaBr1->m16

YGlycine=-8.5594*XNaBr1+8.7742*1
r=0.709 XNaBr1->m17

YGlycine=-11.9774*XNaBr1+11.8528*1
r=0.9921 XNaBr1->m1

 

YGlycine=5.8571*XNaBr1-4.2108*1
r=0.4851 YGlycine->m10

YGlycine=5.9267*XNaBr1-4.2735*1
r=0.4909 YGlycine->m11

YGlycine=5.3249*XNaBr1-3.7314*1
r=0.441 YGlycine->m12

YGlycine=4.0916*XNaBr1-2.6206*1
r=0.3389 YGlycine->m13

YGlycine=2.2304*XNaBr1-0.9442*1
r=0.1847 YGlycine->m14

YGlycine=-0.2988*XNaBr1+1.3338*1
r=0.0247 YGlycine->m15

YGlycine=-3.5341*XNaBr1+4.2479*1
r=0.2927 YGlycine->m16

YGlycine=-7.4288*XNaBr1+7.7558*1
r=0.6153 YGlycine->m17

YGlycine=-11.9774*XNaBr1+11.8528*1
r=0.9921 YGlycine->m1

 

Fig. 8. Output of validation tests (continuing)

 

            After the regression validation procedure running, the program put automatically the regression results into the database. A query can be applied now to the database (see also fig 1):

 

Fig. 9. Query statistics for validation analysis

            The query from figure 9 will produce a full analysis of database contents, detailed in figure 10 (a, b and c):

 

c_cols

date

r_value

c_rows

k_cols

id

XNaBr1

03:12:02:09:12:08

0.001

m14

Glycine.NaBr1

79

XNaBr1

03:12:02:09:12:08

0.0247

m4

Glycine.NaBr1

69

XNaBr1

03:12:02:09:12:08

0.1729

m13

Glycine.NaBr1

78

XNaBr1

03:12:02:09:12:08

0.1847

m5

Glycine.NaBr1

70

XNaBr1

03:12:02:09:12:08

0.207

m15

Glycine.NaBr1

80

XNaBr1

03:12:02:09:12:08

0.2927

m3

Glycine.NaBr1

68

XNaBr1

03:12:02:09:12:08

0.308

m12

Glycine.NaBr1

77

XNaBr1

03:12:02:09:12:08

0.3389

m6

Glycine.NaBr1

71

XNaBr1

03:12:02:09:12:08

0.406

m11

Glycine.NaBr1

76

XNaBr1

03:12:02:09:12:08

0.441

m7

Glycine.NaBr1

72

XNaBr1

03:12:02:09:12:08

0.4437

m16

Glycine.NaBr1

81

XNaBr1

03:12:02:09:12:08

0.4557

m10

Glycine.NaBr1

75

XNaBr1

03:12:02:09:12:08

0.4851

m9

Glycine.NaBr1

74

XNaBr1

03:12:02:09:12:08

0.4909

m8

Glycine.NaBr1

73

XNaBr1

03:12:02:09:12:08

0.6153

m2

Glycine.NaBr1

67

XNaBr1

03:12:02:09:12:08

0.709

m17

Glycine.NaBr1

82

XNaBr1

03:12:02:09:12:08

0.9921

m1

Glycine.NaBr1

83

XNaBr1

-

0.3864

-

-

T(17)

 

Fig. 10. (a) Validation statistics for X randomizing (see fig. 7)

 

Y

03:12:02:09:10:51

0.9921

0

Glycine.NaBr1

48

Y

03:12:02:09:12:08

0.9921

0

Glycine.NaBr1

66

Y

-

0.9921

-

-

T(2)

 

Fig. 10. (b) Validation statistics without randomizing (continuing from figure 10a)


YGlycine

03:12:02:09:10:51

0.001

m5

Glycine.NaBr1

52

YGlycine

03:12:02:09:10:51

0.0247

m15

Glycine.NaBr1

62

YGlycine

03:12:02:09:10:51

0.1729

m6

Glycine.NaBr1

53

YGlycine

03:12:02:09:10:51

0.1847

m14

Glycine.NaBr1

61

YGlycine

03:12:02:09:10:51

0.207

m4

Glycine.NaBr1

51

YGlycine

03:12:02:09:10:51

0.2927

m16

Glycine.NaBr1

63

YGlycine

03:12:02:09:10:51

0.308

m7

Glycine.NaBr1

54

YGlycine

03:12:02:09:10:51

0.3389

m13

Glycine.NaBr1

60

YGlycine

03:12:02:09:10:51

0.406

m8

Glycine.NaBr1

55

YGlycine

03:12:02:09:10:51

0.441

m12

Glycine.NaBr1

59

YGlycine

03:12:02:09:10:51

0.4437

m3

Glycine.NaBr1

50

YGlycine

03:12:02:09:10:51

0.4557

m9

Glycine.NaBr1

56

YGlycine

03:12:02:09:10:51

0.4851

m10

Glycine.NaBr1

57

YGlycine

03:12:02:09:10:51

0.4909

m11

Glycine.NaBr1

58

YGlycine

03:12:02:09:10:51

0.6153

m17

Glycine.NaBr1

64

YGlycine

03:12:02:09:10:51

0.709

m2

Glycine.NaBr1

49

YGlycine

03:12:02:09:10:51

0.9921

m1

Glycine.NaBr1

65

YGlycine

-

0.3864

-

-

T(17)

 

Fig. 10 (c) Validation statistics for Y randomizing (see fig. 7, continuing from figure 10b)

 

            Starting from ionic activity determinations in an unknown glycine concentration solution, a simple regression analysis provides a QSPR (quantitative structure- property relationship) model (eqs 1 or 2). the glycine concentration was determined with two decimals precision in NaBr+Glycine solutions.

The linear dependency between ionic activity and glycine concentration, given by eq 2, can be extended to solutions with unknown concentration of a significant component (say, aminoacid). The experimental part for ionic determination, according to [6], was performed on a Jenway ion analyzer, Model 3045.

                        Conclusions

 

            The program allows an efficient management of data submitted to a correlation study, with re-viewing and differently selecting subsets.

            The multiple subsets selection from inputted data let us to establish more than one QSPR/QSAR relations between data sets, depending on theoretical and/or experimental requests. The friendly interface of data analysis shorts significant the analysis time.

The results storage and data conversion features show the versatility and efficiency of the proposed data mining procedure.

 

 

                        References

 

The program:

http://vl.academicdirect.ro/molecular_topology/data_mining/



[1]. http://www.microsoft.com

[2]. www.statsoftinc.com

[3]. JÄNTSCHI Lorentz, Free Software Development. 1. Fitting Statistical Regressions, Leonardo Journal of Sciences, 1, 31-52, 2002.

[4]. http://www.math.okstate.edu/~alspach/banach/

[5]. http://www.bioss.ac.uk/smart/unix/mplsgxe/slides/frames.htm

[6]. Khavaninzadeh A., Modarress H., Taghikhani V., Khoshkbarchi M. K., Measurement of activity coeffcients of amino acids in aqueous electrolyte solutions: experimental data for the systems (H2O+ NaBr + glycine) and (H2O+ NaBr + L-valine) at T = 298.15 K, Journal of Chemical Thermodynamics, 35, 1553–1565, 2003.