I have an SDF file with thousands of molecules and several text files of ID's grouped together by certain characteristics. Right now, I have a script that loads in an CSV database with the features of the molecules and generates the ID text files by classifying based on these features. I want to use these text files to parse the SDF file to get new SDF files with the corresponding molecules. In addition, I want to do this in MATLAB.
For example, here are some molecules in the original SDF file:
NCGC00178831-03
Marvin 07111412562D
34 37 0 0 0 0 999 V2000
4.8814 -2.7443 0.0000 Cl 0 5 0 0 0 0 0 0 0 0 0 0
2.8647 -2.4751 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8647 -1.6501 0.0000 N 0 3 0 0 0 0 0 0 0 0 0 0
3.5808 -1.2318 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2970 -1.6501 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.0017 -1.2318 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7179 -1.6501 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
5.0017 -0.4068 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2970 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5808 -0.4068 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8647 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1485 -0.4068 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1485 -1.2318 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4324 -1.6501 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7162 -1.2318 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -1.6501 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.7162 -0.4068 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4324 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8761 -3.5407 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
3.5923 -3.9590 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.3084 -3.5407 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.0132 -3.9590 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7293 -3.5407 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
5.0132 -4.7840 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.3084 -5.1908 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5923 -4.7840 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8761 -5.1908 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1599 -4.7840 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1599 -3.9590 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4438 -3.5407 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7276 -3.9590 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0115 -3.5407 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.7276 -4.7840 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4438 -5.1908 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2 3 1 0 0 0 0
3 4 2 0 0 0 0
3 13 1 0 0 0 0
4 5 1 0 0 0 0
4 10 1 0 0 0 0
5 6 2 0 0 0 0
6 7 1 0 0 0 0
6 8 1 0 0 0 0
8 9 2 0 0 0 0
9 10 1 0 0 0 0
10 11 2 0 0 0 0
11 12 1 0 0 0 0
12 13 2 0 0 0 0
12 18 1 0 0 0 0
13 14 1 0 0 0 0
14 15 2 0 0 0 0
15 16 1 0 0 0 0
15 17 1 0 0 0 0
17 18 2 0 0 0 0
19 20 2 0 0 0 0
19 29 1 0 0 0 0
20 21 1 0 0 0 0
20 26 1 0 0 0 0
21 22 2 0 0 0 0
22 23 1 0 0 0 0
22 24 1 0 0 0 0
24 25 2 0 0 0 0
25 26 1 0 0 0 0
26 27 2 0 0 0 0
27 28 1 0 0 0 0
28 29 2 0 0 0 0
28 34 1 0 0 0 0
29 30 1 0 0 0 0
30 31 2 0 0 0 0
31 32 1 0 0 0 0
31 33 1 0 0 0 0
33 34 2 0 0 0 0
M CHG 2 1 -1 3 1
M END
> <Formula>
C27H25ClN6
> <FW>
468.9806 (35.4535+224.2805+209.2465)
> <DSSTox_CID>
25848
> <SR-HSE>
0
$$$$
NCGC00166114-03
Marvin 07111412562D
31 32 0 0 0 0 999 V2000
4.9884 -1.2417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.9884 -2.0696 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2748 -2.4764 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2748 -3.7038 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.9884 -4.1178 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7021 -3.7038 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.4157 -4.1178 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0
5.7021 -2.8760 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.9884 -4.9385 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2748 -5.3524 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5612 -4.9385 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5612 -4.1178 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5612 -2.0696 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5612 -1.2417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2748 -0.8279 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.8403 -0.8279 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1267 -1.2417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1267 -2.0696 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8403 -2.4764 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4202 -2.4764 0.0000 Br 0 0 0 0 0 0 0 0 0 0 0 0
1.4202 -0.8279 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0
2.8403 0.0000 0.0000 Br 0 0 0 0 0 0 0 0 0 0 0 0
5.7021 -2.4764 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.4229 -2.0696 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.4229 -1.2417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7021 -0.8279 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7021 0.0000 0.0000 Br 0 0 0 0 0 0 0 0 0 0 0 0
7.1366 -0.8279 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
7.1366 -2.4764 0.0000 Br 0 0 0 0 0 0 0 0 0 0 0 0
7.0866 -4.1963 0.0000 Na 0 3 0 0 0 0 0 0 0 0 0 0
0.0000 -0.7708 0.0000 Na 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 15 1 0 0 0 0
1 26 2 0 0 0 0
2 3 2 0 0 0 0
2 23 1 0 0 0 0
3 4 1 0 0 0 0
3 13 1 0 0 0 0
4 5 2 0 0 0 0
4 12 1 0 0 0 0
5 6 1 0 0 0 0
5 9 1 0 0 0 0
6 7 1 0 0 0 0
6 8 2 0 0 0 0
9 10 2 0 0 0 0
10 11 1 0 0 0 0
11 12 2 0 0 0 0
13 14 2 0 0 0 0
13 19 1 0 0 0 0
14 15 1 0 0 0 0
14 16 1 0 0 0 0
16 17 2 0 0 0 0
16 22 1 0 0 0 0
17 18 1 0 0 0 0
17 21 1 0 0 0 0
18 19 2 0 0 0 0
18 20 1 0 0 0 0
23 24 2 0 0 0 0
24 25 1 0 0 0 0
24 29 1 0 0 0 0
25 26 1 0 0 0 0
25 28 2 0 0 0 0
26 27 1 0 0 0 0
M CHG 4 7 -1 21 -1 30 1 31 1
M END
> <Formula>
C20H6Br4Na2O5
> <FW>
691.8542 (645.8757+22.9892+22.9892)
> <DSSTox_CID>
5234
> <SR-HSE>
0
$$$$
NCGC00263563-01
Marvin 07111412562D
71 76 0 0 1 0 999 V2000
2.1953 -4.9878 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
3.6803 -4.9878 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
2.9701 -5.4074 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.5858 -4.9878 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
5.1008 -4.9878 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
2.1953 -4.1484 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
11.8157 -5.6335 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
14.1239 -5.8755 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
11.0893 -5.1008 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
3.6803 -4.1484 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
10.2015 -5.1008 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
12.5905 -5.1653 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
14.9633 -5.8755 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
4.3905 -5.4074 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
5.8755 -5.4074 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.9701 -3.6803 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
11.4606 -4.3905 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
13.6558 -5.1653 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
9.5559 -5.5043 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
7.2476 -5.5043 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.1008 -4.1484 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
1.4850 -5.4074 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
11.8157 -2.4858 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.9578 -4.9878 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
6.5858 -4.1484 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
12.5905 -2.9055 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
12.3483 -4.3905 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
11.8157 -1.6626 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.8755 -3.6803 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
13.3008 -1.6626 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
12.5905 -1.2429 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
13.3008 -2.4858 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
8.8457 -4.9878 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
11.4606 -3.1961 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
14.1239 -4.5035 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
0.7748 -4.9878 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
15.4314 -5.2137 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
14.9633 -4.5035 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
9.9756 -4.2776 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -5.4074 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0
7.6673 -4.2776 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1953 -5.7464 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.8764 -4.2776 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
9.0877 -4.2776 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7748 -4.1484 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
14.5437 -6.4567 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
3.6803 -3.3736 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.9701 -2.9055 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.8755 -2.9055 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
14.0110 -1.2429 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
12.5905 -0.4197 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.4850 -3.6803 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
15.5444 -6.4082 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
10.5566 -4.3905 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.3905 -6.1177 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.5035 -3.7933 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
8.1838 -4.2776 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
14.0110 -2.9055 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
13.6558 -3.7449 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
16.1416 -5.2137 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2130 -2.9701 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1953 -2.3729 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
14.7858 -1.6626 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
13.3008 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
11.0893 -5.8755 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
12.5905 -5.9885 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
8.8941 -5.7464 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
3.6803 -5.7464 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
5.1008 -5.7464 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
13.6558 -5.9885 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
0.4681 -6.7634 0.0000 N 0 3 0 0 0 0 0 0 0 0 0 0
1 3 1 0 0 0 0
1 6 1 0 0 0 0
1 22 1 6 0 0 0
1 42 1 1 0 0 0
2 3 1 0 0 0 0
2 14 1 0 0 0 0
2 68 1 1 0 0 0
2 10 1 0 0 0 0
4 15 1 0 0 0 0
4 20 1 1 0 0 0
4 43 1 0 0 0 0
4 25 1 0 0 0 0
5 14 1 0 0 0 0
5 15 1 0 0 0 0
5 21 1 0 0 0 0
5 69 1 1 0 0 0
6 16 1 0 0 0 0
6 52 1 1 0 0 0
7 9 1 0 0 0 0
7 12 1 0 0 0 0
8 18 1 0 0 0 0
8 13 1 0 0 0 0
9 11 1 0 0 0 0
9 17 1 0 0 0 0
9 65 1 6 0 0 0
10 16 1 0 0 0 0
10 47 1 1 0 0 0
11 19 1 0 0 0 0
11 54 1 6 0 0 0
11 39 1 0 0 0 0
12 18 1 0 0 0 0
12 66 1 1 0 0 0
12 27 1 0 0 0 0
13 46 1 1 0 0 0
13 53 1 6 0 0 0
13 37 1 0 0 0 0
14 55 1 1 0 0 0
16 48 1 6 0 0 0
17 27 1 0 0 0 0
17 34 1 1 0 0 0
18 35 1 0 0 0 0
18 70 1 1 0 0 0
19 33 1 0 0 0 0
20 24 1 0 0 0 0
21 29 1 0 0 0 0
21 56 1 6 0 0 0
22 36 1 0 0 0 0
23 34 1 0 0 0 0
23 26 1 0 0 0 0
23 28 1 0 0 0 0
24 33 1 0 0 0 0
24 57 1 6 0 0 0
24 41 1 0 0 0 0
25 29 1 0 0 0 0
26 32 1 0 0 0 0
28 31 1 0 0 0 0
29 49 1 1 0 0 0
30 31 1 0 0 0 0
30 50 1 1 0 0 0
30 32 1 0 0 0 0
31 51 1 6 0 0 0
32 58 1 6 0 0 0
33 44 1 0 0 0 0
33 67 1 6 0 0 0
35 38 1 0 0 0 0
35 59 1 1 0 0 0
36 40 1 0 0 0 0
36 45 2 0 0 0 0
37 38 1 0 0 0 0
37 60 1 1 0 0 0
39 44 1 0 0 0 0
41 43 1 0 0 0 0
47 61 1 0 0 0 0
48 62 1 0 0 0 0
50 63 1 0 0 0 0
51 64 1 0 0 0 0
M CHG 2 40 -1 71 1
M END
> <Formula>
C47H83NO17
> <FW>
934.1584 (916.1205+18.0379)
> <DSSTox_CID>
28909
> <SR-HSE>
0
$$$$
And here are some ID's from the text file:
NCGC00015959-03
NCGC00168261-01
NCGC00257010-01
NCGC00254654-01
NCGC00254471-01
The generated SDF file should start like this:
NCGC00015959-03
Marvin 07111412562D
25 30 0 0 0 0 999 V2000
3.4098 -1.3130 0.0000 N 0 3 0 0 0 0 0 0 0 0 0 0
4.8329 -1.3130 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.4098 -2.1380 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.1248 -2.5436 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.6948 -2.5436 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.8329 -2.1380 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.1248 -0.8937 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.5547 -0.8937 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.9799 -2.1380 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.6948 -3.3548 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.2718 -2.5436 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.2718 -3.3548 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.1248 -3.3548 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.9799 -3.7741 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.5547 -2.5436 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.2765 -1.3130 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7128 -0.0894 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.4881 -2.2755 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.4881 -3.6160 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.8746 -0.7562 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.5378 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -2.9423 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.4098 -3.7741 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.2765 -2.1380 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.6948 -0.8937 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 3 1 0 0 0 0
1 7 2 0 0 0 0
1 25 1 0 0 0 0
2 7 1 0 0 0 0
2 6 2 0 0 0 0
2 8 1 0 0 0 0
3 4 2 0 0 0 0
3 5 1 0 0 0 0
4 13 1 0 0 0 0
4 6 1 0 0 0 0
5 9 1 0 0 0 0
5 10 2 0 0 0 0
6 15 1 0 0 0 0
8 16 2 0 0 0 0
8 17 1 0 0 0 0
9 11 2 0 0 0 0
10 14 1 0 0 0 0
10 23 1 0 0 0 0
11 18 1 0 0 0 0
11 12 1 0 0 0 0
12 14 2 0 0 0 0
12 19 1 0 0 0 0
13 23 2 0 0 0 0
15 24 2 0 0 0 0
16 20 1 0 0 0 0
16 24 1 0 0 0 0
17 21 1 0 0 0 0
18 22 1 0 0 0 0
19 22 1 0 0 0 0
20 21 1 0 0 0 0
M CHG 1 1 1
M END
> <Formula>
C20H14NO4
> <FW>
332.3289
> <DSSTox_CID>
25204
> <NR-AR>
0
> <NR-ER-LBD>
1
> <NR-AhR>
1
$$$$
NCGC00168261-01
Marvin 07111412562D
23 25 0 0 0 0 999 V2000
2.1236 -2.4895 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4205 -2.0662 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1236 -3.3074 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4205 -3.7235 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.7174 -2.4895 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7174 -3.3074 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8554 -2.0662 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -2.0662 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.4205 -1.2412 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8554 -3.7235 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5656 -2.4895 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.5656 -3.3074 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8554 -1.2412 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.7174 -0.8251 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -1.2412 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0430 -2.8984 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7174 -4.1324 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2902 -3.7378 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7174 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.0292 -3.3145 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.4569 -3.3360 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7538 -3.7378 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.1743 -3.7378 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 2 0 0 0 0
1 7 1 0 0 0 0
2 5 2 0 0 0 0
2 9 1 0 0 0 0
3 4 1 0 0 0 0
3 10 1 0 0 0 0
4 6 1 0 0 0 0
5 8 1 0 0 0 0
5 6 1 0 0 0 0
6 16 1 0 0 0 0
6 17 1 0 0 0 0
7 11 2 0 0 0 0
7 13 1 0 0 0 0
8 15 2 0 0 0 0
9 14 2 0 0 0 0
10 12 2 0 0 0 0
11 12 1 0 0 0 0
12 18 1 0 0 0 0
14 15 1 0 0 0 0
14 19 1 0 0 0 0
18 20 1 0 0 0 0
20 22 1 0 0 0 0
21 22 1 0 0 0 0
21 23 1 0 0 0 0
M END
> <Formula>
C21H26O2
> <FW>
310.4299
> <DSSTox_CID>
28922
> <NR-AR>
0
> <NR-AhR>
1
> <SR-MMP>
1
$$$$
NCGC00257010-01
Marvin 07111412562D
35 37 0 0 0 0 999 V2000
2.0286 -3.5779 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.0019 -7.8578 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.0019 -0.7019 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.8589 -3.5779 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
1.6092 -2.8589 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
1.6092 -4.2799 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
3.2784 -4.2799 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
6.5825 -7.1217 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.5825 -1.4381 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.3681 -3.5779 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.5024 -3.5779 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.5024 -4.9989 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.0915 -4.2799 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.3412 -3.5779 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.3412 -4.9989 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7704 -4.2799 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7704 -2.8589 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.7294 -1.1385 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0
6.2829 -0.2996 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0
7.7294 -7.4213 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0
7.4384 -8.5597 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0
6.2829 -8.2601 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0
7.4384 0.0000 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0
7.0019 -2.1485 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.0019 -6.4112 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7607 -1.4381 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7607 -7.1217 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7607 -5.7008 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.7607 -2.8589 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.5825 -5.7008 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.5825 -2.8589 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.3412 -6.4112 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.3412 -2.1485 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -2.9103 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0086 -4.2542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 4 2 0 0 0 0
1 5 1 0 0 0 0
1 6 1 0 0 0 0
2 8 1 0 0 0 0
2 20 1 0 0 0 0
2 21 1 0 0 0 0
2 22 1 0 0 0 0
3 9 1 0 0 0 0
3 18 1 0 0 0 0
3 19 1 0 0 0 0
3 23 1 0 0 0 0
4 7 1 0 0 0 0
5 17 1 0 0 0 0
6 16 1 0 0 0 0
7 13 2 0 0 0 0
8 27 1 0 0 0 0
8 25 2 0 0 0 0
9 26 2 0 0 0 0
9 24 1 0 0 0 0
10 16 1 0 0 0 0
10 34 1 0 0 0 0
10 35 1 0 0 0 0
10 17 1 0 0 0 0
11 13 1 0 0 0 0
11 14 2 0 0 0 0
12 13 1 0 0 0 0
12 15 2 0 0 0 0
14 29 1 0 0 0 0
15 28 1 0 0 0 0
24 31 2 0 0 0 0
25 30 1 0 0 0 0
26 33 1 0 0 0 0
27 32 2 0 0 0 0
28 30 2 0 0 0 0
28 32 1 0 0 0 0
29 31 1 0 0 0 0
29 33 2 0 0 0 0
M END
> <Formula>
C25H24F6N4
> <FW>
494.4753
> <DSSTox_CID>
3868
> <NR-AR>
0
> <NR-ER>
1
> <NR-AhR>
1
$$$$
I've seen this post: Extract molecules in order from SDF file according to IDs given in another file that offers a solution in unix to solve this. I've used that workaround in the command line:
awk 'BEGIN{ORS="$$$$"}NR==FNR{a[$1]=$0;next}$1 in a' ids.txt RS="$" molecules.sdf > molecules_by_ids.sdf and was able to get most of what I wanted. However, even when I use this command-line option I am not able to get 100% of the molecules extracted from the SDF file. For example, there are 981 molecules positive for one of the features, the text file gets 981 ID's, and this command gives me 950 molecules in the SDF file.
What I really want is a MATLAB solution that does not miss any of the molecules in the generated file. I appreciate any efforts to make a solution. Thanks!
A workaround I found in MATLAB is the following function, where "id" is the name of the ID TXT file, "sdfs" is the SDF database, and "sdf_name" is the name of the new SDF file with the molecules extracted by ID:
The problem with this solution is that it is VERY slow. If someone knows a faster way to do this please let me know! For now, I will be using this.