#

Experiment

Accuracy

Correctness

Notes

1

Baseline

72.12

75.71

Iteration 48

2

SVD_0

71.55

73.75

Reduction to 50 dimensions

3

Baseline + SVD_0

72.11

75.88

Using 50 dim SVD

4

SVD_1

71.63

73.75

50 dimensions

5

SVD_2

71.62

73.69

40 dimensions

6

SVD_3

71.56

73.73

50 dimensions

7

SVD_1_Kmeans_sub-frames

71.65

73.80

50 dims, iteration 44

8

SVD_1_Kmeans_sub-frames_sub-phones

71.66

73.82

50 dims, iteration 29

9

Kmeans all, preSVD

71.52

73.53

7 clusters, 50 dims, iteration 27

10

Kmeans all, postSVD

71.34

73.30

9 clusters, 50 dims, iteration 46

11

Baseline + Kmeans all

72.36

76.11

105 features (exp 1)+ 336 features (exp 9)

12

SVD_0 + Kmeans all

72.07

74.31

145 features (exp 2)+ 336 features (exp 9)

13

Posteriors + Linear + LinKLT

70.31

72.15

315 features reduced to SVD50, 145 cos

14

FeatureSpace to CRF

71.18

73.00

50 features, 50 cosines

15

FeatureSpace to MLP to CRF

72.99

75.23

SVD50, MLP48, 48 feats to CRF

16

FeatureSpace-MLP + Baseline

73.46

77.32

153 features

17

Baseline + FeatureSpace-MLP + FeatureSpace

73.46

77.37

203 features

18

FeatureSpace, Scaled

72.05

74.56

50 features

18.A

FeatureSpace, Scaled + MLP48 (scaled)

73.39

76.39

98 features

18.B

FeatureSpace, Scaled + Baseline

71.76

76.21

155 features

18.C

FeatureSpace, Scaled + Baseline + MLP48

73.19

77.75

203 features

19

FeatureSpace, Scaled, to MLP to CRF

72.99

75.26

48 features

20

FeatureSpace, Full SVD to 50 dims, to CRF

6.91

6.91

50 features

20.A

Fspace, Full SVD to 300 dims, to CRF

crashed on svd, blue

?

300 features

20.B

Fspace, Full SVD to 50 dims, scaled by 100, to CRF

71.03

72.88

50 features

21

FeatureSpace, Phon.Feat. MLP to CRF

71.93

75.20

44 features

22

FeatureSpace, Full SVD, to MLP(48) to CRF

72.82

74.95

48 features

23

FspaceMLP48 + FspaceMLP44 to CRF

73.24

77.16

92 features

24

FspaceMLP48+FspaceMLP44+Baseline

72.91

78.20

197 features

25

Fspace, triphone SVD

70.61

72.32

50 features

25.A

Fspace, triphone SVD,100 dims

71.18

72.99

100 features

25.B

Fspace, triphone SVD, 100 dims, scaled; stopped training at 22

71.51

76.02

100 features

26

MLP48+MLP44+Fspace50

73.29

77.15

142 features

27

Fspace, triphone SVD to 100, + baseline

72.07

75.62

205 features

27.B

Fspace, triphone SVD to 100, Scaled + baseline; stopped training at 14

69.77

77.17

205 features

28

4space - 44pos,61pos,44lin,61lin

3.32

9.65

200 features

29

Fspace + baseline

72.15

75.77

155 features

30

Fspace from triphone SVD to 100, to MLP48

72.66

75.88

100 -> 48 features

30.A

Triphone SVD to 100 to MLP48 + Baseline

71.76

77.38

153 features

30.B

Triphone SVD to 100 to MLP48 + Baseline + Triphone Fspace, Scaled; stopped training at 25; wrong number of states also

70.46

78.68

203 features

31

Fspace from triphone SVD to 100, to MLP44

70.68

74.92

100 -> 44 features

32

Posteriors to KLT to CRF

70.89

76.34

50 features, trained only to it 39

32.a

Avg Posteriors to KLT to CRF

69.98

76.39

50 features, trianed only to it 39

33

Posteriors to KLT to MLP to CRF

72.77

75.19

48 features

33.a

Avg Posteriors to KLT to MLP to CRF

73.12

75.48

48 features

 

Explanation of Experiments:

 

1. Baseline: The MLP activation of each of 105 phonetic and sub-phonetic features for each frame of data. 105 state feature functions as input to the CRF. Labels are 48 phones, reduced to 39 for testing.

 

2. SVD_0: Calculate the average activation of each of the 105 features over each of 145 phone states (automatically aligned to 3 states per phone, one for silence). The 105x145 matrix undergoes SVD. Reduce to various sizes (50 is best). Calculate the cosine of each frame of data to each new �phone state� column in the P matrix. 145 state feature functions (cosines per frame) given to CRF to train. (Further tuning showed that 51 dimensions is ever-so-slightly better than 50 dimensions.)

 

3. Baseline + SVD_0: For each frame, concatenate the Baseline 105 MLP outputs and the SVD_0 145 cosines. Train the CRF on 250 state feature functions.

 

4. SVD_1: Look at the cosines generated by SVD_0. For each set of frames corresponding to a single phone, mark the frames for which the highest cosine is less than some threshhold. Set the threshhold so that, overall, 25% of frames are marked. For each phone, calculate the centroid of the MLP Activations for the frames that are marked. These centroids are the correct size to be appended to the original 105x145 matrix. Recalculate SVD on the augmented matrix, reduce the matrix to 50 dimensions, and recalculate the cosine of each frame to each of the (now) 193 phone states. Give 193 state feature functions for the CRF to train on.

 

5. SVD_2: Do exactly as above, starting with 193 cosines, finding a new threshhold to hold out 25% of frames, calculate new centroids to append to SVD matrix, etc. 241 state feature functions.

 

6. SVD_3: And again. 289 state feature functions.

 

7. SVD_1_ Kmeans_sub-frames: Instead of just finding the average of the sub-threshhold frames, use k-means clustering to determine the new features. That is, extract the frames that are below a threshhold, calculate k-means k=2 on the MLP activations of those frames, find the centroids of those two clusters, and add them to the original SVD matrix. ReSVD, ReReduce, ReCosine, ReTrain on 241 state feature functions.

 

8. SVD_1_Kmeans_sub-frames_sub-phones: As in experiment 7, but only calculate clusters and centroids for the sub-threshhold frames of those phones whose recognition accuracy score was below the accuracy score of the whole data set (8 phones). Append 16 new centroids to the original SVD, ReSVD, ReReduce, ReCosine, ReTrain on 161 state feature functions.

 

9. Kmeans all, pre-SVD: Start with the 105 feature frames � the MLP activations. Group the frames by phone label. For each group, run the k-means calculation over the frames for k=2 through k=10. The centroids of the resulting clusters form the matrix that then undergoes SVD and reduction to 50 dimensions. The number of state feature functions is dependent on how many clusters there were � there are (48*clusters) cosines/state feature functions. The best result was for 7 clusters.

 

10. Kmeans all, post-SVD: Start with the 145 state feature functions (cosines) resulting from SVD_0 (experiment 2). Group the frames by phone label, and run the k-means calculation for k=2 through k=10. Centroids become the SVD matrix, cosines of frames to SVD columns become the new state feature functions. Again, the number of state feature functions vary by how many clusters were made per phone. Best results were with 9 clusters.

 

11. Baseline + Kmeans all: Append the 105 features of original MLP data to the 336 features resulting from the best run of experiment 9. Train the CRF (no extra SVD required here, just mashing feature functions together). Results are non-significantly greater than the baseline alone.

 

12. SVD_0 + Kmeans all: Append the 145 features resulting from experiment 2 to the 336 features of experiment 9.

 

13. Posteriors+Linear+LinKLT: Concatenate the pfiles containing the output of the 105 MLP classifiers as softmax posteriors, linear outputs, and linear transformed outputs. This is a pfile with 315 features. Calculate the average value of each of the features over each of 145 phone states. This is the original SVD matrix. Perform SVD and reduce to various sizes. Use the frames of the pfile to get cosines to the 145 new phone vectors. Train on 145 cosine features, test on devtest.

 

14. FeatureSpace to CRF: Start with the 105x145 matrix (MLP features by phone states). Run the SVD and reduce to 50 features. Multiply the left (feature) matrix by the inverse diagonal matrix. Convert each frame of the original MLP data into this feature space, so now each frame has 50 features instead of 105. Train the CRF on this data. Tested dimensionality 47-53, best: 53feats - 71.25/73.05 it 47

 

15. FeatureSpace to MLP to CRF: Start as above. Train an MLP with the 50 features as input, 48 labels as output. Calculate the posteriors of all training and testing data for those 48 labels. Train the CRF on those 48 features.

 16. Baseline + FSpace-MLP: Append the 48 Features resulting from the MLP to the 105 original features in the baseline experiment. Retrain CRF.

 17. Baseline + FSpace-MLP + Fspace: To experiment 16, append the features from experiment 14 (50dims).

 18. Fspace, scaled: Normalize the variation among the features derived from the SVD fspace. Train CRF and decode.

 19. Fspace, scaled to MLP: Scale the F-space output, then retrain the phone MLP, then train CRF and decode.

 20. Fspace, full svd: Instead of averaging the 105 features over the phones, take the SVD of the entire space - all MLP derived frames. Again, reduce the resulting matrices to 50 dimensions. The left matrix is 50x105, so do the fspace (non-cosine) experiments with this new SVD matrix. Train CRF and decode.

 21. Fspace, phonological feature MLPs: Using the 105x145 matrix for SVD, derive the fspace version of the data. Retrain each of the 8 phon. feature class MLPs, combine for 44 features, train CRF and decode.

 22. FSpace, Full SVD, to MLP to CRF: Using the features generated by reducing the SVD on the entire data matrix, train the 48-phone MLP. Use the output to train a CRF.

 23. FspaceMLP48 + FspaceMLP44 to CRF: Combine the 48 phone features from experiment 15 with the 44 phonological features from experiment 21, retrain CRF and decode.

 24. FspaceML48 + FspaceMLP44 + Baseline: Combine the 105 posteriors of the baseline to the features in experiment 23. Retrain CRF and decode.

 25. Fspace, triphone SVD: For each triphone exhibited in the training set, if it shows up in more than 100 frames, get the average activation for each feature over all frames labeled with that triphone. This gives a 2598x105 matrix. Transpose, do SVD, use Fspace projection of 105 posteriors as input to CRF.

 26. MLP48+MLP44+Fspace50: Take the features that result from training the phone-MLP (48) and the phone-feature-MLP (44) on the fspace features. Append them to the 50 fspace features. Train the CRF and decode.

 27. Triphone Fspace to 100 + Baseline: Take the features derived from reducing the triphone fspace to 100 dimensions. Append to these features the 105 (non-SVD) baseline features.

 28. 4space- 44pos,61pos,44lin,61lin: Use the PARAFAC2 algorithm. Start with 4 matrices representing the average activations from posterior or linear outputs over 145 phone states. Each type of activation in its own matrix, slab of a 3D matrix. Use PARAFAC2 to reduce the 145 to 50 (this is opposite how the other experiments worked, but makes sense in this context (?)). Each slab is reduced separately, but in a way contigent upon the other slabs. Multiply each slab by H, which is like S. Project the frames of the original data onto the appropriate slab to get 50 features per frame. Append the projections to get 200 features per frame. Use norm2vars.pl to scale the features to mean 0, std 1. Train a CRF and decode.

 29. Fspace+baseline: Reduce the 105x145 SVD to 105x50. Project the frames into the 50dim Fspace (exp 14). Append to these features the 105 baseline features. Train CRF and decode.

 32. Posteriors to KLT to CRF. Do a KL transform over the training data. Warp the training and test pfiles by the klt stats. Train the CRF and decode.

 32a. Avg Posteriors to KLT to CRF. Since we took SVD over the averages, not the whole, do KLT over the averages, then warp the train and test data, then train the CRF.

 33. Posteriors to KLT to MLP to CRF. Do a KL transform over the training data. Warp the training and test pfiles by the klt stats. Train a 48-output MLP over the KLT data. Train the CRF on the MLP outputs.

 33a. Avg Posteriors to KLT to MLP to CRF. Do a KL transform over the averages of the training data. Warp the training and test pfiles by the klt stats. Train a 48-output MLP over the KLT data. Train the CRF on the MLP outputs.