islamger.blogg.se - Urdu speech to text software

All machine learning algorithms can be represented as a general model with tunable hyper parameters which are learned during the training phase using available data. Pattern recognition finds patterns in data to perform certain tasks, while processes to learn patterns present in data can be referred to as machine learning. This work is focused on pattern recognition of extracted speech features to classify isolated Urdu words.

Lower MFCC coefficients correspond to slow changing frequencies in sound and are used for speaker independent speech recognition as they represent vocal tract response creating phonemes. MFCC are found by taking the Discrete Cosine Transform (DCT) of logarithm values of energies in filter banks applied on a Mel scale to Power Spectral Density (PSD) of speech. The most famous features used for ASR are the Mel Frequency Cepstral Coefficients (MFCC). These are, (a) extraction of useful features from speech for recognition of language phonemes or words, (b) classification of extracted features into words and (c) probabilistic modeling of predicted words based on language grammar and dictionary. After more than 50 years of research, ASR is still not a completely solved problem. The decrease in WER after incorporating SSL is more significant with an increased validation data size.Īutomatic Speech Recognition (ASR) can be a vital component in artificially-intelligent interactive systems. The proposed model also utilizes label propagation-based self-training of initially trained models and achieves a Word Error Rate (WER) of 4% less than that reported as the benchmark on the same Urdu corpus using HMM. Transformed data along with higher dimensional features is used to train neural networks. Speech features are transformed into a lower dimensional manifold using an unsupervised dimensionality-reduction technique called Locally Linear Embedding (LLE). Due to limited labeled data, Semi Supervised Learning (SSL) techniques are also incorporated to improve model generalization. Dropout and ensembles are averaging techniques over multiple neural network models while Maxout are units in a neural network which adapt their activation functions. This paper proposes an end-to-end neural network model, for Urdu ASR, regularized with dropout, ensemble averaging and Maxout units.

Most models proposed for Urdu ASR are based on Hidden Markov Models (HMMs). These supervised models need huge amounts of labeled speech data for good generalization, which can be quite a challenge to obtain for low-resource languages like Urdu. Automatic Speech Recognition, (ASR) has achieved the best results for English, with end-to-end neural network based supervised models.