Speech recognition is the process of using a computer to turn spoken words into text or commands.
Examples:
A neural network is a computer model inspired by the way the human brain processes information.
A neural network learns patterns from examples. Instead of a programmer writing every rule by hand, the neural network studies data and adjusts itself so it can make better predictions.
For speech recognition, the neural network learns patterns between:
Human speech is complex. People speak with different:
Older speech recognition systems used many hand-written rules. Neural networks are useful because they can learn patterns from large amounts of speech data.
A speech recognition system usually follows a pipeline.
Turn on the lights.
Computers cannot directly understand sound waves. They need numbers.
When speech is recorded:
A feature is a useful measurement taken from the audio. Instead of analyzing raw sound only, many systems convert speech into features that are easier for the model to understand.
| Feature | Meaning |
|---|---|
| Frequency | How often the sound wave repeats. |
| Pitch | How high or low the voice sounds. |
| Loudness | How strong the sound is. |
| Duration | How long a sound lasts. |
| Spectrogram | Picture-like view of frequencies over time. |
| Mel spectrogram | Spectrogram adjusted to match human hearing. |
| MFCCs | Compact features often used for speech. |
A spectrogram is a visual representation of sound.
It shows:
A spectrogram lets a computer “see” sound patterns.
For example:
A mel spectrogram is a special type of spectrogram designed to better match human hearing.
Humans do not hear all frequency differences equally. We are more sensitive to some frequency ranges than others.
A phoneme is the smallest sound unit that can change the meaning of a word.
Examples:
bat and pat differ by the phonemes /b/ and /p/.sip and zip differ by the phonemes /s/ and /z/.Speech recognition systems may first try to recognize phonemes before building words.
/k/ /a/ /t/
Modern speech recognition systems often use tokens.
A token can be:
playing might be split into tokens like:
playingTokens help systems handle many words, including words they may not have seen often.
A simple neural network has layers.
| Layer | Job |
|---|---|
| Input layer | Receives data, such as audio features. |
| Hidden layers | Find patterns in the data. |
| Output layer | Makes a prediction. |
In speech recognition:
A neural network contains artificial neurons. Each neuron receives numbers, performs a calculation, and passes a result to the next layer.
Connections between neurons have weights. Weights control how strongly one neuron affects another.
During training, the network adjusts its weights to improve its predictions.
Training means teaching a neural network using examples.
For speech recognition, training data usually includes:
| Audio | Correct Transcript |
|---|---|
| Recording of a person saying “open the door” | open the door |
The model makes a prediction, compares it to the correct answer, and updates its weights. This happens many times with thousands or millions of examples.
A loss function measures how wrong the neural network’s prediction is.
turn on the lightturn on the rightBackpropagation is the process neural networks use to learn from mistakes.
The basic idea:
Different neural network designs can be used for speech recognition.
A feedforward neural network passes data in one direction from input to output. It is simple, but it may not be ideal for long speech because speech depends on time order.
A convolutional neural network, or CNN, is good at finding patterns in grid-like data.
CNNs can analyze spectrograms because spectrograms are similar to images.
CNNs can detect patterns such as:
A recurrent neural network, or RNN, is designed for sequences.
Speech is a sequence because sounds happen over time. RNNs can use earlier sounds to help understand later sounds.
LSTM and GRU networks are improved types of RNNs. They are designed to remember important information for longer periods.
This is useful in speech because words and sentence meaning depend on context.
Transformers are modern neural networks that use attention.
Attention helps the model focus on important parts of the audio or text. Transformers are widely used in modern speech recognition systems because they can handle long sequences efficiently.
Attention is a technique that helps a neural network decide which parts of the input are most important.
Attention helps the system connect audio parts to the correct words.
Speech recognition often uses two important ideas.
The acoustic model connects audio patterns to speech sounds or tokens.
The language model predicts likely word sequences.
Decoding is the process of choosing the final text output from possible predictions.
The model may consider many possible word sequences and choose the most likely one.
I screamice creamWord error rate, or WER, is a common way to measure speech recognition accuracy.
WER measures how many words were:
Where:
turn on the lightturn on the rightright instead of light.Neural networks can be powerful, but speech recognition is still difficult.
Challenges include:
Speech recognition systems may work better for some groups of people than others.
Possible causes:
Speech data can contain personal information.
A voice recording may reveal:
Important safety practices:
Neural networks for speech recognition are used in many technologies.
| Application | Example |
|---|---|
| Voice assistants | Siri, Alexa, Google Assistant |
| Dictation | Voice typing documents or messages |
| Captioning | Automatic captions for videos |
| Accessibility | Helping people who cannot type easily |
| Customer service | Automated phone systems |
| Translation | Spoken language translation |
| Cars | Voice-controlled navigation |
| Education | Reading and pronunciation tools |
| Games | Voice commands and interactive characters |
Voice Command Classifier
Build a small system that recognizes a few spoken commands.
Short audio clips recorded from a microphone.
One command label, such as:
startstoppauseresumehelpunknownRecordings of different people saying each command.
Possible features:
The system correctly identifies the command at least 90% of the time on test examples.
| Term | Meaning |
|---|---|
| Speech recognition | Turning spoken words into text or commands. |
| ASR | Automatic speech recognition. |
| Neural network | Computer model that learns patterns from data. |
| Training | Teaching a model using examples. |
| Feature | Useful measurement from data. |
| Spectrogram | Visual display of frequencies over time. |
| Mel spectrogram | Spectrogram adjusted to match human hearing. |
| MFCC | Compact speech feature used in audio analysis. |
| Phoneme | Smallest sound unit that can change word meaning. |
| Token | Unit used by a model, such as a letter, word, or word part. |
| Neuron | Small calculation unit in a neural network. |
| Weight | Value that controls connection strength between neurons. |
| Loss function | Measures how wrong a model is. |
| Backpropagation | Method for adjusting weights based on error. |
| CNN | Neural network good at finding patterns in image-like data. |
| RNN | Neural network designed for sequence data. |
| LSTM | Type of RNN that remembers longer patterns. |
| GRU | Simpler type of RNN that remembers sequence patterns. |
| Transformer | Modern neural network using attention. |
| Attention | Method that helps a model focus on important information. |
| Acoustic model | Model that connects audio patterns to speech sounds. |
| Language model | Model that predicts likely word sequences. |
| Decoding | Choosing the final text from possible predictions. |
| Word error rate | Measure of speech recognition mistakes. |