Neural Networks for Speech Recognition

Class Notes for High School Students

1. What Is Speech Recognition?

Speech recognition is the process of using a computer to turn spoken words into text or commands.

Examples:

Speech recognition is sometimes called automatic speech recognition, or ASR.

2. What Is a Neural Network?

A neural network is a computer model inspired by the way the human brain processes information.

A neural network learns patterns from examples. Instead of a programmer writing every rule by hand, the neural network studies data and adjusts itself so it can make better predictions.

For speech recognition, the neural network learns patterns between:

3. Why Use Neural Networks for Speech Recognition?

Human speech is complex. People speak with different:

Older speech recognition systems used many hand-written rules. Neural networks are useful because they can learn patterns from large amounts of speech data.

Example:
Different people may say “hello” with different pitch, speed, and accent, but the system can still learn that the word is “hello.”

4. Speech Recognition Pipeline

A speech recognition system usually follows a pipeline.

  1. Capture audio using a microphone.
  2. Digitize audio by converting sound waves into numbers.
  3. Preprocess audio by reducing noise or adjusting volume.
  4. Extract features from the sound.
  5. Use a neural network to predict sounds, letters, words, or tokens.
  6. Use a language model to choose likely word sequences.
  7. Output text to the user.
Example:
Input speech: “Turn on the lights.”
Output text: Turn on the lights.

5. Turning Sound Into Numbers

Computers cannot directly understand sound waves. They need numbers.

When speech is recorded:

  1. A microphone captures the sound.
  2. The sound is converted into digital samples.
  3. The samples become a long list of numbers.
  4. The computer analyzes the number patterns.
A spoken sentence may become thousands or millions of numbers depending on the length and sampling rate.

6. Features Used in Speech Recognition

A feature is a useful measurement taken from the audio. Instead of analyzing raw sound only, many systems convert speech into features that are easier for the model to understand.

Feature Meaning
Frequency How often the sound wave repeats.
Pitch How high or low the voice sounds.
Loudness How strong the sound is.
Duration How long a sound lasts.
Spectrogram Picture-like view of frequencies over time.
Mel spectrogram Spectrogram adjusted to match human hearing.
MFCCs Compact features often used for speech.

7. Spectrograms

A spectrogram is a visual representation of sound.

It shows:

A spectrogram lets a computer “see” sound patterns.

For example:

Neural networks can analyze spectrograms similarly to how image recognition systems analyze pictures.

8. Mel Spectrograms

A mel spectrogram is a special type of spectrogram designed to better match human hearing.

Humans do not hear all frequency differences equally. We are more sensitive to some frequency ranges than others.

A mel spectrogram changes the frequency scale so it is closer to how humans hear sound. This often helps neural networks learn speech patterns more effectively.

9. Phonemes

A phoneme is the smallest sound unit that can change the meaning of a word.

Examples:

Speech recognition systems may first try to recognize phonemes before building words.

Example:
Spoken word: “cat”
Possible phonemes: /k/ /a/ /t/

10. Tokens

Modern speech recognition systems often use tokens.

A token can be:

Example:
The word playing might be split into tokens like:

Tokens help systems handle many words, including words they may not have seen often.

11. Basic Neural Network Structure

A simple neural network has layers.

Layer Job
Input layer Receives data, such as audio features.
Hidden layers Find patterns in the data.
Output layer Makes a prediction.

In speech recognition:

12. Neurons, Weights, and Training

A neural network contains artificial neurons. Each neuron receives numbers, performs a calculation, and passes a result to the next layer.

Connections between neurons have weights. Weights control how strongly one neuron affects another.

During training, the network adjusts its weights to improve its predictions.

Example:
At first, the model may hear “cat” and guess “cap.” After many examples and corrections, it learns to better distinguish the /t/ sound from the /p/ sound.

13. Training a Neural Network

Training means teaching a neural network using examples.

For speech recognition, training data usually includes:

Audio Correct Transcript
Recording of a person saying “open the door” open the door

The model makes a prediction, compares it to the correct answer, and updates its weights. This happens many times with thousands or millions of examples.

14. Loss Function

A loss function measures how wrong the neural network’s prediction is.

Training tries to reduce the loss over time.
Example:
Correct text: turn on the light
Model prediction: turn on the right

The model made a mistake, so the loss function gives feedback that helps the model improve.

15. Backpropagation

Backpropagation is the process neural networks use to learn from mistakes.

The basic idea:

  1. The model makes a prediction.
  2. The prediction is compared to the correct answer.
  3. The loss is calculated.
  4. The error is sent backward through the network.
  5. The weights are adjusted.
  6. The model tries again.
Backpropagation is one of the key reasons neural networks can learn complex patterns.

16. Types of Neural Networks Used in Speech Recognition

Different neural network designs can be used for speech recognition.

A. Feedforward Neural Networks

A feedforward neural network passes data in one direction from input to output. It is simple, but it may not be ideal for long speech because speech depends on time order.

B. Convolutional Neural Networks

A convolutional neural network, or CNN, is good at finding patterns in grid-like data.

CNNs can analyze spectrograms because spectrograms are similar to images.

CNNs can detect patterns such as:

C. Recurrent Neural Networks

A recurrent neural network, or RNN, is designed for sequences.

Speech is a sequence because sounds happen over time. RNNs can use earlier sounds to help understand later sounds.

Example:
The sound at the end of a word may depend on the sounds before it.

D. LSTM and GRU Networks

LSTM and GRU networks are improved types of RNNs. They are designed to remember important information for longer periods.

This is useful in speech because words and sentence meaning depend on context.

E. Transformer Networks

Transformers are modern neural networks that use attention.

Attention helps the model focus on important parts of the audio or text. Transformers are widely used in modern speech recognition systems because they can handle long sequences efficiently.

17. Attention

Attention is a technique that helps a neural network decide which parts of the input are most important.

Example:
For the spoken sentence: “Please turn off the living room lights.”

The model may focus strongly on:

Attention helps the system connect audio parts to the correct words.

18. Acoustic Model and Language Model

Speech recognition often uses two important ideas.

Acoustic Model

The acoustic model connects audio patterns to speech sounds or tokens.

It answers: “What sounds are likely in this audio?”

Language Model

The language model predicts likely word sequences.

It answers: “What words make sense together?”
Example:
Audio may sound like: A language model helps choose the phrase that makes more sense in context.

19. Decoding

Decoding is the process of choosing the final text output from possible predictions.

The model may consider many possible word sequences and choose the most likely one.

Example:
Possible outputs: The system uses audio clues and language patterns to choose the best answer.

20. Word Error Rate

Word error rate, or WER, is a common way to measure speech recognition accuracy.

WER measures how many words were:

WER = (S + D + I) ÷ N

Where:

Example:
Correct: turn on the light
Prediction: turn on the right

There is 1 substitution: right instead of light.
There are 4 words in the correct transcript.

WER = (1 + 0 + 0) ÷ 4 = 0.25 = 25%
Lower WER means better performance.

21. Common Challenges for Neural Speech Recognition

Neural networks can be powerful, but speech recognition is still difficult.

Challenges include:

22. Bias and Fairness

Speech recognition systems may work better for some groups of people than others.

Possible causes:

A fair system should be tested with many types of voices and speaking styles.

23. Privacy and Safety

Speech data can contain personal information.

A voice recording may reveal:

Important safety practices:

24. Real-World Applications

Neural networks for speech recognition are used in many technologies.

Application Example
Voice assistants Siri, Alexa, Google Assistant
Dictation Voice typing documents or messages
Captioning Automatic captions for videos
Accessibility Helping people who cannot type easily
Customer service Automated phone systems
Translation Spoken language translation
Cars Voice-controlled navigation
Education Reading and pronunciation tools
Games Voice commands and interactive characters

25. Student Example Project

Project Title

Voice Command Classifier

Goal

Build a small system that recognizes a few spoken commands.

Possible Commands

Input

Short audio clips recorded from a microphone.

Output

One command label, such as:

Data Needed

Recordings of different people saying each command.

Features

Possible features:

Success Measure

The system correctly identifies the command at least 90% of the time on test examples.

Challenges

26. Key Vocabulary

Term Meaning
Speech recognition Turning spoken words into text or commands.
ASR Automatic speech recognition.
Neural network Computer model that learns patterns from data.
Training Teaching a model using examples.
Feature Useful measurement from data.
Spectrogram Visual display of frequencies over time.
Mel spectrogram Spectrogram adjusted to match human hearing.
MFCC Compact speech feature used in audio analysis.
Phoneme Smallest sound unit that can change word meaning.
Token Unit used by a model, such as a letter, word, or word part.
Neuron Small calculation unit in a neural network.
Weight Value that controls connection strength between neurons.
Loss function Measures how wrong a model is.
Backpropagation Method for adjusting weights based on error.
CNN Neural network good at finding patterns in image-like data.
RNN Neural network designed for sequence data.
LSTM Type of RNN that remembers longer patterns.
GRU Simpler type of RNN that remembers sequence patterns.
Transformer Modern neural network using attention.
Attention Method that helps a model focus on important information.
Acoustic model Model that connects audio patterns to speech sounds.
Language model Model that predicts likely word sequences.
Decoding Choosing the final text from possible predictions.
Word error rate Measure of speech recognition mistakes.

27. Main Ideas to Remember