Module 1 – Week 3: Image, Voice, and Video AI

Unit Title: Understanding Multimodal AI: Seeing, Hearing, and Creating
Level: Introductory–Intermediate
Duration: ~90–120 minutes (self-paced or split into 2 sessions)

🎯 Learning Objectives

By the end of this week, you should be able to:

Understand the basic principles of computer vision, speech recognition, and AI-generated media.
Describe how AI tools “see” images, “hear” audio, and generate media outputs.
Compare use cases and ethical questions across visual and auditory AI.
Distinguish between recognition-based AI and generative AI in these fields.

🧭 Lesson Flow

Segment	Duration	Format
1. What Is Multimodal AI?	15 min	Overview + Visual Diagram
2. Image Recognition and Generation	25 min	Concept + Examples
3. Voice AI and Audio Processing	25 min	Concept + Examples
4. Video Synthesis and Deepfakes	20 min	Explanation + Ethics
5. Exercises and Knowledge Check	30–45 min	Interactive + Output-based

🧑‍🏫 1. What Is Multimodal AI?

📖 Teaching Script:

Most of what you’ve learned so far relates to text — but AI today increasingly processes images, audio, and even video. This is called multimodal AI — tools that interact with or understand more than one kind of data.

This week, you’ll explore how AI “sees,” “hears,” and “creates” beyond just words.

🖼️ Simple Diagram:

[Input]
 Text → LLM (ChatGPT)
 Image → Computer Vision (Midjourney, DALL·E)
 Audio → Speech Recognition (Whisper, Siri)
 Video → Generative AI (Sora, Deepfakes)

🧩 2. Image Recognition and Generation

🧠 Key Concepts:

Concept	Description	Example
Image Classification	AI identifies the content of a picture	Google Photos sorting “dogs”
Object Detection	AI finds and labels parts of an image	Self-driving car recognising pedestrians
Image Generation	AI creates new pictures based on text prompts	DALL·E producing a surrealist painting

🧪 Three Real-World Examples:

Google Lens – Identifies landmarks, plants, and objects from photos.
Face ID on iPhones – Uses neural networks to recognise your face.
DALL·E / Midjourney – Text-to-image AI that generates original art.

✏️ Quick Activity:

Choose one example above. Write:

What the AI does
How it helps users
One risk or limitation

🎤 3. Voice AI and Audio Processing

🧠 Key Concepts:

Concept	Description	Example
Speech-to-Text	AI converts spoken words to text	Otter.ai transcribing meetings
Voice Recognition	AI identifies who is speaking	Smart home devices recognising different users
Text-to-Speech (TTS)	AI converts text into realistic voice	Audiobooks generated with synthetic voices

🧪 Three Real-World Examples:

Whisper by OpenAI – Recognises speech in 57+ languages.
ElevenLabs – Produces humanlike voices from text input.
Amazon Alexa – Recognises voice commands and responds.

🎧 Thought Prompt:

What are some benefits of voice AI in education, accessibility, or productivity?
What are the dangers of synthetic voices in misinformation or impersonation?

🎬 4. Video Synthesis and Deepfakes

🧠 Key Concepts:

Concept	Description	Example
Video Synthesis	AI generates moving images from text or frames	Sora creating video from a text prompt
Deepfakes	AI manipulates faces/voices to create realistic but false videos	A fake video of a celebrity speaking
Motion Capture AI	Tracks human movement to animate 3D models	Used in games and film animation

🧪 Three Real-World Examples:

Sora by OpenAI – Converts prompts into 5–10 second videos.
Deep Nostalgia – Animates still images (e.g., making old photos smile).
Synthesia – Creates avatars that speak your text with lip-sync.

⚖️ Ethics Reflection:

Write a short answer to each:

When is video AI helpful? (education, film, gaming)
When is it dangerous? (election interference, fake news)
Should synthetic video content be labelled? Why?

🧪 5. Exercises & Knowledge Check

✅ Exercise 1: Modality Matching

Match each tool to the type of data it handles:

Tool	Data Type
DALL·E	?
Otter.ai	?
Sora	?
Midjourney	?
Whisper	?

✅ Exercise 2: Compare & Contrast

Fill in the table:

Task	Text AI	Image AI	Audio AI
Summarising content	?	?	?
Responding to prompts	?	?	?
Creating something new	?	?	?

(Write brief examples for each.)

✅ Exercise 3: “What If” Scenario

Imagine you’re building an AI that teaches children geography.

What modalities would you use?
What risks would you need to manage?
How could voice and image generation help improve learning?

🧠 Knowledge Check (10 Questions)

What is multimodal AI?
Name 2 examples of image recognition tools.
What does speech-to-text mean?
What is the difference between face recognition and voice recognition?
What is video synthesis?
What is a deepfake?
How can image AI help accessibility?
What risks does synthetic voice present?
Why is transparency important in generated media?
Give one real-world use case for each of: image AI, voice AI, and video AI.

📝 Wrap-Up Assignment (Optional)

Title: “AI That Sees and Speaks”

Write 300–400 words reflecting on:

The most surprising thing you learned
An example of visual or voice AI in your life
How you might use this type of AI ethically in your work or field

📦 End-of-Week Deliverables

✅ Completed modality comparison table
✅ Tool matching and use case reflections
✅ Journal or blog-style entry on visual/audio AI
✅ Knowledge check complete