Module 1 – Week 3: Image, Voice, and Video AI


Unit Title: Understanding Multimodal AI: Seeing, Hearing, and Creating
Level: Introductory–Intermediate
Duration: ~90–120 minutes (self-paced or split into 2 sessions)


🎯 Learning Objectives

By the end of this week, you should be able to:

  • Understand the basic principles of computer vision, speech recognition, and AI-generated media.
  • Describe how AI tools “see” images, “hear” audio, and generate media outputs.
  • Compare use cases and ethical questions across visual and auditory AI.
  • Distinguish between recognition-based AI and generative AI in these fields.

🧭 Lesson Flow

SegmentDurationFormat
1. What Is Multimodal AI?15 minOverview + Visual Diagram
2. Image Recognition and Generation25 minConcept + Examples
3. Voice AI and Audio Processing25 minConcept + Examples
4. Video Synthesis and Deepfakes20 minExplanation + Ethics
5. Exercises and Knowledge Check30–45 minInteractive + Output-based

🧑‍🏫 1. What Is Multimodal AI?

📖 Teaching Script:

Most of what you’ve learned so far relates to text — but AI today increasingly processes images, audio, and even video. This is called multimodal AI — tools that interact with or understand more than one kind of data.

This week, you’ll explore how AI “sees,” “hears,” and “creates” beyond just words.


🖼️ Simple Diagram:

[Input]
 Text → LLM (ChatGPT)
 Image → Computer Vision (Midjourney, DALL·E)
 Audio → Speech Recognition (Whisper, Siri)
 Video → Generative AI (Sora, Deepfakes)

🧩 2. Image Recognition and Generation

🧠 Key Concepts:

ConceptDescriptionExample
Image ClassificationAI identifies the content of a pictureGoogle Photos sorting “dogs”
Object DetectionAI finds and labels parts of an imageSelf-driving car recognising pedestrians
Image GenerationAI creates new pictures based on text promptsDALL·E producing a surrealist painting

🧪 Three Real-World Examples:

  1. Google Lens – Identifies landmarks, plants, and objects from photos.
  2. Face ID on iPhones – Uses neural networks to recognise your face.
  3. DALL·E / Midjourney – Text-to-image AI that generates original art.

✏️ Quick Activity:

Choose one example above. Write:

  • What the AI does
  • How it helps users
  • One risk or limitation

🎤 3. Voice AI and Audio Processing

🧠 Key Concepts:

ConceptDescriptionExample
Speech-to-TextAI converts spoken words to textOtter.ai transcribing meetings
Voice RecognitionAI identifies who is speakingSmart home devices recognising different users
Text-to-Speech (TTS)AI converts text into realistic voiceAudiobooks generated with synthetic voices

🧪 Three Real-World Examples:

  1. Whisper by OpenAI – Recognises speech in 57+ languages.
  2. ElevenLabs – Produces humanlike voices from text input.
  3. Amazon Alexa – Recognises voice commands and responds.

🎧 Thought Prompt:

What are some benefits of voice AI in education, accessibility, or productivity?
What are the dangers of synthetic voices in misinformation or impersonation?


🎬 4. Video Synthesis and Deepfakes

🧠 Key Concepts:

ConceptDescriptionExample
Video SynthesisAI generates moving images from text or framesSora creating video from a text prompt
DeepfakesAI manipulates faces/voices to create realistic but false videosA fake video of a celebrity speaking
Motion Capture AITracks human movement to animate 3D modelsUsed in games and film animation

🧪 Three Real-World Examples:

  1. Sora by OpenAI – Converts prompts into 5–10 second videos.
  2. Deep Nostalgia – Animates still images (e.g., making old photos smile).
  3. Synthesia – Creates avatars that speak your text with lip-sync.

⚖️ Ethics Reflection:

Write a short answer to each:

  • When is video AI helpful? (education, film, gaming)
  • When is it dangerous? (election interference, fake news)
  • Should synthetic video content be labelled? Why?

🧪 5. Exercises & Knowledge Check

✅ Exercise 1: Modality Matching

Match each tool to the type of data it handles:

ToolData Type
DALL·E?
Otter.ai?
Sora?
Midjourney?
Whisper?

✅ Exercise 2: Compare & Contrast

Fill in the table:

TaskText AIImage AIAudio AI
Summarising content???
Responding to prompts???
Creating something new???

(Write brief examples for each.)


✅ Exercise 3: “What If” Scenario

Imagine you’re building an AI that teaches children geography.

  • What modalities would you use?
  • What risks would you need to manage?
  • How could voice and image generation help improve learning?

🧠 Knowledge Check (10 Questions)

  1. What is multimodal AI?
  2. Name 2 examples of image recognition tools.
  3. What does speech-to-text mean?
  4. What is the difference between face recognition and voice recognition?
  5. What is video synthesis?
  6. What is a deepfake?
  7. How can image AI help accessibility?
  8. What risks does synthetic voice present?
  9. Why is transparency important in generated media?
  10. Give one real-world use case for each of: image AI, voice AI, and video AI.

📝 Wrap-Up Assignment (Optional)

Title: “AI That Sees and Speaks”

Write 300–400 words reflecting on:

  • The most surprising thing you learned
  • An example of visual or voice AI in your life
  • How you might use this type of AI ethically in your work or field

📦 End-of-Week Deliverables

  • ✅ Completed modality comparison table
  • ✅ Tool matching and use case reflections
  • ✅ Journal or blog-style entry on visual/audio AI
  • ✅ Knowledge check complete