Unit Title: Understanding Multimodal AI: Seeing, Hearing, and Creating
Level: Introductory–Intermediate
Duration: ~90–120 minutes (self-paced or split into 2 sessions)
🎯 Learning Objectives
By the end of this week, you should be able to:
- Understand the basic principles of computer vision, speech recognition, and AI-generated media.
- Describe how AI tools “see” images, “hear” audio, and generate media outputs.
- Compare use cases and ethical questions across visual and auditory AI.
- Distinguish between recognition-based AI and generative AI in these fields.
🧭 Lesson Flow
Segment | Duration | Format |
---|---|---|
1. What Is Multimodal AI? | 15 min | Overview + Visual Diagram |
2. Image Recognition and Generation | 25 min | Concept + Examples |
3. Voice AI and Audio Processing | 25 min | Concept + Examples |
4. Video Synthesis and Deepfakes | 20 min | Explanation + Ethics |
5. Exercises and Knowledge Check | 30–45 min | Interactive + Output-based |
🧑🏫 1. What Is Multimodal AI?
📖 Teaching Script:
Most of what you’ve learned so far relates to text — but AI today increasingly processes images, audio, and even video. This is called multimodal AI — tools that interact with or understand more than one kind of data.
This week, you’ll explore how AI “sees,” “hears,” and “creates” beyond just words.
🖼️ Simple Diagram:
[Input]
Text → LLM (ChatGPT)
Image → Computer Vision (Midjourney, DALL·E)
Audio → Speech Recognition (Whisper, Siri)
Video → Generative AI (Sora, Deepfakes)
🧩 2. Image Recognition and Generation
🧠 Key Concepts:
Concept | Description | Example |
---|---|---|
Image Classification | AI identifies the content of a picture | Google Photos sorting “dogs” |
Object Detection | AI finds and labels parts of an image | Self-driving car recognising pedestrians |
Image Generation | AI creates new pictures based on text prompts | DALL·E producing a surrealist painting |
🧪 Three Real-World Examples:
- Google Lens – Identifies landmarks, plants, and objects from photos.
- Face ID on iPhones – Uses neural networks to recognise your face.
- DALL·E / Midjourney – Text-to-image AI that generates original art.
✏️ Quick Activity:
Choose one example above. Write:
- What the AI does
- How it helps users
- One risk or limitation
🎤 3. Voice AI and Audio Processing
🧠 Key Concepts:
Concept | Description | Example |
---|---|---|
Speech-to-Text | AI converts spoken words to text | Otter.ai transcribing meetings |
Voice Recognition | AI identifies who is speaking | Smart home devices recognising different users |
Text-to-Speech (TTS) | AI converts text into realistic voice | Audiobooks generated with synthetic voices |
🧪 Three Real-World Examples:
- Whisper by OpenAI – Recognises speech in 57+ languages.
- ElevenLabs – Produces humanlike voices from text input.
- Amazon Alexa – Recognises voice commands and responds.
🎧 Thought Prompt:
What are some benefits of voice AI in education, accessibility, or productivity?
What are the dangers of synthetic voices in misinformation or impersonation?
🎬 4. Video Synthesis and Deepfakes
🧠 Key Concepts:
Concept | Description | Example |
---|---|---|
Video Synthesis | AI generates moving images from text or frames | Sora creating video from a text prompt |
Deepfakes | AI manipulates faces/voices to create realistic but false videos | A fake video of a celebrity speaking |
Motion Capture AI | Tracks human movement to animate 3D models | Used in games and film animation |
🧪 Three Real-World Examples:
- Sora by OpenAI – Converts prompts into 5–10 second videos.
- Deep Nostalgia – Animates still images (e.g., making old photos smile).
- Synthesia – Creates avatars that speak your text with lip-sync.
⚖️ Ethics Reflection:
Write a short answer to each:
- When is video AI helpful? (education, film, gaming)
- When is it dangerous? (election interference, fake news)
- Should synthetic video content be labelled? Why?
🧪 5. Exercises & Knowledge Check
✅ Exercise 1: Modality Matching
Match each tool to the type of data it handles:
Tool | Data Type |
---|---|
DALL·E | ? |
Otter.ai | ? |
Sora | ? |
Midjourney | ? |
Whisper | ? |
✅ Exercise 2: Compare & Contrast
Fill in the table:
Task | Text AI | Image AI | Audio AI |
---|---|---|---|
Summarising content | ? | ? | ? |
Responding to prompts | ? | ? | ? |
Creating something new | ? | ? | ? |
(Write brief examples for each.)
✅ Exercise 3: “What If” Scenario
Imagine you’re building an AI that teaches children geography.
- What modalities would you use?
- What risks would you need to manage?
- How could voice and image generation help improve learning?
🧠 Knowledge Check (10 Questions)
- What is multimodal AI?
- Name 2 examples of image recognition tools.
- What does speech-to-text mean?
- What is the difference between face recognition and voice recognition?
- What is video synthesis?
- What is a deepfake?
- How can image AI help accessibility?
- What risks does synthetic voice present?
- Why is transparency important in generated media?
- Give one real-world use case for each of: image AI, voice AI, and video AI.
📝 Wrap-Up Assignment (Optional)
Title: “AI That Sees and Speaks”
Write 300–400 words reflecting on:
- The most surprising thing you learned
- An example of visual or voice AI in your life
- How you might use this type of AI ethically in your work or field
📦 End-of-Week Deliverables
- ✅ Completed modality comparison table
- ✅ Tool matching and use case reflections
- ✅ Journal or blog-style entry on visual/audio AI
- ✅ Knowledge check complete