MMAI 2026

Winston Qian

Modeling Multimodal AI 2026
Homework, final project, and random thoughts

Bio

Hi, I'm Winston Qian. I'm a student taking 6.S985: Modeling Multimodal AI in Spring 2026. This portfolio showcases my homework and final project explorations in vision-language models, multimodal alignment, and brain-to-vision decoding.

multimodal AI coursework experiments computer vision LLMs / VLMs

Final Project — Vision vs. Protocol Effects in EEG2Video

Task-Aware Temporal Attention + Temporal EEG–Video Alignment on SEED-DV

We investigate whether current EEG-to-video models capture genuine dynamic visual perception or merely exploit experimental sequence artifacts. By introducing task-aware temporal attention and a chunk-level contrastive alignment objective, we establish an interpretable, artifact-resistant baseline for brain-to-vision decoding.

Midterm Update: A Living Lab Notebook

In our journey to build a robust EEG-to-Video generative pipeline, we chose to rigorously audit our semantic classification baselines before scaling up. Here is the evolution of our methodology so far:

1. Sequence Shortcut Audit: We hypothesized that multi-clip protocols might allow models to 'cheat' by decoding subject anticipation over the course of a session. By adopting a strict within-subject evaluation paradigm and stratifying Top-1 decoding accuracy uniquely by clip index (1 through 5), we proved performance does not increase monotonically (finding flat profiles for both DE [4.40%, 4.29%, 4.48%, 4.38%, 4.10%] and PSD [3.84%, 4.39%, 4.29%, 4.13%, 4.31%]). Our baseline is genuinely decoding stimuli perception, not protocol artifacts.
2. Preprocessing Data Leakage: We strictly audited train/test normalization leakage in standard pipelines. Sealing the leak caused an incredible 13x reduction in cross-fold accuracy variance for PSD features, but actually increased DE feature fold-to-fold instability. This proved that strict leakage-free normalization impacts frequency and entropy domains entirely differently.
3. Architectural Trade-offs: We prototyped a custom O(T^2) Temporal Attention model on raw 200Hz (T=400) signals, but found full temporal self-attention caused out-of-memory (OOM) errors. Moving forward, this directly motivates our proposed Chunk-Level Temporal EEG–Video Contrastive Alignment—we are pivoting to latency-aware contrastive alignment using frozen VideoMAE features to track within-clip temporal dynamics rather than static semantic priors.

Homework

Homework 1 — Dataset

Multimodal TVQA Preprocessing

Extracting visual and semantic modalities from TVQA for multimodal reasoning.

Homework 2 — Fusion

Multimodal Fusion and Alignment

Evaluating fusion techniques and contrastive learning on TVQA.

Homework 3 — VLM

Vision-Language Model Fine-Tuning

Fine-tuning Qwen2.5-VL with LoRA for zero-shot reasoning on TVQA video clips.

Homework 4 — TBD

TBD

Homework 5 — TBD

TBD

Highlights

Homework 1: Multimodal TVQA Preprocessing

HW1 — Multimodal TVQA Preprocessing

HW2 — Multimodal Fusion and Alignment

HW3 — VLM Fine-Tuning for Action Reasoning

HW4 — TBD

HW5 — TBD

About This Site

Repository

Source code lives at github.com/winstonqian/mmai.

Built from the Academic Project Page Template and adapted into a course portfolio / homework hub.

License

This site content is licensed under Creative Commons Attribution-ShareAlike 4.0 International .

Site Manifest

@misc{qian_mmai_2026,
  title        = {MMAI 2026},
  author       = {Winston Qian},
  howpublished = {\url{https://winstonqian.github.io/mmai/}},
  note         = {Modeling Multimodal AI 2026 course site: homework, final project, and notes},
  year         = {2026}
}

Quick Links

Live Course Site

Repository

Homework 1

MMAI 2026

Bio

Final Project — Vision vs. Protocol Effects in EEG2Video

Midterm Update: A Living Lab Notebook

Homework

Homework 1 — Dataset

Homework 2 — Fusion

Homework 3 — VLM

Homework 4 — TBD

Homework 5 — TBD

Highlights

HW1 — Multimodal TVQA Preprocessing

HW2 — Multimodal Fusion and Alignment

HW3 — VLM Fine-Tuning for Action Reasoning

HW4 — TBD

HW5 — TBD

About This Site

Repository

License

Site Manifest