MMAI 2026

Modeling Multimodal AI 2026
Homework, final project, and random thoughts
Profile photo of Winston Qian

Bio

Hi, I'm Winston Qian. I'm a student taking 6.S985: Modeling Multimodal AI in Spring 2026. This portfolio showcases my homework and final project explorations in vision-language models, multimodal alignment, and brain-to-vision decoding.

multimodal AI coursework experiments computer vision LLMs / VLMs

Final Project — Vision vs. Protocol Effects in EEG2Video

Task-Aware Temporal Attention + Temporal EEG–Video Alignment on SEED-DV

We investigate whether current EEG-to-video models capture genuine dynamic visual perception or merely exploit experimental sequence artifacts. By introducing task-aware temporal attention and a chunk-level contrastive alignment objective, we establish an interpretable, artifact-resistant baseline for brain-to-vision decoding.


Midterm Update: A Living Lab Notebook

In our journey to build a robust EEG-to-Video generative pipeline, we chose to rigorously audit our semantic classification baselines before scaling up. Here is the evolution of our methodology so far:

  • 1. Sequence Shortcut Audit: We hypothesized that multi-clip protocols might allow models to 'cheat' by decoding subject anticipation over the course of a session. By adopting a strict within-subject evaluation paradigm and stratifying Top-1 decoding accuracy uniquely by clip index (1 through 5), we proved performance does not increase monotonically (finding flat profiles for both DE [4.40%, 4.29%, 4.48%, 4.38%, 4.10%] and PSD [3.84%, 4.39%, 4.29%, 4.13%, 4.31%]). Our baseline is genuinely decoding stimuli perception, not protocol artifacts.
  • 2. Preprocessing Data Leakage: We strictly audited train/test normalization leakage in standard pipelines. Sealing the leak caused an incredible 13x reduction in cross-fold accuracy variance for PSD features, but actually increased DE feature fold-to-fold instability. This proved that strict leakage-free normalization impacts frequency and entropy domains entirely differently.
  • 3. Architectural Trade-offs: We prototyped a custom O(T^2) Temporal Attention model on raw 200Hz (T=400) signals, but found full temporal self-attention caused out-of-memory (OOM) errors. Moving forward, this directly motivates our proposed Chunk-Level Temporal EEG–Video Contrastive Alignment—we are pivoting to latency-aware contrastive alignment using frozen VideoMAE features to track within-clip temporal dynamics rather than static semantic priors.

Highlights

About This Site

Repository

Source code lives at github.com/winstonqian/mmai.

Built from the Academic Project Page Template and adapted into a course portfolio / homework hub.

License

This site content is licensed under Creative Commons Attribution-ShareAlike 4.0 International .

Site Manifest

@misc{qian_mmai_2026,
  title        = {MMAI 2026},
  author       = {Winston Qian},
  howpublished = {\url{https://winstonqian.github.io/mmai/}},
  note         = {Modeling Multimodal AI 2026 course site: homework, final project, and notes},
  year         = {2026}
}