[x] Masked Autoencoders As Spatiotemporal Learners
[ ] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
[ ] Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
[ ] Flamingo: a Visual Language Model for Few-Shot Learning
[ ] Visual Instruction Tuning
[ ] InternVideo: General Video Foundation Models via Generative and Discriminative Learning
[x] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
VideoMAE: Masked Autoencoders are Data-Efficient Learners for...
[ ] BEVT: BERT Pretraining of Video Transformers -ing
https://arxiv.org/pdf/2405.14767v2
https://openreview.net/pdf?id=PdaPky8MUn