Tx, Yann LeCun.
• JEPA / H-JEPA: avoids predicting every single pixel (too expensive) and rather predicts in latent space. H-JEPA adds hierarchy — short term details vs long term planning ie. how humans actually learn.
• I-JEPA: built for very efficient vision models. Masks image patches and predicts the semantics and in doing so bypasses heavy compute of traditional autoencoders.
• MC-JEPA & V-JEPA: both of these are built for videos. MC-JEPA separates content (what an object is) vs motion (how it moves). V-JEPA masks video features with no text labels making it perfect of action tracking at scale.
• Audio-JEPA: filters out background noise by treating sounds like visuals.
• Point-JEPA & 3D-JEPA: used primarily in AVs. Uses LiDAR point clouds & volumetric grids.
• ACT-JEPA: filters out real world noise to learn manipulation tasks efficiently via imitation learning.




