Vision-language models, image generation, video AI, and cross-modal research — from DALL-E to GPT-4V.