OpenAI CLIP
OpenAI’s Contrastive Language-Image Pretraining (CLIP) neural network is a multimodal model that can map between text and images. It leverages natural language (from text data) to help direct its learning of visual concepts (from image data) and enable zero-shot transfer. CLIP can be used to create a text caption that best represents an image, or create the best visual representation given a text input.
