“The highest education is that which does not merely give us information but makes our life in harmony with all its existence” – Tagore
Andrew Ng gave an interview to EE Times in early September 2023, where he talked about the next AI revolution coming to images and about a future with LVMs (Large Vision Models).
Large Language Models (LLMs) are now well-known thanks to ChatGPT’s phenomenal success. They can analyze and comprehend vast volumes of sophisticated data, including text, images, and other types of information, making them unique. These models analyze and learn from enormous volumes of data using deep learning techniques, which enables them to spot patterns, forecast the future, and produce high-quality results. Large language models’ capacity to build natural language material that closely mimics human writing is one of their main features. These models are helpful for applications like language translation, content generation, and chatbots since they can generate logical and persuasive written passages on various subjects. Similarly, LVMs can recognize and classify images with astounding precision. They can produce in-depth descriptions of what they see and recognize items, scenes, and even emotions that are depicted in photographs. These models’ distinctive abilities have several real-world applications in areas like artificial intelligence, computer vision, and natural language processing, and they have the potential to alter how we use technology and handle data fundamentally.
The applications of this are enormous. Imagine a computer looking into human tissues and counting the exact number of cancer cells. LVMs combined with the ever-evolving LLMs can count, classify, and predict the stage and rate of progression.
And like LLMs, LVMs can be trained through a technique called Visual Prompting. In this technique, a user prompts the model to produce the desired output by suggesting a pattern or image that the model has been trained to recognize and respond to in a certain way.
Here is an Example
On the left: We have the visual prompt showing the system a white space (long stroke) and on a cell by a dot.
On the Right: The system has detected cells on the petri dish, ignoring the empty space.
CLIP: Developed by OpenAI, CLIP (Contrastive Language–Image Pretraining) is a vision-language model that’s trained to understand images in conjunction with natural language.
Google’s Vision Transformer: Also called ViT is a model for image classification that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them is then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder. This is now trained on 22 billion parameters. You can read about the latest update here.
LandingAi: Their flagship product, LandingLens™, is designed to make computer vision accessible to everyone. It provides an intuitive platform to create a custom computer vision project in minutes. You can upload images directly into LandingLens, label objects in your images, train your model, evaluate its performance, and deploy it to the cloud or edge devices.
Large Visual Models (LVMs) can be beneficial in several ways. Here are a few examples:
To realize some of these benefits above; the LVMs will need to be used in conjunction with LLMs.
LVMs are far from perfect and have a few issues related to hallucinations, labeling issues, biases, and privacy concerns but these current offerings will continue to evolve and find their place in the future of Marketing.