Gemini, the first multimodal artificial intelligence created by Google, is capable of “moving” 360° between different types of information such as texts, audio, video, images and lines of code.
We will make a couple of examples for Gemini Pro Vision.
Installation and Configuration of Google Cloud CLI
- You must have a Google Cloud access account, otherwise create one on the official website.
- Let’s set up gcloud CLI locally, so follow the Cloud SDK documentation. Google Cloud CLI requires Python 3.8 to 3.12
- In the last step of the installation it will ask you to log in, enter “y” and then, as soon as prompted, enter the correct number of your project to use. Write down the project ID which we will then insert into Python.
- If you are not logged in, try again with the gcloud auth application-default login command.
Vertex AI SDK for Python
- Open your favorite editor, in our case VS Code, create your virtual environment on Python and then from the terminal launch the command pip install google-cloud-aiplatform>=1.38
- Download the two files from our repository https://github.com/Impesud/generative-AI/tree/main/gemini and open them with your editor.
- At “Project ID” enter your chosen project ID. Remember that on Google Cloud you will have to actively pre-load the Vertex APIs.
Send Multimodal Prompt Requests
- We will send two multimodal prompt requests with images to the Gemini Pro Vision (gemini-pro-vision) model. The Gemini Pro Vision model supports prompts that include text, code, images, and video, and can output text and code.
- With the code present on gemini_image_from_uri.py we send an image via URI, while with gemini_image_from_url.py we send an image via URL.
- Run your files (example: python filename) from your terminal.
- The first file will give you the answer:
role: "model"
parts {
text: "The image shows a table with a white surface. On the table are two cups of coffee, a bowl of blueberries, and five scones with blueberries on top. There are also some pink flowers on the table. The table is covered in a white paper with purple and blue stains."
}
Instead, the second file will return you:
role: "model"
parts {
text: "The Colosseum is an oval amphitheater in the center of the city of Rome, Italy. Built of concrete and stone, it is the largest ancient amphitheater ever built and is still the largest standing theater in the world today. The Colosseum could hold , it is estimated, between 50,000 and 80,000 spectators, having an average audience of some 65,000; it was used for gladiatorial contests and public spectacles such as mock sea battles, animal hunts, executions, re-enactments of famous battles, and dramas based on Classical mythology. The building"
}
Once you are done with Gemini, I recommend deactivating the APIs used so as not to have extra expenses.
Don’t miss the next articles dedicated to Generative AI!