Ever wonder about the technology that makes it possible for a machine to tell the difference between two objects in one photo?

Object detection is a breakthrough technology with many benefits: It allows for data scale, catches anomalies and other unusual behavior, and automates the time-consuming process of analyzing photo feeds manually. 

Google’s AutoML Vision enables you to train machine learning models to classify your images and detect objects with them according to your own defined labels, making AI accessible to every business.

What Is Google AutoML Vision?

AutoML Vision is Google’s service for leveraging machine learning capabilities to classify images into thousands of predefined categories. It’s useful when you have many images you need to analyze intelligently, such as photos from security feeds.

Say you want to know if an image contains a person. The system will return Yes if a person is present or No if there isn’t. This type of analysis is called a simple classification, where the intent is to analyze an image and then return a Yes/No label.

Classification is a straightforward use case. However, there might be a more complex case where you need to figure out how many people are in a particular image. It could extend beyond classification, where you need to identify people collectively from different parts of the image while also distinguishing one person from another. This is where object detection comes into the picture.

Object Detection

Object detection identifies objects in an image and distinguishes one object from another based on things like coordinates, position, size, etc.

How it works is that object detection splits an image into grids or proposal regions (regions that indicate a higher probability of having a reference object) and then performs classification on these different regions. If a regional grid classifies the grid as containing a “human,” then that grid will be marked as a bounding box, as shown in the example below.

A group of people sitting on chairs Description automatically generated with low confidence

Google AutoML Vision Tutorial

Training an AutoML Vision model to scan images can be done in a few steps, starting with data preparation.  

Prepare the Data

For training an AutoML object detection model, you’ll need data (images) in a Google Cloud Storage bucket. You’ll also need a schema file, which should be in the same bucket, ideally within the same folder. This schema file is a CSV file containing the following data columns in the same order. 

As indicated in blue in the chart below, you can choose to exclude one of the following:

  • Top right X coordinate 
  • Top right Y coordinate
  • Bottom left X coordinate
  • Bottom left Y coordinate

These coordinates point to a rectangle, and just the top left and bottom right is more than sufficient to plot a rectangle.

Train, Test, and Validation Table

(Note: The X and Y coordinates are scaled, and values typically represent a scaled value between zero and one.)

Once the images are in the bucket and the CSV file is provided, data preparation is complete.


Above, you can see the CSV schema file and the images in the same bucket. This is how AutoML expects the data. 

Next, we’ll load this bucket into AutoML. AutoML is intelligent enough to load the images and use the schema provided to trace bounding boxes on images.

Create a New Dataset

After opening Vision and AutoML, navigate to the Dataset section to create a new dataset. Enter a unique name for the dataset and choose Object Detection as the model type. 

The screen below will appear. Choose Select a CVS File on Cloud Storage.


From here, choose the Google Cloud Storage bucket containing the images and select the CSV schema file. The backends of AutoML will configure the rest. 

(Note: Alternatively, you can load images from your computer instead of using a CSV file. This would require you to use Google’s annotation tool to go through each uploaded image and start specifying/drawing bounding boxes. It’s not included here in this article, but it’s pretty straightforward.)

Graphical user interface, text, application, email Description automatically generated


The images will then be loaded into the AutoML dashboard.

A collage of people Description automatically generated with low confidence

Start Training Your AutoML Model 

Once the images are uploaded, you can inspect them and modify the bounding boxes as needed. 

Click Train to start training the AutoML model.

Graphical user interface, application Description automatically generated

Then, click Train New Model. 

(Note: You won’t see the metric above when training a model for the first time.)

Graphical user interface, text, application, email Description automatically generated

Graphical user interface, text, application, email Description automatically generated

The option above allows you to optimize for cloud-hosted online predictions or edge devices (edge compute models are optimized to work offline and on mobile). You can also optimize the model for higher accuracy or faster predictions, but this is project-specific.

Graphical user interface, text, application, email Description automatically generated

Finally, you can specify the number of node hours for training, or how long training should take. 

Once the training is completed, AutoML provides you with two modes of consuming these models. Depending on the use case, you can pick one of the two. One would be a deployed instance that lets you upload images and get predictions quickly. Predictions would be a JSON with parameters such as label of the image, coordinates of the image, confidence score, etc.

The second would be a containerized version of the model, which you can then run locally or on any compatible device.

Learn more about AutoML and schedule a time for a consultation with 66degrees’ cloud experts today.