Qi's blog

The overall plan

Frontend: Next.js and React
Backend: Python and FastAPI

My plan for extracting nutritional information was a three-step pipeline:

First, check if the uploaded image even contains food.
If it does, figure out what the food is.
Finally, look up the nutritional information for the identified food.

The Food Detection Model

The main AI model that identifies the food (a VLM) is powerful, but it's also slow. I didn't want to waste time and processing power analyzing pictures of my dog or my desk. My solution was to build a small, fast "gatekeeper" model. Its only job is to answer one question: "Is this a picture of food?"

Here’s how I built it:

1. The Model Architecture

I chose MobileNetV2, a lightweight and efficient CNN pre-trained on the ImageNet dataset. This way, instead of training a model from scratch, we adapt a model that already understands general image features. I used a custom classification head, consisting of a GlobalAveragePooling2D layer followed by a single Dense layer with a sigmoid activation function. This outputs a single value between 0 and 1, representing the probability that the image contains food.

2. The Data

The training dataset was composed of:

Positive Class (Food): 5,000 images sourced from the Food-101 dataset, which contains a diverse set of food items.
Negative Class (Not Food): 5,000 images of non-food subjects, such as landscapes, people, and everyday objects, obtained from the ImageNet-V2 dataset.

To make the model more robust, data augmentation was applied during training. I used ImageDataGenerator from Keras to apply random rotations, shifts, shears, zooms, and flips to the training images in the hope of improving generalization.

3. The Training Process

The training was conducted in two main phases:

Phase 1: Training the Classifier Head: Initially, all the layers of the base MobileNetV2 model were frozen. Only the new classification head we added was trained, to allow the new head to learn how to interpret the features extracted by the base model. I ran this phase runs for 5 epochs with a relatively higher learning rate of 1e-3.
Phase 2: Fine-Tuning: After the head was trained, some of the top layers of the base model were unfrozen. The entire model was then trained for 5 more epochs with a very low learning rate of 1e-5. This fine-tuning step allows the model to slightly adjust its learned features to be more specific to our dataset.

The model was trained using the Adam optimizer and binary_crossentropy as the loss function.

The result is binary_food_detector.h5, a tiny model (just a few megabytes) that runs in milliseconds on a CPU and has an accuracy of over 95% on the test set. This simple gatekeeper determines when the image is passed to the more computationally expensive agent.

Orchestrating the AI with a LangChain Agent

My original pipeline was a simple, linear flow: check for food, identify the food, and get nutrition data. However, I needed a more dynamic way to handle the VLM's output and the subsequent nutrition lookup.

The first challenge was getting structured data from the VLM. It was great at describing images, but its creativity was a problem.

My first attempt at a prompt was simple:

"What food is in this image?"

The VLM responded with:

"It looks like a delicious breakfast plate with two sunny-side-up eggs, a few strips of crispy bacon, and a side of toast."

While accurate, this was difficult to parse. After a few iterations, I landed on a much more effective prompt:

"You are a food identification expert. Analyze the image and identify the distinct food items. Respond ONLY with a comma-separated list. Do not add any introductory text or explanations. For example: 'eggs, bacon, toast'."

The VLM now returned exactly what I needed:

"eggs, bacon, toast"

This clean, predictable output could be easily split into a list and passed to the next step.

With a reliable list of food items, the next challenge was looking up each one and aggregating the results. This is where a LangChain agent came in. Instead of a rigid, linear pipeline, the agent could use a set of tools and a reasoning engine to decide what to do. Here’s how it works:

The Goal: The agent is given a primary goal defined in a detailed system prompt instructing the agent to find all food items from the VLM's output, use the Nutrition_Analyzer tool for each one, and then calculate the total nutritional values.
The Tools: The agent has access to a single tool: the Nutrition_Analyzer. This tool calls the USDA FoodData Central API and can get the nutritional summary for one food item at a time.
The Reasoning Loop (ReAct): The agent uses a framework called ReAct (Reasoning and Acting). When given a task like analyzing "eggs, bacon, toast", it goes through a thought process:
- Thought: "I need to find the nutritional information for eggs, bacon, and toast, and then sum it up. I have a tool for finding nutritional information. I will start with 'eggs'."
- Action: Call Nutrition_Analyzer with the input "eggs".
- Observation: Get the nutritional data for eggs.
- Thought: "Okay, I have the data for eggs. Now I need to do the same for 'bacon'."
- Action: Call Nutrition_Analyzer with the input "bacon".
- Observation: Get the nutritional data for bacon.
- Thought: "Great. Now for the toast."
- Action: Call Nutrition_Analyzer with the input "toast".
- Observation: Get the nutritional data for toast.
- Thought: "I have all the data. Now I will calculate the totals and format the final answer as requested."
- Final Answer: Present the aggregated nutritional summary.

This thought-action-observation loop allows the agent to execute a plan to get nutritional data and adapt based on the results it gets back from its tools.

Building this was a challenging but rewarding process that taught me a lot about integrating machine learning models. You can find the code for the project on my GitHub.

Qi's blog

Obtaining nutritional information from pictures of food

The overall plan

The Food Detection Model

Orchestrating the AI with a LangChain Agent