In this section, I will be covering RCNN.

RCNN

Region-based Convolutional Neural Network

What is RCNN?

RCNN is a two-step object detection framework.

The idea: rather than scanning the whole picture pixel by pixel, it first guesses where objects might be, and then it looks inside those regions to figure out what they are.

Main Steps of RCNN

Region Proposal (Selective Search)
- The model uses an external algorithm called selective search to generate about 2,000 region proposals (chunks of the image that may contain objects).
- Selective search:
  - Splits the image into small regions based on color, texture, size, and shape.
  - Gradually merges similar regions into bigger chunks.
  - Keeps any region that looks interesting as a “proposal”.
  - Analogy: Like searching for keys on a messy table by grouping together clumps of items that look similar, rather than sifting through every single crumb.
Feature Extraction (CNN)
- Each proposed region is cropped/resized to the input size expected by the CNN (e.g., 224×224), often padded to ensure some surrounding context.
- These regions are run through a CNN to convert the image patch into a feature vector (a summary of what the patch looks like).
Classification
- The feature vector is sent to a classifier head to classify what object (if any) is in the region.
Bounding Box Regression
- A regression model tweaks the box’s coordinates for a tighter fit.
- Why proposals must match? This step can only make small corrections. If your initial region is way off, regression can’t fix it.
  
  Analogy: If you poke a sticker almost perfectly onto a label spot, you can nudge it into place. If you miss completely, a nudge isn’t enough.

Summary Table: RCNN Components

Step	What It Does	Key Point
Region Proposal	Select likely areas with selective search	External, hand-designed, about 2,000 regions
Feature Extraction	CNN extracts regional features	Slow, cropped/warped input, per region
Classification	Classify regions	This gives the class
Bounding Box Regression	Tweaks box for tightness	Small corrections only, must start close

NMS (Non-Maximum Suppression)

This technique removes overlapping boxes.

Analogy: If you circled a ball on a photo three times, NMS picks the best circle and removes the extras.

How it works:

Keeps the box with the highest score.
Removes boxes with high overlap (IoU above threshold).