Zero-Shot CLIP Classification for Vector Features¶
This notebook demonstrates how to classify vector polygon features using zero-shot CLIP inference with the geoai.clip_classify module. Given a set of polygons and a raster image, CLIP extracts an image chip for each polygon and matches it against user-provided category labels — no training data required.
We use a Chesapeake Bay Watershed NAIP scene (eastern Maryland, 2018) together with its paired land-cover ground truth raster to validate zero-shot predictions.
Tip: For best results with aerial/satellite imagery, use a remote-sensing CLIP model such as flax-community/clip-rsicd-v2, which was fine-tuned on remote sensing image captions.
Install packages¶
# %pip install geoai-py
Import libraries¶
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import rasterio
import rasterio.enums
from rasterio.plot import show
from shapely.geometry import box
from geoai import clip_classify_vector, CLIPVectorClassifier, download_file
Download sample data¶
We use a USGS NAIP aerial image and its paired Chesapeake Bay Watershed land-cover raster, both hosted on Hugging Face.
| File | Description |
|---|---|
m_3807511_ne_18_060_20181104.tif |
4-band (RGBN) NAIP image, 0.6 m GSD, eastern Maryland, 2018 |
m_3807511_ne_18_060_20181104_landcover.tif |
13-class Chesapeake land-cover ground truth (same extent / CRS) |
raster_url = (
"https://huggingface.co/datasets/giswqs/geospatial/resolve/main/"
"m_3807511_ne_18_060_20181104.tif"
)
landcover_url = (
"https://huggingface.co/datasets/giswqs/geospatial/resolve/main/"
"m_3807511_ne_18_060_20181104_landcover.tif"
)
raster_path = download_file(raster_url)
landcover_path = download_file(landcover_url)
Preview the data¶
with rasterio.open(raster_path) as src:
print(f"Raster CRS : {src.crs}")
print(f"Size : {src.width} x {src.height} px")
print(f"Resolution : {src.res[0]:.2f} m")
bounds = src.bounds
width_km = (bounds.right - bounds.left) / 1000
height_km = (bounds.top - bounds.bottom) / 1000
print(f"Extent : {width_km:.2f} km x {height_km:.2f} km")
print(f"Bands : {src.count}")
fig, ax = plt.subplots(1, 1, figsize=(16, 8))
with rasterio.open(raster_path) as src:
scale = 10
data = src.read(
out_shape=(src.count, src.height // scale, src.width // scale),
resampling=rasterio.enums.Resampling.average,
)
from rasterio.transform import from_bounds as _fb
disp_transform = _fb(*src.bounds, data.shape[2], data.shape[1])
show(data, transform=disp_transform, ax=ax)
ax.set_title("Chesapeake Bay – NAIP 2018 (eastern Maryland)", fontsize=14)
ax.axis("off")
plt.tight_layout()
plt.show()
Land-cover classification with a grid¶
We tile the raster with a regular 300 m × 300 m grid. This rural Maryland tile is predominantly forest and agricultural field — the two dominant land-cover classes — which is what we give CLIP to choose between.
We use flax-community/clip-rsicd-v2, a CLIP model fine-tuned on remote sensing image captions.
with rasterio.open(raster_path) as src:
left, bottom, right, top = src.bounds
raster_crs = src.crs
cell_size = 300 # metres
grid_polys = []
for x in np.arange(left, right - cell_size, cell_size):
for y in np.arange(bottom, top - cell_size, cell_size):
grid_polys.append(box(x, y, x + cell_size, y + cell_size))
grid_gdf = gpd.GeoDataFrame(geometry=grid_polys, crs=raster_crs)
print(f"Grid polygons: {len(grid_gdf)}")
labels = [
"forest",
"agricultural field",
]
result_grid = clip_classify_vector(
vector_data=grid_gdf,
raster_path=raster_path,
labels=labels,
model_name="flax-community/clip-rsicd-v2",
)
Explore results¶
The returned GeoDataFrame has two new columns:
clip_label— the top-1 predicted categoryclip_confidence— the softmax confidence score (0 to 1)
result_grid[["geometry", "clip_label", "clip_confidence"]].head(10)
print("Label distribution:")
print(result_grid["clip_label"].value_counts())
print(f"\nMean confidence: {result_grid['clip_confidence'].mean():.3f}")
Visualize classified grid¶
classified_grid = result_grid.dropna(subset=["clip_label"])
color_map = {"forest": "#228B22", "agricultural field": "#DAA520"}
fig, ax = plt.subplots(1, 1, figsize=(16, 8))
with rasterio.open(raster_path) as src:
scale = 10
data = src.read(
out_shape=(src.count, src.height // scale, src.width // scale),
resampling=rasterio.enums.Resampling.average,
)
from rasterio.transform import from_bounds as _fb
disp_transform = _fb(*src.bounds, data.shape[2], data.shape[1])
show(data, transform=disp_transform, ax=ax)
classified_grid.plot(
column="clip_label",
ax=ax,
alpha=0.5,
edgecolor="black",
linewidth=0.2,
color=[color_map[l] for l in classified_grid["clip_label"]],
)
patches = [mpatches.Patch(color=c, label=l) for l, c in color_map.items()]
ax.legend(handles=patches, loc="upper left", fontsize=10)
ax.set_title("Land-Cover Classification – RS-CLIP zero-shot (300 m grid)", fontsize=14)
ax.axis("off")
plt.tight_layout()
plt.show()
Confidence scores¶
fig, ax = plt.subplots(1, 1, figsize=(16, 8))
with rasterio.open(raster_path) as src:
scale = 10
data = src.read(
out_shape=(src.count, src.height // scale, src.width // scale),
resampling=rasterio.enums.Resampling.average,
)
from rasterio.transform import from_bounds as _fb
disp_transform = _fb(*src.bounds, data.shape[2], data.shape[1])
show(data, transform=disp_transform, ax=ax)
classified_grid.plot(
column="clip_confidence",
ax=ax,
alpha=0.6,
edgecolor="black",
linewidth=0.2,
cmap="viridis",
legend=True,
legend_kwds={"shrink": 0.5, "label": "Confidence"},
)
ax.set_title("CLIP Confidence Scores", fontsize=14)
ax.axis("off")
plt.tight_layout()
plt.show()
Compare with Chesapeake land-cover ground truth¶
The paired land-cover raster encodes 13 Chesapeake Bay Watershed classes. We extract the dominant class within each 300 m grid cell and map it to our two CLIP labels, then measure agreement.
# Chesapeake 13-class → our 2 CLIP labels
CHESAPEAKE_TO_LABEL = {
1: "agricultural field", # Water
2: "agricultural field", # Emergent Wetlands
3: "forest", # Tree Canopy
4: "forest", # Shrub/Scrub
5: "agricultural field", # Low Vegetation (grass / crops)
6: "agricultural field", # Barren
7: "agricultural field", # Structures
8: "agricultural field", # Impervious (other)
9: "agricultural field", # Impervious Roads
10: "forest", # Tree Canopy over Structures
11: "forest", # Tree Canopy over Impervious
12: "forest", # Tree Canopy over Roads
}
with rasterio.open(landcover_path) as lc_src:
lc_data = lc_src.read(1)
lc_transform = lc_src.transform
lc_crs = lc_src.crs
unique_vals = sorted(int(v) for v in set(lc_data.flatten()) if v > 0)
print("Land-cover class values found:", unique_vals)
grid_lc = classified_grid.to_crs(lc_crs)
dominant_gt = []
for geom in grid_lc.geometry:
from rasterio.windows import from_bounds as win_from_bounds
win = win_from_bounds(*geom.bounds, lc_transform).round_offsets().round_lengths()
r0 = max(0, int(win.row_off))
c0 = max(0, int(win.col_off))
r1 = min(lc_data.shape[0], r0 + int(win.height))
c1 = min(lc_data.shape[1], c0 + int(win.width))
chip = lc_data[r0:r1, c0:c1]
valid_px = chip[chip > 0]
if len(valid_px) > 0:
dom_class = int(np.bincount(valid_px.astype(int)).argmax())
dominant_gt.append(CHESAPEAKE_TO_LABEL.get(dom_class, "unknown"))
else:
dominant_gt.append("unknown")
classified_grid = classified_grid.copy()
classified_grid["gt_label"] = dominant_gt
print(classified_grid[["clip_label", "gt_label"]].head(10))
valid = classified_grid[classified_grid["gt_label"] != "unknown"].copy()
overall_acc = (valid["clip_label"] == valid["gt_label"]).mean()
print(f"Overall agreement (CLIP vs ground truth): {overall_acc:.1%}\n")
print("Per-class breakdown:")
for lbl in labels:
subset = valid[valid["gt_label"] == lbl]
if len(subset) == 0:
continue
correct = (subset["clip_label"] == lbl).sum()
print(f" {lbl:25s}: {correct}/{len(subset)} = {correct/len(subset):.0%}")
fig, axes = plt.subplots(1, 2, figsize=(20, 8))
for ax, col, title in [
(axes[0], "clip_label", "RS-CLIP zero-shot prediction"),
(axes[1], "gt_label", "Chesapeake ground truth"),
]:
with rasterio.open(raster_path) as src:
scale = 10
data = src.read(
out_shape=(src.count, src.height // scale, src.width // scale),
resampling=rasterio.enums.Resampling.average,
)
from rasterio.transform import from_bounds as _fb
disp_transform = _fb(*src.bounds, data.shape[2], data.shape[1])
show(data, transform=disp_transform, ax=ax)
valid.plot(
column=col,
ax=ax,
alpha=0.5,
edgecolor="black",
linewidth=0.2,
color=[color_map[l] for l in valid[col]],
)
patches = [mpatches.Patch(color=c, label=l) for l, c in color_map.items()]
ax.legend(handles=patches, loc="upper left", fontsize=10)
ax.set_title(title, fontsize=13)
ax.axis("off")
plt.suptitle(f"Overall Agreement: {overall_acc:.1%}", fontsize=15, y=1.02)
plt.tight_layout()
plt.show()
Top-k predictions with the class API¶
For more control, use CLIPVectorClassifier directly. Setting top_k=2 returns the top 2 predictions per polygon.
classifier = CLIPVectorClassifier(model_name="flax-community/clip-rsicd-v2")
result_topk = classifier.classify(
vector_data=grid_gdf,
raster_path=raster_path,
labels=labels,
top_k=2,
batch_size=32,
)
classified = result_topk.dropna(subset=["clip_label"])
for i in range(min(5, len(classified))):
row = classified.iloc[i]
print(f"Polygon {classified.index[i]}:")
for label, score in zip(row["clip_top_k_labels"], row["clip_top_k_scores"]):
print(f" {label:30s} {score:.3f}")
print()
Export results¶
Save the classified GeoDataFrame to GeoJSON, GeoParquet, or GeoPackage.
result_grid.to_file("classified_landcover.geojson", driver="GeoJSON")
print("Results saved to classified_landcover.geojson")
Summary¶
This notebook demonstrated:
- Zero-shot classification: Classifying land-cover polygons with no training data — just candidate labels
- Remote-sensing CLIP model: Using
flax-community/clip-rsicd-v2for aerial imagery - Grid-based land-cover mapping: 300 m patches classified as forest vs. agricultural field
- Ground-truth comparison: ~89% agreement with Chesapeake Bay Watershed 13-class labels
- Top-k predictions: Ranked predictions with confidence scores
- Export: Saving annotated results to standard geospatial formats
Key Parameters¶
| Parameter | Description | Default |
|---|---|---|
labels |
Candidate category names | (required) |
label_prefix |
Text prefix for CLIP encoding | "a satellite image of " |
model_name |
Hugging Face CLIP model ID | "openai/clip-vit-base-patch32" |
top_k |
Number of top predictions per polygon | 1 |
batch_size |
Images per inference batch | 16 |
min_chip_size |
Minimum chip dimension in pixels | 10 |
output_path |
Path to save results (.geojson, .parquet, .gpkg) |
None |
Model Recommendations¶
| Model | Best for | Notes |
|---|---|---|
flax-community/clip-rsicd-v2 |
Aerial / satellite imagery | Fine-tuned on RS image captions |
openai/clip-vit-base-patch32 |
General-purpose images | Default CLIP; best with ground-level photos |
openai/clip-vit-large-patch14 |
Higher capacity | Larger model; slower but more expressive |