Fine-Tune DINOv3 for Semantic Segmentation¶
This notebook demonstrates how to fine-tune a DINOv3 Vision Transformer for semantic segmentation of geospatial imagery using the geoai.dinov3_finetune module.
Key Features¶
- DPT Decoder: Dense Prediction Transformer head for multi-scale feature fusion
- Frozen Backbone: Keep the pretrained DINOv3 weights frozen for efficient training
- Optional LoRA: Low-Rank Adaptation for lightweight backbone tuning
- Sliding-Window Inference: Segment large GeoTIFF rasters with overlap-based voting
- Lightning Integration: Built-in checkpointing, early stopping, and logging
Install packages¶
# %pip install geoai-py lightning
Import libraries¶
import geoai
Download Sample Data¶
We use NAIP aerial imagery and building footprint labels hosted on Hugging Face.
train_raster_url = (
"https://huggingface.co/datasets/giswqs/geospatial/resolve/main/naip_rgb_train.tif"
)
train_vector_url = "https://huggingface.co/datasets/giswqs/geospatial/resolve/main/naip_train_buildings.geojson"
test_raster_url = (
"https://huggingface.co/datasets/giswqs/geospatial/resolve/main/naip_test.tif"
)
train_raster_path = geoai.download_file(train_raster_url)
train_vector_path = geoai.download_file(train_vector_url)
test_raster_path = geoai.download_file(test_raster_url)
Visualize Sample Data¶
geoai.view_vector_interactive(train_vector_path, tiles=train_raster_path)
Create Training Chips¶
Generate image tiles and corresponding segmentation masks from the raster and vector data.
out_folder = "dinov3_buildings"
tiles = geoai.export_geotiff_tiles(
in_raster=train_raster_path,
out_folder=out_folder,
in_class_data=train_vector_path,
tile_size=512,
stride=256,
buffer_radius=0,
)
Prepare the Dataset¶
Create a DINOv3SegmentationDataset from the exported image/mask tile pairs.
import glob
image_paths = sorted(glob.glob(f"{out_folder}/images/*.tif"))
mask_paths = sorted(glob.glob(f"{out_folder}/labels/*.tif"))
print(f"Found {len(image_paths)} image tiles and {len(mask_paths)} mask tiles")
from geoai.dinov3_finetune import DINOv3SegmentationDataset
# Use 80% for training and 20% for validation
split = int(0.8 * len(image_paths))
train_dataset = DINOv3SegmentationDataset(
image_paths=image_paths[:split],
mask_paths=mask_paths[:split],
patch_size=16,
target_size=512,
)
val_dataset = DINOv3SegmentationDataset(
image_paths=image_paths[split:],
mask_paths=mask_paths[split:],
patch_size=16,
target_size=512,
)
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
Train with Frozen Backbone¶
Train a DPT segmentation decoder on top of a frozen DINOv3 backbone. Only the decoder weights are updated, making training fast and memory-efficient.
model = geoai.train_dinov3_segmentation(
train_dataset=train_dataset,
val_dataset=val_dataset,
model_name="dinov3_vitl16",
num_classes=2,
output_dir=f"{out_folder}/dinov3_frozen",
batch_size=4,
num_epochs=20,
learning_rate=1e-4,
freeze_backbone=True,
use_lora=False,
)
Train with LoRA Adaptation¶
Optionally, apply Low-Rank Adaptation (LoRA) to the backbone attention layers. This adds a small number of trainable parameters to adapt the backbone while keeping most weights frozen.
# model_lora = geoai.train_dinov3_segmentation(
# train_dataset=train_dataset,
# val_dataset=val_dataset,
# model_name="dinov3_vitl16",
# num_classes=2,
# output_dir=f"{out_folder}/dinov3_lora",
# batch_size=4,
# num_epochs=20,
# learning_rate=1e-4,
# freeze_backbone=True,
# use_lora=True,
# lora_rank=4,
# )
Run Inference on a GeoTIFF¶
Use the trained model to segment a full GeoTIFF raster with sliding-window inference. Overlapping windows are fused using softmax probability voting.
output_mask = "naip_test_dinov3_prediction.tif"
checkpoint = f"{out_folder}/dinov3_frozen/models/last.ckpt"
geoai.dinov3_segment_geotiff(
input_path=test_raster_path,
output_path=output_mask,
checkpoint_path=checkpoint,
model_name="dinov3_vitl16",
num_classes=2,
window_size=512,
overlap=256,
batch_size=4,
)
Vectorize and Visualize Results¶
Convert the segmentation mask to vector polygons and display them on the raster.
output_vector = "naip_test_dinov3_prediction.geojson"
gdf = geoai.orthogonalize(output_mask, output_vector, epsilon=2)
gdf_props = geoai.add_geometric_properties(gdf, area_unit="m2", length_unit="m")
geoai.view_raster(output_mask, nodata=0, basemap=test_raster_path, backend="ipyleaflet")
gdf_filtered = gdf_props[gdf_props["area_m2"] > 50]
geoai.view_vector_interactive(gdf_filtered, column="area_m2", tiles=test_raster_path)
geoai.create_split_map(
left_layer=gdf_filtered,
right_layer=test_raster_path,
left_args={"style": {"color": "red", "fillOpacity": 0.2}},
basemap=test_raster_path,
)
Summary¶
This notebook demonstrated:
- Data preparation: Creating image/mask tile pairs from raster and vector data
- Frozen backbone training: Training only the DPT decoder for fast convergence
- LoRA adaptation: Optionally adding low-rank adaptation to the backbone
- Sliding-window inference: Segmenting large GeoTIFFs with overlap voting
- Post-processing: Vectorization, filtering, and visualization
Key Parameters¶
| Parameter | Description | Default |
|---|---|---|
model_name |
DINOv3 hub model identifier | "dinov3_vitl16" |
num_classes |
Number of segmentation classes | 2 |
freeze_backbone |
Keep backbone weights frozen | True |
use_lora |
Apply LoRA to attention layers | False |
lora_rank |
Rank of LoRA decomposition | 4 |
decoder_features |
DPT decoder hidden dimension | 256 |
window_size |
Sliding window size for inference | 512 |
overlap |
Overlap between inference windows | 256 |
Next Steps¶
- Try different DINOv3 backbone sizes (
dinov3_vits16,dinov3_vitb16,dinov3_vitl16) - Experiment with LoRA rank and alpha for backbone adaptation
- Use class weights for imbalanced datasets
- Apply to multi-class segmentation tasks (land cover, crop mapping)