Creating Training Data for Deep Learning¶
This notebook demonstrates how to create training data (image and mask tiles) from georeferenced imagery and vector annotations using the improved export_geotiff_tiles_batch
function.
The function now supports three different input modes:
- Single vector file covering all images - Most efficient for large annotation files
- Multiple vector files matched by filename - Good for paired datasets
- Multiple vector files matched by sorted order - Good for sequential datasets
Install package¶
To use the geoai-py
package, ensure it is installed in your environment. Uncomment the command below if needed.
# %pip install geoai-py
Setup¶
Import the required functions and check the sample data structure.
import os
import geoai
Download Sample Data¶
url = "https://huggingface.co/datasets/giswqs/geospatial/resolve/main/naip_rgb_train_tiles.zip"
download_dir = geoai.download_file(url)
Explore Sample Data¶
The sample data contains:
- images/: Two NAIP RGB image tiles
- masks1/: Single GeoJSON file with all building annotations
- masks2/: Separate GeoJSON files for each image tile
# List available data
data_dir = os.path.join(download_dir, "data")
print("Images:")
for f in sorted(os.listdir(f"{data_dir}/images")):
print(f" - {f}")
print("\nMasks (single file):")
for f in sorted(os.listdir(f"{data_dir}/masks1")):
print(f" - {f}")
print("\nMasks (multiple files):")
for f in sorted(os.listdir(f"{data_dir}/masks2")):
print(f" - {f}")
Visualize Sample Image and Annotations¶
Let's look at one of the images and its corresponding building annotations.
# Load and display first image
image_path = f"{data_dir}/images/naip_rgb_train_tile1.tif"
mask_path = f"{data_dir}/masks2/naip_rgb_train_tile1.geojson"
fig, axes, info = geoai.display_image_with_vector(image_path, mask_path)
print(f"Number of buildings: {info['num_features']}")
Method 1: Single Vector File Covering All Images¶
This is the most efficient method when you have one large annotation file covering multiple image tiles. The function automatically:
- Loads the vector file once
- Spatially filters features for each image based on bounds
- Generates tiles only where features exist
# Use single mask file for all images
stats = geoai.export_geotiff_tiles_batch(
images_folder=f"{data_dir}/images",
masks_file=f"{data_dir}/masks1/naip_train_buildings.geojson",
output_folder="output/method1_single_mask",
tile_size=256,
stride=256, # No overlap
class_value_field="class",
skip_empty_tiles=True, # Skip tiles with no buildings
max_tiles=20, # Limit for demo purposes
quiet=False,
)
print(f"\n{'='*60}")
print("Results:")
print(f" Images processed: {stats['processed_pairs']}")
print(f" Total tiles generated: {stats['total_tiles']}")
print(f" Tiles with features: {stats['tiles_with_features']}")
print(
f" Feature percentage: {stats['tiles_with_features']/stats['total_tiles']*100:.1f}%"
)
Method 2: Multiple Vector Files Matched by Sorted Order¶
This method pairs images and masks alphabetically by sorted order. The 1st image pairs with the 1st mask, 2nd with 2nd, etc.
# Use multiple mask files matched by sorted order
stats = geoai.export_geotiff_tiles_batch(
images_folder=f"{data_dir}/images",
masks_folder=f"{data_dir}/masks2",
output_folder="output/method2_sorted_order",
tile_size=256,
stride=256,
class_value_field="class",
skip_empty_tiles=True,
match_by_name=False, # Match by sorted order
max_tiles=20,
)
print(f"\n{'='*60}")
print("Results:")
print(f" Images processed: {stats['processed_pairs']}")
print(f" Total tiles generated: {stats['total_tiles']}")
print(f" Tiles with features: {stats['tiles_with_features']}")
Method 3: Multiple Vector Files Matched by Filename¶
This method pairs images and masks by matching their base filenames (e.g., image1.tif
→ image1.geojson
).
Note: This requires images and masks to have matching base names. The sample dataset doesn't have matching names, so this example creates a compatible structure first.
stats = geoai.export_geotiff_tiles_batch(
images_folder="data/images",
masks_folder="data/masks2",
output_folder="output/method3_filename_match",
tile_size=256,
stride=256,
class_value_field="class",
skip_empty_tiles=True,
match_by_name=True, # Match by filename
)
print("Method 3 requires matching base filenames between images and masks.")
print("Example: 'image001.tif' pairs with 'image001.geojson'")
Visualize Generated Tiles¶
Let's look at some of the generated training tiles.
output_dir = "output/method1_single_mask"
fig = geoai.display_training_tiles(output_dir, num_tiles=6)
Advanced Usage: Custom Parameters¶
The function supports many parameters for customization:
# Advanced example with custom parameters
stats = geoai.export_geotiff_tiles_batch(
images_folder=f"{data_dir}/images",
masks_file=f"{data_dir}/masks1/naip_train_buildings.geojson",
output_folder="output/advanced_example",
tile_size=512, # Larger tiles
stride=256, # 50% overlap for better coverage
class_value_field="class", # Field containing class labels
buffer_radius=0.5, # Add 0.5m buffer around buildings
skip_empty_tiles=True, # Skip tiles with no features
all_touched=True, # Include pixels touching features
max_tiles=10, # Limit number of tiles per image
quiet=False, # Show progress
)
print(f"\nGenerated {stats['total_tiles']} tiles with 50% overlap")
print(f"Output structure:")
print(f" - output/advanced_example/images/ (image tiles)")
print(f" - output/advanced_example/masks/ (mask tiles)")
Train a Segmentation Model¶
geoai.train_segmentation_model(
images_dir=f"output/method3_filename_match/images",
labels_dir=f"output/method3_filename_match/masks",
output_dir=f"output/unet_models",
architecture="unet",
encoder_name="resnet34",
encoder_weights="imagenet",
num_channels=3,
num_classes=2, # background and building
batch_size=8,
num_epochs=20,
learning_rate=0.001,
val_split=0.2,
verbose=True,
)
geoai.plot_performance_metrics(
history_path=f"output/unet_models/training_history.pth",
figsize=(15, 5),
verbose=True,
)
Summary¶
The improved export_geotiff_tiles_batch
function provides flexible options for creating training data:
Method | Use Case | Parameter |
---|---|---|
Single vector file | One annotation file covering all images | masks_file="path/to/file.geojson" |
Multiple files (by name) | Paired files with matching names | masks_folder="path/to/masks", match_by_name=True |
Multiple files (by order) | Paired files in sorted order | masks_folder="path/to/masks", match_by_name=False |
Key Features:
- Supports both raster and vector masks
- Automatic CRS reprojection
- Spatial filtering for single mask files
- Configurable tile size, stride, and overlap
- Optional empty tile filtering
- Buffer support for vector annotations
- Detailed statistics reporting