Moondream Vision Language Model - Sliding Window for Large Images¶
This notebook demonstrates how to use the sliding window methods for processing large images with Moondream vision language model. The sliding window approach divides large images into smaller overlapping tiles for more effective processing.
Why Sliding Window?¶
- Better Performance on Large Images: Moondream VLM processes smaller image tiles more effectively than very large images
- Memory Efficiency: Reduces memory requirements by processing one tile at a time
- Better Detail Recognition: Smaller tiles allow the model to focus on finer details
- Overlap Handling: Overlapping tiles prevent missing objects at tile boundaries
Available Sliding Window Methods¶
detect_sliding_window()- Object detection with bounding boxespoint_sliding_window()- Point detection for object locationsquery_sliding_window()- Visual question answeringcaption_sliding_window()- Image captioning
Install packages¶
Uncomment the following line to install the required packages.
# %pip install -U geoai-py
Import libraries¶
import leafmap
from geoai import MoondreamGeo
import geoai
Download sample data¶
We'll use a large GeoTIFF image for demonstration. For this example, let's use a larger area that benefits from sliding window processing.
# Download a sample large image
url = "https://huggingface.co/datasets/giswqs/geospatial/resolve/main/parking_lot.tif"
image_path = geoai.download_file(url)
image_path
Visualize the image¶
Let's first visualize the sample image on an interactive map.
m = leafmap.Map()
m.add_raster(image_path, layer_name="Satellite Image")
m
Initialize the Moondream processor¶
Load the Moondream2 model. The first time you run this, the model will be downloaded from HuggingFace (~3.7GB).
processor = MoondreamGeo(
model_name="vikhyatk/moondream2",
revision="2025-06-21",
device="cuda", # Use "cpu" if you don't have a GPU
)
1. Object Detection with Sliding Window¶
Detect objects in large images using the sliding window approach. The method automatically:
- Divides the image into overlapping tiles
- Detects objects in each tile
- Applies Non-Maximum Suppression (NMS) to merge overlapping detections
Key Parameters:¶
window_size: Size of each tile (default: 512)overlap: Overlap between tiles (default: 64)iou_threshold: IoU threshold for NMS (default: 0.5)
# Detect cars using sliding window
result = processor.detect_sliding_window(
image_path,
"car",
window_size=512,
overlap=64,
iou_threshold=0.5,
output_path="cars_sliding_window.geojson",
)
print(f"Detected {len(result['objects'])} cars")
Visualize Detection Results¶
# View the GeoDataFrame
if "gdf" in result:
display(result["gdf"].head())
# Visualize on map
m = leafmap.Map()
m.add_raster(image_path, layer_name="Satellite Image")
if "gdf" in result:
m.add_gdf(
result["gdf"],
layer_name="Detected Cars",
style={"color": "red", "fillOpacity": 0.3},
)
m
Detect Buildings¶
# Detect buildings using sliding window
buildings = processor.detect_sliding_window(
image_path,
"building",
window_size=512,
overlap=64,
output_path="buildings_sliding_window.geojson",
)
print(f"Detected {len(buildings['objects'])} buildings")
2. Point Detection with Sliding Window¶
Find specific object locations as points across large images.
# Find tree locations using sliding window
trees = processor.point_sliding_window(
image_path,
"tree",
window_size=512,
overlap=64,
output_path="trees_sliding_window.geojson",
)
print(f"Found {len(trees['points'])} tree locations")
# Visualize tree locations
m = leafmap.Map()
m.add_raster(image_path, layer_name="Satellite Image")
if "gdf" in trees:
m.add_gdf(trees["gdf"], layer_name="Trees", style={"color": "green", "radius": 3})
m
# Query with concatenation
result = processor.query_sliding_window(
"What types of vehicles are visible?",
image_path,
window_size=512,
overlap=64,
combine_strategy="concatenate",
)
print("Combined Answer:")
print(result["answer"])
# Query with summarization (requires additional model call)
result = processor.query_sliding_window(
"Describe the land use and features in this area.",
image_path,
window_size=512,
overlap=64,
combine_strategy="summarize",
)
print("Summary:")
print(result["answer"])
# View individual tile answers
print("\nIndividual Tile Answers:")
for tile in result["tile_answers"][:3]: # Show first 3 tiles
print(f"Tile {tile['tile_id']}: {tile['answer']}")
4. Image Captioning with Sliding Window¶
Generate comprehensive captions for large images by captioning tiles and combining them.
# Generate caption with concatenation
result = processor.caption_sliding_window(
image_path,
window_size=512,
overlap=64,
length="normal",
combine_strategy="concatenate",
)
print("Combined Caption:")
print(result["caption"])
# Generate caption with summarization for better coherence
result = processor.caption_sliding_window(
image_path, window_size=512, overlap=64, length="long", combine_strategy="summarize"
)
print("Summarized Caption:")
print(result["caption"])
Using Convenience Functions¶
You can also use the convenience functions for one-off processing without creating a processor instance.
from geoai import moondream_detect_sliding_window
# Quick detection
result = moondream_detect_sliding_window(
image_path,
"parking space",
window_size=512,
overlap=64,
model_name="vikhyatk/moondream2",
revision="2025-06-21",
)
print(f"Detected {len(result['objects'])} parking spaces")
Performance Tips¶
Window Size:
- Smaller windows (256-512): Better for small objects, more tiles to process
- Larger windows (512-1024): Faster processing, may miss small objects
Overlap:
- Larger overlap (64-128): Better for objects at tile boundaries, slower
- Smaller overlap (32-64): Faster, may miss objects at boundaries
IoU Threshold (for detection):
- Higher (0.6-0.8): Keeps more detections, may have duplicates
- Lower (0.3-0.5): More aggressive merging, may lose some objects
Combine Strategy:
concatenate: Faster, preserves all informationsummarize: Better quality, requires extra model call
Compare: Regular vs Sliding Window¶
Let's compare regular detection with sliding window detection.
# Regular detection (without sliding window)
regular_result = processor.detect(image_path, "car")
print(f"Regular detection: {len(regular_result['objects'])} cars")
# Sliding window detection
sliding_result = processor.detect_sliding_window(
image_path, "car", window_size=512, overlap=64
)
print(f"Sliding window detection: {len(sliding_result['objects'])} cars")
print(
f"\nDifference: {len(sliding_result['objects']) - len(regular_result['objects'])} more detections"
)
Summary¶
This notebook demonstrated the sliding window methods for Moondream VLM:
- Object Detection: Process large images in tiles with NMS for merging
- Point Detection: Find object locations across large images
- Query: Answer questions about large images by querying tiles
- Caption: Generate comprehensive captions by combining tile descriptions
The sliding window approach is particularly useful for:
- Very large satellite/aerial imagery
- High-resolution images where details matter
- Scenes with many small objects
- Memory-constrained environments
Next Steps¶
- Try different window sizes and overlaps for your use case
- Experiment with both combine strategies for queries and captions
- Use georeferenced outputs with GIS tools
- Combine with other geoai tools for complete workflows