Moondream Vision Language Model - Sliding Window for Large Images¶

This notebook demonstrates how to use the sliding window methods for processing large images with Moondream vision language model. The sliding window approach divides large images into smaller overlapping tiles for more effective processing.

Why Sliding Window?¶

Better Performance on Large Images: Moondream VLM processes smaller image tiles more effectively than very large images
Memory Efficiency: Reduces memory requirements by processing one tile at a time
Better Detail Recognition: Smaller tiles allow the model to focus on finer details
Overlap Handling: Overlapping tiles prevent missing objects at tile boundaries

Available Sliding Window Methods¶

detect_sliding_window() - Object detection with bounding boxes
point_sliding_window() - Point detection for object locations
query_sliding_window() - Visual question answering
caption_sliding_window() - Image captioning

Install packages¶

Uncomment the following line to install the required packages.

In [ ]:

Copied!

# %pip install -U geoai-py
# %pip install -U geoai-py

Import libraries¶

In [ ]:

Copied!

import leafmap
from geoai import MoondreamGeo
import geoai
import leafmap
from geoai import MoondreamGeo
import geoai

Download sample data¶

We'll use a large GeoTIFF image for demonstration. For this example, let's use a larger area that benefits from sliding window processing.

In [ ]:

Copied!





# Download a sample large image
url = "https://huggingface.co/datasets/giswqs/geospatial/resolve/main/parking_lot.tif"
image_path = geoai.download_file(url)
image_path
# Download a sample large image
url = "https://huggingface.co/datasets/giswqs/geospatial/resolve/main/parking_lot.tif"
image_path = geoai.download_file(url)
image_path

Visualize the image¶

Let's first visualize the sample image on an interactive map.

In [ ]:

Copied!

m = leafmap.Map()
m.add_raster(image_path, layer_name="Satellite Image")
m
m = leafmap.Map()
m.add_raster(image_path, layer_name="Satellite Image")
m

Initialize the Moondream processor¶

Load the Moondream2 model. The first time you run this, the model will be downloaded from HuggingFace (~3.7GB).

In [ ]:

Copied!





processor = MoondreamGeo(
    model_name="vikhyatk/moondream2",
    revision="2025-06-21",
    device="cuda",  # Use "cpu" if you don't have a GPU
)
processor = MoondreamGeo(
    model_name="vikhyatk/moondream2",
    revision="2025-06-21",
    device="cuda",  # Use "cpu" if you don't have a GPU
)

1. Object Detection with Sliding Window¶

Detect objects in large images using the sliding window approach. The method automatically:

Divides the image into overlapping tiles
Detects objects in each tile
Applies Non-Maximum Suppression (NMS) to merge overlapping detections

Key Parameters:¶

window_size: Size of each tile (default: 512)
overlap: Overlap between tiles (default: 64)
iou_threshold: IoU threshold for NMS (default: 0.5)

In [ ]:

Copied!





# Detect cars using sliding window
result = processor.detect_sliding_window(
    image_path,
    "car",
    window_size=512,
    overlap=64,
    iou_threshold=0.5,
    output_path="cars_sliding_window.geojson",
)

print(f"Detected {len(result['objects'])} cars")
# Detect cars using sliding window
result = processor.detect_sliding_window(
    image_path,
    "car",
    window_size=512,
    overlap=64,
    iou_threshold=0.5,
    output_path="cars_sliding_window.geojson",
)

print(f"Detected {len(result['objects'])} cars")

Visualize Detection Results¶

In [ ]:

Copied!

# View the GeoDataFrame
if "gdf" in result:
    display(result["gdf"].head())
# View the GeoDataFrame
if "gdf" in result:
    display(result["gdf"].head())

In [ ]:

Copied!





# Visualize on map
m = leafmap.Map()
m.add_raster(image_path, layer_name="Satellite Image")
if "gdf" in result:
    m.add_gdf(
        result["gdf"],
        layer_name="Detected Cars",
        style={"color": "red", "fillOpacity": 0.3},
    )
m
# Visualize on map
m = leafmap.Map()
m.add_raster(image_path, layer_name="Satellite Image")
if "gdf" in result:
    m.add_gdf(
        result["gdf"],
        layer_name="Detected Cars",
        style={"color": "red", "fillOpacity": 0.3},
    )
m

Detect Buildings¶

In [ ]:

Copied!





# Detect buildings using sliding window
buildings = processor.detect_sliding_window(
    image_path,
    "building",
    window_size=512,
    overlap=64,
    output_path="buildings_sliding_window.geojson",
)

print(f"Detected {len(buildings['objects'])} buildings")
# Detect buildings using sliding window
buildings = processor.detect_sliding_window(
    image_path,
    "building",
    window_size=512,
    overlap=64,
    output_path="buildings_sliding_window.geojson",
)

print(f"Detected {len(buildings['objects'])} buildings")

2. Point Detection with Sliding Window¶

Find specific object locations as points across large images.

In [ ]:

Copied!





# Find tree locations using sliding window
trees = processor.point_sliding_window(
    image_path,
    "tree",
    window_size=512,
    overlap=64,
    output_path="trees_sliding_window.geojson",
)

print(f"Found {len(trees['points'])} tree locations")
# Find tree locations using sliding window
trees = processor.point_sliding_window(
    image_path,
    "tree",
    window_size=512,
    overlap=64,
    output_path="trees_sliding_window.geojson",
)

print(f"Found {len(trees['points'])} tree locations")

In [ ]:

Copied!





# Visualize tree locations
m = leafmap.Map()
m.add_raster(image_path, layer_name="Satellite Image")
if "gdf" in trees:
    m.add_gdf(trees["gdf"], layer_name="Trees", style={"color": "green", "radius": 3})
m
# Visualize tree locations
m = leafmap.Map()
m.add_raster(image_path, layer_name="Satellite Image")
if "gdf" in trees:
    m.add_gdf(trees["gdf"], layer_name="Trees", style={"color": "green", "radius": 3})
m

3. Visual Question Answering with Sliding Window¶

Query large images by processing them in tiles and combining answers.

Combine Strategies:¶

concatenate: Simply join all tile answers (faster)
summarize: Use the model to create a coherent summary (better quality)

In [ ]:

Copied!





# Query with concatenation
result = processor.query_sliding_window(
    "What types of vehicles are visible?",
    image_path,
    window_size=512,
    overlap=64,
    combine_strategy="concatenate",
)

print("Combined Answer:")
print(result["answer"])
# Query with concatenation
result = processor.query_sliding_window(
    "What types of vehicles are visible?",
    image_path,
    window_size=512,
    overlap=64,
    combine_strategy="concatenate",
)

print("Combined Answer:")
print(result["answer"])

In [ ]:

Copied!





# Query with summarization (requires additional model call)
result = processor.query_sliding_window(
    "Describe the land use and features in this area.",
    image_path,
    window_size=512,
    overlap=64,
    combine_strategy="summarize",
)

print("Summary:")
print(result["answer"])
# Query with summarization (requires additional model call)
result = processor.query_sliding_window(
    "Describe the land use and features in this area.",
    image_path,
    window_size=512,
    overlap=64,
    combine_strategy="summarize",
)

print("Summary:")
print(result["answer"])

In [ ]:

Copied!





# View individual tile answers
print("\nIndividual Tile Answers:")
for tile in result["tile_answers"][:3]:  # Show first 3 tiles
    print(f"Tile {tile['tile_id']}: {tile['answer']}")
# View individual tile answers
print("\nIndividual Tile Answers:")
for tile in result["tile_answers"][:3]:  # Show first 3 tiles
    print(f"Tile {tile['tile_id']}: {tile['answer']}")

4. Image Captioning with Sliding Window¶

Generate comprehensive captions for large images by captioning tiles and combining them.

In [ ]:

Copied!





# Generate caption with concatenation
result = processor.caption_sliding_window(
    image_path,
    window_size=512,
    overlap=64,
    length="normal",
    combine_strategy="concatenate",
)

print("Combined Caption:")
print(result["caption"])
# Generate caption with concatenation
result = processor.caption_sliding_window(
    image_path,
    window_size=512,
    overlap=64,
    length="normal",
    combine_strategy="concatenate",
)

print("Combined Caption:")
print(result["caption"])

In [ ]:

Copied!





# Generate caption with summarization for better coherence
result = processor.caption_sliding_window(
    image_path, window_size=512, overlap=64, length="long", combine_strategy="summarize"
)

print("Summarized Caption:")
print(result["caption"])
# Generate caption with summarization for better coherence
result = processor.caption_sliding_window(
    image_path, window_size=512, overlap=64, length="long", combine_strategy="summarize"
)

print("Summarized Caption:")
print(result["caption"])

Using Convenience Functions¶

You can also use the convenience functions for one-off processing without creating a processor instance.

In [ ]:

Copied!





from geoai import moondream_detect_sliding_window

# Quick detection
result = moondream_detect_sliding_window(
    image_path,
    "parking space",
    window_size=512,
    overlap=64,
    model_name="vikhyatk/moondream2",
    revision="2025-06-21",
)

print(f"Detected {len(result['objects'])} parking spaces")
from geoai import moondream_detect_sliding_window

# Quick detection
result = moondream_detect_sliding_window(
    image_path,
    "parking space",
    window_size=512,
    overlap=64,
    model_name="vikhyatk/moondream2",
    revision="2025-06-21",
)

print(f"Detected {len(result['objects'])} parking spaces")

Performance Tips¶

Window Size:
- Smaller windows (256-512): Better for small objects, more tiles to process
- Larger windows (512-1024): Faster processing, may miss small objects
Overlap:
- Larger overlap (64-128): Better for objects at tile boundaries, slower
- Smaller overlap (32-64): Faster, may miss objects at boundaries
IoU Threshold (for detection):
- Higher (0.6-0.8): Keeps more detections, may have duplicates
- Lower (0.3-0.5): More aggressive merging, may lose some objects
Combine Strategy:
- concatenate: Faster, preserves all information
- summarize: Better quality, requires extra model call

Compare: Regular vs Sliding Window¶

Let's compare regular detection with sliding window detection.

In [ ]:

Copied!





# Regular detection (without sliding window)
regular_result = processor.detect(image_path, "car")
print(f"Regular detection: {len(regular_result['objects'])} cars")

# Sliding window detection
sliding_result = processor.detect_sliding_window(
    image_path, "car", window_size=512, overlap=64
)
print(f"Sliding window detection: {len(sliding_result['objects'])} cars")

print(
    f"\nDifference: {len(sliding_result['objects']) - len(regular_result['objects'])} more detections"
)
# Regular detection (without sliding window)
regular_result = processor.detect(image_path, "car")
print(f"Regular detection: {len(regular_result['objects'])} cars")

# Sliding window detection
sliding_result = processor.detect_sliding_window(
    image_path, "car", window_size=512, overlap=64
)
print(f"Sliding window detection: {len(sliding_result['objects'])} cars")

print(
    f"\nDifference: {len(sliding_result['objects']) - len(regular_result['objects'])} more detections"
)

Summary¶

This notebook demonstrated the sliding window methods for Moondream VLM:

Object Detection: Process large images in tiles with NMS for merging
Point Detection: Find object locations across large images
Query: Answer questions about large images by querying tiles
Caption: Generate comprehensive captions by combining tile descriptions

The sliding window approach is particularly useful for:

Very large satellite/aerial imagery
High-resolution images where details matter
Scenes with many small objects
Memory-constrained environments

Next Steps¶

Try different window sizes and overlaps for your use case
Experiment with both combine strategies for queries and captions
Use georeferenced outputs with GIS tools
Combine with other geoai tools for complete workflows