Catalog Search Agent - Find Datasets with Natural Language¶
This notebook demonstrates how to use the Catalog Search Agent to find datasets in data catalogs using natural language queries.
The Catalog Agent can:
- Search through data catalogs (TSV, CSV, JSON formats)
- Search by geographic region (location names or bounding boxes)
- Understand natural language queries
- Filter by keywords, dataset type, and provider
- Return structured results with metadata
- Work with Earth Engine Data Catalog and custom catalogs
Uncomment the following line to install geoai if needed.
Installation¶
In [ ]:
Copied!
# %pip install "geoai[agents]"
# %pip install "geoai[agents]"
Import libraries¶
In [ ]:
Copied!
import json
from geoai.agents import (
CatalogAgent,
CatalogTools,
create_ollama_model,
create_openai_model,
create_anthropic_model,
)
import json
from geoai.agents import (
CatalogAgent,
CatalogTools,
create_ollama_model,
create_openai_model,
create_anthropic_model,
)
Load Earth Engine Data Catalog¶
The Earth Engine Data Catalog is available in JSON format from the opengeos/Earth-Engine-Catalog repository.
Important: Use the JSON format (not TSV) to enable spatial search capabilities with bounding box information.
In [ ]:
Copied!
catalog_url = "https://raw.githubusercontent.com/opengeos/Earth-Engine-Catalog/refs/heads/master/gee_catalog.json"
catalog_url = "https://raw.githubusercontent.com/opengeos/Earth-Engine-Catalog/refs/heads/master/gee_catalog.json"
Create a model¶
You can create a model with the following functions:
create_ollama_model
: Create a model using Ollama. You will need to install Ollama separately and pull the model you want to use, such asllama3.1
.create_openai_model
: Create a model using OpenAI. You will need an OpenAI API key. Set it in theOPENAI_API_KEY
environment variable.create_anthropic_model
: Create a model using Anthropic. You will need an Anthropic API key. Set it in theANTHROPIC_API_KEY
environment variable.
In [ ]:
Copied!
model = create_ollama_model(model="llama3.1")
model = create_ollama_model(model="llama3.1")
Use CatalogTools Directly (Fast)¶
For faster searches without LLM overhead, you can use CatalogTools
directly:
In [ ]:
Copied!
tools = CatalogTools(catalog_url=catalog_url)
tools = CatalogTools(catalog_url=catalog_url)
Get catalog statistics¶
In [ ]:
Copied!
stats = json.loads(tools.get_catalog_stats())
print(f"Total datasets: {stats['total_datasets']}")
print(f"\nDataset types:")
for dtype, count in stats.get("dataset_types", {}).items():
print(f" {dtype}: {count}")
stats = json.loads(tools.get_catalog_stats())
print(f"Total datasets: {stats['total_datasets']}")
print(f"\nDataset types:")
for dtype, count in stats.get("dataset_types", {}).items():
print(f" {dtype}: {count}")
Search for datasets by keyword¶
In [ ]:
Copied!
result = json.loads(tools.search_datasets(keywords="landcover", max_results=5))
print(f"Found {result['dataset_count']} datasets\n")
for ds in result["datasets"]:
print(f"ID: {ds['id']}")
print(f"Title: {ds['title']}")
print(f"Provider: {ds.get('provider', 'N/A')}")
print("-" * 80)
result = json.loads(tools.search_datasets(keywords="landcover", max_results=5))
print(f"Found {result['dataset_count']} datasets\n")
for ds in result["datasets"]:
print(f"ID: {ds['id']}")
print(f"Title: {ds['title']}")
print(f"Provider: {ds.get('provider', 'N/A')}")
print("-" * 80)
Search by geographic region (NEW!)¶
Find datasets covering a specific location or bounding box:
In [ ]:
Copied!
# Search by location name
result = json.loads(
tools.search_by_region(location="California", keywords="elevation", max_results=5)
)
print(f"Found {result['dataset_count']} datasets covering California\n")
for ds in result["datasets"]:
print(f"ID: {ds['id']}")
print(f"Title: {ds['title']}")
print(f"Bbox: {ds.get('bbox', 'N/A')}")
print("-" * 80)
# Search by location name
result = json.loads(
tools.search_by_region(location="California", keywords="elevation", max_results=5)
)
print(f"Found {result['dataset_count']} datasets covering California\n")
for ds in result["datasets"]:
print(f"ID: {ds['id']}")
print(f"Title: {ds['title']}")
print(f"Bbox: {ds.get('bbox', 'N/A')}")
print("-" * 80)
In [ ]:
Copied!
# Search by bounding box coordinates
# San Francisco Bay Area: [west, south, east, north]
result = json.loads(
tools.search_by_region(
bbox=[-122.5, 37.5, -122.0, 38.0], keywords="landcover", max_results=3
)
)
print(f"Found {result['dataset_count']} datasets\n")
for ds in result["datasets"]:
print(f"ID: {ds['id']}")
print(f"Title: {ds['title']}")
print("-" * 80)
# Search by bounding box coordinates
# San Francisco Bay Area: [west, south, east, north]
result = json.loads(
tools.search_by_region(
bbox=[-122.5, 37.5, -122.0, 38.0], keywords="landcover", max_results=3
)
)
print(f"Found {result['dataset_count']} datasets\n")
for ds in result["datasets"]:
print(f"ID: {ds['id']}")
print(f"Title: {ds['title']}")
print("-" * 80)
Geocode location names¶
In [ ]:
Copied!
# Convert location name to bounding box
location_info = json.loads(tools.geocode_location("New York City"))
print(f"Name: {location_info['name']}")
print(f"Bbox: {location_info['bbox']}")
print(f"Center: {location_info['center']}")
# Convert location name to bounding box
location_info = json.loads(tools.geocode_location("New York City"))
print(f"Name: {location_info['name']}")
print(f"Bbox: {location_info['bbox']}")
print(f"Center: {location_info['center']}")
Search with filters¶
In [ ]:
Copied!
result = json.loads(
tools.search_datasets(
keywords="elevation", dataset_type="image", provider="NASA", max_results=5
)
)
print(f"Found {result['dataset_count']} datasets\n")
for ds in result["datasets"]:
print(f"ID: {ds['id']}")
print(f"Title: {ds['title']}")
print(f"Type: {ds.get('type', 'N/A')}")
print("-" * 80)
result = json.loads(
tools.search_datasets(
keywords="elevation", dataset_type="image", provider="NASA", max_results=5
)
)
print(f"Found {result['dataset_count']} datasets\n")
for ds in result["datasets"]:
print(f"ID: {ds['id']}")
print(f"Title: {ds['title']}")
print(f"Type: {ds.get('type', 'N/A')}")
print("-" * 80)
List available providers¶
In [ ]:
Copied!
providers = json.loads(tools.list_providers())
print(f"Total providers: {providers['count']}\n")
print("Sample providers:")
for p in providers["providers"][:10]:
print(f" - {p}")
providers = json.loads(tools.list_providers())
print(f"Total providers: {providers['count']}\n")
print("Sample providers:")
for p in providers["providers"][:10]:
print(f" - {p}")
List dataset types¶
In [ ]:
Copied!
types = json.loads(tools.list_dataset_types())
print(f"Available dataset types: {types['types']}")
types = json.loads(tools.list_dataset_types())
print(f"Available dataset types: {types['types']}")
Use CatalogAgent with Natural Language (LLM)¶
For natural language queries, create a CatalogAgent
that uses an LLM to understand and execute searches:
In [ ]:
Copied!
agent = CatalogAgent(model=model, catalog_url=catalog_url)
agent = CatalogAgent(model=model, catalog_url=catalog_url)
Ask natural language questions¶
In [ ]:
Copied!
response = agent.ask("Find datasets about landcover from NASA")
print(response)
response = agent.ask("Find datasets about landcover from NASA")
print(response)
In [ ]:
Copied!
response = agent.ask("Show me elevation data")
print(response)
response = agent.ask("Show me elevation data")
print(response)
Spatial search with natural language (NEW!)¶
In [ ]:
Copied!
response = agent.ask("Find landcover datasets covering California")
print(response)
response = agent.ask("Find landcover datasets covering California")
print(response)
In [ ]:
Copied!
response = agent.ask("Show me elevation data for San Francisco")
print(response)
response = agent.ask("Show me elevation data for San Francisco")
print(response)
In [ ]:
Copied!
response = agent.ask("Find land cover datasets from NASA that cover New York City")
print(response)
response = agent.ask("Find land cover datasets from NASA that cover New York City")
print(response)
In [ ]:
Copied!
response = agent.ask("What types of datasets are available?")
print(response)
response = agent.ask("What types of datasets are available?")
print(response)
In [ ]:
Copied!
response = agent.ask("Find image collections about forests")
print(response)
response = agent.ask("Find image collections about forests")
print(response)
In [ ]:
Copied!
response = agent.ask("Find landcover datasets from 2022 onwards")
print(response)
response = agent.ask("Find landcover datasets from 2022 onwards")
print(response)
Get structured results programmatically¶
In [ ]:
Copied!
datasets = agent.search_datasets(keywords="sentinel", max_results=5)
for ds in datasets:
print(f"ID: {ds['id']}")
print(f"Title: {ds['title']}")
print(f"Provider: {ds.get('provider', 'N/A')}")
if ds.get("snippet"):
print(f"Code: {ds['snippet']}")
print("-" * 80)
datasets = agent.search_datasets(keywords="sentinel", max_results=5)
for ds in datasets:
print(f"ID: {ds['id']}")
print(f"Title: {ds['title']}")
print(f"Provider: {ds.get('provider', 'N/A')}")
if ds.get("snippet"):
print(f"Code: {ds['snippet']}")
print("-" * 80)