Skip to content

Commit a13d4e3

Browse files
authored
Merge pull request #2 from jairomelo/dev/0.2.2
Dev/0.2.2
2 parents 420824b + ebe2f5f commit a13d4e3

7 files changed

Lines changed: 217 additions & 24 deletions

File tree

.github/workflows/ci.yml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
branches: [main]
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-latest
12+
13+
steps:
14+
- uses: actions/checkout@v4
15+
- name: Set up Python
16+
uses: actions/setup-python@v5
17+
with:
18+
python-version: "3.11"
19+
- name: Install dependencies
20+
run: |
21+
python -m pip install --upgrade pip
22+
pip install -e .[dev]
23+
- name: Run tests
24+
env:
25+
GEONAMES_USERNAME: ${{ secrets.GEONAMES_USERNAME }}
26+
run: |
27+
pytest -v --maxfail=3

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,5 @@ build/
1010
dist/
1111
*.egg-info/
1212
*.sh
13+
14+
tests/data/

CHANGELOG.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,22 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [v0.2.2] - 2025-07-14
9+
10+
### Added
11+
- GitHub Actions CI workflow for automated testing and validation
12+
- Enhanced development dependencies in pyproject.toml for better development experience
13+
14+
### Changed
15+
- **PERFORMANCE**: Batch resolver now processes only unique values, significantly reducing API requests and improving performance
16+
- Updated README with CI status badge
17+
18+
### Tests
19+
- Enhanced batch resolver tests with country_code and place_type columns for improved resolution testing
20+
- Added comprehensive test coverage for unique value processing functionality
21+
22+
---
23+
824
## [v0.2.1] - 2025-07-10
925

1026
### Added

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
![PyPI - Version](https://img.shields.io/pypi/v/georesolver)
22
![Python Versions](https://img.shields.io/pypi/pyversions/georesolver)
3+
![CI](https://github.com/jairomelo/GeoResolver/actions/workflows/ci.yml/badge.svg)
34
![License](https://img.shields.io/pypi/l/georesolver)
45
![Downloads](https://static.pepy.tech/badge/georesolver)
56
[![Documentation](https://img.shields.io/badge/docs-online-blue)](https://jairomelo.com/Georesolver/)
67
[![Issues](https://img.shields.io/github/issues/jairomelo/Georesolver)](https://github.com/jairomelo/Georesolver/issues)
78

89

9-
# Georesolver
10+
# GeoResolver
1011

1112
GeoResolver is a lightweight Python library for resolving place names into geographic coordinates and related metadata using multiple gazetteer services, including [GeoNames](https://www.geonames.org/), [WHG](https://whgazetteer.org/), [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page), and [TGN](https://www.getty.edu/research/tools/vocabularies/tgn/).
1213

pyproject.toml

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "georesolver"
7-
version = "0.2.1"
7+
version = "0.2.2"
88
description = "Multi-source place name to coordinates resolver using TGN, WHG, GeoNames, and Wikidata"
99
authors = [
1010
{name="Jairo Antonio Melo Florez", email="jairoantoniomelo@gmail.com"}
@@ -54,4 +54,12 @@ Issues = "https://github.com/jairomelo/Georesolver/issues"
5454
Documentation = "https://jairomelo.com/Georesolver/"
5555

5656
[tool.setuptools.package-data]
57-
"georesolver" = ["data/mappings/places_map.json"]
57+
"georesolver" = ["data/mappings/places_map.json"]
58+
59+
[project.optional-dependencies]
60+
dev = [
61+
"pytest>=8.0.0",
62+
"pytest-cov",
63+
"mypy",
64+
"ruff"
65+
]

src/georesolver/resolver.py

Lines changed: 118 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -917,9 +917,12 @@ def resolve(self,
917917

918918
place_name = place_name.strip()
919919

920-
if pycountry.countries.get(alpha_2=country_code) is None and country_code is not None:
921-
self.logger.warning(f"Invalid country code: {country_code}\nLook at the correct ISO 3166-1 alpha-2 country codes at https://www.iso.org/iso-3166-country-codes.html")
922-
country_code = None
920+
try:
921+
if pycountry.countries.get(alpha_2=country_code) is None and country_code is not None:
922+
self.logger.warning(f"Invalid country code: {country_code}\nLook at the correct ISO 3166-1 alpha-2 country codes at https://www.iso.org/iso-3166-country-codes.html")
923+
country_code = None
924+
except Exception as e:
925+
self.logger.info(f"Error occurred while validating country code: {e}")
923926

924927
if self.flexible_threshold and len(place_name) < 5:
925928
self.logger.warning(
@@ -972,12 +975,16 @@ def resolve_batch(
972975
) -> Union[pd.DataFrame, List[dict]]:
973976
"""
974977
Resolve coordinates for a batch of places from a DataFrame.
978+
979+
This method optimizes API calls by processing only unique combinations of
980+
place_name, country_code, and place_type, then mapping results back to the original DataFrame.
975981
976982
Args:
977983
df (pd.DataFrame): Input DataFrame with place names and optional country/type columns.
978984
place_column (str): Column name for place names.
979985
country_column (str): Column name for country codes (optional).
980986
place_type_column (str): Column name for place types (optional).
987+
use_default_filter (bool): If True, apply a default filter as fallback.
981988
return_df (bool): If True, return a DataFrame with separate columns for each attribute. Otherwise, return a list of dictionaries.
982989
show_progress (bool): If True, show a progress bar during processing.
983990
@@ -988,11 +995,6 @@ def resolve_batch(
988995
pd.DataFrame: A DataFrame with resolved coordinates and metadata.
989996
List[dict]: A list of dictionaries with resolved coordinates and metadata if return_df is False.
990997
"""
991-
#TODO:
992-
# - Gently handle NaN and empty strings in place_column
993-
# - Process data in chunks of 100 rows
994-
# - Only process records with valid place names (non-empty strings)
995-
# - Sort Series
996998

997999
if not isinstance(df, pd.DataFrame):
9981000
raise ValueError("Input must be a pandas DataFrame")
@@ -1006,27 +1008,123 @@ def resolve_batch(
10061008
if place_type_column and place_type_column not in df.columns:
10071009
raise ValueError(f"Column '{place_type_column}' not found in DataFrame")
10081010

1011+
# Create a copy of the input DataFrame to avoid modifying the original
1012+
df_copy = df.copy()
1013+
1014+
# Handle NaN and empty values in place_column
1015+
df_copy[place_column] = df_copy[place_column].fillna("").astype(str)
1016+
1017+
# Filter out rows with empty place names
1018+
valid_mask = df_copy[place_column].str.strip() != ""
1019+
df_valid = df_copy[valid_mask].copy()
1020+
1021+
if df_valid.empty:
1022+
self.logger.warning("No valid place names found in the DataFrame")
1023+
if return_df:
1024+
# Return empty results DataFrame with proper structure
1025+
empty_results = pd.DataFrame({
1026+
"place": None, "standardize_label": None, "language": None,
1027+
"latitude": None, "longitude": None, "source": None,
1028+
"id": None, "uri": None, "country_code": None,
1029+
"part_of": None, "part_of_uri": None, "confidence": None,
1030+
"threshold": None, "match_type": None
1031+
}, index=df.index)
1032+
return empty_results
1033+
else:
1034+
# Return list of None values, properly typed for the Union return type
1035+
return [None] * len(df) # type: ignore
1036+
1037+
# Create unique combinations for processing
1038+
lookup_columns = [place_column]
1039+
if country_column:
1040+
df_valid[country_column] = df_valid[country_column].fillna("")
1041+
lookup_columns.append(country_column)
1042+
if place_type_column:
1043+
df_valid[place_type_column] = df_valid[place_type_column].fillna("")
1044+
lookup_columns.append(place_type_column)
1045+
1046+
# Get unique combinations
1047+
unique_combinations = df_valid[lookup_columns].drop_duplicates().reset_index(drop=True)
1048+
1049+
# Log optimization info
1050+
original_count = len(df_valid)
1051+
unique_count = len(unique_combinations)
1052+
reduction_pct = ((original_count - unique_count) / original_count * 100) if original_count > 0 else 0
1053+
self.logger.info(f"Processing {unique_count} unique combinations instead of {original_count} rows "
1054+
f"({reduction_pct:.1f}% reduction in API calls)")
1055+
1056+
# Process unique combinations
10091057
if show_progress:
1010-
df_iter = tqdm(df.iterrows(), total=len(df))
1058+
unique_iter = tqdm(unique_combinations.iterrows(),
1059+
total=len(unique_combinations),
1060+
desc="Resolving unique places")
10111061
else:
1012-
df_iter = df.iterrows()
1062+
unique_iter = unique_combinations.iterrows()
10131063

1014-
results = []
1015-
for _, row in df_iter:
1016-
place_name = row.get(place_column, "")
1017-
country_code = row.get(country_column) if country_column else None
1018-
place_type = row.get(place_type_column) if place_type_column else None
1019-
1020-
coords = self.resolve(
1064+
# Store results for unique combinations
1065+
unique_results = {}
1066+
1067+
for _, row in unique_iter:
1068+
place_name = row[place_column].strip()
1069+
country_code = row.get(country_column, None) if country_column else None
1070+
place_type = row.get(place_type_column, None) if place_type_column else None
1071+
1072+
# Convert empty strings to None for consistency
1073+
country_code = country_code if country_code and country_code.strip() else None
1074+
place_type = place_type if place_type and place_type.strip() else None
1075+
1076+
# Create a key for the combination
1077+
key = (place_name, country_code or "", place_type or "")
1078+
1079+
result = self.resolve(
10211080
place_name=place_name,
10221081
country_code=country_code,
10231082
place_type=place_type,
10241083
use_default_filter=use_default_filter
10251084
)
1026-
1027-
results.append(coords)
1085+
1086+
unique_results[key] = result
1087+
1088+
# Map results back to original DataFrame
1089+
results = []
1090+
for idx in df.index:
1091+
if idx in df_valid.index:
1092+
row = df_valid.loc[idx]
1093+
place_name = row[place_column].strip()
1094+
country_code = row.get(country_column, None) if country_column else None
1095+
place_type = row.get(place_type_column, None) if place_type_column else None
1096+
1097+
# Convert empty strings to None for key matching
1098+
country_code = country_code if country_code and country_code.strip() else None
1099+
place_type = place_type if place_type and place_type.strip() else None
1100+
1101+
key = (place_name, country_code or "", place_type or "")
1102+
result = unique_results.get(key)
1103+
else:
1104+
# For rows with invalid place names, return None
1105+
result = None
1106+
1107+
results.append(result)
10281108

10291109
if return_df:
1030-
return pd.DataFrame(results, columns=["place", "standardize_label", "language", "latitude", "longitude", "source", "place_id", "place_uri", "country_code", "part_of", "part_of_uri", "confidence", "threshold", "match_type"], index=df.index)
1110+
# Fill None results with a default structure before creating DataFrame
1111+
default_result = {
1112+
"place": None, "standardize_label": None, "language": None,
1113+
"latitude": None, "longitude": None, "source": None,
1114+
"id": None, "uri": None, "country_code": None,
1115+
"part_of": None, "part_of_uri": None, "confidence": None,
1116+
"threshold": None, "match_type": None
1117+
}
1118+
1119+
# Expand dictionary results into separate columns
1120+
expanded_results = []
1121+
for result in results:
1122+
if result is None:
1123+
expanded_results.append(default_result)
1124+
else:
1125+
expanded_results.append(result)
1126+
1127+
results_df = pd.DataFrame(expanded_results, index=df.index)
1128+
return results_df
10311129
else:
10321130
return results

tests/test_batch.py

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,4 +113,45 @@ def test_batch_resolver_list():
113113
assert 'latitude' in result, "Result should contain latitude"
114114
assert 'longitude' in result, "Result should contain longitude"
115115
assert isinstance(result['latitude'], (int, float)), "Latitude should be numeric"
116-
assert isinstance(result['longitude'], (int, float)), "Longitude should be numeric"
116+
assert isinstance(result['longitude'], (int, float)), "Longitude should be numeric"
117+
118+
119+
""" def test_batch_real_df(csv_path="tests/data/bautismos_cleaned.csv"):
120+
df = pd.read_csv(csv_path)
121+
122+
df["country_code"] = "PE"
123+
df["place_type"] = "city"
124+
125+
resolver = PlaceResolver([GeoNamesQuery(), WHGQuery()],
126+
verbose=True, lang="es")
127+
results_df = resolver.resolve_batch(df,
128+
place_column="Descriptor Geográfico 2",
129+
country_column="country_code",
130+
place_type_column="place_type",
131+
show_progress=True)
132+
print(f"\n=== Real DataFrame Results ===")
133+
print("Results DataFrame:")
134+
print(results_df.head())
135+
136+
assert isinstance(results_df, pd.DataFrame), "Results should be a DataFrame"
137+
assert 'latitude' in results_df.columns, "Results DataFrame should contain latitude"
138+
assert 'longitude' in results_df.columns, "Results DataFrame should contain longitude"
139+
140+
# Check that we have at least some successful results
141+
successful_results = results_df.dropna(subset=['latitude', 'longitude'])
142+
assert len(successful_results) > 0, "Should have at least some successful coordinate resolutions"
143+
144+
# Print some statistics about the results
145+
total_places = len(results_df)
146+
resolved_places = len(successful_results)
147+
resolution_rate = (resolved_places / total_places) * 100
148+
print(f"Resolution statistics: {resolved_places}/{total_places} places resolved ({resolution_rate:.1f}%)")
149+
150+
# Show some examples of successful and failed resolutions
151+
print("\nSuccessful resolutions:")
152+
print(successful_results[['place', 'country_code', 'standardize_label', 'latitude', 'longitude', 'source']].head())
153+
154+
failed_places = results_df[results_df['latitude'].isnull()]
155+
if len(failed_places) > 0:
156+
print(f"\nFailed to resolve {len(failed_places)} places:")
157+
print(failed_places['place'].unique()[:10]) # Show first 10 unresolved places """

0 commit comments

Comments
 (0)