This document describes the new DataProfiler integration module that provides safe data profiling with comprehensive validation to prevent constructor failures.
The personal_finance.data_profiler module provides:
- Data Validation: Comprehensive validation of data before passing to DataProfiler
- Sensitive Data Detection: Financial-specific PII and sensitive data detection
- Data Quality Analysis: Analysis of data quality issues and patterns
- Safe Integration: Graceful handling when DataProfiler is not available
from personal_finance.data_profiler import validate_profile_data, ProfileDataError
import pandas as pd
# Validate DataFrame before profiling
financial_df = pd.DataFrame({
'transaction_id': ['TXN001', 'TXN002'],
'amount': [100.50, 250.75],
'account_id': ['ACC001', 'ACC002']
})
try:
validate_profile_data(financial_df)
print("✓ Data is valid for DataProfiler")
except ProfileDataError as e:
print(f"✗ Validation failed: {e}")from personal_finance.data_profiler import DataProfilerService
# Initialize service with sensitive data detection
service = DataProfilerService(enable_sensitive_data_detection=True)
# Analyze financial data
analysis = service.analyze_financial_data(financial_df)
print("Financial patterns:", analysis['financial_patterns'])
print("Data quality:", analysis['data_quality'])
print("Sensitive data:", analysis['sensitive_data_detected'])The validation system supports all DataProfiler-compatible formats:
df = pd.DataFrame({
'symbol': ['AAPL', 'GOOGL'],
'price': [175.50, 2950.25],
'volume': [1000000, 800000]
})
validate_profile_data(df) # ✓ Validseries = pd.Series([100, 200, 300], name='amounts')
validate_profile_data(series) # ✓ Validimport numpy as np
price_array = np.array([[100, 200], [300, 400]])
validate_profile_data(price_array) # ✓ Validportfolio_records = [
{'symbol': 'AAPL', 'quantity': 100, 'price': 150.25},
{'symbol': 'GOOGL', 'quantity': 50, 'price': 2800.75}
]
validate_profile_data(portfolio_records) # ✓ Validprice_data = {
'dates': ['2024-01-01', '2024-01-02'],
'prices': [100.0, 102.5],
'volumes': [1000000, 1200000]
}
validate_profile_data(price_data) # ✓ Validvalidate_profile_data('/path/to/financial_data.csv') # ✓ Valid
validate_profile_data('portfolio.xlsx') # ✓ Valid- Must not be empty
- Must have at least one column
- Column names must be strings, integers, or floats
- No empty string column names
- Warns for extremely large DataFrames (>1M rows or >1000 columns)
- Must not be empty
- Series name (if provided) must be string, integer, or float
- Must not be empty
- Maximum 2 dimensions (DataProfiler limitation)
- Warns for object arrays with multiple dimensions
- Must not be empty
- For list of dictionaries: consistent schema across all records
- For simple lists: allows mixed types with warnings for excessive diversity
- Must not be empty
- For column-oriented data: all arrays/lists must have same length
- For record data: validates as single record
- Must not be empty or whitespace-only
- Warns for unsupported file extensions
- Supported extensions: .csv, .json, .parquet, .xlsx, .txt
try:
validate_profile_data(invalid_data)
except ProfileDataError as e:
print(f"Validation failed: {e}")
# Handle the error appropriately# Empty data
validate_profile_data(pd.DataFrame()) # Raises: DataFrame cannot be empty
# None data
validate_profile_data(None) # Raises: profile_data cannot be None
# Inconsistent records
validate_profile_data([
{'name': 'John', 'age': 30},
{'name': 'Jane', 'salary': 50000} # Different keys
]) # Raises: Inconsistent schema
# Unsupported type
validate_profile_data(set([1, 2, 3])) # Raises: Unsupported data typeThe DataProfilerService provides financial-specific analysis:
analysis = service.analyze_financial_data(financial_df)
patterns = analysis['financial_patterns']
# - potential_currency_columns: Columns that look like currency/amounts
# - potential_date_columns: Columns that look like dates
# - potential_amount_columns: Numeric columns for financial amounts
# - suspicious_patterns: Unusual patterns that may indicate data issuesquality = analysis['data_quality']
# - missing_data_ratio: Proportion of missing values
# - duplicate_rows: Number of duplicate rows
# - empty_columns: Columns with all null values
# - constant_columns: Columns with only one unique value
# - outlier_candidates: Columns with potential outlierssensitive_data = analysis['sensitive_data_detected']
# Detects:
# - potential_account_number: Numeric strings that look like account numbers
# - potential_ssn: Patterns matching Social Security Numbers
# Each finding includes confidence level and recommendationsThe validate_and_prepare_data function optimizes data for DataProfiler:
from personal_finance.data_profiler import validate_and_prepare_data
# Converts list of dicts to DataFrame for better performance
records = [{'id': 1, 'amount': 100}, {'id': 2, 'amount': 200}]
optimized_data = validate_and_prepare_data(records)
# Returns: pandas DataFrame
# Converts column-oriented dict to DataFrame
columns = {'ids': [1, 2], 'amounts': [100, 200]}
optimized_data = validate_and_prepare_data(columns)
# Returns: pandas DataFrameservice = DataProfilerService()
if service.is_available():
# DataProfiler is installed and available
profile = service.create_profile(financial_data)
if profile:
print("Profile summary:", profile['summary'])
print("Column profiles:", profile['column_profiles'])
print("Data types:", profile['data_types'])
print("Null analysis:", profile['null_analysis'])
print("Sensitive data:", profile['sensitive_data'])
else:
print("DataProfiler not available - install with: pip install dataprofiler")# Disable sensitive data detection for performance
service = DataProfilerService(enable_sensitive_data_detection=False)
# Custom DataProfiler options
profile = service.create_profile(
data,
samples_per_update=1000, # Process in batches
min_true_samples=10 # Minimum samples for statistics
)Add to your requirements:
dataprofiler>=0.11.0 # Data profiling and sensitive data detection
Install with pip:
pip install dataprofiler# Good
try:
validate_profile_data(data)
profile = service.create_profile(data)
except ProfileDataError as e:
logger.error(f"Data validation failed: {e}")
return None
# Bad
profile = service.create_profile(data) # May fail with unclear errors# Good - converts to optimal format
prepared_data = validate_and_prepare_data(list_of_records)
profile = service.create_profile(prepared_data)
# Acceptable - but may be less efficient
profile = service.create_profile(list_of_records)service = DataProfilerService()
if not service.is_available():
logger.warning("DataProfiler not available, using limited analysis")
# Fall back to basic analysis
analysis = service.analyze_financial_data(data)
# Still provides financial patterns and data quality analysis# For financial applications
service = DataProfilerService(enable_sensitive_data_detection=True)
# Check findings
sensitive_findings = analysis['sensitive_data_detected']
for finding in sensitive_findings:
if finding['pattern_type'] == 'potential_ssn':
logger.warning(f"SSN detected in column: {finding['column']}")quality = analysis['data_quality']
if quality['missing_data_ratio'] > 0.1:
logger.warning(f"High missing data ratio: {quality['missing_data_ratio']:.1%}")
if quality['outlier_candidates']:
logger.info(f"Outliers detected in: {[o['column'] for o in quality['outlier_candidates']]}")- DataFrames with >1M rows or >1000 columns generate warnings
- Consider sampling large datasets before profiling
- Use
samples_per_updateparameter for batch processing
- Validation creates minimal memory overhead
- Data preparation may create DataFrame copies
- DataProfiler itself can be memory-intensive for large datasets
- Consider caching profile results for static datasets
- Validation is fast and doesn't need caching
-
ImportError: No module named 'dataprofiler'
pip install dataprofiler
-
ProfileDataError: DataFrame cannot be empty
- Check that your data actually contains records
- Verify data loading/filtering logic
-
ProfileDataError: Inconsistent schema
- Ensure all records in list have same keys
- Check for typos in field names
-
Memory errors with large datasets
- Sample your data before profiling
- Use DataProfiler's sampling options
Enable detailed logging:
import logging
logging.getLogger('personal_finance.data_profiler').setLevel(logging.DEBUG)This provides detailed information about:
- Validation steps and decisions
- Data conversion processes
- DataProfiler integration status
- Performance warnings
See examples/data_profiler_usage.py for comprehensive usage examples including:
- Basic validation scenarios
- Financial data analysis workflows
- Error handling patterns
- Real-world integration examples