Streaming Excel Export Architecture¶
Overview¶
The refactored StreamingExcelExporter uses dependency injection to leverage existing Excel export services while adding streaming capabilities for handling massive files with minimal memory usage.
Key Components¶
1. StreamingExcelExporter (analyze/services/excel_streamer.py)¶
- Uses dependency injection via
@injectdecorator - Leverages existing services:
SheetNameSanitizerProtocol- for Excel-safe sheet namesTitleFormatterProtocol- for formatting titlesWorkbookEngineProtocol- for creating Excel writersRichOutputProtocol- for console output- Implements context manager for proper resource cleanup
- Buffers writes for efficiency while maintaining low memory usage
2. StreamingWorkbookEngine¶
- Custom engine that specifically uses
xlsxwriterwithconstant_memorymode - Provides streaming capabilities not available in standard
openpyxlengine - Memory usage: O(1) regardless of data size
3. Generator-based Analyzer (analyze/services/entropy/analyzer_streaming.py)¶
analyze_file_generator()method yields results as computed- Filters regions at creation time - never creates objects for low-risk regions
- Memory usage: O(1) for any file size
Integration Points¶
Dependency Injection Container¶
# In containers/application.py or containers/core.py
streaming_excel_exporter = providers.Factory(
StreamingExcelExporter,
sheet_name_sanitizer=excel_export.sheet_name_sanitizer,
title_formatter=excel_export.title_formatter,
workbook_engine=streaming_workbook_engine,
rich_output=rich_output,
)
CLI Command Integration¶
@inject
def entropy(
excel_streaming: bool,
risk_threshold: str,
# ... injected services
):
if excel_streaming:
with StreamingExcelExporter(output_path, risk_level) as exporter:
for file in files:
exporter.process_file_streaming(file, analyzer)
Memory Usage Comparison¶
| Component | Current Implementation | Streaming Implementation |
|---|---|---|
| Region Storage | All regions in memory | Only high-risk regions yielded |
| Excel Writing | Build DataFrame first | Direct cell writes |
| File Processing | Complete file in memory | Chunked streaming |
| 1GB File RAM Usage | ~12.5 GB | ~10 MB |
Benefits of Refactored Architecture¶
- Dependency Injection: Reuses existing, tested services
- Separation of Concerns: Streaming logic separate from Excel formatting
- Testability: Each component can be mocked/tested independently
- Extensibility: Easy to add new output formats or filters
- Configuration: Risk thresholds and options configurable via DI
Configuration Options¶
# In configuration files
analysis:
streaming:
enabled: true
risk_threshold: "medium_high"
exclude_data_samples: true
buffer_size: 1000 # Rows to buffer before writing
excel:
engine: "xlsxwriter" # Required for streaming
constant_memory: true
max_rows_warning: 900000
Usage Examples¶
Basic Streaming¶
With Risk Filtering¶
Large Dataset Warning¶
# Automatically warns if estimated regions > 500k with low threshold
ssf_tools analyze entropy /massive/file \
--excel-streaming \
--risk-threshold=low
# Warning: Low risk threshold with 6,250,000 estimated regions...
Implementation Checklist¶
- [x] Create
StreamingExcelExporterwith dependency injection - [x] Implement
StreamingWorkbookEnginefor xlsxwriter support - [x] Add
analyze_file_generator()method to analyzer - [x] Create container configuration for DI
- [x] Document CLI integration approach
- [ ] Add to main
analyze.pycommand - [ ] Update container wiring in
application.py - [ ] Add unit tests for streaming components
- [ ] Add integration tests with large files
- [ ] Update user documentation
Testing Strategy¶
- Unit Tests
- Mock all injected services
- Test buffer flushing logic
-
Test Excel row limit handling
-
Integration Tests
- Test with files of various sizes
- Verify memory usage stays constant
-
Validate Excel output format
-
Performance Tests
- Benchmark vs current implementation
- Measure memory usage over time
- Test with 1GB+, 5GB+, 10GB+ files
Migration Path¶
- Phase 1: Add streaming as opt-in flag (
--excel-streaming) - Phase 2: Make streaming default for files > 100MB
- Phase 3: Deprecate non-streaming Excel export
- Phase 4: Remove legacy code
Backward Compatibility¶
- Non-streaming mode remains available via standard
--excel-exportflag - Output format identical between streaming and non-streaming
- Configuration files backward compatible
- No breaking changes to existing APIs