SSF Tools - Analysis Components Architecture¶

Overview¶

The Analysis Components architecture provides the core infrastructure for file analysis operations within SSF Tools. This system implements entropy analysis, credential detection, and content-aware threshold management through a protocol-based design with dependency injection.

The architecture supports multiple analysis types including Shannon entropy calculation, detect-secrets integration for credential scanning, and content-aware file classification. All components follow SOLID principles and integrate seamlessly with the SSF Tools container system.

Architectural Principles¶

Design Goals¶

Protocol-First Design: All services implement well-defined protocols for testability and modularity
Content-Aware Analysis: File-type-specific thresholds and analysis strategies reduce false positives
Streaming Processing: Memory-efficient analysis of large files through chunk-based processing
Dependency Injection: All components receive dependencies through the container system
Research-Based Thresholds: Threshold values derived from peer-reviewed academic research

Key Benefits¶

Modular Architecture: Easy to extend with new analysis types or detection methods
Memory Efficiency: Streaming analysis supports files of any size without memory constraints
High Accuracy: Content-aware thresholds significantly reduce false positive rates
Container Integration: Seamless integration with SSF Tools dependency injection system
CLI Ready: Direct integration with command-line interface through injected services

Architecture Overview¶

graph TD %% CLI Entry Points CLI[CLI Commands analyze entropy, analyze credentials] %% Core Analysis Services subgraph ANALYZE [📁 Analysis Services] EA[EntropyAnalyzer Shannon entropy calculation] DS[DetectSecretsService Credential detection] TS[ThresholdService Content-aware thresholds] FC[FileClassifier File type detection] end %% Model Layer subgraph MODELS [📁 Analysis Models] AM[AnalysisModels Input/Result types] CAT[ContentAwareThresholds File-specific thresholds] TYPES[AnalysisTypes Enums and constants] end %% Service Dependencies CLI --> ANALYZE ANALYZE --> MODELS %% Core Service Integration subgraph CORE [🔌 Core Services] FPS[FileProcessingService File I/O operations] ROS[RichOutputService User interface] TS_SVC[TimestampService Analysis timestamps] HTTP[HTTPClientService External API calls] end ANALYZE --> CORE %% External Integrations subgraph EXTERNAL [🔌 External Integrations] DETECT[detect-secrets Credential scanning] MAGIC[python-magic MIME detection] PYGMENTS[pygments Language detection] end ANALYZE --> EXTERNAL %% Output Systems subgraph OUTPUT [💾 Output Systems] EXCEL[Excel Export Streaming XLSX output] JSON[JSON Export Structured results] YAML[YAML Export Configuration format] end ANALYZE --> OUTPUT %% Styling classDef coreService fill:#e1f5fe,stroke:#0277bd,stroke-width:2px classDef protocol fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px classDef component fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px classDef integration fill:#fff3e0,stroke:#ef6c00,stroke-width:2px classDef cli fill:#e3f2fd,stroke:#1565c0,stroke-width:2px classDef storage fill:#fce4ec,stroke:#c2185b,stroke-width:2px classDef group fill:#f8f9fa,stroke:#6c757d,stroke-width:3px,stroke-dasharray: 5 5 class EA,DS,TS,FC coreService class AM,CAT,TYPES component class DETECT,MAGIC,PYGMENTS integration class CLI cli class EXCEL,JSON,YAML storage class ANALYZE,MODELS,CORE,EXTERNAL,OUTPUT group

Analysis Service Protocols¶

The analysis architecture is built around well-defined protocols that enable testing, modularity, and dependency injection. For detailed protocol specifications, see the Architecture and Design Patterns document.

Key Protocol Overview¶

The architecture defines protocols for consistent service interfaces. For complete protocol definitions and specifications, see the Architecture and Design Patterns document.

EntropyAnalyzerProtocol: Primary protocol for Shannon entropy analysis with content-aware thresholds
CredentialDetectionProtocol: Protocol for credential detection using detect-secrets integration
Threshold Management: Protocol for content-aware threshold management

Configuration Models¶

AnalysisConfiguration¶

Main configuration model for analysis operations.

Configuration Model¶

The configuration model defines the structure and validation rules for analysis settings. For the complete configuration protocol and validation details, see the Architecture and Design Patterns document.

Global Configuration Integration¶

The analysis system integrates with SSF Tools global configuration.

# In ssf-tools-config.yaml
global:
  analysis_block_size: 8192              # Default block size for analysis
  max_concurrent_files: 10               # Maximum files to process concurrently

analyze:
  entropy:
    content_aware_enabled: true          # Enable content-aware thresholds
    streaming_enabled: true              # Enable streaming analysis
    max_regions_per_file: 1000          # Maximum entropy regions per file
  credentials:
    detect_secrets_baseline: null        # Path to detect-secrets baseline
    exclude_patterns: []                 # Patterns to exclude from scanning

Service Implementation¶

This section outlines the concrete implementations of the analysis services. For protocol definitions and interface specifications, see the Architecture and Design Patterns document.

EntropyAnalyzer¶

Primary implementation of entropy analysis with streaming support. See the architecture document for the complete implementation details and protocol definition.

DetectSecretsCredentialService¶

Implementation of credential detection using detect-secrets. See the architecture document for the complete implementation details and subprocess integration patterns.

ContentAwareThresholdManager¶

Service for managing content-aware entropy thresholds by file type. See the architecture document for the complete implementation details and protocol definition.

Container Integration¶

AnalysisContainer Registration¶

Services are registered in the dependency injection container for proper lifecycle management.

`kp_ssf_tools.containers.analysis.AnalysisContainer` ¶

Bases: DeclarativeContainer

Container for analysis services (entropy, wordlists, crypto detection).

Source code in src\kp_ssf_tools\containers\analysis.py

class AnalysisContainer(containers.DeclarativeContainer):
    """Container for analysis services (entropy, wordlists, crypto detection)."""

    # Core services (injected from main container)
    core: providers.DependenciesContainer = providers.DependenciesContainer()

    threshold_manager: providers.Singleton[ThresholdProviderProtocol] = (
        providers.Singleton(
            ContentAwareThresholdManager,
        )
    )

    # Entropy analyzer with core services injected
    entropy_analyzer: providers.Factory[EntropyAnalyzer] = providers.Factory(
        "kp_ssf_tools.analyze.services.entropy.analyzer.EntropyAnalyzer",
        rich_output=core.rich_output,
        timestamp_service=core.timestamp,
        file_validator=core.file_validator,
        mime_detector=core.mime_detector,
        file_processing=core.file_processing,
        threshold_manager=threshold_manager,
    )

    # Detect-secrets based credential service
    detect_secrets_credential_service: providers.Singleton[
        DetectSecretsCredentialService
    ] = providers.Singleton(
        DetectSecretsCredentialService,
        rich_output=core.rich_output,
        timestamp_service=core.timestamp,
        file_discovery=core.file_discoverer,
        file_processing=core.file_processing,
    )

    # Active credential detection service
    active_credential_service = detect_secrets_credential_service  # type: ignore[assignment]

    # For backward compatibility during transition
    analyzer = entropy_analyzer

CLI Integration¶

Analyze Command Group¶

CLI commands that interact with analysis services through dependency injection.

`kp_ssf_tools.cli.commands.analyze` ¶

Entropy analysis CLI commands with streaming Excel export.

Classes¶

`AnalysisSummary` ¶

Bases: NamedTuple

Summary data for analysis results.

Source code in src\kp_ssf_tools\cli\commands\analyze.py

class AnalysisSummary(NamedTuple):
    """Summary data for analysis results."""

    total_files_analyzed: int
    total_files: int
    total_high_risk_regions: int
    total_time: float

`ExportContext` ¶

Bases: NamedTuple

Context for Excel export operations.

Source code in src\kp_ssf_tools\cli\commands\analyze.py

class ExportContext(NamedTuple):
    """Context for Excel export operations."""

    excel_service: ExcelExportService
    timestamp_service: TimestampService
    rich_output: RichOutputService

`ProcessingConfig` ¶

Bases: NamedTuple

Configuration for file processing.

Source code in src\kp_ssf_tools\cli\commands\analyze.py

class ProcessingConfig(NamedTuple):
    """Configuration for file processing."""

    file_block_size: int | None
    analysis_block_size: int | None
    step_size: int | None
    include_samples: bool

`ProcessingContext` ¶

Bases: NamedTuple

Context for file processing with analyzer and configs.

Source code in src\kp_ssf_tools\cli\commands\analyze.py

class ProcessingContext(NamedTuple):
    """Context for file processing with analyzer and configs."""

    analyzer: EntropyAnalyzer
    entropy_config: AnalysisConfiguration
    global_config: GlobalConfiguration
    rich_output: RichOutputService

Functions¶

`analyze_group()` ¶

Security analysis commands for PCI SSF 2.3 compliance.

Source code in src\kp_ssf_tools\cli\commands\analyze.py

@click.group(name="analyze")
def analyze_group() -> None:
    """Security analysis commands for PCI SSF 2.3 compliance."""

credentials(target, *, recursive, file_extensions, context_lines, scan_binary, max_binary_size, credential_service=Provide[ApplicationContainer.analysis.active_credential_service], rich_output=Provide[ApplicationContainer.core.rich_output], excel_service=Provide[ApplicationContainer.core.excel_export_service], timestamp_service=Provide[ApplicationContainer.core.timestamp], global_config_service=Provide[ApplicationContainer.core.global_config_service], analysis_config_service=Provide[ApplicationContainer.core.entropy_config_service]) ¶

Detect credentials in files for PCI SSF 2.3 compliance.

Analyzes files for embedded credentials including usernames, passwords, API keys, and other sensitive information. Uses wordlists from SecLists and regex patterns to identify potential security issues.

Results are automatically exported to Excel with per-file worksheets and a summary sheet. Output filename: analyze-credentials-.xlsx

Arguments:

TARGET                  Path to file or directory to analyze

Examples:

# Basic credential detection
ssf_tools analyze credentials sample.py

# Analyze specific file types only
ssf_tools analyze credentials data/ --file-extensions .py --file-extensions .js

# Include more context around matches
ssf_tools analyze credentials config/ --context-lines 5

# Skip binary files to speed up analysis
ssf_tools analyze credentials project/ --no-scan-binary

Source code in src\kp_ssf_tools\cli\commands\analyze.py

@analyze_group.command("credentials")
@click.argument("target", type=click.Path(exists=True, path_type=Path))
@click.option(
    "--recursive/--no-recursive",
    default=True,
    help="Search directories recursively for files to analyze",
)
@click.option(
    "--file-extensions",
    multiple=True,
    help="File extensions to include (e.g., .py .js .txt). If not specified, all text files are analyzed",
)
@click.option(
    "--context-lines",
    type=int,
    default=3,
    help="Number of context lines to show around matches",
)
@click.option(
    "--scan-binary/--no-scan-binary",
    default=True,
    help="Whether to scan binary files for embedded credentials",
)
@click.option(
    "--max-binary-size",
    type=int,
    default=10,
    help="Maximum size in MB for binary files to scan",
)
@inject
def credentials(  # noqa: PLR0913
    target: Path,
    *,
    recursive: bool,
    file_extensions: tuple[str, ...],
    context_lines: int,
    scan_binary: bool,
    max_binary_size: int,
    credential_service: CredentialDetectionProtocol = Provide[
        ApplicationContainer.analysis.active_credential_service
    ],
    rich_output: RichOutputService = Provide[ApplicationContainer.core.rich_output],
    excel_service: ExcelExportService = Provide[
        ApplicationContainer.core.excel_export_service
    ],
    timestamp_service: TimestampService = Provide[ApplicationContainer.core.timestamp],
    global_config_service: ConfigurationService[GlobalConfiguration] = Provide[
        ApplicationContainer.core.global_config_service
    ],
    analysis_config_service: ConfigurationService[AnalysisConfiguration] = Provide[
        ApplicationContainer.core.entropy_config_service
    ],
) -> None:
    """
    Detect credentials in files for PCI SSF 2.3 compliance.

    Analyzes files for embedded credentials including usernames, passwords,
    API keys, and other sensitive information. Uses wordlists from SecLists
    and regex patterns to identify potential security issues.

    Results are automatically exported to Excel with per-file worksheets
    and a summary sheet. Output filename: analyze-credentials-<timestamp>.xlsx

    **Arguments:**
    ```
    TARGET                  Path to file or directory to analyze
    ```

    **Examples:**
    ```
    # Basic credential detection
    ssf_tools analyze credentials sample.py

    # Analyze specific file types only
    ssf_tools analyze credentials data/ --file-extensions .py --file-extensions .js

    # Include more context around matches
    ssf_tools analyze credentials config/ --context-lines 5

    # Skip binary files to speed up analysis
    ssf_tools analyze credentials project/ --no-scan-binary
    ```
    """
    # Load configurations
    global_config = global_config_service.load_config()
    analysis_config = analysis_config_service.load_config()

    # Set verbose mode if requested
    if global_config.output.verbose:
        rich_output.debug("Verbose mode enabled")

    # Convert file extensions to list
    extensions_list = list(file_extensions) if file_extensions else None

    try:
        # Convert the Pydantic configuration to the dict format expected by the service
        config_dict = {
            "credentials": {
                "enabled": analysis_config.credentials.enabled,
                "cache_duration_hours": analysis_config.credentials.cache_duration_hours,
                "auto_download": analysis_config.credentials.auto_download,
                "wordlist_sources": analysis_config.credentials.wordlist_sources,
            },
        }

        # Perform credential analysis
        scan_options = CredentialScanOptions(
            recursive=recursive,
            file_extensions=tuple(extensions_list) if extensions_list else (),
            context_lines=context_lines,
            scan_binary_files=scan_binary,
            max_binary_size_mb=max_binary_size,
        )

        result = credential_service.analyze_files(
            target_paths=[target],
            config=config_dict,
            options=scan_options,
        )

        # Export to Excel if results found
        if result and result.patterns:
            export_context = ExportContext(
                excel_service=excel_service,
                timestamp_service=timestamp_service,
                rich_output=rich_output,
            )
            _export_credentials_to_excel(result, export_context)

        # Display results in console
        credential_display_limit = 10
        if result and result.patterns:
            rich_output.warning(f"Found {len(result.patterns)} potential credentials")
            for pattern in result.patterns[:credential_display_limit]:  # Show first N
                rich_output.info(
                    f"  {pattern.pattern_type}: {pattern.value[:50]}... "
                    f"(line {pattern.line_start})",
                )
            if len(result.patterns) > credential_display_limit:
                rich_output.info(
                    f"  ... and {len(result.patterns) - credential_display_limit} more",
                )
            rich_output.info("Complete results have been exported to Excel")
        else:
            rich_output.success("No credentials detected in analyzed files")

    except Exception as e:
        rich_output.error(f"Credential analysis failed: {e}")
        if global_config.output.verbose:
            import traceback

            rich_output.error(traceback.format_exc())
        raise

entropy(target, risk_threshold, file_block_size, analysis_block_size, step_size, ignore_pattern, *, no_recurse, include_samples, analyzer=Provide[ApplicationContainer.entropy.analyzer], rich_output=Provide[ApplicationContainer.core.rich_output], file_discovery=Provide[ApplicationContainer.core.file_discoverer], global_config_service=Provide[ApplicationContainer.core.global_config_service], entropy_config_service=Provide[ApplicationContainer.core.entropy_config_service], timestamp_service=Provide[ApplicationContainer.core.timestamp]) ¶

Analyze entropy of files for PCI SSF 2.3 compliance.

Performs Shannon entropy analysis using content-aware thresholds to detect potentially suspicious patterns in files. Results are streamed directly to Excel with minimal memory usage.

Arguments:

TARGET                  Path to file or directory to analyze

Examples:

# Basic file analysis
ssf_tools analyze entropy sample.bin

# Analyze with higher risk threshold (fewer results)
ssf_tools analyze entropy sample.bin --risk-threshold high

# Analyze with custom block size
ssf_tools analyze entropy sample.bin --analysis-block-size 128

# Override file type detection
ssf_tools analyze entropy app.exe --force-file-type windows_pe

# Analyze directory non-recursively
ssf_tools analyze entropy data/ --no-recurse

Source code in src\kp_ssf_tools\cli\commands\analyze.py

@analyze_group.command("entropy")
@click.argument("target", type=click.Path(exists=True, path_type=Path))
@click.option(
    "--ignore-pattern",
    multiple=True,
    help="Glob pattern(s) to ignore when searching for files (e.g. --ignore-pattern='__pycache__' --ignore-pattern='*.egg-info')",
)
@click.option(
    "--risk-threshold",
    type=click.Choice(
        ["very_low", "low", "medium", "medium_high", "high", "critical"],
        case_sensitive=False,
    ),
    default="medium_high",
    help="Minimum risk level for regions to include in analysis (default: medium_high)",
)
@click.option(
    "--file-block-size",
    type=int,
    help="File I/O block size in bytes (default: 65536)",
)
@click.option(
    "--analysis-block-size",
    type=int,
    help="Analysis block size in bytes (default: 64)",
)
@click.option(
    "--step-size",
    type=int,
    help="Step size for sliding window analysis (default: 16)",
)
@click.option(
    "--no-recurse",
    is_flag=True,
    help="Disable recursive directory analysis (analyze current directory only)",
)
@click.option(
    "--include-samples",
    is_flag=True,
    help="Include data samples in region analysis (increases file size)",
)
@inject
def entropy(  # noqa: PLR0913
    target: Path,
    risk_threshold: str,
    file_block_size: int | None,
    analysis_block_size: int | None,
    step_size: int | None,
    ignore_pattern: tuple[str, ...],
    *,
    no_recurse: bool,
    include_samples: bool,
    analyzer: EntropyAnalyzer = Provide[ApplicationContainer.entropy.analyzer],
    rich_output: RichOutputService = Provide[ApplicationContainer.core.rich_output],
    file_discovery: FileDiscoveryService = Provide[
        ApplicationContainer.core.file_discoverer
    ],
    global_config_service: ConfigurationService[GlobalConfiguration] = Provide[
        ApplicationContainer.core.global_config_service
    ],
    entropy_config_service: ConfigurationService[AnalysisConfiguration] = Provide[
        ApplicationContainer.core.entropy_config_service
    ],
    timestamp_service: TimestampService = Provide[ApplicationContainer.core.timestamp],
) -> None:
    """
    Analyze entropy of files for PCI SSF 2.3 compliance.

    Performs Shannon entropy analysis using content-aware thresholds to detect
    potentially suspicious patterns in files. Results are streamed directly to
    Excel with minimal memory usage.

    **Arguments:**
    ```
    TARGET                  Path to file or directory to analyze
    ```

    **Examples:**
    ```
    # Basic file analysis
    ssf_tools analyze entropy sample.bin

    # Analyze with higher risk threshold (fewer results)
    ssf_tools analyze entropy sample.bin --risk-threshold high

    # Analyze with custom block size
    ssf_tools analyze entropy sample.bin --analysis-block-size 128

    # Override file type detection
    ssf_tools analyze entropy app.exe --force-file-type windows_pe

    # Analyze directory non-recursively
    ssf_tools analyze entropy data/ --no-recurse
    ```
    """
    try:
        # Build configuration overrides and load configs
        cli_overrides = _build_cli_overrides(
            file_block_size,
            analysis_block_size,
            step_size,
        )
        global_config = global_config_service.load_config()
        entropy_config = entropy_config_service.load_config(
            command_overrides=cli_overrides,
        )

        if global_config.output.verbose:
            rich_output.debug("Verbose mode enabled")

        # Setup analysis parameters
        risk_level = EntropyLevel[risk_threshold.upper()]
        timestamp = timestamp_service.format_filename_now()
        output_path = Path(f"entropy-analysis-{timestamp}.xlsx")

        # Discover and filter files to analyze
        files_to_analyze = _discover_and_filter_files(
            target,
            file_discovery,
            no_recurse=no_recurse,
            ignore_pattern=ignore_pattern,
            rich_output=rich_output,
        )
        if not files_to_analyze:
            return

        # Check Excel limits and warn if necessary
        step_size_val = step_size or entropy_config.analysis.step_size
        if _check_excel_limits(
            files_to_analyze,
            risk_level,
            step_size_val,
            rich_output,
        ):
            return

        # Process files
        rich_output.info(
            f"Starting entropy analysis with {risk_level.value} risk threshold",
        )
        rich_output.info(f"Output will be saved to: {output_path}")

        processing_config = ProcessingConfig(
            file_block_size=file_block_size,
            analysis_block_size=analysis_block_size,
            step_size=step_size,
            include_samples=include_samples,
        )

        context = ProcessingContext(
            analyzer=analyzer,
            entropy_config=entropy_config,
            global_config=global_config,
            rich_output=rich_output,
        )

        total_files_analyzed, total_high_risk_regions, total_time = _process_files(
            files_to_analyze,
            output_path,
            risk_level,
            context,
            processing_config,
        )

        # Report final summary
        summary = AnalysisSummary(
            total_files_analyzed=total_files_analyzed,
            total_files=len(files_to_analyze),
            total_high_risk_regions=total_high_risk_regions,
            total_time=total_time,
        )
        _report_summary(output_path, summary, rich_output)

    except Exception as e:
        rich_output.error(f"Analysis failed: {e}")
        if "global_config" in locals() and global_config.output.verbose:
            import traceback

            rich_output.error(traceback.format_exc())
        raise

Service Integration Patterns¶

File Processing Integration¶

Analysis services integrate with the core file processing system for efficient I/O operations through direct CLI orchestration.

The current implementation uses a streaming-based approach where CLI commands directly coordinate analysis workflows. See the Architecture and Design Patterns document for the complete implementation details of the file processing workflow.

This pattern provides: - Memory Efficiency: Streaming analysis with direct Excel export - Error Isolation: Per-file error handling without stopping the entire analysis - Progress Reporting: Real-time feedback on analysis progress - Configuration Flexibility: CLI overrides for analysis parameters

Core Models¶

Analysis Models Overview¶

The analysis system uses well-defined Pydantic models for inputs and results. For detailed model specifications, see the Architecture and Design Patterns document.

Key model categories include: - Analysis Results: Structured outputs for entropy and credential detection operations - Pattern Models: Detection result containers for identified patterns and regions
- Configuration Models: Input validation and type safety for analysis operations

Content-Aware Threshold Models¶

File-type-specific threshold definitions for accurate entropy classification.

`kp_ssf_tools.analyze.models.content_aware` ¶

Content-Aware Thresholds module.

Classes¶

`ContentAwareThresholds` ¶

Bases: SSFToolsBaseModel

File type-specific entropy thresholds loaded from configuration.

Source code in src\kp_ssf_tools\analyze\models\content_aware.py

class ContentAwareThresholds(SSFToolsBaseModel):
    """File type-specific entropy thresholds loaded from configuration."""

    file_type: FileType
    expected_entropy: tuple[float, float]  # (mean, std_dev) for normal content
    very_low_threshold: float  # Below this = VERY_LOW
    low_threshold: float  # Below this = LOW
    medium_threshold: float  # Normal range center
    medium_high_threshold: float  # Above this = MEDIUM_HIGH
    high_threshold: float  # Above this = HIGH

    @classmethod
    def get_default_values(cls) -> dict[FileType, dict[str, object]]:
        """
        Default threshold values for configuration file generation.

        These values are derived from extensive academic research documented
        in docs/file-entropy-research.md, including:
        - Lyda & Hamrock (2007) IEEE foundational paper
        - Davies et al. (2022) NapierOne dataset (500,000+ files)
        - Practical Security Analytics (500,000 PE file analysis)
        - Multiple peer-reviewed studies with statistical validation

        Returns a dict suitable for YAML configuration file generation.
        """
        return {
            # Top 20 Programming Languages (2025 Rankings)
            FileType.PYTHON: {
                "expected_entropy": [
                    5.5,
                    0.8,
                ],  # Mean=5.5, StdDev=0.8
                "very_low_threshold": 4.0,  # Highly repetitive code
                "low_threshold": 5.0,  # Simple scripts, lots of comments
                "medium_threshold": 6.0,  # Typical Python code
                "medium_high_threshold": 6.8,  # Complex logic, minified
                "high_threshold": 7.2,  # Obfuscated/packed code | > Likely suspicious
            },
            FileType.JAVASCRIPT: {
                "expected_entropy": [
                    5.4,
                    0.8,
                ],  # Mean=5.4, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple scripts
                "low_threshold": 4.9,  # Basic JS with comments
                "medium_threshold": 5.9,  # Typical JavaScript
                "medium_high_threshold": 6.8,  # Minified/complex
                "high_threshold": 7.2,  # Obfuscated code | > Likely suspicious
            },
            FileType.JAVA: {
                "expected_entropy": [
                    5.6,
                    0.7,
                ],  # Mean=5.6, StdDev=0.7
                "very_low_threshold": 4.0,  # Verbose Java patterns
                "low_threshold": 5.0,  # Simple classes
                "medium_threshold": 6.0,  # Typical Java code
                "medium_high_threshold": 6.8,  # Complex enterprise code
                "high_threshold": 7.2,  # Bytecode/obfuscated | > Likely suspicious
            },
            FileType.CPP: {
                "expected_entropy": [
                    5.8,
                    0.9,
                ],  # Mean=5.8, StdDev=0.9
                "very_low_threshold": 4.0,  # Header files
                "low_threshold": 5.0,  # Simple implementations
                "medium_threshold": 6.2,  # Typical C++ code
                "medium_high_threshold": 7.0,  # Template-heavy code
                "high_threshold": 7.3,  # Compiled/obfuscated | > Likely suspicious
            },
            FileType.C: {
                "expected_entropy": [
                    5.7,
                    0.9,
                ],  # Mean=5.7, StdDev=0.9
                "very_low_threshold": 4.0,  # Header files
                "low_threshold": 5.0,  # Simple C code
                "medium_threshold": 6.1,  # Typical C programs
                "medium_high_threshold": 6.9,  # Complex system code
                "high_threshold": 7.3,  # Compiled/obfuscated | > Likely suspicious
            },
            FileType.CSHARP: {
                "expected_entropy": [
                    5.6,
                    0.8,
                ],  # Mean=5.6, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple classes
                "low_threshold": 5.0,  # Basic C# code
                "medium_threshold": 6.0,  # Typical C# applications
                "medium_high_threshold": 6.8,  # Complex .NET code
                "high_threshold": 7.2,  # IL bytecode/obfuscated | > Likely suspicious
            },
            FileType.TYPESCRIPT: {
                "expected_entropy": [
                    5.4,
                    0.8,
                ],  # Mean=5.4, StdDev=0.8
                "very_low_threshold": 4.0,  # Type definitions
                "low_threshold": 4.9,  # Simple TypeScript
                "medium_threshold": 5.9,  # Typical TS code
                "medium_high_threshold": 6.8,  # Complex/transpiled
                "high_threshold": 7.2,  # Obfuscated output | > Likely suspicious
            },
            FileType.PHP: {
                "expected_entropy": [
                    5.3,
                    0.8,
                ],  # Mean=5.3, StdDev=0.8
                "very_low_threshold": 4.0,  # HTML mixed PHP
                "low_threshold": 4.8,  # Simple PHP scripts
                "medium_threshold": 5.8,  # Typical PHP code
                "medium_high_threshold": 6.7,  # Complex frameworks
                "high_threshold": 7.1,  # Obfuscated PHP | > Likely suspicious
            },
            FileType.GO: {
                "expected_entropy": [
                    5.5,
                    0.7,
                ],  # Mean=5.5, StdDev=0.7
                "very_low_threshold": 4.0,  # Simple Go code
                "low_threshold": 5.0,  # Basic programs
                "medium_threshold": 6.0,  # Typical Go code
                "medium_high_threshold": 6.8,  # Complex concurrent code
                "high_threshold": 7.2,  # Compiled binary data | > Likely suspicious
            },
            FileType.SQL: {
                "expected_entropy": [
                    5.2,
                    0.9,
                ],  # Mean=5.2, StdDev=0.9
                "very_low_threshold": 3.8,  # Simple queries
                "low_threshold": 4.7,  # Basic SQL statements
                "medium_threshold": 5.7,  # Complex queries
                "medium_high_threshold": 6.6,  # Stored procedures
                "high_threshold": 7.0,  # Obfuscated SQL | > Likely suspicious
            },
            FileType.RUST: {
                "expected_entropy": [
                    5.7,
                    0.8,
                ],  # Mean=5.7, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple Rust code
                "low_threshold": 5.0,  # Basic implementations
                "medium_threshold": 6.1,  # Typical Rust code
                "medium_high_threshold": 6.9,  # Complex unsafe code
                "high_threshold": 7.3,  # Compiled/obfuscated | > Likely suspicious
            },
            FileType.SWIFT: {
                "expected_entropy": [
                    5.5,
                    0.8,
                ],  # Mean=5.5, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple Swift code
                "low_threshold": 5.0,  # Basic iOS code
                "medium_threshold": 6.0,  # Typical Swift apps
                "medium_high_threshold": 6.8,  # Complex frameworks
                "high_threshold": 7.2,  # Compiled/obfuscated | > Likely suspicious
            },
            FileType.KOTLIN: {
                "expected_entropy": [
                    5.5,
                    0.8,
                ],  # Mean=5.5, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple Kotlin code
                "low_threshold": 5.0,  # Basic Android code
                "medium_threshold": 6.0,  # Typical Kotlin apps
                "medium_high_threshold": 6.8,  # Complex coroutines
                "high_threshold": 7.2,  # Bytecode/obfuscated | > Likely suspicious
            },
            FileType.RUBY: {
                "expected_entropy": [
                    5.3,
                    0.8,
                ],  # Mean=5.3, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple Ruby scripts
                "low_threshold": 4.8,  # Basic Rails code
                "medium_threshold": 5.8,  # Typical Ruby code
                "medium_high_threshold": 6.7,  # Complex metaprogramming
                "high_threshold": 7.1,  # Obfuscated Ruby | > Likely suspicious
            },
            FileType.R: {
                "expected_entropy": [
                    5.4,
                    0.9,
                ],  # Mean=5.4, StdDev=0.9
                "very_low_threshold": 3.9,  # Simple R scripts
                "low_threshold": 4.8,  # Basic statistics
                "medium_threshold": 5.9,  # Typical R analysis
                "medium_high_threshold": 6.8,  # Complex models
                "high_threshold": 7.1,  # Compiled R code | > Likely suspicious
            },
            FileType.VISUAL_BASIC: {
                "expected_entropy": [
                    5.2,
                    0.8,
                ],  # Mean=5.2, StdDev=0.8
                "very_low_threshold": 3.9,  # Simple VB code
                "low_threshold": 4.7,  # Basic VB.NET
                "medium_threshold": 5.7,  # Typical VB apps
                "medium_high_threshold": 6.6,  # Complex forms
                "high_threshold": 7.0,  # Obfuscated VB | > Likely suspicious
            },
            FileType.SCALA: {
                "expected_entropy": [
                    5.6,
                    0.8,
                ],  # Mean=5.6, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple Scala code
                "low_threshold": 5.0,  # Basic functional code
                "medium_threshold": 6.0,  # Typical Scala apps
                "medium_high_threshold": 6.8,  # Complex Spark code
                "high_threshold": 7.2,  # Bytecode/obfuscated | > Likely suspicious
            },
            FileType.MATLAB: {
                "expected_entropy": [
                    5.4,
                    0.9,
                ],  # Mean=5.4, StdDev=0.9
                "very_low_threshold": 3.9,  # Simple scripts
                "low_threshold": 4.8,  # Basic computations
                "medium_threshold": 5.9,  # Typical MATLAB code
                "medium_high_threshold": 6.8,  # Complex algorithms
                "high_threshold": 7.1,  # Compiled MEX files | > Likely suspicious
            },
            FileType.PERL: {
                "expected_entropy": [
                    5.4,
                    0.9,
                ],  # Mean=5.4, StdDev=0.9
                "very_low_threshold": 3.9,  # Simple Perl scripts
                "low_threshold": 4.8,  # Basic regex code
                "medium_threshold": 5.9,  # Typical Perl code
                "medium_high_threshold": 6.8,  # Complex one-liners
                "high_threshold": 7.1,  # Obfuscated Perl | > Likely suspicious
            },
            FileType.DART: {
                "expected_entropy": [
                    5.5,
                    0.8,
                ],  # Mean=5.5, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple Dart code
                "low_threshold": 5.0,  # Basic Flutter widgets
                "medium_threshold": 6.0,  # Typical Dart apps
                "medium_high_threshold": 6.8,  # Complex async code
                "high_threshold": 7.2,  # Compiled/obfuscated | > Likely suspicious
            },
            # Documentation Files
            FileType.DOCUMENTATION: {
                "expected_entropy": [
                    4.8,
                    0.65,
                ],  # Mean=4.8, StdDev=0.65 (combined plain/markdown)
                "very_low_threshold": 3.55,  # Highly repetitive text
                "low_threshold": 4.25,  # Simple documentation
                "medium_threshold": 5.1,  # Typical documentation
                "medium_high_threshold": 5.65,  # Technical docs with code
                "high_threshold": 6.15,  # Mixed content | Anomalous for docs
            },
            # Binary Executables
            FileType.WINDOWS_PE: {
                "expected_entropy": [
                    6.0,
                    1.2,
                ],  # Mean=6.0, StdDev=1.2
                "very_low_threshold": 4.5,  # Text sections
                "low_threshold": 5.2,  # Code sections
                "medium_threshold": 6.5,  # Typical PE files
                "medium_high_threshold": 7.0,  # Complex binaries
                "high_threshold": 7.2,  # Packed/compressed | > Likely suspicious
            },
            FileType.MACOS_MACHO: {
                "expected_entropy": [
                    5.9,
                    1.2,
                ],  # Mean=5.9, StdDev=1.2
                "very_low_threshold": 4.5,  # Text sections
                "low_threshold": 5.2,  # Code sections
                "medium_threshold": 6.4,  # Typical MachO files
                "medium_high_threshold": 6.9,  # Universal binaries
                "high_threshold": 7.2,  # Packed/compressed | > Likely suspicious
            },
            FileType.LINUX_ELF: {
                "expected_entropy": [
                    5.8,
                    1.1,
                ],  # Mean=5.8, StdDev=1.1
                "very_low_threshold": 4.5,  # Text sections
                "low_threshold": 5.1,  # Code sections
                "medium_threshold": 6.3,  # Typical ELF files
                "medium_high_threshold": 6.8,  # Complex binaries
                "high_threshold": 7.2,  # Packed/compressed | > Likely suspicious
            },
            # Encrypted/Suspicious Content
            FileType.ENCRYPTED: {
                "expected_entropy": [
                    7.99,
                    0.01,
                ],  # Mean=7.99, StdDev=0.01 (AES validated)
                "very_low_threshold": 7.8,  # Weak/broken encryption
                "low_threshold": 7.85,  # Poor encryption
                "medium_threshold": 7.9,  # Possible encryption
                "medium_high_threshold": 7.95,  # Likely encrypted
                "high_threshold": 7.98,  # Strong encryption | > Max entropy
            },
            FileType.BASE64_ENCODED: {
                "expected_entropy": [
                    6.0,
                    0.3,
                ],  # Mean=6.0, StdDev=0.3
                "very_low_threshold": 5.2,  # Partial encoding
                "low_threshold": 5.5,  # Simple base64
                "medium_threshold": 6.0,  # Typical base64
                "medium_high_threshold": 6.3,  # Complex encoded data
                "high_threshold": 6.5,  # Encrypted then encoded | > Suspicious encoding
            },
            FileType.HEX_ENCODED: {
                "expected_entropy": [
                    4.0,
                    0.2,
                ],  # Mean=4.0, StdDev=0.2
                "very_low_threshold": 3.5,  # Partial hex
                "low_threshold": 3.7,  # Simple hex strings
                "medium_threshold": 4.0,  # Typical hex encoding
                "medium_high_threshold": 4.2,  # Complex hex data
                "high_threshold": 4.4,  # Anomalous hex | > Suspicious pattern
            },
            # Unknown file types - Conservative thresholds
            FileType.UNKNOWN: {
                "expected_entropy": [
                    5.5,
                    1.5,
                ],  # Mean=5.5, StdDev=1.5 (conservative mixed content)
                "very_low_threshold": 3.0,  # Likely text/structured
                "low_threshold": 4.5,  # Probable code/data
                "medium_threshold": 6.0,  # Typical binary content
                "medium_high_threshold": 7.0,  # Complex binary/media
                "high_threshold": 7.2,  # Boundary suspicious | > Conservative threshold
            },
        }

    @classmethod
    def get_default_models(cls) -> dict[FileType, ContentAwareThresholds]:
        """
        Get pre-built Pydantic model instances for all file types.

        Returns validated ContentAwareThresholds models instead of raw dicts.
        Use this method for runtime threshold management to avoid dict-to-model conversion.

        """
        models = {}
        for file_type, data in cls.get_default_values().items():
            # Cast values from object to proper types
            expected_entropy = data["expected_entropy"]
            if isinstance(expected_entropy, list | tuple):
                float_values = [float(x) for x in expected_entropy]
                # Ensure we have exactly 2 values for tuple[float, float]
                expected_tuple_length = 2
                if len(float_values) >= expected_tuple_length:
                    entropy_tuple = (float_values[0], float_values[1])
                else:
                    entropy_tuple = (0.0, 8.0)  # fallback
            else:
                entropy_tuple = (0.0, 8.0)  # fallback

            models[file_type] = cls(
                file_type=file_type,
                expected_entropy=entropy_tuple,
                very_low_threshold=cast("float", data["very_low_threshold"]),
                low_threshold=cast("float", data["low_threshold"]),
                medium_threshold=cast("float", data["medium_threshold"]),
                medium_high_threshold=cast("float", data["medium_high_threshold"]),
                high_threshold=cast("float", data["high_threshold"]),
            )
        return models

    @classmethod
    def for_file_type(cls, file_type: FileType) -> ContentAwareThresholds:
        """
        Factory method to create a threshold model for a specific file type.

        Args:
            file_type: The file type to get thresholds for

        Returns:
            ContentAwareThresholds model instance with validated data

        Raises:
            KeyError: If file_type is not supported

        """
        defaults = cls.get_default_values()
        if file_type not in defaults:
            # Return sensible defaults for unknown file types
            return cls(
                file_type=file_type,
                expected_entropy=(5.5, 1.0),
                very_low_threshold=4.0,
                low_threshold=5.0,
                medium_threshold=6.0,
                medium_high_threshold=6.8,
                high_threshold=7.2,
            )

        data = defaults[file_type]
        # Cast values from object to proper types
        expected_entropy = data["expected_entropy"]
        if isinstance(expected_entropy, list | tuple):
            float_values = [float(x) for x in expected_entropy]
            # Ensure we have exactly 2 values for tuple[float, float]
            expected_tuple_length = 2
            if len(float_values) >= expected_tuple_length:
                entropy_tuple = (float_values[0], float_values[1])
            else:
                entropy_tuple = (0.0, 8.0)  # fallback
        else:
            entropy_tuple = (0.0, 8.0)  # fallback

        return cls(
            file_type=file_type,
            expected_entropy=entropy_tuple,
            very_low_threshold=cast("float", data["very_low_threshold"]),
            low_threshold=cast("float", data["low_threshold"]),
            medium_threshold=cast("float", data["medium_threshold"]),
            medium_high_threshold=cast("float", data["medium_high_threshold"]),
            high_threshold=cast("float", data["high_threshold"]),
        )

Functions¶

`for_file_type(file_type)` `classmethod` ¶

Factory method to create a threshold model for a specific file type.

Parameters:

Name	Type	Description	Default
`file_type`	`FileType`	The file type to get thresholds for	required

Returns:

Type	Description
`ContentAwareThresholds`	ContentAwareThresholds model instance with validated data

Raises:

Type	Description
`KeyError`	If file_type is not supported

Source code in src\kp_ssf_tools\analyze\models\content_aware.py

@classmethod
def for_file_type(cls, file_type: FileType) -> ContentAwareThresholds:
    """
    Factory method to create a threshold model for a specific file type.

    Args:
        file_type: The file type to get thresholds for

    Returns:
        ContentAwareThresholds model instance with validated data

    Raises:
        KeyError: If file_type is not supported

    """
    defaults = cls.get_default_values()
    if file_type not in defaults:
        # Return sensible defaults for unknown file types
        return cls(
            file_type=file_type,
            expected_entropy=(5.5, 1.0),
            very_low_threshold=4.0,
            low_threshold=5.0,
            medium_threshold=6.0,
            medium_high_threshold=6.8,
            high_threshold=7.2,
        )

    data = defaults[file_type]
    # Cast values from object to proper types
    expected_entropy = data["expected_entropy"]
    if isinstance(expected_entropy, list | tuple):
        float_values = [float(x) for x in expected_entropy]
        # Ensure we have exactly 2 values for tuple[float, float]
        expected_tuple_length = 2
        if len(float_values) >= expected_tuple_length:
            entropy_tuple = (float_values[0], float_values[1])
        else:
            entropy_tuple = (0.0, 8.0)  # fallback
    else:
        entropy_tuple = (0.0, 8.0)  # fallback

    return cls(
        file_type=file_type,
        expected_entropy=entropy_tuple,
        very_low_threshold=cast("float", data["very_low_threshold"]),
        low_threshold=cast("float", data["low_threshold"]),
        medium_threshold=cast("float", data["medium_threshold"]),
        medium_high_threshold=cast("float", data["medium_high_threshold"]),
        high_threshold=cast("float", data["high_threshold"]),
    )

`get_default_models()` `classmethod` ¶

Get pre-built Pydantic model instances for all file types.

Returns validated ContentAwareThresholds models instead of raw dicts. Use this method for runtime threshold management to avoid dict-to-model conversion.

Source code in src\kp_ssf_tools\analyze\models\content_aware.py

@classmethod
def get_default_models(cls) -> dict[FileType, ContentAwareThresholds]:
    """
    Get pre-built Pydantic model instances for all file types.

    Returns validated ContentAwareThresholds models instead of raw dicts.
    Use this method for runtime threshold management to avoid dict-to-model conversion.

    """
    models = {}
    for file_type, data in cls.get_default_values().items():
        # Cast values from object to proper types
        expected_entropy = data["expected_entropy"]
        if isinstance(expected_entropy, list | tuple):
            float_values = [float(x) for x in expected_entropy]
            # Ensure we have exactly 2 values for tuple[float, float]
            expected_tuple_length = 2
            if len(float_values) >= expected_tuple_length:
                entropy_tuple = (float_values[0], float_values[1])
            else:
                entropy_tuple = (0.0, 8.0)  # fallback
        else:
            entropy_tuple = (0.0, 8.0)  # fallback

        models[file_type] = cls(
            file_type=file_type,
            expected_entropy=entropy_tuple,
            very_low_threshold=cast("float", data["very_low_threshold"]),
            low_threshold=cast("float", data["low_threshold"]),
            medium_threshold=cast("float", data["medium_threshold"]),
            medium_high_threshold=cast("float", data["medium_high_threshold"]),
            high_threshold=cast("float", data["high_threshold"]),
        )
    return models

`get_default_values()` `classmethod` ¶

Default threshold values for configuration file generation.

These values are derived from extensive academic research documented in docs/file-entropy-research.md, including: - Lyda & Hamrock (2007) IEEE foundational paper - Davies et al. (2022) NapierOne dataset (500,000+ files) - Practical Security Analytics (500,000 PE file analysis) - Multiple peer-reviewed studies with statistical validation

Returns a dict suitable for YAML configuration file generation.

Source code in src\kp_ssf_tools\analyze\models\content_aware.py

@classmethod
def get_default_values(cls) -> dict[FileType, dict[str, object]]:
    """
    Default threshold values for configuration file generation.

    These values are derived from extensive academic research documented
    in docs/file-entropy-research.md, including:
    - Lyda & Hamrock (2007) IEEE foundational paper
    - Davies et al. (2022) NapierOne dataset (500,000+ files)
    - Practical Security Analytics (500,000 PE file analysis)
    - Multiple peer-reviewed studies with statistical validation

    Returns a dict suitable for YAML configuration file generation.
    """
    return {
        # Top 20 Programming Languages (2025 Rankings)
        FileType.PYTHON: {
            "expected_entropy": [
                5.5,
                0.8,
            ],  # Mean=5.5, StdDev=0.8
            "very_low_threshold": 4.0,  # Highly repetitive code
            "low_threshold": 5.0,  # Simple scripts, lots of comments
            "medium_threshold": 6.0,  # Typical Python code
            "medium_high_threshold": 6.8,  # Complex logic, minified
            "high_threshold": 7.2,  # Obfuscated/packed code | > Likely suspicious
        },
        FileType.JAVASCRIPT: {
            "expected_entropy": [
                5.4,
                0.8,
            ],  # Mean=5.4, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple scripts
            "low_threshold": 4.9,  # Basic JS with comments
            "medium_threshold": 5.9,  # Typical JavaScript
            "medium_high_threshold": 6.8,  # Minified/complex
            "high_threshold": 7.2,  # Obfuscated code | > Likely suspicious
        },
        FileType.JAVA: {
            "expected_entropy": [
                5.6,
                0.7,
            ],  # Mean=5.6, StdDev=0.7
            "very_low_threshold": 4.0,  # Verbose Java patterns
            "low_threshold": 5.0,  # Simple classes
            "medium_threshold": 6.0,  # Typical Java code
            "medium_high_threshold": 6.8,  # Complex enterprise code
            "high_threshold": 7.2,  # Bytecode/obfuscated | > Likely suspicious
        },
        FileType.CPP: {
            "expected_entropy": [
                5.8,
                0.9,
            ],  # Mean=5.8, StdDev=0.9
            "very_low_threshold": 4.0,  # Header files
            "low_threshold": 5.0,  # Simple implementations
            "medium_threshold": 6.2,  # Typical C++ code
            "medium_high_threshold": 7.0,  # Template-heavy code
            "high_threshold": 7.3,  # Compiled/obfuscated | > Likely suspicious
        },
        FileType.C: {
            "expected_entropy": [
                5.7,
                0.9,
            ],  # Mean=5.7, StdDev=0.9
            "very_low_threshold": 4.0,  # Header files
            "low_threshold": 5.0,  # Simple C code
            "medium_threshold": 6.1,  # Typical C programs
            "medium_high_threshold": 6.9,  # Complex system code
            "high_threshold": 7.3,  # Compiled/obfuscated | > Likely suspicious
        },
        FileType.CSHARP: {
            "expected_entropy": [
                5.6,
                0.8,
            ],  # Mean=5.6, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple classes
            "low_threshold": 5.0,  # Basic C# code
            "medium_threshold": 6.0,  # Typical C# applications
            "medium_high_threshold": 6.8,  # Complex .NET code
            "high_threshold": 7.2,  # IL bytecode/obfuscated | > Likely suspicious
        },
        FileType.TYPESCRIPT: {
            "expected_entropy": [
                5.4,
                0.8,
            ],  # Mean=5.4, StdDev=0.8
            "very_low_threshold": 4.0,  # Type definitions
            "low_threshold": 4.9,  # Simple TypeScript
            "medium_threshold": 5.9,  # Typical TS code
            "medium_high_threshold": 6.8,  # Complex/transpiled
            "high_threshold": 7.2,  # Obfuscated output | > Likely suspicious
        },
        FileType.PHP: {
            "expected_entropy": [
                5.3,
                0.8,
            ],  # Mean=5.3, StdDev=0.8
            "very_low_threshold": 4.0,  # HTML mixed PHP
            "low_threshold": 4.8,  # Simple PHP scripts
            "medium_threshold": 5.8,  # Typical PHP code
            "medium_high_threshold": 6.7,  # Complex frameworks
            "high_threshold": 7.1,  # Obfuscated PHP | > Likely suspicious
        },
        FileType.GO: {
            "expected_entropy": [
                5.5,
                0.7,
            ],  # Mean=5.5, StdDev=0.7
            "very_low_threshold": 4.0,  # Simple Go code
            "low_threshold": 5.0,  # Basic programs
            "medium_threshold": 6.0,  # Typical Go code
            "medium_high_threshold": 6.8,  # Complex concurrent code
            "high_threshold": 7.2,  # Compiled binary data | > Likely suspicious
        },
        FileType.SQL: {
            "expected_entropy": [
                5.2,
                0.9,
            ],  # Mean=5.2, StdDev=0.9
            "very_low_threshold": 3.8,  # Simple queries
            "low_threshold": 4.7,  # Basic SQL statements
            "medium_threshold": 5.7,  # Complex queries
            "medium_high_threshold": 6.6,  # Stored procedures
            "high_threshold": 7.0,  # Obfuscated SQL | > Likely suspicious
        },
        FileType.RUST: {
            "expected_entropy": [
                5.7,
                0.8,
            ],  # Mean=5.7, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple Rust code
            "low_threshold": 5.0,  # Basic implementations
            "medium_threshold": 6.1,  # Typical Rust code
            "medium_high_threshold": 6.9,  # Complex unsafe code
            "high_threshold": 7.3,  # Compiled/obfuscated | > Likely suspicious
        },
        FileType.SWIFT: {
            "expected_entropy": [
                5.5,
                0.8,
            ],  # Mean=5.5, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple Swift code
            "low_threshold": 5.0,  # Basic iOS code
            "medium_threshold": 6.0,  # Typical Swift apps
            "medium_high_threshold": 6.8,  # Complex frameworks
            "high_threshold": 7.2,  # Compiled/obfuscated | > Likely suspicious
        },
        FileType.KOTLIN: {
            "expected_entropy": [
                5.5,
                0.8,
            ],  # Mean=5.5, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple Kotlin code
            "low_threshold": 5.0,  # Basic Android code
            "medium_threshold": 6.0,  # Typical Kotlin apps
            "medium_high_threshold": 6.8,  # Complex coroutines
            "high_threshold": 7.2,  # Bytecode/obfuscated | > Likely suspicious
        },
        FileType.RUBY: {
            "expected_entropy": [
                5.3,
                0.8,
            ],  # Mean=5.3, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple Ruby scripts
            "low_threshold": 4.8,  # Basic Rails code
            "medium_threshold": 5.8,  # Typical Ruby code
            "medium_high_threshold": 6.7,  # Complex metaprogramming
            "high_threshold": 7.1,  # Obfuscated Ruby | > Likely suspicious
        },
        FileType.R: {
            "expected_entropy": [
                5.4,
                0.9,
            ],  # Mean=5.4, StdDev=0.9
            "very_low_threshold": 3.9,  # Simple R scripts
            "low_threshold": 4.8,  # Basic statistics
            "medium_threshold": 5.9,  # Typical R analysis
            "medium_high_threshold": 6.8,  # Complex models
            "high_threshold": 7.1,  # Compiled R code | > Likely suspicious
        },
        FileType.VISUAL_BASIC: {
            "expected_entropy": [
                5.2,
                0.8,
            ],  # Mean=5.2, StdDev=0.8
            "very_low_threshold": 3.9,  # Simple VB code
            "low_threshold": 4.7,  # Basic VB.NET
            "medium_threshold": 5.7,  # Typical VB apps
            "medium_high_threshold": 6.6,  # Complex forms
            "high_threshold": 7.0,  # Obfuscated VB | > Likely suspicious
        },
        FileType.SCALA: {
            "expected_entropy": [
                5.6,
                0.8,
            ],  # Mean=5.6, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple Scala code
            "low_threshold": 5.0,  # Basic functional code
            "medium_threshold": 6.0,  # Typical Scala apps
            "medium_high_threshold": 6.8,  # Complex Spark code
            "high_threshold": 7.2,  # Bytecode/obfuscated | > Likely suspicious
        },
        FileType.MATLAB: {
            "expected_entropy": [
                5.4,
                0.9,
            ],  # Mean=5.4, StdDev=0.9
            "very_low_threshold": 3.9,  # Simple scripts
            "low_threshold": 4.8,  # Basic computations
            "medium_threshold": 5.9,  # Typical MATLAB code
            "medium_high_threshold": 6.8,  # Complex algorithms
            "high_threshold": 7.1,  # Compiled MEX files | > Likely suspicious
        },
        FileType.PERL: {
            "expected_entropy": [
                5.4,
                0.9,
            ],  # Mean=5.4, StdDev=0.9
            "very_low_threshold": 3.9,  # Simple Perl scripts
            "low_threshold": 4.8,  # Basic regex code
            "medium_threshold": 5.9,  # Typical Perl code
            "medium_high_threshold": 6.8,  # Complex one-liners
            "high_threshold": 7.1,  # Obfuscated Perl | > Likely suspicious
        },
        FileType.DART: {
            "expected_entropy": [
                5.5,
                0.8,
            ],  # Mean=5.5, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple Dart code
            "low_threshold": 5.0,  # Basic Flutter widgets
            "medium_threshold": 6.0,  # Typical Dart apps
            "medium_high_threshold": 6.8,  # Complex async code
            "high_threshold": 7.2,  # Compiled/obfuscated | > Likely suspicious
        },
        # Documentation Files
        FileType.DOCUMENTATION: {
            "expected_entropy": [
                4.8,
                0.65,
            ],  # Mean=4.8, StdDev=0.65 (combined plain/markdown)
            "very_low_threshold": 3.55,  # Highly repetitive text
            "low_threshold": 4.25,  # Simple documentation
            "medium_threshold": 5.1,  # Typical documentation
            "medium_high_threshold": 5.65,  # Technical docs with code
            "high_threshold": 6.15,  # Mixed content | Anomalous for docs
        },
        # Binary Executables
        FileType.WINDOWS_PE: {
            "expected_entropy": [
                6.0,
                1.2,
            ],  # Mean=6.0, StdDev=1.2
            "very_low_threshold": 4.5,  # Text sections
            "low_threshold": 5.2,  # Code sections
            "medium_threshold": 6.5,  # Typical PE files
            "medium_high_threshold": 7.0,  # Complex binaries
            "high_threshold": 7.2,  # Packed/compressed | > Likely suspicious
        },
        FileType.MACOS_MACHO: {
            "expected_entropy": [
                5.9,
                1.2,
            ],  # Mean=5.9, StdDev=1.2
            "very_low_threshold": 4.5,  # Text sections
            "low_threshold": 5.2,  # Code sections
            "medium_threshold": 6.4,  # Typical MachO files
            "medium_high_threshold": 6.9,  # Universal binaries
            "high_threshold": 7.2,  # Packed/compressed | > Likely suspicious
        },
        FileType.LINUX_ELF: {
            "expected_entropy": [
                5.8,
                1.1,
            ],  # Mean=5.8, StdDev=1.1
            "very_low_threshold": 4.5,  # Text sections
            "low_threshold": 5.1,  # Code sections
            "medium_threshold": 6.3,  # Typical ELF files
            "medium_high_threshold": 6.8,  # Complex binaries
            "high_threshold": 7.2,  # Packed/compressed | > Likely suspicious
        },
        # Encrypted/Suspicious Content
        FileType.ENCRYPTED: {
            "expected_entropy": [
                7.99,
                0.01,
            ],  # Mean=7.99, StdDev=0.01 (AES validated)
            "very_low_threshold": 7.8,  # Weak/broken encryption
            "low_threshold": 7.85,  # Poor encryption
            "medium_threshold": 7.9,  # Possible encryption
            "medium_high_threshold": 7.95,  # Likely encrypted
            "high_threshold": 7.98,  # Strong encryption | > Max entropy
        },
        FileType.BASE64_ENCODED: {
            "expected_entropy": [
                6.0,
                0.3,
            ],  # Mean=6.0, StdDev=0.3
            "very_low_threshold": 5.2,  # Partial encoding
            "low_threshold": 5.5,  # Simple base64
            "medium_threshold": 6.0,  # Typical base64
            "medium_high_threshold": 6.3,  # Complex encoded data
            "high_threshold": 6.5,  # Encrypted then encoded | > Suspicious encoding
        },
        FileType.HEX_ENCODED: {
            "expected_entropy": [
                4.0,
                0.2,
            ],  # Mean=4.0, StdDev=0.2
            "very_low_threshold": 3.5,  # Partial hex
            "low_threshold": 3.7,  # Simple hex strings
            "medium_threshold": 4.0,  # Typical hex encoding
            "medium_high_threshold": 4.2,  # Complex hex data
            "high_threshold": 4.4,  # Anomalous hex | > Suspicious pattern
        },
        # Unknown file types - Conservative thresholds
        FileType.UNKNOWN: {
            "expected_entropy": [
                5.5,
                1.5,
            ],  # Mean=5.5, StdDev=1.5 (conservative mixed content)
            "very_low_threshold": 3.0,  # Likely text/structured
            "low_threshold": 4.5,  # Probable code/data
            "medium_threshold": 6.0,  # Typical binary content
            "medium_high_threshold": 7.0,  # Complex binary/media
            "high_threshold": 7.2,  # Boundary suspicious | > Conservative threshold
        },
    }

Analysis Type Definitions¶

Enumerations and type definitions used throughout the analysis system.

`kp_ssf_tools.analyze.models.types` ¶

Types for entropy analysis models.

Classes¶

`ComplianceStatus` ¶

Bases: StrEnum

PCI SSF compliance status levels.

Source code in src\kp_ssf_tools\analyze\models\types.py

class ComplianceStatus(StrEnum):
    """PCI SSF compliance status levels."""

    COMPLIANT = "compliant"
    NON_COMPLIANT = "non_compliant"
    REQUIRES_REVIEW = "requires_review"
    INSUFFICIENT_DATA = "insufficient_data"

`CredentialRiskLevel` ¶

Bases: StrEnum

Risk levels for detected credentials.

Source code in src\kp_ssf_tools\analyze\models\types.py

class CredentialRiskLevel(StrEnum):
    """Risk levels for detected credentials."""

    CRITICAL = "critical"  # Confirmed high-value credentials
    HIGH = "high"  # Likely credentials with high impact
    MEDIUM = "medium"  # Potential credentials requiring review
    LOW = "low"  # Suspicious patterns, low confidence
    INFO = "info"  # Informational findings

`CryptoStructureType` ¶

Bases: StrEnum

Types of cryptographic structures and credential patterns.

Source code in src\kp_ssf_tools\analyze\models\types.py

class CryptoStructureType(StrEnum):
    """Types of cryptographic structures and credential patterns."""

    # Cryptographic structures
    AES_SBOX = "aes_sbox"
    DES_SBOX = "des_sbox"
    ROUND_CONSTANTS = "round_constants"
    GALOIS_FIELD = "galois_field"
    BASE64_DATA = "base64_data"
    HEX_ENCODED = "hex_encoded"
    PEM_STRUCTURE = "pem_structure"
    CRYPTO_FUNCTION = "crypto_function"
    HARDCODED_KEY = "hardcoded_key"

    # Credential patterns (PCI SSF 2.3.b requirement)
    COMMON_USERNAME = "common_username"
    COMMON_PASSWORD = "common_password"  # noqa: S105
    DEFAULT_CREDENTIAL = "default_credential"
    API_KEY_PATTERN = "api_key_pattern"
    DATABASE_CONNECTION = "database_connection"
    CREDENTIAL_PATTERN = "credential_pattern"

`EntropyLevel` ¶

Bases: StrEnum

Entropy classification levels (thresholds are file-type adaptive).

Source code in src\kp_ssf_tools\analyze\models\types.py

class EntropyLevel(StrEnum):
    """Entropy classification levels (thresholds are file-type adaptive)."""

    CRITICAL = "critical"  # Significantly above expected for file type
    HIGH = "high"  # Above normal range for file type
    MEDIUM_HIGH = "medium_high"  # Slightly elevated for file type
    MEDIUM = "medium"  # Normal range for file type
    LOW = "low"  # Below normal range for file type
    VERY_LOW = "very_low"  # Significantly below expected for file type

    @property
    def order(self) -> int:
        """
        Get the order of the entropy level.

        Higher levels have a higher order (e.g., CRITICAL > HIGH)
        """
        levels = list(type(self))
        return len(levels) - levels.index(self) - 1

Attributes¶

`order` `property` ¶

Get the order of the entropy level.

Higher levels have a higher order (e.g., CRITICAL > HIGH)

`FileType` ¶

Bases: StrEnum

File types for entropy analysis using Pygments lexer names where applicable.

Source code in src\kp_ssf_tools\analyze\models\types.py

class FileType(StrEnum):
    """File types for entropy analysis using Pygments lexer names where applicable."""

    # Programming languages (using Pygments lexer names)
    C = "C"
    CPP = "C++"
    CSHARP = "C#"
    DART = "Dart"
    GO = "Go"
    JAVA = "Java"
    JAVASCRIPT = "JavaScript"
    KOTLIN = "Kotlin"
    MATLAB = "MATLAB"
    PERL = "Perl"
    PHP = "PHP"
    PYTHON = "Python"
    R = "R"
    RUBY = "Ruby"
    RUST = "Rust"
    SCALA = "Scala"
    SQL = "SQL"
    SWIFT = "Swift"
    TYPESCRIPT = "TypeScript"
    VISUAL_BASIC = "Visual Basic"

    # Special content types
    DOCUMENTATION = "documentation"  # Plain text files, documentation
    BASE64_ENCODED = "base64_encoded"  # Custom
    ENCRYPTED = "encrypted"  # Custom
    HEX_ENCODED = "hex_encoded"  # Custom

    # Binary executables
    LINUX_ELF = "linux_elf"  # Custom
    MACOS_MACHO = "macos_macho"  # Custom
    WINDOWS_PE = "windows_pe"  # Custom

    # Fallback
    UNKNOWN = "unknown"

    @classmethod
    def from_pygments_lexer(cls, lexer_name: str) -> FileType:
        """
        Map Pygments lexer names to FileType enums.

        Handles multiple lexer names for the same language (e.g., "Python" vs "Python 3").

        Args:
            lexer_name: Name from Pygments lexer.name

        Returns:
            Corresponding FileType enum, defaults to UNKNOWN for unrecognized lexers

        """
        # Direct matches (most common case)
        for file_type in cls:
            if file_type.value == lexer_name:
                return file_type

        # Handle aliases and special cases
        lexer_aliases = {
            "Python 3": cls.PYTHON,
            "JavaScript+Lasso": cls.JAVASCRIPT,
            "VB.net": cls.VISUAL_BASIC,
            "Markdown": cls.DOCUMENTATION,
            "reStructuredText": cls.DOCUMENTATION,
            "Text only": cls.DOCUMENTATION,
        }

        return lexer_aliases.get(lexer_name, cls.UNKNOWN)

Functions¶

`from_pygments_lexer(lexer_name)` `classmethod` ¶

Map Pygments lexer names to FileType enums.

Handles multiple lexer names for the same language (e.g., "Python" vs "Python 3").

Parameters:

Name	Type	Description	Default
`lexer_name`	`str`	Name from Pygments lexer.name	required

Returns:

Type	Description
`FileType`	Corresponding FileType enum, defaults to UNKNOWN for unrecognized lexers

Source code in src\kp_ssf_tools\analyze\models\types.py

@classmethod
def from_pygments_lexer(cls, lexer_name: str) -> FileType:
    """
    Map Pygments lexer names to FileType enums.

    Handles multiple lexer names for the same language (e.g., "Python" vs "Python 3").

    Args:
        lexer_name: Name from Pygments lexer.name

    Returns:
        Corresponding FileType enum, defaults to UNKNOWN for unrecognized lexers

    """
    # Direct matches (most common case)
    for file_type in cls:
        if file_type.value == lexer_name:
            return file_type

    # Handle aliases and special cases
    lexer_aliases = {
        "Python 3": cls.PYTHON,
        "JavaScript+Lasso": cls.JAVASCRIPT,
        "VB.net": cls.VISUAL_BASIC,
        "Markdown": cls.DOCUMENTATION,
        "reStructuredText": cls.DOCUMENTATION,
        "Text only": cls.DOCUMENTATION,
    }

    return lexer_aliases.get(lexer_name, cls.UNKNOWN)

Usage Examples¶

Basic Entropy Analysis¶

Command-line examples of entropy analysis operations.

# Basic entropy analysis
ssf_tools analyze entropy /path/to/file.bin

# Entropy analysis with custom block size
ssf_tools analyze entropy /path/to/directory --block-size 16384

# Output to specific file
ssf_tools analyze entropy /path/to/files --output analysis_results.xlsx

Credential Detection¶

Examples of credential scanning operations.

# Basic credential scan
ssf_tools analyze credentials /path/to/source/code

Service Usage in Code¶

Programmatic usage of analysis services within other commands.

@inject
def custom_analysis(
    target: Path,
    entropy_analyzer=Provide[ApplicationContainer.analysis.entropy_analyzer],
    threshold_service=Provide[ApplicationContainer.analysis.threshold_service],
):
    """Custom analysis combining multiple services."""

    # Get content-aware thresholds
    file_type = FileType.PYTHON
    thresholds = threshold_service.get_thresholds_for_file_type(file_type)

    # Perform entropy analysis
    for region in entropy_analyzer.analyze_file_generator(target, file_type):
        level = threshold_service.classify_entropy_level(region.entropy, file_type)
        if level in [EntropyLevel.HIGH, EntropyLevel.MEDIUM_HIGH]:
            print(f"Suspicious region: {region.offset}-{region.offset + region.size}")

    return analysis_results

Performance Considerations¶

Memory Efficiency¶

The analysis system uses streaming techniques to handle large files efficiently.

Input File Streaming Analysis: Files are processed in chunks to maintain constant memory usage
Results File Streaming: Results are exported to Excel as they are received using xlsxwriter's constant_memory option. Memory consumption is set to 10MB regardless of input file or result set size
Generator Patterns: Results are yielded incrementally to avoid loading entire result sets
Resource Management: File handles and streams are properly managed through context managers

Processing Optimization¶

Multiple optimization strategies improve analysis performance.

Content-Aware Processing: Different file types use optimized analysis strategies
Parallel Processing: Multiple files can be analyzed concurrently when memory allows
Early Termination: Analysis can stop early when maximum regions are reached

Testing Patterns¶

Mock Analysis Services¶

Create mock services for testing analysis workflows.

@pytest.fixture
def mock_entropy_analyzer():
    """Mock entropy analyzer for testing."""
    mock_analyzer = Mock(spec=EntropyAnalyzerProtocol)
    mock_analyzer.analyze_file_generator.return_value = iter([
        EntropyRegion(offset=0, size=1024, entropy=7.8, level=EntropyLevel.HIGH)
    ])
    return mock_analyzer

def test_entropy_analysis_with_mock(mock_entropy_analyzer):
    """Test entropy analysis with mocked analyzer."""
    result = perform_analysis_with_analyzer(mock_entropy_analyzer)
    assert len(result.high_entropy_regions) == 1

Integration Testing¶

Test analysis services with real dependencies and temporary files.

def test_entropy_analyzer_integration():
    """Test entropy analyzer with real dependencies."""
    with tempfile.TemporaryDirectory() as temp_dir:
        # Create test file with known entropy characteristics
        test_file = Path(temp_dir) / "test.bin"
        with open(test_file, "wb") as f:
            f.write(os.urandom(8192))  # High entropy data

        # Test with real analyzer
        analyzer = create_real_analyzer()
        regions = list(analyzer.analyze_file_generator(test_file, FileType.UNKNOWN))

        assert len(regions) > 0
        assert all(region.entropy > 7.0 for region in regions)

The Analysis Components architecture provides a robust, extensible foundation for file analysis operations within SSF Tools, supporting both entropy analysis and credential detection through a unified, protocol-based design that integrates seamlessly with the broader SSF Tools ecosystem.

SSF Tools - Analysis Components Architecture¶

Overview¶

Architectural Principles¶

Design Goals¶

Key Benefits¶

Architecture Overview¶

Analysis Service Protocols¶

Key Protocol Overview¶

Configuration Models¶

AnalysisConfiguration¶

Configuration Model¶

Global Configuration Integration¶

Service Implementation¶

EntropyAnalyzer¶

DetectSecretsCredentialService¶

ContentAwareThresholdManager¶

Container Integration¶

AnalysisContainer Registration¶

kp_ssf_tools.containers.analysis.AnalysisContainer ¶

CLI Integration¶

Analyze Command Group¶

kp_ssf_tools.cli.commands.analyze ¶

Classes¶

AnalysisSummary ¶

ExportContext ¶

ProcessingConfig ¶

ProcessingContext ¶

Functions¶

analyze_group() ¶

Service Integration Patterns¶

File Processing Integration¶

Core Models¶

Analysis Models Overview¶

Content-Aware Threshold Models¶

kp_ssf_tools.analyze.models.content_aware ¶

Classes¶

ContentAwareThresholds ¶

Functions¶

for_file_type(file_type) classmethod ¶

get_default_models() classmethod ¶

get_default_values() classmethod ¶

Analysis Type Definitions¶

kp_ssf_tools.analyze.models.types ¶

Classes¶

ComplianceStatus ¶

CredentialRiskLevel ¶

CryptoStructureType ¶

EntropyLevel ¶

Attributes¶

order property ¶

FileType ¶

Functions¶

from_pygments_lexer(lexer_name) classmethod ¶

Usage Examples¶

Basic Entropy Analysis¶

Credential Detection¶

Service Usage in Code¶

Performance Considerations¶

Memory Efficiency¶

Processing Optimization¶

Testing Patterns¶

Mock Analysis Services¶

Integration Testing¶

`kp_ssf_tools.containers.analysis.AnalysisContainer` ¶

`kp_ssf_tools.cli.commands.analyze` ¶

`AnalysisSummary` ¶

`ExportContext` ¶

`ProcessingConfig` ¶

`ProcessingContext` ¶

`analyze_group()` ¶

`kp_ssf_tools.analyze.models.content_aware` ¶

`ContentAwareThresholds` ¶

`for_file_type(file_type)` `classmethod` ¶

`get_default_models()` `classmethod` ¶

`get_default_values()` `classmethod` ¶

`kp_ssf_tools.analyze.models.types` ¶

`ComplianceStatus` ¶

`CredentialRiskLevel` ¶

`CryptoStructureType` ¶

`EntropyLevel` ¶

`order` `property` ¶

`FileType` ¶

`from_pygments_lexer(lexer_name)` `classmethod` ¶