Skip to content

SSF Tools - Analysis Components Architecture

Overview

The Analysis Components architecture provides the core infrastructure for file analysis operations within SSF Tools. This system implements entropy analysis, credential detection, and content-aware threshold management through a protocol-based design with dependency injection.

The architecture supports multiple analysis types including Shannon entropy calculation, detect-secrets integration for credential scanning, and content-aware file classification. All components follow SOLID principles and integrate seamlessly with the SSF Tools container system.

Architectural Principles

Design Goals

  • Protocol-First Design: All services implement well-defined protocols for testability and modularity
  • Content-Aware Analysis: File-type-specific thresholds and analysis strategies reduce false positives
  • Streaming Processing: Memory-efficient analysis of large files through chunk-based processing
  • Dependency Injection: All components receive dependencies through the container system
  • Research-Based Thresholds: Threshold values derived from peer-reviewed academic research

Key Benefits

  • Modular Architecture: Easy to extend with new analysis types or detection methods
  • Memory Efficiency: Streaming analysis supports files of any size without memory constraints
  • High Accuracy: Content-aware thresholds significantly reduce false positive rates
  • Container Integration: Seamless integration with SSF Tools dependency injection system
  • CLI Ready: Direct integration with command-line interface through injected services

Architecture Overview

graph TD %% CLI Entry Points CLI[CLI Commands<br/>analyze entropy, analyze credentials] %% Core Analysis Services subgraph ANALYZE [📁 Analysis Services] EA[EntropyAnalyzer<br/>Shannon entropy calculation] DS[DetectSecretsService<br/>Credential detection] TS[ThresholdService<br/>Content-aware thresholds] FC[FileClassifier<br/>File type detection] end %% Model Layer subgraph MODELS [📁 Analysis Models] AM[AnalysisModels<br/>Input/Result types] CAT[ContentAwareThresholds<br/>File-specific thresholds] TYPES[AnalysisTypes<br/>Enums and constants] end %% Service Dependencies CLI --> ANALYZE ANALYZE --> MODELS %% Core Service Integration subgraph CORE [🔌 Core Services] FPS[FileProcessingService<br/>File I/O operations] ROS[RichOutputService<br/>User interface] TS_SVC[TimestampService<br/>Analysis timestamps] HTTP[HTTPClientService<br/>External API calls] end ANALYZE --> CORE %% External Integrations subgraph EXTERNAL [🔌 External Integrations] DETECT[detect-secrets<br/>Credential scanning] MAGIC[python-magic<br/>MIME detection] PYGMENTS[pygments<br/>Language detection] end ANALYZE --> EXTERNAL %% Output Systems subgraph OUTPUT [💾 Output Systems] EXCEL[Excel Export<br/>Streaming XLSX output] JSON[JSON Export<br/>Structured results] YAML[YAML Export<br/>Configuration format] end ANALYZE --> OUTPUT %% Styling classDef coreService fill:#e1f5fe,stroke:#0277bd,stroke-width:2px classDef protocol fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px classDef component fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px classDef integration fill:#fff3e0,stroke:#ef6c00,stroke-width:2px classDef cli fill:#e3f2fd,stroke:#1565c0,stroke-width:2px classDef storage fill:#fce4ec,stroke:#c2185b,stroke-width:2px classDef group fill:#f8f9fa,stroke:#6c757d,stroke-width:3px,stroke-dasharray: 5 5 class EA,DS,TS,FC coreService class AM,CAT,TYPES component class DETECT,MAGIC,PYGMENTS integration class CLI cli class EXCEL,JSON,YAML storage class ANALYZE,MODELS,CORE,EXTERNAL,OUTPUT group

Analysis Service Protocols

The analysis architecture is built around well-defined protocols that enable testing, modularity, and dependency injection. For detailed protocol specifications, see the Architecture and Design Patterns document.

Key Protocol Overview

The architecture defines protocols for consistent service interfaces. For complete protocol definitions and specifications, see the Architecture and Design Patterns document.

  • EntropyAnalyzerProtocol: Primary protocol for Shannon entropy analysis with content-aware thresholds
  • CredentialDetectionProtocol: Protocol for credential detection using detect-secrets integration
  • Threshold Management: Protocol for content-aware threshold management

Configuration Models

AnalysisConfiguration

Main configuration model for analysis operations.

Configuration Model

The configuration model defines the structure and validation rules for analysis settings. For the complete configuration protocol and validation details, see the Architecture and Design Patterns document.

Global Configuration Integration

The analysis system integrates with SSF Tools global configuration.

# In ssf-tools-config.yaml
global:
  analysis_block_size: 8192              # Default block size for analysis
  max_concurrent_files: 10               # Maximum files to process concurrently

analyze:
  entropy:
    content_aware_enabled: true          # Enable content-aware thresholds
    streaming_enabled: true              # Enable streaming analysis
    max_regions_per_file: 1000          # Maximum entropy regions per file
  credentials:
    detect_secrets_baseline: null        # Path to detect-secrets baseline
    exclude_patterns: []                 # Patterns to exclude from scanning

Service Implementation

This section outlines the concrete implementations of the analysis services. For protocol definitions and interface specifications, see the Architecture and Design Patterns document.

EntropyAnalyzer

Primary implementation of entropy analysis with streaming support. See the architecture document for the complete implementation details and protocol definition.

DetectSecretsCredentialService

Implementation of credential detection using detect-secrets. See the architecture document for the complete implementation details and subprocess integration patterns.

ContentAwareThresholdManager

Service for managing content-aware entropy thresholds by file type. See the architecture document for the complete implementation details and protocol definition.

Container Integration

AnalysisContainer Registration

Services are registered in the dependency injection container for proper lifecycle management.

kp_ssf_tools.containers.analysis.AnalysisContainer

Bases: DeclarativeContainer

Container for analysis services (entropy, wordlists, crypto detection).

Source code in src\kp_ssf_tools\containers\analysis.py
class AnalysisContainer(containers.DeclarativeContainer):
    """Container for analysis services (entropy, wordlists, crypto detection)."""

    # Core services (injected from main container)
    core: providers.DependenciesContainer = providers.DependenciesContainer()

    threshold_manager: providers.Singleton[ThresholdProviderProtocol] = (
        providers.Singleton(
            ContentAwareThresholdManager,
        )
    )

    # Entropy analyzer with core services injected
    entropy_analyzer: providers.Factory[EntropyAnalyzer] = providers.Factory(
        "kp_ssf_tools.analyze.services.entropy.analyzer.EntropyAnalyzer",
        rich_output=core.rich_output,
        timestamp_service=core.timestamp,
        file_validator=core.file_validator,
        mime_detector=core.mime_detector,
        file_processing=core.file_processing,
        threshold_manager=threshold_manager,
    )

    # Detect-secrets based credential service
    detect_secrets_credential_service: providers.Singleton[
        DetectSecretsCredentialService
    ] = providers.Singleton(
        DetectSecretsCredentialService,
        rich_output=core.rich_output,
        timestamp_service=core.timestamp,
        file_discovery=core.file_discoverer,
        file_processing=core.file_processing,
    )

    # Active credential detection service
    active_credential_service = detect_secrets_credential_service  # type: ignore[assignment]

    # For backward compatibility during transition
    analyzer = entropy_analyzer

CLI Integration

Analyze Command Group

CLI commands that interact with analysis services through dependency injection.

kp_ssf_tools.cli.commands.analyze

Entropy analysis CLI commands with streaming Excel export.

Classes

AnalysisSummary

Bases: NamedTuple

Summary data for analysis results.

Source code in src\kp_ssf_tools\cli\commands\analyze.py
class AnalysisSummary(NamedTuple):
    """Summary data for analysis results."""

    total_files_analyzed: int
    total_files: int
    total_high_risk_regions: int
    total_time: float

ExportContext

Bases: NamedTuple

Context for Excel export operations.

Source code in src\kp_ssf_tools\cli\commands\analyze.py
class ExportContext(NamedTuple):
    """Context for Excel export operations."""

    excel_service: ExcelExportService
    timestamp_service: TimestampService
    rich_output: RichOutputService

ProcessingConfig

Bases: NamedTuple

Configuration for file processing.

Source code in src\kp_ssf_tools\cli\commands\analyze.py
class ProcessingConfig(NamedTuple):
    """Configuration for file processing."""

    file_block_size: int | None
    analysis_block_size: int | None
    step_size: int | None
    include_samples: bool

ProcessingContext

Bases: NamedTuple

Context for file processing with analyzer and configs.

Source code in src\kp_ssf_tools\cli\commands\analyze.py
class ProcessingContext(NamedTuple):
    """Context for file processing with analyzer and configs."""

    analyzer: EntropyAnalyzer
    entropy_config: AnalysisConfiguration
    global_config: GlobalConfiguration
    rich_output: RichOutputService

Functions

analyze_group()

Security analysis commands for PCI SSF 2.3 compliance.

Source code in src\kp_ssf_tools\cli\commands\analyze.py
@click.group(name="analyze")
def analyze_group() -> None:
    """Security analysis commands for PCI SSF 2.3 compliance."""

credentials(target, *, recursive, file_extensions, context_lines, scan_binary, max_binary_size, credential_service=Provide[ApplicationContainer.analysis.active_credential_service], rich_output=Provide[ApplicationContainer.core.rich_output], excel_service=Provide[ApplicationContainer.core.excel_export_service], timestamp_service=Provide[ApplicationContainer.core.timestamp], global_config_service=Provide[ApplicationContainer.core.global_config_service], analysis_config_service=Provide[ApplicationContainer.core.entropy_config_service])

Detect credentials in files for PCI SSF 2.3 compliance.

Analyzes files for embedded credentials including usernames, passwords, API keys, and other sensitive information. Uses wordlists from SecLists and regex patterns to identify potential security issues.

Results are automatically exported to Excel with per-file worksheets and a summary sheet. Output filename: analyze-credentials-.xlsx

Arguments:

TARGET                  Path to file or directory to analyze

Examples:

# Basic credential detection
ssf_tools analyze credentials sample.py

# Analyze specific file types only
ssf_tools analyze credentials data/ --file-extensions .py --file-extensions .js

# Include more context around matches
ssf_tools analyze credentials config/ --context-lines 5

# Skip binary files to speed up analysis
ssf_tools analyze credentials project/ --no-scan-binary

Source code in src\kp_ssf_tools\cli\commands\analyze.py
@analyze_group.command("credentials")
@click.argument("target", type=click.Path(exists=True, path_type=Path))
@click.option(
    "--recursive/--no-recursive",
    default=True,
    help="Search directories recursively for files to analyze",
)
@click.option(
    "--file-extensions",
    multiple=True,
    help="File extensions to include (e.g., .py .js .txt). If not specified, all text files are analyzed",
)
@click.option(
    "--context-lines",
    type=int,
    default=3,
    help="Number of context lines to show around matches",
)
@click.option(
    "--scan-binary/--no-scan-binary",
    default=True,
    help="Whether to scan binary files for embedded credentials",
)
@click.option(
    "--max-binary-size",
    type=int,
    default=10,
    help="Maximum size in MB for binary files to scan",
)
@inject
def credentials(  # noqa: PLR0913
    target: Path,
    *,
    recursive: bool,
    file_extensions: tuple[str, ...],
    context_lines: int,
    scan_binary: bool,
    max_binary_size: int,
    credential_service: CredentialDetectionProtocol = Provide[
        ApplicationContainer.analysis.active_credential_service
    ],
    rich_output: RichOutputService = Provide[ApplicationContainer.core.rich_output],
    excel_service: ExcelExportService = Provide[
        ApplicationContainer.core.excel_export_service
    ],
    timestamp_service: TimestampService = Provide[ApplicationContainer.core.timestamp],
    global_config_service: ConfigurationService[GlobalConfiguration] = Provide[
        ApplicationContainer.core.global_config_service
    ],
    analysis_config_service: ConfigurationService[AnalysisConfiguration] = Provide[
        ApplicationContainer.core.entropy_config_service
    ],
) -> None:
    """
    Detect credentials in files for PCI SSF 2.3 compliance.

    Analyzes files for embedded credentials including usernames, passwords,
    API keys, and other sensitive information. Uses wordlists from SecLists
    and regex patterns to identify potential security issues.

    Results are automatically exported to Excel with per-file worksheets
    and a summary sheet. Output filename: analyze-credentials-<timestamp>.xlsx

    **Arguments:**
    ```
    TARGET                  Path to file or directory to analyze
    ```

    **Examples:**
    ```
    # Basic credential detection
    ssf_tools analyze credentials sample.py

    # Analyze specific file types only
    ssf_tools analyze credentials data/ --file-extensions .py --file-extensions .js

    # Include more context around matches
    ssf_tools analyze credentials config/ --context-lines 5

    # Skip binary files to speed up analysis
    ssf_tools analyze credentials project/ --no-scan-binary
    ```
    """
    # Load configurations
    global_config = global_config_service.load_config()
    analysis_config = analysis_config_service.load_config()

    # Set verbose mode if requested
    if global_config.output.verbose:
        rich_output.debug("Verbose mode enabled")

    # Convert file extensions to list
    extensions_list = list(file_extensions) if file_extensions else None

    try:
        # Convert the Pydantic configuration to the dict format expected by the service
        config_dict = {
            "credentials": {
                "enabled": analysis_config.credentials.enabled,
                "cache_duration_hours": analysis_config.credentials.cache_duration_hours,
                "auto_download": analysis_config.credentials.auto_download,
                "wordlist_sources": analysis_config.credentials.wordlist_sources,
            },
        }

        # Perform credential analysis
        scan_options = CredentialScanOptions(
            recursive=recursive,
            file_extensions=tuple(extensions_list) if extensions_list else (),
            context_lines=context_lines,
            scan_binary_files=scan_binary,
            max_binary_size_mb=max_binary_size,
        )

        result = credential_service.analyze_files(
            target_paths=[target],
            config=config_dict,
            options=scan_options,
        )

        # Export to Excel if results found
        if result and result.patterns:
            export_context = ExportContext(
                excel_service=excel_service,
                timestamp_service=timestamp_service,
                rich_output=rich_output,
            )
            _export_credentials_to_excel(result, export_context)

        # Display results in console
        credential_display_limit = 10
        if result and result.patterns:
            rich_output.warning(f"Found {len(result.patterns)} potential credentials")
            for pattern in result.patterns[:credential_display_limit]:  # Show first N
                rich_output.info(
                    f"  {pattern.pattern_type}: {pattern.value[:50]}... "
                    f"(line {pattern.line_start})",
                )
            if len(result.patterns) > credential_display_limit:
                rich_output.info(
                    f"  ... and {len(result.patterns) - credential_display_limit} more",
                )
            rich_output.info("Complete results have been exported to Excel")
        else:
            rich_output.success("No credentials detected in analyzed files")

    except Exception as e:
        rich_output.error(f"Credential analysis failed: {e}")
        if global_config.output.verbose:
            import traceback

            rich_output.error(traceback.format_exc())
        raise

entropy(target, risk_threshold, file_block_size, analysis_block_size, step_size, ignore_pattern, *, no_recurse, include_samples, analyzer=Provide[ApplicationContainer.entropy.analyzer], rich_output=Provide[ApplicationContainer.core.rich_output], file_discovery=Provide[ApplicationContainer.core.file_discoverer], global_config_service=Provide[ApplicationContainer.core.global_config_service], entropy_config_service=Provide[ApplicationContainer.core.entropy_config_service], timestamp_service=Provide[ApplicationContainer.core.timestamp])

Analyze entropy of files for PCI SSF 2.3 compliance.

Performs Shannon entropy analysis using content-aware thresholds to detect potentially suspicious patterns in files. Results are streamed directly to Excel with minimal memory usage.

Arguments:

TARGET                  Path to file or directory to analyze

Examples:

# Basic file analysis
ssf_tools analyze entropy sample.bin

# Analyze with higher risk threshold (fewer results)
ssf_tools analyze entropy sample.bin --risk-threshold high

# Analyze with custom block size
ssf_tools analyze entropy sample.bin --analysis-block-size 128

# Override file type detection
ssf_tools analyze entropy app.exe --force-file-type windows_pe

# Analyze directory non-recursively
ssf_tools analyze entropy data/ --no-recurse

Source code in src\kp_ssf_tools\cli\commands\analyze.py
@analyze_group.command("entropy")
@click.argument("target", type=click.Path(exists=True, path_type=Path))
@click.option(
    "--ignore-pattern",
    multiple=True,
    help="Glob pattern(s) to ignore when searching for files (e.g. --ignore-pattern='__pycache__' --ignore-pattern='*.egg-info')",
)
@click.option(
    "--risk-threshold",
    type=click.Choice(
        ["very_low", "low", "medium", "medium_high", "high", "critical"],
        case_sensitive=False,
    ),
    default="medium_high",
    help="Minimum risk level for regions to include in analysis (default: medium_high)",
)
@click.option(
    "--file-block-size",
    type=int,
    help="File I/O block size in bytes (default: 65536)",
)
@click.option(
    "--analysis-block-size",
    type=int,
    help="Analysis block size in bytes (default: 64)",
)
@click.option(
    "--step-size",
    type=int,
    help="Step size for sliding window analysis (default: 16)",
)
@click.option(
    "--no-recurse",
    is_flag=True,
    help="Disable recursive directory analysis (analyze current directory only)",
)
@click.option(
    "--include-samples",
    is_flag=True,
    help="Include data samples in region analysis (increases file size)",
)
@inject
def entropy(  # noqa: PLR0913
    target: Path,
    risk_threshold: str,
    file_block_size: int | None,
    analysis_block_size: int | None,
    step_size: int | None,
    ignore_pattern: tuple[str, ...],
    *,
    no_recurse: bool,
    include_samples: bool,
    analyzer: EntropyAnalyzer = Provide[ApplicationContainer.entropy.analyzer],
    rich_output: RichOutputService = Provide[ApplicationContainer.core.rich_output],
    file_discovery: FileDiscoveryService = Provide[
        ApplicationContainer.core.file_discoverer
    ],
    global_config_service: ConfigurationService[GlobalConfiguration] = Provide[
        ApplicationContainer.core.global_config_service
    ],
    entropy_config_service: ConfigurationService[AnalysisConfiguration] = Provide[
        ApplicationContainer.core.entropy_config_service
    ],
    timestamp_service: TimestampService = Provide[ApplicationContainer.core.timestamp],
) -> None:
    """
    Analyze entropy of files for PCI SSF 2.3 compliance.

    Performs Shannon entropy analysis using content-aware thresholds to detect
    potentially suspicious patterns in files. Results are streamed directly to
    Excel with minimal memory usage.

    **Arguments:**
    ```
    TARGET                  Path to file or directory to analyze
    ```

    **Examples:**
    ```
    # Basic file analysis
    ssf_tools analyze entropy sample.bin

    # Analyze with higher risk threshold (fewer results)
    ssf_tools analyze entropy sample.bin --risk-threshold high

    # Analyze with custom block size
    ssf_tools analyze entropy sample.bin --analysis-block-size 128

    # Override file type detection
    ssf_tools analyze entropy app.exe --force-file-type windows_pe

    # Analyze directory non-recursively
    ssf_tools analyze entropy data/ --no-recurse
    ```
    """
    try:
        # Build configuration overrides and load configs
        cli_overrides = _build_cli_overrides(
            file_block_size,
            analysis_block_size,
            step_size,
        )
        global_config = global_config_service.load_config()
        entropy_config = entropy_config_service.load_config(
            command_overrides=cli_overrides,
        )

        if global_config.output.verbose:
            rich_output.debug("Verbose mode enabled")

        # Setup analysis parameters
        risk_level = EntropyLevel[risk_threshold.upper()]
        timestamp = timestamp_service.format_filename_now()
        output_path = Path(f"entropy-analysis-{timestamp}.xlsx")

        # Discover and filter files to analyze
        files_to_analyze = _discover_and_filter_files(
            target,
            file_discovery,
            no_recurse=no_recurse,
            ignore_pattern=ignore_pattern,
            rich_output=rich_output,
        )
        if not files_to_analyze:
            return

        # Check Excel limits and warn if necessary
        step_size_val = step_size or entropy_config.analysis.step_size
        if _check_excel_limits(
            files_to_analyze,
            risk_level,
            step_size_val,
            rich_output,
        ):
            return

        # Process files
        rich_output.info(
            f"Starting entropy analysis with {risk_level.value} risk threshold",
        )
        rich_output.info(f"Output will be saved to: {output_path}")

        processing_config = ProcessingConfig(
            file_block_size=file_block_size,
            analysis_block_size=analysis_block_size,
            step_size=step_size,
            include_samples=include_samples,
        )

        context = ProcessingContext(
            analyzer=analyzer,
            entropy_config=entropy_config,
            global_config=global_config,
            rich_output=rich_output,
        )

        total_files_analyzed, total_high_risk_regions, total_time = _process_files(
            files_to_analyze,
            output_path,
            risk_level,
            context,
            processing_config,
        )

        # Report final summary
        summary = AnalysisSummary(
            total_files_analyzed=total_files_analyzed,
            total_files=len(files_to_analyze),
            total_high_risk_regions=total_high_risk_regions,
            total_time=total_time,
        )
        _report_summary(output_path, summary, rich_output)

    except Exception as e:
        rich_output.error(f"Analysis failed: {e}")
        if "global_config" in locals() and global_config.output.verbose:
            import traceback

            rich_output.error(traceback.format_exc())
        raise

Service Integration Patterns

File Processing Integration

Analysis services integrate with the core file processing system for efficient I/O operations through direct CLI orchestration.

The current implementation uses a streaming-based approach where CLI commands directly coordinate analysis workflows. See the Architecture and Design Patterns document for the complete implementation details of the file processing workflow.

This pattern provides: - Memory Efficiency: Streaming analysis with direct Excel export - Error Isolation: Per-file error handling without stopping the entire analysis - Progress Reporting: Real-time feedback on analysis progress - Configuration Flexibility: CLI overrides for analysis parameters

Core Models

Analysis Models Overview

The analysis system uses well-defined Pydantic models for inputs and results. For detailed model specifications, see the Architecture and Design Patterns document.

Key model categories include: - Analysis Results: Structured outputs for entropy and credential detection operations - Pattern Models: Detection result containers for identified patterns and regions
- Configuration Models: Input validation and type safety for analysis operations

Content-Aware Threshold Models

File-type-specific threshold definitions for accurate entropy classification.

kp_ssf_tools.analyze.models.content_aware

Content-Aware Thresholds module.

Classes

ContentAwareThresholds

Bases: SSFToolsBaseModel

File type-specific entropy thresholds loaded from configuration.

Source code in src\kp_ssf_tools\analyze\models\content_aware.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
class ContentAwareThresholds(SSFToolsBaseModel):
    """File type-specific entropy thresholds loaded from configuration."""

    file_type: FileType
    expected_entropy: tuple[float, float]  # (mean, std_dev) for normal content
    very_low_threshold: float  # Below this = VERY_LOW
    low_threshold: float  # Below this = LOW
    medium_threshold: float  # Normal range center
    medium_high_threshold: float  # Above this = MEDIUM_HIGH
    high_threshold: float  # Above this = HIGH

    @classmethod
    def get_default_values(cls) -> dict[FileType, dict[str, object]]:
        """
        Default threshold values for configuration file generation.

        These values are derived from extensive academic research documented
        in docs/file-entropy-research.md, including:
        - Lyda & Hamrock (2007) IEEE foundational paper
        - Davies et al. (2022) NapierOne dataset (500,000+ files)
        - Practical Security Analytics (500,000 PE file analysis)
        - Multiple peer-reviewed studies with statistical validation

        Returns a dict suitable for YAML configuration file generation.
        """
        return {
            # Top 20 Programming Languages (2025 Rankings)
            FileType.PYTHON: {
                "expected_entropy": [
                    5.5,
                    0.8,
                ],  # Mean=5.5, StdDev=0.8
                "very_low_threshold": 4.0,  # Highly repetitive code
                "low_threshold": 5.0,  # Simple scripts, lots of comments
                "medium_threshold": 6.0,  # Typical Python code
                "medium_high_threshold": 6.8,  # Complex logic, minified
                "high_threshold": 7.2,  # Obfuscated/packed code | > Likely suspicious
            },
            FileType.JAVASCRIPT: {
                "expected_entropy": [
                    5.4,
                    0.8,
                ],  # Mean=5.4, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple scripts
                "low_threshold": 4.9,  # Basic JS with comments
                "medium_threshold": 5.9,  # Typical JavaScript
                "medium_high_threshold": 6.8,  # Minified/complex
                "high_threshold": 7.2,  # Obfuscated code | > Likely suspicious
            },
            FileType.JAVA: {
                "expected_entropy": [
                    5.6,
                    0.7,
                ],  # Mean=5.6, StdDev=0.7
                "very_low_threshold": 4.0,  # Verbose Java patterns
                "low_threshold": 5.0,  # Simple classes
                "medium_threshold": 6.0,  # Typical Java code
                "medium_high_threshold": 6.8,  # Complex enterprise code
                "high_threshold": 7.2,  # Bytecode/obfuscated | > Likely suspicious
            },
            FileType.CPP: {
                "expected_entropy": [
                    5.8,
                    0.9,
                ],  # Mean=5.8, StdDev=0.9
                "very_low_threshold": 4.0,  # Header files
                "low_threshold": 5.0,  # Simple implementations
                "medium_threshold": 6.2,  # Typical C++ code
                "medium_high_threshold": 7.0,  # Template-heavy code
                "high_threshold": 7.3,  # Compiled/obfuscated | > Likely suspicious
            },
            FileType.C: {
                "expected_entropy": [
                    5.7,
                    0.9,
                ],  # Mean=5.7, StdDev=0.9
                "very_low_threshold": 4.0,  # Header files
                "low_threshold": 5.0,  # Simple C code
                "medium_threshold": 6.1,  # Typical C programs
                "medium_high_threshold": 6.9,  # Complex system code
                "high_threshold": 7.3,  # Compiled/obfuscated | > Likely suspicious
            },
            FileType.CSHARP: {
                "expected_entropy": [
                    5.6,
                    0.8,
                ],  # Mean=5.6, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple classes
                "low_threshold": 5.0,  # Basic C# code
                "medium_threshold": 6.0,  # Typical C# applications
                "medium_high_threshold": 6.8,  # Complex .NET code
                "high_threshold": 7.2,  # IL bytecode/obfuscated | > Likely suspicious
            },
            FileType.TYPESCRIPT: {
                "expected_entropy": [
                    5.4,
                    0.8,
                ],  # Mean=5.4, StdDev=0.8
                "very_low_threshold": 4.0,  # Type definitions
                "low_threshold": 4.9,  # Simple TypeScript
                "medium_threshold": 5.9,  # Typical TS code
                "medium_high_threshold": 6.8,  # Complex/transpiled
                "high_threshold": 7.2,  # Obfuscated output | > Likely suspicious
            },
            FileType.PHP: {
                "expected_entropy": [
                    5.3,
                    0.8,
                ],  # Mean=5.3, StdDev=0.8
                "very_low_threshold": 4.0,  # HTML mixed PHP
                "low_threshold": 4.8,  # Simple PHP scripts
                "medium_threshold": 5.8,  # Typical PHP code
                "medium_high_threshold": 6.7,  # Complex frameworks
                "high_threshold": 7.1,  # Obfuscated PHP | > Likely suspicious
            },
            FileType.GO: {
                "expected_entropy": [
                    5.5,
                    0.7,
                ],  # Mean=5.5, StdDev=0.7
                "very_low_threshold": 4.0,  # Simple Go code
                "low_threshold": 5.0,  # Basic programs
                "medium_threshold": 6.0,  # Typical Go code
                "medium_high_threshold": 6.8,  # Complex concurrent code
                "high_threshold": 7.2,  # Compiled binary data | > Likely suspicious
            },
            FileType.SQL: {
                "expected_entropy": [
                    5.2,
                    0.9,
                ],  # Mean=5.2, StdDev=0.9
                "very_low_threshold": 3.8,  # Simple queries
                "low_threshold": 4.7,  # Basic SQL statements
                "medium_threshold": 5.7,  # Complex queries
                "medium_high_threshold": 6.6,  # Stored procedures
                "high_threshold": 7.0,  # Obfuscated SQL | > Likely suspicious
            },
            FileType.RUST: {
                "expected_entropy": [
                    5.7,
                    0.8,
                ],  # Mean=5.7, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple Rust code
                "low_threshold": 5.0,  # Basic implementations
                "medium_threshold": 6.1,  # Typical Rust code
                "medium_high_threshold": 6.9,  # Complex unsafe code
                "high_threshold": 7.3,  # Compiled/obfuscated | > Likely suspicious
            },
            FileType.SWIFT: {
                "expected_entropy": [
                    5.5,
                    0.8,
                ],  # Mean=5.5, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple Swift code
                "low_threshold": 5.0,  # Basic iOS code
                "medium_threshold": 6.0,  # Typical Swift apps
                "medium_high_threshold": 6.8,  # Complex frameworks
                "high_threshold": 7.2,  # Compiled/obfuscated | > Likely suspicious
            },
            FileType.KOTLIN: {
                "expected_entropy": [
                    5.5,
                    0.8,
                ],  # Mean=5.5, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple Kotlin code
                "low_threshold": 5.0,  # Basic Android code
                "medium_threshold": 6.0,  # Typical Kotlin apps
                "medium_high_threshold": 6.8,  # Complex coroutines
                "high_threshold": 7.2,  # Bytecode/obfuscated | > Likely suspicious
            },
            FileType.RUBY: {
                "expected_entropy": [
                    5.3,
                    0.8,
                ],  # Mean=5.3, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple Ruby scripts
                "low_threshold": 4.8,  # Basic Rails code
                "medium_threshold": 5.8,  # Typical Ruby code
                "medium_high_threshold": 6.7,  # Complex metaprogramming
                "high_threshold": 7.1,  # Obfuscated Ruby | > Likely suspicious
            },
            FileType.R: {
                "expected_entropy": [
                    5.4,
                    0.9,
                ],  # Mean=5.4, StdDev=0.9
                "very_low_threshold": 3.9,  # Simple R scripts
                "low_threshold": 4.8,  # Basic statistics
                "medium_threshold": 5.9,  # Typical R analysis
                "medium_high_threshold": 6.8,  # Complex models
                "high_threshold": 7.1,  # Compiled R code | > Likely suspicious
            },
            FileType.VISUAL_BASIC: {
                "expected_entropy": [
                    5.2,
                    0.8,
                ],  # Mean=5.2, StdDev=0.8
                "very_low_threshold": 3.9,  # Simple VB code
                "low_threshold": 4.7,  # Basic VB.NET
                "medium_threshold": 5.7,  # Typical VB apps
                "medium_high_threshold": 6.6,  # Complex forms
                "high_threshold": 7.0,  # Obfuscated VB | > Likely suspicious
            },
            FileType.SCALA: {
                "expected_entropy": [
                    5.6,
                    0.8,
                ],  # Mean=5.6, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple Scala code
                "low_threshold": 5.0,  # Basic functional code
                "medium_threshold": 6.0,  # Typical Scala apps
                "medium_high_threshold": 6.8,  # Complex Spark code
                "high_threshold": 7.2,  # Bytecode/obfuscated | > Likely suspicious
            },
            FileType.MATLAB: {
                "expected_entropy": [
                    5.4,
                    0.9,
                ],  # Mean=5.4, StdDev=0.9
                "very_low_threshold": 3.9,  # Simple scripts
                "low_threshold": 4.8,  # Basic computations
                "medium_threshold": 5.9,  # Typical MATLAB code
                "medium_high_threshold": 6.8,  # Complex algorithms
                "high_threshold": 7.1,  # Compiled MEX files | > Likely suspicious
            },
            FileType.PERL: {
                "expected_entropy": [
                    5.4,
                    0.9,
                ],  # Mean=5.4, StdDev=0.9
                "very_low_threshold": 3.9,  # Simple Perl scripts
                "low_threshold": 4.8,  # Basic regex code
                "medium_threshold": 5.9,  # Typical Perl code
                "medium_high_threshold": 6.8,  # Complex one-liners
                "high_threshold": 7.1,  # Obfuscated Perl | > Likely suspicious
            },
            FileType.DART: {
                "expected_entropy": [
                    5.5,
                    0.8,
                ],  # Mean=5.5, StdDev=0.8
                "very_low_threshold": 4.0,  # Simple Dart code
                "low_threshold": 5.0,  # Basic Flutter widgets
                "medium_threshold": 6.0,  # Typical Dart apps
                "medium_high_threshold": 6.8,  # Complex async code
                "high_threshold": 7.2,  # Compiled/obfuscated | > Likely suspicious
            },
            # Documentation Files
            FileType.DOCUMENTATION: {
                "expected_entropy": [
                    4.8,
                    0.65,
                ],  # Mean=4.8, StdDev=0.65 (combined plain/markdown)
                "very_low_threshold": 3.55,  # Highly repetitive text
                "low_threshold": 4.25,  # Simple documentation
                "medium_threshold": 5.1,  # Typical documentation
                "medium_high_threshold": 5.65,  # Technical docs with code
                "high_threshold": 6.15,  # Mixed content | Anomalous for docs
            },
            # Binary Executables
            FileType.WINDOWS_PE: {
                "expected_entropy": [
                    6.0,
                    1.2,
                ],  # Mean=6.0, StdDev=1.2
                "very_low_threshold": 4.5,  # Text sections
                "low_threshold": 5.2,  # Code sections
                "medium_threshold": 6.5,  # Typical PE files
                "medium_high_threshold": 7.0,  # Complex binaries
                "high_threshold": 7.2,  # Packed/compressed | > Likely suspicious
            },
            FileType.MACOS_MACHO: {
                "expected_entropy": [
                    5.9,
                    1.2,
                ],  # Mean=5.9, StdDev=1.2
                "very_low_threshold": 4.5,  # Text sections
                "low_threshold": 5.2,  # Code sections
                "medium_threshold": 6.4,  # Typical MachO files
                "medium_high_threshold": 6.9,  # Universal binaries
                "high_threshold": 7.2,  # Packed/compressed | > Likely suspicious
            },
            FileType.LINUX_ELF: {
                "expected_entropy": [
                    5.8,
                    1.1,
                ],  # Mean=5.8, StdDev=1.1
                "very_low_threshold": 4.5,  # Text sections
                "low_threshold": 5.1,  # Code sections
                "medium_threshold": 6.3,  # Typical ELF files
                "medium_high_threshold": 6.8,  # Complex binaries
                "high_threshold": 7.2,  # Packed/compressed | > Likely suspicious
            },
            # Encrypted/Suspicious Content
            FileType.ENCRYPTED: {
                "expected_entropy": [
                    7.99,
                    0.01,
                ],  # Mean=7.99, StdDev=0.01 (AES validated)
                "very_low_threshold": 7.8,  # Weak/broken encryption
                "low_threshold": 7.85,  # Poor encryption
                "medium_threshold": 7.9,  # Possible encryption
                "medium_high_threshold": 7.95,  # Likely encrypted
                "high_threshold": 7.98,  # Strong encryption | > Max entropy
            },
            FileType.BASE64_ENCODED: {
                "expected_entropy": [
                    6.0,
                    0.3,
                ],  # Mean=6.0, StdDev=0.3
                "very_low_threshold": 5.2,  # Partial encoding
                "low_threshold": 5.5,  # Simple base64
                "medium_threshold": 6.0,  # Typical base64
                "medium_high_threshold": 6.3,  # Complex encoded data
                "high_threshold": 6.5,  # Encrypted then encoded | > Suspicious encoding
            },
            FileType.HEX_ENCODED: {
                "expected_entropy": [
                    4.0,
                    0.2,
                ],  # Mean=4.0, StdDev=0.2
                "very_low_threshold": 3.5,  # Partial hex
                "low_threshold": 3.7,  # Simple hex strings
                "medium_threshold": 4.0,  # Typical hex encoding
                "medium_high_threshold": 4.2,  # Complex hex data
                "high_threshold": 4.4,  # Anomalous hex | > Suspicious pattern
            },
            # Unknown file types - Conservative thresholds
            FileType.UNKNOWN: {
                "expected_entropy": [
                    5.5,
                    1.5,
                ],  # Mean=5.5, StdDev=1.5 (conservative mixed content)
                "very_low_threshold": 3.0,  # Likely text/structured
                "low_threshold": 4.5,  # Probable code/data
                "medium_threshold": 6.0,  # Typical binary content
                "medium_high_threshold": 7.0,  # Complex binary/media
                "high_threshold": 7.2,  # Boundary suspicious | > Conservative threshold
            },
        }

    @classmethod
    def get_default_models(cls) -> dict[FileType, ContentAwareThresholds]:
        """
        Get pre-built Pydantic model instances for all file types.

        Returns validated ContentAwareThresholds models instead of raw dicts.
        Use this method for runtime threshold management to avoid dict-to-model conversion.

        """
        models = {}
        for file_type, data in cls.get_default_values().items():
            # Cast values from object to proper types
            expected_entropy = data["expected_entropy"]
            if isinstance(expected_entropy, list | tuple):
                float_values = [float(x) for x in expected_entropy]
                # Ensure we have exactly 2 values for tuple[float, float]
                expected_tuple_length = 2
                if len(float_values) >= expected_tuple_length:
                    entropy_tuple = (float_values[0], float_values[1])
                else:
                    entropy_tuple = (0.0, 8.0)  # fallback
            else:
                entropy_tuple = (0.0, 8.0)  # fallback

            models[file_type] = cls(
                file_type=file_type,
                expected_entropy=entropy_tuple,
                very_low_threshold=cast("float", data["very_low_threshold"]),
                low_threshold=cast("float", data["low_threshold"]),
                medium_threshold=cast("float", data["medium_threshold"]),
                medium_high_threshold=cast("float", data["medium_high_threshold"]),
                high_threshold=cast("float", data["high_threshold"]),
            )
        return models

    @classmethod
    def for_file_type(cls, file_type: FileType) -> ContentAwareThresholds:
        """
        Factory method to create a threshold model for a specific file type.

        Args:
            file_type: The file type to get thresholds for

        Returns:
            ContentAwareThresholds model instance with validated data

        Raises:
            KeyError: If file_type is not supported

        """
        defaults = cls.get_default_values()
        if file_type not in defaults:
            # Return sensible defaults for unknown file types
            return cls(
                file_type=file_type,
                expected_entropy=(5.5, 1.0),
                very_low_threshold=4.0,
                low_threshold=5.0,
                medium_threshold=6.0,
                medium_high_threshold=6.8,
                high_threshold=7.2,
            )

        data = defaults[file_type]
        # Cast values from object to proper types
        expected_entropy = data["expected_entropy"]
        if isinstance(expected_entropy, list | tuple):
            float_values = [float(x) for x in expected_entropy]
            # Ensure we have exactly 2 values for tuple[float, float]
            expected_tuple_length = 2
            if len(float_values) >= expected_tuple_length:
                entropy_tuple = (float_values[0], float_values[1])
            else:
                entropy_tuple = (0.0, 8.0)  # fallback
        else:
            entropy_tuple = (0.0, 8.0)  # fallback

        return cls(
            file_type=file_type,
            expected_entropy=entropy_tuple,
            very_low_threshold=cast("float", data["very_low_threshold"]),
            low_threshold=cast("float", data["low_threshold"]),
            medium_threshold=cast("float", data["medium_threshold"]),
            medium_high_threshold=cast("float", data["medium_high_threshold"]),
            high_threshold=cast("float", data["high_threshold"]),
        )
Functions
for_file_type(file_type) classmethod

Factory method to create a threshold model for a specific file type.

Parameters:

Name Type Description Default
file_type FileType

The file type to get thresholds for

required

Returns:

Type Description
ContentAwareThresholds

ContentAwareThresholds model instance with validated data

Raises:

Type Description
KeyError

If file_type is not supported

Source code in src\kp_ssf_tools\analyze\models\content_aware.py
@classmethod
def for_file_type(cls, file_type: FileType) -> ContentAwareThresholds:
    """
    Factory method to create a threshold model for a specific file type.

    Args:
        file_type: The file type to get thresholds for

    Returns:
        ContentAwareThresholds model instance with validated data

    Raises:
        KeyError: If file_type is not supported

    """
    defaults = cls.get_default_values()
    if file_type not in defaults:
        # Return sensible defaults for unknown file types
        return cls(
            file_type=file_type,
            expected_entropy=(5.5, 1.0),
            very_low_threshold=4.0,
            low_threshold=5.0,
            medium_threshold=6.0,
            medium_high_threshold=6.8,
            high_threshold=7.2,
        )

    data = defaults[file_type]
    # Cast values from object to proper types
    expected_entropy = data["expected_entropy"]
    if isinstance(expected_entropy, list | tuple):
        float_values = [float(x) for x in expected_entropy]
        # Ensure we have exactly 2 values for tuple[float, float]
        expected_tuple_length = 2
        if len(float_values) >= expected_tuple_length:
            entropy_tuple = (float_values[0], float_values[1])
        else:
            entropy_tuple = (0.0, 8.0)  # fallback
    else:
        entropy_tuple = (0.0, 8.0)  # fallback

    return cls(
        file_type=file_type,
        expected_entropy=entropy_tuple,
        very_low_threshold=cast("float", data["very_low_threshold"]),
        low_threshold=cast("float", data["low_threshold"]),
        medium_threshold=cast("float", data["medium_threshold"]),
        medium_high_threshold=cast("float", data["medium_high_threshold"]),
        high_threshold=cast("float", data["high_threshold"]),
    )
get_default_models() classmethod

Get pre-built Pydantic model instances for all file types.

Returns validated ContentAwareThresholds models instead of raw dicts. Use this method for runtime threshold management to avoid dict-to-model conversion.

Source code in src\kp_ssf_tools\analyze\models\content_aware.py
@classmethod
def get_default_models(cls) -> dict[FileType, ContentAwareThresholds]:
    """
    Get pre-built Pydantic model instances for all file types.

    Returns validated ContentAwareThresholds models instead of raw dicts.
    Use this method for runtime threshold management to avoid dict-to-model conversion.

    """
    models = {}
    for file_type, data in cls.get_default_values().items():
        # Cast values from object to proper types
        expected_entropy = data["expected_entropy"]
        if isinstance(expected_entropy, list | tuple):
            float_values = [float(x) for x in expected_entropy]
            # Ensure we have exactly 2 values for tuple[float, float]
            expected_tuple_length = 2
            if len(float_values) >= expected_tuple_length:
                entropy_tuple = (float_values[0], float_values[1])
            else:
                entropy_tuple = (0.0, 8.0)  # fallback
        else:
            entropy_tuple = (0.0, 8.0)  # fallback

        models[file_type] = cls(
            file_type=file_type,
            expected_entropy=entropy_tuple,
            very_low_threshold=cast("float", data["very_low_threshold"]),
            low_threshold=cast("float", data["low_threshold"]),
            medium_threshold=cast("float", data["medium_threshold"]),
            medium_high_threshold=cast("float", data["medium_high_threshold"]),
            high_threshold=cast("float", data["high_threshold"]),
        )
    return models
get_default_values() classmethod

Default threshold values for configuration file generation.

These values are derived from extensive academic research documented in docs/file-entropy-research.md, including: - Lyda & Hamrock (2007) IEEE foundational paper - Davies et al. (2022) NapierOne dataset (500,000+ files) - Practical Security Analytics (500,000 PE file analysis) - Multiple peer-reviewed studies with statistical validation

Returns a dict suitable for YAML configuration file generation.

Source code in src\kp_ssf_tools\analyze\models\content_aware.py
@classmethod
def get_default_values(cls) -> dict[FileType, dict[str, object]]:
    """
    Default threshold values for configuration file generation.

    These values are derived from extensive academic research documented
    in docs/file-entropy-research.md, including:
    - Lyda & Hamrock (2007) IEEE foundational paper
    - Davies et al. (2022) NapierOne dataset (500,000+ files)
    - Practical Security Analytics (500,000 PE file analysis)
    - Multiple peer-reviewed studies with statistical validation

    Returns a dict suitable for YAML configuration file generation.
    """
    return {
        # Top 20 Programming Languages (2025 Rankings)
        FileType.PYTHON: {
            "expected_entropy": [
                5.5,
                0.8,
            ],  # Mean=5.5, StdDev=0.8
            "very_low_threshold": 4.0,  # Highly repetitive code
            "low_threshold": 5.0,  # Simple scripts, lots of comments
            "medium_threshold": 6.0,  # Typical Python code
            "medium_high_threshold": 6.8,  # Complex logic, minified
            "high_threshold": 7.2,  # Obfuscated/packed code | > Likely suspicious
        },
        FileType.JAVASCRIPT: {
            "expected_entropy": [
                5.4,
                0.8,
            ],  # Mean=5.4, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple scripts
            "low_threshold": 4.9,  # Basic JS with comments
            "medium_threshold": 5.9,  # Typical JavaScript
            "medium_high_threshold": 6.8,  # Minified/complex
            "high_threshold": 7.2,  # Obfuscated code | > Likely suspicious
        },
        FileType.JAVA: {
            "expected_entropy": [
                5.6,
                0.7,
            ],  # Mean=5.6, StdDev=0.7
            "very_low_threshold": 4.0,  # Verbose Java patterns
            "low_threshold": 5.0,  # Simple classes
            "medium_threshold": 6.0,  # Typical Java code
            "medium_high_threshold": 6.8,  # Complex enterprise code
            "high_threshold": 7.2,  # Bytecode/obfuscated | > Likely suspicious
        },
        FileType.CPP: {
            "expected_entropy": [
                5.8,
                0.9,
            ],  # Mean=5.8, StdDev=0.9
            "very_low_threshold": 4.0,  # Header files
            "low_threshold": 5.0,  # Simple implementations
            "medium_threshold": 6.2,  # Typical C++ code
            "medium_high_threshold": 7.0,  # Template-heavy code
            "high_threshold": 7.3,  # Compiled/obfuscated | > Likely suspicious
        },
        FileType.C: {
            "expected_entropy": [
                5.7,
                0.9,
            ],  # Mean=5.7, StdDev=0.9
            "very_low_threshold": 4.0,  # Header files
            "low_threshold": 5.0,  # Simple C code
            "medium_threshold": 6.1,  # Typical C programs
            "medium_high_threshold": 6.9,  # Complex system code
            "high_threshold": 7.3,  # Compiled/obfuscated | > Likely suspicious
        },
        FileType.CSHARP: {
            "expected_entropy": [
                5.6,
                0.8,
            ],  # Mean=5.6, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple classes
            "low_threshold": 5.0,  # Basic C# code
            "medium_threshold": 6.0,  # Typical C# applications
            "medium_high_threshold": 6.8,  # Complex .NET code
            "high_threshold": 7.2,  # IL bytecode/obfuscated | > Likely suspicious
        },
        FileType.TYPESCRIPT: {
            "expected_entropy": [
                5.4,
                0.8,
            ],  # Mean=5.4, StdDev=0.8
            "very_low_threshold": 4.0,  # Type definitions
            "low_threshold": 4.9,  # Simple TypeScript
            "medium_threshold": 5.9,  # Typical TS code
            "medium_high_threshold": 6.8,  # Complex/transpiled
            "high_threshold": 7.2,  # Obfuscated output | > Likely suspicious
        },
        FileType.PHP: {
            "expected_entropy": [
                5.3,
                0.8,
            ],  # Mean=5.3, StdDev=0.8
            "very_low_threshold": 4.0,  # HTML mixed PHP
            "low_threshold": 4.8,  # Simple PHP scripts
            "medium_threshold": 5.8,  # Typical PHP code
            "medium_high_threshold": 6.7,  # Complex frameworks
            "high_threshold": 7.1,  # Obfuscated PHP | > Likely suspicious
        },
        FileType.GO: {
            "expected_entropy": [
                5.5,
                0.7,
            ],  # Mean=5.5, StdDev=0.7
            "very_low_threshold": 4.0,  # Simple Go code
            "low_threshold": 5.0,  # Basic programs
            "medium_threshold": 6.0,  # Typical Go code
            "medium_high_threshold": 6.8,  # Complex concurrent code
            "high_threshold": 7.2,  # Compiled binary data | > Likely suspicious
        },
        FileType.SQL: {
            "expected_entropy": [
                5.2,
                0.9,
            ],  # Mean=5.2, StdDev=0.9
            "very_low_threshold": 3.8,  # Simple queries
            "low_threshold": 4.7,  # Basic SQL statements
            "medium_threshold": 5.7,  # Complex queries
            "medium_high_threshold": 6.6,  # Stored procedures
            "high_threshold": 7.0,  # Obfuscated SQL | > Likely suspicious
        },
        FileType.RUST: {
            "expected_entropy": [
                5.7,
                0.8,
            ],  # Mean=5.7, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple Rust code
            "low_threshold": 5.0,  # Basic implementations
            "medium_threshold": 6.1,  # Typical Rust code
            "medium_high_threshold": 6.9,  # Complex unsafe code
            "high_threshold": 7.3,  # Compiled/obfuscated | > Likely suspicious
        },
        FileType.SWIFT: {
            "expected_entropy": [
                5.5,
                0.8,
            ],  # Mean=5.5, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple Swift code
            "low_threshold": 5.0,  # Basic iOS code
            "medium_threshold": 6.0,  # Typical Swift apps
            "medium_high_threshold": 6.8,  # Complex frameworks
            "high_threshold": 7.2,  # Compiled/obfuscated | > Likely suspicious
        },
        FileType.KOTLIN: {
            "expected_entropy": [
                5.5,
                0.8,
            ],  # Mean=5.5, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple Kotlin code
            "low_threshold": 5.0,  # Basic Android code
            "medium_threshold": 6.0,  # Typical Kotlin apps
            "medium_high_threshold": 6.8,  # Complex coroutines
            "high_threshold": 7.2,  # Bytecode/obfuscated | > Likely suspicious
        },
        FileType.RUBY: {
            "expected_entropy": [
                5.3,
                0.8,
            ],  # Mean=5.3, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple Ruby scripts
            "low_threshold": 4.8,  # Basic Rails code
            "medium_threshold": 5.8,  # Typical Ruby code
            "medium_high_threshold": 6.7,  # Complex metaprogramming
            "high_threshold": 7.1,  # Obfuscated Ruby | > Likely suspicious
        },
        FileType.R: {
            "expected_entropy": [
                5.4,
                0.9,
            ],  # Mean=5.4, StdDev=0.9
            "very_low_threshold": 3.9,  # Simple R scripts
            "low_threshold": 4.8,  # Basic statistics
            "medium_threshold": 5.9,  # Typical R analysis
            "medium_high_threshold": 6.8,  # Complex models
            "high_threshold": 7.1,  # Compiled R code | > Likely suspicious
        },
        FileType.VISUAL_BASIC: {
            "expected_entropy": [
                5.2,
                0.8,
            ],  # Mean=5.2, StdDev=0.8
            "very_low_threshold": 3.9,  # Simple VB code
            "low_threshold": 4.7,  # Basic VB.NET
            "medium_threshold": 5.7,  # Typical VB apps
            "medium_high_threshold": 6.6,  # Complex forms
            "high_threshold": 7.0,  # Obfuscated VB | > Likely suspicious
        },
        FileType.SCALA: {
            "expected_entropy": [
                5.6,
                0.8,
            ],  # Mean=5.6, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple Scala code
            "low_threshold": 5.0,  # Basic functional code
            "medium_threshold": 6.0,  # Typical Scala apps
            "medium_high_threshold": 6.8,  # Complex Spark code
            "high_threshold": 7.2,  # Bytecode/obfuscated | > Likely suspicious
        },
        FileType.MATLAB: {
            "expected_entropy": [
                5.4,
                0.9,
            ],  # Mean=5.4, StdDev=0.9
            "very_low_threshold": 3.9,  # Simple scripts
            "low_threshold": 4.8,  # Basic computations
            "medium_threshold": 5.9,  # Typical MATLAB code
            "medium_high_threshold": 6.8,  # Complex algorithms
            "high_threshold": 7.1,  # Compiled MEX files | > Likely suspicious
        },
        FileType.PERL: {
            "expected_entropy": [
                5.4,
                0.9,
            ],  # Mean=5.4, StdDev=0.9
            "very_low_threshold": 3.9,  # Simple Perl scripts
            "low_threshold": 4.8,  # Basic regex code
            "medium_threshold": 5.9,  # Typical Perl code
            "medium_high_threshold": 6.8,  # Complex one-liners
            "high_threshold": 7.1,  # Obfuscated Perl | > Likely suspicious
        },
        FileType.DART: {
            "expected_entropy": [
                5.5,
                0.8,
            ],  # Mean=5.5, StdDev=0.8
            "very_low_threshold": 4.0,  # Simple Dart code
            "low_threshold": 5.0,  # Basic Flutter widgets
            "medium_threshold": 6.0,  # Typical Dart apps
            "medium_high_threshold": 6.8,  # Complex async code
            "high_threshold": 7.2,  # Compiled/obfuscated | > Likely suspicious
        },
        # Documentation Files
        FileType.DOCUMENTATION: {
            "expected_entropy": [
                4.8,
                0.65,
            ],  # Mean=4.8, StdDev=0.65 (combined plain/markdown)
            "very_low_threshold": 3.55,  # Highly repetitive text
            "low_threshold": 4.25,  # Simple documentation
            "medium_threshold": 5.1,  # Typical documentation
            "medium_high_threshold": 5.65,  # Technical docs with code
            "high_threshold": 6.15,  # Mixed content | Anomalous for docs
        },
        # Binary Executables
        FileType.WINDOWS_PE: {
            "expected_entropy": [
                6.0,
                1.2,
            ],  # Mean=6.0, StdDev=1.2
            "very_low_threshold": 4.5,  # Text sections
            "low_threshold": 5.2,  # Code sections
            "medium_threshold": 6.5,  # Typical PE files
            "medium_high_threshold": 7.0,  # Complex binaries
            "high_threshold": 7.2,  # Packed/compressed | > Likely suspicious
        },
        FileType.MACOS_MACHO: {
            "expected_entropy": [
                5.9,
                1.2,
            ],  # Mean=5.9, StdDev=1.2
            "very_low_threshold": 4.5,  # Text sections
            "low_threshold": 5.2,  # Code sections
            "medium_threshold": 6.4,  # Typical MachO files
            "medium_high_threshold": 6.9,  # Universal binaries
            "high_threshold": 7.2,  # Packed/compressed | > Likely suspicious
        },
        FileType.LINUX_ELF: {
            "expected_entropy": [
                5.8,
                1.1,
            ],  # Mean=5.8, StdDev=1.1
            "very_low_threshold": 4.5,  # Text sections
            "low_threshold": 5.1,  # Code sections
            "medium_threshold": 6.3,  # Typical ELF files
            "medium_high_threshold": 6.8,  # Complex binaries
            "high_threshold": 7.2,  # Packed/compressed | > Likely suspicious
        },
        # Encrypted/Suspicious Content
        FileType.ENCRYPTED: {
            "expected_entropy": [
                7.99,
                0.01,
            ],  # Mean=7.99, StdDev=0.01 (AES validated)
            "very_low_threshold": 7.8,  # Weak/broken encryption
            "low_threshold": 7.85,  # Poor encryption
            "medium_threshold": 7.9,  # Possible encryption
            "medium_high_threshold": 7.95,  # Likely encrypted
            "high_threshold": 7.98,  # Strong encryption | > Max entropy
        },
        FileType.BASE64_ENCODED: {
            "expected_entropy": [
                6.0,
                0.3,
            ],  # Mean=6.0, StdDev=0.3
            "very_low_threshold": 5.2,  # Partial encoding
            "low_threshold": 5.5,  # Simple base64
            "medium_threshold": 6.0,  # Typical base64
            "medium_high_threshold": 6.3,  # Complex encoded data
            "high_threshold": 6.5,  # Encrypted then encoded | > Suspicious encoding
        },
        FileType.HEX_ENCODED: {
            "expected_entropy": [
                4.0,
                0.2,
            ],  # Mean=4.0, StdDev=0.2
            "very_low_threshold": 3.5,  # Partial hex
            "low_threshold": 3.7,  # Simple hex strings
            "medium_threshold": 4.0,  # Typical hex encoding
            "medium_high_threshold": 4.2,  # Complex hex data
            "high_threshold": 4.4,  # Anomalous hex | > Suspicious pattern
        },
        # Unknown file types - Conservative thresholds
        FileType.UNKNOWN: {
            "expected_entropy": [
                5.5,
                1.5,
            ],  # Mean=5.5, StdDev=1.5 (conservative mixed content)
            "very_low_threshold": 3.0,  # Likely text/structured
            "low_threshold": 4.5,  # Probable code/data
            "medium_threshold": 6.0,  # Typical binary content
            "medium_high_threshold": 7.0,  # Complex binary/media
            "high_threshold": 7.2,  # Boundary suspicious | > Conservative threshold
        },
    }

Analysis Type Definitions

Enumerations and type definitions used throughout the analysis system.

kp_ssf_tools.analyze.models.types

Types for entropy analysis models.

Classes

ComplianceStatus

Bases: StrEnum

PCI SSF compliance status levels.

Source code in src\kp_ssf_tools\analyze\models\types.py
class ComplianceStatus(StrEnum):
    """PCI SSF compliance status levels."""

    COMPLIANT = "compliant"
    NON_COMPLIANT = "non_compliant"
    REQUIRES_REVIEW = "requires_review"
    INSUFFICIENT_DATA = "insufficient_data"

CredentialRiskLevel

Bases: StrEnum

Risk levels for detected credentials.

Source code in src\kp_ssf_tools\analyze\models\types.py
class CredentialRiskLevel(StrEnum):
    """Risk levels for detected credentials."""

    CRITICAL = "critical"  # Confirmed high-value credentials
    HIGH = "high"  # Likely credentials with high impact
    MEDIUM = "medium"  # Potential credentials requiring review
    LOW = "low"  # Suspicious patterns, low confidence
    INFO = "info"  # Informational findings

CryptoStructureType

Bases: StrEnum

Types of cryptographic structures and credential patterns.

Source code in src\kp_ssf_tools\analyze\models\types.py
class CryptoStructureType(StrEnum):
    """Types of cryptographic structures and credential patterns."""

    # Cryptographic structures
    AES_SBOX = "aes_sbox"
    DES_SBOX = "des_sbox"
    ROUND_CONSTANTS = "round_constants"
    GALOIS_FIELD = "galois_field"
    BASE64_DATA = "base64_data"
    HEX_ENCODED = "hex_encoded"
    PEM_STRUCTURE = "pem_structure"
    CRYPTO_FUNCTION = "crypto_function"
    HARDCODED_KEY = "hardcoded_key"

    # Credential patterns (PCI SSF 2.3.b requirement)
    COMMON_USERNAME = "common_username"
    COMMON_PASSWORD = "common_password"  # noqa: S105
    DEFAULT_CREDENTIAL = "default_credential"
    API_KEY_PATTERN = "api_key_pattern"
    DATABASE_CONNECTION = "database_connection"
    CREDENTIAL_PATTERN = "credential_pattern"

EntropyLevel

Bases: StrEnum

Entropy classification levels (thresholds are file-type adaptive).

Source code in src\kp_ssf_tools\analyze\models\types.py
class EntropyLevel(StrEnum):
    """Entropy classification levels (thresholds are file-type adaptive)."""

    CRITICAL = "critical"  # Significantly above expected for file type
    HIGH = "high"  # Above normal range for file type
    MEDIUM_HIGH = "medium_high"  # Slightly elevated for file type
    MEDIUM = "medium"  # Normal range for file type
    LOW = "low"  # Below normal range for file type
    VERY_LOW = "very_low"  # Significantly below expected for file type

    @property
    def order(self) -> int:
        """
        Get the order of the entropy level.

        Higher levels have a higher order (e.g., CRITICAL > HIGH)
        """
        levels = list(type(self))
        return len(levels) - levels.index(self) - 1
Attributes
order property

Get the order of the entropy level.

Higher levels have a higher order (e.g., CRITICAL > HIGH)

FileType

Bases: StrEnum

File types for entropy analysis using Pygments lexer names where applicable.

Source code in src\kp_ssf_tools\analyze\models\types.py
class FileType(StrEnum):
    """File types for entropy analysis using Pygments lexer names where applicable."""

    # Programming languages (using Pygments lexer names)
    C = "C"
    CPP = "C++"
    CSHARP = "C#"
    DART = "Dart"
    GO = "Go"
    JAVA = "Java"
    JAVASCRIPT = "JavaScript"
    KOTLIN = "Kotlin"
    MATLAB = "MATLAB"
    PERL = "Perl"
    PHP = "PHP"
    PYTHON = "Python"
    R = "R"
    RUBY = "Ruby"
    RUST = "Rust"
    SCALA = "Scala"
    SQL = "SQL"
    SWIFT = "Swift"
    TYPESCRIPT = "TypeScript"
    VISUAL_BASIC = "Visual Basic"

    # Special content types
    DOCUMENTATION = "documentation"  # Plain text files, documentation
    BASE64_ENCODED = "base64_encoded"  # Custom
    ENCRYPTED = "encrypted"  # Custom
    HEX_ENCODED = "hex_encoded"  # Custom

    # Binary executables
    LINUX_ELF = "linux_elf"  # Custom
    MACOS_MACHO = "macos_macho"  # Custom
    WINDOWS_PE = "windows_pe"  # Custom

    # Fallback
    UNKNOWN = "unknown"

    @classmethod
    def from_pygments_lexer(cls, lexer_name: str) -> FileType:
        """
        Map Pygments lexer names to FileType enums.

        Handles multiple lexer names for the same language (e.g., "Python" vs "Python 3").

        Args:
            lexer_name: Name from Pygments lexer.name

        Returns:
            Corresponding FileType enum, defaults to UNKNOWN for unrecognized lexers

        """
        # Direct matches (most common case)
        for file_type in cls:
            if file_type.value == lexer_name:
                return file_type

        # Handle aliases and special cases
        lexer_aliases = {
            "Python 3": cls.PYTHON,
            "JavaScript+Lasso": cls.JAVASCRIPT,
            "VB.net": cls.VISUAL_BASIC,
            "Markdown": cls.DOCUMENTATION,
            "reStructuredText": cls.DOCUMENTATION,
            "Text only": cls.DOCUMENTATION,
        }

        return lexer_aliases.get(lexer_name, cls.UNKNOWN)
Functions
from_pygments_lexer(lexer_name) classmethod

Map Pygments lexer names to FileType enums.

Handles multiple lexer names for the same language (e.g., "Python" vs "Python 3").

Parameters:

Name Type Description Default
lexer_name str

Name from Pygments lexer.name

required

Returns:

Type Description
FileType

Corresponding FileType enum, defaults to UNKNOWN for unrecognized lexers

Source code in src\kp_ssf_tools\analyze\models\types.py
@classmethod
def from_pygments_lexer(cls, lexer_name: str) -> FileType:
    """
    Map Pygments lexer names to FileType enums.

    Handles multiple lexer names for the same language (e.g., "Python" vs "Python 3").

    Args:
        lexer_name: Name from Pygments lexer.name

    Returns:
        Corresponding FileType enum, defaults to UNKNOWN for unrecognized lexers

    """
    # Direct matches (most common case)
    for file_type in cls:
        if file_type.value == lexer_name:
            return file_type

    # Handle aliases and special cases
    lexer_aliases = {
        "Python 3": cls.PYTHON,
        "JavaScript+Lasso": cls.JAVASCRIPT,
        "VB.net": cls.VISUAL_BASIC,
        "Markdown": cls.DOCUMENTATION,
        "reStructuredText": cls.DOCUMENTATION,
        "Text only": cls.DOCUMENTATION,
    }

    return lexer_aliases.get(lexer_name, cls.UNKNOWN)

Usage Examples

Basic Entropy Analysis

Command-line examples of entropy analysis operations.

# Basic entropy analysis
ssf_tools analyze entropy /path/to/file.bin

# Entropy analysis with custom block size
ssf_tools analyze entropy /path/to/directory --block-size 16384

# Output to specific file
ssf_tools analyze entropy /path/to/files --output analysis_results.xlsx

Credential Detection

Examples of credential scanning operations.

# Basic credential scan
ssf_tools analyze credentials /path/to/source/code

Service Usage in Code

Programmatic usage of analysis services within other commands.

@inject
def custom_analysis(
    target: Path,
    entropy_analyzer=Provide[ApplicationContainer.analysis.entropy_analyzer],
    threshold_service=Provide[ApplicationContainer.analysis.threshold_service],
):
    """Custom analysis combining multiple services."""

    # Get content-aware thresholds
    file_type = FileType.PYTHON
    thresholds = threshold_service.get_thresholds_for_file_type(file_type)

    # Perform entropy analysis
    for region in entropy_analyzer.analyze_file_generator(target, file_type):
        level = threshold_service.classify_entropy_level(region.entropy, file_type)
        if level in [EntropyLevel.HIGH, EntropyLevel.MEDIUM_HIGH]:
            print(f"Suspicious region: {region.offset}-{region.offset + region.size}")

    return analysis_results

Performance Considerations

Memory Efficiency

The analysis system uses streaming techniques to handle large files efficiently.

  • Input File Streaming Analysis: Files are processed in chunks to maintain constant memory usage
  • Results File Streaming: Results are exported to Excel as they are received using xlsxwriter's constant_memory option. Memory consumption is set to 10MB regardless of input file or result set size
  • Generator Patterns: Results are yielded incrementally to avoid loading entire result sets
  • Resource Management: File handles and streams are properly managed through context managers

Processing Optimization

Multiple optimization strategies improve analysis performance.

  • Content-Aware Processing: Different file types use optimized analysis strategies
  • Parallel Processing: Multiple files can be analyzed concurrently when memory allows
  • Early Termination: Analysis can stop early when maximum regions are reached

Testing Patterns

Mock Analysis Services

Create mock services for testing analysis workflows.

@pytest.fixture
def mock_entropy_analyzer():
    """Mock entropy analyzer for testing."""
    mock_analyzer = Mock(spec=EntropyAnalyzerProtocol)
    mock_analyzer.analyze_file_generator.return_value = iter([
        EntropyRegion(offset=0, size=1024, entropy=7.8, level=EntropyLevel.HIGH)
    ])
    return mock_analyzer

def test_entropy_analysis_with_mock(mock_entropy_analyzer):
    """Test entropy analysis with mocked analyzer."""
    result = perform_analysis_with_analyzer(mock_entropy_analyzer)
    assert len(result.high_entropy_regions) == 1

Integration Testing

Test analysis services with real dependencies and temporary files.

def test_entropy_analyzer_integration():
    """Test entropy analyzer with real dependencies."""
    with tempfile.TemporaryDirectory() as temp_dir:
        # Create test file with known entropy characteristics
        test_file = Path(temp_dir) / "test.bin"
        with open(test_file, "wb") as f:
            f.write(os.urandom(8192))  # High entropy data

        # Test with real analyzer
        analyzer = create_real_analyzer()
        regions = list(analyzer.analyze_file_generator(test_file, FileType.UNKNOWN))

        assert len(regions) > 0
        assert all(region.entropy > 7.0 for region in regions)

The Analysis Components architecture provides a robust, extensible foundation for file analysis operations within SSF Tools, supporting both entropy analysis and credential detection through a unified, protocol-based design that integrates seamlessly with the broader SSF Tools ecosystem.