11

Documentation for 11 from the Json remedy repository.

You’re absolutely right - this is like reverse engineering without the test suite that reveals the true edge cases. The Python codebase is battle-tested against real-world malformed JSON that we can’t see. Here’s a detailed investigation plan focusing on the sophisticated systems and hardcoded processes that need complete reverse engineering:

Investigation Plan: Reverse Engineering Python json_repair

Phase 1: Core Architecture Understanding

1.1 State Machine Analysis (`json_parser.py`)

What to investigate:

The ContextValues enum and how context transitions work
The relationship between context.current and parsing decisions
How missing_quotes flag affects parsing behavior throughout the entire pipeline

Hardcoded processes to reverse engineer:

# This is critical - the context switching logic
def parse_object(self):
    # How does it decide between treating something as a key vs value?
    # What are the exact conditions for context.set(ContextValues.OBJECT_KEY)?

Key questions:

When exactly does it switch from OBJECT_KEY to OBJECT_VALUE context?
How does the context stack interact with nested objects/arrays?
What triggers the missing_quotes = True state and how does it propagate?

1.2 String Parsing State Machine (`parse_string()`)

This is the most complex part - 200+ lines of nested conditionals

Critical hardcoded behaviors to map:

# Line ~400-600: The core string parsing logic
def parse_string(self) -> str | bool | None:
    missing_quotes = False
    doubled_quotes = False
    unmatched_delimiter = False
    # ... complex state tracking

Sophisticated systems to understand:

Quote delimiter selection logic - how it chooses between ", ', ", "
Missing quote detection - the heuristics for when to assume quotes are missing
Doubled quote handling - when "" should become " vs when it’s intentional
String termination conditions - the complex logic for when to end a string

Phase 2: Critical Heuristics Reverse Engineering

2.1 The “Unmatched Delimiter” Logic

Location: parse_string() around lines 450-500

What to reverse engineer:

# This is probably the most battle-tested part
unmatched_delimiter = not unmatched_delimiter
# When does this toggle? What does it mean?
# How does it affect subsequent parsing?

Investigation needed:

Map all conditions that flip unmatched_delimiter
Understand how it interacts with doubled_quotes state
Document the exact sequence of state changes

2.2 Missing Quote Detection Heuristics

The most sophisticated system - this handles malformed strings

Key patterns to extract:

# Around line 480-520: Complex lookahead logic
if self.context.current == ContextValues.OBJECT_KEY:
    # Check if this is followed by a colon
    i = self.skip_to_character(character=":", idx=1)
    # ... complex decision tree

Hardcoded processes:

Colon detection for object keys - exact distance and whitespace handling
Comma vs closing brace priority - which terminator takes precedence
Nested quote handling - how it handles quotes within “strings”

2.3 The Whitespace and Terminator Matrix

Location: Throughout parse_string() but especially the termination logic

Critical matrix to reverse engineer:

Context      | Next Char | Action
-------------|-----------|--------
OBJECT_KEY   | :         | End string, expect value
OBJECT_KEY   | ,         | End string, expect next key  
OBJECT_KEY   | }         | End string, end object
OBJECT_VALUE | ,         | End string, expect next key
OBJECT_VALUE | }         | End string, end object
ARRAY        | ,         | End string, expect next item
ARRAY        | ]         | End string, end array

Phase 3: Object Parsing Complexity (`parse_object()`)

3.1 The Key-Value State Machine

Sophisticated system: How it handles malformed key-value pairs

Critical logic to map:

# Around line 200-250: Key parsing with rollback
rollback_index = self.index
key = self.parse_string()
# ... complex validation and rollback logic

Hardcoded behaviors:

Rollback conditions - when does it backtrack and retry?
Empty key handling - how it deals with {"": value}
Duplicate key detection - the array merging logic

3.2 Array Merging Logic

This is particularly sophisticated:

# Lines ~220-240: Array detection and merging
if isinstance(prev_value, list):
    new_array = self.parse_array()
    prev_value.extend(new_array)

Investigation needed:

Under what exact conditions does it merge arrays?
How does it detect that something should be an array continuation?
What’s the precedence between array merging vs new key-value pairs?

Phase 4: Edge Case Handling Patterns

4.1 The “Something Fishy” Detection

Location: Multiple places in parse_string()

Hardcoded patterns:

# This appears several times - what makes something "fishy"?
# something fishy is going on here
if next_c == rstring_delimiter:
    doubled_quotes = True

Reverse engineer:

All conditions that trigger “fishy” detection
How “fishy” state affects subsequent parsing
The relationship between “fishy” and error recovery

4.2 Comment and Special Character Handling

Location: parse_comment() and embedded in string parsing

Systems to understand:

# How it handles comments embedded in JSON
termination_characters = ["\n", "\r"]
# Complex comment vs string content detection

4.3 Unicode and Escape Sequence Handling

Location: Throughout string parsing

Hardcoded processes:

escape_seqs = {"t": "\t", "n": "\n", "r": "\r", "b": "\b"}
if char == "u" else 2
next_chars = self.json_str[self.index + 1 : self.index + 1 + num_chars]

Phase 5: The Most Critical Investigation Points

5.1 The `stream_stable` Parameter Effects

This affects multiple parsing decisions:

if self.stream_stable:
    string_acc = string_acc[:-1]  # Remove trailing backslash
else:
    string_acc = string_acc.rstrip()  # Remove trailing whitespace

Map all locations where stream_stable changes behavior

5.2 The Whitespace and Boundary Detection

Critical for robust parsing:

def skip_whitespaces_at(self, idx: int = 0, move_main_index=True) -> int:
    # This is called everywhere - understand its exact behavior

In json_parser.py - the foundation of everything:

def get_char_at(self, count: int = 0) -> str | Literal[False]:
    # Why does this return False instead of None?
    # How does False propagate through the system?

Phase 6: Data-Driven Reverse Engineering Strategy

6.1 Create Synthetic Test Cases

Since we don’t have the real test suite:

Generate edge cases systematically:
- All combinations of missing quotes × context states
- Nested structures with malformed terminators
- Unicode edge cases with malformed escapes
Compare Python vs Elixir outputs:
- Feed identical malformed JSON to both
- Document where they differ
- Reverse engineer the Python logic that handles the differences

6.2 Instrumentation Strategy

Add logging to Python codebase:

# Add at every decision point:
print(f"DEBUG: context={self.context.current}, char={char}, missing_quotes={missing_quotes}")

6.3 State Transition Mapping

Create comprehensive state diagrams:

Map every context.set() call and its conditions
Document every flag flip (missing_quotes, doubled_quotes, etc.)
Chart the interaction between all boolean states

Phase 7: The Hardest Parts (Priority Investigation)

7.1 The String Termination Decision Tree

This is where most of the battle-testing is encoded

The exact precedence of different terminators
How context affects termination decisions
The lookahead distance for each decision

7.2 The Quote Matching Algorithm

Sophisticated quote pairing logic

How it decides which quotes are pairs
The unmatched_delimiter toggle mechanism
Recovery strategies when quote matching fails

7.3 The Error Recovery Hierarchy

When parsing fails, how does it recover?

The rollback mechanisms
The “ignore and continue” vs “backtrack and retry” decisions
How it chooses between multiple possible repairs

Investigation Tools and Methodology

Differential Testing: Create hundreds of malformed JSON samples and compare outputs
State Logging: Instrument every state change in the Python code
Edge Case Mining: Use property-based testing to find divergent behaviors
Performance Profiling: Understand which code paths are most exercised
Code Coverage Analysis: See which branches are rarely taken (likely edge cases)

The key insight is that without the test suite, we need to become archaeological about the codebase - every conditional branch represents a real-world edge case that someone encountered. The nested if/else trees are essentially a knowledge base of “here’s what goes wrong with JSON in the real world.”