You’re absolutely right - this is like reverse engineering without the test suite that reveals the true edge cases. The Python codebase is battle-tested against real-world malformed JSON that we can’t see. Here’s a detailed investigation plan focusing on the sophisticated systems and hardcoded processes that need complete reverse engineering:
Investigation Plan: Reverse Engineering Python json_repair
Phase 1: Core Architecture Understanding
1.1 State Machine Analysis (json_parser.py
)
What to investigate:
- The
ContextValues
enum and how context transitions work - The relationship between
context.current
and parsing decisions - How
missing_quotes
flag affects parsing behavior throughout the entire pipeline
Hardcoded processes to reverse engineer:
# This is critical - the context switching logic
def parse_object(self):
# How does it decide between treating something as a key vs value?
# What are the exact conditions for context.set(ContextValues.OBJECT_KEY)?
Key questions:
- When exactly does it switch from
OBJECT_KEY
toOBJECT_VALUE
context? - How does the context stack interact with nested objects/arrays?
- What triggers the
missing_quotes = True
state and how does it propagate?
1.2 String Parsing State Machine (parse_string()
)
This is the most complex part - 200+ lines of nested conditionals
Critical hardcoded behaviors to map:
# Line ~400-600: The core string parsing logic
def parse_string(self) -> str | bool | None:
missing_quotes = False
doubled_quotes = False
unmatched_delimiter = False
# ... complex state tracking
Sophisticated systems to understand:
- Quote delimiter selection logic - how it chooses between
"
,'
,"
,"
- Missing quote detection - the heuristics for when to assume quotes are missing
- Doubled quote handling - when
""
should become"
vs when it’s intentional - String termination conditions - the complex logic for when to end a string
Phase 2: Critical Heuristics Reverse Engineering
2.1 The “Unmatched Delimiter” Logic
Location: parse_string()
around lines 450-500
What to reverse engineer:
# This is probably the most battle-tested part
unmatched_delimiter = not unmatched_delimiter
# When does this toggle? What does it mean?
# How does it affect subsequent parsing?
Investigation needed:
- Map all conditions that flip
unmatched_delimiter
- Understand how it interacts with
doubled_quotes
state - Document the exact sequence of state changes
2.2 Missing Quote Detection Heuristics
The most sophisticated system - this handles malformed strings
Key patterns to extract:
# Around line 480-520: Complex lookahead logic
if self.context.current == ContextValues.OBJECT_KEY:
# Check if this is followed by a colon
i = self.skip_to_character(character=":", idx=1)
# ... complex decision tree
Hardcoded processes:
- Colon detection for object keys - exact distance and whitespace handling
- Comma vs closing brace priority - which terminator takes precedence
- Nested quote handling - how it handles quotes within “strings”
2.3 The Whitespace and Terminator Matrix
Location: Throughout parse_string()
but especially the termination logic
Critical matrix to reverse engineer:
Context | Next Char | Action
-------------|-----------|--------
OBJECT_KEY | : | End string, expect value
OBJECT_KEY | , | End string, expect next key
OBJECT_KEY | } | End string, end object
OBJECT_VALUE | , | End string, expect next key
OBJECT_VALUE | } | End string, end object
ARRAY | , | End string, expect next item
ARRAY | ] | End string, end array
Phase 3: Object Parsing Complexity (parse_object()
)
3.1 The Key-Value State Machine
Sophisticated system: How it handles malformed key-value pairs
Critical logic to map:
# Around line 200-250: Key parsing with rollback
rollback_index = self.index
key = self.parse_string()
# ... complex validation and rollback logic
Hardcoded behaviors:
- Rollback conditions - when does it backtrack and retry?
- Empty key handling - how it deals with
{"": value}
- Duplicate key detection - the array merging logic
3.2 Array Merging Logic
This is particularly sophisticated:
# Lines ~220-240: Array detection and merging
if isinstance(prev_value, list):
new_array = self.parse_array()
prev_value.extend(new_array)
Investigation needed:
- Under what exact conditions does it merge arrays?
- How does it detect that something should be an array continuation?
- What’s the precedence between array merging vs new key-value pairs?
Phase 4: Edge Case Handling Patterns
4.1 The “Something Fishy” Detection
Location: Multiple places in parse_string()
Hardcoded patterns:
# This appears several times - what makes something "fishy"?
# something fishy is going on here
if next_c == rstring_delimiter:
doubled_quotes = True
Reverse engineer:
- All conditions that trigger “fishy” detection
- How “fishy” state affects subsequent parsing
- The relationship between “fishy” and error recovery
4.2 Comment and Special Character Handling
Location: parse_comment()
and embedded in string parsing
Systems to understand:
# How it handles comments embedded in JSON
termination_characters = ["\n", "\r"]
# Complex comment vs string content detection
4.3 Unicode and Escape Sequence Handling
Location: Throughout string parsing
Hardcoded processes:
escape_seqs = {"t": "\t", "n": "\n", "r": "\r", "b": "\b"}
if char == "u" else 2
next_chars = self.json_str[self.index + 1 : self.index + 1 + num_chars]
Phase 5: The Most Critical Investigation Points
5.1 The stream_stable
Parameter Effects
This affects multiple parsing decisions:
if self.stream_stable:
string_acc = string_acc[:-1] # Remove trailing backslash
else:
string_acc = string_acc.rstrip() # Remove trailing whitespace
Map all locations where stream_stable
changes behavior
5.2 The Whitespace and Boundary Detection
Critical for robust parsing:
def skip_whitespaces_at(self, idx: int = 0, move_main_index=True) -> int:
# This is called everywhere - understand its exact behavior
5.3 The Character Navigation System
In json_parser.py
- the foundation of everything:
def get_char_at(self, count: int = 0) -> str | Literal[False]:
# Why does this return False instead of None?
# How does False propagate through the system?
Phase 6: Data-Driven Reverse Engineering Strategy
6.1 Create Synthetic Test Cases
Since we don’t have the real test suite:
Generate edge cases systematically:
- All combinations of missing quotes × context states
- Nested structures with malformed terminators
- Unicode edge cases with malformed escapes
Compare Python vs Elixir outputs:
- Feed identical malformed JSON to both
- Document where they differ
- Reverse engineer the Python logic that handles the differences
6.2 Instrumentation Strategy
Add logging to Python codebase:
# Add at every decision point:
print(f"DEBUG: context={self.context.current}, char={char}, missing_quotes={missing_quotes}")
6.3 State Transition Mapping
Create comprehensive state diagrams:
- Map every
context.set()
call and its conditions - Document every flag flip (
missing_quotes
,doubled_quotes
, etc.) - Chart the interaction between all boolean states
Phase 7: The Hardest Parts (Priority Investigation)
7.1 The String Termination Decision Tree
This is where most of the battle-testing is encoded
- The exact precedence of different terminators
- How context affects termination decisions
- The lookahead distance for each decision
7.2 The Quote Matching Algorithm
Sophisticated quote pairing logic
- How it decides which quotes are pairs
- The
unmatched_delimiter
toggle mechanism - Recovery strategies when quote matching fails
7.3 The Error Recovery Hierarchy
When parsing fails, how does it recover?
- The rollback mechanisms
- The “ignore and continue” vs “backtrack and retry” decisions
- How it chooses between multiple possible repairs
Investigation Tools and Methodology
- Differential Testing: Create hundreds of malformed JSON samples and compare outputs
- State Logging: Instrument every state change in the Python code
- Edge Case Mining: Use property-based testing to find divergent behaviors
- Performance Profiling: Understand which code paths are most exercised
- Code Coverage Analysis: See which branches are rarely taken (likely edge cases)
The key insight is that without the test suite, we need to become archaeological about the codebase - every conditional branch represents a real-world edge case that someone encountered. The nested if/else
trees are essentially a knowledge base of “here’s what goes wrong with JSON in the real world.”