← Back to Design

11

Documentation for 11 from the Json remedy repository.

You’re absolutely right - this is like reverse engineering without the test suite that reveals the true edge cases. The Python codebase is battle-tested against real-world malformed JSON that we can’t see. Here’s a detailed investigation plan focusing on the sophisticated systems and hardcoded processes that need complete reverse engineering:

Investigation Plan: Reverse Engineering Python json_repair

Phase 1: Core Architecture Understanding

1.1 State Machine Analysis (json_parser.py)

What to investigate:

  • The ContextValues enum and how context transitions work
  • The relationship between context.current and parsing decisions
  • How missing_quotes flag affects parsing behavior throughout the entire pipeline

Hardcoded processes to reverse engineer:

# This is critical - the context switching logic
def parse_object(self):
    # How does it decide between treating something as a key vs value?
    # What are the exact conditions for context.set(ContextValues.OBJECT_KEY)?

Key questions:

  • When exactly does it switch from OBJECT_KEY to OBJECT_VALUE context?
  • How does the context stack interact with nested objects/arrays?
  • What triggers the missing_quotes = True state and how does it propagate?

1.2 String Parsing State Machine (parse_string())

This is the most complex part - 200+ lines of nested conditionals

Critical hardcoded behaviors to map:

# Line ~400-600: The core string parsing logic
def parse_string(self) -> str | bool | None:
    missing_quotes = False
    doubled_quotes = False
    unmatched_delimiter = False
    # ... complex state tracking

Sophisticated systems to understand:

  1. Quote delimiter selection logic - how it chooses between ", ', ", "
  2. Missing quote detection - the heuristics for when to assume quotes are missing
  3. Doubled quote handling - when "" should become " vs when it’s intentional
  4. String termination conditions - the complex logic for when to end a string

Phase 2: Critical Heuristics Reverse Engineering

2.1 The “Unmatched Delimiter” Logic

Location: parse_string() around lines 450-500

What to reverse engineer:

# This is probably the most battle-tested part
unmatched_delimiter = not unmatched_delimiter
# When does this toggle? What does it mean?
# How does it affect subsequent parsing?

Investigation needed:

  • Map all conditions that flip unmatched_delimiter
  • Understand how it interacts with doubled_quotes state
  • Document the exact sequence of state changes

2.2 Missing Quote Detection Heuristics

The most sophisticated system - this handles malformed strings

Key patterns to extract:

# Around line 480-520: Complex lookahead logic
if self.context.current == ContextValues.OBJECT_KEY:
    # Check if this is followed by a colon
    i = self.skip_to_character(character=":", idx=1)
    # ... complex decision tree

Hardcoded processes:

  1. Colon detection for object keys - exact distance and whitespace handling
  2. Comma vs closing brace priority - which terminator takes precedence
  3. Nested quote handling - how it handles quotes within “strings”

2.3 The Whitespace and Terminator Matrix

Location: Throughout parse_string() but especially the termination logic

Critical matrix to reverse engineer:

Context      | Next Char | Action
-------------|-----------|--------
OBJECT_KEY   | :         | End string, expect value
OBJECT_KEY   | ,         | End string, expect next key  
OBJECT_KEY   | }         | End string, end object
OBJECT_VALUE | ,         | End string, expect next key
OBJECT_VALUE | }         | End string, end object
ARRAY        | ,         | End string, expect next item
ARRAY        | ]         | End string, end array

Phase 3: Object Parsing Complexity (parse_object())

3.1 The Key-Value State Machine

Sophisticated system: How it handles malformed key-value pairs

Critical logic to map:

# Around line 200-250: Key parsing with rollback
rollback_index = self.index
key = self.parse_string()
# ... complex validation and rollback logic

Hardcoded behaviors:

  • Rollback conditions - when does it backtrack and retry?
  • Empty key handling - how it deals with {"": value}
  • Duplicate key detection - the array merging logic

3.2 Array Merging Logic

This is particularly sophisticated:

# Lines ~220-240: Array detection and merging
if isinstance(prev_value, list):
    new_array = self.parse_array()
    prev_value.extend(new_array)

Investigation needed:

  • Under what exact conditions does it merge arrays?
  • How does it detect that something should be an array continuation?
  • What’s the precedence between array merging vs new key-value pairs?

Phase 4: Edge Case Handling Patterns

4.1 The “Something Fishy” Detection

Location: Multiple places in parse_string()

Hardcoded patterns:

# This appears several times - what makes something "fishy"?
# something fishy is going on here
if next_c == rstring_delimiter:
    doubled_quotes = True

Reverse engineer:

  • All conditions that trigger “fishy” detection
  • How “fishy” state affects subsequent parsing
  • The relationship between “fishy” and error recovery

4.2 Comment and Special Character Handling

Location: parse_comment() and embedded in string parsing

Systems to understand:

# How it handles comments embedded in JSON
termination_characters = ["\n", "\r"]
# Complex comment vs string content detection

4.3 Unicode and Escape Sequence Handling

Location: Throughout string parsing

Hardcoded processes:

escape_seqs = {"t": "\t", "n": "\n", "r": "\r", "b": "\b"}
if char == "u" else 2
next_chars = self.json_str[self.index + 1 : self.index + 1 + num_chars]

Phase 5: The Most Critical Investigation Points

5.1 The stream_stable Parameter Effects

This affects multiple parsing decisions:

if self.stream_stable:
    string_acc = string_acc[:-1]  # Remove trailing backslash
else:
    string_acc = string_acc.rstrip()  # Remove trailing whitespace

Map all locations where stream_stable changes behavior

5.2 The Whitespace and Boundary Detection

Critical for robust parsing:

def skip_whitespaces_at(self, idx: int = 0, move_main_index=True) -> int:
    # This is called everywhere - understand its exact behavior

5.3 The Character Navigation System

In json_parser.py - the foundation of everything:

def get_char_at(self, count: int = 0) -> str | Literal[False]:
    # Why does this return False instead of None?
    # How does False propagate through the system?

Phase 6: Data-Driven Reverse Engineering Strategy

6.1 Create Synthetic Test Cases

Since we don’t have the real test suite:

  1. Generate edge cases systematically:

    • All combinations of missing quotes × context states
    • Nested structures with malformed terminators
    • Unicode edge cases with malformed escapes
  2. Compare Python vs Elixir outputs:

    • Feed identical malformed JSON to both
    • Document where they differ
    • Reverse engineer the Python logic that handles the differences

6.2 Instrumentation Strategy

Add logging to Python codebase:

# Add at every decision point:
print(f"DEBUG: context={self.context.current}, char={char}, missing_quotes={missing_quotes}")

6.3 State Transition Mapping

Create comprehensive state diagrams:

  • Map every context.set() call and its conditions
  • Document every flag flip (missing_quotes, doubled_quotes, etc.)
  • Chart the interaction between all boolean states

Phase 7: The Hardest Parts (Priority Investigation)

7.1 The String Termination Decision Tree

This is where most of the battle-testing is encoded

  • The exact precedence of different terminators
  • How context affects termination decisions
  • The lookahead distance for each decision

7.2 The Quote Matching Algorithm

Sophisticated quote pairing logic

  • How it decides which quotes are pairs
  • The unmatched_delimiter toggle mechanism
  • Recovery strategies when quote matching fails

7.3 The Error Recovery Hierarchy

When parsing fails, how does it recover?

  • The rollback mechanisms
  • The “ignore and continue” vs “backtrack and retry” decisions
  • How it chooses between multiple possible repairs

Investigation Tools and Methodology

  1. Differential Testing: Create hundreds of malformed JSON samples and compare outputs
  2. State Logging: Instrument every state change in the Python code
  3. Edge Case Mining: Use property-based testing to find divergent behaviors
  4. Performance Profiling: Understand which code paths are most exercised
  5. Code Coverage Analysis: See which branches are rarely taken (likely edge cases)

The key insight is that without the test suite, we need to become archaeological about the codebase - every conditional branch represents a real-world edge case that someone encountered. The nested if/else trees are essentially a knowledge base of “here’s what goes wrong with JSON in the real world.”