CRITIQUE | CNS 2.0 Project

Is Binary Pattern Matching Actually The Right Approach for JSON Repair?

This is an excellent question that gets to the heart of the architectural decisions. Let me analyze this critically:

The Binary Pattern Matching Reality Check

What Binary Pattern Matching IS Good For

Binary pattern matching excels when you have:

Known, fixed structures: <<"HTTP/1.1 ", status::binary-size(3), " ", rest::binary>>
Well-defined protocols: Network packets, binary formats, etc.
Performance-critical parsing: When you know exactly what you’re looking for

What Binary Pattern Matching IS NOT Good For

JSON repair is fundamentally about handling the unknown and malformed:

# This works for valid JSON:
defp parse_value(<<"{", rest::binary>>, ctx) do
  parse_object(rest, ctx)
end

# But what about these real-world cases?
malformed_cases = [
  "{name: 'Alice'",           # Missing quotes, single quotes
  "{ name : Alice }",         # Unquoted value
  "{/*comment*/ name: 'Alice'}", # Comments
  "{'name': 'Alice' 'age': 30}", # Missing comma
  "{name: 'Alice\nage: 30}",  # Newlines in strings
]

You cannot pattern match on malformed structure because by definition, you don’t know what the malformation will look like.

The Fundamental Problem

JSON Repair is Inherently Context-Sensitive

Consider this example:

{
  "message": "Hello, world",
  "items": [1, 2, 3,]
}

The comma in "Hello, world" should be preserved (it’s content)
The comma in [1, 2, 3,] should be removed (it’s a trailing comma)

Binary pattern matching cannot distinguish context without complex state management that negates its benefits.

Real-World Malformations Are Unpredictable

LLMs and legacy systems produce wildly different malformations:

# From GPT models:
"{name: \"Alice\", age: 30, active: true"

# From legacy Python systems:
"{'name': 'Alice', 'active': True, 'scores': [1, 2, 3,]}"

# From copy-paste errors:
"```json\n{\"name\": \"Alice\"}\n```"

# From incomplete responses:
"{\"users\": [{\"name\": \"Alice\""

Each requires different repair strategies that are impossible to express as binary patterns.

Alternative Approaches Analysis

1. Regex Preprocessing + Standard Parser (Current approach)

# Pros:
- Handles wide variety of malformations
- Can be ordered and composed
- Leverages battle-tested JSON parsers
- Relatively simple to understand and debug

# Cons:
- Not "true" binary pattern matching
- Order dependency between regex rules
- Can interfere with each other
- Performance overhead of multiple string passes

2. Pure Binary Pattern Matching Parser

# Pros:
- True to Elixir idioms
- Potentially very fast for known patterns
- Memory efficient
- Composable parsing functions

# Cons:
- Cannot handle unknown malformations
- Requires predefined error patterns
- Complex state management for context
- Limited flexibility for new malformation types

3. Hybrid State Machine Approach

defmodule JsonRepair.StateMachine do
  # Track parsing context with states:
  # :in_string, :in_object_key, :in_object_value, :in_array, etc.
  
  defp repair_char(char, :in_string, state) do
    # Different repair rules when inside strings
  end
  
  defp repair_char(",", :after_array_element, state) do
    # Handle trailing commas differently based on context
  end
end

4. AST-Based Repair

# Parse into partial AST, then repair the tree
defmodule JsonRepair.ASTRepair do
  def repair(malformed_json) do
    malformed_json
    |> parse_partial_ast()
    |> repair_ast_nodes()
    |> render_to_json()
  end
end

The Verdict: Binary Pattern Matching is NOT the Right Approach

Why Binary Pattern Matching Fails Here

Malformation is unpredictable: You cannot pattern match on “anything that might be wrong”
Context sensitivity: Same character needs different treatment based on location
Composition complexity: Handling multiple malformations simultaneously
Maintenance burden: Every new malformation type requires new patterns

What IS the Right Approach?

A Layered Strategy:

defmodule JsonRemedy.LayeredRepair do
  def repair(input) do
    input
    |> normalize_encoding()           # Handle UTF-8, encoding issues
    |> remove_non_json_wrapper()      # Strip code fences, comments
    |> apply_structural_fixes()       # Fix missing braces, brackets
    |> apply_syntax_fixes()           # Fix quotes, commas, colons
    |> attempt_parse()                # Try Jason.decode
    |> fallback_to_tolerant_parser()  # Last resort: custom parser
  end
end

Each layer handles specific concerns:

Layer 1: Content cleaning (regex is perfect here)
Layer 2: Structural repair (state machine works well)
Layer 3: Syntax normalization (regex again)
Layer 4: Validation (standard parser)
Layer 5: Fallback (custom tolerant parser)

Recommendation: Embrace the Right Tool for Each Job

Instead of forcing binary pattern matching everywhere:

1. Use Regex for What It’s Good At

# Content cleaning and syntax normalization
defmodule JsonRemedy.Preprocessing do
  @syntax_fixes [
    {~r/True\b/, "true"},
    {~r/False\b/, "false"},
    {~r/'([^']*)'/, ~S("\\1")},
    # etc.
  ]
end

2. Use State Machines for Context

# Track parsing context for context-sensitive repairs
defmodule JsonRemedy.ContextualRepair do
  def repair_with_context(input) do
    input
    |> String.graphemes()
    |> Enum.reduce(%{state: :start, repairs: []}, &process_char/2)
  end
end

3. Use Binary Patterns for Final Parsing

# After preprocessing, use binary patterns for clean parsing
defmodule JsonRemedy.CleanParser do
  defp parse_value(<<"{", rest::binary>>, ctx) do
    # Now we can safely use binary patterns
    parse_object(rest, ctx)
  end
end

4. Be Honest About the Architecture

The current approach is actually quite reasonable:

Regex preprocessing handles most malformations
Jason.decode handles the clean parsing
Custom parser as fallback

The problem isn’t the architecture—it’s the misleading marketing about “binary pattern matching supremacy.”

Conclusion

Binary pattern matching is NOT the right primary approach for JSON repair because:

Repair requires handling unknown malformations
Context sensitivity needs state tracking
Regex is actually perfect for many repair tasks
Existing parsers (Jason) are extremely optimized

The right approach is:

Multi-layered repair pipeline
Each layer uses appropriate tools
Honest about what each layer does
Focus on correctness and maintainability over theoretical purity

The current codebase should be repositioned as a practical, multi-strategy JSON repair toolkit rather than a “binary pattern matching showcase.”