← Back to Design

100

Documentation for 100 from the Json remedy repository.

After studying your documents and understanding both the Python json_repair and your Elixir json_remedy implementations, I can see a fundamentally different approach that could be much more powerful. Let me offer a fresh perspective:

The Meta-Learning Approach: Self-Bootstrapping Knowledge Extraction

Instead of trying to reverse-engineer Python’s accumulated wisdom or build complex probabilistic systems, we can create a self-improving system that learns patterns directly from real-world usage without requiring massive datasets.

Core Insight: JSON Repair as Program Synthesis

The breakthrough insight is treating JSON repair not as string manipulation, but as program synthesis - generating a minimal program that transforms malformed input into valid JSON. This shifts the problem from “what edits to make” to “what transformation rules to apply.”

defmodule JsonRemedy.MetaLearning do
  @moduledoc """
  Self-bootstrapping JSON repair that learns transformation patterns
  from successful repairs and builds increasingly sophisticated rules.
  """
  
  @doc """
  The key insight: every successful repair teaches us a transformation rule
  that can be generalized and applied to similar malformed patterns.
  """
  def learn_from_repair(malformed_input, successful_output, repair_context) do
    # Extract the transformation as a generalizable rule
    transformation_rule = synthesize_transformation_rule(
      malformed_input, 
      successful_output, 
      repair_context
    )
    
    # Test if this rule generalizes to other patterns
    generalization_score = evaluate_rule_generalization(transformation_rule)
    
    if generalization_score > 0.8 do
      # Add to knowledge base with confidence score
      KnowledgeBase.add_rule(transformation_rule, generalization_score)
    end
  end
  
  defp synthesize_transformation_rule(input, output, context) do
    # Use diff analysis to find minimal transformation
    edit_operations = compute_minimal_edit_sequence(input, output)
    
    # Abstract the specific characters to pattern variables
    abstract_pattern = abstract_to_pattern(edit_operations, context)
    
    # Generate applicability conditions
    conditions = infer_applicability_conditions(input, context)
    
    %TransformationRule{
      pattern: abstract_pattern,
      conditions: conditions,
      confidence: calculate_initial_confidence(edit_operations),
      examples: [{input, output, context}]
    }
  end
end

The Elegant Solution: Compositional Pattern Matching

Instead of complex state machines, use compositional pattern matching where simple rules combine to handle complex cases:

defmodule JsonRemedy.CompositionEngine do
  @moduledoc """
  Compositional approach where simple, reliable patterns compose
  to handle arbitrarily complex malformed JSON.
  """
  
  # Start with the most reliable, atomic patterns
  @atomic_patterns [
    # Pattern: Obvious quote fixes (99% confidence)
    %Pattern{
      name: :single_to_double_quotes,
      match: ~r/'([^']*)'/,
      transform: fn [_, content] -> "\"#{content}\"" end,
      confidence: 0.99,
      preconditions: [&not_inside_string?/2]
    },
    
    # Pattern: Python literals (95% confidence)
    %Pattern{
      name: :python_literals,
      match: ~r/\b(True|False|None)\b/,
      transform: fn [literal] -> 
        %{"True" => "true", "False" => "false", "None" => "null"}[literal]
      end,
      confidence: 0.95,
      preconditions: [&word_boundary?/2, &not_in_string?/2]
    }
  ]
  
  def repair_compositionally(input) do
    # Apply patterns in confidence order, building up repair confidence
    {repaired, total_confidence} = 
      @atomic_patterns
      |> Enum.sort_by(& &1.confidence, :desc)
      |> Enum.reduce({input, 1.0}, fn pattern, {current_input, conf_acc} ->
        case apply_pattern_safely(current_input, pattern) do
          {:ok, new_input, applied_conf} ->
            {new_input, conf_acc * applied_conf}
          
          :no_match ->
            {current_input, conf_acc}
        end
      end)
    
    # If confidence is high enough, we're done
    if total_confidence > 0.9 do
      {:ok, repaired, total_confidence}
    else
      # Try learning new patterns from context
      learn_and_retry(input, repaired, total_confidence)
    end
  end
end

The Breakthrough: Real-Time Pattern Discovery

Here’s the revolutionary part - instead of pre-training, discover patterns in real-time using structural analysis:

defmodule JsonRemedy.StructuralDiscovery do
  @moduledoc """
  Discovers repair patterns by analyzing the structural invariants
  that must hold for valid JSON, without requiring training data.
  """
  
  def discover_structural_repairs(malformed_input) do
    # Analyze what JSON structure this is trying to be
    intended_structure = infer_intended_structure(malformed_input)
    
    # Find the minimal changes needed to achieve that structure
    structural_repairs = compute_structural_alignment(
      malformed_input, 
      intended_structure
    )
    
    # Each repair is a hypothesis - test by seeing if result parses
    structural_repairs
    |> Enum.map(&apply_structural_repair(malformed_input, &1))
    |> Enum.find(fn result -> valid_json?(result) end)
  end
  
  defp infer_intended_structure(input) do
    # Use bracket/brace analysis to infer intended nesting
    bracket_analysis = analyze_bracket_patterns(input)
    
    # Use comma/colon patterns to infer key-value vs array structure
    delimiter_analysis = analyze_delimiter_patterns(input)
    
    # Use quote patterns to infer string boundaries
    quote_analysis = analyze_quote_patterns(input)
    
    # Combine analyses to hypothesize intended structure
    synthesize_structure_hypothesis(
      bracket_analysis, 
      delimiter_analysis, 
      quote_analysis
    )
  end
  
  defp analyze_bracket_patterns(input) do
    # This is where we get sophisticated about bracket matching
    # But using structural invariants, not heuristics
    
    chars = String.graphemes(input)
    
    # Find all bracket-like characters and their contexts
    bracket_contexts = chars
    |> Enum.with_index()
    |> Enum.filter(fn {char, _idx} -> char in ~w({ } [ ]) end)
    |> Enum.map(fn {char, idx} ->
      %{
        char: char,
        position: idx,
        local_context: extract_local_context(input, idx),
        nesting_level: calculate_nesting_at_position(chars, idx)
      }
    end)
    
    # Use bracket contexts to infer intended nesting structure
    infer_bracket_intentions(bracket_contexts)
  end
end

The Game-Changer: Zero-Shot Generalization

The most powerful aspect is zero-shot generalization - handling patterns we’ve never seen before:

defmodule JsonRemedy.ZeroShotRepair do
  @moduledoc """
  Handles novel malformed patterns by reasoning about JSON's
  compositional structure rather than memorized patterns.
  """
  
  def repair_novel_pattern(input) do
    # Step 1: Identify what makes this invalid JSON
    violations = detect_json_violations(input)
    
    # Step 2: For each violation, generate repair hypotheses
    repair_hypotheses = violations
    |> Enum.flat_map(&generate_repair_hypotheses/1)
    |> Enum.sort_by(& &1.simplicity_score)
    
    # Step 3: Test hypotheses by applying and validating
    Enum.find_value(repair_hypotheses, fn hypothesis ->
      case apply_repair_hypothesis(input, hypothesis) do
        {:ok, repaired} when valid_json?(repaired) ->
          {:ok, repaired, hypothesis}
        
        _ ->
          nil
      end
    end) || {:error, "No valid repair found"}
  end
  
  defp detect_json_violations(input) do
    [
      # Structural violations
      &detect_bracket_mismatches/1,
      &detect_quote_mismatches/1,
      &detect_comma_issues/1,
      
      # Lexical violations  
      &detect_invalid_literals/1,
      &detect_malformed_strings/1,
      &detect_invalid_numbers/1,
      
      # Contextual violations
      &detect_context_mismatches/1
    ]
    |> Enum.flat_map(fn detector -> detector.(input) end)
    |> Enum.sort_by(& &1.severity)
  end
  
  defp generate_repair_hypotheses(violation) do
    case violation.type do
      :missing_comma ->
        [
          %Hypothesis{action: :insert_comma, position: violation.position, simplicity: 1},
          %Hypothesis{action: :merge_values, position: violation.position, simplicity: 3}
        ]
      
      :unmatched_quote ->
        [
          %Hypothesis{action: :close_quote, position: violation.end_position, simplicity: 1},
          %Hypothesis{action: :escape_quote, position: violation.position, simplicity: 2},
          %Hypothesis{action: :remove_quote, position: violation.position, simplicity: 2}
        ]
      
      :invalid_literal ->
        [
          %Hypothesis{action: :normalize_literal, position: violation.position, 
                     replacement: violation.suggested_replacement, simplicity: 1}
        ]
    end
  end
end

Why This Approach is Superior

  1. No Training Data Required: The system learns from its own successful repairs and structural analysis.

  2. Mathematical Guarantees: Based on JSON’s formal grammar, not heuristics.

  3. Compositional: Simple patterns combine to handle complex cases naturally.

  4. Self-Improving: Gets better with use without manual intervention.

  5. Transparent: Every repair decision is explainable and debuggable.

  6. Efficient: Leverages Elixir’s pattern matching for O(1) common cases.

The Implementation Strategy

  1. Start with structural invariants - what must be true for valid JSON
  2. Build compositional rules that maintain these invariants
  3. Add real-time learning from successful repairs
  4. Use zero-shot reasoning for novel patterns

This approach sidesteps the need to reverse-engineer Python’s accumulated wisdom while potentially achieving superior results through principled structural reasoning.

The key insight is that JSON’s structure is simple enough that we can reason about it directly, rather than learning patterns from examples. We’re essentially building a JSON-aware theorem prover that finds the minimal proof (repair) that the input can be transformed into valid JSON.