← Back to Design

102

Documentation for 102 from the Json remedy repository.

The Complete Data-Driven Development Process

You’re absolutely right - I outlined the system architecture but skipped the crucial question: What data do you actually need and how do you get it? Let me map out the complete process from “zero data” to “working system.”

The Data Reality Check

Short answer: You DO need data, but not massive synthetic datasets. You need strategic, targeted data that you can generate systematically.

Phase 0: Bootstrap Data Generation (Week 0)

Step 1: Create Your “Ground Truth” Generator

defmodule JsonRemedy.DataGeneration do
  @moduledoc """
  Generate the minimal dataset needed to bootstrap the learning system.
  This is your 'seed knowledge' that everything else builds from.
  """
  
  def generate_bootstrap_dataset do
    # Start with valid JSON, then systematically break it
    valid_examples = generate_valid_json_corpus()
    
    # Apply systematic corruptions to create malformed examples
    malformed_examples = apply_systematic_corruptions(valid_examples)
    
    # Each example is: {malformed_input, correct_output, corruption_type}
    labeled_examples = create_labeled_pairs(malformed_examples, valid_examples)
    
    # Save as your initial training set
    save_bootstrap_dataset(labeled_examples)
  end
  
  defp generate_valid_json_corpus do
    [
      # Simple objects
      ~s|{"name": "Alice", "age": 30}|,
      ~s|{"active": true, "tags": ["dev", "senior"]}|,
      
      # Nested structures  
      ~s|{"user": {"name": "Bob", "settings": {"theme": "dark"}}}|,
      
      # Arrays
      ~s|[1, 2, 3, "hello", true, null]|,
      
      # Complex mixed
      ~s|{"users": [{"id": 1, "active": true}, {"id": 2, "active": false}]}|,
      
      # Edge cases
      ~s|{"": "empty key", "unicode": "café", "numbers": [1.5, -2, 0]}|
    ]
  end
  
  defp apply_systematic_corruptions(valid_examples) do
    corruption_types = [
      &remove_random_comma/1,
      &change_quotes_to_single/1,
      &add_python_literals/1,
      &remove_closing_quote/1,
      &add_trailing_comma/1,
      &remove_quote_from_key/1,
      &mismatch_brackets/1
    ]
    
    # Apply each corruption to each example
    for example <- valid_examples,
        corruption <- corruption_types do
      {corruption.(example), example, get_corruption_name(corruption)}
    end
  end
end

Step 2: Real-World Data Collection Strategy

defmodule JsonRemedy.DataCollection do
  @doc """
  The key insight: You don't need millions of examples.
  You need *diverse, representative* examples that cover the error space.
  """
  
  def collect_real_world_data do
    # Strategy 1: GitHub mining for malformed JSON in issues/PRs
    github_examples = mine_github_for_json_errors()
    
    # Strategy 2: Generate from API documentation examples
    api_examples = corrupt_api_documentation_examples()
    
    # Strategy 3: Elixir community - ask for malformed JSON examples
    community_examples = collect_community_examples()
    
    # Strategy 4: Convert Python json_repair test cases
    python_test_cases = convert_python_test_cases()
    
    combine_and_deduplicate([github_examples, api_examples, community_examples, python_test_cases])
  end
  
  defp mine_github_for_json_errors do
    # Search GitHub for issues containing malformed JSON
    # Look for: "invalid json", "json parse error", "malformed json"
    # Extract the malformed JSON from issue descriptions
  end
  
  defp convert_python_test_cases do
    # This is your goldmine! Python json_repair probably has test cases
    # Convert their test cases to your format
    # Each test case is: input -> expected_output
  end
end

Phase 1: Initial Data Requirements (Week 1)

What You Actually Need:

  1. ~100 carefully chosen examples covering major error types
  2. Systematic corruption patterns (not random noise)
  3. Ground truth pairs: {malformed, corrected}
  4. Error type labels: what kind of error each example represents

The Data Collection Process:

# Step 1: Mine Python json_repair test cases
git clone https://github.com/mangiucugna/json_repair
cd json_repair
grep -r "test_" tests/ | grep -v __pycache__ > test_cases.txt

# Step 2: Extract test cases programmatically
python3 -c "
import json_repair
import json

# Test cases from their test suite
test_cases = [
    ('{\"name\": \"value\"}', '{\"name\": \"value\"}'),  # Already valid
    ('{name: \"value\"}', '{\"name\": \"value\"}'),      # Unquoted key
    # ... extract more from their actual tests
]

for malformed, expected in test_cases:
    print(f'{malformed} -> {expected}')
" > python_test_cases.txt

Step 3: Create Your Initial Dataset

defmodule JsonRemedy.InitialDataset do
  def create_initial_dataset do
    %{
      # 20 examples of each major error type
      unquoted_keys: generate_unquoted_key_examples(20),
      single_quotes: generate_single_quote_examples(20), 
      python_literals: generate_python_literal_examples(20),
      missing_commas: generate_missing_comma_examples(20),
      trailing_commas: generate_trailing_comma_examples(20),
      
      # Real-world examples from Python library
      python_cases: load_python_test_cases(),
      
      # Edge cases that break most parsers
      edge_cases: generate_edge_cases(10)
    }
  end
  
  def generate_unquoted_key_examples(count) do
    base_templates = [
      ~s|{name: "value"}|,
      ~s|{user_id: 123, active: true}|,
      ~s|{nested: {inner_key: "value"}}|
    ]
    
    # Generate variations
    Enum.flat_map(base_templates, fn template ->
      generate_variations(template, count)
    end)
    |> Enum.take(count)
    |> Enum.map(fn malformed ->
      corrected = fix_unquoted_keys(malformed)
      {malformed, corrected, :unquoted_keys}
    end)
  end
end

Phase 2: The Learning Process (Weeks 2-3)

Step 1: Pattern Extraction from Your Data

defmodule JsonRemedy.PatternExtraction do
  @doc """
  This is where the magic happens: extracting generalizable patterns
  from your carefully curated examples.
  """
  
  def extract_patterns_from_dataset(dataset) do
    # Group examples by error type
    grouped_examples = Enum.group_by(dataset, &elem(&1, 2))
    
    # Extract patterns for each error type
    patterns = Enum.map(grouped_examples, fn {error_type, examples} ->
      {error_type, extract_patterns_for_error_type(examples)}
    end)
    
    # Validate patterns work across examples
    validated_patterns = validate_pattern_generalization(patterns, dataset)
    
    validated_patterns
  end
  
  defp extract_patterns_for_error_type(examples) do
    # For each pair of {malformed, corrected}, find the transformation
    transformations = Enum.map(examples, fn {malformed, corrected, _type} ->
      extract_transformation(malformed, corrected)
    end)
    
    # Find common patterns in the transformations
    common_patterns = find_common_transformation_patterns(transformations)
    
    # Convert to executable rules
    Enum.map(common_patterns, &convert_to_rule/1)
  end
  
  defp extract_transformation(malformed, corrected) do
    # This is your "edit distance with memory" algorithm
    edit_sequence = compute_detailed_edit_sequence(malformed, corrected)
    
    # Abstract the specific edits to pattern rules
    abstract_pattern = abstract_edit_sequence(edit_sequence)
    
    %{
      original_example: malformed,
      corrected_example: corrected,
      edit_sequence: edit_sequence,
      abstract_pattern: abstract_pattern
    }
  end
end

Step 2: The Pattern Validation Process

defmodule JsonRemedy.PatternValidation do
  @doc """
  Critical step: Test if your extracted patterns actually work
  on examples they weren't trained on.
  """
  
  def validate_patterns(patterns, test_examples) do
    # Hold out 20% of your data for testing
    {train_examples, test_examples} = split_dataset(test_examples, 0.8)
    
    # Test each pattern on held-out examples
    validation_results = Enum.map(patterns, fn pattern ->
      test_pattern_on_examples(pattern, test_examples)
    end)
    
    # Keep only patterns that generalize well
    good_patterns = Enum.filter(validation_results, fn result ->
      result.accuracy > 0.8  # 80% success rate threshold
    end)
    
    good_patterns
  end
  
  defp test_pattern_on_examples(pattern, test_examples) do
    # Apply pattern to each test example
    results = Enum.map(test_examples, fn {malformed, expected, _type} ->
      case apply_pattern(pattern, malformed) do
        {:ok, repaired} -> 
          {repaired == expected, repaired, expected}
        {:error, _} -> 
          {false, nil, expected}
      end
    end)
    
    # Calculate success metrics
    successes = Enum.count(results, &elem(&1, 0))
    total = length(results)
    
    %{
      pattern: pattern,
      accuracy: successes / total,
      successes: successes,
      total: total,
      failures: Enum.reject(results, &elem(&1, 0))
    }
  end
end

Phase 3: The Bootstrapping Process (Week 4)

Step 1: Start With High-Confidence Patterns

defmodule JsonRemedy.Bootstrap do
  @doc """
  Bootstrap the system with patterns you're confident about,
  then use success to build confidence in more complex patterns.
  """
  
  def bootstrap_system do
    # Start with the most reliable patterns (95%+ accuracy)
    reliable_patterns = get_high_confidence_patterns()
    
    # Initialize the system with these patterns
    :ok = PatternDatabase.initialize(reliable_patterns)
    
    # Test the system on your validation set
    validation_results = test_on_validation_set()
    
    # Use successful repairs to learn new patterns
    learn_from_successful_repairs(validation_results.successes)
    
    # Iteratively improve
    iterate_and_improve()
  end
  
  defp get_high_confidence_patterns do
    [
      # Pattern 1: Single quotes -> double quotes (99% reliable)
      %Pattern{
        name: :single_to_double_quotes,
        regex: ~r/'([^']*)'/,
        replacement: "\"\\1\"",
        confidence: 0.99,
        conditions: [&not_inside_string?/1]
      },
      
      # Pattern 2: Python True/False (95% reliable)
      %Pattern{
        name: :python_booleans,
        regex: ~r/\b(True|False)\b/,
        replacement: fn [bool] -> String.downcase(bool) end,
        confidence: 0.95,
        conditions: [&word_boundary?/1]
      }
    ]
  end
end

Step 2: The Iterative Improvement Loop

defmodule JsonRemedy.IterativeImprovement do
  @doc """
  The key process: Use each successful repair to improve the system.
  """
  
  def improvement_loop(max_iterations \\ 10) do
    improvement_loop(0, max_iterations, get_initial_metrics())
  end
  
  defp improvement_loop(iteration, max_iter, previous_metrics) 
       when iteration >= max_iter do
    {:max_iterations_reached, previous_metrics}
  end
  
  defp improvement_loop(iteration, max_iter, previous_metrics) do
    IO.puts("=== Improvement Iteration #{iteration + 1} ===")
    
    # Step 1: Run system on test cases
    test_results = run_comprehensive_test()
    
    # Step 2: Analyze failures to find new patterns
    new_patterns = analyze_failures_for_patterns(test_results.failures)
    
    # Step 3: Add promising new patterns
    :ok = add_patterns_if_promising(new_patterns)
    
    # Step 4: Measure improvement
    current_metrics = calculate_current_metrics()
    
    if metrics_improved?(current_metrics, previous_metrics) do
      IO.puts("✅ Improvement: #{current_metrics.accuracy - previous_metrics.accuracy}")
      improvement_loop(iteration + 1, max_iter, current_metrics)
    else
      IO.puts("⚠️  No improvement, stopping iteration")
      {:converged, current_metrics}
    end
  end
  
  defp analyze_failures_for_patterns(failures) do
    # This is where you discover new patterns from failures
    failure_groups = group_failures_by_similarity(failures)
    
    Enum.flat_map(failure_groups, fn group ->
      if length(group) >= 3 do  # Need at least 3 similar failures
        extract_pattern_from_failure_group(group)
      else
        []
      end
    end)
  end
end

Phase 4: Data Collection Strategy (Ongoing)

The Real-World Data Pipeline

defmodule JsonRemedy.DataPipeline do
  @doc """
  Once your system is working, collect real-world usage data
  to continuously improve it.
  """
  
  def setup_data_collection do
    # Strategy 1: Anonymous usage metrics (with user consent)
    setup_telemetry_collection()
    
    # Strategy 2: Community contributions
    setup_community_contribution_system()
    
    # Strategy 3: Automated testing against other JSON libraries
    setup_comparative_testing()
    
    # Strategy 4: Synthetic data generation based on learned patterns
    setup_synthetic_data_generation()
  end
  
  defp setup_telemetry_collection do
    # Collect (with permission):
    # - Input patterns that fail
    # - Successful repair patterns
    # - Performance metrics
    # - Error types encountered
  end
  
  defp setup_community_contribution_system do
    # Create a simple way for users to contribute examples:
    # mix json_remedy.contribute "malformed json" "corrected json"
  end
end

The Complete Process Timeline

Week 0: Data Foundation

  • Extract Python json_repair test cases (50 examples)
  • Generate systematic corruptions (50 examples)
  • Create edge cases (20 examples)
  • Validate all examples manually
  • Total: ~120 high-quality examples

Week 1: Pattern Extraction

  • Implement edit distance with abstraction
  • Extract patterns from each error type
  • Validate patterns on held-out data
  • Keep only patterns with >80% accuracy

Week 2: Initial System

  • Implement pattern matching engine
  • Create hypothesis testing framework
  • Build compositional repair system
  • Test on validation set

Week 3: Bootstrap and Iterate

  • Start with high-confidence patterns only
  • Use successful repairs to learn new patterns
  • Iterate improvement loop 5-10 times
  • Measure convergence

Week 4: Real-World Testing

  • Test on diverse real-world examples
  • Set up data collection pipeline
  • Community feedback system
  • Performance benchmarking

Key Success Metrics

  1. Pattern Quality: >80% accuracy on held-out data
  2. System Improvement: Measurable improvement each iteration
  3. Coverage: Handle 95%+ of common JSON errors
  4. Performance: 10x faster than Python for common cases
  5. Learning: System gets better with more data

The Bottom Line

You need strategic data, not big data:

  • ~120 carefully chosen examples to start
  • Systematic pattern extraction process
  • Iterative improvement based on failures
  • Real-world data collection pipeline

The key insight: Quality over quantity. 120 well-chosen examples that cover the error space systematically will teach you more than 10,000 random malformed JSON strings.