The Complete Data-Driven Development Process
You’re absolutely right - I outlined the system architecture but skipped the crucial question: What data do you actually need and how do you get it? Let me map out the complete process from “zero data” to “working system.”
The Data Reality Check
Short answer: You DO need data, but not massive synthetic datasets. You need strategic, targeted data that you can generate systematically.
Phase 0: Bootstrap Data Generation (Week 0)
Step 1: Create Your “Ground Truth” Generator
defmodule JsonRemedy.DataGeneration do
@moduledoc """
Generate the minimal dataset needed to bootstrap the learning system.
This is your 'seed knowledge' that everything else builds from.
"""
def generate_bootstrap_dataset do
# Start with valid JSON, then systematically break it
valid_examples = generate_valid_json_corpus()
# Apply systematic corruptions to create malformed examples
malformed_examples = apply_systematic_corruptions(valid_examples)
# Each example is: {malformed_input, correct_output, corruption_type}
labeled_examples = create_labeled_pairs(malformed_examples, valid_examples)
# Save as your initial training set
save_bootstrap_dataset(labeled_examples)
end
defp generate_valid_json_corpus do
[
# Simple objects
~s|{"name": "Alice", "age": 30}|,
~s|{"active": true, "tags": ["dev", "senior"]}|,
# Nested structures
~s|{"user": {"name": "Bob", "settings": {"theme": "dark"}}}|,
# Arrays
~s|[1, 2, 3, "hello", true, null]|,
# Complex mixed
~s|{"users": [{"id": 1, "active": true}, {"id": 2, "active": false}]}|,
# Edge cases
~s|{"": "empty key", "unicode": "café", "numbers": [1.5, -2, 0]}|
]
end
defp apply_systematic_corruptions(valid_examples) do
corruption_types = [
&remove_random_comma/1,
&change_quotes_to_single/1,
&add_python_literals/1,
&remove_closing_quote/1,
&add_trailing_comma/1,
&remove_quote_from_key/1,
&mismatch_brackets/1
]
# Apply each corruption to each example
for example <- valid_examples,
corruption <- corruption_types do
{corruption.(example), example, get_corruption_name(corruption)}
end
end
end
Step 2: Real-World Data Collection Strategy
defmodule JsonRemedy.DataCollection do
@doc """
The key insight: You don't need millions of examples.
You need *diverse, representative* examples that cover the error space.
"""
def collect_real_world_data do
# Strategy 1: GitHub mining for malformed JSON in issues/PRs
github_examples = mine_github_for_json_errors()
# Strategy 2: Generate from API documentation examples
api_examples = corrupt_api_documentation_examples()
# Strategy 3: Elixir community - ask for malformed JSON examples
community_examples = collect_community_examples()
# Strategy 4: Convert Python json_repair test cases
python_test_cases = convert_python_test_cases()
combine_and_deduplicate([github_examples, api_examples, community_examples, python_test_cases])
end
defp mine_github_for_json_errors do
# Search GitHub for issues containing malformed JSON
# Look for: "invalid json", "json parse error", "malformed json"
# Extract the malformed JSON from issue descriptions
end
defp convert_python_test_cases do
# This is your goldmine! Python json_repair probably has test cases
# Convert their test cases to your format
# Each test case is: input -> expected_output
end
end
Phase 1: Initial Data Requirements (Week 1)
What You Actually Need:
- ~100 carefully chosen examples covering major error types
- Systematic corruption patterns (not random noise)
- Ground truth pairs:
{malformed, corrected}
- Error type labels: what kind of error each example represents
The Data Collection Process:
# Step 1: Mine Python json_repair test cases
git clone https://github.com/mangiucugna/json_repair
cd json_repair
grep -r "test_" tests/ | grep -v __pycache__ > test_cases.txt
# Step 2: Extract test cases programmatically
python3 -c "
import json_repair
import json
# Test cases from their test suite
test_cases = [
('{\"name\": \"value\"}', '{\"name\": \"value\"}'), # Already valid
('{name: \"value\"}', '{\"name\": \"value\"}'), # Unquoted key
# ... extract more from their actual tests
]
for malformed, expected in test_cases:
print(f'{malformed} -> {expected}')
" > python_test_cases.txt
Step 3: Create Your Initial Dataset
defmodule JsonRemedy.InitialDataset do
def create_initial_dataset do
%{
# 20 examples of each major error type
unquoted_keys: generate_unquoted_key_examples(20),
single_quotes: generate_single_quote_examples(20),
python_literals: generate_python_literal_examples(20),
missing_commas: generate_missing_comma_examples(20),
trailing_commas: generate_trailing_comma_examples(20),
# Real-world examples from Python library
python_cases: load_python_test_cases(),
# Edge cases that break most parsers
edge_cases: generate_edge_cases(10)
}
end
def generate_unquoted_key_examples(count) do
base_templates = [
~s|{name: "value"}|,
~s|{user_id: 123, active: true}|,
~s|{nested: {inner_key: "value"}}|
]
# Generate variations
Enum.flat_map(base_templates, fn template ->
generate_variations(template, count)
end)
|> Enum.take(count)
|> Enum.map(fn malformed ->
corrected = fix_unquoted_keys(malformed)
{malformed, corrected, :unquoted_keys}
end)
end
end
Phase 2: The Learning Process (Weeks 2-3)
Step 1: Pattern Extraction from Your Data
defmodule JsonRemedy.PatternExtraction do
@doc """
This is where the magic happens: extracting generalizable patterns
from your carefully curated examples.
"""
def extract_patterns_from_dataset(dataset) do
# Group examples by error type
grouped_examples = Enum.group_by(dataset, &elem(&1, 2))
# Extract patterns for each error type
patterns = Enum.map(grouped_examples, fn {error_type, examples} ->
{error_type, extract_patterns_for_error_type(examples)}
end)
# Validate patterns work across examples
validated_patterns = validate_pattern_generalization(patterns, dataset)
validated_patterns
end
defp extract_patterns_for_error_type(examples) do
# For each pair of {malformed, corrected}, find the transformation
transformations = Enum.map(examples, fn {malformed, corrected, _type} ->
extract_transformation(malformed, corrected)
end)
# Find common patterns in the transformations
common_patterns = find_common_transformation_patterns(transformations)
# Convert to executable rules
Enum.map(common_patterns, &convert_to_rule/1)
end
defp extract_transformation(malformed, corrected) do
# This is your "edit distance with memory" algorithm
edit_sequence = compute_detailed_edit_sequence(malformed, corrected)
# Abstract the specific edits to pattern rules
abstract_pattern = abstract_edit_sequence(edit_sequence)
%{
original_example: malformed,
corrected_example: corrected,
edit_sequence: edit_sequence,
abstract_pattern: abstract_pattern
}
end
end
Step 2: The Pattern Validation Process
defmodule JsonRemedy.PatternValidation do
@doc """
Critical step: Test if your extracted patterns actually work
on examples they weren't trained on.
"""
def validate_patterns(patterns, test_examples) do
# Hold out 20% of your data for testing
{train_examples, test_examples} = split_dataset(test_examples, 0.8)
# Test each pattern on held-out examples
validation_results = Enum.map(patterns, fn pattern ->
test_pattern_on_examples(pattern, test_examples)
end)
# Keep only patterns that generalize well
good_patterns = Enum.filter(validation_results, fn result ->
result.accuracy > 0.8 # 80% success rate threshold
end)
good_patterns
end
defp test_pattern_on_examples(pattern, test_examples) do
# Apply pattern to each test example
results = Enum.map(test_examples, fn {malformed, expected, _type} ->
case apply_pattern(pattern, malformed) do
{:ok, repaired} ->
{repaired == expected, repaired, expected}
{:error, _} ->
{false, nil, expected}
end
end)
# Calculate success metrics
successes = Enum.count(results, &elem(&1, 0))
total = length(results)
%{
pattern: pattern,
accuracy: successes / total,
successes: successes,
total: total,
failures: Enum.reject(results, &elem(&1, 0))
}
end
end
Phase 3: The Bootstrapping Process (Week 4)
Step 1: Start With High-Confidence Patterns
defmodule JsonRemedy.Bootstrap do
@doc """
Bootstrap the system with patterns you're confident about,
then use success to build confidence in more complex patterns.
"""
def bootstrap_system do
# Start with the most reliable patterns (95%+ accuracy)
reliable_patterns = get_high_confidence_patterns()
# Initialize the system with these patterns
:ok = PatternDatabase.initialize(reliable_patterns)
# Test the system on your validation set
validation_results = test_on_validation_set()
# Use successful repairs to learn new patterns
learn_from_successful_repairs(validation_results.successes)
# Iteratively improve
iterate_and_improve()
end
defp get_high_confidence_patterns do
[
# Pattern 1: Single quotes -> double quotes (99% reliable)
%Pattern{
name: :single_to_double_quotes,
regex: ~r/'([^']*)'/,
replacement: "\"\\1\"",
confidence: 0.99,
conditions: [¬_inside_string?/1]
},
# Pattern 2: Python True/False (95% reliable)
%Pattern{
name: :python_booleans,
regex: ~r/\b(True|False)\b/,
replacement: fn [bool] -> String.downcase(bool) end,
confidence: 0.95,
conditions: [&word_boundary?/1]
}
]
end
end
Step 2: The Iterative Improvement Loop
defmodule JsonRemedy.IterativeImprovement do
@doc """
The key process: Use each successful repair to improve the system.
"""
def improvement_loop(max_iterations \\ 10) do
improvement_loop(0, max_iterations, get_initial_metrics())
end
defp improvement_loop(iteration, max_iter, previous_metrics)
when iteration >= max_iter do
{:max_iterations_reached, previous_metrics}
end
defp improvement_loop(iteration, max_iter, previous_metrics) do
IO.puts("=== Improvement Iteration #{iteration + 1} ===")
# Step 1: Run system on test cases
test_results = run_comprehensive_test()
# Step 2: Analyze failures to find new patterns
new_patterns = analyze_failures_for_patterns(test_results.failures)
# Step 3: Add promising new patterns
:ok = add_patterns_if_promising(new_patterns)
# Step 4: Measure improvement
current_metrics = calculate_current_metrics()
if metrics_improved?(current_metrics, previous_metrics) do
IO.puts("✅ Improvement: #{current_metrics.accuracy - previous_metrics.accuracy}")
improvement_loop(iteration + 1, max_iter, current_metrics)
else
IO.puts("⚠️ No improvement, stopping iteration")
{:converged, current_metrics}
end
end
defp analyze_failures_for_patterns(failures) do
# This is where you discover new patterns from failures
failure_groups = group_failures_by_similarity(failures)
Enum.flat_map(failure_groups, fn group ->
if length(group) >= 3 do # Need at least 3 similar failures
extract_pattern_from_failure_group(group)
else
[]
end
end)
end
end
Phase 4: Data Collection Strategy (Ongoing)
The Real-World Data Pipeline
defmodule JsonRemedy.DataPipeline do
@doc """
Once your system is working, collect real-world usage data
to continuously improve it.
"""
def setup_data_collection do
# Strategy 1: Anonymous usage metrics (with user consent)
setup_telemetry_collection()
# Strategy 2: Community contributions
setup_community_contribution_system()
# Strategy 3: Automated testing against other JSON libraries
setup_comparative_testing()
# Strategy 4: Synthetic data generation based on learned patterns
setup_synthetic_data_generation()
end
defp setup_telemetry_collection do
# Collect (with permission):
# - Input patterns that fail
# - Successful repair patterns
# - Performance metrics
# - Error types encountered
end
defp setup_community_contribution_system do
# Create a simple way for users to contribute examples:
# mix json_remedy.contribute "malformed json" "corrected json"
end
end
The Complete Process Timeline
Week 0: Data Foundation
- Extract Python json_repair test cases (50 examples)
- Generate systematic corruptions (50 examples)
- Create edge cases (20 examples)
- Validate all examples manually
- Total: ~120 high-quality examples
Week 1: Pattern Extraction
- Implement edit distance with abstraction
- Extract patterns from each error type
- Validate patterns on held-out data
- Keep only patterns with >80% accuracy
Week 2: Initial System
- Implement pattern matching engine
- Create hypothesis testing framework
- Build compositional repair system
- Test on validation set
Week 3: Bootstrap and Iterate
- Start with high-confidence patterns only
- Use successful repairs to learn new patterns
- Iterate improvement loop 5-10 times
- Measure convergence
Week 4: Real-World Testing
- Test on diverse real-world examples
- Set up data collection pipeline
- Community feedback system
- Performance benchmarking
Key Success Metrics
- Pattern Quality: >80% accuracy on held-out data
- System Improvement: Measurable improvement each iteration
- Coverage: Handle 95%+ of common JSON errors
- Performance: 10x faster than Python for common cases
- Learning: System gets better with more data
The Bottom Line
You need strategic data, not big data:
- ~120 carefully chosen examples to start
- Systematic pattern extraction process
- Iterative improvement based on failures
- Real-world data collection pipeline
The key insight: Quality over quantity. 120 well-chosen examples that cover the error space systematically will teach you more than 10,000 random malformed JSON strings.