Of course! Based on the provided codebase, here is a series of diagrams illustrating the core workings of BoundaryML’s internal-baml-core
library.
The process can be broken down into four main stages:
High-Level Compilation Pipeline: The overall flow from source files to a usable artifact.
Schema Validation: A d %% BAML-inspired styling classDef setupPhase fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#fff classDef concurrentEngine fill:#06b6d4,stroke:#0891b2,stroke-width:2px,color:#fff classDef taskNode fill:#22d3ee,stroke:#0891b2,stroke-width:2px,color:#0f172a classDef processNode fill:#67e8f9,stroke:#0891b2,stroke-width:2px,color:#0f172a classDef resultNode fill:#a7f3d0,stroke:#059669,stroke-width:1px,color:#0f172a classDef aggregationPhase fill:#f3f4f6,stroke:#6b7280,stroke-width:2px,color:#374151 classDef subgraphTitleTop fill:#e0f2fe,stroke:#0891b2,stroke-width:2px,color:#0f172a
class A setupPhase class B,C,D,E,F concurrentEngine class Diagnostics1 taskNode class G,H processNode class I resultNode class InputGen,ValidationLoading,OutputConf,DownstreamImpact subgraphTitleTop
%% BAML brand styling for arrows linkStyle default stroke:#0e7490,stroke-width:2pxBAML code is checked for correctness.
AST to IR Transformation: How the parsed code is converted into a more structured Intermediate Representation (IR).
IR Hashing & Signatures: An advanced mechanism for change detection and caching.
Diagram 1: The High-Level BAML Compilation Pipeline
This diagram shows the main journey of your BAML code from raw text files to a validated, structured Intermediate Representation (IR). The internal-baml-core
library orchestrates this entire process.
Explanation:
- Input: The process starts with one or more
.baml
source files. - Parsing: An external crate,
internal-baml-parser-database
, consumes these files. It parses the text into an Abstract Syntax Tree (AST), which is a raw, tree-like representation of the code. This AST also includes information about file locations (spans) for accurate error reporting. - Validation & IR Generation: This is the heart of
internal-baml-core
. The mainvalidate()
function (fromsrc/lib.rs
) takes the AST and orchestrates a series of complex steps to:- Check the code for logical errors, inconsistencies, and adherence to BAML’s rules.
- Transform the raw AST into a structured, easier-to-use Intermediate Representation (IR).
- Output: The pipeline produces two key outputs:
- Validated IR: A rich, in-memory representation of the entire BAML project, which code generators (e.g., for Python or TypeScript) can use to produce client code.
- Diagnostics: A collection of any errors or warnings found during the process, complete with locations pointing back to the original source files.
Diagram 2: Deep Dive into the Validation Pipeline
Validation is a critical, multi-step process. It ensures that your BAML schema is not only syntactically correct but also semantically valid.
Explanation:
- Load Generators: The pipeline first parses any
generator
blocks in your schema to understand what code you intend to generate (e.g.,python
,typescript
). This context is crucial for later steps, like checking for reserved keywords. - Run Validation Checks:
internal-baml-core
runs a battery of checks, as defined insrc/validate/validation_pipeline/validations/
. These include:- Types: Are all referenced types (
MyClass
,MyEnum
) actually defined? Are types used correctly (e.g., you can’t use a class as a map key)? - Cycles: Are there any infinite dependency cycles between your classes (e.g.,
class A { b: B }
,class B { a: A }
)? BAML allows some cycles through lists and maps but flags direct, non-terminating ones. - Functions, Classes, Enums: Each major entity has its own specific rules. For example, ensuring function clients are defined, or class fields don’t conflict with language keywords.
- Expression Type-Checking: For BAML
expr
, the system validates and infers the types to ensure operations are valid (e.g., you can’t add a string to an integer).
- Types: Are all referenced types (
- Validate
test
Blocks: A special feature of BAML is thattest
blocks can define their own local types. The validator creates a temporary, “scoped” database for each test case, combining global types with the test’s local types, and runs the entire validation pipeline again on this temporary scope. - Finalize & Link: After validation, a final linking step resolves all references, ensuring the entire schema is consistent.
Diagram 3: AST to Intermediate Representation (IR) Transformation
Once the AST is validated, it’s transformed into the more structured and powerful Intermediate Representation (IR). This process happens in src/ir/repr.rs
within the from_parser_database
function.
Explanation:
The transformation is a systematic conversion:
- The
from_parser_database
function receives the validated AST. - It uses “walkers” (
db.walk_classes()
,db.walk_enums()
, etc.) to iterate over every top-level definition in the AST. - For each AST node (like a class or function), it calls a
.repr(db)
method. This method is responsible for:- Extracting all relevant data (name, fields, arguments, return types, docstrings, attributes like
@alias
). - Resolving any type references.
- Creating a corresponding IR struct (e.g.,
ir::Class
,ir::Function
).
- Extracting all relevant data (name, fields, arguments, return types, docstrings, attributes like
- Each new IR object is wrapped in a
Node<T>
which attaches metadata like source spans and attributes. - All these
Node
objects are collected into vectors (Vec<ir::Class>
,Vec<ir::Enum>
, etc.) inside the finalIntermediateRepr
struct. This struct is the definitive, validated model of your BAML project.
Diagram 4: The IR Hashing & Signature System
BAML Core includes a sophisticated hashing system (src/ir/ir_hasher/
) to create a unique, deterministic “signature” for every element in the IR. This is primarily used for change detection and caching. If a function’s signature hasn’t changed, a tool using the IR knows it doesn’t need to be re-processed.
create a hash of its own properties
e.g. class name field names] Step2[2 Dependency Analysis] Step2_Desc[Each IR item declares its dependencies
e.g. A Function depends on its argument Types] Step3[3 Recursive Signature Generation] Step3_Desc[Final Hash =
Hash ShallowHash +
Hash Dep1_Name + Dep1_FinalHash +
Hash Dep2_Name + Dep2_FinalHash + ...] end subgraph OutputSig["Output"] Signatures[IRSignature] SigDesc[Contains a unique interface_hash and implementation_hash
for every single element in the IR] end IR --> StartHash StartHash --> Step1 Step1 --> Step2 Step2 --> Step3 Step1 -.-> Step1_Desc Step2 -.-> Step2_Desc Step3 -.-> Step3_Desc Step3 --> Signatures Signatures -.-> SigDesc %% BAML-inspired styling classDef setupPhase fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#fff classDef concurrentEngine fill:#06b6d4,stroke:#0891b2,stroke-width:2px,color:#fff classDef taskNode fill:#22d3ee,stroke:#0891b2,stroke-width:2px,color:#0f172a classDef processNode fill:#67e8f9,stroke:#0891b2,stroke-width:2px,color:#0f172a classDef resultNode fill:#a7f3d0,stroke:#059669,stroke-width:1px,color:#0f172a classDef aggregationPhase fill:#f3f4f6,stroke:#6b7280,stroke-width:2px,color:#374151 classDef subgraphTitleTop fill:#e0f2fe,stroke:#0891b2,stroke-width:2px,color:#0f172a class IR setupPhase class StartHash,Step1,Step2,Step3 concurrentEngine class Step1_Desc,Step2_Desc,Step3_Desc taskNode class Signatures,SigDesc resultNode class InputHash,HashingProcess,OutputSig subgraphTitleTop %% BAML brand styling for arrows linkStyle default stroke:#0e7490,stroke-width:2px
Explanation:
The signature of an element changes if either its own code changes or if any of its dependencies change.
- Shallow Hash: First, the system computes a “shallow” hash for every item. This hash only considers the item’s own definition. For a class, this would be its name and the names and types of its fields.
- Dependency Analysis: The system builds a graph of dependencies. A function
MyFn(input: MyClass) -> MyEnum
depends onMyClass
andMyEnum
. - Recursive Signature Generation: To compute the final, deep signature for
MyFn
, the system does the following:- It starts a hasher.
- It adds
MyFn
’s shallow hash. - It then recursively looks at its dependencies,
MyClass
andMyEnum
. - It adds the final signature of
MyClass
to the hasher. - It adds the final signature of
MyEnum
to the hasher. - The result is a hash that is stable as long as
MyFn
,MyClass
, andMyEnum
(and any of their dependencies) do not change.
This creates an extremely reliable way to detect exactly what has changed between different versions of the BAML source.
Excellent. Let’s continue by exploring the next set of high-level functionalities and then drill down into the specifics of how they are implemented.
Diagram 5: Configuration Loading and Processing
Before validation can be fully effective, the system needs to understand the project’s configuration, which is primarily defined in generator
blocks. This process is handled by the generator_loader
.
output_type python/pydantic
output_dir ../gen
version 1.2.3
on_generate my-script.sh] end subgraph ValidationLoading["Validation & Loading src/validate/generator_loader/v2.rs"] B[1 Parse Generator Block] C{2 Check Property Allowlist} D[3 Parse Required Keys e.g. output_type] E[4 Parse Optional Keys e.g. output_dir version] F[5 Construct Generator Struct] end subgraph OutputConf["Output"] G[configuration CodegenGenerator] H[Overall configuration Configuration] end subgraph DownstreamImpact["Downstream Impact"] I[Influences Further Validation
e.g. Reserved Name Checks] end A --> B B --> C C -->|Valid Properties| D C -->|Unknown Property| Diagnostics1[Diagnostics Error] D --> E E --> F F --> G G --> H H --> I %% BAML-inspired styling classDef setupPhase fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#fff classDef concurrentEngine fill:#06b6d4,stroke:#0891b2,stroke-width:2px,color:#fff classDef taskNode fill:#22d3ee,stroke:#0891b2,stroke-width:2px,color:#0f172a classDef processNode fill:#67e8f9,stroke:#0891b2,stroke-width:2px,color:#0f172a classDef resultNode fill:#a7f3d0,stroke:#059669,stroke-width:1px,color:#0f172a classDef aggregationPhase fill:#f3f4f6,stroke:#6b7280,stroke-width:2px,color:#374151 classDef subgraphTitleTop fill:#e0f2fe,stroke:#0891b2,stroke-width:2px,color:#0f172a class A setupPhase class B,C,D,E,F concurrentEngine class Diagnostics1 taskNode class G,H processNode class I resultNode class InputGen,ValidationLoading,OutputConf,DownstreamImpact subgraphTitleTop %% BAML brand styling for arrows linkStyle default stroke:#0e7490,stroke-width:2px
Explanation:
- Parsing: The
generator_loader
(src/validate/generator_loader/v2.rs
) is invoked early in the validation pipeline. It finds allgenerator
blocks in the AST. - Allowlist Check: It first validates that only known properties (like
output_type
,output_dir
,version
,default_client_mode
, etc.) are used. An unknown property will immediately generate an error. - Key Parsing: It then systematically parses each property. Some are mandatory (
output_type
), and their absence is an error. Others are optional. The parser validates the value’s type (e.g., ensuringversion
is a string). - Struct Construction: The parsed and validated properties are used to build a strongly-typed
CodegenGenerator
orCloudProject
struct (defined insrc/configuration.rs
). - Downstream Impact: This final
Configuration
object is passed to the rest of the validation pipeline. It’s used, for example, to determine which set of reserved language keywords to check against when validating class and enum field names.
Diagram 6: Interacting with the IR (The ir_helpers
Module)
Once the IntermediateRepr
is built, it’s not just a static data structure. The ir_helpers
module provides a rich API to query, manipulate, and reason about the types and values within the IR. This is the primary interface for any tool that consumes the IR.
trait IRHelperExtended] subgraph KeyHelperFunctions["Key Helper Functions"] Find[find_class / find_function etc
Name-based lookup] IsSubtype[is_subtype base other
Complex type compatibility check] DistributeType[distribute_type value type
Annotates BamlValue with types from FieldType] CoerceArg[coerce_arg value type
Attempts to fit value into given type] DistributeConstraints[distribute_constraints type
Collects all @assert/@check from type and definition] end end subgraph OutputHelp["Output"] Walker[Walker to an IR Node] BoolResult[Boolean true/false] TypedValue[BamlValueWithMeta
Value annotated with types] CoercedValue[Result BamlValue Error] Constraints[Vec Constraint] end IR -.-> HelperTraits UserInput -.-> Find & IsSubtype & DistributeType & CoerceArg & DistributeConstraints HelperTraits --> Find & IsSubtype & DistributeType & CoerceArg & DistributeConstraints Find --> Walker IsSubtype --> BoolResult DistributeType --> TypedValue CoerceArg --> CoercedValue DistributeConstraints --> Constraints %% BAML-inspired styling classDef setupPhase fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#fff classDef concurrentEngine fill:#06b6d4,stroke:#0891b2,stroke-width:2px,color:#fff classDef taskNode fill:#22d3ee,stroke:#0891b2,stroke-width:2px,color:#0f172a classDef processNode fill:#67e8f9,stroke:#0891b2,stroke-width:2px,color:#0f172a classDef resultNode fill:#a7f3d0,stroke:#059669,stroke-width:1px,color:#0f172a classDef aggregationPhase fill:#f3f4f6,stroke:#6b7280,stroke-width:2px,color:#374151 classDef subgraphTitleTop fill:#e0f2fe,stroke:#0891b2,stroke-width:2px,color:#0f172a class IR,UserInput setupPhase class HelperTraits concurrentEngine class Find,IsSubtype,DistributeType,CoerceArg,DistributeConstraints taskNode class Walker,BoolResult,TypedValue,CoercedValue,Constraints resultNode class InputHelp,CoreAPI,KeyHelperFunctions,OutputHelp subgraphTitleTop %% BAML brand styling for arrows linkStyle default stroke:#0e7490,stroke-width:2px
Explanation:
The IRHelper
and IRHelperExtended
traits, implemented for IntermediateRepr
, provide essential functionality:
- Lookup (
find_*
): Simple, efficient methods to find a specific class, enum, or function by its name string. - Type Compatibility (
is_subtype
): A powerful recursive function that determines if a value of one type can be safely used where another type is expected. It correctly handles unions, optionals, lists, and maps. This is fundamental to type checking. - Value Annotation (
distribute_type
): This is a cornerstone function. It takes a rawBamlValue
(which has no type information) and aFieldType
and walks both structures in parallel. It produces aBamlValueWithMeta
, where every single node in the value tree (even nested ones) is annotated with its corresponding type. This is invaluable for detailed validation and transformation. - Argument Coercion (
coerce_arg
): When processing function arguments, this helper attempts to convert user-provided values into the schema-defined types. It handles cases like converting a string to a matching enum value, or parsing a dictionary into a class instance. - Constraint Aggregation (
distribute_constraints
): A type’s constraints (@assert
or@check
) can be defined in multiple places (on the type itself, on the class/enum definition). This helper recursively traverses the type definition to gather all applicable constraints into a single list.
Drilling Down: Specific Subsystems
Now we’ll move into more detailed diagrams of specific, complex internal mechanisms.
Diagram 7: The BAML Expression Subsystem (expr_fn
)
BAML’s expr_fn
allows for more complex, code-like logic. This requires a multi-pass compilation approach within internal-baml-core
.
let x = val * val
x] end subgraph Pass1["Pass 1: AST to IR Expression src/ir/repr.rs"] B[convert_function_body] C[ir repr Expr Enum
e.g. Expr Let Expr App Expr Atom] end subgraph Pass2["Pass 2: Type Inference src/validate/validation_pipeline/validations/expr_typecheck.rs"] D[infer_types_in_context] E[Walks Expr tree propagating types from
known variables like val int
and function signatures] F[Outputs a new Expr tree where
each node metadata contains its inferred type] end subgraph Pass3["Pass 3: Type Checking expr_typecheck.rs"] G[typecheck_in_context] H[Verifies the annotated Expr tree
Checks if val * val is valid operation for int
Checks if final return type matches -> int] end subgraph OutputExpr["Output"] I[Fully Typed & Validated Expression IR] J[Diagnostics if type mismatch] end A --> B B --> C C --> D D --> E E --> F F --> G G --> H H --> I H --> J %% BAML-inspired styling classDef setupPhase fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#fff classDef concurrentEngine fill:#06b6d4,stroke:#0891b2,stroke-width:2px,color:#fff classDef taskNode fill:#22d3ee,stroke:#0891b2,stroke-width:2px,color:#0f172a classDef processNode fill:#67e8f9,stroke:#0891b2,stroke-width:2px,color:#0f172a classDef resultNode fill:#a7f3d0,stroke:#059669,stroke-width:1px,color:#0f172a classDef aggregationPhase fill:#f3f4f6,stroke:#6b7280,stroke-width:2px,color:#374151 classDef subgraphTitleTop fill:#e0f2fe,stroke:#0891b2,stroke-width:2px,color:#0f172a class A setupPhase class B,D,G concurrentEngine class C,E,F,H taskNode class I,J resultNode class InputExpr,Pass1,Pass2,Pass3,OutputExpr subgraphTitleTop %% BAML brand styling for arrows linkStyle default stroke:#0e7490,stroke-width:2px
Explanation:
- AST to IR Expression: During the main IR transformation, the raw expression from the AST is converted into a structured
ir::repr::Expr
enum. This turns the code into a tree of operations (Let
,App
for function application, etc.). At this stage, most type information is missing. - Type Inference: The
infer_types_in_context
function performs a crucial pass. It traverses theExpr
tree, using aHashMap
context that maps variable names to their types. It starts with the function’s parameters (val: int
). As it descends the tree, it infers the type of each sub-expression (e.g., it knowsval * val
results in anint
) and annotates theExpr
node’s metadata with this new type information. - Type Checking: Finally,
typecheck_in_context
performs a verification pass on the type-annotated tree. It confirms that all operations are valid for the given types. For example, in a function callMyFn(x)
, it usesis_subtype
to ensure the type ofx
is compatible withMyFn
’s expected parameter type. It also confirms that the function’s final expression type matches its declared return type. Any mismatch results in a diagnostic error.
Diagram 8: IR to JSON Schema Generation
This is a great example of a “downstream” application of the validated IR. The src/ir/json_schema.rs
module transforms the BAML IR into a standard JSON Schema, which is useful for validation, documentation, and tooling.
A complete JSON Schema document] end IR --> Trait Trait --> ImplRoot & ImplTypes ImplRoot -->|Orchestrates by calling json_schema on all top-level items| ImplTypes ImplTypes -->|Recursively transforms types| TransformationLogic TransformationLogic --> JSONSchema %% BAML-inspired styling classDef setupPhase fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#fff classDef concurrentEngine fill:#06b6d4,stroke:#0891b2,stroke-width:2px,color:#fff classDef taskNode fill:#22d3ee,stroke:#0891b2,stroke-width:2px,color:#0f172a classDef processNode fill:#67e8f9,stroke:#0891b2,stroke-width:2px,color:#0f172a classDef resultNode fill:#a7f3d0,stroke:#059669,stroke-width:1px,color:#0f172a classDef aggregationPhase fill:#f3f4f6,stroke:#6b7280,stroke-width:2px,color:#374151 classDef subgraphTitleTop fill:#e0f2fe,stroke:#0891b2,stroke-width:2px,color:#0f172a class IR setupPhase class Trait,ImplRoot,ImplTypes concurrentEngine class Class,List,Optional,Enum taskNode class Ref,Array,Nullable,EnumRef processNode class JSONSchema resultNode class InputJSON,CoreLogic,TransformationLogic,OutputJSON subgraphTitleTop %% BAML brand styling for arrows linkStyle default stroke:#0e7490,stroke-width:2px
Explanation:
This feature is elegantly implemented using a trait.
- The Trait: A trait
WithJsonSchema
is defined with a single method,json_schema() -> serde_json::Value
. - Implementation: This trait is implemented for the
IntermediateRepr
and, crucially, for every variant ofFieldType
and other IR nodes. - Orchestration: The implementation for
IntermediateRepr
is the entry point. It creates the top-level JSON Schema structure (with$defs
for definitions) and then iterates through all its classes, enums, and functions, calling.json_schema()
on each to populate the$defs
. - Recursive Transformation: The core logic resides in
impl WithJsonSchema for FieldType
. This implementation is recursive:- A BAML
class
orenum
becomes a JSON Schema$ref
to its definition in$defs
. - A BAML
list
becomes a JSON Schema object withtype: "array"
. Theitems
property is generated by recursively callingjson_schema()
on the list’s inner type. - An
optional
BAML type is handled by adding"null"
to thetype
array in the resulting schema. - Primitives like
int
,string
,bool
map directly tointeger
,string
,boolean
.
- A BAML
- Output: The final result is a single, self-contained
serde_json::Value
that represents a valid JSON Schema for the entire BAML project.