WWDC26 · 21 min · AI & Machine Learning

Create robust evaluations for agentic apps

Learn how to leverage advanced features of the Evaluations framework to build robust evaluations for your app. Explore evaluating flows with tool calling and dynamic conditions, and how to define what correct behavior means for your use case. Discover how to generate synthetic data, use judges effectively, and validate your datasets for reliable results.

Watch at developer.apple.com ↗

Transcript all transcripts

Chapters

0:00 — Introduction
2:21 — The dataset problem in BookTracker
3:46 — Generating synthetic data with makeSamples
6:27 — Customizing generation with SampleGenerator
8:38 — Sampling strategies
10:11 — Validating synthetic samples
13:04 — Comparing evaluation results
15:09 — Tool calling and tool evaluations
18:54 — Trajectory expectations
21:26 — Building a tool call evaluation
22:02 — Synthetic data for tool evaluations
23:49 — Next steps

Code shown on screen · 12 snippets

Generate synthetic data with makeSamples swift · at 5:16 ↗

// Synthetic data
  let prompt = Prompt("""
      Generate diverse range of book reviews and corresponding tags.
      Cover a wide range of genres, time periods, cultures, and
      reader personas. Do not repeat books already in the dataset.
      """)
  
  let dataset = Book.sampleBooks.map { book in
      ModelSample(prompt: book.review, expected: BookTags(tags: book.tags))
  }
  
  let targetCount = 100
  var expandedDataset = dataset

  for try await sample in dataset.makeSamples(prompt, targetCount: targetCount) {
      expandedDataset.append(sample)
      print("Generated \(expandedDataset.count) samples so far.")
  }

  2. Configure a custom SampleGenerator — slides 30–43
  
  // Define your own configuration
  let generator = SampleGenerator<ModelSample<BookTags>>(
      prompt,
      samples: dataset,
      targetCount: targetCount,
      sessionProvider: {
          LanguageModelSession( 
              model: PrivateCloudComputeLanguageModel(),
              instructions: """
                  You are a synthetic data generator for a book-tracking app's evaluation suite.
                  Your job is to produce realistic, diverse book entries that will stress-test
                  a tagging system.

                  Rules:
                  - Review must be at least 100 characters long.
                  - Review should cover a mix of genre, mood/tone, and themes.
                  - Reviews should vary in length.
                  - Create between 3 and 8 tags.
                  - Tags must be lowercase.
                  """ 
          )
      }
  )

Configure a custom SampleGenerator swift · at 5:53 ↗

// Define your own configuration
  let generator = SampleGenerator<ModelSample<BookTags>>(
      prompt,
      samples: dataset,
      targetCount: targetCount,
      sessionProvider: {
          LanguageModelSession( 
              model: PrivateCloudComputeLanguageModel(),
              instructions: """
                  You are a synthetic data generator for a book-tracking app's evaluation suite.
                  Your job is to produce realistic, diverse book entries that will stress-test
                  a tagging system.

                  Rules:
                  - Review must be at least 100 characters long.
                  - Review should cover a mix of genre, mood/tone, and themes.
                  - Reviews should vary in length.
                  - Create between 3 and 8 tags.
                  - Tags must be lowercase.
                  """ 
          )
      }
  )

Validate generated samples swift · at 10:37 ↗

// Define validation metrics
  validator: { sample in
      guard let book = sample.expected else { return false }

      // Review must be at least 100 characters
      guard sample.promptDescription.count >= 100 else { return false }

      // Must have between 3 and 8 tags
      guard (3...8).contains(book.tags.count) else { return false }

      // All tags must be lowercase
      guard book.tags.allSatisfy({ $0 == $0.lowercased() }) else { return false }

      return true
  }

Access valid and invalid results swift · at 10:58 ↗

// Accessing results
  for try await sample in generator.run() {
      // During iteration
      expandedDataset.append(sample)
  }

  // After iteration
  let allSamples = await generator.samples
  let invalidSamples = await generator.invalidSamples
  
  print("Generated \(allSamples.count) new samples. Total: \(expandedDataset.count)")

Define a tool's Generable argument swift · at 15:30 ↗

@Generable
  struct SearchBooksArguments {
      @Guide(description: "A freeform search term to match against titles, reviews, or tags")
      var query: String?
  
      @Guide(description: "Filter results to books with this specific tag")
      var tag: String?

      @Guide(description: "Filter results by mood")
      var mood: String?

      @Guide(description: "Filter results by genre")
      var genre: String?

      @Guide(description: "Maximum number of results to return. Defaults to 5.")
      var limit: Int? 
  }

A basic trajectory expectation swift · at 16:37 ↗

// "Find books tagged gothic"
  TrajectoryExpectation(
      unordered: [
          ToolExpectation(
              "searchBooks",
              arguments: [
                  .exact(argumentName: "tag", value: .string("gothic"))
              ]
          )
      ]
  )

Match arguments by intent (naturalLanguage) swift · at 17:07 ↗

// "Find something cheerful"
  TrajectoryExpectation(
      "searchBooks",
      arguments: [
          .naturalLanguage(
              argumentName: "mood",
              criteria: "Should relate to uplifting, hopeful, or positive feelings"
          )
      ]
  )
  Other matchers available: .contains, .oneOf, .pattern, .range, and more.

Expect tool calls in order swift · at 17:34 ↗

// "Find gothic books and show details on the first"
  TrajectoryExpectation(
      ordered: [
          ToolExpectation(
              "searchBooks",
              arguments: [
                  .exact(argumentName: "tag", value: .string("gothic"))
              ]
          ),
          ToolExpectation(
              "getBookDetails",
              arguments: [
                  .keyOnly(argumentName: "bookId")
              ]
          )
      ]
  )

Disallow specific tool calls swift · at 17:55 ↗

// "Show only sci-fi books. Don't look for similar ones."
  TrajectoryExpectation(
      unordered: [
          ToolExpectation(
              "searchBooks",
              arguments: [
                  .naturalLanguage(
                      argumentName: "genre",
                      criteria: "Should refer to science fiction")
              ]
          )
      ],
      disallowed: [
          ToolExpectation("findSimilarBooks")
      ]
  )

Build a tool call evaluation swift · at 18:14 ↗

// Tool call evaluations
  let samples = SampleArrayLoader(samples: [
      ModelSample(
          prompt: "Find all the books tagged with 'gothic'.",
          instructions: "Help the user explore their book collection.",
          expectations: TrajectoryExpectation(  )
      )
  ])

  struct BookLibraryToolCallEval: Evaluation {
      var dataset = samples

      let pass = Metric("All Passed")
      let percent = Metric("Percentage Passed")

      var evaluators: Evaluators { 
          ToolCallEvaluator(allPass: pass, percentagePass: percent)
      }
  }

Synthesize tool-evaluation samples swift · at 19:20 ↗

// Tool call evaluations
  let prompt = Prompt("""
      Generate diverse user queries for a personal book library assistant.
      Each sample needs a prompt (what the user says), and a trajectory
      expectation describing which tools should be called and in what order.
      """)

  let instructions = """
      AVAILABLE TOOLS:
      - searchBooks(query?, tag?, mood?, genre?, limit?): search the library
      - getBookDetails(bookId): full details for one book
      - findSimilarBooks(bookId, maxResults?): find books sharing tags
      ORDER REQUIREMENTS:
      - searchBooks must comes before getBookDetails or findSimilarBooks
      - Use TrajectoryExpectation(ordered:) when sequence matters, else (unordered:)
      USE THESE ARGUMENT MATCHERS:
      - .exact for precise values, .naturalLanguage for fuzzy matching
      - .keyOnly when any value is acceptable, .range for numeric constraints
      - .contains/.hasPrefix/.hasSuffix for partial string matching
      """

Validate tool-evaluation samples swift · at 19:51 ↗

// Tool call evaluations
  validator: { sample in
      // Must have expectations defined
      guard sample.output.expectations != nil else { return false }

      let expectations = sample.output.expectations!

      // Must reference at least one tool
      let totalExpectations = expectations.ordered.count + expectations.unordered.count
      guard totalExpectations > 0 else { return false }

      // All tool names must be from the valid set
      let validTools: Set<String> = ["searchBooks", "getBookDetails", "findSimilarBooks"]
      let allExpectations = expectations.ordered + expectations.unordered + expectations.disallowed
      for expectation in allExpectations {
          guard validTools.contains(expectation.name) else { return false }
      }
  
      return true
  }

  ---

Resources

[samplecode] Book Tracker: Using Evaluations to evaluate an intelligent feature
[documentation] Generating synthetic datasets
[documentation] Evaluating tool-calling behavior
[documentation] Scoring with model-as-judge evaluators