---
title: "Before Using Flue: Figuring Out What This Framework Actually Is"
description: "A rough summary of how Flue thinks about harnesses, agents, workflows, skills, tools, sandboxes, and persistence — before actually running anything."
lang: "en"
canonical: "https://llm-lab.dev/en/posts/flue-framework-overview/"
source: "https://llm-lab.dev/en/posts/flue-framework-overview.md"
publishedAt: "2026-06-18"
updatedAt: "2026-06-18"
category: "Flue"
tags:
  - "flue"
  - "agent"
  - "typescript"
  - "observability"
---

# Before Using Flue: Figuring Out What This Framework Actually Is

import LinkCard from "../../components/LinkCard.astro";

Before working with the AI agent framework [Flue 1.0 Beta](https://flueframework.com/), I wanted to get a clear picture of what this framework is actually for.

My first read of the docs suggested it wasn't just a thin SDK for calling LLM APIs. Instead, it looked like a foundation for bringing together the execution environment, state management, tools, skills, and sandboxes that an agent needs to keep doing real work.

This post is the background research. The follow-up covers actually running the quickstart.

<LinkCard
  href="https://llm-lab.dev/posts/flue-1-0-beta-local-check/"
  title="When I Tried Flue 1.0 Beta Locally, I Hit a Wall in the First Line of the Quickstart"
  description=""
  siteName=""
  image="/images/posts/flue-1-0-beta-local-check/heroImage.webp"
/>

I planned to start with the quickstart to see how the CLI behaves, then move on to building a small practical use case combining agents, skills, channels, and observability. To prepare, I organized these points first:

- Does the quickstart guide match the actual CLI behavior?
- How naturally can the `src/agents/` and `src/workflows/` conventions be used?
- How should tools and skills be separated?
- Which sandbox should be chosen for agents handling external input?
- Are agent continuation IDs, workflow run IDs, and observability structured in a way that can be traced later?
- How does persistence and operational context differ between Node.js and Cloudflare?

What I especially wanted to see was how "AI-agent-like abstractions" translate into concrete design decisions during implementation. Are skills just descriptive text, or do they serve as stable units of judgment? Is a sandbox a convenience feature, or a trust boundary that needs to be designed as one? Are channels and observability just demo decorations, or things that should be wired in from the start? The follow-up posts are the verification logs for these questions.

## What Flue Centers On

Flue is a TypeScript framework for autonomous AI agents. It targets execution environments like Node.js or Cloudflare, aims to stay independent of specific models and hosting platforms, and focuses on handling long-running sessions and recovery from interruptions.

The most prominent thing in the docs is how heavily Flue foregrounds the concept of a **harness**. Here, a harness means the complete environment an LLM needs to move forward with real-world tasks:

- File system
- Tools
- Sandboxes
- Instructions and context
- Subagents
- Skills
- External connections such as MCP servers

An LLM on its own only returns text. But when given the above, it can read files, gather needed information, call external APIs, and sustain long-running work. Flue seems to be designed around the question of "what do you give the model?"

## Harness-First as a Design Lens

Traditional agent implementations often end up with the developer encoding step-by-step procedures: call this, then that, and if it fails do this. That makes for understandable workflows, but they tend to be brittle against unexpected inputs or ambiguous tasks.

Flue's harness-first approach shifts the focus from locking down every step to equipping the model with the tools and context it needs. The developer doesn't write a sequential script; instead they compose a model, instructions, tools, skills, sandbox, and subagents to create a state in which the task can be solved.

This mindset feels quite natural if you've worked with coding agents like Claude Code or Codex. Even without the human spelling out every micro-step, once the working directory, command execution, search, editing permissions, and project rules are in place, the agent can move forward on its own. I understood Flue as a framework for assembling that structure on the application side.

That said, it's important not to reduce harness-first to "just let the model do whatever it wants." In practice, you still face fine-grained boundary decisions: what information belongs in a skill, what operation becomes a tool, what input should not be trusted, and what processing should be offloaded to a workflow.

In the follow-up issue triage agent, for example, the judgment criteria for reading an issue body alone can live in a skill. On the other hand, writing comments on GitHub, adding labels, and searching past issues — anything with side effects or external references — becomes the responsibility of tools or channels. If this separation is blurry, you can end up with the state where "everything is written in the prompt but it's unclear what counts as safe behavior."

## How Agents Are Built

In Flue, you place TypeScript files under `src/agents/` and define an agent with `createAgent()`. The file name becomes the agent name, and by convention it serves as the HTTP endpoint or dispatch target.

A heavily simplified shape looks like this:

```ts
import { createAgent } from '@flue/runtime';

export default createAgent(() => ({
  model: 'anthropic/claude-sonnet-4-6',
  instructions: 'You are a code review agent.',
  tools: [],
  skills: [],
}));
```

Into this you add which model to use, what to instruct, which tools to provide, which skills to read, and in which sandbox to run.

During local development you can connect interactively from the CLI, while in production the agent is invoked via an HTTP endpoint like `POST /agents/<name>/<id>`, or from within the application via `dispatch()`. The `id` can serve as a persistent session identifier, so for example you can keep the same agent instance alive across events tied to "repository name + issue number."

The design of this `id` turned out to be more important than I initially expected. A one-off chat can get away with a random ID, but for subjects like GitHub issues or support tickets where additional information arrives later, the application itself must decide "what counts as the same session."

In the follow-up post I used an ID like `repository.full_name#issue.number`. This was to ensure that additional events for the same issue were routed to the same persistent instance. Even if Flue can hold state, the granularity at which state continues is an application-level design decision, not the framework's.

## Tools and Skills Are Different Things

One of the first things that gets confusing when reading about Flue is the distinction between tools and skills. At a glance both look like "things that add capability to the agent," but their roles differ.

A **tool** is executable code. It carries out real actions: querying order status from a database, writing a GitHub comment, or calling an external API. When the model calls a tool, an application-side function runs and the result is returned to the model.

A **skill**, on the other hand, is a reusable instruction document. It collects specialized ways of working — a code review process, troubleshooting viewpoints, ticket classification rules, internal writing tone — into Markdown. A skill does not add execution power; it guides the model's judgment and workflow.

My mental model is: a tool is "what you can do," and a skill is "how to do it." If you want the agent to manipulate external systems, you need tools. If you want stable judgment criteria or working perspectives, skills are effective.

This distinction shows up immediately even in a small agent like issue triage. Definitions of severity, judgments on whether reproduction info is sufficient, and ways to think about label candidates can all be skills. But actually writing the comment to GitHub, applying the label, and reading the repository settings — these parts are better made explicit as tools or channels so responsibilities stay clear.

If you blur this boundary and try to cram everything into instructions, you can build something quickly at first. But observing behavior, restricting permissions, and testing later become much harder. One of the interesting things about Flue is that it encodes this separation into the framework's vocabulary from the start.

## How Skills Are Imported

A skill is written as a `SKILL.md` with `name` and `description` in its frontmatter. You can either place it in the project and import it explicitly from TypeScript, or let it be auto-discovered from `.agents/skills/` inside the sandbox.

```md
---
name: code-review
description: Reviews code changes using the project review checklist.
---

# Code Review

Check correctness, security, maintainability, and missing tests.
Do not comment on unrelated style issues unless they affect behavior.
```

On the TypeScript side, you import the Markdown as a skill and pass it to the agent:

```ts
import reviewSkill from '../skills/code-review/SKILL.md' with { type: 'skill' };

export default createAgent(() => ({
  model: 'anthropic/claude-sonnet-4-6',
  skills: [reviewSkill],
}));
```

This structure feels close to Codex skills or Claude Code custom instructions. Rather than embedding all behavior into application code, being able to separate working patterns into Markdown seems convenient for adjusting while operating.

## View the Sandbox as a Trust Boundary

Flue provides multiple sandboxes. Roughly speaking, these are the default virtual sandbox, a local sandbox available in Node.js environments, and remote sandboxes using services like Daytona.

The virtual sandbox is a lightweight in-memory environment that does not directly access the host file system. The local sandbox can touch the host file system and shell. Remote sandboxes offer an isolated Linux environment.

This is a dangerous area to choose based on "which one is convenient." `local()` in particular is handy for trusted local development or ephemeral CI runners, but it is not something you use in a permanent service that processes incoming input from external users. A sandbox is simultaneously a working environment and a trust boundary.

Even when building the GitHub issue triage agent in the follow-up post, I initially reached for the local sandbox, but because issue bodies are external input, I eventually shifted to the virtual sandbox.

This decision was the strongest case of "I'm glad I read the overview first" in this series. Skimming the docs makes `local()` look appealing because it gives access to files and shells. But handing host access to an agent processing externally received issue bodies is less about convenience and more about trust boundaries.

In other words, choosing a sandbox is operational design, not feature selection. Local solo experiments, ephemeral CI runners, production permanent services, and multi-tenant environments each have different premises. When working with Flue, it seems better to first decide "where does this input come from" and "what is this agent allowed to touch" before picking a sandbox.

## Subagents as a Context Isolation Tool

Flue also supports delegating tasks to subagents. Instead of the parent agent holding every investigation, classification, review, and patch creation, specialized child agents take on those roles.

The benefit of subagents is not just parallelization. It's the ability to do exploratory work without polluting the parent's context, to equip the child with only the tools it needs, and to let the parent integrate only the results.

In human work too, splitting into investigation, implementation, and review roles often makes things easier. The same applies to agent design: rather than cramming everything into one giant prompt, separating by role often leads to better clarity.

## Agent and Workflow Durability Are Different

Flue has not only agents but also workflows. What's important here is that their durability and recovery model are not the same.

An agent is treated as a persistent session, where resuming after interruption is a core design concern. It picks up safely from partial outputs, tool calls, and conversation history.

A workflow, on the other hand, is closer to a finite function execution. Retrying does not mean "resume from the middle" but rather "run a new execution." Because of this, any workflow step that affects external systems must be designed by the application with idempotency in mind.

For example, creating tickets, posting comments, requesting payments, or sending emails can cause problems if the same action runs twice on failure. For webhook-driven processing, you might need to store delivery IDs or event IDs for deduplication, or pass idempotency keys to external APIs.

This is basically standard distributed systems design that predates LLMs, but when agents and workflows enter the picture it's easy to fall into the feeling that "the model will handle it smartly." That makes it something to be explicitly careful about.

In the quickstart, the workflow's `run ID` was visible; in the issue triage agent, the agent's `id` and `submissionId` showed up. These are subtle but become very important when tracing operational logs later. AgentOps is not just about looking at the model's output. Being able to trace "which input went into which session or run, and which operation failed" is close to the core of the work.

Before working with Flue, I saw the difference between agents and workflows as roughly "things you let an LLM handle" versus "fixed processes." But once you start looking at logs and errors, the unit of resumability, the unit of retry, and the responsibility for avoiding duplicate execution all differ. Without understanding this, you can get stuck when moving from a working demo to an operational agent.

## Persistence Changes by Deployment Target

Flue's conversation history and workflow execution records can be persisted via a `PersistenceAdapter`. However, the configuration differs depending on the execution environment.

When deploying to Cloudflare, a setup using Durable Objects and SQLite seems to be the assumption, with the necessary state storage built into Flue. On the other hand, the default in a Node.js environment is in-memory SQLite, so state disappears on process restart. For persistence, you need to define an adapter for SQLite or PostgreSQL in something like `src/db.ts`.

That said, what gets stored here is the Flue runtime's state. It does not mean customer data, business data, or application-specific persistent data is automatically handled. It seems better to keep agent conversation history and business data separate.

## What Caught My Attention Before Trying It

From reading the docs, Flue seemed less like a "convenient library for calling LLM APIs" and more like an execution foundation for keeping agent applications alive over the long term.

These were the points that especially stood out:

- Treating an agent as a persistent session
- Separating tools and skills
- Treating the sandbox as a trust boundary
- Requiring idempotency design for workflows
- Differing persistence assumptions between Node.js and Cloudflare

Conversely, for a lightweight chatbot it might feel a bit heavy. But for use cases like issue triage, code review, investigation, or multi-step business assistance — where state, external integration, and observability are needed — it seems like it will shine.

I'll split the follow-up into implementation verification logs in separate posts!
