Groq + Llama 3.3 AI
Why Groq + Llama 3.3 70B, two-pass pattern
Groq + Llama 3.3 AI
This page explains the rationale for using Groq with the Llama 3.3 70B model in the AI layer and details the two-pass pattern used for reliable tool calling in both the WhatsApp intake agent and the AI assistant.
Groq + Llama 3.3 70B Selection
The AI layer uses Groq's inference engine with the llama-3.3-70b-versatile model in two key components: the AI assistant (src/app/api/ai/chat/route.ts) and the WhatsApp intake agent (src/lib/ai-intake.ts). This choice is driven by performance and cost efficiency.
Groq delivers significantly faster inference—critical for back-to-back LLM calls—while being more cost-effective than GPT-4-class models. The model is invoked with the same llama-3.3-70b-versatile identifier in both files, confirming consistent usage across the AI surface.
Although Llama 3.3 70B may have weaker instruction-following capabilities compared to larger proprietary models, this is mitigated by the narrow scope of available tools. For example:
- In
src/lib/ai-intake.ts, tools likeget_available_slotsandbook_appointmentare tightly scoped to scheduling logic. - In
src/app/api/ai/chat/route.ts, tools such asget_statsandget_no_show_clientsserve specific analytical queries.
The system prompt in each file enforces strict rules (e.g., "never invent numbers", "use tools to fetch data") to reduce hallucination risk, making the model’s limitations manageable within bounded domains.
Two-Pass Tool Calling Pattern
Both AI implementations follow a two-pass pattern to separate tool decision from response generation, improving reliability and reducing incorrect tool usage.
First Pass: Tool Decision
The LLM receives the user message, system prompt, and tool definitions. Its sole task is to decide whether to invoke a tool and return a function_call object.
In src/lib/ai-intake.ts, this occurs via:
const first = await getGroq().chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: llmMessages,
tools: TOOLS,
tool_choice: "auto",
})If choice.tool_calls is present, the system proceeds to execution.
Second Pass: Response Generation
Tool results are injected into a new message list with role tool, including tool_call_id and JSON-serialized result. A second LLM call generates the final natural language response.
In src/app/api/ai/chat/route.ts, this is implemented as:
const second = await getGroq().chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: toolMessages,
})This separation ensures:
- Deterministic tool invocation before any user-facing text is generated.
- Clear audit trail of inputs, decisions, and outputs.
- Reduced risk of hallucinated or inconsistent responses based on unexecuted tool calls.
Known Gaps
- Instruction Following: Llama 3.3 70B is weaker than GPT-4-class models on complex, open-ended instruction following. This is mitigated by narrow tool scope and strict system prompts, but could limit future expansion into more nuanced AI behaviors without additional guardrails.
Why This Shape
The combination of Groq and Llama 3.3 70B enables fast, low-cost inference, making the two-pass pattern economically viable. Each user interaction may trigger two LLM calls, which would be prohibitively expensive with slower, higher-cost providers.
The two-pass pattern enhances reliability by decoupling decision from response. This design allows the system to use a high-performance but less instruction-precise model like Llama 3.3 70B safely, as tool execution is validated before final output generation. The pattern is consistently applied across both src/lib/ai-intake.ts and src/app/api/ai/chat/route.ts, ensuring uniform behavior in user-facing and professional-facing AI interactions.