Groq Llama AI Layer
Why Groq + Llama 3.3 70B, two-pass pattern
Groq Llama AI Layer
The AI layer within Odys is designed to enhance user interaction and provide valuable insights, leveraging the capabilities of large language models. This documentation delves into the architectural choices behind integrating Groq and the Llama 3.3 70B model, specifically highlighting the two-pass tool-calling pattern that underpins its functionality. This approach allows for robust, controlled, and context-aware AI interactions, serving both client-facing automation and internal professional analytics.
Overview
The Odys AI layer is composed of two primary components, each tailored for distinct use cases but sharing a common underlying architecture:
- AI WhatsApp Intake Agent (
src/lib/ai-intake.ts): This agent automates client interactions via WhatsApp, primarily focusing on checking availability and creating appointments. It acts as a conversational interface for clients to schedule sessions with professionals. - In-App AI Assistant (
src/app/api/ai/chat/route.ts): This component provides professionals with an interactive assistant within the Odys application. It offers insights into their business, such as appointment statistics, upcoming schedules, and client no-show rates.
Both components utilize the groq-sdk to interface with the llama-3.3-70b-versatile model. A critical design choice across both implementations is the "two-pass" tool-calling pattern, which ensures that the LLM's decisions regarding tool use are carefully evaluated before generating a final response. The system interacts with a Drizzle ORM-backed PostgreSQL database, which includes 10 tables, notably professionals, appointments, clients, and availability, to retrieve and store relevant information. The api/ai/chat route is also protected by a rate limit of 20 requests per hour per user, managed by the getAiChatLimiter.
The Two-Pass Tool-Calling Pattern
A cornerstone of the Odys AI layer is the "two-pass" tool-calling pattern. This pattern is employed in both the handleIncomingMessage function within src/lib/ai-intake.ts and the POST handler in src/app/api/ai/chat/route.ts. The rationale behind this approach is to enhance reliability and control over the LLM's behavior, especially when interacting with external tools that can modify data or trigger actions.
Here's how it works:
- First Pass: Tool Decision: The LLM receives the user's message along with a set of predefined
TOOLSand aSYSTEM_PROMPT. In this initial call, the LLM's primary task is not to generate a direct response, but to decide whether any of the available tools are relevant to the user's query. If a tool is deemed necessary, the LLM outputs atool_callsobject specifying the tool's name and its required arguments. - Tool Execution: If the first pass indicates a tool call, the application intercepts this. It then parses the tool's name and arguments, executes the corresponding backend function (e.g.,
getAvailableSlots,bookAppointment,getStats), and captures the result. - Second Pass: Response Generation: The original conversation history, the LLM's initial
tool_callsmessage, and the results from the executed tool are all fed back to the LLM. In this second pass, the LLM's role is to synthesize a natural language response to the user, incorporating the information obtained from the tool's execution.
This two-pass mechanism provides a crucial layer of control. It prevents the LLM from "hallucinating" tool results or making incorrect assumptions, as the actual tool execution happens outside the LLM's direct generation loop. It also allows for more complex logic, such as handling tool errors or refining arguments, before the final user-facing response is crafted.
Groq and Llama 3.3 70B Integration
The choice of Groq as the LLM provider and Llama 3.3 70B as the model is a deliberate one, driven by performance and capability considerations. The groq-sdk is used to interact with the Groq API, which is known for its high inference speed. This speed is particularly beneficial for conversational agents like the WhatsApp intake, where low latency is critical for a natural user experience.
The llama-3.3-70b-versatile model is selected for its balance of versatility and performance. A 70 billion parameter model offers significant reasoning capabilities, allowing it to understand complex user queries, accurately interpret tool descriptions, and generate coherent, contextually appropriate responses. This combination ensures that the AI layer can handle a wide range of interactions effectively, from scheduling appointments to providing detailed business analytics.
AI WhatsApp Intake Agent (src/lib/ai-intake.ts)
The ai-intake.ts module is responsible for powering the WhatsApp conversational agent. Its core function, handleIncomingMessage, processes inbound WhatsApp messages, maintains conversation state, and orchestrates LLM interactions.
Tools and Their Purpose
The agent is equipped with three specific tools, defined in the TOOLS array, to manage client bookings:
get_available_slots: This tool is crucial for checking a professional's availability. It takes adate(inYYYY-MM-DDformat) as input and returns a list of available time slots. ThegetAvailableSlotsfunction queries theprofessionalstable for session duration and theavailabilitytable for general rules, then cross-references with existingappointmentsto find open slots. It explicitly handles São Paulo's UTC-3 timezone using helper functions likesaoPauloDateandformatSaoPauloTime.book_appointment: Designed to create a new appointment, this tool requiresdate,time,client_name, andclient_phone. ThebookAppointmentfunction performs an atomic transaction using Drizzle ORM'sserializableisolation level. This is vital to prevent race conditions where two users might try to book the same slot simultaneously. It also handles upserting client information into theclientstable if a new client is detected and sends notifications (in-app, WhatsApp viasendWhatsAppwithmsgBookingRequesttemplate, and email viasendBookingRequestEmailToProfessional) to the professional.get_professional_info: This tool retrieves essential details about the professional, such as theirname,profession,sessionDuration,sessionPrice, andschedulebased onavailabilityrules. This allows the LLM to answer client questions about the professional's services.
Conversation Management
The handleIncomingMessage function uses getConversation and saveConversation to persist and retrieve conversation history for a given senderPhone and professionalId. This ensures that the LLM has context from previous turns, allowing for more natural and continuous dialogue. It also attempts to pre-fill clientName from the clients table if the phone number is recognized. A notable feature is the ability to seed the LLM with outboundContext if the inbound message is a reply to a recent transactional notification, helping the AI understand the user's intent regarding a specific appointment.
System Prompt
The buildSystemPrompt function crafts a detailed SYSTEM_PROMPT that guides the LLM's behavior. It instructs the LLM to act as the professional's booking assistant, respond in Brazilian Portuguese, be sympathetic and concise, never invent schedules, and always use tools for availability and booking. It also includes rules for asking for missing information (date, client name) and handling out-of-scope questions.
In-App AI Assistant (src/app/api/ai/chat/route.ts)
The api/ai/chat/route.ts file defines the API endpoint for the in-app AI assistant, providing professionals with data-driven insights.
Tools and Their Purpose
This assistant is equipped with a different set of tools, tailored for business analytics:
get_stats: This tool fetches comprehensive statistics for the last six months, including global and month-by-month breakdowns of total appointments, completed sessions, no-shows, no-show rates, and revenue. ThegetStatsfunction queries theappointmentstable, filtering byprofessionalIdandstartsAtwithin the last six months, and then aggregates the data. It uses theprofessional.sessionPriceto calculate revenue.get_upcoming: This tool retrieves a list of appointments scheduled for the next seven days, including client names, dates, and statuses. ThegetUpcomingfunction joinsappointmentswithclientsto fetch the necessary details.get_no_show_clients: This tool identifies and ranks clients with the highest no-show rates over the past six months. ThegetNoShowClientsfunction performs a grouped query onappointmentsandclients, counting total appointments and no-shows per client.
Access Control and Rate Limiting
Before processing any AI chat request, the POST handler performs several checks:
- Authentication: It verifies the user's identity using
getUserand ensures they are associated with a professional profile viagetProfessional. - Rate Limiting: To prevent abuse and manage resource consumption, the
getAiChatLimiterapplies a rate limit of 20 requests per hour per authenticated user. This is a crucial operational safeguard. - Feature Gating: The
canUseFeatureutility checks if the professional's currentplan(e.g., "Pro" or "Premium") allows access to the "assistant" feature, potentially consideringtrialEndsAt. This ensures that AI capabilities are aligned with subscription tiers.
System Prompt
The SYSTEM_PROMPT for the in-app assistant is designed to guide the LLM in providing business insights. It instructs the LLM to act as the Odys assistant, respond in Portuguese, use tools for data retrieval, and format financial values in BRL. It also provides specific formatting guidelines for responses related to "Taxa de no-show" and "Resumo do mês," ensuring consistency and clarity in the generated reports.
Design Decisions
- Two-Pass Tool-Calling Pattern: This pattern was chosen to enhance the reliability and safety of tool interactions. By separating the LLM's decision to call a tool from its final response generation, the system gains an opportunity to validate, execute, and process tool results before presenting them to the user. This reduces the risk of the LLM fabricating information or misinterpreting tool capabilities, which is a common challenge in single-pass tool-calling architectures.
- Groq and Llama 3.3 70B: The selection of Groq as the inference engine and Llama 3.3 70B as the model was a trade-off balancing performance, cost, and model capability. Groq's specialized hardware offers exceptionally fast inference, which is critical for real-time conversational agents like the WhatsApp intake, where user experience is directly tied to response latency. Llama 3.3 70B provides a robust and versatile foundation for understanding complex queries and generating high-quality text, making it suitable for both client-facing and analytical tasks.
- Dedicated AI Agents for Distinct Use Cases: Instead of a monolithic AI, Odys employs two separate AI agents (
ai-intake.tsandapi/ai/chat/route.ts). This modular design allows for highly specializedSYSTEM_PROMPTs andTOOLStailored to each agent's specific domain (client booking vs. professional analytics). This separation improves the LLM's focus and accuracy for each task, as it's not burdened with irrelevant context or tools. - Atomic Transaction for Booking: The
bookAppointmentfunction insrc/lib/ai-intake.tsuses a Drizzle ORM transaction withisolationLevel: "serializable". This is a critical design choice to ensure data consistency and prevent race conditions. By wrapping the conflict check, client upsert, and appointment insertion in a serializable transaction, the system guarantees that if two concurrent requests attempt to book the same slot, only one will succeed, and the other will gracefully fail with aSLOT_TAKENerror. - Explicit Timezone Handling: The
src/lib/ai-intake.tsmodule includes explicit timezone helper functions (saoPauloDate,formatSaoPauloTime,formatSaoPauloDate). This is essential for accurately managing appointment times and availability, especially given Brazil's abolition of DST. Without precise timezone management, scheduling conflicts or incorrect availability displays could arise. - Rate Limiting and Feature Gating: The
api/ai/chat/route.tsimplements rate limiting (getAiChatLimiter) and feature gating (canUseFeature). These are operational decisions to manage infrastructure costs associated with LLM usage and to align AI features with different subscription plans. Rate limiting prevents a single user from exhausting the Groq API budget, while feature gating ensures that advanced AI capabilities are part of the value proposition for higher-tier plans.
Potential Improvements
- Enhanced Tool Parameter Validation: In
src/lib/ai-intake.ts, thegetAvailableSlotsfunction performs a basic regex check for thedateformat. This could be expanded to include more robust validation, such as checking if the date is in the past or if it falls within a reasonable future range. Similarly,bookAppointmentcould validatetimeformat more strictly. This would reduce the LLM's reliance on generating perfectly formatted inputs and allow for more graceful error handling for malformed requests. - Asynchronous Notification Handling: The
bookAppointmentfunction insrc/lib/ai-intake.tssends WhatsApp and email notifications synchronously after the appointment is created. While functional, this can introduce latency to the API response if external services are slow. Decoupling these notifications into an asynchronous queue (e.g., using a message broker or a dedicated background job system) would improve the responsiveness of the booking process and make it more resilient to transient issues with notification providers. - Dynamic System Prompt Augmentation: The
SYSTEM_PROMPTs in bothsrc/lib/ai-intake.tsandsrc/app/api/ai/chat/route.tsare static strings. For the WhatsApp agent, thebuildSystemPromptfunction already injects the professional's name and profession. This could be extended to dynamically include other relevant professional details (e.g., specific services offered, special instructions) from theprofessionalstable, providing the LLM with richer context without needing to callget_professional_infoexplicitly for every interaction. - More Granular Error Reporting from Tools: Currently, tool implementations often return generic error messages like
{ error: "Profissional não encontrado" }or{ success: false, reason: "Este horário acabou de ser preenchido." }. While the LLM can interpret these, providing more structured error codes or types (e.g.,{"errorCode": "PROFESSIONAL_NOT_FOUND"}) would allow the LLM to generate more precise and helpful user-facing error messages, and also enable the application to log or react to specific error conditions programmatically. - Conversation Summarization for Long Chats: The
handleIncomingMessagefunction insrc/lib/ai-intake.tsappends all previous messages tollmMessages. For very long conversations, this can lead to increased token usage and potentially hit context window limits. Implementing a conversation summarization step (e.g., periodically summarizing older messages into a concise context string) could optimize token usage and maintain relevant context over extended interactions.
References
src/lib/ai-intake.tssrc/app/api/ai/chat/route.ts