Gemini API in Ruby: Building AI Agents Without an SDK
Build Gemini AI agents in Ruby or Rails without an official SDK: Interactions API, Faraday, function calling, server-side state, and production guardrails.
Build an AI agent in Rails and the prompt turns out to be the least of your worries. What actually decides whether it ships is the seam between your application and the model provider: how tools are declared, how function calls get executed, how conversation state is kept, how failures surface, and how much provider-specific JSON ends up smeared across your jobs, controllers, and views.
For Gemini, that seam is cleanest through the Interactions API, its newer interface for agentic workflows. Rather than treating every call as a one-off content generation request, an interaction represents a complete turn in a task or conversation, carrying user input, model output, reasoning steps, tool calls, tool results, usage metadata, and state you can reuse on the next turn. An agent lives or dies on that carried-over state; a plain request-and-response throws it away every single turn.
There is one wrinkle if you work in Ruby: Google ships official GenAI SDKs for Python, JavaScript, Go, Java, and C#, but not Ruby. So you call the REST API yourself, usually through Faraday, and you own the client boundary. That ownership sounds heavier than it turns out to be. If you are building against the Anthropic API instead, the architecture is similar but the client boundary differs, which I covered in Building AI Agents in Ruby with the Anthropic SDK.

| Feature | generateContent API | Interactions API |
|---|---|---|
| Primary use | Standalone text generation | Multi-turn agentic workflows |
| Server-side state | No | Yes (interaction IDs) |
| Background execution | No | Yes (background: true) |
| Observable execution steps | No | Yes (step log in response) |
| Multi-turn without full resend | No | Yes (previous_interaction_id) |
| Google recommendation | Legacy, still supported | Primary for new projects |
Why the Interactions API Is the Right Default
For any new Gemini agent, reach for the Interactions API rather than generateContent. generateContent still works, but it treats every call as a standalone completion, and an agent is not a string of standalone completions. It is a workflow with memory: it takes some input, decides whether it needs a tool, waits while your code runs that tool, picks up where it left off, and goes around again until it actually has an answer.
And while the model runs that loop, your app has its own loop to run: streaming progress to the user, persisting state for the next turn, handing slow tools to a background job, and keeping enough step data to explain what happened when a run goes sideways in production.
The Interactions API models all of that as a first-class resource. The core object is the interaction, one complete turn in a conversation or task, and it carries its execution steps in order: user input, model thoughts, function calls, function results, and the final model output. Parsing that beats reconstructing intent from a flat completion response, where you end up regexing the text or guessing from a finish_reason whether the model wanted a tool at all. Here the structure is already in the payload; you read it instead of inferring it.
The Ruby Problem
Ruby is not on Google’s SDK list, and if you have wired Rails up to a Google product before, you already knew it would not be. That leaves a Rails app three realistic options:
- Use a community Gemini gem.
- Use a multi-provider abstraction.
- Call the REST API directly.
The gems are fine, and for plain model calls they will get you to a working demo fastest. The moment you start caring about the exact shape of the provider boundary, though - the request format, the response parsing, retries, logging, how streaming behaves, where tool execution actually happens - calling REST directly starts to win. The cost is that everything an SDK normally hides is now yours to write and maintain. In Ruby that means a small Faraday client and an adapter that turns your app’s internal idea of an agent into Gemini interaction requests. In production it lands at a few hundred lines of client code: narrow enough to own without a second thought, specific enough that when a run fails you know exactly which file to open.
Two Interaction Surfaces
There are two surfaces, and they differ by authentication, not just URL. The Gemini Developer API uses a simple API key, so you can prototype from a console with one env var. The Enterprise Agent Platform on Google Cloud uses IAM and a Bearer token, and is where managed agents, governance, and the people who own your Google Cloud bill all live.
The Gemini Developer API exposes interactions at:
POST https://generativelanguage.googleapis.com/v1beta/interactions
No token to mint, no project to configure, just the key in a header and you are making calls. That simplicity is exactly where the Enterprise path parts ways.
Gemini Enterprise Agent Platform exposes the same capability through Google Cloud, on a project- and location-scoped path on aiplatform.googleapis.com instead of the flat Developer API path. This one has moved around as the platform evolved, so treat the path below as a placeholder and confirm the current one against the Enterprise Agent Platform reference before you wire it in:
# Placeholder shape - verify against the current Enterprise Agent Platform reference
POST https://aiplatform.googleapis.com/v1beta1/projects/{project}/locations/{location}/interactions
That is the Google Cloud path, with Google Cloud authentication, and it is where you land if you are building around managed agents, enterprise governance, or infrastructure you already run on Google Cloud. The request bodies look much the same on both, but operationally they are different animals: one is an API key you drop into an env var, the other drags in projects, locations, IAM, billing, and platform controls. A Ruby client should decide which of those two worlds it lives in once, at construction time, and keep the rest of your code from ever thinking about it.
class GeminiInteractionsClient
def initialize(api_key: nil, project: nil, location: "global", authorizer: nil)
@api_key = api_key
@project = project
@location = location
@authorizer = authorizer
end
def developer_api?
@api_key.present?
end
def base_url
developer_api? ? "https://generativelanguage.googleapis.com" : "https://aiplatform.googleapis.com"
end
def interactions_path
if developer_api?
"/v1beta/interactions"
else
"/v1beta1/projects/#{@project}/locations/#{@location}/interactions"
end
end
end
The developer_api? switch carries that whole decision: one boolean, set once, and nothing downstream has to ask again. The one sharp edge is the path itself. The Developer API’s /v1beta/interactions is verified, so leave the Enterprise path as a value you fill in from the current reference rather than copying mine.
A Small Faraday Client
A minimal client is mostly just careful, boring HTTP.
class GeminiInteractionsClient
def create_interaction(payload)
response = connection.post(interactions_path) do |request|
request.headers.update(auth_headers)
request.params.update(query_params)
request.body = payload
end
response.body
end
private
def connection
@connection ||= Faraday.new(url: base_url) do |faraday|
faraday.request :json
faraday.response :json
faraday.response :raise_error
faraday.options.open_timeout = 5
faraday.options.timeout = 60
faraday.adapter Faraday.default_adapter
end
end
def query_params
{}
end
def auth_headers
return { "x-goog-api-key" => @api_key } if developer_api?
headers = {}
@authorizer.apply!(headers)
headers
end
end
The line that earns its keep has nothing to do with Gemini:
faraday.response :raise_error
Leave it out and Faraday will cheerfully hand you a failed HTTP response as if it were a normal JSON body. Inside an agent loop that shows up as a blank or half-parsed assistant message, and you lose an afternoon debugging the model when the real problem was a 401 from an expired key. A failed model call should look like a failed model call, not like the model deciding to say nothing.
Creating an Interaction
A minimal Gemini interaction needs exactly two things: a model name and an input. After all the talk of execution steps and server-side state, the smallest useful payload is almost anticlimactic. System instructions, tools, and generation config layer on top.
payload = {
model: "gemini-3.5-flash",
input: "Explain the difference between optimistic and pessimistic locking in Rails."
}
client.create_interaction(payload)
For an agent, you normally add a system instruction and tools.
payload = {
model: "gemini-3.5-flash",
system_instruction: "You are a careful assistant inside a Rails application. Use tools when you need application data. Do not guess internal records.",
input: "Which invoices are overdue for customer 123?",
tools: [
{
type: "function",
name: "lookup_customer_invoices",
description: "Look up invoices for one known customer. Use this when the user asks about that customer's invoices, payment status, or overdue balance.",
parameters: {
type: "object",
properties: {
customer_id: {
type: "integer",
description: "The internal customer ID. Do not guess this value."
},
status: {
type: "string",
enum: ["draft", "open", "paid", "overdue"],
description: "Optional invoice status filter."
}
},
required: ["customer_id"]
}
}
]
}
Notice the shape: tools are part of the model interface, not just implementation details. The function name, description, parameter names, parameter descriptions, enum values, and required fields are all part of how the model decides what to do, so a sloppy declaration produces a sloppy agent.
Tool Design Matters More Than Prompt Tricks
The temptation is to point a tool straight at a controller action or service object you already have and call it done. Resist it. A human using your existing API can read a cluttered table, notice something looks off, click around, recover from a bad guess, and ask a follow-up. The model gets one shot at reading your description and parameters, so the interface has to make the right next move obvious on its own.
A tool the model can actually use well answers a few questions up front:
- When should this tool be used?
- When should it not be used?
- What identifiers are safe to pass?
- What should the model do if the identifier is missing?
- What does the result mean?
- Is this a read, preview, write, or destructive action?
- Does the action require user confirmation?
For example, this tool is too vague:
{
type: "function",
name: "lookup",
description: "Looks things up.",
parameters: {
type: "object",
properties: {
id: { type: "integer" }
}
}
}
This is better:
{
type: "function",
name: "lookup_customer_invoices",
description: "Look up invoices for one known customer. Use this only after the customer has been resolved to an internal customer_id. This tool does not search customers by name and does not create or update invoices.",
parameters: {
type: "object",
properties: {
customer_id: {
type: "integer",
description: "The internal customer ID. Do not guess this. Resolve the customer first if needed."
},
status: {
type: "string",
enum: ["draft", "open", "paid", "overdue"],
description: "Optional invoice status filter."
},
limit: {
type: "integer",
description: "Maximum number of invoices to return. Defaults to 20."
}
},
required: ["customer_id"]
}
}
The second version is not just wordier. It hands the model an actual decision boundary: when to call this, when not to, and what to do when it does not have the customer_id yet. That is the difference between a tool the model uses correctly and one it guesses at.
Reading Execution Steps
The response is step-oriented, so parse it that way. The reflex from generateContent days is to grab the text field and move on, but here that habit drops half the payload on the floor: model thoughts, function calls, results, and usage metadata all live in the steps. Normalize them into internal step objects at the provider boundary rather than letting raw Gemini JSON flow into your jobs and views:
module Agent
Step = Data.define(:id, :type, :name, :arguments, :content, :raw)
FunctionCall = Data.define(:id, :name, :arguments, :raw)
ModelOutput = Data.define(:text, :raw)
end
Then the rest of your Rails app deals with your own objects, not raw Gemini JSON. That boundary matters: if raw provider JSON spreads through your jobs, controllers, views, and service objects, every provider change becomes an application-wide refactor. Keep the weirdness at the edge. Keep the function_call id on each step too, because tool results have to reference the call they answer.
The Agent Loop
The agent loop is small enough to hold in your head: create an interaction, read its execution steps, run any tools the model asked for, hand the results back, and go around again until the model stops asking for tools and just answers. Every agent framework you have heard of is, underneath the marketing, some version of this.
class AgentRunner
MAX_STEPS = 8
def initialize(client:, tool_registry:)
@client = client
@tool_registry = tool_registry
end
def run(input:, previous_interaction_id: nil)
interaction_id = previous_interaction_id
final_output = nil
MAX_STEPS.times do
response = @client.create_interaction(
build_payload(input: input, previous_interaction_id: interaction_id)
)
interaction_id = response["id"]
steps = parse_steps(response)
function_calls = steps.select { |step| step.type == "function_call" }
if function_calls.empty?
final_output = extract_model_output(steps)
break
end
tool_results = function_calls.map do |call|
execute_tool(call)
end
input = tool_results_to_input(tool_results)
end
{
interaction_id: interaction_id,
output: final_output
}
end
end
Production code piles on stricter parsing, real error handling, streaming, and observability, but the skeleton never changes: create, read steps, run tools, continue. The tool_results_to_input helper is where you build the continuation payload. A function result is a function_result step whose call_id matches the id of the function_call it answers, with result as an array of content objects:
def tool_results_to_input(tool_results)
tool_results.map do |call_id:, name:, output:|
{
type: "function_result",
call_id: call_id,
name: name,
result: [{ type: "text", text: output.to_json }]
}
end
end
That array becomes the next request’s input, which is why keeping the function_call id on each step matters: it is the only thing linking a result back to its call on multi-call turns. That is the whole multi-turn model: instead of rebuilding the context array yourself, you pass an ID, and the full history is the API’s problem.
Server-Side State with previous_interaction_id
Pass an ID instead of a transcript. When an interaction completes, the API returns an ID; on the next turn you pass it as previous_interaction_id and Gemini retrieves the conversation history from the prior interaction rather than waiting for you to resend it.
payload = {
model: "gemini-3.5-flash",
previous_interaction_id: previous_interaction_id,
input: "Now summarize that in three bullet points.",
system_instruction: system_instruction,
tools: tool_declarations
}
This changes the architecture, not just your convenience. Without server-side state, your app keeps rebuilding the full model conversation: you append user messages, model messages, tool calls, tool results, and maybe compact older turns as the context grows. With server-side state, Gemini retrieves the previous history from the prior interaction ID, which can improve cache hit rates and cuts the context you resend on multi-turn conversations. In a customer support workflow I shipped for a SaaS product, ten-turn conversations dropped from a few thousand tokens resent per turn down to the latest message plus an ID.
One detail trips people up: the history carries over, but interaction-scoped settings do not. Tools, system instructions, temperature, thinking level - none of those ride along automatically just because you passed an interaction ID. You resend them on every request, which means your Rails code wants one durable place to keep that configuration:
class AgentConfig
def system_instruction
"You are a careful assistant inside a Rails application..."
end
def tool_declarations
ToolRegistry.declarations
end
def generation_config
{
temperature: 0.2,
thinking_level: "medium"
}
end
end
Then feed that same config into every interaction, new turn or continuation alike.
store=false vs Server-Side State
Storing conversation data on Google’s servers is a product and privacy decision before it is a technical one. By default the Interactions API stores interaction objects, and that is what makes previous_interaction_id, background execution, and step-level observability possible. If you would rather nothing was stored, send:
{
store: false
}
But there is a trade-off. If you disable storage, you cannot use stored-state features such as previous_interaction_id, and it is incompatible with background execution. The trap is shipping with store: true by default and only learning what retention means when someone in legal reads the data-processing terms back to you.
How that shakes out in practice:
- A throwaway prototype: stored interactions, nobody cares.
- A normal chat assistant: stored, as long as you actually understand the retention window.
- A workflow touching sensitive data:
store: false, full stop. - A background research task: storage on, because it has no choice.
- A regulated production system: pick a mode and write down why, because an auditor will eventually ask.
Make this a deliberate call, not a client default nobody revisits.
Background Execution
Use background: true with store: true to run long Gemini interactions asynchronously, because a multi-tool agent run will happily outlive both your Puma worker’s timeout and the user’s patience. The controller creates the agent run record and returns immediately, and a background job manages the interaction lifecycle, persists steps, and notifies the UI.
The Interactions API supports background execution with:
{
background: true,
store: true
}
An ApplicationJob is the natural wrapper - it gives you retries, error recording, and queue management with no agent-specific plumbing:
class AgentRunJob < ApplicationJob
queue_as :default
def perform(agent_run_id)
agent_run = AgentRun.find(agent_run_id)
result = AgentRunner.new(
client: GeminiInteractionsClient.build,
tool_registry: ToolRegistry.new(agent_run.user)
).run(
input: agent_run.input,
previous_interaction_id: agent_run.previous_interaction_id
)
agent_run.update!(
output: result[:output],
previous_interaction_id: result[:interaction_id],
status: "completed"
)
rescue => error
agent_run.update!(
status: "failed",
error_class: error.class.name,
error_message: error.message
)
raise
end
end
The model provider should not dictate your Rails architecture. Treat the interaction as an external workflow and wrap it in normal application primitives: jobs, records, logs, retries, and authorization. For queue setup and concurrency tuning that affects agent throughput, see the practical Solid Queue guide.
Streaming Events to the UI
Streaming Gemini agent output matters whenever a user is watching. A multi-tool run can take ten or twenty seconds, and a spinner with no movement reads as broken. The Interactions API streams events, so in Ruby you parse a streaming HTTP response and treat each event as another update to the run. The Faraday adapter you pick matters more than usual here. I have watched a streaming agent behave perfectly in rails console and then buffer the entire response under Puma, flushing nothing until the run finished, because the adapter’s on_data callback fired differently under the app server than it did interactively. net_http works, but confirm the callbacks actually arrive in your real runtime before you trust them. The parser looks like this:
def stream_interaction(payload)
buffer = +""
connection.post(interactions_path) do |request|
request.headers.update(auth_headers)
request.params.update(query_params)
request.body = payload.merge(stream: true)
request.options.on_data = lambda do |chunk, _bytes|
buffer << chunk
while (line_end = buffer.index("\n"))
line = buffer.slice!(0..line_end).strip
handle_stream_line(line)
end
end
end
handle_stream_line(buffer.strip) if buffer.present?
end
def handle_stream_line(line)
return if line.blank?
return unless line.start_with?("data:")
payload = line.delete_prefix("data:").strip
return if payload.empty?
event = JSON.parse(payload)
handle_interaction_event(event)
rescue JSON::ParserError
# Keepalive or non-JSON sentinel line. Ignore it and wait for the next event.
end
A production parser needs to be more defensive:
- Ignore empty keepalive lines.
- Flush the final buffer after the stream closes.
- Do not assume every line is JSON.
- Reset buffers between retries.
- Persist enough event data to debug broken runs.
- Do not report success until you see a terminal event.
The UI should not need raw provider events. Convert them to product events:
agent.startedagent.thinkingagent.tool_call.startedagent.tool_call.completedagent.output.deltaagent.completedagent.failed
That keeps the frontend stable even if the provider event shape changes.
Function Calls and Tool Results
When Gemini emits a function call, your app decides whether to run it - and “whatever the model asked for” is not an acceptable answer. Route everything through a registry so there is exactly one place that maps a tool name to real code, and anything off the list raises instead of doing something surprising:
class ToolRegistry
def initialize(user)
@user = user
end
def call(name, arguments)
case name
when "lookup_customer_invoices"
LookupCustomerInvoicesTool.new(user: @user).call(**arguments.symbolize_keys)
else
raise UnknownTool, name
end
end
end
Every tool enforces authorization internally. Authorization can’t live in the model or the prompt; the tool itself must check whether the current user is allowed to perform the requested action.
class LookupCustomerInvoicesTool
def initialize(user:)
@user = user
end
def call(customer_id:, status: nil, limit: 20)
customer = Customer.find(customer_id)
raise NotAuthorized unless CustomerPolicy.new(@user, customer).show?
invoices = customer.invoices
invoices = invoices.where(status: status) if status.present?
invoices = invoices.limit(limit)
{
customer_id: customer.id,
invoices: invoices.map do |invoice|
{
id: invoice.id,
number: invoice.number,
status: invoice.status,
due_on: invoice.due_on,
amount_cents: invoice.amount_cents
}
end
}
end
end
This is the part where being on Rails actually pays off. You are not inventing an authorization story for the agent; it reuses the same policies, models, and audit logs your human-facing controllers already run through. The agent is just one more caller with no special privileges, and a tool that skips the policy check is a bug you would catch anywhere else in the app.
The Write Tool Rule
Never let a write tool fire in one step. Split it into a preview tool the model can draft freely and a separate execute tool that runs only after the user confirms. Read tools are dangerous when they leak data; write tools are dangerous because they change reality, so they need the extra gate.
The model drafts the intended action, the user confirms it, and only then does your app execute the write. Here is a preview tool from a SaaS support workflow:
{
type: "function",
name: "preview_help_article",
description: "Prepare a draft of a help-center article for review. This does not publish anything.",
parameters: {
type: "object",
properties: {
collection_id: { type: "integer" },
title: { type: "string" },
tone: {
type: "string",
enum: ["concise", "detailed", "beginner_friendly"]
}
},
required: ["collection_id", "title"]
}
}
Then a separate execution tool:
{
type: "function",
name: "publish_help_article",
description: "Publish a previously previewed help-center article. Only use this after the user explicitly confirms publishing.",
parameters: {
type: "object",
properties: {
preview_id: { type: "string" },
confirmation: {
type: "string",
description: "The exact user confirmation text."
}
},
required: ["preview_id", "confirmation"]
}
}
Do not let the model jump directly from “draft an article” to “publish it.” Agents need friction around side effects.
Observability
The Interactions API is more observable than a raw completion, and that lulls you into thinking you can skip your own logging. You cannot.
For every agent run, capture enough to reconstruct it cold: who ran it, which model and interaction IDs were involved, whether store and background were on, the tool calls requested versus the ones you actually executed, how long each took and how big its result was, the final status, and the usage metadata. Grab the provider request ID too, if you get one. Persist enough that you can reconstruct a run you have never seen before.
Create an agent_runs table and an agent_steps table, and persist the normalized steps you show in the UI.
create_table :agent_runs do |t|
t.references :user, null: false
t.string :provider, null: false
t.string :model
t.string :interaction_id
t.string :previous_interaction_id
t.string :status, null: false
t.boolean :stored, null: false, default: true
t.boolean :background, null: false, default: false
t.jsonb :usage, null: false, default: {}
t.text :input
t.text :output
t.text :error_class
t.text :error_message
t.timestamps
end
create_table :agent_steps do |t|
t.references :agent_run, null: false
t.string :step_type, null: false
t.string :tool_name
t.jsonb :arguments, null: false, default: {}
t.jsonb :result, null: false, default: {}
t.jsonb :raw, null: false, default: {}
t.timestamps
end
The first time an agent does something genuinely strange in production, these tables are what let you reconstruct what it saw and why it acted on it. Without step-level logs you are guessing, and guessing about a probabilistic system is a miserable way to spend a Friday.
Production Guardrails
Give an agent tools and no ceiling and it eventually finds a failure mode you did not picture: a run that loops on the same tool forty times, one that burns the month’s token budget in an afternoon, a tool result big enough to OOM the job. A production Gemini agent in Rails needs hard limits.
At minimum:
- Maximum number of tool calls per run.
- Maximum execution time per run.
- Maximum tool result size.
- Maximum streamed event size.
- Allowlist of callable tools.
- Authorization inside every tool.
- User confirmation for writes.
- Audit log for side effects.
- Safe failure state.
- Clear status in the UI.
The agent loop should also have a strict state machine.
VALID_TRANSITIONS = {
"pending" => ["running", "failed"],
"running" => ["waiting_for_confirmation", "completed", "failed"],
"waiting_for_confirmation" => ["running", "cancelled"],
"completed" => [],
"failed" => [],
"cancelled" => []
}
None of this trusts the model to behave, because it is probabilistic and will eventually do the unlikely thing. Your application owns the bounds; the model only gets to make suggestions inside them.
Where Thought Signatures Fit Now
If you have used a Gemini thinking model directly through the content generation APIs, you have probably hit this: you parse out the text, throw away the rest because it looks like internal noise, and two turns later the model has lost the thread. Those discarded blobs are thought signatures, the record of the reasoning chain Gemini’s thinking models attach so they can pick up where they left off. Keep only the text and you have thrown away the model’s scratch paper.
The Interactions API handles most of that for you, since the interaction resource and server-side state carry the reasoning chain between turns instead of making you reconstruct it by hand. The same principle applies anywhere you touch a Gemini thinking model: hang on to whatever state the provider hands back unless the docs explicitly say you can drop it. If a response gives you IDs, step metadata, environment IDs, or other continuation fields, treat them as part of the protocol and keep them around, rather than grabbing the text and binning the rest.
Choosing Developer API or Enterprise Agent Platform
In practice, most Rails teams start on the Developer API path and never need to leave it. An API key, an env var, and a Faraday client is enough to ship something real.
Use it when:
- You want to ship with an API key and an env var, not an IAM setup.
- You are calling Gemini models directly, not invoking managed agents.
- You do not need Google Cloud governance or enterprise platform controls.
- The Developer API’s data-handling model is acceptable for your domain.
Use Gemini Enterprise Agent Platform when:
- You are building around Google Cloud.
- You need IAM-based authentication.
- You want managed agents.
- You need enterprise platform controls.
- You want to invoke deployed agents through the Interactions API.
- You are already operating in Google Cloud infrastructure.
These are not the same decision. The Ruby code can hide both behind one provider adapter, but picking the Enterprise platform pulls in IAM, billing, and whoever owns your Google Cloud org. That is a procurement and governance call, not something you settle in a pull request.
What You Own in Ruby
Here is the bill for skipping the SDK. Every layer of the client boundary that a Python or JavaScript developer gets handed for free is, in Ruby, code you write, test, and keep working as the API drifts:
- HTTP client configuration.
- Authentication.
- API versioning.
- Request serialization.
- Response parsing.
- Streaming event parsing.
- Error handling.
- Retry behavior.
- Tool declaration generation.
- Tool execution.
- State persistence.
- Observability.
- Tests around provider fixtures.
It only stays manageable if you keep the boundary small and refuse to let it sprawl into the rest of the app.
Do not build a giant “AI service” object with every provider detail mixed into application logic. Build a narrow client and a narrow adapter. Split the responsibilities: one object speaks HTTP, one knows Gemini’s interaction shape, one runs the loop, and the registry executes tools. When Gemini ships a breaking change, you fix the one object that owns it instead of grepping the whole app for the leak.
When a Hand-Rolled Gemini Client Is the Wrong Call
Owning the boundary is not always worth it. If all you need is the occasional one-shot completion, a community gem or a multi-provider abstraction will ship faster and the extra control buys you nothing. If you cannot stomach the no-SDK tax, keep in mind that when the Interactions API shifts, the breakage and the fix are both yours, where an official SDK would have absorbed them. If your domain has strict retention rules, stored interactions keep conversation history on Google’s side for as long as you leave it there, and that may simply be off the table. And if nobody on the team wants to babysit parser fixtures as the API moves, that work does not disappear on its own.
Reach for the hand-rolled client when you genuinely need the boundary: real control over retries, logging, streaming, tool execution, and how much provider detail leaks into the rest of the app. For anything lighter, a gem is the cheaper answer, and there is no shame in reaching for it.
Testing the Provider Boundary
Do not let live API calls be your only test for the integration. They are slow, they cost money, and they go red when Google is having a bad afternoon rather than when your code is actually wrong. What you want instead is fixture-based tests covering every response shape your app leans on:
- Text-only interaction.
- Function call interaction.
- Function result continuation.
- Streaming model output.
- Streaming tool call.
- Background interaction created.
- Completed interaction event.
- Failed provider response.
- Rate limit response.
- Unknown step type.
The most valuable tests are parser tests.
RSpec.describe GeminiInteractionParser do
it "extracts function calls from interaction steps" do
response = JSON.parse(file_fixture("gemini/function_call_interaction.json").read)
steps = described_class.new(response).steps
expect(steps.first.type).to eq("function_call")
expect(steps.first.name).to eq("lookup_customer_invoices")
expect(steps.first.arguments).to include("customer_id" => 123)
end
end
Provider APIs drift, and a fixture is a precise, checked-in record of exactly which response shapes your app is betting on. When Google changes one, a parser test goes red and the diff tells you what moved, instead of a customer telling you for free in production.
A Practical Checklist
Here is the checklist I run down when wiring the Interactions API into a Rails app, roughly in the order I build things:
- Start with the Interactions API for new agent workflows.
- Decide Developer API vs Enterprise Agent Platform up front.
- Hide that decision inside one client constructor.
- Use Faraday with JSON middleware, timeouts, and
raise_error. - Keep raw Gemini JSON at the provider boundary.
- Normalize interaction steps into internal step objects.
- Store interaction IDs.
- Use
previous_interaction_idfor multi-turn flows when storage is acceptable. - Resend system instructions, tools, and generation config on every interaction.
- Make
storea deliberate product decision. - Use
background: truefor long-running work. - Stream events into product-level UI events.
- Execute tools through a registry.
- Enforce authorization inside tools.
- Require confirmation for write tools.
- Persist agent runs and agent steps.
- Cap iterations, duration, and tool result sizes.
- Test parser behavior with provider fixtures.
The prompt is not the architecture. The architecture is the client boundary, the tool boundary, the state model, and the failure model.
The Missing SDK Is Not What Stops You
Building a Gemini agent in Ruby is not blocked by the lack of an official SDK. The API is HTTP, Ruby is good at HTTP, and Rails already gives you background jobs, persistence, authorization, and audit trails, which are exactly the pieces a production agent needs.
What the missing SDK changes is where the work goes. You own the provider boundary, the interaction-step parsing, the decision about how server-side state fits your data policy, safe streaming, tool execution through normal application authorization, confirmable writes, and logs that explain what happened after the model surprises you.
The Interactions API is a better primitive for that than the older completion-shaped approach because it models the thing you are actually building: a stateful, observable, tool-using interaction. In Ruby the recipe is small: Faraday handles transport, the Interactions API holds state, Rails jobs run the work in the background, a tool registry exposes capabilities, your policies handle authorization, and your logs tell you what happened. That is enough to build something that holds up in production.
Before you pick a job backend for agent runs, the Solid Queue vs Sidekiq vs GoodJob comparison is worth reading, since background execution is where most of the operational weight lands.
Need help shipping a Gemini or Anthropic agent in Rails? I help teams with agent architecture, tool design, and the production guardrails that keep these systems safe to run.