Building a Proactive Infrastructure Knowledge Base with Grafana Assistant
Overview
When an unexpected alert fires, the first instinct of many engineers is to consult an AI assistant for help. However, traditional assistants require you to provide context—data sources, services, connections, labels, and metrics—each time you start a new conversation. This discovery process eats into precious troubleshooting time. Grafana Assistant changes this paradigm by preemptively learning your infrastructure and building a persistent knowledge base. It knows your services, their dependencies, metrics, logs, and traces before you even ask a question. This guide walks you through how Grafana Assistant works, how to set it up, and best practices for leveraging its pre-built context to speed up incident response.
Prerequisites
Grafana Cloud Account
Grafana Assistant is a feature of Grafana Cloud. You need an active Grafana Cloud stack (free or paid tier). Sign up at grafana.com if you haven’t already.
Connected Data Sources
Assistant automatically discovers and learns from the following data sources configured in your Grafana Cloud stack:
- Prometheus – for metrics and service discovery
- Loki – for logs and log formats
- Tempo – for traces and service dependencies
Ensure at least one of these is connected and ingesting data. For best results, have all three configured.
Permissions
You need administrative or editor-level access to your Grafana Cloud stack to enable Assistant and view its knowledge base.
Step-by-Step Instructions
Step 1: Enable Grafana Assistant
In your Grafana Cloud instance, navigate to Administration > Grafana Assistant (or search for “Assistant” in the sidebar). Toggle the feature on. No further configuration is required—Assistant starts scanning your environment immediately.
Step 2: Let the Assistant Build Its Knowledge Base
Behind the scenes, a swarm of AI agents performs the following tasks:
- Data source discovery: All connected Prometheus, Loki, and Tempo sources are identified.
- Metrics scans: Agents query Prometheus data sources in parallel to find services, deployments, and infrastructure components (e.g., pods, nodes, containers).
- Enrichments via logs and traces: Loki log streams and Tempo trace data are correlated with metrics, adding context about log formats, trace structures, and service-to-service dependencies.
- Structured knowledge generation: For each discovered service group, Assistant produces documentation covering five key areas: what the service is, its key metrics and labels, deployment details (e.g., Kubernetes namespace, replicas), dependencies (upstream and downstream), and where to find its logs and traces.
This process runs in the background with zero configuration. You don’t need to tell Assistant what to scan—it proactively explores your entire stack.
Step 3: View the Pre-Built Knowledge Base
To inspect what Assistant has learned, go to Assistant > Knowledge Base in the left-hand menu (or via the Assistant panel). You’ll see a list of services and infrastructure components. Click on any entry to see its auto-generated documentation, including:
- Service description: A natural-language summary of the service’s role.
- Key metrics & labels (e.g.,
http_requests_total,service="payment"). - Dependencies: Upstream and downstream services.
- Log format: Whether logs are JSON, plain text, etc., and where they live in Loki.
- Trace structure: Typical spans and their relationships.
You can also search the knowledge base using natural language queries.
Step 4: Use Assistant for Incident Response
Now when an alert fires, you can ask Assistant directly (e.g., in the Grafana Explore view or via the Assistant chat panel) without providing context. Example interaction:
User: "Why is the checkout service slow?"
Assistant: "The checkout service depends on Payment (latency p99=2.3s) and Inventory (error rate 12%). Recent logs show database connection timeouts. Metrics spike at 15:00 UTC. Trace IDs point to slow SQL queries in the Payment service. Want me to dive into logs?"
Assistant already knows your services, metrics, logs, and traces—so it can jump straight to root cause analysis.
Step 5: Keep the Knowledge Base Updated
The Assistant continuously refreshes its knowledge base. You don’t need to do anything. However, if you make major changes (e.g., add a new service, rename metrics, change log formats), Assistant will automatically adapt within a few minutes. You can also manually trigger a rescan from the Knowledge Base page.
Common Mistakes
Mistake 1: Expecting Assistant to Work Without Data
If your Prometheus, Loki, or Tempo data sources are not ingesting data, Assistant will have nothing to learn. Ensure data is flowing before expecting meaningful insights. Check the data source health in Grafana.
Mistake 2: Overlooking Permissions
If you cannot see the Assistant panel or knowledge base, verify you have at least editor role in your Grafana Cloud stack. Some admin-level features may be hidden if your role is too restrictive.
Mistake 3: Disabling Assistant After Initial Build
Assistant only works when enabled. If you turn it off, it stops updating the knowledge base. Incident response will then fall back to context-sharing mode. Keep it on for continuous benefit.
Mistake 4: Assuming It Knows Everything Instantly
Assistant scans periodically (e.g., every few minutes). New services or changes may not appear for a short time. Be patient or manually trigger a rescan if you need immediate updates.
Mistake 5: Not Using the Knowledge Base for Onboarding
Many teams miss the opportunity to use Assistant’s pre-built knowledge as a reference for new team members. Encourage developers to explore the knowledge base to understand service dependencies without reading outdated runbooks.
Summary
Grafana Assistant eliminates the repetitive context-sharing that slows down incident response. By automatically discovering your Prometheus, Loki, and Tempo data sources, scanning metrics, correlating logs and traces, and generating structured documentation for each service, it builds a persistent knowledge base that is ready when you need it. Setup requires zero configuration—just enable Assistant and let it run. The result is faster, more accurate troubleshooting, especially for teams where not everyone knows the full infrastructure picture. Use the knowledge base as a living map of your system and refresh it continuously to keep insights current.
Related Articles
- What the Coursera-Udemy Merger Means for Learners: What You Need to Know
- 10 Game-Changing Features in IBM Vault 2.0 That Simplify Secrets Management
- Coursera and Udery Join Forces: A New Era in Skill Development
- Casey Hudson’s Critique of AI in Game Development and What It Means for Star Wars: Fate of the Old Republic
- Structured Mentorship Triples Developer Salary Growth, Study of 3,214 Careers Reveals
- Shared Design Leadership: A Holistic Framework for Balanced Team Growth
- Alarming Reversal: Girls' Math Progress Eroded Post-Pandemic, International Study Reveals
- Bridging Knowledge Gaps: How Graph RAG Enhances AI Accuracy in Enterprise Environments