Skip to main content

Architecture Overview

Agnitra combines profiling, AI-assisted kernel tuning, and usage-based billing so you can prove performance gains before pushing new models to production. This overview explains how the pieces fit together from a user’s point of view.

Optimization Loop at a Glance

  1. Profile & capture telemetry – The CLI or SDK traces your TorchScript/ONNX model, capturing latency, throughput, memory, and GPU utilization.
  2. Plan optimizations – Telemetry is fed to the Agnitra optimizer, which pairs OpenAI Responses API hints with reinforcement learning to propose kernel tweaks.
  3. Patch & validate – Agnitra generates Triton/CUDA kernels, validates correctness against your baseline, and swaps them into the runtime.
  4. Report uplift – Before/after telemetry becomes a structured usage event you can forward to finance systems, marketplaces, or dashboards.
Every stage emits artifacts (telemetry JSON, optimized models, usage events) so you can audit changes or roll back quickly.

Core Surfaces You’ll Use

  • CLI & Python SDK – Run agnitra optimize locally or embed agnitra.optimize() inside services to automate tuning.
  • Agentic Optimization API (agnitra-api) – Offload work to a remote control plane, queue jobs, and replay usage events via REST.
  • Telemetry exporters – Push JSON snapshots to S3, HTTP endpoints, Kafka, or your observability stack for long-term analysis.
  • Marketplace dispatchers – Enable adapters (Stripe, AWS Marketplace, GCP Marketplace) to translate usage events into billable records.

Deployment Options

  • Managed – Point the CLI/SDK at the hosted control plane (https://api.agnitra.ai) and use preconfigured telemetry dashboards.
  • Self-hosted – Run agnitra-api in your environment, connect it to GPU workers, and plug telemetry into your data stores.
  • Offline / regulated – Enable offline mode with enterprise licenses to generate optimizations without external network access; usage events sync once connectivity returns.

Performance Goals

MetricTarget
Tokens per second uplift≥ 20%
Latency reduction≥ 15%
Memory efficiency≥ 25%
Integration time< 10 minutes
Output correctness≥ 99.9% parity with baseline

Continue Exploring