AI Ops Engineer

Parspec

Software Engineering, Operations, Data Science

Bengaluru, Karnataka, India · Sterling, VA, USA

Posted on Mar 13, 2026

Apply now

About Parspec

Parspec is building the AI and digital infrastructure for the construction materials supply chain.

Construction is a $15 trillion industry, yet the systems that underpin the buying and selling of materials remain fragmented, manual, and disconnected. Distributors and rep agencies rely on spreadsheets, PDFs, phone calls, and siloed tools to find new products and quote and manage projects; creating delays, errors, and margin erosion across the supply chain.

Parspec is an AI-native platform that powers how construction products are discovered, bought, and sold. Trusted by more than 300 MEP distributors and rep agencies, Parspec helps project-driven businesses bid faster, win more work, and operate more profitably. By combining product intelligence, AI-powered workflows, and a connected ecosystem, Parspec is laying the foundation for a more intelligent, efficient construction supply chain.

Founded in 2021 and headquartered in San Mateo, California, Parspec has raised $31 million from leading deep-tech and construction-technology investors.

The Opportunity

We are looking for an experienced AIOps / LLMOps Engineer to help design, deploy, and manage the AI infrastructure that powers Parspec’s next-generation AI systems.

This role will focus on designing, deploying and operating LLMOps/AIOps platforms for high-volume generative AI workloads, with deep emphasis on self-hosted model serving performance alongside observability, security and cost. Ensures scalable, secure AI infrastructure using async patterns on AWS.

We run a high-throughput document AI pipeline: vision-language models reading construction documents, plus a fleet of smaller models (rerankers, embedders, layout detectors) serving search and extraction. Our throughput ceiling and our infrastructure bill are both set primarily by how efficiently we serve models. The core of this role is moving that ceiling.

A note on fit: this is an inference-performance and AI-platform role, not an ITOps/AIOps-for-infrastructure role, not an observability-tooling role, and not a general SRE or DevOps role. Kubernetes, Terraform and CI/CD are tools you'll use daily, but they are not what we're hiring for. If your strongest work is platform engineering rather than model serving performance, this is likely the wrong fit.

Preferred location: Bengaluru, with regular in-office collaboration.

What You Will Achieve and Key Responsibilities

Model serving performance and efficiency

Own throughput and cost per unit of work for self-hosted models — tune serving stacks (vLLM, SGLang, TensorRT-LLM, TGI) at the level of scheduler configuration, batching policy and memory layout: continuous/in-flight batching, chunked prefill, max_num_seqs / max_num_batched_tokens, admission control, preemption.
Manage KV cache economics: PagedAttention block sizing, memory-utilization headroom, KV quantization, cache offload, and prefix/radix caching. Our extraction prompts carry long fixed schemas across every page, so cache hit rate is a first-class operating metric here.
Quantize models to FP8 / INT8 / INT4 (AWQ, GPTQ, W8A8) and prove quality held using an accuracy harness on real production documents. Building and maintaining that harness is part of the role.
Serve vision-language models at scale: image token budgets, tiling and dynamic resolution, DPI/resize tradeoffs, the vision encoder as an independent bottleneck, encoder output caching, page-level batching. Our workload is batch document throughput, not interactive chat.
Make and defend parallelism and topology decisions — tensor vs pipeline parallelism vs replica scale-out, including when tensor parallelism hurts (e.g. PCIe-only nodes without NVLink, where N replicas at TP=1 often beat one TP=N).
Improve accelerator packing and multi-tenancy: MIG, MPS, time-slicing, multi-LoRA serving, model load and cold-start times, weight streaming — so our small-model zoo runs co-resident efficiently rather than one model per node.
Serve non-LLM models (layout detection, embeddings, rerankers) on a real inference server such as Triton, with dynamic batching and ensembles, rather than a Python web process.
Own capacity and autoscaling policy: scale on queue depth, KV-cache utilization or TTFT rather than accelerator utilization, and be able to explain why. Spot and capacity blocks, Karpenter node pools, warm pools, bin-packing across L4 / A10G / L40S / A100 / H100, and selecting the cheapest instance that fits after quantization.
Establish benchmarking rigour: load generation using real input/output length distributions from production traffic, saturation curves and latency–throughput Pareto plots reported per-accelerator-normalized, plus performance and quality regression gates in CI.

Platform, reliability and governance

Build document AI platforms using generative AI with asynchronous architectures — queues and event-driven workers for elastic scaling and non-blocking inference.
Architect self-hosting infrastructure for LLMs on Kubernetes/EC2 with accelerator orchestration and scheduling.
Oversee correct LLM usage, including gateways (LiteLLM, Portkey), prompt engineering support and guardrails to prevent misuse or hallucinations.
Implement observability for AI/ML pipelines: distributed tracing with OpenTelemetry and X-Ray, drift and hallucination monitoring via Prometheus/Grafana, LLM-specific metrics via Langfuse.
Manage AI/ML workflows — MLflow (including serverless SageMaker MLflow setup) and Kubeflow for experiment tracking, versioning and deployment.
Add security layers: Bedrock Guardrails (PII/toxicity), KMS encryption, IAM least-privilege, VPC endpoints, CloudTrail auditing.
Optimize AWS resources (Bedrock, SageMaker, EKS) for cost, security and performance in production AI workloads.

Required Skills

Inference performance

Production experience optimizing self-hosted model serving: continuous batching, KV cache tuning, prefix caching, chunked prefill, quantization with accuracy validation, parallelism tradeoffs.
Hands-on with at least one of vLLM, SGLang, TensorRT-LLM or TGI at the level of reading and modifying scheduler and memory configuration, not only deploying a container.
Profiling and diagnosis: Nsight Systems, DCGM, NCCL behaviour; comfortable reasoning about memory-bandwidth-bound vs compute-bound regimes.
Able to explain any throughput number in terms of batch size, memory bandwidth, sequence lengths and cache hit rate.

LLMOps Tools

Model serving runtimes as above; inference gateways (LiteLLM, Portkey); vector DB operations (Qdrant, pgvector, Pinecone, Weaviate); Triton Inference Server for CV and embedding models.

Cloud Proficiency

AWS for ML infrastructure: EC2 (incl. accelerated instance families), EKS, Lambda, S3, Bedrock, SageMaker, Step Functions, API Gateway, EventBridge, SQS/SNS fan-out. IaC with Terraform or CloudFormation.

Async / Architecture

Queues (SQS, Kafka), micro-batching, and event-driven systems for scalable GenAI workloads.

Observability / AIOps

Prometheus and Grafana, ELK, X-Ray tracing for async flows, Langfuse for LLM metrics (latency, token usage, quality), DCGM for hardware-level telemetry.

Programming / DevOps

Python — FastAPI, asyncio, and enough PyTorch to read model and serving code. Docker and Kubernetes. CI/CD with GitHub Actions and ArgoCD.

Strong plus

Vision-language model serving at scale; multi-LoRA serving or MIG/MPS partitioning in production; speculative decoding, including a clear view of when it doesn't pay off; kernel-adjacent familiarity (FlashAttention/FlashInfer, CUDA graphs, torch.compile, custom Triton kernels); cost-optimized inference on alternative accelerators (Trainium/Inferentia).

Who You Are

Experience Level

5+ years of experience in MLOps/LLMOps/AIOps or DevOps for AI, having built production GenAI systems — including at least 2 years spent specifically on self-hosted model serving performance. Depth of optimization work matters considerably more than total years.
Track record deploying self-hosted LLMs on AWS, production serverless document AI platforms, and ML services with >99.9% uptime and demonstrated cost optimization.
Candidates should be able to walk through one concrete throughput or cost win end to end: what was measured, what was tried, what was rejected and why.

Preferred Qualifications

Education: BS/MS in CS, EE, Data Science, or equivalent practical experience. Strong self-taught candidates with demonstrable performance work are welcome.
Emerging Tech: Multimodal LLMs, agentic workflows, cost-optimized inference (e.g. AWS Trainium).

What We Offer

Competitive salary and benefits, including family insurance coverage, free health tele-consultations, and learning/up-skilling budgets
Equity in the company
Flexible hours and a hybrid work setup
Unlimited PTO
Opportunity to grow with a fast-scaling company transforming a large market
Preferred Location: San Mateo, with regular in-office presence

Join Us

At Parspec, we recognize that traditional job descriptions don’t always capture the full range of your unique abilities—and that’s perfectly okay. You may not meet every requirement, but if you bring a mix of experiences, fresh perspectives, and a passion that aligns with our mission, we want to hear from you!

The Parspec team believes that varied backgrounds drive better outcomes and fuel innovation. We are a team of self-starters that lead from every seat. We think big, set a standard of excellence and are committed to diversity and a discrimination-free workplace. We welcome applicants from all walks of life to join us and help shape the future at Parspec.

How to Apply

Submit your application and resume highlighting your achievements. Apply now and help drive transformative change in one of the world’s oldest and largest industries!