115 Devops Engineers jobs in Jaipur
Site Reliability Engineer
Posted 3 days ago
Job Viewed
Job Description
Location: Bangalore / Remote
Experience: 4–10 years
Type: Full-time (6-month probation)
CodeKarma is redefining how engineering teams understand and evolve complex systems — bringing production context directly into the developer’s workflow.
Our platform runs both as SaaS and as sub-account / on-prem deployments within our customers’ cloud environments.
We’re looking for engineers who can take ownership of these deployments end-to-end — from setup to monitoring, upgrades, and ongoing reliability.
You’ll be responsible for managing CodeKarma’s distributed deployments across client environments — ensuring reliability, security, and performance at scale.
- Deploy and manage CodeKarma clusters across AWS, GCP, and Azure customer sub-accounts.
- Monitor, upgrade, and maintain Kubernetes clusters and related infrastructure.
- Implement observability, alerting, and disaster recovery for each deployment.
- Handle CI/CD automation for platform releases, patches, and version upgrades.
- Work closely with client engineering teams to adapt deployments to their environments, policies, and security constraints.
- Diagnose and resolve environment-specific issues across networking, storage, and configuration layers.
- Build and maintain infrastructure playbooks, Helm charts, and Terraform modules for standardized deployment.
- Strong experience managing Kubernetes clusters (EKS, GKE, AKS, or on-prem equivalents).
- Deep understanding of Kubernetes internals, Helm, ingress controllers, networking, and storage classes .
- Hands-on experience with CI/CD tools (GitHub Actions, ArgoCD, or similar).
- Familiarity with monitoring and alerting stacks (Prometheus, Grafana, Loki, ELK, etc.).
- Working knowledge of cloud infrastructure across AWS / GCP / Azure.
- Ability to work directly with client engineering and DevOps teams , understanding their constraints and helping them integrate CodeKarma.
- Strong debugging and communication skills — you’ll often be the bridge between CodeKarma and client infrastructure.
- Manage real, large-scale production environments across multiple enterprises.
- Work directly with founders and senior engineers to shape how CodeKarma scales across clients.
- High ownership, fast-moving environment, and exposure to deep-tech systems.
Please share:
- A short summary of your Kubernetes experience (cluster management, scaling, debugging, etc.).
- Any automation or deployment tooling you’ve built or maintained.
- Links to your GitHub / GitLab / blog posts (if available).
Site Reliability Engineer
Posted 3 days ago
Job Viewed
Job Description
Be part of something revolutionary
At o9 Solutions, our mission is clear: be the Most Valuable Platform (MVP) for enterprises. With our AI-driven platform — the o9 Digital Brain — we integrate global enterprises’ siloed planning capabilities, helping them capture millions and, in some cases, billions of dollars in value leakage. But our impact doesn’t stop there. Businesses that plan better and faster also reduce waste, which drives better outcomes for the planet, too.
We're on the lookout for the brightest, most committed individuals to join us on our mission. Along the journey, we’ll provide you with a nurturing environment where you can be part of something truly extraordinary and make a real difference for companies and the planet.
Site Reliability Engineer
You'll be working in the shift: > Remote (WFH): Night Shift (6PM - 2AM)
About the role.
This SRE professional will have the opportunity to work for an AI-based Unicorn which is recognized as one of the fastest-growing companies on the Inc. 5000 list. This role will provide you opportunity to deploy, maintain and support the o9 Digital Brain Platform across the world on AWS, AZURE, GCP &; Samsung Cloud utilizing state of the art CI/CD tools. This role will empower you to continuously challenge the status quo and implement the great ideas you may have to create value for o9 clients.
Major focus is on deployments, provisioning, upgrades, resizing, migrations, vulnerability fixes, security bug fixes, and patching at the infrastructure level. This team handles upgrades primarily on weekends, particularly Saturdays, and operates in shifts that cover different days of the week, including Tuesday to Saturday to support weekend upgrades.
What you will do in this role:
- Deploy, maintain and support o9 digital Brain SaaS environments on all major clouds
- Monitor availability and maintain system in good health
- Build software and systems to manage platform infrastructure and applications
- Improve reliability, quality, cost, and time-to-deploy, and time-to-upgrade
- Monitor, measure and optimize system performance
- Provide on-call support on rotation basis
- Ability and flexibility to work with teams globally, across the time zones
What you’ll have.
- Education: Bachelor’s degree in computer science, Software Engineering, Information Technology, Industrial Engineering, Engineering Management
- Cloud (at least one) and Kubernetes administration certification
- Experience: 4-8 years of experience in a SRE role , deploying and maintaining applications, performance tuning, conducting application upgrades, patches, and supporting continuous integration and deployment tooling
- 4+ years of experience deploying and maintaining applications in any one of the clouds (AWS, AZURE, GCP)
- Experience with Dockers or similar and experience with Kubernetes or similar
- Skills: Ability to debug issues and solve problems
- Working knowledge with Jenkins, Ansible, Terraform, ArgoCD
- Knowledge on any of the scripting language (Bash, shell, Powershell or Python etc)
- Working knowledge with Linux and Windows operating system
- Primary Skill: strong in operating system concepts, Linux and troubleshooting.
- Secondary Skill: Automation and cloud
- Characteristics: Passion to learn and adapt to new technology
- We really value team spirit: Transparency and frequent communication is key. At o9, this is not limited by hierarchy, distance, or function
What we’ll do for you:
- Flat organization: With a very strong entrepreneurial culture (and no corporate politics).
- Great people and unlimited fun at work.
- Possibility to really make a difference in a scale-up environment.
- Support network: Work with a team you can learn from every day.
- Diversity: We pride ourselves on our international working environment.
- Work-Life Balance:
- Feel part of A team:
How the process works.
- Respond with your interest to us.
- We’ll contact you either via video call or phone call - whatever you prefer, with the further schedule status.
- HackerEarth Online Assessment - Domain specific
- During the interview phase, you will meet with the technical panel for 60 minutes. We will contact you after the interview to let you know if we’d like to progress your application.
- There will be 2 rounds of technical discussion followed by a Managerial round.
- We will let you know if you’re the successful candidate.
Good luck!
Site Reliability Engineer
Posted 4 days ago
Job Viewed
Job Description
Key Responsibilities
- Manage and scale production systems hosted on Google Cloud Platform (GCP)
- Implement SRE best practices : monitoring, alerting, SLAs, SLOs, and error budgets
- Automate operational tasks using Infrastructure as Code (IaC) tools like Terraform
- Improve system reliability and reduce manual interventions through automation
- Collaborate with development teams to ensure new services are production-ready
- Incident response and post-mortem analysis to prevent recurring issues
- Design and implement CI/CD pipelines for rapid and safe deployments
- Manage GCP resources: IAM, VPC, Compute Engine, GKE, Cloud Functions, Pub/Sub, BigQuery, etc.
- Ensure security, compliance, and cost optimization on the cloud infrastructure
Required Skills & Qualifications
- 5+ years of experience in SRE, DevOps, or Cloud Infrastructure roles
- Strong hands-on experience with Google Cloud Platform (GCP) services
- Proficiency with Terraform or other IaC tools
- Solid knowledge of Kubernetes (GKE) , containerization, and microservices
- Strong scripting skills in Python, Go, or Shell
- Familiarity with incident response and post-mortem culture
- Knowledge of networking, security, and cloud cost management
Preferred Qualifications
- GCP certifications (e.g., Professional Cloud DevOps Engineer )
- Prior experience working with e-commerce or high-scale platforms
- Familiarity with SRE tooling like Chaos Engineering, Service Mesh (Istio), etc.
Soft Skills
- Strong communication and stakeholder management
- Problem-solving mindset with a focus on reliability and automation
- Ability to work independently in a distributed, outsourced team model
Senior Site Reliability Engineer- ELK Expert
Posted today
Job Viewed
Job Description
Location: India (Remote) - Must be available to work in the EST (US/Canada) Time Zone.
Role Summary:
Are you a Senior Site Reliability Engineer (SRE) with deep ELK expertise, ready to take ownership of large-scale observability infrastructure?
We're looking for an SRE with 7+ years of experience, including 4+ years specializing in the ELK stack (Elasticsearch, Logstash, Kibana), to join our Platform Engineering Practice. In this role, you'll design, manage, and scale ELK clusters ingesting 2–3+ TB/day, enhance reliability across distributed systems, and drive automation within Azure cloud environments. This is a high-impact engineering opportunity focused on performance, observability, and operational excellence at scale.
- Career Growth: Work alongside industry experts on cutting-edge cloud technologies
- Competitive Compensation and Benefits: We recognize and reward top talent
- Exciting, Impactful Work: Design and build scalable, resilient cloud environments
- Strategic Platform Role: Contribute to the foundation of next-gen observability and reliability infrastructure
- Design and Optimize Cloud Infrastructure: Architect scalable, fault-tolerant systems on Microsoft Azure
- Automate Everything: Use Terraform, Ansible, and GitHub Actions to streamline deployment and configuration
- Ensure Reliability and Performance: Proactively monitor, troubleshoot, and resolve production issues using Prometheus, Grafana, and Azure Monitor
- Enhance Security and Compliance: Implement security best practices across DevOps workflows
- Collaborate and Innovate: Work closely with engineering, security, and operations teams to drive automation and efficiency
- Manage and scale large ELK clusters handling 2–3+ TB/day log volumes, ensuring high availability and performance
- Optimize ELK architecture: Implement efficient index lifecycle policies, shard strategies, and hot-warm-cold tiered storage
- Build and tune log pipelines: Scale Logstash and Beats pipelines across distributed environments
- Support Kibana observability layers: Create dashboards, visualizations, and custom alerting frameworks (e.g., Watcher, ElastAlert)
- 7+ years of experience in Site Reliability Engineering, DevOps, or Cloud Engineering
- 4+ years of dedicated, hands-on experience with ELK (Elasticsearch, Logstash, Kibana)
- Strong experience managing large-scale ELK clusters in production with heavy ingestion (multi-TB/day)
- Deep knowledge of index tuning, shard allocation, ILM policies, and scaling ELK components
- Expertise in GitHub Actions, Terraform, Ansible, and Infrastructure as Code (IaC)
- Proficiency in Python, Go, or Bash for automation and scripting
- Deep understanding of Kubernetes, Docker, and cloud-native architectures
- Experience with observability tools such as Prometheus, Grafana, Azure Monitor
- Ability to work in a fast-paced, collaborative environment and solve complex operational issues
- Bachelor's or Master's degree in Computer Science, Information Technology, or a related field
- Microsoft Azure certifications: AZ-104, AZ-400
Senior Site Reliability Engineer- ELK Expert
Posted today
Job Viewed
Job Description
Location: India (Remote) - Must be available to work in the EST (US/Canada) Time Zone.
Role Summary:
Are you a Senior Site Reliability Engineer (SRE) with deep ELK expertise, ready to take ownership of large-scale observability infrastructure?
We're looking for an SRE with 7+ years of experience , including 4+ years specializing in the ELK stack (Elasticsearch, Logstash, Kibana) , to join our Platform Engineering Practice . In this role, you’ll design, manage, and scale ELK clusters ingesting 2–3+ TB/day , enhance reliability across distributed systems, and drive automation within Azure cloud environments. This is a high-impact engineering opportunity focused on performance, observability, and operational excellence at scale.
- Career Growth: Work alongside industry experts on cutting-edge cloud technologies
- Competitive Compensation and Benefits: We recognize and reward top talent
- Exciting, Impactful Work: Design and build scalable, resilient cloud environments
- Strategic Platform Role: Contribute to the foundation of next-gen observability and reliability infrastructure
- Design and Optimize Cloud Infrastructure: Architect scalable, fault-tolerant systems on Microsoft Azure
- Automate Everything: Use Terraform, Ansible, and GitHub Actions to streamline deployment and configuration
- Ensure Reliability and Performance: Proactively monitor, troubleshoot, and resolve production issues using Prometheus, Grafana, and Azure Monitor
- Enhance Security and Compliance: Implement security best practices across DevOps workflows
- Collaborate and Innovate: Work closely with engineering, security, and operations teams to drive automation and efficiency
- Manage and scale large ELK clusters handling 2–3+ TB/day log volumes, ensuring high availability and performance
- Optimize ELK architecture: Implement efficient index lifecycle policies, shard strategies, and hot-warm-cold tiered storage
- Build and tune log pipelines: Scale Logstash and Beats pipelines across distributed environments
- Support Kibana observability layers: Create dashboards, visualizations, and custom alerting frameworks (e.g., Watcher, ElastAlert)
- 7+ years of experience in Site Reliability Engineering, DevOps, or Cloud Engineering
- 4+ years of dedicated, hands-on experience with ELK (Elasticsearch, Logstash, Kibana)
- Strong experience managing large-scale ELK clusters in production with heavy ingestion (multi-TB/day)
- Deep knowledge of index tuning, shard allocation, ILM policies , and scaling ELK components
- Expertise in GitHub Actions, Terraform, Ansible, and Infrastructure as Code (IaC)
- Proficiency in Python, Go, or Bash for automation and scripting
- Deep understanding of Kubernetes, Docker , and cloud-native architectures
- Experience with observability tools such as Prometheus, Grafana, Azure Monitor
- Ability to work in a fast-paced, collaborative environment and solve complex operational issues
- Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field
- Microsoft Azure certifications: AZ-104 , AZ-400
Senior Site Reliability Engineer- ELK Expert
Posted 27 days ago
Job Viewed
Job Description
Location: India (Remote) - Must be available to work in the EST (US/Canada) Time Zone.
Role Summary:
Are you a Senior Site Reliability Engineer (SRE) with deep ELK expertise, ready to take ownership of large-scale observability infrastructure?
We're looking for an SRE with 7+ years of experience , including 4+ years specializing in the ELK stack (Elasticsearch, Logstash, Kibana) , to join our Platform Engineering Practice . In this role, you’ll design, manage, and scale ELK clusters ingesting 2–3+ TB/day , enhance reliability across distributed systems, and drive automation within Azure cloud environments. This is a high-impact engineering opportunity focused on performance, observability, and operational excellence at scale.
- Career Growth: Work alongside industry experts on cutting-edge cloud technologies
- Competitive Compensation and Benefits: We recognize and reward top talent
- Exciting, Impactful Work: Design and build scalable, resilient cloud environments
- Strategic Platform Role: Contribute to the foundation of next-gen observability and reliability infrastructure
- Design and Optimize Cloud Infrastructure: Architect scalable, fault-tolerant systems on Microsoft Azure
- Automate Everything: Use Terraform, Ansible, and GitHub Actions to streamline deployment and configuration
- Ensure Reliability and Performance: Proactively monitor, troubleshoot, and resolve production issues using Prometheus, Grafana, and Azure Monitor
- Enhance Security and Compliance: Implement security best practices across DevOps workflows
- Collaborate and Innovate: Work closely with engineering, security, and operations teams to drive automation and efficiency
- Manage and scale large ELK clusters handling 2–3+ TB/day log volumes, ensuring high availability and performance
- Optimize ELK architecture: Implement efficient index lifecycle policies, shard strategies, and hot-warm-cold tiered storage
- Build and tune log pipelines: Scale Logstash and Beats pipelines across distributed environments
- Support Kibana observability layers: Create dashboards, visualizations, and custom alerting frameworks (e.g., Watcher, ElastAlert)
- 7+ years of experience in Site Reliability Engineering, DevOps, or Cloud Engineering
- 4+ years of dedicated, hands-on experience with ELK (Elasticsearch, Logstash, Kibana)
- Strong experience managing large-scale ELK clusters in production with heavy ingestion (multi-TB/day)
- Deep knowledge of index tuning, shard allocation, ILM policies , and scaling ELK components
- Expertise in GitHub Actions, Terraform, Ansible, and Infrastructure as Code (IaC)
- Proficiency in Python, Go, or Bash for automation and scripting
- Deep understanding of Kubernetes, Docker , and cloud-native architectures
- Experience with observability tools such as Prometheus, Grafana, Azure Monitor
- Ability to work in a fast-paced, collaborative environment and solve complex operational issues
- Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field
- Microsoft Azure certifications: AZ-104 , AZ-400
Cloud Engineer
Posted today
Job Viewed
Job Description
THIS IS A LONG TERM CONTRACT POSITION WITH ONE OF THE LARGEST, GLOBAL, TECHNOLOGY LEADER.
Client's R&D team is looking for a talented and highly passionate individual to join its product development team and develop world class cloud-based software products and solutions to solve interesting problems in the construction industry. If you are a software developer who is proficient in web-based technologies, possess strong design and coding skills, passionate about critical thinking while solving problems, we would love to hear from you You will be part of an agile team of smart and highly motivated engineers building highly scalable, secure and cloud-based products/services. You will work in a global team and collaborate with local and remote colleagues from various disciplines like business, engineering, operations, support etc. You will work with latest technologies in a flexible environment.
Minimum Qualifications:
- 5+ years of overall experience with good knowledge of data structures, algorithms, object-oriented programming.
- Solid understanding of typical web architecture (data, application, web tiers etc.).
- Proficient in Python based technologies like Conceptual understanding of Web and RESTful APIs.
- Experience in networking related services – like VPC, Subnet etc., in AWS.
- Monitoring tool experience like Splunk / Cloudwatch.
- Ability to work with a team in an Agile environment.
Responsibilities:
- Involved with all aspects of software development.
- Design and develop highly scalable, reliable and fault tolerant systems with minimal guidance.
- Ensure the best possible performance, quality, and responsiveness of the applications.
- Identify bottlenecks and bugs, and devise solutions to these problems.
- Should help maintain code quality, organization, and automation.
- Should have the ability to span to full stack development whenever necessary.
- Write and maintain code with high attention to details, perform peer code-reviews, and participate in technical design discussions.
Build & Release Automation:
- Provide solution to implement Continuous integration & continuous deployment solutions for medium sized project.
- Manage Medium to large sized projects.
- Guide the team to solve build & deployment automation issues.
- Design/Implement release orchestration solutions for medium or large sized projects.
- Participates in the discovery phase of large sized projects to come up with high level design.
Infrastructure Automation:
- Develop playbooks/cookbooks for configuration management for medium to large sized projects.
- Install and optimizing tools on cloud infrastructure.
- Design/Implement solution for infrastructure automation over cloud infrastructure.
- Implement optimized complex networking configurations.
- Implement optimized complex storage setup.
- Develop Terraform script for deploying AWS resources.
- Develop Python script for automating DevOps task.
- Develop shell script for Infrastructure automation .
Must Have Skills:
- AWS Cloud and DevOps.
- Jenkins CI/CD.
- Git.
- Docker containerization and administration.
- Linux administration.
- Python.
Our large, Fortune Technology client is ranked as one of the best companies to work with, in the world. As a global leader in 3D design, engineering, and entertainment software, they foster progressive culture, creativity, and a flexible work environment. They use cutting-edge technologies to keep themselves ahead of the curve. Diversity in all aspects is respected. Integrity, experience, honesty, people, humanity, and passion for excellence are some other adjectives that define this global technology leader.
Be The First To Know
About the latest Devops engineers Jobs in Jaipur !
Cloud Engineer
Posted today
Job Viewed
Job Description
We're looking for a highly skilled and experienced Cloud AI Engineer to join our dynamic team. In this role, you'll be instrumental in designing, developing, and deploying cutting-edge artificial intelligence and machine learning solutions leveraging the full suite of Google Cloud Platform (GCP) services.
Objectives of this role
- Lead the end-to-end development cycle of AI applications, from conceptualization and prototyping to deployment and optimization, with a core focus on LLM-driven solutions.
- Architect and implement highly performant and scalable AI services, effectively integrating with GCP's comprehensive AI/ML ecosystem.
- Collaborate closely with product managers, data scientists, and MLOps engineers to translate complex business requirements into tangible, AI-powered features.
- Continuously research and apply the latest advancements in LLM technology, prompt engineering, and AI frameworks to enhance application capabilities and performance.
# Responsibilities
- Develop and deploy production-grade AI applications and microservices primarily using Python and FastAPI, ensuring robust API design, security, and scalability.
- Design and implement end-to-end LLM pipelines, encompassing data ingestion, processing, model inference, and output generation.
- Utilize Google Cloud Platform (GCP) services extensively, including Vertex AI (Generative AI, Model Garden, Workbench), Cloud Functions, Cloud Run, Cloud Storage, and BigQuery, to build, train, and deploy LLMs and AI models.
- Expertly apply prompt engineering techniques and strategies to optimize LLM responses, manage context windows, and reduce hallucinations.
- Implement and manage embeddings and vector stores for efficient information retrieval and Retrieval-Augmented Generation (RAG) patterns.
- Work with advanced LLM orchestration frameworks such as LangChain, LangGraph, Google ADK, and CrewAI to build sophisticated multi-agent systems and complex AI workflows.
- Integrate AI solutions with other enterprise systems and databases, ensuring seamless data flow and interoperability.
- Participate in code reviews, establish best practices for AI application development, and contribute to a culture of technical excellence.
- Keep abreast of the latest advancements in GCP AI/ML services and broader AI/ML technologies, evaluating and recommending new tools and approaches.
# Required skills and qualifications
- Two or more years of hands-on experience as an AI Engineer with a focus on building and deploying AI applications, particularly those involving Large Language Models (LLMs).
- Strong programming proficiency in Python, with significant experience in developing web APIs using FastAPI.
- Demonstrable expertise with Google Cloud Platform (GCP), specifically with services like Vertex AI (Generative AI, AI Platform), Cloud Run/Functions, and Cloud Storage.
- Proven experience in prompt engineering, including advanced techniques like few-shot learning, chain-of-thought prompting, and instruction tuning.
- Practical knowledge and application of embeddings and vector stores for semantic search and RAG architectures.
- Hands-on experience with at least one major LLM orchestration framework (e.g., LangChain, LangGraph, CrewAI).
- Solid understanding of software engineering principles, including API design, data structures, algorithms, and testing methodologies.
- Experience with version control systems (Git) and CI/CD pipelines.
Preferred skills and qualifications
Bachelor's or Master's degree in Computer Science
Good to have:
Experience with MLOps practices for deploying, monitoring, and maintaining AI models in production.
Understanding of distributed computing and data processing technologies.
Contributions to open-source AI projects or a strong portfolio showcasing relevant AI/LLM applications.
Excellent analytical and problem-solving skills with a keen attention to detail.
Strong communication and interpersonal skills, with the ability to explain complex technical concepts to non-technical stakeholders