Cloud Site Reliability Engineer

Bengaluru, Karnataka ₹800000 - ₹2400000 Y Mastek

Posted today

Job Viewed

Tap Again To Close

Job Description

Role: Cloud Site Reliability Engineer (6-9 Years)

Location: Bangalore Hybrid

Fulltime

6-8 years of hands-on experience with AWS in a production environment

Experience building and deploying Docker images including Docker Compose

Production experience running Kubernetes workloads ideally on AWS EKS

Experience managing and maintaining Kubernetes Clusters on AWS EKS

Experience creating and deploying Helm charts & libraries

Production experience with infrastructure-as-code (IaC), Terraform preferred

Hands-on experience with Jenkins Core, including authoring and maintaining declarative CI/CD pipelines and libraries

Experience with monitoring tools e.g., CloudWatch, Datadog & Splunk Cloud

Proficiency with UNIX operating systems and shell scripting

Programming experience, e.g., Python preferred

Experience with distributed version control systems, Git preferred

Experience with the agile software development lifecycle and Kanban preferred

Experience with CDN Providers e.g., Akamai preferred

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Bengaluru, Karnataka ₹1200000 - ₹3600000 Y Microsoft

Posted today

Job Viewed

Tap Again To Close

Job Description

Microsoft is a company where passionate innovators come to collaborate, envision what can be and take their careers further. This is a world of more possibilities, more innovation, more openness, and the sky is the limit thinking in a cloud-enabled world.

Microsoft's Azure Data engineering team is leading the transformation of analytics in the world of data with products like databases, data integration, big data analytics, messaging & real-time analytics, and business intelligence. The products our portfolio include Microsoft Fabric, Azure SQL DB, Azure Cosmos DB, Azure PostgreSQL, Azure Data Factory, Azure Synapse Analytics, Azure Service Bus, Azure Event Grid, and Power BI. Our mission is to build the data platform for the age of AI, powering a new class of data-first applications and driving a data culture.

Within Azure Data, the databases team builds and maintains Microsoft's operational Database systems. We store and manage data in a structured way to enable multitude of applications across various industries. We are on a journey to enable developer friendly, mission-critical, AI enabled operational Databases across relational, non-relational and OSS offerings.

We are hiring a Software Engineer 2 to join the Azure Cosmos DB team, where you will be working on a large-scale distributed operational database. In this role, you will work on distributed systems problems and technologies to help determine the future of our planet scale database.

We do not just value differences or different perspectives. We seek them out and invite them in so we can tap into the collective power of everyone in the company. As a result, our customers are better served.

Responsibilities
  • Operational Efficiency: Lead designing systems/solutions at org scale, streamlining processes and enhancing efficiency.
  • AIOps: Use AI tools and agents to improve SLO/SLAs and reduce toil.
  • Monitoring/Observability Architecture: Develop and implement monitoring agents, dashboards, escalations, and alerts to proactively manage and improve service reliability.
  • Incident Management: Participate in a distributed on-call rotation, drive root cause analysis during outages, and write and review postmortems to continuously improve our services and practices.
  • Team Growth: Advocate for SRE best practices, work independently, and help grow the SRE team by onboarding and mentoring new teammates.

Embody our culture and values

Qualifications

Required/Minimum Qualifications

  • Hands-On Experience: Demonstrate 3-8 years of practical experience in site reliability engineering within commercial large-scale software Organizations.
  • Proficiency in coding languages (such as Python, .NET).
  • Live Site Troubleshooting: Adept at troubleshooting live site issues and providing guidance to engineering teams to resolve them promptly.
  • Cloud Proficiency: Possess a good understanding of public cloud offerings such as Azure, Google Cloud, or AWS.
  • Distributed Systems: Experience with distributed systems and micro-service-based architectures.
  • Performance Analysis: Conduct in-depth analysis of web application performance, identifying bottlenecks and areas for improvement. Utilize various monitoring tools and performance profiling techniques to diagnose and troubleshoot performance issues.

Other Requirements

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check:

  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Preferred/Additional Qualifications

Other Requirements

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check:

This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

azdat
azuredata
cosmosdb

Microsoft is an equal opportunity employer. Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Bengaluru, Karnataka ₹1200000 - ₹3600000 Y Betsol

Posted today

Job Viewed

Tap Again To Close

Job Description

Company Description

BETSOL is a cloud-first digital transformation and data management company offering products and IT services to enterprises in over 40 countries. BETSOL team holds several engineering patents and is recognized with industry awards, and BETSOL maintains a net promoter score that is 2x the industry average.

BETSOL's open source backup and recovery product line, Zmanda ), delivers up to 50% savings in total cost of ownership (TCO) and best-in-class performance.

BETSOL Global IT Services ) builds and supports end-to-end enterprise solutions, reducing time-to-market for its customers.

BETSOL offices are set against the vibrant backdrops of Broomfield, Colorado, and Bangalore, India.

We take pride in being an employee-centric organization, offering comprehensive health insurance, competitive salaries, 401K, volunteer programs, and scholarship opportunities. Office amenities include a fitness center, cafe, and recreational facilities.

Job Description

Own the reliability, availability, performance, and scalability of customer and employee facing platforms. Partner with application, infrastructure, security, and NOC teams to engineer resilient services, and automate operations across Azure and on-prem environments. Drive incident response and post-incident reviews, implement observability, and continuously improve service health through automation and best practices.

Responsibilities:

  • Build and operate production platforms across Azure (e.g., AKS, App Services, Functions), Windows/Linux, and networking layers in partnership with Platform/Server/Network teams.
  • Engineer end-to-end observability: metrics, logs, and traces via Azure Monitor, Application Insights, Log Analytics, Prometheus, Grafana, and centralized logging.
  • Automate provisioning and configuration using Infrastructure as Code (Terraform/Bicep) and configuration management (Ansible/PowerShell DSC).
  • Design and maintain CI/CD pipelines (Azure DevOps/GitHub Actions) with automated testing, canary/blue-green deployments, and change control alignment.
  • Establish runbooks, SOPs, and self-healing automations to reduce MTTR and ticket volume from the NOC and Service Desk.
  • Harden platform security (identity, secrets, certificates, network segmentation) leveraging Azure Key Vault, managed identities, and policy guardrails.
  • Perform capacity planning, performance tuning, and cost optimization (FinOps) for compute, storage, and networking.
  • Partner with Data/ETL teams to ensure reliability of batch and streaming jobs, scheduling, and dependencies.
  • Create and maintain documentation (architecture, runbooks, dashboards) and support audits and compliance requirements.

Qualifications

Bachelor's degree in Computer Science, Engineering, or equivalent experience.

  • 2–5+ years in SRE/DevOps/Platform Engineering with hands-on production ownership.
  • Proficiency with Azure services (AKS, App Services, Functions, Azure Monitor, Log Analytics, Application Insights).
  • Strong Kubernetes/Docker skills; Helm, ingress, service mesh (e.g., Istio/Linkerd) experience is a plus.
  • IaC (Terraform or Bicep) and scripting (PowerShell and/or Python); Git-based workflows.
  • CI/CD (Azure DevOps or GitHub Actions), artifact management, and release strategies (canary/blue-green).
  • Observability tooling (Prometheus, Grafana, ELK/OpenSearch, Azure Monitor) and alert design to minimize noise.
  • Experience with ITIL processes (incident, change, problem) and tools (ServiceNow/Jira).
  • Knowledge of networking, DNS, TLS/certificates, load balancers, and security fundamentals.
  • Excellent troubleshooting, communication, and cross-functional collaboration skills.
  • Certifications such as Microsoft Azure Administrator/DevOps, CKA/CKAD, or ITIL Foundation are a plus.

Additional Information

All your information will be kept confidential according to EEO guidelines.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Bengaluru, Karnataka ₹1200000 - ₹3600000 Y Zetamicron

Posted today

Job Viewed

Tap Again To Close

Job Description

Job Title: Site Reliability Engineer (SRE)

About the Role

We are seeking a highly skilled and proactive Site Reliability Engineer (SRE)
to ensure the stability, scalability, and reliability of our platform. The ideal candidate will have strong experience in managing production environments, automating operational processes, and enhancing system performance through continuous improvement and innovation.

Key Responsibilities

1. Platform Stability and Reliability

  • Ensure the platform consistently meets defined performance, availability, and reliability SLAs.
  • Identify and resolve performance bottlenecks and potential production risks proactively.
  • Maintain and enhance monitoring, logging, and alerting systems to prevent downtime and incidents.

2. Incident Management

  • Serve as the primary responder during critical incidents, ensuring rapid resolution and minimal impact.
  • Conduct post-incident analysis and implement preventive measures.
  • Develop and maintain detailed runbooks and playbooks to improve operational readiness.

3. Automation and Efficiency

  • Build and maintain automation tools for deployment, scaling, and failover.
  • Enhance CI/CD pipeline performance for faster and more reliable releases.
  • Implement and manage
    Infrastructure as Code (IaC)
    using tools like
    Terraform
    or
    Pulumi
    .

4. Collaboration and Mentorship

  • Collaborate closely with SRE, CI/CD, Developer Experience, and Templates teams to improve platform reliability.
  • Mentor junior engineers and promote best practices in SRE and system operations.
  • Partner with development teams to integrate observability and reliability into the application lifecycle.

5. Observability and Metrics

  • Implement and optimize observability tools such as
    Dynatrace
    ,
    Prometheus
    , or
    Grafana
    .
  • Define and maintain key performance metrics and dashboards for system health monitoring.
  • Continuously analyze operational data to identify areas for optimization and improvement.

Qualifications

Required:

  • Minimum
    5 years of experience
    in Site Reliability Engineering, Software Engineering, or related domains.
  • At least
    3 years of experience
    managing
    AWS
    cloud environments.
  • Strong programming proficiency in
    Python
    ,
    Java
    ,

    , or
    TypeScript
    .
  • Hands-on experience with
    Kubernetes
    and
    Docker
    .
  • Proficiency in CI/CD tools like
    GitLab
    ,
    Jenkins
    , or similar.
  • Experience with monitoring and alerting tools (preferably
    Dynatrace
    ).

Preferred:

  • Advanced expertise in
    Kubernetes (K8s)
    for container orchestration and deployment.
  • Familiarity with observability stacks like
    Prometheus
    and
    Grafana
    .
  • Exposure to
    Agile
    development environments.
  • Experience with additional cloud platforms (
    Azure
    or
    Google Cloud
    ) is a plus.

Why Join Us

  • Opportunity to work on
    cutting-edge cloud and DevOps technologies
    .
  • Collaborative, growth-oriented, and learning-driven work culture.
  • Competitive compensation and clear career progression.
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Bengaluru, Karnataka ₹104000 - ₹130878 Y Enterprise Minds

Posted today

Job Viewed

Tap Again To Close

Job Description

Job Title: Site Reliability Engineer

Department: Engineering / Infrastructure

- Reports To: SRE Manager / DevOps Lead
- Location: Bangalore, India

Role Summary

The Site Reliability Engineer (SRE) will be responsible for ensuring the availability, performance, and scalability of critical systems. This role involves managing CI/CD pipelines, monitoring production environments, automating operations, and driving platform reliability improvements in collaboration with development and infrastructure teams.

Key Responsibilities
- Manage alerts and monitoring of critical production systems.
- Operate and enhance CI/CD pipelines and improve deployment and rollback strategies.
- Work with central platform teams on reliability initiatives.
- Automate testing, regression, and build tooling for operational efficiency.
- Execute NFR testing on production systems.
- Plan and implement Debian version migrations with minimal disruption.

Required Qualifications & Skills- CI/CD and Packaging Tools:
- Hands-on experience with Jenkins, Docker, JFrog for packaging and deployment.
- Operating System Expertise:
- Experience in Debian OS migration and upgrade processes.
- Monitoring Systems:
- Knowledge of Grafana, Nagios, and other observability tools.
- Configuration Management:
- Proficiency with Ansible, Puppet, or Chef.
- Version Control:
- Working knowledge of Git and related version control systems.
- Kubernetes:
- Deep understanding of Kubernetes architecture, deployment pipelines, and debugging.
- Ability to deploy components with detailed insights into:
- Configuration parameters and system requirements
- Monitoring and alerting needs
- Performance tuning
- Designing for high availability and fault tolerance
- Networking:
- Understanding of TCP/IP, UDP, Multicast, Broadcast.
- Experience with TCPDump, Wireshark for network diagnostics.
- Linux & Databases:
- Strong skills in Linux tools and scripting.
- Familiarity with MySQL and NoSQL database systems.

Soft Skills- Strong problem-solving and analytical skills
- Effective communication and collaboration with cross-functional teams
- Ownership mindset and accountability
- Adaptability to fast-paced and dynamic environments
- Detail-oriented and proactive approach

Preferred Qualifications- Bachelors degree in Computer Science, Engineering, or related technical field
- Certifications in Kubernetes (CKA/CKAD), Linux, or DevOps practices
- Experience with cloud platforms (AWS, GCP, Azure)
- Exposure to service mesh, observability stacks, or SRE toolkits

Key Relationships- Internal: DevOps, Infrastructure, Software Development, QA, Security Teams
- External: Tool vendors, platform service providers (if applicable)

Role Dimensions- Impact on uptime and reliability of business-critical services
- Ownership of CI/CD and production deployment processes
- Contributor to cross-team reliability and scalability initiatives

Success Measures (KPIs)- System uptime and availability (SLA adherence)
- Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) incidents
- Deployment success rate and rollback frequency
- Automation coverage of operational tasks
- Completion of OS migration and infrastructure upgrade projects

Competency Framework Alignment- Technical Mastery: Infrastructure, automation, CI/CD, Kubernetes, monitoring
- Execution Excellence: Timely project delivery, process improvements
- Collaboration: Cross-functional team engagement and support
- Resilience: Problem solving under pressure and incident response
- Innovation: Continuous improvement of operational reliability and performance

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Bengaluru, Karnataka ₹1500000 - ₹2500000 Y Amiseq

Posted today

Job Viewed

Tap Again To Close

Job Description

Job Functions:

You will be a member of our AI Platform Team, supporting the next generation AI architecture for various research and engineering teams within the organization.

You'll partner with vendors and the infrastructure engineering team for security and service availability

You'll fix production issues with engineering teams, researchers, data scientists, including performance and functional issues

Diagnose and solve customer technical problems

Participate in training customers and prepare reports on customer issues

Be responsible for customer service improvements and recommend product improvements

Write support documentation

You'll design and implement zero-downtime to monitor and accomplish a highly available service %)

As a support engineer, find opportunities to automate as part of the problem management process, creating automation to avoid issues.

Define engineering excellence for operational maturity You'll work together with AI platform developers to provide the CI/CD model to deploy and configure the production system automatically

Develop and follow operational standard processes for tools and automation development. Including: Style guides, versioning practices, source control, branching and merging patterns and advising other engineers on development standards

Deliver solutions that accelerate the activities, phenomenal engineers would perform through automation, deep domain expertise, and knowledge sharing

Required Skills:

Demonstrated ability in designing, building, refactoring and releasing software written in Python.

Hands-on experience with ML frameworks such as PyTorch, TensorFlow, Triton.

Ability to handle framework-related issues, version upgrades, and compatibility with data processing / model training environments.

Experience with AI/ML model training and inferencing platforms is a big plus.

Experience with the LLM fine tuning system is a big plus.

Debugging and triaging skills.

Cloud technologies like Kubernetes, Docker and Linux fundamentals.

Familiar with DevOps practices and continuous testing.

DevOps pipeline and automations: app deployment/configuration & performance monitoring.

Test automations, Jenkins CI/CD.

Excellent communication, presentation, and leadership skills to be able to work and collaborate with partners, customers and engineering teams.

Well organized and able to manage multiple projects in a fast paced and demanding environment.

Good oral/reading/writing English ability

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Bengaluru, Karnataka ₹500000 - ₹1500000 Y PhonePe

Posted today

Job Viewed

Tap Again To Close

Job Description

About PhonePe Limited:

Headquartered in India, its flagship product, the PhonePe digital payments app, was launched in Aug 2016. As of April 2025, PhonePe has over 60 Crore (600 Million) registered users and a digital payments acceptance network spread across over 4 Crore (40+ million) merchants. PhonePe also processes over 33 Crore (330+ Million) transactions daily with an Annualized Total Payment Value (TPV) of over INR 150 lakh crore.

PhonePe's portfolio of businesses includes the distribution of financial products (Insurance, Lending, and Wealth) as well as new consumer tech businesses (Pincode - hyperlocal e-commerce and Indus AppStore Localized App Store for the Android ecosystem) in India, which are aligned with the company's vision to offer every Indian an equal opportunity to accelerate their progress by unlocking the flow of money and access to services.

Culture:

At PhonePe, we go the extra mile to make sure you can bring your best self to work, Everyday. And that starts with creating the right environment for you. We empower people and trust them to do the right thing. Here, you own your work from start to finish, right from day one. PhonePe-rs solve complex problems and execute quickly; often building frameworks from scratch. If you're excited by the idea of building platforms that touch millions, ideating with some of the best minds in the country and executing on your dreams with purpose and speed, join us

Site Reliability Engineer - Intern

Job Description:

  • Knowledge in Linux/Unix Administration
  • Knowledge in networking
  • Knowledge of wide variety of open source technologies/tools and cloud services
  • Knowledge of best practices and IT operations in an always-up, always-available service
  • Good English communications skills
  • Strong background in linux networking ( ip, iptables, ipsec )
  • Knowledge of working with MySQL
  • A working understanding of code and script ( eg: Perl/Golang preferred )
  • Knowledge of automation/configuration management using either saltstack or equivalent

Knowledgeofthefollowingwillbeaplus:

  • Implementation of cloud services on linux using kvm/qemu
  • DCOS (mesos & mesos frameworks)
  • Aerospike ( nosql )
  • Perl/golang
  • Galera
  • Openbsd
  • Data center related activities ( Rarely needed )

PhonePe Full Time Employee Benefits (Not applicable for Intern or Contract Roles)

  • Insurance Benefits - Medical Insurance, Critical Illness Insurance, Accidental Insurance, Life Insurance
  • Wellness Program - Employee Assistance Program, Onsite Medical Center, Emergency Support System
  • Parental Support - Maternity Benefit, Paternity Benefit Program, Adoption Assistance Program, Day-care Support Program
  • Mobility Benefits - Relocation benefits, Transfer Support Policy, Travel Policy
  • Retirement Benefits - Employee PF Contribution, Flexible PF Contribution, Gratuity, NPS, Leave Encashment
  • Other Benefits - Higher Education Assistance, Car Lease, Salary Advance Policy

Our inclusive culture promotes individual expression, creativity, innovation, and achievement and in turn helps us better understand and serve our customers. We see ourselves as a place for intellectual curiosity, ideas and debates, where diverse perspectives lead to deeper understanding and better quality results. PhonePe is an equal opportunity employer and is committed to treating all its employees and job applicants equally; regardless of gender, sexual preference, religion, race, color or disability. If you have a disability or special need that requires assistance or reasonable accommodation, during the application and hiring process, including support for the interview or onboarding process, please fill out this form.

Read more about PhonePe on our blog .

This advertiser has chosen not to accept applicants from your region.
Be The First To Know

About the latest Devops engineers Jobs in Bengaluru !

Site Reliability Engineer

Bengaluru, Karnataka ₹1500000 - ₹2500000 Y Xebia

Posted today

Job Viewed

Tap Again To Close

Job Description

We are seeking an experienced
AWS DevOps Engineer with strong expertise in Observability and Site Reliability Engineering (SRE)
to design, build, and manage scalable, reliable, and secure cloud environments. The role requires hands-on experience with AWS services, Infrastructure as Code (IaC), CI/CD, monitoring & observability frameworks, and incident response practices to ensure high availability, performance, and resilience of business-critical systems.

Key Responsibilities

  • Cloud Infrastructure (AWS):
  • Design, implement, and manage scalable, resilient, and cost-optimized cloud infrastructure using AWS services (EC2, EKS, Lambda, RDS, S3, CloudFront, IAM, VPC, etc.).
  • Implement Infrastructure as Code (IaC) using tools like
    Terraform / CloudFormation
    .
  • DevOps & Automation:
  • Build and maintain
    CI/CD pipelines
    (Jenkins, GitHub Actions, GitLab CI, or AWS CodePipeline) for automated deployments.
  • Automate repetitive tasks to improve development velocity and operational efficiency.
  • Observability & Monitoring:
  • Define and implement
    observability strategy
    covering monitoring, logging, tracing, and alerting.
  • Work with tools like
    Prometheus, Grafana, ELK/EFK stack, AWS CloudWatch, Datadog, New Relic, Splunk, or Dynatrace
    .
  • Establish
    SLIs, SLOs, and SLAs
    to measure and improve system reliability.
  • Site Reliability Engineering (SRE):
  • Drive incident management processes – detection, alerting, root cause analysis, and postmortems.
  • Apply
    chaos engineering
    principles to validate resilience and recovery.
  • Optimize reliability, latency, scalability, and system efficiency.
  • Security & Compliance:
  • Implement best practices for cloud security, identity & access management, and compliance frameworks (ISO, SOC2, GDPR, etc.).
  • Ensure observability and monitoring meet security and audit requirements.
  • Collaboration & Leadership:
  • Partner with development, QA, and product teams to ensure seamless deployments.
  • Mentor junior engineers and promote a culture of
    reliability, automation, and continuous improvement
    .

Required Skills & Qualifications

  • 7+ years
    of professional experience in DevOps, Cloud Infrastructure, or SRE roles.
  • Strong expertise in AWS Cloud
    (certification preferred: AWS Certified DevOps Engineer, Solutions Architect, or SysOps).
  • Proficiency in
    IaC tools
    (Terraform, CloudFormation).
  • Solid experience in
    CI/CD pipeline tools
    (Jenkins, GitHub Actions, GitLab CI/CD, AWS CodePipeline).
  • Hands-on with
    observability tools
    : Prometheus, Grafana, CloudWatch, ELK, Datadog, New Relic, Splunk, or similar.
  • Deep understanding of
    SRE principles
    : SLIs/SLOs, error budgets, incident response, chaos testing.
  • Strong scripting/coding experience (Python, Bash, Go, or similar).
  • Knowledge of
    containers & orchestration
    (Docker, Kubernetes, EKS).
  • Familiarity with
    security best practices
    in cloud-native environments.

Preferred Skills

  • Experience with
    multi-cloud or hybrid-cloud environments
    .
  • Exposure to
    resiliency testing & chaos engineering tools
    (Gremlin, Litmus, Chaos Mesh).
  • Knowledge of cost-optimization and FinOps in AWS.
  • Excellent communication and stakeholder management skills.

What We Offer

  • Opportunity to work on cutting-edge cloud-native architectures.
  • A culture focused on
    automation, reliability, and innovation
    .
  • Growth opportunities with certifications, training, and leadership exposure.
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Bengaluru, Karnataka ₹1200000 - ₹3600000 Y Weave

Posted today

Job Viewed

Tap Again To Close

Job Description

This position supports TrueLark, a recently acquired brand under the Weave umbrella. Your work will directly contribute to the TrueLark product and team. TrueLark is an AI-powered virtual receptionist designed for appointment-based businesses. Its agentic AI platform manages scheduling, rescheduling, and client inquiries through SMS and web chat, providing 24/7 support. TrueLark helps businesses recover missed calls, increase bookings, and streamline front-office operations.

At TrueLark, our infrastructure powers mission-critical business applications for customers across industries. As we continue to grow, we need a
Scaling Engineer
who can design and implement
scalable, high-performance infrastructure solutions
.

You will focus on
optimizing performance, improving system resilience, and scaling our services
built on Redis, MySQL, ActiveMQ, and Jetty Spark servers. This is a highly impactful role where you will collaborate with engineering and DevOps teams to ensure our systems perform reliably under heavy load.

  • This position will be onsite 2-3 times per week in the Bengaluru office
  • Reports to: Head of TrueLark Technology

What You Will Own

  • Infrastructure Scaling & Optimization: Analyze current infrastructure and design solutions to handle increasing load with high availability.
  • Database Performance Tuning: Optimize MySQL (indexes, queries, partitioning, replication, sharding) and Redis caching strategies for throughput and latency improvements.
  • Message Queue Scaling: Enhance and optimize ActiveMQ for large-scale, distributed messaging with high durability and reliability.
  • Server & Application Scaling: Tune Jetty Spark servers for concurrency, session management, load balancing, and failover strategies.
  • Monitoring & Bottleneck Detection: Implement monitoring dashboards (Grafana, Prometheus, ELK/EFK, or Azure Monitor) to detect slow queries, CPU/memory bottlenecks, and scaling issues.
  • Distributed Systems Reliability: Ensure fault tolerance, disaster recovery, and horizontal scaling strategies.
  • CI/CD & Automation: Work with GitHub Actions, Jenkins, or Azure DevOps for automated deployments and scaling workflows.
  • Collaboration: Partner with backend engineers, DevOps, and product teams to deliver reliable infrastructure that meets performance goals.

What You Will Need to Accomplish the Job

  • Bachelor's degree in Computer Science, Information Technology, or related field.
  • 3+ years of experience in scaling infrastructure for distributed systems or high-traffic applications.
  • Strong hands-on experience with:

  • Databases: MySQL (query optimization, replication, partitioning, indexing strategies).

  • Caching: Redis (cluster setup, pub/sub, eviction policies, memory optimization).
  • Message Queues: ActiveMQ (scaling consumers/producers, HA clusters, performance tuning).
  • Web Servers: Jetty with Spark framework (thread pools, tuning JVM, load balancing).

  • Proficiency in monitoring & observability tools (Grafana, Prometheus, ELK/EFK stack, Azure Monitor, or DataDog).

  • Knowledge of distributed systems principles (CAP theorem, sharding, replication, consensus).
  • Familiarity with CI/CD pipelines, Docker, Kubernetes (AKS/EKS/GKE optional).
  • Strong problem-solving skills in debugging production performance bottlenecks.
  • Ability to work independently and collaboratively in a fast-paced environment.

What Will Make Us Love You

  • Experience with horizontal scaling of microservices in production.
  • Knowledge of cloud platforms (Azure, AWS, GCP) for scaling and infrastructure automation.
  • Hands-on experience with load testing tools (JMeter, Locust, k6, Artillery).
  • Familiarity with event-driven architecture and streaming platforms (Kafka).
  • Contributions to open-source performance/scaling projects

Weave is an equal opportunity employer that is committed to fostering an inclusive workplace where all individuals are valued and supported. We welcome anyone who is hungry to learn, problem-solve and progress regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, or other applicable legally protected characteristics. If you have a disability or special need that requires accommodation, please let us know.

All official correspondence will occur through Weave branded email. We will never ask you to share bank account information, cash a check from us, or purchase software or equipment as part of your interview or hiring process.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Bangalore, Karnataka Iron Mountain

Posted 4 days ago

Job Viewed

Tap Again To Close

Job Description

At Iron Mountain we know that work, when done well, makes a positive impact for our customers, our employees, and our planet. That's why we need smart, committed people to join us. Whether you're looking to start your career or make a change, talk to us and see how you can elevate the power of your work at Iron Mountain.
We provide expert, sustainable solutions in records and information management, digital transformation services, data centers, asset lifecycle management, and fine art storage, handling, and logistics. We proudly partner every day with our 225,000 customers around the world to preserve their invaluable artifacts, extract more from their inventory, and protect their data privacy in innovative and socially responsible ways.
Are you curious about being part of our growth story while evolving your skills in a culture that will welcome your unique contributions? If so, let's start the conversation.
Category: Information Technology
Iron Mountain is a global leader in storage and information management services trusted by more than 225,000 organizations in 60 countries. We safeguard billions of our customers' assets, including critical business information, highly sensitive data, and invaluable cultural and historic artifacts. Take a look at our history here.
Iron Mountain helps lower cost and risk, comply with regulations, recover from disaster, and enable digital and sustainable solutions, whether in information management, digital transformation, secure storage and destruction, data center operations, cloud services, or art storage and logistics. Please see our Values and Code of Ethics for a look at our principles and aspirations in elevating the power of our work together.
If you have a physical or mental disability that requires special accommodations, please let us know by sending an email to See the Supplement to learn more about Equal Employment Opportunity.
Iron Mountain is committed to a policy of equal employment opportunity. We recruit and hire applicants without regard to race, color, religion, sex (including pregnancy), national origin, disability, age, sexual orientation, veteran status, genetic information, gender identity, gender expression, or any other factor prohibited by law.
To view the Equal Employment Opportunity is the Law posters and the supplement, as well as the Pay Transparency Policy Statement, CLICK HERE
**Requisition:** J
This advertiser has chosen not to accept applicants from your region.
 

Nearby Locations

Other Jobs Near Me

Industry

  1. request_quote Accounting
  2. work Administrative
  3. eco Agriculture Forestry
  4. smart_toy AI & Emerging Technologies
  5. school Apprenticeships & Trainee
  6. apartment Architecture
  7. palette Arts & Entertainment
  8. directions_car Automotive
  9. flight_takeoff Aviation
  10. account_balance Banking & Finance
  11. local_florist Beauty & Wellness
  12. restaurant Catering
  13. volunteer_activism Charity & Voluntary
  14. science Chemical Engineering
  15. child_friendly Childcare
  16. foundation Civil Engineering
  17. clean_hands Cleaning & Sanitation
  18. diversity_3 Community & Social Care
  19. construction Construction
  20. brush Creative & Digital
  21. currency_bitcoin Crypto & Blockchain
  22. support_agent Customer Service & Helpdesk
  23. medical_services Dental
  24. medical_services Driving & Transport
  25. medical_services E Commerce & Social Media
  26. school Education & Teaching
  27. electrical_services Electrical Engineering
  28. bolt Energy
  29. local_mall Fmcg
  30. gavel Government & Non Profit
  31. emoji_events Graduate
  32. health_and_safety Healthcare
  33. beach_access Hospitality & Tourism
  34. groups Human Resources
  35. precision_manufacturing Industrial Engineering
  36. security Information Security
  37. handyman Installation & Maintenance
  38. policy Insurance
  39. code IT & Software
  40. gavel Legal
  41. sports_soccer Leisure & Sports
  42. inventory_2 Logistics & Warehousing
  43. supervisor_account Management
  44. supervisor_account Management Consultancy
  45. supervisor_account Manufacturing & Production
  46. campaign Marketing
  47. build Mechanical Engineering
  48. perm_media Media & PR
  49. local_hospital Medical
  50. local_hospital Military & Public Safety
  51. local_hospital Mining
  52. medical_services Nursing
  53. local_gas_station Oil & Gas
  54. biotech Pharmaceutical
  55. checklist_rtl Project Management
  56. shopping_bag Purchasing
  57. home_work Real Estate
  58. person_search Recruitment Consultancy
  59. store Retail
  60. point_of_sale Sales
  61. science Scientific Research & Development
  62. wifi Telecoms
  63. psychology Therapy
  64. pets Veterinary
View All Devops Engineers Jobs View All Jobs in Bengaluru