1776 Devops Engineers jobs in Bengaluru
Cloud Site Reliability Engineer
Posted today
Job Viewed
Job Description
Role: Cloud Site Reliability Engineer (6-9 Years)
Location: Bangalore Hybrid
Fulltime
6-8 years of hands-on experience with AWS in a production environment
Experience building and deploying Docker images including Docker Compose
Production experience running Kubernetes workloads ideally on AWS EKS
Experience managing and maintaining Kubernetes Clusters on AWS EKS
Experience creating and deploying Helm charts & libraries
Production experience with infrastructure-as-code (IaC), Terraform preferred
Hands-on experience with Jenkins Core, including authoring and maintaining declarative CI/CD pipelines and libraries
Experience with monitoring tools e.g., CloudWatch, Datadog & Splunk Cloud
Proficiency with UNIX operating systems and shell scripting
Programming experience, e.g., Python preferred
Experience with distributed version control systems, Git preferred
Experience with the agile software development lifecycle and Kanban preferred
Experience with CDN Providers e.g., Akamai preferred
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Microsoft is a company where passionate innovators come to collaborate, envision what can be and take their careers further. This is a world of more possibilities, more innovation, more openness, and the sky is the limit thinking in a cloud-enabled world.
Microsoft's Azure Data engineering team is leading the transformation of analytics in the world of data with products like databases, data integration, big data analytics, messaging & real-time analytics, and business intelligence. The products our portfolio include Microsoft Fabric, Azure SQL DB, Azure Cosmos DB, Azure PostgreSQL, Azure Data Factory, Azure Synapse Analytics, Azure Service Bus, Azure Event Grid, and Power BI. Our mission is to build the data platform for the age of AI, powering a new class of data-first applications and driving a data culture.
Within Azure Data, the databases team builds and maintains Microsoft's operational Database systems. We store and manage data in a structured way to enable multitude of applications across various industries. We are on a journey to enable developer friendly, mission-critical, AI enabled operational Databases across relational, non-relational and OSS offerings.
We are hiring a Software Engineer 2 to join the Azure Cosmos DB team, where you will be working on a large-scale distributed operational database. In this role, you will work on distributed systems problems and technologies to help determine the future of our planet scale database.
We do not just value differences or different perspectives. We seek them out and invite them in so we can tap into the collective power of everyone in the company. As a result, our customers are better served.
Responsibilities- Operational Efficiency: Lead designing systems/solutions at org scale, streamlining processes and enhancing efficiency.
- AIOps: Use AI tools and agents to improve SLO/SLAs and reduce toil.
- Monitoring/Observability Architecture: Develop and implement monitoring agents, dashboards, escalations, and alerts to proactively manage and improve service reliability.
- Incident Management: Participate in a distributed on-call rotation, drive root cause analysis during outages, and write and review postmortems to continuously improve our services and practices.
- Team Growth: Advocate for SRE best practices, work independently, and help grow the SRE team by onboarding and mentoring new teammates.
Embody our culture and values
QualificationsRequired/Minimum Qualifications
- Hands-On Experience: Demonstrate 3-8 years of practical experience in site reliability engineering within commercial large-scale software Organizations.
- Proficiency in coding languages (such as Python, .NET).
- Live Site Troubleshooting: Adept at troubleshooting live site issues and providing guidance to engineering teams to resolve them promptly.
- Cloud Proficiency: Possess a good understanding of public cloud offerings such as Azure, Google Cloud, or AWS.
- Distributed Systems: Experience with distributed systems and micro-service-based architectures.
- Performance Analysis: Conduct in-depth analysis of web application performance, identifying bottlenecks and areas for improvement. Utilize various monitoring tools and performance profiling techniques to diagnose and troubleshoot performance issues.
Other Requirements
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check:
- This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Preferred/Additional Qualifications
Other Requirements
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check:
This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
azdatazuredata
cosmosdb
Microsoft is an equal opportunity employer. Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Company Description
BETSOL is a cloud-first digital transformation and data management company offering products and IT services to enterprises in over 40 countries. BETSOL team holds several engineering patents and is recognized with industry awards, and BETSOL maintains a net promoter score that is 2x the industry average.
BETSOL's open source backup and recovery product line, Zmanda ), delivers up to 50% savings in total cost of ownership (TCO) and best-in-class performance.
BETSOL Global IT Services ) builds and supports end-to-end enterprise solutions, reducing time-to-market for its customers.
BETSOL offices are set against the vibrant backdrops of Broomfield, Colorado, and Bangalore, India.
We take pride in being an employee-centric organization, offering comprehensive health insurance, competitive salaries, 401K, volunteer programs, and scholarship opportunities. Office amenities include a fitness center, cafe, and recreational facilities.
Job Description
Own the reliability, availability, performance, and scalability of customer and employee facing platforms. Partner with application, infrastructure, security, and NOC teams to engineer resilient services, and automate operations across Azure and on-prem environments. Drive incident response and post-incident reviews, implement observability, and continuously improve service health through automation and best practices.
Responsibilities:
- Build and operate production platforms across Azure (e.g., AKS, App Services, Functions), Windows/Linux, and networking layers in partnership with Platform/Server/Network teams.
- Engineer end-to-end observability: metrics, logs, and traces via Azure Monitor, Application Insights, Log Analytics, Prometheus, Grafana, and centralized logging.
- Automate provisioning and configuration using Infrastructure as Code (Terraform/Bicep) and configuration management (Ansible/PowerShell DSC).
- Design and maintain CI/CD pipelines (Azure DevOps/GitHub Actions) with automated testing, canary/blue-green deployments, and change control alignment.
- Establish runbooks, SOPs, and self-healing automations to reduce MTTR and ticket volume from the NOC and Service Desk.
- Harden platform security (identity, secrets, certificates, network segmentation) leveraging Azure Key Vault, managed identities, and policy guardrails.
- Perform capacity planning, performance tuning, and cost optimization (FinOps) for compute, storage, and networking.
- Partner with Data/ETL teams to ensure reliability of batch and streaming jobs, scheduling, and dependencies.
- Create and maintain documentation (architecture, runbooks, dashboards) and support audits and compliance requirements.
Qualifications
Bachelor's degree in Computer Science, Engineering, or equivalent experience.
- 2–5+ years in SRE/DevOps/Platform Engineering with hands-on production ownership.
- Proficiency with Azure services (AKS, App Services, Functions, Azure Monitor, Log Analytics, Application Insights).
- Strong Kubernetes/Docker skills; Helm, ingress, service mesh (e.g., Istio/Linkerd) experience is a plus.
- IaC (Terraform or Bicep) and scripting (PowerShell and/or Python); Git-based workflows.
- CI/CD (Azure DevOps or GitHub Actions), artifact management, and release strategies (canary/blue-green).
- Observability tooling (Prometheus, Grafana, ELK/OpenSearch, Azure Monitor) and alert design to minimize noise.
- Experience with ITIL processes (incident, change, problem) and tools (ServiceNow/Jira).
- Knowledge of networking, DNS, TLS/certificates, load balancers, and security fundamentals.
- Excellent troubleshooting, communication, and cross-functional collaboration skills.
- Certifications such as Microsoft Azure Administrator/DevOps, CKA/CKAD, or ITIL Foundation are a plus.
Additional Information
All your information will be kept confidential according to EEO guidelines.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Job Title: Site Reliability Engineer (SRE)
About the Role
We are seeking a highly skilled and proactive Site Reliability Engineer (SRE)
to ensure the stability, scalability, and reliability of our platform. The ideal candidate will have strong experience in managing production environments, automating operational processes, and enhancing system performance through continuous improvement and innovation.
Key Responsibilities
1. Platform Stability and Reliability
- Ensure the platform consistently meets defined performance, availability, and reliability SLAs.
- Identify and resolve performance bottlenecks and potential production risks proactively.
- Maintain and enhance monitoring, logging, and alerting systems to prevent downtime and incidents.
2. Incident Management
- Serve as the primary responder during critical incidents, ensuring rapid resolution and minimal impact.
- Conduct post-incident analysis and implement preventive measures.
- Develop and maintain detailed runbooks and playbooks to improve operational readiness.
3. Automation and Efficiency
- Build and maintain automation tools for deployment, scaling, and failover.
- Enhance CI/CD pipeline performance for faster and more reliable releases.
- Implement and manage
Infrastructure as Code (IaC)
using tools like
Terraform
or
Pulumi
.
4. Collaboration and Mentorship
- Collaborate closely with SRE, CI/CD, Developer Experience, and Templates teams to improve platform reliability.
- Mentor junior engineers and promote best practices in SRE and system operations.
- Partner with development teams to integrate observability and reliability into the application lifecycle.
5. Observability and Metrics
- Implement and optimize observability tools such as
Dynatrace
,
Prometheus
, or
Grafana
. - Define and maintain key performance metrics and dashboards for system health monitoring.
- Continuously analyze operational data to identify areas for optimization and improvement.
Qualifications
Required:
- Minimum
5 years of experience
in Site Reliability Engineering, Software Engineering, or related domains. - At least
3 years of experience
managing
AWS
cloud environments. - Strong programming proficiency in
Python
,
Java
,
, or
TypeScript
. - Hands-on experience with
Kubernetes
and
Docker
. - Proficiency in CI/CD tools like
GitLab
,
Jenkins
, or similar. - Experience with monitoring and alerting tools (preferably
Dynatrace
).
Preferred:
- Advanced expertise in
Kubernetes (K8s)
for container orchestration and deployment. - Familiarity with observability stacks like
Prometheus
and
Grafana
. - Exposure to
Agile
development environments. - Experience with additional cloud platforms (
Azure
or
Google Cloud
) is a plus.
Why Join Us
- Opportunity to work on
cutting-edge cloud and DevOps technologies
. - Collaborative, growth-oriented, and learning-driven work culture.
- Competitive compensation and clear career progression.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Job Title: Site Reliability Engineer
Department: Engineering / Infrastructure
- Reports To: SRE Manager / DevOps Lead
- Location: Bangalore, India
Role Summary
The Site Reliability Engineer (SRE) will be responsible for ensuring the availability, performance, and scalability of critical systems. This role involves managing CI/CD pipelines, monitoring production environments, automating operations, and driving platform reliability improvements in collaboration with development and infrastructure teams.
Key Responsibilities
- Manage alerts and monitoring of critical production systems.
- Operate and enhance CI/CD pipelines and improve deployment and rollback strategies.
- Work with central platform teams on reliability initiatives.
- Automate testing, regression, and build tooling for operational efficiency.
- Execute NFR testing on production systems.
- Plan and implement Debian version migrations with minimal disruption.
Required Qualifications & Skills- CI/CD and Packaging Tools:
- Hands-on experience with Jenkins, Docker, JFrog for packaging and deployment.
- Operating System Expertise:
- Experience in Debian OS migration and upgrade processes.
- Monitoring Systems:
- Knowledge of Grafana, Nagios, and other observability tools.
- Configuration Management:
- Proficiency with Ansible, Puppet, or Chef.
- Version Control:
- Working knowledge of Git and related version control systems.
- Kubernetes:
- Deep understanding of Kubernetes architecture, deployment pipelines, and debugging.
- Ability to deploy components with detailed insights into:
- Configuration parameters and system requirements
- Monitoring and alerting needs
- Performance tuning
- Designing for high availability and fault tolerance
- Networking:
- Understanding of TCP/IP, UDP, Multicast, Broadcast.
- Experience with TCPDump, Wireshark for network diagnostics.
- Linux & Databases:
- Strong skills in Linux tools and scripting.
- Familiarity with MySQL and NoSQL database systems.
Soft Skills- Strong problem-solving and analytical skills
- Effective communication and collaboration with cross-functional teams
- Ownership mindset and accountability
- Adaptability to fast-paced and dynamic environments
- Detail-oriented and proactive approach
Preferred Qualifications- Bachelors degree in Computer Science, Engineering, or related technical field
- Certifications in Kubernetes (CKA/CKAD), Linux, or DevOps practices
- Experience with cloud platforms (AWS, GCP, Azure)
- Exposure to service mesh, observability stacks, or SRE toolkits
Key Relationships- Internal: DevOps, Infrastructure, Software Development, QA, Security Teams
- External: Tool vendors, platform service providers (if applicable)
Role Dimensions- Impact on uptime and reliability of business-critical services
- Ownership of CI/CD and production deployment processes
- Contributor to cross-team reliability and scalability initiatives
Success Measures (KPIs)- System uptime and availability (SLA adherence)
- Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) incidents
- Deployment success rate and rollback frequency
- Automation coverage of operational tasks
- Completion of OS migration and infrastructure upgrade projects
Competency Framework Alignment- Technical Mastery: Infrastructure, automation, CI/CD, Kubernetes, monitoring
- Execution Excellence: Timely project delivery, process improvements
- Collaboration: Cross-functional team engagement and support
- Resilience: Problem solving under pressure and incident response
- Innovation: Continuous improvement of operational reliability and performance
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Job Functions:
You will be a member of our AI Platform Team, supporting the next generation AI architecture for various research and engineering teams within the organization.
You'll partner with vendors and the infrastructure engineering team for security and service availability
You'll fix production issues with engineering teams, researchers, data scientists, including performance and functional issues
Diagnose and solve customer technical problems
Participate in training customers and prepare reports on customer issues
Be responsible for customer service improvements and recommend product improvements
Write support documentation
You'll design and implement zero-downtime to monitor and accomplish a highly available service %)
As a support engineer, find opportunities to automate as part of the problem management process, creating automation to avoid issues.
Define engineering excellence for operational maturity You'll work together with AI platform developers to provide the CI/CD model to deploy and configure the production system automatically
Develop and follow operational standard processes for tools and automation development. Including: Style guides, versioning practices, source control, branching and merging patterns and advising other engineers on development standards
Deliver solutions that accelerate the activities, phenomenal engineers would perform through automation, deep domain expertise, and knowledge sharing
Required Skills:
Demonstrated ability in designing, building, refactoring and releasing software written in Python.
Hands-on experience with ML frameworks such as PyTorch, TensorFlow, Triton.
Ability to handle framework-related issues, version upgrades, and compatibility with data processing / model training environments.
Experience with AI/ML model training and inferencing platforms is a big plus.
Experience with the LLM fine tuning system is a big plus.
Debugging and triaging skills.
Cloud technologies like Kubernetes, Docker and Linux fundamentals.
Familiar with DevOps practices and continuous testing.
DevOps pipeline and automations: app deployment/configuration & performance monitoring.
Test automations, Jenkins CI/CD.
Excellent communication, presentation, and leadership skills to be able to work and collaborate with partners, customers and engineering teams.
Well organized and able to manage multiple projects in a fast paced and demanding environment.
Good oral/reading/writing English ability
Site Reliability Engineer
Posted today
Job Viewed
Job Description
About PhonePe Limited:
Headquartered in India, its flagship product, the PhonePe digital payments app, was launched in Aug 2016. As of April 2025, PhonePe has over 60 Crore (600 Million) registered users and a digital payments acceptance network spread across over 4 Crore (40+ million) merchants. PhonePe also processes over 33 Crore (330+ Million) transactions daily with an Annualized Total Payment Value (TPV) of over INR 150 lakh crore.
PhonePe's portfolio of businesses includes the distribution of financial products (Insurance, Lending, and Wealth) as well as new consumer tech businesses (Pincode - hyperlocal e-commerce and Indus AppStore Localized App Store for the Android ecosystem) in India, which are aligned with the company's vision to offer every Indian an equal opportunity to accelerate their progress by unlocking the flow of money and access to services.
Culture:
At PhonePe, we go the extra mile to make sure you can bring your best self to work, Everyday. And that starts with creating the right environment for you. We empower people and trust them to do the right thing. Here, you own your work from start to finish, right from day one. PhonePe-rs solve complex problems and execute quickly; often building frameworks from scratch. If you're excited by the idea of building platforms that touch millions, ideating with some of the best minds in the country and executing on your dreams with purpose and speed, join us
Site Reliability Engineer - Intern
Job Description:
- Knowledge in Linux/Unix Administration
- Knowledge in networking
- Knowledge of wide variety of open source technologies/tools and cloud services
- Knowledge of best practices and IT operations in an always-up, always-available service
- Good English communications skills
- Strong background in linux networking ( ip, iptables, ipsec )
- Knowledge of working with MySQL
- A working understanding of code and script ( eg: Perl/Golang preferred )
- Knowledge of automation/configuration management using either saltstack or equivalent
Knowledgeofthefollowingwillbeaplus:
- Implementation of cloud services on linux using kvm/qemu
- DCOS (mesos & mesos frameworks)
- Aerospike ( nosql )
- Perl/golang
- Galera
- Openbsd
- Data center related activities ( Rarely needed )
PhonePe Full Time Employee Benefits (Not applicable for Intern or Contract Roles)
- Insurance Benefits - Medical Insurance, Critical Illness Insurance, Accidental Insurance, Life Insurance
- Wellness Program - Employee Assistance Program, Onsite Medical Center, Emergency Support System
- Parental Support - Maternity Benefit, Paternity Benefit Program, Adoption Assistance Program, Day-care Support Program
- Mobility Benefits - Relocation benefits, Transfer Support Policy, Travel Policy
- Retirement Benefits - Employee PF Contribution, Flexible PF Contribution, Gratuity, NPS, Leave Encashment
- Other Benefits - Higher Education Assistance, Car Lease, Salary Advance Policy
Our inclusive culture promotes individual expression, creativity, innovation, and achievement and in turn helps us better understand and serve our customers. We see ourselves as a place for intellectual curiosity, ideas and debates, where diverse perspectives lead to deeper understanding and better quality results. PhonePe is an equal opportunity employer and is committed to treating all its employees and job applicants equally; regardless of gender, sexual preference, religion, race, color or disability. If you have a disability or special need that requires assistance or reasonable accommodation, during the application and hiring process, including support for the interview or onboarding process, please fill out this form.
Read more about PhonePe on our blog .
Be The First To Know
About the latest Devops engineers Jobs in Bengaluru !
Site Reliability Engineer
Posted today
Job Viewed
Job Description
We are seeking an experienced
AWS DevOps Engineer with strong expertise in Observability and Site Reliability Engineering (SRE)
to design, build, and manage scalable, reliable, and secure cloud environments. The role requires hands-on experience with AWS services, Infrastructure as Code (IaC), CI/CD, monitoring & observability frameworks, and incident response practices to ensure high availability, performance, and resilience of business-critical systems.
Key Responsibilities
- Cloud Infrastructure (AWS):
- Design, implement, and manage scalable, resilient, and cost-optimized cloud infrastructure using AWS services (EC2, EKS, Lambda, RDS, S3, CloudFront, IAM, VPC, etc.).
- Implement Infrastructure as Code (IaC) using tools like
Terraform / CloudFormation
. - DevOps & Automation:
- Build and maintain
CI/CD pipelines
(Jenkins, GitHub Actions, GitLab CI, or AWS CodePipeline) for automated deployments. - Automate repetitive tasks to improve development velocity and operational efficiency.
- Observability & Monitoring:
- Define and implement
observability strategy
covering monitoring, logging, tracing, and alerting. - Work with tools like
Prometheus, Grafana, ELK/EFK stack, AWS CloudWatch, Datadog, New Relic, Splunk, or Dynatrace
. - Establish
SLIs, SLOs, and SLAs
to measure and improve system reliability. - Site Reliability Engineering (SRE):
- Drive incident management processes – detection, alerting, root cause analysis, and postmortems.
- Apply
chaos engineering
principles to validate resilience and recovery. - Optimize reliability, latency, scalability, and system efficiency.
- Security & Compliance:
- Implement best practices for cloud security, identity & access management, and compliance frameworks (ISO, SOC2, GDPR, etc.).
- Ensure observability and monitoring meet security and audit requirements.
- Collaboration & Leadership:
- Partner with development, QA, and product teams to ensure seamless deployments.
- Mentor junior engineers and promote a culture of
reliability, automation, and continuous improvement
.
Required Skills & Qualifications
- 7+ years
of professional experience in DevOps, Cloud Infrastructure, or SRE roles. - Strong expertise in AWS Cloud
(certification preferred: AWS Certified DevOps Engineer, Solutions Architect, or SysOps). - Proficiency in
IaC tools
(Terraform, CloudFormation). - Solid experience in
CI/CD pipeline tools
(Jenkins, GitHub Actions, GitLab CI/CD, AWS CodePipeline). - Hands-on with
observability tools
: Prometheus, Grafana, CloudWatch, ELK, Datadog, New Relic, Splunk, or similar. - Deep understanding of
SRE principles
: SLIs/SLOs, error budgets, incident response, chaos testing. - Strong scripting/coding experience (Python, Bash, Go, or similar).
- Knowledge of
containers & orchestration
(Docker, Kubernetes, EKS). - Familiarity with
security best practices
in cloud-native environments.
Preferred Skills
- Experience with
multi-cloud or hybrid-cloud environments
. - Exposure to
resiliency testing & chaos engineering tools
(Gremlin, Litmus, Chaos Mesh). - Knowledge of cost-optimization and FinOps in AWS.
- Excellent communication and stakeholder management skills.
What We Offer
- Opportunity to work on cutting-edge cloud-native architectures.
- A culture focused on
automation, reliability, and innovation
. - Growth opportunities with certifications, training, and leadership exposure.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
This position supports TrueLark, a recently acquired brand under the Weave umbrella. Your work will directly contribute to the TrueLark product and team. TrueLark is an AI-powered virtual receptionist designed for appointment-based businesses. Its agentic AI platform manages scheduling, rescheduling, and client inquiries through SMS and web chat, providing 24/7 support. TrueLark helps businesses recover missed calls, increase bookings, and streamline front-office operations.
At TrueLark, our infrastructure powers mission-critical business applications for customers across industries. As we continue to grow, we need a
Scaling Engineer
who can design and implement
scalable, high-performance infrastructure solutions
.
You will focus on
optimizing performance, improving system resilience, and scaling our services
built on Redis, MySQL, ActiveMQ, and Jetty Spark servers. This is a highly impactful role where you will collaborate with engineering and DevOps teams to ensure our systems perform reliably under heavy load.
- This position will be onsite 2-3 times per week in the Bengaluru office
- Reports to: Head of TrueLark Technology
What You Will Own
- Infrastructure Scaling & Optimization: Analyze current infrastructure and design solutions to handle increasing load with high availability.
- Database Performance Tuning: Optimize MySQL (indexes, queries, partitioning, replication, sharding) and Redis caching strategies for throughput and latency improvements.
- Message Queue Scaling: Enhance and optimize ActiveMQ for large-scale, distributed messaging with high durability and reliability.
- Server & Application Scaling: Tune Jetty Spark servers for concurrency, session management, load balancing, and failover strategies.
- Monitoring & Bottleneck Detection: Implement monitoring dashboards (Grafana, Prometheus, ELK/EFK, or Azure Monitor) to detect slow queries, CPU/memory bottlenecks, and scaling issues.
- Distributed Systems Reliability: Ensure fault tolerance, disaster recovery, and horizontal scaling strategies.
- CI/CD & Automation: Work with GitHub Actions, Jenkins, or Azure DevOps for automated deployments and scaling workflows.
- Collaboration: Partner with backend engineers, DevOps, and product teams to deliver reliable infrastructure that meets performance goals.
What You Will Need to Accomplish the Job
- Bachelor's degree in Computer Science, Information Technology, or related field.
- 3+ years of experience in scaling infrastructure for distributed systems or high-traffic applications.
Strong hands-on experience with:
Databases: MySQL (query optimization, replication, partitioning, indexing strategies).
- Caching: Redis (cluster setup, pub/sub, eviction policies, memory optimization).
- Message Queues: ActiveMQ (scaling consumers/producers, HA clusters, performance tuning).
Web Servers: Jetty with Spark framework (thread pools, tuning JVM, load balancing).
Proficiency in monitoring & observability tools (Grafana, Prometheus, ELK/EFK stack, Azure Monitor, or DataDog).
- Knowledge of distributed systems principles (CAP theorem, sharding, replication, consensus).
- Familiarity with CI/CD pipelines, Docker, Kubernetes (AKS/EKS/GKE optional).
- Strong problem-solving skills in debugging production performance bottlenecks.
- Ability to work independently and collaboratively in a fast-paced environment.
What Will Make Us Love You
- Experience with horizontal scaling of microservices in production.
- Knowledge of cloud platforms (Azure, AWS, GCP) for scaling and infrastructure automation.
- Hands-on experience with load testing tools (JMeter, Locust, k6, Artillery).
- Familiarity with event-driven architecture and streaming platforms (Kafka).
- Contributions to open-source performance/scaling projects
Weave is an equal opportunity employer that is committed to fostering an inclusive workplace where all individuals are valued and supported. We welcome anyone who is hungry to learn, problem-solve and progress regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, or other applicable legally protected characteristics. If you have a disability or special need that requires accommodation, please let us know.
All official correspondence will occur through Weave branded email. We will never ask you to share bank account information, cash a check from us, or purchase software or equipment as part of your interview or hiring process.
Site Reliability Engineer

Posted 4 days ago
Job Viewed
Job Description
We provide expert, sustainable solutions in records and information management, digital transformation services, data centers, asset lifecycle management, and fine art storage, handling, and logistics. We proudly partner every day with our 225,000 customers around the world to preserve their invaluable artifacts, extract more from their inventory, and protect their data privacy in innovative and socially responsible ways.
Are you curious about being part of our growth story while evolving your skills in a culture that will welcome your unique contributions? If so, let's start the conversation.
Category: Information Technology
Iron Mountain is a global leader in storage and information management services trusted by more than 225,000 organizations in 60 countries. We safeguard billions of our customers' assets, including critical business information, highly sensitive data, and invaluable cultural and historic artifacts. Take a look at our history here.
Iron Mountain helps lower cost and risk, comply with regulations, recover from disaster, and enable digital and sustainable solutions, whether in information management, digital transformation, secure storage and destruction, data center operations, cloud services, or art storage and logistics. Please see our Values and Code of Ethics for a look at our principles and aspirations in elevating the power of our work together.
If you have a physical or mental disability that requires special accommodations, please let us know by sending an email to See the Supplement to learn more about Equal Employment Opportunity.
Iron Mountain is committed to a policy of equal employment opportunity. We recruit and hire applicants without regard to race, color, religion, sex (including pregnancy), national origin, disability, age, sexual orientation, veteran status, genetic information, gender identity, gender expression, or any other factor prohibited by law.
To view the Equal Employment Opportunity is the Law posters and the supplement, as well as the Pay Transparency Policy Statement, CLICK HERE
**Requisition:** J