[November 20 Webinar] Detecting & Correcting Infrastructure Drift

➡️ Register Now

Platform Engineering

How to Build a Platform Engineering Team [Focus & Roles]

platform engineering team

Platform engineering is slowly becoming the de-facto standard for managing your software development and operations activity. As organizations grow and continue to adopt cloud-native architectures and microservices, the complexity of managing the underlying infrastructure grows significantly.

In this post, we will talk about how to build a successful platform engineering team and what you should consider when hiring engineers.

What we will cover:

  1. Platform engineering focus areas
  2. Platform engineering roles
  3. Hiring the right talent
  4. Onboarding, goals, and training

What is platform engineering?

Platform engineering focuses on designing, building, and maintaining the foundational infrastructure that supports the entire software development lifecycle by allowing developers to accelerate while maintaining control. The main focus of a platform engineering team is to create an easy-to-scale and reusable infrastructure that can be used throughout the organization.

Platform engineering focus areas

Platform engineering focuses on the following areas:

  • Infrastructure as code – Using tools such as OpenTofu, Terraform, Pulumi, and AWS CloudFormation to automate, build, and scale your infrastructure
platform engineering iac tools
  • Configuration management – Automatically configure your infrastructure after it is provisioned.
platform engineering config management
  • Continuous integration and delivery – Integrate code changes, lint checks, vulnerability scans, build, tests, and deploys.
platform engineering ci cd tools
  • Container orchestration – Orchestrate your microservices applications.
platform engineering co tools
  • Monitoring and observability – Maintain the health and performance of your application.
platform engineering monitoring tools
  • Security and compliance – Ensure your platform respects industry standards and regulations and your users’ information is safe.
platform engineering security tools
  • Infrastructure management platforms – Easily integrate all focus areas with a single product.
platform engineering infrastructure management platforms

Platform engineering team structure

A platform engineering team often includes roles such as platform engineers, site reliability engineers, cloud architects, and security engineers. Other roles might include DevOps engineers, automation engineers, and quality assurance engineers.

When platform engineering is implemented, the most common split for the infrastructure team is between platform engineers and site reliability engineers. Usually, at least one cloud architect assists the platform team with the platform’s architecture and design. A dedicated security team also assists with everything related to security, governance, and compliance.

platform engineering team

Platform engineers

Platform engineers are responsible for designing, building, and maintaining the platform infrastructure:

Responsibilities Technical skills
Implement and manage IaC, CM, CI/CD, and CO solutions Proficiency in IaC (OpenTofu/Terraform/Terragrunt/Pulumi/CloudFormation/Azure Bicep)
Ensure platform scalability, reliability, and performance Proficiency in cloud platforms (AWS/Azure/GCP)
Automate the monitoring, governance, and security tasks Strong CI/CD knowledge (GitHub Actions, Jenkins, GitLab CI/CD, Azure DevOps)
Collaborate with other teams to understand their needs Strong knowledge of CM and/or CO
Implement platform security Knowledge of observability, monitoring, governance, compliance, and security

Choosing the right platform engineers is key to the successful implementation of the platform. You need to find engineers who are automation-first, demonstrate strong problem-solving skills, and can collaborate effectively with other teams. They should be proficient in a programming language such as Golang, Python, TypeScript, or JavaScript.

Site reliability engineers

Site reliability engineers (SREs) handle deployments to superior environments, are experienced in incident management and response, and have strong problem-solving skills.

Responsibilities Technical skills
Ensure the availability and reliability of the platform Proficiency in monitoring
Do the actual deployments to the superior environments Good knowledge of IaC
Perform capacity planning Good CI/CD knowledge (GitHub Actions, Jenkins, GitLab CI/CD, Azure DevOps)
Implement monitoring, logging, and alerting Good knowledge of CM and/or CO
Conduct post-incident reviews Good knowledge of cloud platforms (AWS/Azure/GCP)

The best people for SRE positions have a good grasp of distributed systems and cloud-native architectures and are keen to continuously improve the performance of the systems they manage. They should also have a solid understanding of scripting and automation

Cloud architects

Cloud architects should be experts in designing platforms and ensuring the platform architecture is resilient and satisfies all the organization’s needs.

Responsibilities Technical skills
Design and implement highly reliable architectures Proficiency in cloud platforms (AWS/Azure/GCP)
Develop and enforce architectural standards and best practices Proficiency in cloud architecture principles and in designing highly scalable systems
Guide all teams involved Good IaC, CI/CD, CM, and CO knowledge
Have a security-first approach Proficient in governance and compliance

When selecting a cloud architect, look for candidates who have extensive experience in designing and implementing cloud infrastructure solutions in different cloud providers. They should be able to enforce architectural standards and best practices and collaborate extensively with other teams.

Security engineers

Security engineers work hand in hand with platform engineers and cloud architects to implement and maintain the security best practices for the platform:

Responsibilities Technical skills
Implement and maintain security best practices Proficiency in security tools
Enforce security vulnerability scanning in all tools used Knowledge of encryption, authentication, and authorization methods
Enforce compliance checks Experience with compliance service
Respond to security incidents Ability to perform security audits

The best security engineers can enforce security best practices and are aware of industry standards and regulations (GDPR, HIPAA, etc). They should work regularly with all teams involved in the platform and ensure vulnerability scanning is implemented for everything.

Hiring the right talent

The many steps involved in hiring the right talent to build a platform engineering team start with the job description.

Tips for creating a good job description

Most contemporary IT jobs have many requirements. It is extremely difficult to find engineers who excel in all the areas a platform engineer activates, and that’s why you need to create focused jobs. For example, if your platform uses AWS, Terraform, Kubernetes, GitHub Actions, and Ansible, you should expect all applicants to know something about these tools but not to be experts in all of them.

If you have to hire four platform engineers for this tech stack, you should create focused jobs such as:

  • Platform engineer – Terraform focus
  • Platform engineer – Kubernetes focus
  • Platform engineer – CI/CD focus
  • Platform engineer – Ansible focus

In the focused job description, you should require the engineer to be an expert in the focused tool and have extensive experience with at least one, as well as some experience with others.

If you don’t offer focused jobs, at least ensure that the job description specifies that engineers don’t need to be experts in all technologies. Otherwise, some good candidates might not even apply.

Another important aspect of creating a job description that attracts top talent is to specify what the company does, the way the collaboration will be implemented, and the perks and benefits of joining your company.

In addition, it is really important to outline the necessary qualifications (education, years of experience, certifications) you require from the engineers applying for the job. These need to be realistic — at one point, FastAPI’s creator couldn’t apply for a role that required FastAPI experience because the number of years of experience required exceeded the number of years the tool had existed. Be realistic and do your research when you are creating job descriptions.

Tips for creating a good interview process

The interview process is crucial for evaluating a candidate’s technical competencies and cultural fit, so structuring an interview effectively will make the difference between hiring the right people for the job:

Do’s Don’t’s
A technical interview should verify programming skills: In the technical interview, you should seek to understand how future colleagues think about a problem and their solution to it. This part can contain live coding or at least evaluate how that person would address that particular problem. Add more than five interview rounds for a position: Many people will be discouraged from applying for the job.
Have at least a future team member participate in an interview with the candidate: Culture fit is very important, and existing team members should be involved in the hiring process. After all, they will collaborate with the new person after they get hired. Start with an algorithm interview: Although hacker rank tests can be good for platform engineers and other roles in platform engineering, sometimes they can be exaggerated, so people with decent programming skills can be penalized because they don’t use a certain algorithm
Discuss job expectations from the beginning: The candidate must understand what you would expect from them and what a typical working day looks like.  Exaggerate with take-home assignments: Many jobs require take-home assignments. Sometimes the take-home assignments can feel like a full consulting gig. Many are unpaid, which makes engineers abandon the interview process due to the time involved.

Onboarding, goals, and training

Hiring the right talent is very important, but retaining talent is also crucial. Many things can go wrong when you are hiring a new team member, so let’s explore a couple of things that you can do to improve your retention.

Without proper onboarding, some engineers can get frustrated and be tempted to leave within the first three months. This costs money in terms of hiring costs and can also impact team morale and productivity. 

A good onboarding process includes, at minimum, an orientation meeting (company overview, mission, values, team introduction, and what portals to use for different kinds of requests) and a technical onboarding meeting in which a colleague (a buddy) explains the tools used and presents the kind of access the new colleague should have. Regular check-ins are needed to discuss progress, address any challenges, and provide feedback to help identify and resolve issues early on, thus minimizing frustration.

Monthly, quarterly, and yearly goals should be set for the new hire from the beginning of their tenure. This is important because it clarifies to the new engineer what is expected from them and what they can do to exceed the expectations you’ve set. Setting clear, achievable goals (check out SMART goals here) provides direction and motivation, helping new hires stay focused and engaged.

Training is another important aspect an engineer will be interested in. Having access to LinkedIn Learning, Pluralsight, or Udemy, can significantly enhance their skills and keep them updated on the latest industry trends and technologies. By offering access to platforms like these, you will demonstrate your organization’s commitment to your team’s professional development of your team. Offering two or three hours of training per week will encourage your engineers to improve.

How can Spacelift help with platform engineering?

Spacelift offers all the mechanisms required to build a successful platform. Apart from being the product you can easily leverage in step 7 of how to implement platform engineering, by checking the website, documentation and blog posts, all the other steps can be facilitated.

Let’s explore how Spacelift can help.

1. Policies

With policies, you can control what kind of resources people can create, what kind of parameters these resources can have, build custom policies for third-party tools you integrate into your workflow, control how many approvals you need for runs, and more:

platform engineering spacelift policies

In the above example, we are enforcing a couple of mandatory tags for our resources (Name, env, and owner).

2. Stack dependencies

With stack dependencies, you can build dependencies between your configurations, and even share outputs between them. You don’t have any constraint to the number of dependencies you want to create, and whenever a parent configuration finishes a run successfully, it will trigger runs to its children. As Spacelift supports multiple infrastructure tools, you can build dependencies between them, so a parent stack can use OpenTofu for example, and a child stack can use Kubernetes.

platform engineering spacelift stack dependencies

3. Blueprints

Blueprints enable you to configure every aspect of your stack, including governance and compliance. With blueprints, you can create self-service infrastructure, and by this your developer velocity will increase considerably.

spacelift blueprints platform engineering

4. Cloud integrations

Static credentials are easily intercepted and can be used with malicious intent. Spacelift understands that, so it offers you the ability to integrate natively with AWS, Microsoft Azure, and Google Cloud to generate dynamic credentials. Based on the roles you are using, these integrations can offer as few or as many permissions as you want:

platform engineering cloud integrations

5. Spaces

Spaces help you implement RBAC, and give partial admin rights to your users.

spacelift spaces platform engineering

In the above example, if you give a user admin rights to the resources space, and no other rights, he will have all permissions to the resources and production space, but he won’t be able to even view resources in other spaces.

6. Contexts

Contexts are logical containers that can be shared between multiple configurations and contain environment variables, mounted files, and lifecycle hooks, making it easier to ensure reusability and idempotency.

spacelift contexts platform engineering

7. Drift detection and optional remediation

Infrastructure drift can be one of the worst problems you can have because if, for example, you fix something manually and then apply the code again at a later time, you will reintroduce the bug into your configuration.

Spacelift offers a drift detection mechanism that runs a schedule that informs you about drift and can optionally remediate it:

spacelift drift platform engineering

8. Resources view

With Spacelift, you can see all the resources that have been deployed into your Spacelift account (based on the permissions you have), details about them and their health:

platform engineering resource view

There are other features that Spacelift offers that enable you to enhance your platform’s capabilities.

If you want to use a product that greatly enhances the lives of your platform team members, create a free account with Spacelift today, or book a demo with one of our engineers.

Key points

Building a successful platform engineering team involves planning across multiple stages, starting with defining what your platform looks like, crafting job descriptions, defining the interview processes, and implementing effective retention strategies.

By setting clear goals, fostering a culture of collaboration, and implementing training programs, you can ensure your platform team’s success.

The Most Flexible CI/CD Automation Tool

Spacelift is an alternative to using homegrown solutions on top of a generic CI. It helps overcome common state management issues and adds several must-have capabilities for infrastructure management.

Learn more

The Practitioner’s Guide to Scaling Infrastructure as Code

Transform your IaC management to scale

securely, efficiently, and productively

into the future.

ebook global banner
Share your data and download the guide