Hacking the Software Engineer Interview14266 2018-05-02 18:41 (Last Updated 2019-05-11 16:16)
This topic is work in process. You can also visit the google doc version - BigDataKV.
- Table of Contents
- Experience Deep Dive
- Cracking the System Design Interview (Designing Pinterest or Instagram as an example)
- 1. Four steps to Crack the System Design Interview
- 1.1 Step 1. Clarify Requirements and Specs
- 1.2 Step 2. Sketch Out High Level Design
- 1.3 Step 3. Discuss individual components and how they interact in detail
- 1.4 (Optional) Step 4. Back-of-the-envelope Calculation
- 2 Three Common Topics
- 1. Four steps to Crack the System Design Interview
- System Design Theories
- Introduction to Architecture
- Data Partition and Routing
- Replica and Consistency
- Availability (TODO)
- Big Data Algorithms and Data Structures
- Resource Management and Scheduler (TODO)
- Distributed Coordination System (TODO)
- Data Stores (TODO)
- Stream and Batch Processing
- Cloud Design Patterns
- Communication Protocols and APIs
- Lambda architecture
- iOS Architecture Patterns
- System Design in Practice
- Designing Uber (or a realtime marketplace)
- Scaling Facebook social graph store
- Designing Netflix view state service
- Designing robust and predictable APIs with idempotency
- Designing video streaming over HTTP
- Designing a distributed logging system
- Designing a url shortener
- Design a key-value store with external storage
- Designing Facebook Photo Storage (or a Blob storage)
- How to write solid code?
The overall process:
- HR screen: HR will clarify the interview process and ask quick questions to filter out know-nothings
- Phone screen: 1 or 2 rounds of phone screens on coding
- Onsite: spending a whole day interviewing at the office in person
- 2 or 3 rounds of coding interviews
- 1 round of system design and architecture
- 1 round of behavior questions or culture fit
- (optional) 1 round of bar raiser interview
- Making decisions and negotiating. Never accept an offer without negotiating!
If the candidate is local, phone screens can be reduced or even skipped.
If it is a internship position, onsite is skipped.
- throw your book Cracking the Coding Interview into the trash. The coding questions there are obsolete and too easy. Elements of Programming Interviews is better. However, by and large, books are not very helpful because what you need is deliberate practice not reading books.
- Go to Online Judge like Leetcode and Lintcode. Solve at least 70 medium to hard questions. The key is to practice in a large amount and with speed repeatedly.
Since OJ can help you tackle the coding interviews, here in this documentation we focus on soft skills, experience deep dive, system design and architectures.
Building applications is not a rocket science, but having the vision of the overall architecture really makes a difference. We have to crack the system interview sooner or later in our career as a software engineer or engineering manager.
Interviewers are looking for future teammates that they like to work with. The future teammates are expected to be, at least, capable of solving problems independently. There are so many solutions to any given problem, but not all of them are suited given the context. So the interviewee has to specify different choices and their tradeoffs. To summarize, system design interview is a happy show-off of our knowledge on technologies and their tradeoffs. Ideally, keep talking what the interviewer expect throughout the interview, before they even have to ask.
Keep talking for 45 mins could be easy, as long as we are armed with the following four steps and three common topics. Take “design Pinterest” for example.
Breaking down a complex task into small chunks helps us handle the problem at a better pace and in a more actionable way.
First things first, the ultimate goals should always be clear.
Pinterest is a highly scalable photo-sharing service:
- features: user profile, upload photos, news feed, etc.
- scaling out: horizontal scalability and micro services.
Do not dive into details before outlining the big picture. Otherwise, going off too far towards a wrong direction would make it harder to even provide a roughly correct solution. We will regret wasting time on irrelevant details when we do not have time to finish the task.
OK, let us sketch out the following diagram without concerning too much about the implementation detail of these components.
So far so good! To some extent, congrats, we have solved the problem!
When we truly understand a system, we should be able to identify what each component is and explain how they interact with one another. Take these components in the above diagram and specify each one by one. This could lead to more general discussions, such as the three common topics in Section 2, and to more specific domains, like how to design the photo storage data layout…
Generally speaking, load balancers fall into three categories:
- DNS Round Robin (rarely used): clients get a randomly-ordered list of IP addresses.
- pros: easy to implement and free
- cons: hard to control and not responsive, since DNS cache needs time to expire
- L3/L4 Load Balancer: traffic is routed by IP address and port. L3 is network layer (IP). L4 is session layer (TCP).
- pros: better granularity, simple, responsive
- L7 Load Balancer: traffic is routed by what is inside the HTTP protocol. L7 is application layer (HTTP).
It is good enough to talk in this level of detail on this topic, but in case the interviewer wants more, we can suggest exact algorithms like round robin, weighted round robin, least loaded, least loaded with slow start, utilization limit, latency, cascade, etc.
Reverse proxy, like varnish, centralizes internal services and provides unified interfaces to the public. For example, www.example.com/index and www.example.com/sports appear to come from the same domain, but in fact they are from different micro services behind the reverse proxy. Reverse proxy could also help with caching and load balancing.
This is where web pages are served, and usually combined with the service / backend tier in the very early stage of a web service.
There are two major bottlenecks of the whole system – requests per second (rps) and bandwidth. We could improve the situation by using more efficient tech stack, like frameworks with async and non-blocking reactor pattern, and enhancing the hardware, like scaling up (aka vertical scaling) or scaling out (aka horizontal scaling).
Internet companies prefer scaling out, since it is more cost-efficient with a huge number of commodity machines. This is also good for recruiting, because the target skillsets are equipped by. After all, people rarely play with super computers or mainframes at home.
Frontend web tier and service tier must be stateless in order to add or remove hosts conveniently, thus achieving horizontal scalability. As for feature switch or configs, there could be a database table / standalone service to keep those states.
MVC(MVT) or MVVC pattern is the dominant pattern for this layer. Traditionally, view or template is rendered to HTML by the server at runtime. In the age of mobile computing, view can be as simple as serving the minimal package of data transporting to the mobile devices, which is called web API. People believe that the API can be shared by clients and browsers. And that is why single page web applications are becoming more and more popular, especially with the assistance of frontend frameworks like react.js, angular.js, backbone.js, etc.
The view can also be APIs in addition to visual web pages and app screens. Visit Public API Choices for detailed comparison of what to use for building the public APIs.
The single responsibility principle advocates small and autonomous services that work together, so that each service can do one thing well and not block others. Small teams with small services can plan much more aggressively for the sake of hyper-growth.
How do those services find each other? Zookeeper is a popular and centralized choice. Instances with name, address, port, etc. are registered into the path in ZooKeeper for each service. If one service does not know where to find another service, it can query Zookeeper for the location and memorize it until that location is unavailable.
Zookeeper is a CP system in terms of CAP theorem (See Section 2.3 for more discussion), which means it stays consistent in the case of failures, but the leader of the centralized consensus will be unavailable for registering new services.
For the Pinterest case, these micro services could be user profile, follower, feed, search, spam, etc. Any of those topics could lead to an in-depth discussion. Useful links are listed in Section 3: Future Studies, to help us deal with them.
However, for a general interview question like “design Pinterest”, it is good enough to leave those services as black boxes… If we want to show more passion, elaborate with some sample endpoints / APIs for those services would be great.
Although a relational database can do almost all the storage work, please remember do not save a blob, like a photo, into a relational database, and choose the right database for the right service. For example, read performance is important for follower service, therefore it makes sense to use a key-value cache. Feeds are generated as time passes by, so HBase / Cassandra’s timestamp index is a great fit for this use case. Users have relationships with other users or objects, so a relational database is our choice by default in an user profile service.
Data and storage is a rather wide topic, and we will discuss it later in Section 2.2 Storage.
The final step, estimating how many machines are required, is optional, because time is probably up after all the discussions above and three common topics below. In case we run into this topic, we’d better prepare for it as well. It is a little tricky… we need come up with some variables and a function first, and then make some guesses for the values of those variables, and finally calculate the result.
The cost is a function of CPU, RAM, storage, bandwidth, number and size of the images uploaded each day.
- N users 10^10
- i images / (user * day) 10
- s size in bytes / image 10^6
- viewed v times / image 100
- d days
- h requests / sec 10^4 (bottleneck)
- b bandwidth (bottleneck)
- Server cost: $1000 / server
- Storage cost: $0.1 / GB
- Storage = Nisd
Remember the two bottlenecks we mentioned in section 1.3.3 Web Tier? – requests per second (rps) and bandwidth. So the final expression would be
Number of required servers = max(Niv/h, Nivs/b)
There are three common topics that could be applied to almost every system design question. They are extracted and summarized in this section.
How do different components interact with each other? – communication protocols.
Here is a simple comparison of those protocols.
- UDP and TCP are both transport layer protocols. TCP is reliable and connection-based. UDP is connectionless and unreliable.
- HTTP is in the application layer and normally TCP based, since HTTP assumes a reliable transport.
- RPC, an application layer protocol, is an inter-process communication that allows a computer program to cause a subroutine or procedure to execute in another address space (commonly on another computer on a shared network), without the programmer explicitly coding the details for this remote interaction. That is, the programmer writes essentially the same code whether the subroutine is local to the executing program, or remote. In an Object-Oriented Programming context, RPC is also called remote invocation or remote method invocation (RMI).
Since RPC is super useful, some interviewers may ask how RPC works. The following picture is a brief answer.
*Stub procedure: a local procedure that marshals the procedure identifier and the arguments into a request message, and then to send via its communication module to the server. When the reply message arrives, it unmarshals the results.
We do not have to implement our own RPC protocols. There are off-the-shelf frameworks.
- Google Protobuf: an open source RPC with only APIs but no RPC implementations. Smaller serialized data and slightly faster. Better documentations and cleaner APIs.
- Facebook Thrift: supports more languages, richer data structures: list, set, map, etc. that Protobuf does not support) Incomplete documentation and hard to find good examples.
- User case: Hbase/Cassandra/Hypertable/Scrib/…
- Apache Avro: Avro is heavily used in the hadoop ecosystem and based on dynamic schemas in Json. It features dynamic typing, untagged data, and no manually-assigned field IDs.
Generally speaking, RPC is internally used by many tech companies for performance issues, but it is rather hard to debug and not flexible. So for public APIs, we tend to use HTTP APIs, and are usually following the RESTful style.
- REST (Representational state transfer of resources)
- Best practice of HTTP API to interact with resources.
- URL only decides the location. Headers (Accept and Content-Type, etc.) decide the representation. HTTP methods(GET/POST/PUT/DELETE) decide the state transfer.
- minimize the coupling between client and server (a huge number of HTTP infras on various clients, data-marshalling).
- stateless and scaling out.
- service partitioning feasible.
- used for public API.
How do clients in the external world interact with this system? – Please visit [the public API choices: ].
Please visit Intro to Relational Database / SQL Database.
When we design a distributed system, trading off among CAP (consistency, availability, and partition tolerance) is almost the first thing we want to consider.
- Consistency: all nodes see the same data at the same time
- Availability: a guarantee that every request receives a response about whether it succeeded or failed
- Partition tolerance: system continues to operate despite arbitrary message loss or failure of part of the system
In a distributed context, the choice is between CP and AP. Unfortunately, CA is just a joke, because single point of failure is a red flag in the real distributed systems world.
To ensure consistency, there are some popular protocols to consider: 2PC, eventual consistency (vector clock + RWN), Paxos, In-Sync Replica, etc.
To ensure availability, we can add replicas for the data. As to components of the whole system, people usually do cold standby, warm standby, hot standby, and active-active to handle the failover.
Intro to 4 kinds of NoSQL database: Key-value Store vs. Document Store vs. Column-oriented Store vs. Graph Database
empathy / perspective-taking is the most important.
- realize that code is written for human to read first and then for machines to execute.
- software is so “soft” and there are many ways to achieve one thing. It’s all about making the proper trade-offs to fulfill the requirements.
choose a sustainable architecture to reduce human resources costs per feature.
adopt patterns and best practices.
- missing error-handling
- callback hell = spaghetti code + unpredictable error handling
- over-long inheritance chain
- circular dependency
- over-complicated code
- nested tertiary operation
- comment out unused code
- missing i18n, especially RTL issues
- don’t repeat yourself
- simple copy-and-paste
- unreasonable comments
- semantic version
- never introduce breaking change to non major versions
- two legged change