Clarifying Requirements

Designing a service money transfer backend system like Square Cash (we will call this system Cash App below) or PayPal to

  1. Deposit from and payout to bank
  2. Transfer between accounts
  3. High scalability and availability
  4. i18n: language, timezone, currency exchange
  5. Deduplication for non-idempotent APIs and for at-least-once delivery.
  6. Consistency across multiple data sources.

Architecture

AWS CloudHSM
AWS CloudHSM
Presentation Layer
Presentation Layer
SDK/Docs
SDK/Docs
mobile-dashboard
mobile-dashboard
web-dashboard
web-dashboard
dashboard-client
dashboard-client
mobile-wallet
mobile-wallet
web-wallet
web-wallet
wallet-client
wallet-client
Merchant 
User
Merchant <br>User
End User
End User
web-chrome-extension
web-chrome-extension
Operators
Operators
payment
payment
task-queue
task-queue
financial-reporter
financial-reporter
payment-gateway
payment-gateway
banks / 
vendors
[Not supported by viewer]
side-effect maker
side-effect maker
help service portal
help service portal
User
Profiles
AuthDB
[Not supported by viewer]
api-gateway
monolithic
api-gateway<br>monolithic<br>
Payment
DB
Payment<br>DB<br>
Aurora
Aurora
risk control
risk control
risk control
risk control
Event
Queue
[Not supported by viewer]

Features and Components

Payment Service

The payment data model is essentially “double-entry bookkeeping”. Every entry to an account requires a corresponding and opposite entry to a different account. Sum of all debit and credit equals to zero.

Deposit and Payout

Transaction: new user Jane Doe deposits $100 from bank to Cash App. This one transaction involves those DB entries:

bookkeeping table (for history)

+ debit, USD, 100, CashAppAccountNumber, txId
- credit, USD, 100, RoutingNumber:AccountNumber, txId

transaction table

txId, timestamp, status(pending/confirmed), [bookkeeping entries], narration

Once the bank confirmed the transaction, update the pending status above and the following balance sheet in one transaction.

balance sheet

CashAppAccountNumber, USD, 100

Transfer between accounts within Cash App

Similar to the case above, but there is no pending state because we do not need the slow external system to change their state. All changes in bookkeeping table, transaction table, and balance sheet table happen in one transaction.

i18n

We solve the i18n problems in 3 dimensions.

  1. Language: All texts like copywriting, push notifications, emails are picked up according to the accept-language header.
  2. Timezones: All server timezones are in UTC. We transform timestamps to the local timezone in the client-side.
  3. Currency: All user transferring transactions must be in the same currency. If they want to move across currencies, they have to exchange the currency first, in a rate that is favorable to the Cash App.

For example, Jane Doe wants to exchange 1 USD with 6.8 CNY with 0.2

bookkeeping table

- credit, USD, 1, CashAppAccountNumber, txId
+ debit, CNY, 6.8, CashAppAccountNumber, txId, @7.55 CNY/USD
+ debit, USD, 0.1, ExpensesOfExchangeAccountNumber, txId

Transaction table, balance sheet, etc. are similar to the transaction discussed in Deposit and Payout. The major difference is that the bank or the vendor provides the exchange service.

How to sync across the transaction table and external banks and vendors?

Deduplication

Why is Deduplication a concern?

  1. not all endpoints are idempotent
  2. Event queue may be at-least-once.

not all endpoints are idempotent: what if the external system is not idempotent?

For the poll case above, if the external gateway does not support idempotent APIs, in order not to flood with duplicate entries, we must keep record of the order ID or the reference ID the external system gives us with 200, and query GET by the order ID instead of POST all the time.

For the callback case, we can ensure we implement with idempotent APIs, and we mutate pending to confirmed anyway.

Event queue may be at-least-once

  • For the even queue, we can use an exactly-once Kafka with the producer throughput declines only by 3%.
  • In the database layer, we can use idempotency key or deduplication key.
  • In the service layer, we can use Redis key-value store.

Availability and Scalability

1. Clarifying Requirements

  1. Webhook to callback the merchant once the payment succeeds.
  2. Analytics & metrics.
  3. High availability & Failure-resilience.
    1. Async design. Assuming that the servers of merchants are located across the world, and may have a very high latency like 15s.
    2. At-least-once delivery.
    3. Robust & predicable retry.
  4. Security: informing the merchants whether a payment succeeds involves real money real transactions, and thus, security is always a concern.

2. Sketch out the high-level design

async design + retry + queuing + time-series DB + security

Merchants over Internet
Merchants over Internet
subscribe events
subscribe events
get webhook URI, secret, and settings
get webhook URI, secret, and settings
webhook
gateway
webhook<br>gateway
Time-series DB
Time-series DB
publish events
publish events
payment
state machine
payment<br>state machine
user settings
user settings
Dashboard
Dashboard
Event Queue
Event Queue

3. Features and Components

Webhook Gateway

  1. Subscribe to the event queue for payment success events published by a payment state machine or other services.
  2. Once accept an event, fetch webhook URI, secret, and settings from the user settings service. Prepare the request based on those settings.
  3. Make an HTTP POST request to the external merchant’s endpoints with event payload and security headers.

API Definition

// POST https://example.com/webhook/
{
    "id": 1,
    "scheduled_for": "2017-01-31T20:50:02Z",
    "event": {
        "id": "24934862-d980-46cb-9402-43c81b0cdba6",
        "resource": "event",
        "type": "charge:created",
        "api_version": "2018-03-22",
        "created_at": "2017-01-31T20:49:02Z",
        "data": {
          "code": "66BEOV2A", // or order ID the user need to fulfill
          "name": "The Sovereign Individual",
          "description": "Mastering the Transition to the Information Age",
          "hosted_url": "https://commerce.coinbase.com/charges/66BEOV2A",
          "created_at": "2017-01-31T20:49:02Z",
          "expires_at": "2017-01-31T21:49:02Z",
          "metadata": {},
          "pricing_type": "CNY",
          "payments": [
            // ...
          ],
          "addresses": {
            // ...
          }
        }
    }
}

The merchant server should respond with a 200 HTTP status code to acknowledge receipt of a webhook.

If there is no acknowledgment of receipt, we will retry with exponential backoff for up to three days. The maximum retry interval is 1 hour.

Security

  • All webhooks from user settings must be in https
  • All callback requests are with header x-webhook-signature SHA256 HMAC signature. Its value is HMAC(webhook secret, raw request payload);. We generate the secret for the developer to use.

Background Knowledge: HMAC (message authentication code). A short piece of information used to authenticate a message — In other words, to confirm that the message came from the stated sender (its authenticity) and has not been changed in transit (its integrity). The integrity can be verified by the shared secret between trusted parties against the digest of the original message.

Metrics

The webhook gateway service emits statuses into the time-series DB for metrics.

Using Influx DB vs. Prometheus?

  • InfluxDB: Application pushes data to InfluxDB. It has a monolithic DB for metrics and indices.
  • Prometheus: Prometheus server pulls the metrics values from the running application periodically. It uses LevelDB for indices, but each metric is stored in its own file.

I will probably choose InfluxDB for easier maintenance of the monolithic data store.

Depending on how much further data aggregation we need, we can build more advanced data pipeline. However, for just counting success/ failures, a simple time-series DB solves the problem.

  • After public speaking, the biggest social fear in the Western world is initiating a conversation with strangers. To conquer the fear of rejection, it helps to realize that in most cases, people appreciate it when you make an effort to speak with them.
  • In some cases, not talking can make you come off as arrogant or aloof. Initiating a conversation can be simple: first, smiling at someone; second, establishing eye contact; third, being the first to introduce yourself.
  • To approach a group of people: 1, demonstrate your interest to the group from a distance, paying attention to the speaker. 2, the group will notice and make room to include you. 3, let the group warm to you before you offer any strong opinions.
  • Guiding a conversation evokes the positive feelings that make people want to work or socialize with them. One easy way of assuming this responsibility is to act like you’re a host and ask, “What’s your name?”. Emphasize “your” to make them feel valued.
  • The best way to improve your conversations is to ask open-ended questions, which demonstrates that you genuinely care about what they have to say.
  • A conversation will inevitably dip into an awkward silence sometimes. You can get it back to a comfortable flow by asking open-ended questions with the current contexts or the acronym FORM: family, occupation, recreation and miscellaneous.
  • To be an active listener, body language is important: avoid crossing arms, hunching shoulders, or fiddling with clothes, hair or jewelry. Instead, lean forward, nod, smile and maintain eye contact.
  • To be an active listener, vertal cues are important: engage by asking follow-up questions about the details, respond enthusiastically to express your intrest, or paraphrasing what the speaker said to clarify.
  • To end a conversation gracefully, the first thing you can do is to circle back to the highlight of your discussion. If you genuinely want to continue the discussion later, exchange contact info, state what you will do next, (if acquaintance) shake hand, and say goodbye.
  • It’s important to follow through with whatever it is you say you’re doing next. Otherwise, the other side may think you simply were not enjoying your time with them, which hurts feelings.
  • It is a courteous way to end a conversation with introducing your conversation partner to a new person, which will whiden his network and ensure he does not feel that you’re abandoning him. Or reversely, you can ask them to introduce you to someone else.
Name Definition Comment
DAU (Daily active users) # of Unique users per day Downloads are misleading because 80 to 90 percent of those who download an app never return
MAU (Monthly active users) # of Unique users per month Downloads are misleading
Stickiness (DAU / MAU ) x 100 Higher stickiness = higher ROI, mobile stickiness is 20x more than mobile/desktop
Retention rate ((# of customers at end of period – # of customers acquired during period) / # of customers at start of period ) x 100 High retention is almost always a good thing.
Churn rate 1- Customer Retention Rate
CPA (Cost per acquisition) Total Marketing Cost / Total User Acquisitions the lower the better
Average daily sessions per DAU how frequently your users log into your app each day. Not always a good thing.
LTV (Lifetime value) Average value of conversion x Average # of conversions in a timeframe x Average customer value is losing money = Boolean(LTV < CPA)
ARPU (Average revenue per user) Lifetime revenue of app/ Lifetime # of users ARPU answers when you should be earning more revenue per user
ARPPU (average revenue per paying user) Lifetime revenue of app / Lifetime # of paying users
ROI (Return on Investment) Return / Investment stay consistent to measure relative progress year-to-year
App load time should <= 2 sec
User satisfaction measured in CSAT and NPS better user satisfaction = more user retention + more LTV
CSAT (customer satisfaction score) (# of satisfied customers / # of survey respondents) x 100 ask customers to rate their satisfaction on a scale from 1 to 5. 4 or 5 means satisfied
NPS (Net Promoter Score) ((# of promoters – # of detractors ) / # of survey respondents) x 100 ask customers to rate their satisfaction on a scale from 1 to 10. Users reply 0 to 6 are detractors. Users reply 8 to 10 are promoters.
Goal achievement % Users that achieve their goals each session Goals can be a purchase, a signup, a share, etc.
Marketing Acquisition %, $, and dollar value % of visitors from a top marketing channel

Requirements

  • 3 million users
  • 5000 stocks + 250 global stocks
  • a user gets notified about the price change when
    1. subscribing the stock
    2. the stock has 5% or 10% changes
    3. since a) the last week or b) the last day
  • extensibility. may support other kinds of notifications like breaking news, earnings call, etc.

Sketching out the Architecture

Contexts:

  • What is clearing? Clearing is the procedure by which financial trades settle – that is, the correct and timely transfer of funds to the seller and securities to the buyer. Often with clearing, a specialized organization acts as an intermediary known as a clearinghouse.
  • What is a stock exchange? A facility where stock brokers and traders can buy and sell securities.

Apple Push Notification service
(APNs)
Apple Push Notification service<br>(APNs)
Google Firebase Cloud Messaging
(FCM)
Google Firebase Cloud Messaging<br>(FCM)
Email Services
AWS SES /sendgrid/etc
Email Services<br>AWS SES /sendgrid/etc
notifier
notifier
External Vendors

Market Prices
[Not supported by viewer]
Robinhood App
Robinhood App
API Gateway
API Gateway
Reverse Proxy
Reverse Proxy
batch write
batch write
price
ticker
[Not supported by viewer]
Time-series DB
influx or prometheus
Time-series DB<br>influx or prometheus
Tick every 5 mins
[Not supported by viewer]
periorical read
periorical read
price
watcher
price<br>watcher
User Settings
User Settings
Notification Queue
Notification Queue
throttler cache
throttler cache
cronjob
cronjob

What are those components and how do they interact with each other?

  • Price ticker
    • data fetching policies
      • option 1 preliminary: fetches data every 5 mins and flush into the time-series database in batches.
      • option 2 advanced: nowadays external systems usually push data directly so that we do not have to pull all the time.
    • ~6000 points per request or per price change.
    • data retention of 1 week, because this is just the speeding layer of the lambda architecture.
  • Price watcher
    • read the data ranging from last week or last 24 hours for each stock.
    • calculate if the fluctuation exceeds 5% or 10% in those two time spans. we get tuples like (stock, up 5%, 1 week).
      • corner case: should we normalize the price data? for example, some abnormal price like someone sold UBER mistakenly for $1 USD.
    • ratelimit (because 5% or 10% delta may occur many times within one day), and then emit an event PRICE_CHANGE(STOCK_CODE, timeSpan, percentage) to the notification queue.
  • Periodical triggers are cron jobs, e.g. Airflow, Cadence.
  • notification queue
    • may not necessarily be introduced in the first place when users and stocks are small.
    • may accept generic messaging event, like PRICE_CHANGE, EARNINGS_CALL, BREAKING_NEWS, etc.
  • Notifier
    • subscribe the notification queue to get the event
    • and then fetch who to notify from the user settings service
    • finally based on user settings, send out messages through APNs, FCM or AWS SES.

TianPan.co

Startup Engineering
© 2010-2018 Tian
Built with in San Francisco