Why such frameworks?

  • process high-throughput in low latency.
  • fault-tolerance in distributed systems.
  • generic abstraction to serve volatile business requirements.
  • for bounded data set (batch processing) and for unbounded data set (stream processing).

Brief history of batch/stream processing

  1. Hadoop and MapReduce. Google made batch processing as simple as MR result = pairs.map((pair) => (morePairs)).reduce(somePairs => lessPairs) in a distributed system.
  2. Apache Storm and DAG Topology. MR doesn’t efficiently express iterative algorithms. Thus Nathan Marz abstracted stream processing into a graph of spouts and bolts.
  3. Spark in-memory Computing. Reynold Xin said Spark sorted the same data 3X faster using 10X fewer machines compared to Hadoop.
  4. Google Dataflow based on Millwheel and FlumeJava. Google supports both batch and streaming computing with the windowing API.
  1. its fast adoption of Google Dataflow/Beam programming model.
  2. its highly efficient implementation of Chandy-Lamport checkpointing.

How?

Architectural Choices

To serve requirements above with commodity machines, the steaming framework use distributed systems in these architectures…

  • master-slave (centralized): apache storm with zookeeper, apache samza with YARN.
  • P2P (decentralized): apache s4.

Features

  1. DAG Topology for Iterative Processing. e.g. GraphX in Spark, topologies in Apache Storm, DataStream API in Flink.
  2. Delivery Guarantees. How guaranteed to deliver data from nodes to nodes? at-least once / at-most once / exactly once.
  3. Fault-tolerance. Using cold/warm/hot standby, checkpointing, or active-active.
  4. Windowing API for unbounded data set. e.g. Stream Windows in Apache Flink. Spark Window Functions. Windowing in Apache Beam.

Comparison

Framework Storm Storm-trident Spark Flink
Model native micro-batch micro-batch native
Guarentees at-least-once exactly-once exactly-once exactly-once
Fault-tolerance Record-Ack record-ack checkpoint checkpoint
Overhead of fault-tolerance high medium medium low
latency very-low high high low
throughput low medium high high

Motivation

Leveraging user and device data during user login to fight against

  1. ATO (account takeovers)
  2. Botnet attacks

ATOs ranking from easy to hard to detect

  1. from single IP
  2. from IPs on the same device
  3. from IPs across the world
  4. from 100k IPs
  5. attacks on specific accounts
  6. phishing and malware

Solutions

Semi-supervised learning = unlabeled data + small amount of labeled data

Why? better learning accuracy than unsupervised learning + less time and costs than supervised learning

  • K-means: not good
  • DBSCAN: better. Use labels to
    1. Tune hyperparameter
    2. Constrain

Challenges

  • Manual feature selection
  • Feature evolution in adversarial environment
  • Scalability
  • No online DBSCAN

Architecture

Anti-fraud Query

Anti-fraud Featuring

Production Setup

  • Batch: 7 days worth of data, run DBSCAN hourly
  • Streaming: 60 minutes moving window, run streaming k-means
  • Used feedback signal success ratios to mark clusters as good, bad or unknown
  • Bad clusters: Always throw
  • Good clusters: Small % of attempts
  • Unknown clusters: X% of attempts

Ads Ecosystem

  • Brand / Advertiser: individuals or organizations who want to publish advertising messages to the customers.

  • Agency: they help the brand to interact with the rest of the ecosystem and manage the whole lifecycle of the advertising messages, including planning, creating, and distributing ad campaigns.

  • Trading Desk: It streamlines the media buying process.

  • Demand-side Platform (DSP): it automates online ad inventory and buying, helping agencies to manage accounts across different accounts and campaigns through one platform.

  • Data-management Platform (DMP)

    1. Ads-based Analytics: attrition, targeting, profiling, session replay, and more.
    2. Anti-fraud
    3. Market-based Analytics
  • Ad Exchange / Real-time Bidding (RTB): It matches ads suppliers with buyers.

  • Ad Network: It aggregates publisher inventory and sells it to advertisers.

  • Supply Side Platform (SSP): It monitors the entire ads inventory and suggest prices for ad space.

  • Publisher: Ad-space owners like website operators.

Andy Grove emphasizes that a manager’s most important responsibility is to elicit top performance from his subordinates..

Unfortunately, one management style does not fit all the people in all the scenarios. A fundamental variable to find the best management style is task-relevant maturity (TRM) of the subordinates.

TRM Effective Management Style
low structured; task-oriented; detailed-oriented; instruct exactly “what/when/how mode”
medium Individual-oriented; support, “mutual-reasoning mode”
high goal-oriented; monitoring mode

A person’s TRM depends on the specific work items. It takes time to improve. When TRM reaches the highest level, the person’s both knowledge-level and motivation are ready for her manager to delegate work.

The key here is to regard any management mode not as either good or bad but rather as effective or not effective.

What is change aversion?

By and large, anytime you change what people regularly use in a product, they will always throw an uproar. This happens to almost every release of products like Gmail, YouTube, iPhone, etc.

How to avoid or mitigate change aversion?

  1. Let users understand, in advance and afterward. Warn them about the significant changes early and communicate why those places changed. Provide transition instructions afterward.
  2. Let users switch. Don’t shut the door and leave them alone in the helplessness.
  3. Let users give feedbacks and follow through.

Change Aversion isn’t an Excuse

The product changes may turn out to be good or bad ones.

change aversion patterns

Tian Pan's Notes

Software Engineering and Startup
© 2010-2018 Tian
Built with ❤️ in San Francisco