WeChat's Feature Compute Engine for Real-time Recommender

Organized by

WeChat’s Feature Compute

Engine for Real-time

Recommender Systems

Jin Shang, Software Engineer, Tencent WeChat

Present:

● Software Engineer @ WeChat's ML Platform Team

● Lead developer of our vectorized feature compute engine

● Active community contributor of Apache Arrow

● SWE Intern @ Google Cloud Vertex AI Feature Store

● Masters in Computer Science @ CMU

Who am I?

Background

● Recommender Systems

○ Ads, Articles, Videos, Live Streams, Feeds, Search Ranking…

● Trust and Safety

○ Fraud Detection

○ Content Moderation

● Internal Usage

○ Code Generation

○ Documentation Chatbot

Machine learning at WeChat

● Offline feature

generation

○ Fixed-interval Batch jobs: Apache Spark

○ Event Triggered Streaming Jobs: Apache Flink

● Online feature

extraction

○ Transform data in DB to features for models

○ OLAP-like operations on small batches: <1000 items per request

○ Requires low latency: <50ms for a request

○ Facebook F3 / OpenMLDB Online Engine

Features Engineering For Recommendation

Operations that are impossible or expensive offline:

● Filtering: Eliminate invalid/abnormal values

● Joining: Cross user and item features online to save key space

● Sorting and Aggregation: # of request < # of actions

● Numerical transformations: Different parameter for each model

○ Discretization

○ Hashing

Recommender System: Online Feature Extraction

Workload pattern:

● Each request contains a single user and multiple items

● The same set of features are computed for each item

● Features have fixed computation logic once deployed

● The same computation tasks are repeated over and over

Conclusion:

● Perfect scenario for vectorization and compilation

Technical Details

Apache Arrow

● De facto standard for in memory columnar (vector) data layout

● Most widely used vector format in data analysis world

○ Ad hoc analysis: PyArrow, Pandas, Polars…

○ OLAP engine: DuckDB, Velox, Datafusion…

○ ML data: Huggingface Dataset, Ray Dataset

● A complete toolbox for vector data:

○ Primitive, List, Map, Struct, Union, Extensions…

○ IO, Serialization, Compute…

○ C, C++, Rust, Python, Go, Java, Matlab, Julia, JS…

Feature Representation with Apache Arrow

● Each feature value is represented as an Arrow Array or Scalar

● Item ratings: [9, 8, 9, null, 7], representing the value for item 1-5

○ Arrow arrays natively support null values

● Item tags: [[“action”, “horror”], [“romance”], [], null, [“comedy”]]

○ Nuanced difference between empty list and null list

○ Number of features: 0 vs null

User

Item

Scalar Feature

Scalar

Primitive Array

Sequence Feature

Primitive Array

List Array

Feature Computation with Apache Arrow

● We provide common operators such as Join, Sort, Math

Expressions…

● User define a feature by combining operators

● Several ways to achieve SIMD vectorization on Arrow array

○ LLVM JIT Engine

○ Arrow native compute functions

○ Compiler auto-vectorization

○ Hand-written SIMD intrinsics

LLVM JIT Engine

Gandiva Expression Compiler

● Developed by Dremio, maintained by Arrow team

● Provides arithmetic functions pre-compiled to LLVM IR

● Combine functions into expression at LLVM IR level

● Leverages LLVM for various optimizations

○ loop vectorization

○ function in-lining

○ instruction combination

○ …

● Used for Projection and Filtering

Gandiva Expression Compiler

LLVM IR for “(a+b)*c” on ArmV8 Neon instruction set

vector.body: ; preds = %vector.body, %vector.ph

…

%lsr.iv4042 = bitcast double* %lsr.iv40 to <2 x double>*

%lsr.iv4547 = bitcast double* %lsr.iv45 to <2 x double>*

%lsr.iv5052 = bitcast double* %lsr.iv50 to <2 x double>*

%lsr.iv5557 = bitcast double* %lsr.iv55 to <2 x double>*

…

%21 = fadd <2 x double> %wide.load, %wide.load28

%22 = fadd <2 x double> %wide.load27, %wide.load29

%scevgep58 = getelementptr <2 x double>, <2 x double>* %lsr.iv5557, i64 -1

%wide.load30 = load <2 x double>, <2 x double>* %scevgep58, align 8, !alias.scope !16

%wide.load31 = load <2 x double>, <2 x double>* %lsr.iv5557, align 8, !alias.scope !16

%23 = fmul <2 x double> %21, %wide.load30

%24 = fmul <2 x double> %22, %wide.load31

…

Arrow Compute Functions

Natively supported vectorized functions

● Gandiva is very fast, but…

○ Hard to develop and maintain

○ Hard to debug

○ Only used for simple operations such as math expressions

● Arrow Compute

○ C++ functions that are dynamically dispatched at runtime

○ Provides vectorized kernels for many functions

○ Supports various complex operations, e.g. Sort, Aggregate

○ All functions are exported to Python, easier to experiment with

Arrow Compute Functions

>>> import pyarrow as pa

>>> import pyarrow.compute as pc

>>> a = pa.array([5, 3, 4, 1, 2])

>>> sorted_indices = pc.sort_indices(a)

>>> pc.take(a, sorted_indices)

<pyarrow.lib.Int64Array object at 0x127a5d1e0>

[

]

>>> pc.sum(a)

<pyarrow.Int64Scalar: 15>

>>> user_history_ids = pa.array([999, 777, 555])

>>> user_history_watch_time = pa.array([30, 50, 10])

>>> request_items = pa.array([111, 222, 777, 888, 999])

>>> join_index = pc.index_in(request_items,

user_history_ids)

>>> request_items_user_watch_time =

pc.take(user_history_watch_time, join_index)

>>> request_items_user_watch_time

<pyarrow.lib.Int64Array object at 0x127a5d4e0>

[

null,

50,

null,

]

Common complex operations e.g. Sort, Aggregate, Join…

Arrow Compute Functions

We have contributed several functions to Arrow

● cumulative_sum/prod/min/max

● rolling_sum/min/max

● adjoin_as_list

● pairwise_diff

● integer round functions

● arithmetic for temporal types

● …

Compiler Auto Vectorization

● Arrow arrays are always stored in contiguous memory

● Gandiva/Compute too heavy for some light weight operations

● Rely on compilers to automatically vectorize

● GCC: -ftree-vectorize

● Clang: Enabled by default

● Example: Null bitmap bitwise and

Handwritten SIMD with Intrinsics

Call SIMD intrinsic functions provided by Intel

● Compilers fail to vectorize complex operations, e.g. hashing

● Manual vectorization by calling SIMD intrinsics

Performance Issue for Small Batch Size

● Sometimes scenarios only have very few candidate items

○ Online training with one sample

○ Push notifications

Vectorized operators perform badly when batch size is small

● Array overhead

○ Need to store array metadata

○ Arrow array always aligned at 64 bytes

● Vectorization overhead

○ Compute more values than needed

● Compute Function overhead

Native Operators

● Provide a native implementation for all operators

○ Each item feature is stored separately as single values

● Dynamically dispatch to Native (<8) and Vectorized (>=8)

Consistency is ensured:

● Full test coverage

○ Operator level unit tests

○ Engine level E2E tests

● Daily pipeline to examine diffs from raw feature logs

Operator Fusion

● Overhead for each operator call

○ Virtual dispatch

○ Input validation and copy

○ Output memory allocation

● Most operators can be fused with Projection & Filtering

● Three operations done in one operator call

● Example:

log(x+1) ORDERED BY x WHERE x>0 AND x<100

PROJECTOR SORT FILTER

Performance Benchmark

Performance benchmark

● When item num < 8, native is better

● At 64, vectorized is 6x faster

● At 128, vectorized is 10x faster

● Overhead amortized for larger batch

● CPU usage down by 50% on deployment

100

120

1 2 4 8 16 32 64 128

LATENCY(MS)

# OF ITEMS

Vectorized Native

Q&A