Tencent's open-source data component, Fast-Causal-Inference, is used for distributed vectorized statistics

IT Home News on September 18, Tencent announced in its public account "Tencent Open Source" that its open source distributed data science component project Fast-Causal-Inference has been announced on GitHub.

▲ Source: "Tencent Open Source" public account

It is reported that this is developed by Tencent WeChat, using SQL interaction, based on distributed vectorization statistical analysis, causal inference computing library, said to "solve the performance bottleneck of the existing statistical model library (R / Python) under big data, provide tens of billions of data second-level execution of Causal inference capabilities, while reducing the threshold for the use of statistical models through SQL language, easy to use in the production environment, has been applied in WeChat Channels, WeChat search and other WeChat internal businesses." ”

Official introduction:

Causal inference capability that provides second-level execution of massive data Based on the vectorized OLAP execution engine ClickHouse / StarRocks, the speed is more conducive to the ultimate user experience. Simplified SQL usage SQLGateway WebServer lowers the threshold for using statistical models through the SQL language, and provides a simplified SQL usage method in the upper layer, transparently doing engine-related SQL expansion and optimization. Provide causal inference capabilities of basic operators and high-order operators, and upper-layer application packaging support ttest, OLS, Lasso, Tree-based model, matching, bootstrap, DML, etc.

IT Home also learned that the official said that the first version has supported the following features:

Basic causal inference tools Based on deltamethod's ttest, support CUPED OLS, billions of rows of data, sub-second advanced causal inference tools OLS-based IV, WLS, and other GLS, DID, synthetic control, CUPED, mediation are incubating uplift: tens of millions of data minute-level operations bootstrap/permutation and other data simulation frameworks to solve the variance estimation problem without showing the solution