← 返回主平台

概述

聚合规则(aggregate)用于按指定字段分组,对每组数据进行统计计算。例如:

  • 按部门统计员工数和平均薪资
  • 按日期统计销售额和订单数
  • 按地区统计用户总数和活跃度
💡 关键概念:聚合规则会改变数据结构,从多行变为较少行(按分组分组)。

基本语法

配置结构

{
  "type": "aggregate",
  "groupBy": "分组字段",  // 按哪个字段分组
  "targets": [          // 聚合目标数组
    {
      "field": "要统计的字段名",
      "agg": "统计方法",  // sum, count, avg, max, min, distinct 等
      "outputField": "输出字段名"
    }
  ]
}

完整示例

原始数据:

[
  { status: "active",   salary: 8000 },
  { status: "active",   salary: 12000 },
  { status: "inactive", salary: 6500 }
]

聚合规则:

{
  "type": "aggregate",
  "groupBy": "status",
  "targets": [
    { "field": "salary", "agg": "sum", "outputField": "salary_sum" },
    { "field": "salary", "agg": "avg", "outputField": "salary_avg" },
    { "field": "salary", "agg": "count", "outputField": "emp_count" }
  ]
}

结果:

[
  { status: "active",   salary_sum: 20000, salary_avg: 10000, emp_count: 2 },
  { status: "inactive", salary_sum: 6500,  salary_avg: 6500,  emp_count: 1 }
]

聚合方法

COUNT - 计数

统计指定字段的非空值数量。

{
  "field": "salary",
  "agg": "count",
  "outputField": "emp_count"
}

// 数据:salary=[8000, 12000, null]
// 结果:2(count=2,不计null值)

COUNT_DISTINCT - 去重计数

统计指定字段中不重复值的个数。

{
  "field": "region",
  "agg": "count_distinct",
  "outputField": "region_count"
}

// 数据:region=["BJ", "BJ", "SH", "SH", "GZ"]
// 结果:3(三个不同的地区)

SUM - 求和

累加指定字段的数值。

{
  "field": "salary",
  "agg": "sum",
  "outputField": "total_salary"
}

// 数据:salary=[8000, 12000, 6500]
// 结果:26500

AVG - 平均值

计算指定字段的平均值。

{
  "field": "salary",
  "agg": "avg",
  "outputField": "avg_salary"
}

// 数据:salary=[8000, 12000, 6500]
// 结果:8833.33

MAX / MIN - 最大/最小值

{
  "field": "salary",
  "agg": "max",
  "outputField": "max_salary"
}

// 数据:salary=[8000, 12000, 6500]
// 结果:12000

CONCAT_VALUES - 字符串拼接

将组内的值拼接为一个字符串。

{
  "field": "name",
  "agg": "concat_values",
  "outputField": "all_names"
}

// 数据:name=["Alice", "Bob", "Carol"]
// 结果:all_names = "Alice,Bob,Carol"

MEDIAN - 中位数

计算数据的中位数。

{
  "field": "salary",
  "agg": "median",
  "outputField": "median_salary"
}

STDDEV - 标准差

计算数据的统计离散度。

{
  "field": "salary",
  "agg": "stddev",
  "outputField": "salary_stddev"
}

FIRST / LAST - 首值/末值

取分组中的第一个或最后一个值。

{
  "field": "time",
  "agg": "first",
  "outputField": "first_time"
}

输出模式(outputMode)

聚合规则支持三种输出模式:

summary_rows(默认)

仅返回聚合后的汇总行,原始数据被替换:

{
  "type": "aggregate",
  "groupBy": "department",
  "outputMode": "summary_rows",
  "targets": [
    { "field": "salary", "agg": "avg", "outputField": "avg_salary" }
  ]
}

// 原始:3行(3个员工)
// 输出:2行(2个部门的统计)

inject_field

将汇总值注回原始行,每行都会添加对应的聚合值:

{
  "type": "aggregate",
  "groupBy": "department",
  "outputMode": "inject_field",
  "targets": [
    { "field": "salary", "agg": "avg", "outputField": "dept_avg_salary" }
  ]
}

// 原始行:
// { name: "Alice", salary: 8000, dept: "sales" }
// 输出后:
// { name: "Alice", salary: 8000, dept: "sales", dept_avg_salary: 9000 }

append_summary

在原始数据后面附加汇总行,保留所有原始数据:

{
  "type": "aggregate",
  "groupBy": "department",
  "outputMode": "append_summary",
  "targets": [
    { "field": "salary", "agg": "sum", "outputField": "salary_sum" }
  ]
}

// 原始3行 + 聚合后2行 = 共5行输出

分组细节

单字段分组

{
  "type": "aggregate",
  "groupBy": "department",
  "targets": [
    { "field": "salary", "agg": "avg", "outputField": "avg_salary" }
  ]
}

多字段分组

按多个字段组合分组:

{
  "type": "aggregate",
  "groupBy": ["department", "location"],  // 按两个字段
  "targets": [
    { "field": "salary", "agg": "sum", "outputField": "total_salary" }
  ]
}

原始数据:

[
  { dept: "sales", loc: "BJ", salary: 8000 },
  { dept: "sales", loc: "BJ", salary: 9000 },
  { dept: "sales", loc: "SH", salary: 10000 },
  { dept: "hr", loc: "BJ", salary: 6000 }
]

结果:

[
  { dept: "sales", loc: "BJ", total_salary: 17000 },
  { dept: "sales", loc: "SH", total_salary: 10000 },
  { dept: "hr", loc: "BJ", total_salary: 6000 }
]

多重聚合

在一个聚合规则中对多个字段进行不同的统计:

{
  "type": "aggregate",
  "groupBy": "product",
  "targets": [
    { "field": "quantity", "agg": "sum", "outputField": "total_qty" },
    { "field": "price", "agg": "avg", "outputField": "avg_price" },
    { "field": "order_id", "agg": "count", "outputField": "order_count" },
    { "field": "region", "agg": "distinct", "outputField": "region_count" }
  ]
}

结果会包含所有这些聚合后的新字段。

实战示例

例 1:销售数据统计

计算每个销售员的销售总额、平均订单金额、订单数。

{
  "type": "aggregate",
  "groupBy": "salesman",
  "targets": [
    { "field": "amount", "agg": "sum", "outputField": "total_sales" },
    { "field": "amount", "agg": "avg", "outputField": "avg_order" },
    { "field": "order_id", "agg": "count", "outputField": "order_count" }
  ]
}

// 结果示例:
// { salesman: "Alice", total_sales: 50000, avg_order: 2500, order_count: 20 }
// { salesman: "Bob", total_sales: 45000, avg_order: 2250, order_count: 20 }

例 2:按部门和职级的薪资分析

{
  "type": "aggregate",
  "groupBy": ["department", "level"],
  "targets": [
    { "field": "salary", "agg": "sum", "outputField": "total_payroll" },
    { "field": "salary", "agg": "avg", "outputField": "avg_salary" },
    { "field": "name", "agg": "count", "outputField": "headcount" },
    { "field": "salary", "agg": "max", "outputField": "max_salary" },
    { "field": "salary", "agg": "min", "outputField": "min_salary" }
  ]
}

// 结果:清晰展示每个部门、每个职级的薪资情况

例 3:日期聚合(需配合计算规则)

统计每日销售额:

// 第一步:用 compute 规则提取日期部分
{
  "type": "compute",
  "targetField": "sale_date",
  "formula": "DATE_FORMAT(timestamp, 'YYYY-MM-DD')"
}

// 第二步:按日期聚合
{
  "type": "aggregate",
  "groupBy": "sale_date",
  "targets": [
    { "field": "amount", "agg": "sum", "outputField": "daily_sales" },
    { "field": "amount", "agg": "count", "outputField": "transaction_count" }
  ]
}

例 4:数据质量检查

统计每个客户的订单数、金额范围、是否有异常:

{
  "type": "aggregate",
  "groupBy": "customer_id",
  "targets": [
    { "field": "order_id", "agg": "count", "outputField": "order_count" },
    { "field": "amount", "agg": "sum", "outputField": "total_spent" },
    { "field": "amount", "agg": "max", "outputField": "max_single_order" },
    { "field": "amount", "agg": "min", "outputField": "min_single_order" }
  ]
}

高级技巧

聚合后再计算

聚合后,可以继续用 compute 规则对结果进行计算:

// 第一步:聚合
{
  "type": "aggregate",
  "groupBy": "region",
  "targets": [
    { "field": "revenue", "agg": "sum", "outputField": "total_revenue" },
    { "field": "cost", "agg": "sum", "outputField": "total_cost" }
  ]
}

// 第二步:计算利润
{
  "type": "compute",
  "targetField": "profit",
  "formula": "total_revenue - total_cost"
}

// 第三步:计算利润率
{
  "type": "compute",
  "targetField": "profit_rate",
  "formula": "IF(total_revenue == 0, 0, profit / total_revenue * 100)"
}

分组后筛选

使用 清洗规则 中的过滤来选择满足条件的分组:

// 第一步:聚合
{
  "type": "aggregate",
  "groupBy": "product",
  "targets": [
    { "field": "quantity", "agg": "sum", "outputField": "total_qty" }
  ]
}

// 第二步:只保留销售量 > 1000 的产品
{
  "type": "filter",
  "condition": "total_qty > 1000"
}

性能优化

聚合性能指标

  • 📊 10 万行数据、10 个分组字段:通常 < 1 秒
  • 📊 10 万个不同分组值、多个聚合指标:通常 < 500ms
  • ⚠️ 避免在超大字符串字段上做 GROUP_CONCAT

最佳实践

  1. 先过滤不需要的行,再聚合(提高处理速度)
  2. 只聚合需要的字段,不要全聚合
  3. 如果结果仍太大,考虑进一步分组

常见错误与排查

❌ 错误 1:groupBy 字段不存在

// 错误:字段名拼写错误或不存在
"groupBy": "depart"  // 应该是 "department"

解决:检查 groupBy 字段名是否与数据中的字段名完全一致

❌ 错误 2:数值字段用 SUM

// 错误:在字符串或日期字段上做 SUM
{ "field": "email", "agg": "sum" }  // 无意义

// 正确:
{ "field": "salary", "agg": "sum" }  // 数值字段

解决:确保用数值字段做 SUM、AVG 等算术聚合

❌ 错误 3:混淆 COUNT 和 COUNT_ALL

COUNT      // 非空值数量
COUNT_ALL  // 总行数(包括 null)

// 数据:[{a:1}, {a:2}, {a:null}]
// COUNT 结果:2
// COUNT_ALL 结果:3

❌ 错误 4:不同聚合方法混用错误

// 错误:在非数值字段用 AVG
{ "field": "name", "agg": "avg" }  // name 是字符串

// 正确:
{ "field": "salary", "agg": "avg" }  // salary 是数字

与其他规则的组合

完整的数据处理流程

// 1. 脱敏:保护敏感信息
{
  "type": "mask",
  "preset": "name",
  "field": "employee_name"
}

// 2. 清洗:修正错误和缺失值
{
  "type": "normalize",
  "field": "salary",
  "mode": "number"
}

// 3. 计算:生成新字段
{
  "type": "compute",
  "targetField": "annual_salary",
  "formula": "salary * 12"
}

// 4. 聚合:统计分析
{
  "type": "aggregate",
  "groupBy": "department",
  "targets": [
    { "field": "annual_salary", "agg": "sum", "outputField": "dept_payroll" },
    { "field": "annual_salary", "agg": "avg", "outputField": "avg_salary" }
  ]
}
✅ 完成了聚合规则的学习!

结合计算和聚合规则,你可以进行复杂的数据分析和统计。更多实战案例见应用场景