聚合规则详解
统计分析和数据汇总的核心工具
概述
聚合规则(aggregate)用于按指定字段分组,对每组数据进行统计计算。例如:
- 按部门统计员工数和平均薪资
- 按日期统计销售额和订单数
- 按地区统计用户总数和活跃度
💡 关键概念:聚合规则会改变数据结构,从多行变为较少行(按分组分组)。
基本语法
配置结构
{
"type": "aggregate",
"groupBy": "分组字段", // 按哪个字段分组
"targets": [ // 聚合目标数组
{
"field": "要统计的字段名",
"agg": "统计方法", // sum, count, avg, max, min, distinct 等
"outputField": "输出字段名"
}
]
}
完整示例
原始数据:
[
{ status: "active", salary: 8000 },
{ status: "active", salary: 12000 },
{ status: "inactive", salary: 6500 }
]
聚合规则:
{
"type": "aggregate",
"groupBy": "status",
"targets": [
{ "field": "salary", "agg": "sum", "outputField": "salary_sum" },
{ "field": "salary", "agg": "avg", "outputField": "salary_avg" },
{ "field": "salary", "agg": "count", "outputField": "emp_count" }
]
}
结果:
[
{ status: "active", salary_sum: 20000, salary_avg: 10000, emp_count: 2 },
{ status: "inactive", salary_sum: 6500, salary_avg: 6500, emp_count: 1 }
]
聚合方法
COUNT - 计数
统计指定字段的非空值数量。
{
"field": "salary",
"agg": "count",
"outputField": "emp_count"
}
// 数据:salary=[8000, 12000, null]
// 结果:2(count=2,不计null值)
COUNT_DISTINCT - 去重计数
统计指定字段中不重复值的个数。
{
"field": "region",
"agg": "count_distinct",
"outputField": "region_count"
}
// 数据:region=["BJ", "BJ", "SH", "SH", "GZ"]
// 结果:3(三个不同的地区)
SUM - 求和
累加指定字段的数值。
{
"field": "salary",
"agg": "sum",
"outputField": "total_salary"
}
// 数据:salary=[8000, 12000, 6500]
// 结果:26500
AVG - 平均值
计算指定字段的平均值。
{
"field": "salary",
"agg": "avg",
"outputField": "avg_salary"
}
// 数据:salary=[8000, 12000, 6500]
// 结果:8833.33
MAX / MIN - 最大/最小值
{
"field": "salary",
"agg": "max",
"outputField": "max_salary"
}
// 数据:salary=[8000, 12000, 6500]
// 结果:12000
CONCAT_VALUES - 字符串拼接
将组内的值拼接为一个字符串。
{
"field": "name",
"agg": "concat_values",
"outputField": "all_names"
}
// 数据:name=["Alice", "Bob", "Carol"]
// 结果:all_names = "Alice,Bob,Carol"
MEDIAN - 中位数
计算数据的中位数。
{
"field": "salary",
"agg": "median",
"outputField": "median_salary"
}
STDDEV - 标准差
计算数据的统计离散度。
{
"field": "salary",
"agg": "stddev",
"outputField": "salary_stddev"
}
FIRST / LAST - 首值/末值
取分组中的第一个或最后一个值。
{
"field": "time",
"agg": "first",
"outputField": "first_time"
}
输出模式(outputMode)
聚合规则支持三种输出模式:
summary_rows(默认)
仅返回聚合后的汇总行,原始数据被替换:
{
"type": "aggregate",
"groupBy": "department",
"outputMode": "summary_rows",
"targets": [
{ "field": "salary", "agg": "avg", "outputField": "avg_salary" }
]
}
// 原始:3行(3个员工)
// 输出:2行(2个部门的统计)
inject_field
将汇总值注回原始行,每行都会添加对应的聚合值:
{
"type": "aggregate",
"groupBy": "department",
"outputMode": "inject_field",
"targets": [
{ "field": "salary", "agg": "avg", "outputField": "dept_avg_salary" }
]
}
// 原始行:
// { name: "Alice", salary: 8000, dept: "sales" }
// 输出后:
// { name: "Alice", salary: 8000, dept: "sales", dept_avg_salary: 9000 }
append_summary
在原始数据后面附加汇总行,保留所有原始数据:
{
"type": "aggregate",
"groupBy": "department",
"outputMode": "append_summary",
"targets": [
{ "field": "salary", "agg": "sum", "outputField": "salary_sum" }
]
}
// 原始3行 + 聚合后2行 = 共5行输出
分组细节
单字段分组
{
"type": "aggregate",
"groupBy": "department",
"targets": [
{ "field": "salary", "agg": "avg", "outputField": "avg_salary" }
]
}
多字段分组
按多个字段组合分组:
{
"type": "aggregate",
"groupBy": ["department", "location"], // 按两个字段
"targets": [
{ "field": "salary", "agg": "sum", "outputField": "total_salary" }
]
}
原始数据:
[
{ dept: "sales", loc: "BJ", salary: 8000 },
{ dept: "sales", loc: "BJ", salary: 9000 },
{ dept: "sales", loc: "SH", salary: 10000 },
{ dept: "hr", loc: "BJ", salary: 6000 }
]
结果:
[
{ dept: "sales", loc: "BJ", total_salary: 17000 },
{ dept: "sales", loc: "SH", total_salary: 10000 },
{ dept: "hr", loc: "BJ", total_salary: 6000 }
]
多重聚合
在一个聚合规则中对多个字段进行不同的统计:
{
"type": "aggregate",
"groupBy": "product",
"targets": [
{ "field": "quantity", "agg": "sum", "outputField": "total_qty" },
{ "field": "price", "agg": "avg", "outputField": "avg_price" },
{ "field": "order_id", "agg": "count", "outputField": "order_count" },
{ "field": "region", "agg": "distinct", "outputField": "region_count" }
]
}
结果会包含所有这些聚合后的新字段。
实战示例
例 1:销售数据统计
计算每个销售员的销售总额、平均订单金额、订单数。
{
"type": "aggregate",
"groupBy": "salesman",
"targets": [
{ "field": "amount", "agg": "sum", "outputField": "total_sales" },
{ "field": "amount", "agg": "avg", "outputField": "avg_order" },
{ "field": "order_id", "agg": "count", "outputField": "order_count" }
]
}
// 结果示例:
// { salesman: "Alice", total_sales: 50000, avg_order: 2500, order_count: 20 }
// { salesman: "Bob", total_sales: 45000, avg_order: 2250, order_count: 20 }
例 2:按部门和职级的薪资分析
{
"type": "aggregate",
"groupBy": ["department", "level"],
"targets": [
{ "field": "salary", "agg": "sum", "outputField": "total_payroll" },
{ "field": "salary", "agg": "avg", "outputField": "avg_salary" },
{ "field": "name", "agg": "count", "outputField": "headcount" },
{ "field": "salary", "agg": "max", "outputField": "max_salary" },
{ "field": "salary", "agg": "min", "outputField": "min_salary" }
]
}
// 结果:清晰展示每个部门、每个职级的薪资情况
例 3:日期聚合(需配合计算规则)
统计每日销售额:
// 第一步:用 compute 规则提取日期部分
{
"type": "compute",
"targetField": "sale_date",
"formula": "DATE_FORMAT(timestamp, 'YYYY-MM-DD')"
}
// 第二步:按日期聚合
{
"type": "aggregate",
"groupBy": "sale_date",
"targets": [
{ "field": "amount", "agg": "sum", "outputField": "daily_sales" },
{ "field": "amount", "agg": "count", "outputField": "transaction_count" }
]
}
例 4:数据质量检查
统计每个客户的订单数、金额范围、是否有异常:
{
"type": "aggregate",
"groupBy": "customer_id",
"targets": [
{ "field": "order_id", "agg": "count", "outputField": "order_count" },
{ "field": "amount", "agg": "sum", "outputField": "total_spent" },
{ "field": "amount", "agg": "max", "outputField": "max_single_order" },
{ "field": "amount", "agg": "min", "outputField": "min_single_order" }
]
}
高级技巧
聚合后再计算
聚合后,可以继续用 compute 规则对结果进行计算:
// 第一步:聚合
{
"type": "aggregate",
"groupBy": "region",
"targets": [
{ "field": "revenue", "agg": "sum", "outputField": "total_revenue" },
{ "field": "cost", "agg": "sum", "outputField": "total_cost" }
]
}
// 第二步:计算利润
{
"type": "compute",
"targetField": "profit",
"formula": "total_revenue - total_cost"
}
// 第三步:计算利润率
{
"type": "compute",
"targetField": "profit_rate",
"formula": "IF(total_revenue == 0, 0, profit / total_revenue * 100)"
}
分组后筛选
使用 清洗规则 中的过滤来选择满足条件的分组:
// 第一步:聚合
{
"type": "aggregate",
"groupBy": "product",
"targets": [
{ "field": "quantity", "agg": "sum", "outputField": "total_qty" }
]
}
// 第二步:只保留销售量 > 1000 的产品
{
"type": "filter",
"condition": "total_qty > 1000"
}
性能优化
聚合性能指标
- 📊 10 万行数据、10 个分组字段:通常 < 1 秒
- 📊 10 万个不同分组值、多个聚合指标:通常 < 500ms
- ⚠️ 避免在超大字符串字段上做 GROUP_CONCAT
最佳实践
- 先过滤不需要的行,再聚合(提高处理速度)
- 只聚合需要的字段,不要全聚合
- 如果结果仍太大,考虑进一步分组
常见错误与排查
❌ 错误 1:groupBy 字段不存在
// 错误:字段名拼写错误或不存在
"groupBy": "depart" // 应该是 "department"
解决:检查 groupBy 字段名是否与数据中的字段名完全一致
❌ 错误 2:数值字段用 SUM
// 错误:在字符串或日期字段上做 SUM
{ "field": "email", "agg": "sum" } // 无意义
// 正确:
{ "field": "salary", "agg": "sum" } // 数值字段
解决:确保用数值字段做 SUM、AVG 等算术聚合
❌ 错误 3:混淆 COUNT 和 COUNT_ALL
COUNT // 非空值数量
COUNT_ALL // 总行数(包括 null)
// 数据:[{a:1}, {a:2}, {a:null}]
// COUNT 结果:2
// COUNT_ALL 结果:3
❌ 错误 4:不同聚合方法混用错误
// 错误:在非数值字段用 AVG
{ "field": "name", "agg": "avg" } // name 是字符串
// 正确:
{ "field": "salary", "agg": "avg" } // salary 是数字
与其他规则的组合
完整的数据处理流程
// 1. 脱敏:保护敏感信息
{
"type": "mask",
"preset": "name",
"field": "employee_name"
}
// 2. 清洗:修正错误和缺失值
{
"type": "normalize",
"field": "salary",
"mode": "number"
}
// 3. 计算:生成新字段
{
"type": "compute",
"targetField": "annual_salary",
"formula": "salary * 12"
}
// 4. 聚合:统计分析
{
"type": "aggregate",
"groupBy": "department",
"targets": [
{ "field": "annual_salary", "agg": "sum", "outputField": "dept_payroll" },
{ "field": "annual_salary", "agg": "avg", "outputField": "avg_salary" }
]
}
✅ 完成了聚合规则的学习!
结合计算和聚合规则,你可以进行复杂的数据分析和统计。更多实战案例见应用场景。