我们将使用 BigQuery 的标准 SQL 方言中提供的REGEXP_EXTRACT_ALL 函数从 URL 的查询部分提取参数并将它们作为数组返回。
代码
#standardSQL
SELECT
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)((?:[^=]+)=(?:[^&]*))') as params,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:([^=]+)=(?:[^&]*))') as keys,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:(?:[^=]+)=([^&]*))') as values
FROM
table
工作示例
#standardSQL
# Extract query parameters from a URL as ARRAY in BigQuery; standard-sql; 2018-04-08
# @see http://www.pascallandau.com/bigquery-snippets/extract-url-parameters-array/
WITH examples AS (
SELECT 1 AS id,
'?foo=bar' AS query,
'simple' AS description
UNION ALL SELECT 2, '?foo=bar&bar=baz', 'multiple params'
UNION ALL SELECT 3, '?foo[]=bar&foo[]=baz', 'arrays'
UNION ALL SELECT 4, '', 'no query'
)
SELECT
id,
query,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)((?:[^=]+)=(?:[^&]*))') as params,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:([^=]+)=(?:[^&]*))') as keys,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:(?:[^=]+)=([^&]*))') as values,
description
FROM examples
结果
Row | id | query | params | keys | values | description |
---|---|---|---|---|---|---|
1 | 1 | ?foo=bar | foo=bar | foo | bar | simple |
2 | 2 | ?foo=bar&bar=baz | foo=bar | foo | bar | multiple params |
bar=baz | bar | baz | ||||
3 | 3 | ?foo[]=bar&foo[]=baz | foo[]=bar | foo[] | bar | arrays |
foo[]=baz | foo[] | baz | ||||
4 | 4 | no query |
在 BigQuery 上运行
笔记
REGEXP_EXTRACT_ALL
只有 1 个捕获组除外,因此我们需要将所有其他组标记为非捕获(?:
- 如果 URL 包含片段部分(例如 https://example.org/?foo=bar#baz),则片段当前不会被删除。为此,请在使用 REGEXP_REPLACE提取之前删除片段,例如:
REGEXP_EXTRACT_ALL( REGEXP_EXTRACT(query, r'#.*', ''), r'(?:\?|&)(?:(?:[^=]+)=([^&]*))') as values
链接
用例
- 从您的日志文件中编译所有参数的列表
- 评估参数键/值的频率