json形式のログからデータ抽出をしないといけなくなって、さてどうしようかと考えた結果、以前使ったApache Drillを久しぶりに使ってみたら、思いのほかスムーズに欲しいデータが抽出できて満足したので手順のメモです

こんな人向け

  • json形式のログから必要なデータをSQLっぽく抽出したい
  • でも抽出のためにいちいちDB作ったりCSVに変換したりしたくない → jsonのままなんとかしたい
  • これのために新しいツールとかインストールしたりしたくない → Dockerだけでなんとかしたい

サンプルデータ

こんな感じに、jsonオブジェクトが1行ごとに並んでいる、とあるテストの実行ログを使います。 log.json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ cat log.json
{"time":"2021-05-07 13:22:40","elapsed":"00:09:53","fail":1,"label":"Critical Tests","pass":0}
{"time":"2021-05-07 13:23:46","elapsed":"00:10:39","fail":0,"label":"Critical Tests","pass":1}
{"time":"2021-05-07 13:25:07","elapsed":"00:11:52","fail":0,"label":"Critical Tests","pass":1}
{"time":"2021-05-07 13:41:28","elapsed":"00:10:57","fail":0,"label":"Critical Tests","pass":1}
{"time":"2021-05-07 13:23:34","elapsed":"00:10:34","fail":0,"label":"Critical Tests","pass":1}
{"time":"2021-05-07 13:48:51","elapsed":"00:00:51","fail":1,"label":"Critical Tests","pass":0}
{"time":"2021-05-07 13:49:42","elapsed":"00:01:28","fail":1,"label":"Critical Tests","pass":0}
{"time":"2021-05-07 13:50:01","elapsed":"00:01:51","fail":1,"label":"Critical Tests","pass":0}
{"time":"2021-05-07 13:48:49","elapsed":"00:00:55","fail":1,"label":"Critical Tests","pass":0}
{"time":"2021-05-07 13:51:12","elapsed":"00:03:17","fail":1,"label":"Critical Tests","pass":0}
{"time":"2021-05-07 15:37:02","elapsed":"00:04:24","fail":1,"label":"Critical Tests","pass":0}
{"time":"2021-05-07 15:38:02","elapsed":"00:05:24","fail":1,"label":"Critical Tests","pass":0}
{"time":"2021-05-07 15:42:24","elapsed":"00:10:46","fail":1,"label":"Critical Tests","pass":0}
{"time":"2021-05-07 15:43:52","elapsed":"00:13:24","fail":0,"label":"Critical Tests","pass":1}
{"time":"2021-05-07 15:36:44","elapsed":"00:04:27","fail":1,"label":"Critical Tests","pass":0}
{"time":"2021-05-07 15:43:56","elapsed":"00:13:31","fail":0,"label":"Critical Tests","pass":1}
{"time":"2021-05-07 15:45:23","elapsed":"00:13:30","fail":1,"label":"Critical Tests","pass":0}
{"time":"2021-05-07 15:44:54","elapsed":"00:14:00","fail":0,"label":"Critical Tests","pass":1}
{"time":"2021-05-07 15:46:08","elapsed":"00:14:47","fail":1,"label":"Critical Tests","pass":0}
{"time":"2021-05-07 15:37:01","elapsed":"00:04:59","fail":1,"label":"Critical Tests","pass":0}

DockerでApache Drillを使う

apache/drillイメージを使用します

1
2
3
4
5
6
7
8
$ ls
log.json

// カレントディレクトリを/mntにマウントしつつ実行
$ docker run -i --rm -p 8047:8047 -v ${PWD}:/mnt -t apache/drill:latest sqlline -u jdbc:drill:zk=local
Apache Drill 1.18.0
"Got Drill?"
apache drill>

普通にselect

普通に5件参照

1
2
3
4
5
6
7
8
9
10
11
apache drill> select * from dfs.`/mnt/log.json` limit 5;
+---------------------+----------+------+----------------+------+
| time | elapsed | fail | label | pass |
+---------------------+----------+------+----------------+------+
| 2021-05-07 13:22:40 | 00:09:53 | 1 | Critical Tests | 0 |
| 2021-05-07 13:23:46 | 00:10:39 | 0 | Critical Tests | 1 |
| 2021-05-07 13:25:07 | 00:11:52 | 0 | Critical Tests | 1 |
| 2021-05-07 13:41:28 | 00:10:57 | 0 | Critical Tests | 1 |
| 2021-05-07 13:23:34 | 00:10:34 | 0 | Critical Tests | 1 |
+---------------------+----------+------+----------------+------+
5 rows selected (0.336 seconds)

order byでソート

timeで降順ソート

1
2
3
4
5
6
7
8
9
10
apache drill> select * from dfs.`/mnt/log.json` order by `time` desc limit 5;
+---------------------+----------+------+----------------+------+
| time | elapsed | fail | label | pass |
+---------------------+----------+------+----------------+------+
| 2021-05-07 15:46:08 | 00:14:47 | 1 | Critical Tests | 0 |
| 2021-05-07 15:45:23 | 00:13:30 | 1 | Critical Tests | 0 |
| 2021-05-07 15:44:54 | 00:14:00 | 0 | Critical Tests | 1 |
| 2021-05-07 15:43:56 | 00:13:31 | 0 | Critical Tests | 1 |
| 2021-05-07 15:43:52 | 00:13:24 | 0 | Critical Tests | 1 |
+---------------------+----------+------+----------------+------+

group by

failで集約してカウント

1
2
3
4
5
6
7
8
apache drill> select fail,count(*) as cnt from dfs.`/mnt/log.json` group by `fail`;
+------+-----+
| fail | cnt |
+------+-----+
| 0 | 7 |
| 1 | 13 |
+------+-----+
2 rows selected (0.209 seconds)

between

timeで期間の絞り込み

1
2
3
4
5
6
7
8
9
10
apache drill> select * from dfs.`/mnt/log.json` where `time` between '2021-05-07 15:40:00' and '2021-05-07 15:45:00' ;
+---------------------+----------+------+----------------+------+
| time | elapsed | fail | label | pass |
+---------------------+----------+------+----------------+------+
| 2021-05-07 15:42:24 | 00:10:46 | 1 | Critical Tests | 0 |
| 2021-05-07 15:43:52 | 00:13:24 | 0 | Critical Tests | 1 |
| 2021-05-07 15:43:56 | 00:13:31 | 0 | Critical Tests | 1 |
| 2021-05-07 15:44:54 | 00:14:00 | 0 | Critical Tests | 1 |
+---------------------+----------+------+----------------+------+
4 rows selected (0.34 seconds)

可変長構造のjson

NoSQL的に扱えます。たとえばこんなデータ

1
2
3
4
5
6
$ cat log.json
{"time":"2021-05-07 15:36:44","elapsed":"00:04:27","fail":1,"label":"Critical Tests","pass":0}
{"time":"2021-05-07 15:43:56","elapsed":"00:13:31","fail":0,"label":"Critical Tests","pass":1}
{"time":"2021-05-07 15:45:23","elapsed":"00:13:30","fail":1,"label":"Critical Tests","pass":0,"hoge":"aaa"}
{"time":"2021-05-07 15:44:54","elapsed":"00:14:00","fail":0,"label":"Critical Tests","pass":1}
{"time":"2021-05-07 15:46:08","elapsed":"00:14:47","fail":1,"label":"Critical Tests","pass":0}

null扱いになります

1
2
3
4
5
6
7
8
9
10
11
apache drill> select * from dfs.`/mnt/log.json`;
+---------------------+----------+------+----------------+------+------+
| time | elapsed | fail | label | pass | hoge |
+---------------------+----------+------+----------------+------+------+
| 2021-05-07 15:36:44 | 00:04:27 | 1 | Critical Tests | 0 | null |
| 2021-05-07 15:43:56 | 00:13:31 | 0 | Critical Tests | 1 | null |
| 2021-05-07 15:45:23 | 00:13:30 | 1 | Critical Tests | 0 | aaa |
| 2021-05-07 15:44:54 | 00:14:00 | 0 | Critical Tests | 1 | null |
| 2021-05-07 15:46:08 | 00:14:47 | 1 | Critical Tests | 0 | null |
+---------------------+----------+------+----------------+------+------+
5 rows selected (1.513 seconds)

補足情報: ネスト構造のjson

ここでは例示しませんが、Nested Data Functionsの、FLATTENなどを使用すると、ネスト構造を持ったjsonなども解析出来ます

参考サイト: ネスト構造のJSONデータにApache Drillで直接SQLクエリをかける - nagix

補足情報: RDBMS Storage Plugin

また、RDBMS-STORAGE-PLUGINを使用すると、RDBMSとjsonデータをunionで縦結合して解析するみたいな事も出来ます

Apache Drillを終了する

1
apache drill> !quit

マニュアル

以上のように、一般的なSQLで使用する構文はそのまま使用出来ます。公式にSQLリファレンスもあるのでご参考下さい