Apache Drill - Schema-free SQL for Hadoop, NoSQL and Cloud Storage

インストール

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// javaが入っていない場合は必要
$ sudo apt-get install -y default-jdk

$ java -version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-1~deb9u1-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

// ダウンロード
$ cd ~
$ wget http://ftp.jaist.ac.jp/pub/apache/drill/drill-1.12.0/apache-drill-1.12.0.tar.gz
$ tar zxvf apache-drill-1.12.0.tar.gz

// 必要に応じて、PATHの通った所にリンクを貼ります
$ ln -s ~/apache-drill-1.12.0/bin/sqlline ~/bin/

基本的なところ

Drill in 10 Minutes - Apache Drill

↑このあたりにだいたい載ってます

1
2
// 実行
$ sqlline -u jdbc:drill:zk=local

zk=localの部分は、接続先のZookeeperホスト名などが入るようです。特に使用しないならlocalでOK

1
2
// 終了
0: jdbc:drill:zk=local> !quit

触ってみる:CSV

使ったデータ:2016_presidential_candidate_expenditures.csv (Resources | Tableau Publicより)

2018-01-26追記: 下記では、1行目の列名がデータ扱いされていますが、拡張子を「csvh」にすると、1行目はフィールド名として認識してくれるようです

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// 10件select
0: jdbc:drill:zk=local> select * from dfs.`/var/tmp/tableau-datasets/2016_presidential_candidate_expenditures.csv` limit 10;
+---------+
| columns |
+---------+
| ["cmte_id","cand_id","cand_nm","Party","recipient_nm","disb_amt","disb_dt","recipient_city","recipient_st","recipient_zip","disb_desc","memo_cd","memo_text","form_tp","file_num","tran_id","election_tp\r"] |
| ["C00458844","P60006723","Rubio, Marco","#REF!","GREEN, PERRY MR.","500","15-Sep-15","ANCHORAGE","AK","995084442","IN-KIND CONTRIBUTION","","IN KIND: FACILITY RENTAL","SB23","1031526","SB23.831217","P2016\r"] |
| ["C00458844","P60006723","Rubio, Marco","#REF!","ALASKA REPUBLICAN PARTY","2500","16-Nov-15","ANCHORAGE","AK","99503","BALLOT ACCESS","","","SB23","1047126","SB23.I14767","\r"] |
| ["C00458844","P60006723","Rubio, Marco","#REF!","JOHNSON, MARK","1000","30-Sep-15","ANCHORAGE","AK","995163436","IN-KIND CONTRIBUTION","","IN KIND: BUMPER STICKERS SIGNS AND BUTTONS","SB23","1031526","SB23.831218","P2016\r"] |
| ["C00458844","P60006723","Rubio, Marco","#REF!","ENTERPRISE","336.09","18-Dec-15","BIRMINGHAM","AL","352124572","CAR RENTAL","X","","SB23","1047126","SB23.I15112","\r"] |
| ["C00458844","P60006723","Rubio, Marco","#REF!","ALABAMA REPUBLICAN PARTY","10000","22-Oct-15","BIRMINGHAM","AL","35216","BALLOT ACCESS","","","SB23","1047126","SB23.I14766","\r"] |
| ["C00458844","P60006723","Rubio, Marco","#REF!","GUNTERSVILLE REC CENTER","500","3-Dec-15","GUNTERSVILLE","AL","35976","SITE RENTAL","","","SB23","1047126","SB23.I14675","\r"] |
| ["C00458844","P60006723","Rubio, Marco","#REF!","GUNTERSVILLE SENIOR CENTER","250","23-Nov-15","GUNTERSVILLE","AL","35976","SITE RENTAL","","","SB23","1047126","SB23.I14676","\r"] |
| ["C00458844","P60006723","Rubio, Marco","#REF!","GARCIA-NOWLIN, DEBBIE MRS.","857.5","29-Feb-16","HUNTSVILLE","AL","358021277","IN-KIND CONTRIBUTION","","","SB23","1056862","SB23.1064004","P2016\r"] |
| ["C00458844","P60006723","Rubio, Marco","#REF!","MINUTEMAN PRESS","381.5","18-Mar-16","BIRMINGHAM","AL","352174710","SUPPLIES","X","","SB23","1066300","SB23.I27933","\r"] |
+---------+
10 rows selected (0.211 seconds)

// 3カラム目だけ
0: jdbc:drill:zk=local> select columns[2] from dfs.`/var/tmp/tableau-datasets/2016_presidential_candidate_expenditures.csv` limit 10;
+---------------+
| EXPR$0 |
+---------------+
| cand_nm |
| Rubio, Marco |
| Rubio, Marco |
| Rubio, Marco |
| Rubio, Marco |
| Rubio, Marco |
| Rubio, Marco |
| Rubio, Marco |
| Rubio, Marco |
| Rubio, Marco |
+---------------+
10 rows selected (0.196 seconds)

// distinctしてみる
0: jdbc:drill:zk=local> select distinct(columns[2]) from dfs.`/var/tmp/tableau-datasets/2016_presidential_candidate_expenditures.csv` limit 10;
+--------------------------+
| EXPR$0 |
+--------------------------+
| Perry, James R. (Rick) |
| Jindal, Bobby |
| Bush, Jeb |
| Kasich, John R. |
| Paul, Rand |
| Fiorina, Carly |
| Santorum, Richard J. |
| Huckabee, Mike |
| Rubio, Marco |
| O'Malley, Martin Joseph |
+--------------------------+
10 rows selected (2.975 seconds)

なんとなしにdistinctしてみたら普通に使えた すばらC(SQL2003完全準拠、らしい)

触ってみる:json

使ったデータ:Database test data generator - Fill your database with random test data!の「Shop Product」の初期設定(5000件)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
// それっぽい在庫一覧
0: jdbc:drill:zk=local> select * from dfs.`/var/tmp/sample-datasets/shop-product.json` limit 10;
+-------------+--------+------------------------------------+--------+
| Product Id | Price | Listing Name | Stock |
+-------------+--------+------------------------------------+--------+
| 0 | 210 | ["perferendis","ex","rerum"] | 775 |
| 1 | 503 | ["et","perspiciatis","ipsa"] | 861 |
| 2 | 584 | ["distinctio","quis","ullam"] | 979 |
| 3 | 655 | ["ea","quas","facere"] | 281 |
| 4 | 322 | ["eum","sint","labore"] | 451 |
| 5 | 941 | ["consectetur","possimus","odit"] | 886 |
| 6 | 517 | ["quae","ut","expedita"] | 479 |
| 7 | 879 | ["debitis","et","et"] | 693 |
| 8 | 798 | ["et","amet","sunt"] | 430 |
| 9 | 963 | ["ipsa","dolores","inventore"] | 710 |
+-------------+--------+------------------------------------+--------+
10 rows selected (0.194 seconds)

// 値段の高い順に10件
0: jdbc:drill:zk=local> select `Product Id`,`Price` from dfs.`/var/tmp/sample-datasets/shop-product.json` order by `Price` desc limit 10;
+-------------+--------+
| Product Id | Price |
+-------------+--------+
| 4746 | 999 |
| 2132 | 999 |
| 4218 | 999 |
| 4735 | 999 |
| 1794 | 999 |
| 4272 | 999 |
| 3932 | 999 |
| 654 | 999 |
| 2625 | 998 |
| 689 | 998 |
+-------------+--------+
10 rows selected (0.183 seconds)

// 在庫の多い順に10件
0: jdbc:drill:zk=local> select `Product Id`,`Stock` from dfs.`/var/tmp/sample-datasets/shop-product.json` order by `Stock` desc limit 10;
+-------------+--------+
| Product Id | Stock |
+-------------+--------+
| 2453 | 999 |
| 3536 | 999 |
| 4415 | 999 |
| 2567 | 999 |
| 1639 | 999 |
| 409 | 999 |
| 1636 | 998 |
| 514 | 998 |
| 162 | 997 |
| 1023 | 997 |
+-------------+--------+
10 rows selected (0.183 seconds)

// 在庫数で集約(group by)して在庫数上位10件に所属する商品数一覧
0: jdbc:drill:zk=local> select `Stock`,count(`Product Id`) from dfs.`/var/tmp/sample-datasets/shop-product.json` group by `Stock` order by `Stock` desc limit 10;
+--------+---------+
| Stock | EXPR$1 |
+--------+---------+
| 999 | 6 |
| 998 | 2 |
| 997 | 7 |
| 996 | 4 |
| 995 | 2 |
| 994 | 4 |
| 993 | 3 |
| 992 | 3 |
| 991 | 6 |
| 990 | 4 |
+--------+---------+
10 rows selected (0.24 seconds)

そのほかにも、joinunionなど、そこそこの事は出来るようです

参考サイト(公式)

参考サイト