Hive实战操作

Hive实战操作

mysql配置安装:

  • 安装: mysql / mysql-server

    1
    2
    3
    yum install -y mysql
    yum install -y mysql-server
    /etc/init.d/mysqld start
  • 安装Hive

    1
    2
    3
    wget http://archive.apache.org/dist/hive/hive-0.13.0/apache-hive-0.13.0-bin.tar.gz
    tar -zxvf apache-hive-0.13.0-bin.tar.gz
    cd /usr/local/src/apache-hive-0.13.0-bin/conf
  • vi conf/hive-site.xml

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    <configuration>
    <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
    </property>
    <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
    </property>
    <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>root</value>
    </property>
    <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
    <description>
    </description>
    </property>

    </configuration>
  • vi ~/.bashrc

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    ### 底部加入以下配置
    ### java
    export JAVA_HOME=/usr/local/src/jdk1.6.0_45
    export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib
    # export PATH=$PATH:$JAVA_HOME/bin

    ### hadoop
    export HADOOP_HOME=/usr/local/src/hadoop-1.2.1


    ### Hbase
    export HBASE_HOME=/usr/local/src/hbase-0.98.0-hadoop1
    export HBASE_CLASSPATH=$HBASE_HOME/conf
    export HBASE_LOG_DIR=$HBASE_HOME/logs


    ### hive conf
    export HIVE_HOME=/usr/local/src/apache-hive-0.13.0-bin
    # export PATH=$MAHOUT_HOME/conf:$MAHOUT_HOME/bin:$ZOOKEEPER_HOME/bin:$HIVE_HOME/bin:$PATH

    export PATH=$PATH:$JAVA_HOME/bin:$HIVE_HOME/bin:$PATH:$HADOOP_HOME/bin
  • source ~/.bashrc

  • 将mysql driver 放入lib目录

    1
    cp mysql-connector-java-5.1.41-bin.jar /usr/local/src/apache-hive-0.13.0-bin/lib/
  • 测试hive

    1
    hive
  • 创建表

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    create EXTERNAL TABLE w_a
    (
    usrid STRING,
    age STRING,
    sex STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LINES TERMINATED BY '\n';

    create EXTERNAL TABLE w_b
    (
    usrid STRING,
    active STRING,
    time STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LINES TERMINATED BY '\n'

  • 查看表

    1
    show tables;
  • 查看表结构

    1
    desc w_a;
  • 导入数据

    • a.txt
      1
      2
      3
      4
      user1	27	1
      user2 28 1
      user3 29 0
      ...
  • b.txt

    1
    2
    3
    4
    user1	100	20170301
    user3 101 20170302
    user4 102 20170303
    ...
  • 导入命令

    1
    2
    hive -e "LOAD DATA LOCAL INPATH './a.txt' OVERWRITE INTO TABLE w_a"
    hive -e "LOAD DATA LOCAL INPATH './b.txt' OVERWRITE INTO TABLE w_b"
  • 查看导入数据

    1
    select * from w_a;
  • 变链接查询

    1
    select a.*, b.* from w_a a join w_b b on a.usrid=b.usrid;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
hive> select a.*, b.* from w_a a join w_b b on a.usrid=b.usrid;
Total jobs = 1
Execution log at: /tmp/root/root_20180311051515_f78a507f-805c-4360-8ff4-976cb05a798f.log
2018-03-11 05:15:19 Starting to launch local task to process map join; maximum memory = 1013645312
2018-03-11 05:15:20 Dump the side-table into file: file:/tmp/root/hive_2018-03-11_05-15-15_970_401222296198383944-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
2018-03-11 05:15:20 Uploaded 1 File to: file:/tmp/root/hive_2018-03-11_05-15-15_970_401222296198383944-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (543 bytes)
2018-03-11 05:15:20 End of local task; Time Taken: 0.674 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201803110434_0001, Tracking URL = http://master:50030/jobdetails.jsp?jobid=job_201803110434_0001
Kill Command = /usr/local/src/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201803110434_0001
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2018-03-11 05:15:27,738 Stage-3 map = 0%, reduce = 0%
2018-03-11 05:15:29,755 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.73 sec
2018-03-11 05:15:32,768 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 0.73 sec
MapReduce Total cumulative CPU time: 730 msec
Ended Job = job_201803110434_0001
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 0.73 sec HDFS Read: 324 HDFS Write: 180 SUCCESS
Total MapReduce CPU Time Spent: 730 msec
OK
user1 27 1 user1 100 20170301
user3 29 0 user3 101 20170302
user4 30 1 user4 102 20170303
user5 31 0 user5 103 20170304
user7 33 1 user7 104 20170305
user8 34 0 user8 105 20170306
Time taken: 16.83 seconds, Fetched: 6 row(s)
  • 从HDFS中导入

    1
    LOAD DATA INPATH '/user_name.data.utf.txt' OVERWRITE INTO TABLE u_info
  • overwrite表示加载的数据会覆盖原来的内容

  • 对比本地的方式:LOAD DATA LOCAL INPATH

  • 利用insert命令导入数据

    • 执行命令(例子):
    1
    insert into table table1 select usrid, age from w_alimit 3;
    • 也可以支持动态分区插入:
    1
    insert into table test1 partition(c) select * from test2;
  • 直接通过查询插入

    1
    create table test2 as select * from test1;
  • 数据导出(导出为本地文件)

1
insert overwrite local directory '/home/developer/hive_test/1.txt' select usrid, sex from w_a;
  • 数据导出(导出为HDFS文件)

    1
    insert overwrite directory '/hive_output' select * from w_b;

Partition

  • partition是Hive提供的一种机制:用户通过指定一个或多个partition key,决定数据存放方式,进而优化数据的查询,一个表可以指定多个partition key,每个partition在hive中以文件夹的形式存在。

  • 创建表

    1
    2
    3
    4
    5
    6
    7
    8
    9
    create TABLE p_t
    (
    usrid STRING,
    age STRING
    )
    PARTITIONED BY (dt STRING)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LINES TERMINATED BY '\n';

  • p1.txt

    1
    2
    3
    4
    user2	28	20170302
    user4 30 20170302
    user6 32 20170302
    user8 34 20170302
  • 导入数据

    1
    2
    LOAD DATA LOCAL INPATH  '/root/hive_test/p1.txt' OVERWRITE INTO TABLE p_t partition(dt='20180311');
    ## 去掉LOCAL 就是从hdfs导入文件
1
2
3
4
5
6
7
8
9
hive> LOAD DATA INPATH  '/root/hive_test/p1.txt' OVERWRITE INTO TABLE p_t partition(dt='20180311');
FAILED: SemanticException Line 1:18 Invalid path ''/root/hive_test/p1.txt'': No files matching path hdfs://192.168.57.110:9000/root/hive_test/p1.txt
hive> LOAD DATA LOCAL INPATH '/root/hive_test/p1.txt' OVERWRITE INTO TABLE p_t partition(dt='20180311');
Copying data from file:/root/hive_test/p1.txt
Copying file: file:/root/hive_test/p1.txt
Loading data to table default.p_t partition (dt=20180311)
Partition default.p_t{dt=20180311} stats: [numFiles=1, numRows=0, totalSize=72, rawDataSize=0]
OK
Time taken: 0.721 seconds
  • 查询
    1
    select * from p_t where dt='20180311';

Transform

  • transform功能部分可以用UDF替代,但是如果拼接的字段是根据上一次查询的结果时,UDF就不能用,UDF只能用在本行操作
  • transform功能缺点是效率底了点
  • vi transform.awk
    1
    2
    3
    {
    print $1"_"$2
    }
1
hive> add file ./transform.awk
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
hive>  select transform(usrid,age) using "awk -f transform.awk" as (uuu) from w_a;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201803110434_0002, Tracking URL = http://master:50030/jobdetails.jsp?jobid=job_201803110434_0002
Kill Command = /usr/local/src/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201803110434_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-03-11 05:38:15,036 Stage-1 map = 0%, reduce = 0%
2018-03-11 05:38:18,051 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
2018-03-11 05:38:20,060 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 0.69 sec
MapReduce Total cumulative CPU time: 690 msec
Ended Job = job_201803110434_0002
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 0.69 sec HDFS Read: 298 HDFS Write: 72 SUCCESS
Total MapReduce CPU Time Spent: 690 msec
OK
user1_27
user2_28
user3_29
user4_30
user5_31
user6_32
user7_33
user8_34
Time taken: 10.161 seconds, Fetched: 8 row(s)

Hive整合hbase

  • 创建Hbase表:

    • create ‘classes’,’user’
  • 加入数据:

    1
    2
    3
    4
    put 'classes','001','user:name','jack'
    put 'classes','001','user:age','20'
    put 'classes','002','user:name','liza'
    put 'classes','002','user:age','18'
  • 创建Hive表并验证:

    • create external table classes(id int, name string, age int)
    • STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
    • WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,user:name,user:age”)
    • TBLPROPERTIES(“hbase.table.name” = “classes”);
  • 再添加数据到Hbase:

    • put ‘classes’,’003’,’user:age’,’1820183291839132’