架构

Hiv CliMetaStore ServerMySQL
node01*
node02*
node03*(远程数据库模式)
node04*

配置

Server Configuration Parameters

Config ParamConfig ValueCommen
javax.jdo.option.ConnectionURLjdbc:mysql://host name/database name?createDatabaseIfNotExist=truemetadata is stored in a MySQL server
javax.jdo.option.ConnectionDriverNamecom.mysql.jdbc.DriverMySQL JDBC driver class
javax.jdo.option.ConnectionUserNameuser nameuser name for connecting to MySQL server
javax.jdo.option.ConnectionPasswordpasswordpassword for connecting to MySQL server
hive.metastore.warehouse.dirbase hdfs pathdefault location for Hive tables.
hive.metastore.thrift.bind.hosthost_nameHost name to bind the metastore service to. When empty, “localhost” is used. This configuration is available Hive 4.0.0 onwards.

Client Configuration Parameters

Config ParamConfig ValueComment
hive.metastore.uristhrift://host_name:porthost and port for the Thrift metastore server. If hive.metastore.thrift.bind.host is specified, host should be same as that configuration. Read more about this in dynamic service discovery configuration parameters.
hive.metastore.localfalseMetastore is remote. Note: This is no longer needed as of Hive 0.10. Setting hive.metastore.uri is sufficient.
hive.metastore.warehouse.dirbase hdfs pathPoints to default location of non-external Hive tables in HDFS.

hive-site.xml

<configuration>
<property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive_remote/warehouse</value>
</property>
<property>
        <name>hive.metastore.uris</name>
        <value>thrift://node02:9083</value>
</property>
</configuration>

初始化数据库

node02

schematool -dbType mysql -initSchema

启动

元数据服务

node02

hive --service metastore

CLI

node01

hive

Hive SQL

人员表

id,姓名,爱好,住址
1,小明1,lol-book-movie,benjing:chaoyao-shanghai:pudong
2,小明2,lol-book-movie,benjing:chaoyao-shanghai:pudong
3,小明3,lol-book-movie,benjing:chaoyao-shanghai:pudong
4,小明4,lol-book-movie,benjing:chaoyao-shanghai:pudong
5,小明5,lol-movie,benjing:chaoyao-shanghai:pudong

建表语句

create table psn2
    (
    id int,
    name string,
    likes array<string>,
    address map<string,string>
    )
    row format delimited
    fields terminated by ','
    collection items terminated by '-'
    map keys terminated by ':';

插入数据

# 不加 local 会去 hdfs 找数据文件
load data local inpath '/root/data/data' into table psn2;

直接上传到hdfs目录

data2

81,小明81,lol-book-movie,benjing:chaoyao-shanghai:pudong
82,小明82,lol-movie,benjing:chaoyao-shanghai:pudong
hdfs dfs -put ./data2 /user/hive_remote/warehouse/psn2

内部表和外部表

hvie在默认情况下创建的是内部表

如何创建外部表?

create external table psn4 (
    id int,
    name string,
    likes array<string>,
    address map<string,string>
)
    row format delimited
    fields terminated by ','
    collection items terminated by '-'
    map keys terminated by ':'
    location '/hive_data';

内部表和外部表的区别

#加载数据


load data local inpath '/root/hive/data' into tables psn4;

#查看


select * from psn4;

![](https://oss.wyxxt.org.cn/images/2021/09/18/wp_editor_md_cfd52807f211858ba24b9af63795c6dd.jpg)

数据位置

    1、hive内部表创建的时候数据存储在hive的默认存储目录中,外部表在创建的时候需要指定额外的目录
    2、hive内部表删除的时候,会将元数据和数据都删除,而外部表只会删除元数据,不会删除数据

应用场景:

    内部表:需要先创建表,然后向表中添加数据,适合做中间表的存储
    外部表:可以先创建表,再添加数据,也可以先有数据,再创建表,本质上是将hdfs的某一个目录的数据跟               hive的表关联映射起来,因此适合原始数据的存储,不会因为误操作将数据给删除掉

hive的分区表:

    hive默认将表的数据保存在某一个hdfs的存储目录下,当需要检索符合条件的某一部分数据的时候,需要全量        遍历数据,io量比较大,效率比较低,因此可以采用分而治之的思想,将符合某些条件的数据放置在某一个目录       ,此时检索的时候只需要搜索指定目录即可,不需要全量遍历数据。

建表语句

create table psn5 (
    id int,
    name string,
    likes array<string>,
    address map<string,string>
)
partitioned by(gender string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':' ;

插入数据

load data local inpath '/root/hive/data' into table psn5 partition(gender='man');

hdfs目录结构

创建多分区表

create table psn6 (
    id int,
    name string,
    likes array<string>,
    address map<string,string>
)
partitioned by(gender string,age int)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':' ;

    1、当创建完分区表之后,在保存数据的时候,会在hdfs目录中看到分区列会成为一个目录,以多级目录的形式           存在
    2、当创建多分区表之后,插入数据的时候不可以只添加一个分区列,需要将所有的分区列都添加值
    3、多分区表在添加分区列的值得时候,与顺序无关,与分区表的分区列的名称相关,按照名称就行匹配

修复分区

create external table psn7
    (
    id int,
    name string,
    likes array<string>,
    address map<string,string>
    )
    partitioned by(age int)
    row format delimited
    fields terminated by ','
    collection items terminated by '-'
    map keys terminated by ':'
    location '/msb';


msck repair table psn7;
在使用hive外部表的时候,可以先将数据上传到hdfs的某一个目录中,然后再创建外部表建立映射关系,如果在上传数据的时候,参考分区表的形式也创建了多级目录,那么此时创建完表之后,是查询不到数据的,原因是分区的元数据没有保存在mysql中,因此需要修复分区,将元数据同步更新到mysql中,此时才可以查询到元数据。具体操作如下:

Scroll to Top