今天有个需求,就是:指定200W用户(表meids_tmp),把这些用户最近15天的应用使用数据(表tb_yl_upload_info,按天分区)转移到另外一张表中(表upload_info_sub,按天分区)。
很直观,meids_tmp表63M,可以使用map端连接;要求目标表数据按日期组织,自然想到动态分区,使数据插入时自动按日期写入。最终,得到如下sql:
SELECT /*+mapjoin(b)*/ a.sn, a.isenabled, a.netflow, a.icount, a.pkg_name, a.isdelete, a.label, a.vername, a.vercode , a.ispreset, a.apksize, a.firsttime, a.usagetime, a.uptime, a.parsed, a.aid, a.pkgid, a.meid, a.wifi, a.mobile, a.batchid, a.parsed1, a.model, a.install_type, a.ds AS ds FROM tb_yl_upload_info a JOIN meids_tmp b ON (a.ds>='2015-09-05' AND a.ds<'2015-09-15' AND a.meid=b.meid);
首先配置好HIVE动态分区的参数:
set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true;
但是运行之后,所有的map均失败,没有一个正常完成。命令行一直显示如下信息直到失败退出:
2015-09-22 13:41:40,794 Stage-1 map = 0%, reduce = 0% 2015-09-22 13:42:42,836 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 8708.64 sec 2015-09-22 13:43:43,635 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 12395.12 sec 2015-09-22 13:44:44,877 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 12781.82 sec 2015-09-22 13:45:45,270 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 12781.82 sec 2015-09-22 13:46:45,408 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 12781.82 sec 2015-09-22 13:47:47,019 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 12781.82 sec 2015-09-22 13:48:48,130 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 12781.82 sec 2015-09-22 13:49:48,657 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 12781.82 sec 2015-09-22 13:50:49,195 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 12781.82 sec
最先想到的就是mapjoin出的问题,显式加上mapjoin相关的参数:
set hive.auto.convert.join.noconditionaltask.size=200000000; set hive.mapjoin.smalltable.filesize=200000000; set hive.auto.convert.join.noconditionaltask=true; set hive.ignore.mapjoin.hint=false;//有以上三个配置了,其实这个配置是多余的 set mapred.max.split.size=100000000;//把每个map的输入数据配小点(默认bolcksize:256M)
再次跑程序,还是失败,现象完全一样。最后想到,既然map失败,那就去看看每个map的jvm到底在做什么:jmap -heap 查看每个map的内存使用情况,jstat -gc 查看每个map任务垃圾回收是否正常,jstack 查看每个map任务的工作线程,垃圾回收线程到底处于什么状态。最后挑选其中一个nodemanager上的map任务进行查看,结果部分 如下:
JMap -heap :
Heap Usage: PS Young Generation Eden Space: capacity = 262668288 (250.5MB) used = 262668288 (250.5MB) free = 0 (0.0MB) 100.0% used From Space: capacity = 43515904 (41.5MB) used = 0 (0.0MB) free = 43515904 (41.5MB) 0.0% used To Space: capacity = 43515904 (41.5MB) used = 0 (0.0MB) free = 43515904 (41.5MB) 0.0% used PS Old Generation capacity = 699400192 (667.0MB) used = 699265240 (666.8712997436523MB) free = 134952 (0.12870025634765625MB) 99.98070460924323% used
Jstat -gc:(一次Full垃圾回收很耗时,结合JMap -heap 的结果:老年代一直处于99.9%的使用率,说明map一直在进行垃圾回收,所以map工作现程根本没跑)
YGC YGCT FGC FGCT GCT 2 0.366 4 52.569 52.934 2 0.366 4 52.569 52.934 2 0.366 4 52.569 52.934 2 0.366 4 79.129 79.495 2 0.366 5 79.129 79.495 2 0.366 5 79.129 79.495 2 0.366 5 79.129 79.495 2 0.366 5 79.129 79.495 2 0.366 5 79.129 79.495
jstack结果(工作线程状态为blocked,多试几次发现一直是BLOCKED,所以也呼应了上面jstat和jmap的结果):
Thread 5863: (state = BLOCKED) - java.util.ArrayList.<init>(int) @bci=37, line=144 (Compiled frame) - org.apache.hadoop.hive.ql.exec.persistence.MapJoinEagerRowContainer.<init>() @bci=25, line=48 (Compiled frame) - org.apache.hadoop.hive.ql.exec.persistence.MapJoinTableContainerSerDe.load(java.io.ObjectInputStream) @bci=141, line=78 (Compiled frame) - org.apache.hadoop.hive.ql.exec.mr.HashTableLoader.load(org.apache.hadoop.hive.ql.exec.persistence.MapJoinTableContainer[], org.apache.hadoop.hive.ql.exec.persistence.MapJoinTableContainerSerDe[]) @bci=208, line=98 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable() @bci=78, line=155 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.MapJoinOperator.cleanUpInputFileChangedOp() @bci=17, line=181 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged() @bci=1, line=1032 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged() @bci=41, line=1036 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged() @bci=41, line=1036 (Interpreted frame) - org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged() @bci=41, line=1036 (Interpreted frame)
解决办法:既然是内存不够用,那就增加map的内存再说,在hive的命令行设置如下参数:
set mapreduce.map.memory.mb=2000;//mapred-sit.xml里是1100 set mapreduce.map.java.opts=-Xmx1900m -Xms1900m -Xmn200m;
然后在运行sql语句,发现成功了。至此问题解决。但是有一个问题就是,其实map的内存本来设置的值为1G,已经相当大了,还存在浪费,为什么这次一定要这么大的内存配置呢?其实,解决办法也很简单,就是把map任务的jvm heap给dump出来:jmap -dump 。然后用可视化的jconsole等工具,具体查看就行。但是我没时间去分析。初步估计一下,应该是tb_yl_upload_info表里有很多中文字符,而且本来这张表的字段就很多,java jvm里存储字符串本来就要很多额外开销,因此会导致内存不够用。另一方面,map中配置的一些参数如mapreduce.task.io.sort.mb也会消耗掉一部分内存,所以。。。总之,问题都是有原因的。
相关推荐
hive案例之-----------------微博数据分析及答案,恰同学少年,风华正茂,挥斥方遒
hive案例之----------------------------------用户分析及其答案,恰同学少年,风华正茂,挥斥方遒
dbeaver连接hive时需要的驱动包hive-jdbc-uber-2.6.5.0-292.jar
apache-hive-2.1.1-bin.tar apache-hive-2.1.1-bin.tar apache-hive-2.1.1-bin.tarapache-hive-2.1.1-bin.tar apache-hive-2.1.1-bin.tar apache-hive-2.1.1-bin.tarapache-hive-2.1.1-bin.tar apache-hive-2.1.1-...
02、hive-exec-2.1.1-cdh6.3.1.jar 03、hive-jdbc-2.1.1-cdh6.3.1.jar 04、hive-jdbc-2.1.1-cdh6.3.1-standalone.jar 05、hive-metastore-2.1.1-cdh6.3.1.jar 06、hive-service-2.1.1-cdh6.3.1.jar 07、libfb303-...
DBeaver链接hive驱动包下载: hive-jdbc-uber-2.6.5.0-292.jar
hive-jdbc-3.1.2-standalone适用于linux
利用Hive进行复杂用户行为大数据分析及优化案例(全套视频+课件+代码+讲义+工具软件),具体内容包括: 01_自动批量加载数据到hive 02_Hive表批量加载数据的脚本实现(一) 03_Hive表批量加载数据的脚本实现(二) ...
hive java开发驱动包列表hive-common-2.3.4.jarhive-exec-2.3.4.jarhive-jdbc-2.3.4.jarhive-llap-client-2.3.4.jarhive-llap-common-2.3.4.jarhive-llap-server-2.3.4.jarhive-llap-tez-2.3.4.jarhive-metastore-...
hive-jdbc-2.3.7-standalone,可用dbeaver连接hive数据库,在工具中进行数据库记录的新增改查
flink-sql-connector-hive-3.1.2_2.11-1.11.6.jar 已经解决guava冲突亲测可以
含两个文件hive-jdbc-3.1.2-standalone.jar和apache-hive-3.1.2-bin.tar.gz 含两个文件hive-jdbc-3.1.2-standalone.jar和apache-hive-3.1.2-bin.tar.gz 含两个文件hive-jdbc-3.1.2-standalone.jar和apache-hive-...
Hive连接的jar包——hive-jdbc-3.1.2-standalone.jar,使用数据库连接软件连接数据仓库时需要使用相应的驱动器驱动,希望对大家有所帮助
hive连接jdbc的jar包hive-jdbc-1.1.0-cdh5.12.1-standalone.jar
hive-jdbc-1.2.1-standalone.jar hive-jdbc驱动jar包,欢迎下载
被编译的hive-hbase-handler-1.2.1.jar,用于在Hive中创建关联HBase表的jar,解决创建Hive关联HBase时报FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop....
Apache Hive(apache-hive-3.1.3-bin.tar.gz、apache-hive-3.1.3-src.tar.gz)是一种分布式容错数据仓库系统,支持大规模分析,并使用 SQL 促进读取、写入和管理驻留在分布式存储中的 PB 级数据。Hive 构建在 Apache...
atlas hive hook 资源包
hive-jdbc-uber-2.6.5.0-292.jar DbVisualizer (as of version 9.5.5) Below is an example configuration using DbVisualizer: Open the Diver Manager dialog ("Tools" > "Driver Manager...") and hit the ...
spark-hive_2.11-2.3.0...spark-hive-thriftserver_2.11-2.3.0.jar log4j-2.15.0.jar slf4j-api-1.7.7.jar slf4j-log4j12-1.7.25.jar curator-client-2.4.0.jar curator-framework-2.4.0.jar curator-recipes-2.4.0.jar